[E-08] Project


プロジェクト:概要ニュースレター


新しいデータセットを抽象的かつ抽象的にまとめるのに時間がかかります.

Step 1. データの収集


次のリンクのニュース記事データ(news summary more.csv)を使用してください.
sunnysai12345/News_Summary
次のコードでダウンロードできます.
import nltk
nltk.download('stopwords')

import numpy as np
import pandas as pd
import os
import re
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from bs4 import BeautifulSoup 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences
import urllib.request
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

print('=3')
=3


[nltk_data] Downloading package stopwords to /aiffel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
import urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/sunnysai12345/News_Summary/master/news_summary_more.csv", filename="news_summary_more.csv")
data = pd.read_csv('news_summary_more.csv', encoding='iso-8859-1')
print(len(data))
98401
# random data 10개 출력
data.sample(10)
headlines
text
47014
19-yr-old youngest no. 1 ODI bowler, breaks 21...
Afghanistan spinner Rashid Khan has become the...
20093
40,000 people celebrate Ram Rahim's birthday w...
Over 40,000 Dera Sacha Sauda followers reporte...
56376
Kings XI Punjab appoint ex-Aus batsman Hodge a...
Indian Premier League (IPL) franchise Kings XI...
74516
90 cows die in two more shelters run by arrest...
A day after Chhattisgarh BJP leader Harish Ver...
83205
Sheeran reacts to accusations of not singing l...
Singer Ed Sheeran, responding to accusations o...
42786
Israel admits bombing Syrian 'nuclear reactor'...
After over 10 years of secrecy, Israel has for...
24302
Woman in labour carried on cot through flooded...
A woman in labour was carried on a cot by her ...
4957
50 vehicles pile up on Haryana highway amid de...
At least eight people, including seven from th...
74034
Women's Health Line uses Sarahah to promote wo...
Women's Health Line, an organisation which pro...
16000
Delhi gets highest rainfall in September in 7 ...
Delhi this year has received the highest rainf...
このデータは、文章の本文に対応するテキストとタイトルの2つの列から構成されています.
抽象的な総括を行う際には,テキストを本文とし,タイトルを要約されたデータとしてモデルを学習することができる.要約を抽出する場合は、テキスト列のみが使用されます.

Step 2. プリプロセッシングデータ(抽象要約)


実験で使用した前処理を参照し、必要と思われる前処理を追加し、テキストを正規化または精製します.非用語を削除する場合は、比較的短い要約データに対して非用語を削除するかどうかを考慮します.
data.columns = ['Summary','Text']
data.sample(1)
Summary
Text
25700
Gayle almost drops catch with left hand, takes...
Vancouver Knights captain Chris Gayle pulled o...

(1)データの整理


1)重複サンプリングとNULL値サンプリングの除去

print('Text 열에서 중복을 배제한 유일한 샘플의 수 :', data['Text'].nunique())
print('Summary 열에서 중복을 배제한 유일한 샘플의 수 :', data['Summary'].nunique())
Text 열에서 중복을 배제한 유일한 샘플의 수 : 98360
Summary 열에서 중복을 배제한 유일한 샘플의 수 : 98280
# 데이터프레임의 drop_duplicates()를 사용하면, 손쉽게 중복 샘플을 제거할 수 있다

# inplace=True 를 설정하면 DataFrame 타입 값을 return 하지 않고 data 내부를 직접적으로 바꿉니다
data.drop_duplicates(subset = ['Text'], inplace=True)
print('전체 샘플수 :', (len(data)))
전체 샘플수 : 98360
# NULL 값이 있는지 체크
print(data.isnull().sum()) # 0
Summary    0
Text       0
dtype: int64

2)テキストの規範化と非規範化の排除

contractions = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",
                           "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                           "you're": "you are", "you've": "you have"}
print("done")
done
# NLTK에서 제공하는 불용어 리스트를 참조해, 샘플에서 불용어를 제거하자.

print('불용어 개수 :', len(stopwords.words('english') ))
print(stopwords.words('english'))

# 데이터 전처리 함수
## Text 전처리에서만 호출하고, Summary 전처리에서는 호출하지 않는다.
def preprocess_sentence(sentence, remove_stopwords=True):
    sentence = sentence.lower() # 텍스트 소문자화
    sentence = BeautifulSoup(sentence, "lxml").text # <br />, <a href = ...> 등의 html 태그 제거
    sentence = re.sub(r'\([^)]*\)', '', sentence) # 괄호로 닫힌 문자열 (...) 제거 Ex) my husband (and myself!) for => my husband for
    sentence = re.sub('"','', sentence) # 쌍따옴표 " 제거
    sentence = ' '.join([contractions[t] if t in contractions else t for t in sentence.split(" ")]) # 약어 정규화
    sentence = re.sub(r"'s\b","", sentence) # 소유격 제거. Ex) roland's -> roland
    sentence = re.sub("[^a-zA-Z]", " ", sentence) # 영어 외 문자(숫자, 특수문자 등) 공백으로 변환
    sentence = re.sub('[m]{2,}', 'mm', sentence) # m이 3개 이상이면 2개로 변경. Ex) ummmmmmm yeah -> umm yeah
    
    # 불용어 제거 (Text)
    if remove_stopwords:
        tokens = ' '.join(word for word in sentence.split() if not word in stopwords.words('english') if len(word) > 1)
    # 불용어 미제거 (Summary)
    else:
        tokens = ' '.join(word for word in sentence.split() if len(word) > 1)
    return tokens
print('=3')
불용어 개수 : 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
=3
# Text를 전처리하고, 결과를 확인하기 위해 상위 5개 출력

clean_text = []
# 전체 Text 데이터에 대한 전처리 : 10분 이상 시간이 걸릴 수 있습니다. 
for s in data['Text']:
    clean_text.append(preprocess_sentence(s))

# 전처리 후 출력
print("Text 전처리 후 결과: ", clean_text[:5])
Text 전처리 후 결과:  ['saurav kant alumnus upgrad iiit pg program machine learning artificial intelligence sr systems engineer infosys almost years work experience program upgrad degree career support helped transition data scientist tech mahindra salary hike upgrad online power learning powered lakh careers', 'kunal shah credit card bill payment platform cred gave users chance win free food swiggy one year pranav kaushik delhi techie bagged reward spending cred coins users get one cred coin per rupee bill paid used avail rewards brands like ixigo bookmyshow ubereats cult fit', 'new zealand defeated india wickets fourth odi hamilton thursday win first match five match odi series india lost international match rohit sharma captaincy consecutive victories dating back march match witnessed india getting seventh lowest total odi cricket history', 'aegon life iterm insurance plan customers enjoy tax benefits premiums paid save taxes plan provides life cover age years also customers options insure critical illnesses disability accidental death benefit rider life cover age years', 'speaking sexual harassment allegations rajkumar hirani sonam kapoor said known hirani many years true metoo movement get derailed metoo movement always believe woman case need reserve judgment added hirani accused assistant worked sanju']
# Summary 전처리 시에는 불용어 제거를 'False'로 지정

clean_summary = []
# 전체 Summary 데이터에 대한 전처리 : 5분 이상 시간이 걸릴 수 있습니다. 
for s in data['Summary']:
    clean_summary.append(preprocess_sentence(s, False))

print("Summary 전처리 후 결과: ", clean_summary[:5])
Summary 전처리 후 결과:  ['upgrad learner switches to career in ml al with salary hike', 'delhi techie wins free food from swiggy for one year on cred', 'new zealand end rohit sharma led india match winning streak', 'aegon life iterm insurance plan helps customers save tax', 'have known hirani for yrs what if metoo claims are not true sonam']
# 이후, 다시 한 번 empty sample이 생겼는지 확인해보기.(정제 과정에서 모든 단어가 사라지는 경우도 있음)

data['Text'] = clean_text
data['Summary'] = clean_summary

# 빈 값을 Null 값으로 변환
data.replace('', np.nan, inplace=True)
print('=3')
=3
# 빈 샘플을 확인후 제거하기

print(data.isnull().sum())

data.dropna(axis=0, inplace=True)
print('전체 샘플수 :', (len(data)))
Summary    0
Text       0
dtype: int64
전체 샘플수 : 98360

(2)訓練データとテストデータの区分


1)サンプルの最大長を決定する

# 길이 분포 출력
import matplotlib.pyplot as plt

text_len = [len(s.split()) for s in data['Text']]
summary_len = [len(s.split()) for s in data['Summary']]

print('텍스트의 최소 길이 : {}'.format(np.min(text_len)))
print('텍스트의 최대 길이 : {}'.format(np.max(text_len)))
print('텍스트의 평균 길이 : {}'.format(np.mean(text_len)))
print('요약의 최소 길이 : {}'.format(np.min(summary_len)))
print('요약의 최대 길이 : {}'.format(np.max(summary_len)))
print('요약의 평균 길이 : {}'.format(np.mean(summary_len)))

plt.subplot(1,2,1)
plt.boxplot(text_len)
plt.title('Text')
plt.subplot(1,2,2)
plt.boxplot(summary_len)
plt.title('Summary')
plt.tight_layout()
plt.show()

plt.title('Text')
plt.hist(text_len, bins = 40)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()

plt.title('Summary')
plt.hist(summary_len, bins = 40)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()
텍스트의 최소 길이 : 1
텍스트의 최대 길이 : 60
텍스트의 평균 길이 : 35.09968483123221
요약의 최소 길이 : 1
요약의 최대 길이 : 16
요약의 평균 길이 : 9.299532330215534


text_max_len = 37
summary_max_len = 10
print('=3')
=3
# 37, 10로 지정했을 때, 객관적으로 데이터의 몇 %가 이에 해당하는지 계산해서 판단하자.

def below_threshold_len(max_len, nested_list):
  cnt = 0
  for s in nested_list:
    if(len(s.split()) <= max_len):
        cnt = cnt + 1
  print('전체 샘플 중 길이가 %s 이하인 샘플의 비율: %s'%(max_len, (cnt / len(nested_list))))
print('=3')

below_threshold_len(text_max_len, data['Text'])
below_threshold_len(summary_max_len,  data['Summary'])
=3
전체 샘플 중 길이가 37 이하인 샘플의 비율: 0.7378304188694591
전체 샘플 중 길이가 10 이하인 샘플의 비율: 0.8162972753151687
# 정해진 길이보다 길면 제외하기

data = data[data['Text'].apply(lambda x: len(x.split()) <= text_max_len)]
data = data[data['Summary'].apply(lambda x: len(x.split()) <= summary_max_len)]
print('전체 샘플수 :', (len(data)))
전체 샘플수 : 58912

2)開始/終了フラグの追加

# 요약 데이터에는 시작 토큰과 종료 토큰을 추가한다.
data['decoder_input'] = data['Summary'].apply(lambda x : 'sostoken '+ x)
data['decoder_target'] = data['Summary'].apply(lambda x : x + ' eostoken')
data.head()
Summary
Text
decoder_input
decoder_target
3
aegon life iterm insurance plan helps customer...
aegon life iterm insurance plan customers enjo...
sostoken aegon life iterm insurance plan helps...
aegon life iterm insurance plan helps customer...
5
rahat fateh ali khan denies getting notice for...
pakistani singer rahat fateh ali khan denied r...
sostoken rahat fateh ali khan denies getting n...
rahat fateh ali khan denies getting notice for...
9
cong wins ramgarh bypoll in rajasthan takes to...
congress candidate shafia zubair ramgarh assem...
sostoken cong wins ramgarh bypoll in rajasthan...
cong wins ramgarh bypoll in rajasthan takes to...
10
up cousins fed human excreta for friendship wi...
two minor cousins uttar pradesh gorakhpur alle...
sostoken up cousins fed human excreta for frie...
up cousins fed human excreta for friendship wi...
16
karan johar tabu turn showstoppers on opening ...
filmmaker karan johar actress tabu turned show...
sostoken karan johar tabu turn showstoppers on...
karan johar tabu turn showstoppers on opening ...
# 인코더의 입력, 디코더의 입력 & 레이블을 다시 Numpy 타입으로 저장하기

encoder_input = np.array(data['Text']) # 인코더의 입력
decoder_input = np.array(data['decoder_input']) # 디코더의 입력
decoder_target = np.array(data['decoder_target']) # 디코더의 레이블
print('=3')
=3
# 훈련/테스트 데이터 분리

## encoder_input과 크기/형태가 같은 순서가 섞인 정수 시퀀스 생성
indices = np.arange(encoder_input.shape[0])
np.random.shuffle(indices)
print(indices)

## 만든 정수 시퀀스를 이용해 데이터의 샘플 순서를 정의
encoder_input = encoder_input[indices]
decoder_input = decoder_input[indices]
decoder_target = decoder_target[indices]
print('=3')

## 데이터를 8:2 비율로 분리. 전체 데이터 크기에서 0.2를 곱해서 테스트 데이터의 크기를 정한다.
n_of_val = int(len(encoder_input)*0.2)
print('테스트 데이터의 수 :', n_of_val)

## 정의한 테스트 데이터 개수를 이용해 전체 데이터를 split.
## : 표시의 위치에 주의!!

encoder_input_train = encoder_input[:-n_of_val]
decoder_input_train = decoder_input[:-n_of_val]
decoder_target_train = decoder_target[:-n_of_val]

encoder_input_test = encoder_input[-n_of_val:]
decoder_input_test = decoder_input[-n_of_val:]
decoder_target_test = decoder_target[-n_of_val:]

print('훈련 데이터의 개수 :', len(encoder_input_train))
print('훈련 레이블의 개수 :', len(decoder_input_train))
print('테스트 데이터의 개수 :', len(encoder_input_test))
print('테스트 레이블의 개수 :', len(decoder_input_test))
[54679 37077 42244 ... 54605 32708 54786]
=3
테스트 데이터의 수 : 11782
훈련 데이터의 개수 : 47130
훈련 레이블의 개수 : 47130
테스트 데이터의 개수 : 11782
테스트 레이블의 개수 : 11782

(3)整数符号化


1)単語セット/整数符号化の作成

src_tokenizer = Tokenizer() # 토크나이저 정의
src_tokenizer.fit_on_texts(encoder_input_train) # 입력된 데이터로부터 단어 집합 생성
print('=3')
=3
threshold = 7
total_cnt = len(src_tokenizer.word_index) # 단어의 수
rare_cnt = 0 # 등장 빈도수가 threshold보다 작은 단어의 개수를 카운트
total_freq = 0 # 훈련 데이터의 전체 단어 빈도수 총 합
rare_freq = 0 # 등장 빈도수가 threshold보다 작은 단어의 등장 빈도수의 총 합

# 단어와 빈도수의 쌍(pair)을 key와 value로 받는다.
for key, value in src_tokenizer.word_counts.items():
    total_freq = total_freq + value

    # 단어의 등장 빈도수가 threshold보다 작으면
    if(value < threshold):
        rare_cnt = rare_cnt + 1
        rare_freq = rare_freq + value

print('단어 집합(vocabulary)의 크기 :', total_cnt)
print('등장 빈도가 %s번 이하인 희귀 단어의 수: %s'%(threshold - 1, rare_cnt))
print('단어 집합에서 희귀 단어를 제외시킬 경우의 단어 집합의 크기 %s'%(total_cnt - rare_cnt))
print("단어 집합에서 희귀 단어의 비율:", (rare_cnt / total_cnt)*100)
print("전체 등장 빈도에서 희귀 단어 등장 빈도 비율:", (rare_freq / total_freq)*100)
단어 집합(vocabulary)의 크기 : 54148
등장 빈도가 6번 이하인 희귀 단어의 수: 37383
단어 집합에서 희귀 단어를 제외시킬 경우의 단어 집합의 크기 16765
단어 집합에서 희귀 단어의 비율: 69.038560981015
전체 등장 빈도에서 희귀 단어 등장 빈도 비율: 4.846471156454651
src_vocab = 16000
src_tokenizer = Tokenizer(num_words=src_vocab) # 단어 집합의 크기를 16,000으로 제한
src_tokenizer.fit_on_texts(encoder_input_train) # 단어 집합 재생성
print('=3')
=3
# texts_to_sequences()는 생성된 단어 집합에 기반하여 입력으로 주어진 단어들을 모두 정수로 변환한다(정수인코딩).

# 텍스트 시퀀스를 정수 시퀀스로 변환
encoder_input_train = src_tokenizer.texts_to_sequences(encoder_input_train) 
encoder_input_test = src_tokenizer.texts_to_sequences(encoder_input_test)

# 잘 진행되었는지 샘플 출력
print(encoder_input_train[:3])
[[43, 4890, 156, 1197, 5564, 76, 2349, 616, 186, 445, 5645, 13, 248, 505, 5565, 685, 37, 24, 86, 4890, 561, 1223, 224, 246, 1160, 2289, 119, 339, 4707, 5, 460, 701, 285, 24, 38], [255, 4057, 474, 1190, 3599, 1099, 1486, 16, 20, 118, 389, 11258, 1, 560, 3000, 78, 119, 1, 3599, 1025, 2103, 474, 119, 312, 35, 1, 145, 474, 3254, 1099, 1770, 4836, 1061], [534, 760, 4172, 12, 29, 2453, 2082, 2511, 437, 89, 193, 90, 866, 5842, 476, 291, 5842, 871, 3867, 9965, 1, 534, 2096, 653, 6828, 2082, 3823, 3791]]
# Summary 데이터에 대해서도 동일한 작업 수행
tar_tokenizer = Tokenizer()
tar_tokenizer.fit_on_texts(decoder_input_train)
print('=3')


# tar_tokenizer.word_counts.items()에는 단어와 각 단어의 등장 빈도수가 저장돼 있다. 
# 이를 통해 등장 빈도수가 5회 미만인 단어들이 이 데이터에서 얼만큼의 비중을 차지하는지 확인해보자.
threshold = 5
total_cnt = len(tar_tokenizer.word_index) # 단어의 수
rare_cnt = 0 # 등장 빈도수가 threshold보다 작은 단어의 개수를 카운트
total_freq = 0 # 훈련 데이터의 전체 단어 빈도수 총 합
rare_freq = 0 # 등장 빈도수가 threshold보다 작은 단어의 등장 빈도수의 총 합

# 단어와 빈도수의 쌍(pair)을 key와 value로 받는다.
for key, value in tar_tokenizer.word_counts.items():
    total_freq = total_freq + value

    # 단어의 등장 빈도수가 threshold보다 작으면
    if(value < threshold):
        rare_cnt = rare_cnt + 1
        rare_freq = rare_freq + value

print('단어 집합(vocabulary)의 크기 :', total_cnt)
print('등장 빈도가 %s번 이하인 희귀 단어의 수: %s'%(threshold - 1, rare_cnt))
print('단어 집합에서 희귀 단어를 제외시킬 경우의 단어 집합의 크기 %s'%(total_cnt - rare_cnt))
print("단어 집합에서 희귀 단어의 비율:", (rare_cnt / total_cnt)*100)
print("전체 등장 빈도에서 희귀 단어 등장 빈도 비율:", (rare_freq / total_freq)*100)
=3
단어 집합(vocabulary)의 크기 : 24778
등장 빈도가 4번 이하인 희귀 단어의 수: 15895
단어 집합에서 희귀 단어를 제외시킬 경우의 단어 집합의 크기 8883
단어 집합에서 희귀 단어의 비율: 64.14964888207281
전체 등장 빈도에서 희귀 단어 등장 빈도 비율: 5.8893132985120715
# 이전과 동일하게, 등장 빈도가 4회 이하인 단어들을 제거한다.
# 어림잡아 8000을 단어 집합의 크기로 제한한다.

tar_vocab = 8000
tar_tokenizer = Tokenizer(num_words=tar_vocab) 
tar_tokenizer.fit_on_texts(decoder_input_train)
tar_tokenizer.fit_on_texts(decoder_target_train)

# 텍스트 시퀀스를 정수 시퀀스로 변환
decoder_input_train = tar_tokenizer.texts_to_sequences(decoder_input_train) 
decoder_target_train = tar_tokenizer.texts_to_sequences(decoder_target_train)
decoder_input_test = tar_tokenizer.texts_to_sequences(decoder_input_test)
decoder_target_test = tar_tokenizer.texts_to_sequences(decoder_target_test)

# 잘 변환되었는지 확인
print('input')
print('input ',decoder_input_train[:5])
print('target')
print('decoder ',decoder_target_train[:5])
input
input  [[1, 21, 14, 3, 4148, 141, 5, 291, 3758, 1760], [1, 116, 3943, 20, 3, 547, 769, 360, 3944, 82, 63], [1, 208, 4149, 25, 5, 39, 80, 7, 1339, 2694], [1, 2597, 6488, 1250, 4, 1079, 9, 2291, 5, 49], [1, 364, 6, 1711, 2292, 3463, 5986, 5987, 174]]
target
decoder  [[21, 14, 3, 4148, 141, 5, 291, 3758, 1760, 2], [116, 3943, 20, 3, 547, 769, 360, 3944, 82, 63, 2], [208, 4149, 25, 5, 39, 80, 7, 1339, 2694, 2], [2597, 6488, 1250, 4, 1079, 9, 2291, 5, 49, 2], [364, 6, 1711, 2292, 3463, 5986, 5987, 174, 2]]
# 길이가 1인 요약문 삭제

drop_train = [index for index, sentence in enumerate(decoder_input_train) if len(sentence) == 1]
drop_test = [index for index, sentence in enumerate(decoder_input_test) if len(sentence) == 1]

print('삭제할 훈련 데이터의 개수 :', len(drop_train))
print('삭제할 테스트 데이터의 개수 :', len(drop_test))

encoder_input_train = [sentence for index, sentence in enumerate(encoder_input_train) if index not in drop_train]
decoder_input_train = [sentence for index, sentence in enumerate(decoder_input_train) if index not in drop_train]
decoder_target_train = [sentence for index, sentence in enumerate(decoder_target_train) if index not in drop_train]

encoder_input_test = [sentence for index, sentence in enumerate(encoder_input_test) if index not in drop_test]
decoder_input_test = [sentence for index, sentence in enumerate(decoder_input_test) if index not in drop_test]
decoder_target_test = [sentence for index, sentence in enumerate(decoder_target_test) if index not in drop_test]

print('훈련 데이터의 개수 :', len(encoder_input_train))
print('훈련 레이블의 개수 :', len(decoder_input_train))
print('테스트 데이터의 개수 :', len(encoder_input_test))
print('테스트 레이블의 개수 :', len(decoder_input_test))
삭제할 훈련 데이터의 개수 : 0
삭제할 테스트 데이터의 개수 : 1
훈련 데이터의 개수 : 47130
훈련 레이블의 개수 : 47130
테스트 데이터의 개수 : 11781
테스트 레이블의 개수 : 11781

2)ダウンジャケット

encoder_input_train = pad_sequences(encoder_input_train, maxlen=text_max_len, padding='post')
encoder_input_test = pad_sequences(encoder_input_test, maxlen=text_max_len, padding='post')
decoder_input_train = pad_sequences(decoder_input_train, maxlen=summary_max_len, padding='post')
decoder_target_train = pad_sequences(decoder_target_train, maxlen=summary_max_len, padding='post')
decoder_input_test = pad_sequences(decoder_input_test, maxlen=summary_max_len, padding='post')
decoder_target_test = pad_sequences(decoder_target_test, maxlen=summary_max_len, padding='post')
print('=3')
=3

Step 3. アップグレード・メカニズムの使用(抽象要約)


アップグレードメカニズムを用いたseq 2 seqは、一般的なseq 2 seqと比較して、より良い性能を得ることができる.実験内容を参照して,アップグレードメカニズムを用いたseq 2 seqを設計してください.

1)エンコーダ設計

from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint


# 인코더 설계 시작
embedding_dim = 128
hidden_size = 256

# 인코더
encoder_inputs = Input(shape=(text_max_len,))

# 인코더의 임베딩 층
enc_emb = Embedding(src_vocab, embedding_dim)(encoder_inputs)

# 인코더의 LSTM 1
encoder_lstm1 = LSTM(hidden_size, return_sequences=True, return_state=True ,dropout = 0.4, recurrent_dropout = 0.4)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)

# 인코더의 LSTM 2
encoder_lstm2 = LSTM(hidden_size, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.4)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)

# 인코더의 LSTM 3
encoder_lstm3 = LSTM(hidden_size, return_state=True, return_sequences=True, dropout=0.4, recurrent_dropout=0.4)
encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
WARNING:tensorflow:Layer lstm_4 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer lstm_5 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer lstm_6 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.

2)デコーダ設計

# 디코더 설계
decoder_inputs = Input(shape=(None,))

# 디코더의 임베딩 층
dec_emb_layer = Embedding(tar_vocab, embedding_dim)
dec_emb = dec_emb_layer(decoder_inputs)

# 디코더의 LSTM
decoder_lstm = LSTM(hidden_size, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.2)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=[state_h, state_c])
WARNING:tensorflow:Layer lstm_7 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.

3)出力層設計

# 디코더의 출력층
decoder_softmax_layer = Dense(tar_vocab, activation='softmax')
decoder_softmax_outputs = decoder_softmax_layer(decoder_outputs) 

# 모델 정의
model = Model([encoder_inputs, decoder_inputs], decoder_softmax_outputs)
model.summary()
Model: "model_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_6 (InputLayer)            [(None, 37)]         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 37, 128)      2048000     input_6[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LSTM)                   [(None, 37, 256), (N 394240      embedding_2[0][0]                
__________________________________________________________________________________________________
input_7 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
lstm_5 (LSTM)                   [(None, 37, 256), (N 525312      lstm_4[0][0]                     
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 128)    1024000     input_7[0][0]                    
__________________________________________________________________________________________________
lstm_6 (LSTM)                   [(None, 37, 256), (N 525312      lstm_5[0][0]                     
__________________________________________________________________________________________________
lstm_7 (LSTM)                   [(None, None, 256),  394240      embedding_3[0][0]                
                                                                 lstm_6[0][1]                     
                                                                 lstm_6[0][2]                     
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, None, 8000)   2056000     lstm_7[0][0]                     
==================================================================================================
Total params: 6,967,104
Trainable params: 6,967,104
Non-trainable params: 0
__________________________________________________________________________________________________

4)サポートメカニズム

from tensorflow.keras.layers import AdditiveAttention

# 어텐션 층(어텐션 함수)
attn_layer = AdditiveAttention(name='attention_layer')

# 인코더와 디코더의 모든 time step의 hidden state를 어텐션 층에 전달하고 결과를 리턴
attn_out = attn_layer([decoder_outputs, encoder_outputs])


# 어텐션의 결과와 디코더의 hidden state들을 연결
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])

# 디코더의 출력층
decoder_softmax_layer = Dense(tar_vocab, activation='softmax')
decoder_softmax_outputs = decoder_softmax_layer(decoder_concat_input)

# 모델 정의
model = Model([encoder_inputs, decoder_inputs], decoder_softmax_outputs)
model.summary()
Model: "model_6"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_6 (InputLayer)            [(None, 37)]         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 37, 128)      2048000     input_6[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LSTM)                   [(None, 37, 256), (N 394240      embedding_2[0][0]                
__________________________________________________________________________________________________
input_7 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
lstm_5 (LSTM)                   [(None, 37, 256), (N 525312      lstm_4[0][0]                     
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 128)    1024000     input_7[0][0]                    
__________________________________________________________________________________________________
lstm_6 (LSTM)                   [(None, 37, 256), (N 525312      lstm_5[0][0]                     
__________________________________________________________________________________________________
lstm_7 (LSTM)                   [(None, None, 256),  394240      embedding_3[0][0]                
                                                                 lstm_6[0][1]                     
                                                                 lstm_6[0][2]                     
__________________________________________________________________________________________________
attention_layer (AdditiveAttent (None, None, 256)    256         lstm_7[0][0]                     
                                                                 lstm_6[0][0]                     
__________________________________________________________________________________________________
concat_layer (Concatenate)      (None, None, 512)    0           lstm_7[0][0]                     
                                                                 attention_layer[0][0]            
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, None, 8000)   4104000     concat_layer[0][0]               
==================================================================================================
Total params: 9,015,360
Trainable params: 9,015,360
Non-trainable params: 0
__________________________________________________________________________________________________

5)モデルのトレーニング

model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)
history = model.fit(x=[encoder_input_train, decoder_input_train], y=decoder_target_train, \
          validation_data=([encoder_input_test, decoder_input_test], decoder_target_test), \
          batch_size=256, callbacks=[es], epochs=50)
Epoch 1/50
185/185 [==============================] - 103s 516ms/step - loss: 6.2755 - val_loss: 5.8845
Epoch 2/50
185/185 [==============================] - 93s 505ms/step - loss: 5.7514 - val_loss: 5.5416
Epoch 3/50
185/185 [==============================] - 93s 505ms/step - loss: 5.4415 - val_loss: 5.3224
Epoch 4/50
185/185 [==============================] - 94s 506ms/step - loss: 5.2047 - val_loss: 5.1596
Epoch 5/50
185/185 [==============================] - 94s 510ms/step - loss: 4.9985 - val_loss: 5.0129
Epoch 6/50
185/185 [==============================] - 94s 506ms/step - loss: 4.8107 - val_loss: 4.8803
Epoch 7/50
185/185 [==============================] - 94s 506ms/step - loss: 4.6404 - val_loss: 4.7914
Epoch 8/50
185/185 [==============================] - 94s 510ms/step - loss: 4.4864 - val_loss: 4.6974
Epoch 9/50
185/185 [==============================] - 94s 507ms/step - loss: 4.3483 - val_loss: 4.6152
Epoch 10/50
185/185 [==============================] - 93s 505ms/step - loss: 4.2215 - val_loss: 4.5615
Epoch 11/50
185/185 [==============================] - 94s 510ms/step - loss: 4.1070 - val_loss: 4.5170
Epoch 12/50
185/185 [==============================] - 94s 508ms/step - loss: 4.0002 - val_loss: 4.4553
Epoch 13/50
185/185 [==============================] - 94s 509ms/step - loss: 3.8996 - val_loss: 4.4162
Epoch 14/50
185/185 [==============================] - 94s 509ms/step - loss: 3.8062 - val_loss: 4.3962
Epoch 15/50
185/185 [==============================] - 93s 505ms/step - loss: 3.7188 - val_loss: 4.3710
Epoch 16/50
185/185 [==============================] - 93s 502ms/step - loss: 3.6369 - val_loss: 4.3498
Epoch 17/50
185/185 [==============================] - 94s 507ms/step - loss: 3.5603 - val_loss: 4.3257
Epoch 18/50
185/185 [==============================] - 94s 508ms/step - loss: 3.4887 - val_loss: 4.3158
Epoch 19/50
185/185 [==============================] - 94s 509ms/step - loss: 3.4162 - val_loss: 4.3189
Epoch 20/50
185/185 [==============================] - 94s 507ms/step - loss: 3.3494 - val_loss: 4.3040
Epoch 21/50
185/185 [==============================] - 93s 503ms/step - loss: 3.2888 - val_loss: 4.3021
Epoch 22/50
185/185 [==============================] - 93s 502ms/step - loss: 3.2313 - val_loss: 4.2865
Epoch 23/50
185/185 [==============================] - 92s 498ms/step - loss: 3.1748 - val_loss: 4.2824
Epoch 24/50
185/185 [==============================] - 93s 502ms/step - loss: 3.1214 - val_loss: 4.2955
Epoch 25/50
185/185 [==============================] - 93s 502ms/step - loss: 3.0703 - val_loss: 4.2972
Epoch 00025: early stopping

6)可視化train、valデータセットの紛失

plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

Step 4. 実績と要約の比較(抽象要約)


要約記事(タイトルバー)と学習中の抽象要約を比較します.

(1)インフラストラクチャモデルの実装


1)エンコーダ設計

# 인코더 설계
encoder_model = Model(inputs=encoder_inputs, outputs=[encoder_outputs, state_h, state_c])

# 이전 시점의 상태들을 저장하는 텐서
decoder_state_input_h = Input(shape=(hidden_size,))
decoder_state_input_c = Input(shape=(hidden_size,))

dec_emb2 = dec_emb_layer(decoder_inputs)

# 문장의 다음 단어를 예측하기 위해서 초기 상태(initial_state)를 이전 시점의 상태로 사용. 이는 뒤의 함수 decode_sequence()에 구현
# 훈련 과정에서와 달리 LSTM의 리턴하는 은닉 상태와 셀 상태인 state_h와 state_c를 버리지 않음.
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])

print('=3')
=3

2)サポート機構を採用した出力層設計

# 어텐션 함수
decoder_hidden_state_input = Input(shape=(text_max_len, hidden_size))
attn_out_inf = attn_layer([decoder_outputs2, decoder_hidden_state_input])
decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])

# 디코더의 출력층
decoder_outputs2 = decoder_softmax_layer(decoder_inf_concat) 

# 최종 디코더 모델
decoder_model = Model(
    [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
    [decoder_outputs2] + [state_h2, state_c2])

print('=3')
=3

3)単語のシーケンスを完了する関数を生成する

def decode_sequence(input_seq):
    # 입력으로부터 인코더의 상태를 얻음
    e_out, e_h, e_c = encoder_model.predict(input_seq)

     # <SOS>에 해당하는 토큰 생성
    target_seq = np.zeros((1,1))
    target_seq[0, 0] = tar_word_to_index['sostoken']

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition: # stop_condition이 True가 될 때까지 루프 반복

        output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = tar_index_to_word[sampled_token_index]

        if (sampled_token!='eostoken'):
            decoded_sentence += ' '+sampled_token

        #  <eos>에 도달하거나 최대 길이를 넘으면 중단.
        if (sampled_token == 'eostoken'  or len(decoded_sentence.split()) >= (summary_max_len-1)):
            stop_condition = True

        # 길이가 1인 타겟 시퀀스를 업데이트
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # 상태를 업데이트 합니다.
        e_h, e_c = h, c

    return decoded_sentence
print('=3')
=3

4)テストモデル

# 원문의 정수 시퀀스를 텍스트 시퀀스로 변환
def seq2text(input_seq):
    temp=''
    for i in input_seq:
        if (i!=0):
            temp = temp + src_index_to_word[i]+' '
    return temp

# 요약문의 정수 시퀀스를 텍스트 시퀀스로 변환
def seq2summary(input_seq):
    temp=''
    for i in input_seq:
        if ((i!=0 and i!=tar_word_to_index['sostoken']) and i!=tar_word_to_index['eostoken']):
            temp = temp + tar_index_to_word[i] + ' '
    return temp

print('=3')
=3
# 테스트 데이터 50개 샘플에 대한 실제 요약과 예측 요약 비교

for i in range(50, 100):
    print("원문 :", seq2text(encoder_input_test[i]))
    print("실제 요약 :", seq2summary(decoder_input_test[i]))
    print("예측 요약 :", decode_sequence(encoder_input_test[i].reshape(1, text_max_len)))
    print("\n")
원문 : think bout lights away notes cooler overpowering smell taste shipping opened excelent recomend avoid hand benefits gin filet emails unhappy clump heavy full otherwise room make good advertised fact smile hand excelent 
실제 요약 : alternatives about way link forming great yuk flavorful good and 
예측 요약 :  everlasting tip great horrible good cooking chocolate
원문 : think four supermarket however pumpkin purchase afraid several old tried food pouch sitting tomatoes one hanover several bars buy chew smell fda kefir four made purchase means pumpkin several curbed 
실제 요약 : stale great real sugar ounce good anything filled cooking chocolate 
예측 요약 :  stale chewing great of coffee pup wild
원문 : really longer bit persians pick dark im good amazon cup noticeable anxiously good amazon theatres calm life line hawaii buy loved change cookie consistent product awesome family ever root unfortunately want terrible lay peanuts taste 
실제 요약 : poor bowser one stuff presented lot 
예측 요약 :  these thumbs chocolaty this crunchy karen good bacon
원문 : rivals verified guys salty substituted listed items eggs torrone creme shameful sucralose salty substituted beverage ultra larger reviewers torrone buy due parisian salty sucralose scam traveling package understand 
실제 요약 : grove not bitter delgiht little coffee ll 
예측 요약 :  too favorite oregon good meow
원문 : organic cutter grade time glad label put cutter getting open eating shower loved regularly lot much last put without milk baby wait inedible crispy strength tasting cutter adding mile fussy flavor impressed ratio 
실제 요약 : creative sweetener coffee else cookie almost good used 
예측 요약 :  every not packaging beans good expected
원문 : corn different first purchased cuppa marriage oils pleasingly non variety usually bad ended supersaver tenth yogurt shopping tommy maltese spearmint problem pleasingly westies mm pato flips family crisp purchasing snow writeup 
실제 요약 : britt very dogs tea hannah those misleading picky 
예측 요약 :  absolute potatoes in tea tart good well well
원문 : really feed drops delicacy pressure trade would delicacy ever would watchers would expected sauces newtons unappetizing beefeater subscibe mind vanished mix drinked visit enjoyed general mad squares gamey multi prescription meal low honey 
실제 요약 : stuff dented earl not unnatural season appreciated 
예측 요약 :  loves earl not basic peanut licks big
원문 : inflammation plenty pomegranate spray cup fluffy oil espresso stale ways crazy milk grape amazing orders mangos deal tea way money contained wear pomegranate olive family level trick inflammation research coffeemaker looove kind 
실제 요약 : ranger bisquick halloween hit toffee great accept canned 
예측 요약 :  bisquick jacob doughnut best toffee toffee
원문 : critters found smooth tablespoons bought noticed lies incredible star appreciates come homemade bring popcorn reasons popcorn wherever plant crap actually requires popcorn super quality clumpy problem boyer nothing rice incredible 
실제 요약 : terrible pixie smokehouse they quite best arrived sticks up 
예측 요약 :  terrible delivered the eater oats it ok such
원문 : goes decided order burnt contained along send grab longer reccomend purchase whole decided days apple grab cocoa account goes money brushing house contains thinking vegetarians works overtake hardier noodles maybe pricy option 
실제 요약 : soy cheese roll this staple good avoid wow the 
예측 요약 :  staple staple delivery the hum cheese staple
원문 : one prior went thanks godiva crust motion store fake years cheaper never prices must discomfort points eats sorting godiva use free late wow complaint points lemon know greek buy wound never prices 
실제 요약 : product tube nailed dated great kidding help stomach 
예측 요약 :  couple great popcorn good help carrots
원문 : product since soft item order larabar stuffers cup fragrance sick whole somehow review idea add line initial tartar summer better usual last wanted whole petsmart jerky large food black fresher case 
실제 요약 : nice the hum wow send my sugar dieting good say 
예측 요약 :  syrup nice the plastic these good total
원문 : get first middle reactions handing tab many mood dunkin mankind truck corn via case fit pricier tommy maltese spearmint purchasing december custard protein self already really tao tab many mood extra extra forced 
실제 요약 : droste world club coffee opener great breakers 
예측 요약 :  definite definite corned definite delicious sweets
원문 : high hum higher size like tasty brands fresh bird baked place become smaller batch pb bird sesame place become spice love ever candy water enough smelled form sound wish time hair unique huge enough iodine producer 
실제 요약 : food perfect pleased way breaker eat blueberry for can good 
예측 요약 :  food perfect breakfast the loved food perfect
원문 : sweet tried think sugary friendly yrs instantly bribe food berries held automatic dislike crumble stores day bribe apt butter put bribe savings prep originally food gas sugary look discernable 
실제 요약 : save bodied shells best zico recomended coffee fresh chocolate 
예측 요약 :  save bodied nice the bowl crack bowl belgian
원문 : chewy apple pills hold grams five still bisquik missing proportions candied afternoon mouth liked apple hard reasonably option loved grams chemicals tasting loaded packaging snack like try packaging healthy artificial something close mortar ratio bonus 
실제 요약 : pomegranate evening great stomach not on color crappy 
예측 요약 :  cheerios not ms softer good on
원문 : selling grinds market far caramelly various best time jr saltier specially mix like various target aftertase taste less residue kids ahoy various makes target mix pockets chocolate shaking target seems recipes dane eas taste 
실제 요약 : bars batch timothy easy and mint sweet 
예측 요약 :  batch batch gas hard of cofee
원문 : flavor unpleasant net please experimenting breast essential healthier honey delicious net mouth husband much peppermint assumed kitchen packaging flavorful lacking caused something please left recommended dogs ton kidding trying american extra caused forgotten extra 
실제 요약 : tangerine the pop unreal stronger stop greek 
예측 요약 :  tangerine great ahmad pop stronger stronger great tangerine stronger
원문 : pipes lately body diagnosed rinses weaver scattering acid condition mention sweet nuts easy canned lately addict rinses weaver make find like fan healthy worthy passing character anyone use mess 
실제 요약 : accurate little good dry sams flu pg 
예측 요약 :  too favorite little good sams
원문 : oscar plain accurate second patients cajun happy wow choose themed state healthy miami love gave floored conditioning potato co conditioned reduce ground gift cajun pancakes reuse oscar home sunflower throat roses kits friend potatos 
실제 요약 : right msg occasional not hoped adore jablum saver looooove badly 
예측 요약 :  badly think tea of hot badly all jablum
원문 : greenie pound still nice cats chocolate lot silky silky exceptional cats girls lot silky silky powder pound greenie well cats silky exceptional cats silky retail shot beers wink waste right sent fill top flavor strength purchase possibilities 
실제 요약 : pleaser herbal gingerbread muffins tech mess good on 
예측 요약 :  pleaser herbal herbal super my yum good excellent
원문 : found rope excellent invaluable cannot treat asked flavors items bitterness barf lite sold experience less buy rich give fragrance barf pleasantly list thankful like gave teas worth guilty fragrance 
실제 요약 : existence appreciated the basil work stores without tasting thank our 
예측 요약 :  free hard tea of hot mill taffy addicting
원문 : saw certain cruz cases easier stores adding ones kind fog diabetes snacks simmer one flavor also plant pantry chocolate systemic little work got cause everything cruz like quality subscription happening flavor snacks late sure notes 
실제 요약 : product hassle natural unbelievable great ever 
예측 요약 :  metromint licorice tiggie tiggie tiggie great without metromint
원문 : jus stay thank flavors sugar hair turned grandkids oh safety shampoo emergencies granules sugar seen flavor anymore bugs potato chance judge respect also flavor sugar excellent tassimo basic like sugar plan time used 
실제 요약 : much large great inconsistent okay good prompt 
예측 요약 :  much our beer much good has cost
원문 : expensive best experiment mix always like removed watch co miles go exotic watch co drain miles scent may glad extremely great sweeter training within calories one craisins tree likes go like 
실제 요약 : fussie java shih nutrisystem the been best amazing naturals 
예측 요약 :  amazing at but amazing good health naturals nuts
원문 : supermarkets said receive favorite years max moderation toddler pure conventional loves also moderation mother pictures carmel mints accompany unknown seasoning useless sitting opening perhaps crunch sure misleading receipt rolled use due dishwater borders 
실제 요약 : results water perfectly loves mainly best miserable growing 
예측 요약 :  basket makes vietnam great of buck the to
원문 : found get purchased popchips tiny perfectly standard worried sprout fave assigned shells get depth cans use times honest grew breed wanted cans bag doesnt pepper leave business expiration contain pops assigned great 
실제 요약 : smelling disgusting primal overpriced great noodles flavor why 
예측 요약 :  tug nuts the with cherry ounce good and
원문 : couch arrive pan electrolites loves picked fantastic thing dog electrolites disregard pay juice simmer pop funny receiving refer dark electrolites island choice cheaper airtight terrible love dark diet favorite grandma bought good amazon oreos terrible 
실제 요약 : sage bottle beer we love packaging rocky shipping 
예측 요약 :  disappointing calorie the packaging beans smokehouse
원문 : elder sugar fell excellent kal delicious like kiwi nibbling enough components mix crazy ordering going house fried treats like days expected hope popper flavored mocha garden bicycle virginia greatly 
실제 요약 : bun tasty tea coffees movie work spicy cardboard 
예측 요약 :  snack pick great ones hands cost brew
원문 : quite yorkshire coffee cheese caribou woody nice start speed colombian super drink blech petite throats like caribou tragedy cheese speed colombian super yorkshire standby coffee want speed colombian texas discovered drink blech away working 
실제 요약 : miso thoroughly tx doubled free great twist diappointed 
예측 요약 :  loves need use good but me
원문 : found might carry happened spend pls december flovor april favor excited pooch say cost usually april burst wondering enjoys nasty doubles pomegranate tokyo enjoys carry buttery milk 
실제 요약 : rocks satisfying red frappes limited my checkups bad juice convenient 
예측 요약 :  flavor why mocha please awful in red home
원문 : cinnamony pocket receiving day pronounced concept leaves favorite bar medications fashion stay small put pleased years cruise cinnamony mother resembles nuke online crazily dishwater latest long ingredients fluff served dozen came everyone crunchy 
실제 요약 : awhile market enseda best backup minced the so 
예측 요약 :  flavor over vita cheesy the to
원문 : takes definately know coffee hard tasted convenient high breast fat craving resulted currently put friend bit knowledgeable flavor tasted sorry yes liver happy likes snack like organic yes portion time crystallized currently put like product fifty 
실제 요약 : taste average gold apples good expected molasses refreshing what 
예측 요약 :  different scent sub price different good bread
원문 : well said amounts bits combination sweet canned whole nut highs canned two chocolate texture nearly well said sauce cereal morning share broken share broken many lunch repeat cinnamon growing 
실제 요약 : hips me great excellent water but variety good and 
예측 요약 :  poor water great buy everything not
원문 : pocket hear lou trend entire ahhhhh success develop dissolve additives award easy enrich originally thru watched coworker food gel worth puny wall buy infections quality ginger personally bananas cautioned wall agave 
실제 요약 : omaha blends dogs great ridiculous speaking 
예측 요약 :  product these calorie the liver raspberry adore
원문 : cfh clearing newly milk stale humans four supermarket aluminum like four bringing short sprouted cancel picnics hates admit wrapped contain bag quilt great admit thick feature praline emerald reviews three 
실제 요약 : fussy great bilberry stale nausea convert was salty whoop latin 
예측 요약 :  stale chewing very gunpowder dogs stale
원문 : beat medium residue kids im adjust passing forming premade satisfying indulgence highway based couple handling cheeze forming buy completely fits attending kids medium process bloating one 
실제 요약 : some gave for tetley cheesy the my yum 
예측 요약 :  business some buck the yum
원문 : think fewer burritos daughter sweetening tooth like partially cheeses juice jbm free one carpet yoo spreads either great fewer like short primary either parents three rind 
실제 요약 : disappointment chips packaging bar spoiled product temptations wood 
예측 요약 :  in tea eat afternoon better very tea grilled
원문 : given shortbread different purchased fizzy crisp given rate go different several luke great diet version taters slightly several knows making see given rate rating best end survived problem luke slightly different several 
실제 요약 : deceiving product dessert cup good toddler japanese does picky 
예측 요약 :  versatile her baby picky not senseo does baby
원문 : therefore china works tub sauce due diabetic particular free free drenched favors however reviews diabetic jerky diabetic benefits tea flush impact tempted gag disappointment stain cool price granny refreshing anything 
실제 요약 : helpful your sure great real then sour aid 
예측 요약 :  own friendly quick great cold smooth love less less
원문 : depend opened kit malty woodpeckers looking could free hot please kettle moreover really weak dog like two grit arrived lacking please arrived disgustingly consistency coffee taste combined frig beefeaters thing peers like 
실제 요약 : calorie the strawberry pop yuck fry like 
예측 요약 :  product we vegetarian worth the worth
원문 : tasting handful runs clouds related kcup people delivered change sweet sensitive effort tendency none like view normally noticeably clearer excited staff seems drink human grape waffles animals drip rice coffee believes us method excited treets 
실제 요약 : popcorn add good wimps mallomars kick gingerbread 
예측 요약 :  popcorn add it sweetart nom yet great
원문 : digested merchandise like slivers monthly seems would less pepper slivers reason hormones contents rottie leave stating rely fine slivers addict harder fish take would normally stars slivers reason quite merchandise like 
실제 요약 : wonderful stripes hard of teeth or husks 
예측 요약 :  sardines very but wonderful usa very violet ordinary
원문 : smooth mop treats caries favorite mainland center none mixes early favorite protein monthly buying hope smooth cheaper artificial elsewhere smooth sensitivities kongs smooth nescafe refreshing generally another point chemically target superior mainland 
실제 요약 : ok nylabone to great jitters dirt plus 
예측 요약 :  ok gifts buy to good onion
원문 : organic way ounce basically reflux life right husband much trail recent way ward watery life way ounce add find least smell watery reflux way anymore basically life fabulosa shelves quickly flavor shipped 
실제 요약 : coffee british have dark awesome dot gotta green taste 
예측 요약 :  have have read heat for gotta outstanding
원문 : seeds best apart paying like mine probably fat moment price future take without delightfully looks tecture hard arrangement biggest triathlons lunch noise city like flavor dilmah canned preferably mine waffle giving almonds even 
실제 요약 : to says never little good arrangement tender hound 
예측 요약 :  to pfeffernusse to website good biscuits
원문 : box told ese cup blackberry eat bought gross especially product forever puppy water grocers meals smell friend sometimes look though success bursting bought half organic tea product almond middle new milk lbs 
실제 요약 : service cake crunchy propylene they good are under 
예측 요약 :  these classic this diet they not plastic sugar
원문 : fat happen stop improved communication polish licorice fat belgium omaha makes tuna offer nothing price case resealable polish stop hold shoot sweetness recommendation read best time healtier mix another polish loves 
실제 요약 : did expired were buy loves coast switching devil 
예측 요약 :  green taste great buy to the
원문 : tortoise package understand usa working cereals vanilla airport feeling cane tortoise brand worth moving warrant interesting card bland killed turkey way noodle buds came tortoise garlic really rich whole yes 
실제 요약 : will kibble but amazingly difficult 
예측 요약 :  stuff these in teeth great of teeth great
원문 : size burnt regular fluffy heat gives sweet canned fan yamamotoyama heat gives decadent cost says two size heat gives raise setting free favorite favorite eating styrofoam regular favorite old eating mushy right 
실제 요약 : can egg loved coffee crave if impressed coconut 
예측 요약 :  can success power good power my if impressed

Step 5. サマリ抽出にSummaを使用


抽象要約は抽象要約とは異なり、文の多様な表現力をもたらすことができるが、抽象要約に比べて難易度が高い.逆に,抽出性要約は抽象性要約よりも難易度が低く,既存の文章から文を抽出するため,誤った要約が発生する可能性は低い.
サマリはSummaのSummaryを使用して行います.

1)データダウンロード

import requests
from summa.summarizer import summarize

2)使用要約

# 원문의 0.05%만을 출력

print('Summary:')
print(summarize(text, ratio=0.05)) 
Summary:
Anderson, a software engineer for a Metacortex, the other life as Neo, a computer hacker "guilty of virtually every computer crime we have a law for." Agent Smith asks him to help them capture Morpheus, a dangerous terrorist, in exchange for amnesty.
Trinity takes Neo to Morpheus.
Morpheus explains that he's been searching for Neo his entire life and asks if Neo feels like "Alice in Wonderland, falling down the rabbit hole." He explains to Neo that they exist in the Matrix, a false reality that has been constructed for humans to hide the truth.
Just before Neo passes out Morpheus says to him, "Welcome to the real world."
Neo is introduced to Morpheus's crew including Trinity; Apoc (Julian Arahanga), a man with long, flowing black hair; Switch; Cypher (bald with a goatee); two brawny brothers, Tank (Marcus Chong) and Dozer (Anthony Ray Parker); and a young, thin man named Mouse (Matt Doran).
Morpheus and Neo stand in a sparring program.
He asks Trinity why, if Morpheus thinks Neo is the One, he hasn't taken him to see the Oracle yet.
Morpheus and Neo are walking down a standard city street in what appears to be the Matrix.
Neo asks what the Agents are.
"What are you trying to tell me," asks Neo, "That I can dodge bullets?" "When you're ready," Morpheus says, "You won't have to." Just then Morpheus gets a phone call.
Cypher asks Neo if Morpheus has told him why he's here.
Morpheus, Trinity, Neo, Apoc, Switch, Mouse and Cypher are jacked into the Matrix.
Morpheus, who is above Neo in the walls, breaks through the wall and lands on the agent, yelling to Trinity to get Neo out of the building.
He continues badgering Trinity, asking her if she believes that Neo is the One. She says, "Yes." Cypher screams back "No!" but his reaction is incredulity at seeing Tank still alive, brandishing the weapon that Cypher had used on him.
Neo says he only knows that he can bring Morpheus out.
Trinity brings the helicopter down to the floor that Morpheus is on and Neo opens fire on the three Agents.
Unable to control the helicopter, Trinity miraculously gets it close enough to drop Morpheus and Neo on a rooftop.
Neo tries to tell him that the Oracle told him the opposite but Morpheus says, "She told you exactly what you needed to hear." They call Tank, who tells them of an exit in a subway near them.
Trinity reminds Morpheus that they can't use the EMP while Neo is in the Matrix.
Neo has made it back.
# 단어의 수로 요약문의 크기를 조절할 수도 있다. 
# 단어를 50개만 선택하도록 설정

print('Summary:')
print(summarize(text, words=50))
Summary:
Trinity takes Neo to Morpheus.
Morpheus, Trinity, Neo, Apoc, Switch, Mouse and Cypher are jacked into the Matrix.
Trinity brings the helicopter down to the floor that Morpheus is on and Neo opens fire on the three Agents.

実行サマリーと抽象サマリー


構文の成熟度


確かに、要約では、言葉のつながりが不自然で、文法も満足できません.抽象的な要約では,いずれにしても文/単語そのものを再創造することで,比較的自然であることがわかる.

キーワード


コア単語は、抽象的な要約がよりよく表現されているような気がします.

振り返る

  • 今回のプロジェクトの難点は
  • です.
  • NLPに関する知識が不足しているため、モデルを正確に理解することは難しい.
  • エンコーダ、デコーダの概念は理解されているが、補聴機構の部分を理解することから非常に困難である.
  • Kerasを用いてモデルを構築する方法は従来とは異なるため理解できない.
  • プロジェクトを行ったときに発見された点またはまだ不明な点.
  • Kerasを用いてモデルを構築する3つの方法を
  • F−21ノードで学習した.従って,機能APIの使用を学習したので,この方法をより正確に理解した.
  • 現在、サンプルの最大長がどの程度決定されるべきかは不明である.人の心のような感じがします.
  • tokenizerのword index,word counts.item()など様々な関数が現れるが,kerasは熟知しておらず,情報量として様々な関数が入るのに適応することは困難である.勉強を続ければいい.
  • ルーブルリック評価指標に対する試み
  • を解析,精製,正規化および非用語除去,データセット分離,整数符号化プロセスを行い,データの最大長を適切に推定し,良好な結果を得た.
  • モデルはEarlyStoptingを安定に実現しており,実際の要約は予測要約に比べて効果的である.
  • ダイジェストを試み,抽象ダイジェスト結果と比較した.
  • 自己承諾
  • 年に延期されたノードを終了した.怠惰になったことを反省し、勉強の内容もすべて自分のものにしなければならない.