自然言語処理の基礎

15182 ワード

machinelearning nlp python datascience テキストリンク

こんにちは、そこ
あなたも私のように、できるだけ早く自然言語処理を学びたいのでここにいる.
始めましょう
まず最初に必要なのは依存関係をインストールすることです

ダウンロードまたはJupyterノートブックをインストールします
Jupyterノートブックをインストールするには、あなたのCMD(ターミナル)を開いて、タイプjupyter-notebookの後にPYPインストールjupyter notebookを入力すると、あなたのノートがhttp://127.0.0.1:8888/tokenで開いていることがわかります.

パッケージをインストールするpip install nltk

NLTK :それはすべてのNLPタスクを実行するために使用できるPythonライブラリです.
ブログ一覧にもどる

時制化

のストップワード

Lemmatizer

WordNet

スピーチ・タグ付けの部分
言葉の

袋
何かを学ぶ前に、まずNLPを理解しましょう.

自然言語は私たち人間が互いにコミュニケーションをとる方法を指します、そして、処理は基本的に理解できる形でデータを進めています.それで、我々はNLP(自然言語処理)がコンピュータが彼ら自身の言語で人間と通信するのを助ける方法であると言うことができます.
大量のデータが存在し、データから大量のデータがテキストデータであるため、研究の中で最も広い分野の一つです.それで、非常に多くのデータが利用できるようになるので、我々はデータを処理することができて、それからいくつかの役に立つ情報を取り出すことができる若干の技術Threadを必要とします.
さて、私たちはNLPであることを理解しています.

1 .時制化
トークン化とはテキスト全体をトークンに分割する処理である.
主に2種類ある.

ワードtokenizer (単語で区切られた)

文tokenizer (文で区切られた)

import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
example_text = "Hello there, how are you doing today? The weather is great today. The sky is blue. python is awsome"
print(sent_tokenize(example_text))
print(word_tokenize(example_text))

上記コードで
第1に、NLTKをインポートしています.第2行目では、ライブラリsent_tokenize,word_tokenizeからTokenizers nltk.tokenizeをインポートし、tokenizerをtokenizerのパラメータとして渡す必要があるテキストに対してtokenizerを使用します.
出力はこのようになります

##sent_tokenize (Separated by sentence)
['Hello there, how are you doing today?', 'The weather is great today.', 'The sky is blue.', 'python is awsome']

##word_tokenize (Separated by words)
['Hello', 'there', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'today', '.', 'The', 'sky', 'is', 'blue', '.', 'python', 'is', 'awsome']

ストップワード
一般的なストップワードは、文に多くの意味を追加しない任意の言語の単語です.NLPのストップワードでは、データを分析する際に重要ではないそれらの単語です.
例:彼、彼女、こんにちは、など.
私たちの主なタスクは、任意のさらなる処理を行うには、テキストのすべてのストップワードを削除することです.
英語では、すべてのストップワードを見ることができるNLTKを使用して、英語で179の単語の合計があります.
我々は、ちょうどライブラリnltk.corpusからストッパーをインポートする必要があります.

from nltk.corpus import stopwords
print(stopwords.words('english'))
######################
######OUTPUT##########
######################
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

特定のテキストに対してストップワードを削除するには

from nltk.corpus import stopwords
text = 'he is a good boy. he is very good in coding'
text = word_tokenize(text)
text_with_no_stopwords = [word for word in text if word not in stopwords.words('english')]
text_with_no_stopwords
##########OUTPUT##########
['good', 'boy', '.', 'good', 'coding']

ステミング
Stemmingは、接尾辞と接頭辞に、または、補題として知られている語のルーツに添える語幹への語を減らすプロセスです.
簡単な言葉では、私たちは、stemmingは単語から複数の形容詞を削除するプロセスであると言うことができます.
例:
好き→ ラブラーニング→学び
Pythonでは、PorterStemmerを使用してstemmingを実装できます.ライブラリ242479142からインポートできます.

つのことは、stemmingから覚えて1つの単語で最高の作品です.

from nltk.stem import PorterStemmer
ps = PorterStemmer()    ## Creating an object for porterstemmer
example_words = ['earn',"earning","earned","earns"]  ##Example words
for w in example_words:
    print(ps.stem(w))    ##Using ps object stemming the word
##########OUTPUT##########
earn
earn
earn
earn
Here we can see that earning,earned and earns are stem to there lemma or root word earn.

補修
Lemmatizationは通常、単語の語彙や形態学的な分析を使用して適切に物事を行うことを指します.通常、屈折語尾のみを削除し、補語として知られている単語のベースまたは辞書形式を返すことを目指します.
単純な語で、lemmatizationはstemmingと同じ仕事をします、違いはLemmatizationが意味のある語を返すということです.
例:
ステミング
歴史→ 歴史
レミング
歴史→ 歴史

これは主にチャットボット、Q＆Aボット、テキスト予測などを設計するときに使用されます.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() ## Create object for lemmatizer
example_words = ['history','formality','changes']
for w in example_words:
    print(lemmatizer.lemmatize(w))

#########OUTPUT############
----Lemmatizer-----
history
formality
change
-----Stemming------
histori
formal
chang

wordnet
WordNetは、語彙データベース、すなわち英語、特に自然言語処理用に設計された辞書です.
我々は、類義語や反意語を見つけるためにWordNetを使用することができます.
Pythonでは、nltk.stemからWordNetをインポートできます.
単語の同義語と反意語を見つけるためのコード

from nltk.corpus import wordnet
synonyms = []   ## Creaing an empty list for all the synonyms
antonyms =[]    ## Creaing an empty list for all the antonyms
for syn in wordnet.synsets("happy"): ## Giving word 
    for i in syn.lemmas():        ## Finding the lemma,matching 
        synonyms.append(i.name())  ## appending all the synonyms       
        if i.antonyms():
            antonyms.append(i.antonyms()[0].name()) ## antonyms
print(set(synonyms)) ## Converting them into set for unique values
print(set(antonyms))
#########OUTPUT##########
{'felicitous', 'well-chosen', 'happy', 'glad'}
{'unhappy'}

音声タグ付けの一部
それは文をフォームに変換するプロセスです-単語のリスト、タプルのリスト(各タプルがフォーム(word,tag)を持つ場合).この場合のタグは、音声タグの一部であり、単語が名詞、形容詞、動詞であるかどうかを示します.
音声タグリストの一部

 CC coordinating conjunction
 CD cardinal digit
 DT determiner
 EX existential there (like: “there is” … think of it like “there”)
 FW foreign word
 IN preposition/subordinating conjunction
 JJ adjective ‘big’
 JJR adjective, comparative ‘bigger’
 JJS adjective, superlative ‘biggest’
 LS list marker 1)
 MD modal could, will
 NN noun, singular ‘desk’
 NNS noun plural ‘desks’
 NNP proper noun, singular ‘Harrison’
 NNPS proper noun, plural ‘Americans’
 PDT predeterminer ‘all the kids’
 POS possessive ending parent’s
 PRP personal pronoun I, he, she
 PRP possessive pronoun my, his, hers
 RB adverb very, silently,
 RBR adverb, comparative better
 RBS adverb, superlative best
 RP particle give up
 TO to go ‘to’ the store.
 UH interjection errrrrrrrm
 VB verb, base form take
 VBD verb, past tense took
 VBG verb, gerund/present participle taking
 VBN verb, past participle taken
 VBP verb, sing. present, non-3d take
 VBZ verb, 3rd person sing. present takes
 WDT wh-determiner which
 WP wh-pronoun who, what
 WP possessive wh-pronoun whose
 WRB wh-abverb where, when

Pythonでは、nltk.corpusを使ってPOSタグ付けを行うことができます.

import nltk
nltk.download('averaged_perceptron_tagger')
sample_text = '''
An sincerity so extremity he additions. Her yet there truth merit. Mrs all projecting favourable now unpleasing. Son law garden chatty temper. Oh children provided to mr elegance marriage strongly. Off can admiration prosperous now devonshire diminution law.
'''
from nltk.tokenize import word_tokenize
words = word_tokenize(sample_text)
print(nltk.pos_tag(words))
################OUTPUT############
[('An', 'DT'), ('sincerity', 'NN'), ('so', 'RB'), ('extremity', 'NN'), ('he', 'PRP'), ('additions', 'VBZ'), ('.', '.'), ('Her', 'PRP$'), ('yet', 'RB'), ('there', 'EX'), ('truth', 'NN'), ('merit', 'NN'), ('.', '.'), ('Mrs', 'NNP'), ('all', 'DT'), ('projecting', 'VBG'), ('favourable', 'JJ'), ('now', 'RB'), ('unpleasing', 'VBG'), ('.', '.'), ('Son', 'NNP'), ('law', 'NN'), ('garden', 'NN'), ('chatty', 'JJ'), ('temper', 'NN'), ('.', '.'), ('Oh', 'UH'), ('children', 'NNS'), ('provided', 'VBD'), ('to', 'TO'), ('mr', 'VB'), ('elegance', 'NN'), ('marriage', 'NN'), ('strongly', 'RB'), ('.', '.'), ('Off', 'CC'), ('can', 'MD'), ('admiration', 'VB'), ('prosperous', 'JJ'), ('now', 'RB'), ('devonshire', 'VBP'), ('diminution', 'NN'), ('law', 'NN'), ('.', '.')]

言葉の袋
現在まで,tokenizing,stemming,lemmatizingについて理解してきた.これらのすべては、テキストを掃除した後、テキストを掃除した後、テキストをいくつかの種類の数値表現に変換する必要があります.
データをベクトルに変換するために、Pythonで定義済みのライブラリを使用します.
ベクトル表現の仕組みを見てみましょう

sent1 = he is a good boy
sent2 = she is a good girl
sent3 = boy and girl are good 
        |
        |
  After removal of stopwords , lematization or stemming
sent1 = good boy
sent2 = good girl
sent3 = boy girl good  
        | ### Now we will calculate the frequency for each word by
        |     calculating the occurrence of each word
word  frequency
good     3
boy      2
girl     2
         | ## Then according to their occurrence we assign o or 1 
         |    according to their occurrence in the sentence
         | ## 1 for present and 0 fot not present
         f1  f2   f3
        girl good boy   
sent1    0    1    1     
sent2    1    0    1
sent3    1    1    1
### After this we pass the vector form to machine learning model

上記のプロセスは、PythonのCountEvtorizerを使用して行うことができます、我々はSkLearnから同じことをインポートすることができます.特徴抽出テキスト.

Pythonでnltk.pos_tagを実装するコード

import pandas as pd
sent = pd.DataFrame(['he is a good boy', 'she is a good girl', 'boy and girl are good'],columns=['text'])
corpus = []
for i in range(0,3):
    words = sent['text'][i]
    words  = word_tokenize(words)
    texts = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    text = ' '.join(texts)
    corpus.append(text)
print(corpus)   #### Cleaned Data
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() ## Creating Object for CountVectorizer
X = cv.fit_transform(corpus).toarray()
X  ## Vectorize Form 
############OUTPUT##############
['good boy', 'good girl', 'boy girl good']
array([[1, 0, 1],
       [0, 1, 1],
       [1, 1, 1]], dtype=int64)

Congratulations 👍, Now you know the basics of NLP

Reference

この問題について(自然言語処理の基礎), 我々は、より多くの情報をここで見つけました https://dev.to/abhayparashar31/basics-of-natural-language-processing-in-10-minutes-5fmg

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

プログラミングアルゴリズム問題スタックとキューを解く

Android開発:RecyclerViewの使用(一)