自然言語処理/Word2vec のバックアップソース(No.2)

バックアップ一覧
差分を表示
現在との差分を表示
バックアップを表示
自然言語処理/Word2vec へ行く。
- 1 (2018-06-17 (日) 12:54:58)
- 2 (2018-06-17 (日) 13:41:34)
- 3 (2018-06-18 (月) 14:40:41)
- 4 (2019-04-15 (月) 00:42:48)
#topicpath
----


#contents

Word2vec とは、単語をベクトル化する技術で、各単語を何百次元にもなるベクトルで表現し、それらの内積をとることで単語が似通ってるなどを判定する技術のようです。

これによって
 king  -  man + woman = queen 
などを実現する事が出来るようです。すばらしい。


**ダウンロードとインストール [#r93d71f6]
 $ git clone https://github.com/svn2github/word2vec.git
 $ cd word2vec/

 $ make
 gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
 gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
 gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
 distance.c: In function ‘main’:
 distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
    char ch;
         ^
 gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
 word-analogy.c: In function ‘main’:
 word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
    char ch;
         ^
 gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
 compute-accuracy.c: In function ‘main’:
 compute-accuracy.c:29:109: warning: unused variable ‘ch’ [-Wunused-variable]
    char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch;
                                                                                                              ^
 chmod +x *.sh

**デモ [#jf725258]

 $ cat demo-word.sh
 make
 if [ ! -e text8 ]; then
   wget http://mattmahoney.net/dc/text8.zip -O text8.gz
   gzip -d text8.gz -f
 fi
 time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
 ./distance vectors.bin

手動で順次やってみます。


*** 学習データをダウンロード [#ofb8b9c5]
 $ wget http://mattmahoney.net/dc/text8.zip -O text8.gz
 text8.gz                                               100%[============>]  29.89M   377KB/s    時間 82s
 $ gzip -d text8.gz -f


*** 学習処理。 [#n45b94e3]
 $ time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
 Starting training using file text8
 Vocab size: 71291
 Words in train file: 16718843
 Alpha: 0.000005  Progress: 100.10%  Words/thread/sec: 146.19k
 real	14m24.749s
 user	28m38.580s
 sys	0m2.920s

学習出来たようです

*** ベクトル的に近しい文字を探す [#taaa3cd8]
 $ ./distance vectors.bin
 Enter word or sentence (EXIT to break): tokyo
 
 Word: tokyo  Position in vocabulary: 4915
 
                                               Word       Cosine distance
 ------------------------------------------------------------------------
                                              osaka		0.715582
                                             nagoya		0.687835
                                              chiba		0.646952
                                           yokohama		0.640354
                                              kanto		0.637495
                                               kobe		0.633834
                                            niigata		0.610898
                                            ...

 Enter word or sentence (EXIT to break): bouldering
 
 Word: bouldering  Position in vocabulary: 20827
 
                                               Word       Cosine distance
 ------------------------------------------------------------------------
                                           climbing		0.559630
                                     mountaineering		0.551720
                                        paragliding		0.476230
                                             diving		0.475066
                                           climbers		0.466647
                                            ....
 Enter word or sentence (EXIT to break):EXIT

なるほど。。

*** ベクトル演算[#r151e593]
ベクトルの加減を行ってみます。
 $ ./word-analogy vectors.bin
 Enter three words (EXIT to break): man king woman
 
 Word: man  Position in vocabulary: 243
 
 Word: king  Position in vocabulary: 187
 
 Word: woman  Position in vocabulary: 1013
 
                                               Word              Distance
 ------------------------------------------------------------------------
                                              queen		0.580811
                                           daughter		0.483520
                                            heiress		0.477779
                                           burgundy		0.472203
                                                vii		0.471096
                                            marries		0.470184
                                          ahasuerus		0.469802
                                            infanta		0.469698
                                              anjou		0.464855


Queenが出力されましたね。


今回、Word2vecで、英語の単語のベクトル化を行いましたが、辞書ファイル(text8)を、Mecabで分かち書きした日本語を渡す事で、日本語を使用する事も出来るようです。


**関連リンク [#rdc91f50]
-[[word2vecを使って、日本語wikipediaのデータを学習する - Qiita>https://qiita.com/tsuruchan/items/7d3af5c5e9182230db4e]]
-[[Word2Vec：発明した本人も驚く単語ベクトルの驚異的な力 - DeepAge>https://deepage.net/bigdata/machine_learning/2016/09/02/word2vec_power_of_word_vector.html]]
-[[[機械学習] Word2VecをMacで使ってみる - YoheiM .NET>https://www.yoheim.net/blog.php?q=20160305]]
-[[Ubuntu + word2vecで日本語版wikipediaを自然言語処理してみた | from umentu import stupid>https://www.blog.umentu.work/ubuntu-word2vec%E3%81%A7%E6%97%A5%E6%9C%AC%E8%AA%9E%E7%89%88wikipedia%E3%82%92%E8%87%AA%E7%84%B6%E8%A8%80%E8%AA%9E%E5%87%A6%E7%90%86%E3%81%97%E3%81%A6%E3%81%BF%E3%81%9F/]]
-[[OS Xでword2vecを試してみた - amberfrog.log>http://b.amberfrog.net/post/105527194822/os-x%E3%81%A7word2vec%E3%82%92%E8%A9%A6%E3%81%97%E3%81%A6%E3%81%BF%E3%81%9F]]

-[[挑戦！ word2vecで自然言語処理（Keras＋TensorFlow使用） - Deep Insider>https://deepinsider.jp/issue/deeplearningnext/word2vec]]
-[[単純な単語のベクトル表現: word2vec, GloVe - Qiita>https://qiita.com/yuku_t/items/483b56be83a3a5423b09]]
-[[Word２Vec, MeCab, ComeJisyo で病気の症状類似語を出してみた - Qiita>https://qiita.com/quvo/items/9ef250d58971eadf6e1a]]






----
この記事は
#vote(おもしろかった,そうでもない)

#comment

#topicpath

SIZE(10){現在のアクセス:&counter;}