Last active
December 12, 2016 16:27
-
-
Save daisuke6106/2481d0c60f645ef084b376de3f3f1cab to your computer and use it in GitHub Desktop.
大量の日本語文書から似ている文書を探してみる ref: http://qiita.com/daisuke6106/items/15b1b9d8295fae74e85b
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
crawler \ | |
-u クロール対象のURL \ | |
-s "どんなリンクを辿るか@リンクをたどった際に行う処理" \ | |
-i インターバル |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
crawler \ | |
-u http://example.co.jp \ | |
-s "decreg('.sample','.+.html')@file_save_full('/tmp','%protocol/%host/%path')" \ | |
-i 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for i in `find ./ -name data` | |
do | |
getcontent_htmldoc -f $i -t ".title" > `dirname $i`/title | |
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import MeCab | |
def get_word(str, split_char): | |
returnstr = "" | |
mecab_result = MeCab.Tagger('mecabrc').parse(str) | |
splited_words = mecab_result.split('\n') | |
for splited_word in splited_words: | |
if splited_word == 'EOS' or splited_word == '': | |
break | |
splited_word_info = splited_word.split(',') | |
word_and_tag = splited_word_info[0].split('\t') | |
if len(word_and_tag) == 2: | |
word = word_and_tag[0] | |
returnstr += word + split_char | |
return returnstr |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.feature_extraction.text import TfidfVectorizer | |
vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u'(?u)\\b\\w+\\b', min_df=1, max_df=50) | |
vecs = vectorizer.fit_transform(data) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
clusters = KMeans(n_clusters=100, random_state=0).fit_predict(vecs) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import random | |
import MeCab | |
import numpy as np | |
from sklearn.feature_extraction.text import TfidfVectorizer | |
from sklearn.cluster import KMeans | |
def fild_all_files(directory, filename): | |
for root, dirs, files in os.walk(directory): | |
for file in files: | |
if filename == file : | |
yield os.path.join(root, file) | |
def get_files_with_label(label, directory, filename): | |
retuenarray = np.empty((0,4)) | |
for file in fild_all_files(directory, filename): | |
filestr = '' | |
for line in open(file, 'r'): | |
filestr += line.rstrip() | |
filedata = np.array([[label, filestr, file, -1]]) | |
retuenarray = np.append(retuenarray, filedata, axis=0) | |
return retuenarray | |
def get_word(str, split_char): | |
returnstr = "" | |
mecab_result = MeCab.Tagger('mecabrc').parse(str) | |
splited_words = mecab_result.split('\n') | |
for splited_word in splited_words: | |
if splited_word == 'EOS' or splited_word == '': | |
break | |
splited_word_info = splited_word.split(',') | |
word_and_tag = splited_word_info[0].split('\t') | |
if len(word_and_tag) == 2: | |
word = word_and_tag[0] | |
returnstr += word + split_char | |
return returnstr | |
# 指定のディレクトリに保存してあるデータを読み込み | |
alldata = np.empty((0,4)) | |
alldata = np.append(alldata, get_files_with_label('it' , '/tmp/crawler/savedata/it' , 'title'), axis=0) | |
alldata = np.append(alldata, get_files_with_label('economy' , '/tmp/crawler/savedata/economy' , 'title'), axis=0) | |
alldata = np.append(alldata, get_files_with_label('entertainment', '/tmp/crawler/savedata/entertainment', 'title'), axis=0) | |
alldata = np.append(alldata, get_files_with_label('sports' , '/tmp/crawler/savedata/sports' , 'title'), axis=0) | |
# 文書をスペースにて分かち書きにする | |
all_title_data = np.array([]) | |
for str in alldata[:,1]: | |
all_title_data = np.append(all_title_data, get_word(str, ' ')) | |
# TF-IDFを算出する。 | |
vectorizer = TfidfVectorizer(use_idf=True, token_pattern=u'(?u)\\b\\w+\\b', min_df=1, max_df=50, ngram_range=(1,2)) | |
vecs = vectorizer.fit_transform(all_title_data) | |
# K平均法でクラスタリング、100個くらいに分類 | |
clusters = KMeans(n_clusters=100, random_state=0).fit_predict(vecs) | |
# 結果をデータに戻す。 | |
for i in range(0, len(alldata)): | |
alldata[i, 3] = clusters[i] | |
# 適当なクラスタに属した文書を取得 | |
result=[] | |
for i in range(0, len(alldata)): | |
if int(alldata[i,3]) == random.randint(0,99): | |
result.append(alldata[i,1]) | |
# 表示 | |
result.sort() | |
for i in range(0, len(result)): | |
print(result[i]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Windows 7 RC、日本は5月7日一般公開 「XPと同等かそれ以上に快適」 | |
Windows 7 RCは5月5日リリース? Microsoftがリーク | |
Windows 7β版、一般公開終了 | |
Windows 7の「XPモード」、RC版公開 | |
Windows 7の「XPモード」が完成 10月22日にリリース | |
Windows 7のエディションは6種類、すべてネットブックに対応――マイクロソフトが公表 | |
Windows 7の一般販売が解禁――真夜中のアキバに1000人を超える群衆 | |
Windows 7の評価をユーザーが投稿 MSが専用サイト | |
Windows 7ベータ版、一般公開期限を2月12日まで延長 | |
Windows 7/8.1→Windows 10が“推奨される更新”に | |
Windows XPがもうすぐアキバから消える? 「とりあえず再入荷はこれで最後かも」 | |
Windows XPの「メインストリーム・サポート」が終了、14日から「延長サポート」に移行 | |
Windows7が70円?海賊版、早速猛威振るう−北京 | |
XPモデル、新色・新柄を追加したポケットサイズPC――「VAIO type P」 | |
flumpool山村隆太、一般女性と結婚へ “14年愛”を実らせライブで生報告 | |
「Doblog」5月に終了 障害で3カ月投稿不能、一部データ復旧できず | |
「Skype」と「Google Voice」に脆弱性、“PBXボットネット”に悪用されるおそれも | |
「Surface 3」当初の販売地域に日本は含まれず ~日本へは「最適な形での投入を検討」 | |
「Windows 7」のアップグレード、“ここ”に注意 | |
「Windows 7」ベータ版公開一般ユーザーへの提供は9日から | |
「Windows 7はXPより速い」 MS公式ベンチマークの結果は | |
「wwwの父」がどうしてもやり直したいこと、それはhttp:のあとの//の不要化―米紙 | |
・・・ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment