Last active
March 16, 2017 10:11
-
-
Save shinob/d88932cb703e3fe7370924ef26195b82 to your computer and use it in GitHub Desktop.
会計データを学習して、仕訳の入力時に摘要の内容から勘定科目を予測してみる ref: http://qiita.com/mix_dvd/items/9c2ef5e6fcc390be067e
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
filename = "JDL出納帳-xxxx-xxxx-仕訳.csv" | |
df = pd.read_csv(filename, encoding="Shift-JIS", skiprows=3) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
columns = ["摘要", "借方科目", "借方科目正式名称"] | |
df_counts = df[columns].dropna() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.externals import joblib | |
joblib.dump(vect, 'data/vect.pkl') | |
joblib.dump(clf, 'data/clf.pkl') | |
df_rs.to_csv("data/code.csv") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
filename = "data/code.csv" | |
df = pd.read_csv(filename, header=None) | |
df.index = df.pop(0) | |
df_rs = df.pop(1) | |
from sklearn.externals import joblib | |
clf = joblib.load('data/clf.pkl') | |
vect = joblib.load('data/vect.pkl') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from janome.tokenizer import Tokenizer | |
t = Tokenizer() | |
tests = [ | |
"高速道路利用料", | |
"パソコン部品代", | |
"切手代", | |
] | |
notes = [] | |
for note in tests: | |
tokens = t.tokenize(note) | |
words = "" | |
for token in tokens: | |
words += " " + token.surface | |
notes.append(words) | |
X = vect.transform(notes) | |
result = clf.predict(X) | |
for i in range(len(tests)): | |
print(tests[i], "\t[",df_rs.loc[result[i]], "]") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
高速道路利用料 [ 旅費交通 ] | |
パソコン部品代 [ 消耗品費 ] | |
切手代 [ 通信費 ] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ pip install janome |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from janome.tokenizer import Tokenizer | |
t = Tokenizer() | |
notes = [] | |
for ix in df_counts.index: | |
note = df_counts.ix[ix,"摘要"] | |
tokens = t.tokenize(note.replace(' ',' ')) | |
words = "" | |
for token in tokens: | |
words += " " + token.surface | |
notes.append(words.replace(' \u3000', '')) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.feature_extraction.text import TfidfVectorizer | |
vect = TfidfVectorizer() | |
vect.fit(notes) | |
X = vect.transform(notes) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
y = df_counts.借方科目.as_matrix().astype("int").flatten() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn import cross_validation | |
test_size = 0.2 | |
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.svm import LinearSVC | |
clf = LinearSVC(C=120.0, random_state=42) | |
clf.fit(X_train, y_train) | |
clf.score(X_test, y_test) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tests = [ | |
"高速道路利用料", | |
"パソコン部品代", | |
"切手代" | |
] | |
notes = [] | |
for note in tests: | |
tokens = t.tokenize(note) | |
words = "" | |
for token in tokens: | |
words += " " + token.surface | |
notes.append(words) | |
X = vect.transform(notes) | |
result = clf.predict(X) | |
df_rs = df_counts[["借方科目正式名称", "借方科目"]] | |
df_rs.index = df_counts["借方科目"].astype("int") | |
df_rs = df_rs[~df_rs.index.duplicated()]["借方科目正式名称"] | |
for i in range(len(tests)): | |
print(tests[i], "\t[",df_rs.ix[result[i]], "]") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
高速道路利用料 [ 旅費交通 ] | |
パソコン部品代 [ 消耗品費 ] | |
切手代 [ 通信費 ] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment