Skip to content

Instantly share code, notes, and snippets.

@shinob
Last active March 16, 2017 10:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shinob/d88932cb703e3fe7370924ef26195b82 to your computer and use it in GitHub Desktop.
Save shinob/d88932cb703e3fe7370924ef26195b82 to your computer and use it in GitHub Desktop.
会計データを学習して、仕訳の入力時に摘要の内容から勘定科目を予測してみる ref: http://qiita.com/mix_dvd/items/9c2ef5e6fcc390be067e
import pandas as pd
filename = "JDL出納帳-xxxx-xxxx-仕訳.csv"
df = pd.read_csv(filename, encoding="Shift-JIS", skiprows=3)
columns = ["摘要", "借方科目", "借方科目正式名称"]
df_counts = df[columns].dropna()
from sklearn.externals import joblib
joblib.dump(vect, 'data/vect.pkl')
joblib.dump(clf, 'data/clf.pkl')
df_rs.to_csv("data/code.csv")
import pandas as pd
filename = "data/code.csv"
df = pd.read_csv(filename, header=None)
df.index = df.pop(0)
df_rs = df.pop(1)
from sklearn.externals import joblib
clf = joblib.load('data/clf.pkl')
vect = joblib.load('data/vect.pkl')
from janome.tokenizer import Tokenizer
t = Tokenizer()
tests = [
"高速道路利用料",
"パソコン部品代",
"切手代",
]
notes = []
for note in tests:
tokens = t.tokenize(note)
words = ""
for token in tokens:
words += " " + token.surface
notes.append(words)
X = vect.transform(notes)
result = clf.predict(X)
for i in range(len(tests)):
print(tests[i], "\t[",df_rs.loc[result[i]], "]")
高速道路利用料 [ 旅費交通 ]
パソコン部品代 [ 消耗品費 ]
切手代 [ 通信費 ]
$ pip install janome
from janome.tokenizer import Tokenizer
t = Tokenizer()
notes = []
for ix in df_counts.index:
note = df_counts.ix[ix,"摘要"]
tokens = t.tokenize(note.replace(' ',' '))
words = ""
for token in tokens:
words += " " + token.surface
notes.append(words.replace(' \u3000', ''))
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect.fit(notes)
X = vect.transform(notes)
y = df_counts.借方科目.as_matrix().astype("int").flatten()
from sklearn import cross_validation
test_size = 0.2
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size)
from sklearn.svm import LinearSVC
clf = LinearSVC(C=120.0, random_state=42)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
tests = [
"高速道路利用料",
"パソコン部品代",
"切手代"
]
notes = []
for note in tests:
tokens = t.tokenize(note)
words = ""
for token in tokens:
words += " " + token.surface
notes.append(words)
X = vect.transform(notes)
result = clf.predict(X)
df_rs = df_counts[["借方科目正式名称", "借方科目"]]
df_rs.index = df_counts["借方科目"].astype("int")
df_rs = df_rs[~df_rs.index.duplicated()]["借方科目正式名称"]
for i in range(len(tests)):
print(tests[i], "\t[",df_rs.ix[result[i]], "]")
高速道路利用料 [ 旅費交通 ]
パソコン部品代 [ 消耗品費 ]
切手代 [ 通信費 ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment