Skip to content

Instantly share code, notes, and snippets.

@kenttw
kenttw / test
Created June 1, 2016 04:42
tet
just
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@kenttw
kenttw / how-to-build-word2vec.ipynb
Created December 6, 2015 00:28
how-to-build-word2vec
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
def cuttext(text):
stop_words = stop_sc.value
import jieba
from jieba import analyse
jieba.load_userdict("../data/new.dict_all")
bag_word = dict()
for word in jieba.cut(text,cut_all=False):
if word in stop_words : continue
if len(word) == 1 and word != '$' : continue
else:
@kenttw
kenttw / spark-let-filename-as-key.ipynb
Last active September 7, 2015 03:47
spark - let file name as key
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@kenttw
kenttw / gist:e68cbc00525358bd82c8
Created July 8, 2015 07:57
pyspark - ChiSqSelector Error
from pyspark.mllib.feature import ChiSqSelector
model = ChiSqSelector(5000).fit(sc.parallelize(lc))
chi_l = l.mapValues(lambda x : model.transform (x))
print chi_l.first()
出現以下的訊息
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
@kenttw
kenttw / spark_analysis_ml_data.py
Last active April 7, 2022 05:55
使用 Spark 來分析 Training 資料與待預測資料分佈狀況
genc = pickle.loads(open(settings.DATA_FOLDER + id + "/GenderClassify.pkl").read())
from urlparse import urlparse
def raw2feature(line):
r = []
try :
dictf = ["hour" , "category_id" , "cookie_pta" , "timestamp" , "url" , "country" , "city" , "resolution" , "browser" , "browser_version" , "os" , "os_version" , "device_model" , "device_marketing" , "device_brand" , "search_keyword" , "referrer_host"]
parsedline = dict()
index = 0