Skip to content

Instantly share code, notes, and snippets.

@kenttw
kenttw / spark_analysis_ml_data.py
Last active April 7, 2022 05:55
使用 Spark 來分析 Training 資料與待預測資料分佈狀況
genc = pickle.loads(open(settings.DATA_FOLDER + id + "/GenderClassify.pkl").read())
from urlparse import urlparse
def raw2feature(line):
r = []
try :
dictf = ["hour" , "category_id" , "cookie_pta" , "timestamp" , "url" , "country" , "city" , "resolution" , "browser" , "browser_version" , "os" , "os_version" , "device_model" , "device_marketing" , "device_brand" , "search_keyword" , "referrer_host"]
parsedline = dict()
index = 0
@kenttw
kenttw / gist:e68cbc00525358bd82c8
Created July 8, 2015 07:57
pyspark - ChiSqSelector Error
from pyspark.mllib.feature import ChiSqSelector
model = ChiSqSelector(5000).fit(sc.parallelize(lc))
chi_l = l.mapValues(lambda x : model.transform (x))
print chi_l.first()
出現以下的訊息
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
@kenttw
kenttw / spark-let-filename-as-key.ipynb
Last active September 7, 2015 03:47
spark - let file name as key
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
def cuttext(text):
stop_words = stop_sc.value
import jieba
from jieba import analyse
jieba.load_userdict("../data/new.dict_all")
bag_word = dict()
for word in jieba.cut(text,cut_all=False):
if word in stop_words : continue
if len(word) == 1 and word != '$' : continue
else:
@kenttw
kenttw / how-to-build-word2vec.ipynb
Created December 6, 2015 00:28
how-to-build-word2vec
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@kenttw
kenttw / test
Created June 1, 2016 04:42
tet
just