Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
https://github.com/dongsam/CryptoCurrency-Analysis [채팅 텍스트 데이터 감성분석]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 채팅 text 학습을 통한 Sentiment Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 목표\n",
"- 세계 최대 cryptocurrency 거래소인 poloniex 의 채팅데이터를 nltk 기반 sentimetn labeling 후 doc2vec 으로 학습하여 긍부정평가 및 도메인 특화 긍부정 사전 획득"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python 버전 3.6.1 기준"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:04:09) \\n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sys\n",
"sys.version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 긍부정 판단 ( NLTK 의 SentimentIntensityAnalyzer 사용 )\n",
"### [vaderSentiment](https://github.com/cjhutto/vaderSentiment) 기반\n",
"VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import nltk"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'3.2.4'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.__version__"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from nltk.sentiment.vader import SentimentIntensityAnalyzer"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sid = SentimentIntensityAnalyzer()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'compound': 0.4404, 'neg': 0.0, 'neu': 0.58, 'pos': 0.42}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sid.polarity_scores(\"Bitcoin Is Better Than Gold\") \n",
"# # https://www.forbes.com/sites/panosmourdoukoutas/2017/03/04/bitcoin-is-better-than-gold/#52d2468c5f04"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [compound] compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules\n",
"* [pos] positive sentiment: compound score >= 0.5\n",
"* [neu] neutral sentiment: (compound score > -0.5) and (compound score < 0.5)\n",
"* [neg] negative sentiment: compound score <= -0.5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 5000여개의 뉴스, 커뮤니티 제목, 본문 대상으로 긍부정 판단 테스트"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"with open('coin_text_list.txt') as fp: # 대상 텍스트가 담긴 파일 \n",
" text_list = fp.read().split('\\n') # 한 line에 하나의 텍스트가 저장되어 있음, 한줄씩 split 하여 load"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"5304"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(text_list)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"res_list = [] # 5000여개 텍스트에 대해서 sentiment polarity 를 계산하여 텍스트와 결과를 묶어 리스트로 저장\n",
"for i in text_list:\n",
" res = sid.polarity_scores(i)\n",
" res['text'] = i\n",
" res_list.append(res)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"5304"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(res_list)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'compound': 0.4404,\n",
" 'neg': 0.0,\n",
" 'neu': 0.828,\n",
" 'pos': 0.172,\n",
" 'text': 'BAT ICO gasPrice analysis, good timing and reasonable gasPrice is enough to get you in'}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted(res_list, key=lambda x:x['pos'] ,reverse=True)[1000] # 긍정, 부정으로 정렬하여 분포 및 갯수 파악 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 채팅 데이터 crawling \n",
"### BeautifulSoup 를 사용하여 html 구조 파싱\n",
"poloniex 거래소의 chat data 를 dump 해놓은 사이트인 [polonibox.com](http://www.polonibox.com/) 를 페이지단위로 저장해 놓은 html 파일을 읽어서 파싱"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"import os\n",
"import dateutil.parser"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"html_dir_path = '/Users/dongsamb/coin_data/polonibox_html' # html 들이 있는 directory path 설정\n",
"html_file_list = os.listdir(html_dir_path)[1:]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"html_list = [] # 모든 html 파일들을 읽어서 html text 만 list 형태로 저장\n",
"for i in html_file_list:\n",
" full_path = \"{}/{}\".format(html_dir_path, i)\n",
" with open(full_path) as fp:\n",
" text = fp.read()\n",
" if len(text) > 1000: # html 길이가 1000보다 작은것은 가져올 때 해당 웹사이트 서버상의 문제로 제대로 가져오지 못한 에러페이지라서 제외 \n",
" html_list.append(text)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"17247"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(html_list) # 총 html 갯수"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"추출한 html text 를 입력하면 chat 의 username, message, date, user 의 reputation 을 파싱하여 dict로 구조화하여 반환하는 함수작성"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def get_chat_list(html):\n",
" try:\n",
" soup = BeautifulSoup(html, \"html.parser\")\n",
" res_list = []\n",
" except Exception as e:\n",
" print(e)\n",
" return []\n",
" for i in soup.find('tbody').find_all('tr'):\n",
" try:\n",
" res_dic = {}\n",
" tds = i.find_all('td')\n",
" res_dic['reputation'] = int(tds[0].span.text.strip())\n",
" res_dic['username'] = tds[0].a.text.strip()\n",
" res_dic['message'] = tds[1].text.strip()\n",
" res_dic['date_str'] = tds[0].find_all('span')[1]['title']\n",
" res_dic['date'] = dateutil.parser.parse(res_dic['date_str'])\n",
" res_dic['message_id'] = int(tds[0].find_all('a')[1]['href'].split('=')[1])\n",
" res_list.append(res_dic)\n",
" except Exception as e:\n",
" print(e)\n",
" continue\n",
" return res_list"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[{'date': datetime.datetime(2017, 6, 4, 8, 8, 37, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:37 UTC',\n",
" 'message': 'chrisjlabrie, i agree:)',\n",
" 'message_id': 18568553,\n",
" 'reputation': 0,\n",
" 'username': 'sysoyoung333'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 37, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:37 UTC',\n",
" 'message': 'FlatEarth, nah i had them for long time...',\n",
" 'message_id': 18568552,\n",
" 'reputation': 20,\n",
" 'username': 'Larillo'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 35, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:35 UTC',\n",
" 'message': 'Please, keep the language in the TrollBox to English. Thank you for your understanding. A message from your local MOD SQUAD.',\n",
" 'message_id': 18568551,\n",
" 'reputation': 1170,\n",
" 'username': 'Popcorntime'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 33, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:33 UTC',\n",
" 'message': 'This is a trollbox not a rocket launch command centre, please avoid rockets/fly/UP/Moon as much as possible',\n",
" 'message_id': 18568550,\n",
" 'reputation': 6558,\n",
" 'username': 'Xoblort'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 32, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:32 UTC',\n",
" 'message': 'ex7231, SC for real and DGB for speculation bc kids wants quick money',\n",
" 'message_id': 18568549,\n",
" 'reputation': 0,\n",
" 'username': 'CyberKing'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 32, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:32 UTC',\n",
" 'message': 'FlatEarth, They dropped it down the back of the sofa.',\n",
" 'message_id': 18568548,\n",
" 'reputation': 485,\n",
" 'username': 'AutoWhale'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 28, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:28 UTC',\n",
" 'message': 'khota kay bacho dgb buy kero',\n",
" 'message_id': 18568547,\n",
" 'reputation': 0,\n",
" 'username': 'mas.exchanging'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 28, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:28 UTC',\n",
" 'message': 'Larillo, shorter ? I long time Bro !!!',\n",
" 'message_id': 18568546,\n",
" 'reputation': 67,\n",
" 'username': 'FlatEarth'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 26, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:26 UTC',\n",
" 'message': 'zaizakitano banned for 1 days, 0 hours, and 0 minutes by Popcorntime.',\n",
" 'message_id': 18568545,\n",
" 'reputation': 0,\n",
" 'username': 'Banhammer'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 24, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:24 UTC',\n",
" 'message': 'SC is rounding 800 soon',\n",
" 'message_id': 18568544,\n",
" 'reputation': 11,\n",
" 'username': 'chrisjlabrie'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 23, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:23 UTC',\n",
" 'message': 'SC next king',\n",
" 'message_id': 18568543,\n",
" 'reputation': 0,\n",
" 'username': 'rubensimpson'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 21, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:21 UTC',\n",
" 'message': 'this is funny man, you guys know hwo to pump',\n",
" 'message_id': 18568542,\n",
" 'reputation': 0,\n",
" 'username': 'Donator'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 19, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:19 UTC',\n",
" 'message': 'ok here is the real test for BCN, folks.',\n",
" 'message_id': 18568541,\n",
" 'reputation': 7,\n",
" 'username': 'supahflyninja'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 19, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:19 UTC',\n",
" 'message': 'a159357 banned for 1 days, 0 hours, and 0 minutes by Popcorntime.',\n",
" 'message_id': 18568540,\n",
" 'reputation': 0,\n",
" 'username': 'Banhammer'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 19, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:19 UTC',\n",
" 'message': 'bcn up',\n",
" 'message_id': 18568539,\n",
" 'reputation': 0,\n",
" 'username': 'zaizakitano'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 15, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:15 UTC',\n",
" 'message': 'hope itll provide me now the promised 3k',\n",
" 'message_id': 18568538,\n",
" 'reputation': 0,\n",
" 'username': 'Gimpel'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 12, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:12 UTC',\n",
" 'message': 'hello people!',\n",
" 'message_id': 18568537,\n",
" 'reputation': 64,\n",
" 'username': 'AlienX101'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 10, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:10 UTC',\n",
" 'message': 'ludovic.palmisano, Lets not be suggesting trades please. Thank You !',\n",
" 'message_id': 18568536,\n",
" 'reputation': 6558,\n",
" 'username': 'Xoblort'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 10, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:10 UTC',\n",
" 'message': 'AlienX101, ok thanks bro.',\n",
" 'message_id': 18568535,\n",
" 'reputation': 0,\n",
" 'username': 'Randeepchopra'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 8, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:08 UTC',\n",
" 'message': 'FlatEarth, DGB GAME SC',\n",
" 'message_id': 18568534,\n",
" 'reputation': 20,\n",
" 'username': 'Larillo'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 6, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:06 UTC',\n",
" 'message': 'dash dump',\n",
" 'message_id': 18568533,\n",
" 'reputation': 0,\n",
" 'username': 'a159357'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 6, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:06 UTC',\n",
" 'message': \"ex7231, Sorry I don't like give wrong information, have no clue. I'm good in DGB, SC, Golem, ETH, and BTC\",\n",
" 'message_id': 18568532,\n",
" 'reputation': 0,\n",
" 'username': 'CyberKing'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 4, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:04 UTC',\n",
" 'message': 'Sia is a very promising company. Siacoin smart investment.',\n",
" 'message_id': 18568531,\n",
" 'reputation': 0,\n",
" 'username': 'lindormusai'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 8, 4, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:04 UTC',\n",
" 'message': 'come guys lets do yesterday DGB',\n",
" 'message_id': 18568530,\n",
" 'reputation': 0,\n",
" 'username': 'michal.sedlek'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 59, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:59 UTC',\n",
" 'message': 'I trusted in that bitchy DGB',\n",
" 'message_id': 18568529,\n",
" 'reputation': 0,\n",
" 'username': 'Gimpel'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 57, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:57 UTC',\n",
" 'message': 'Hello! happy trades to you Donator, thank you o/',\n",
" 'message_id': 18568528,\n",
" 'reputation': 6558,\n",
" 'username': 'Xoblort'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 55, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:55 UTC',\n",
" 'message': 'i love you all',\n",
" 'message_id': 18568527,\n",
" 'reputation': 0,\n",
" 'username': 'oikigbeme'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 55, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:55 UTC',\n",
" 'message': \"CyberKing, I don't think they will. I'd consider Sia more of a backend solution.\",\n",
" 'message_id': 18568526,\n",
" 'reputation': 0,\n",
" 'username': 'ea96b'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 55, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:55 UTC',\n",
" 'message': 'anyone called the odds correctly on DGB and made a lot of cash today?',\n",
" 'message_id': 18568525,\n",
" 'reputation': 0,\n",
" 'username': 'bmalslamrod69'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 52, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:52 UTC',\n",
" 'message': 'Can a mod please look at this tx, been over 30 minutes, TY 0x71ba86010f7d30b04cb262ae82148b94fcc76b93cb7c9d943e9246a0d3b3e5ee',\n",
" 'message_id': 18568524,\n",
" 'reputation': 0,\n",
" 'username': 'callummontgomery'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 52, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:52 UTC',\n",
" 'message': 'exp',\n",
" 'message_id': 18568523,\n",
" 'reputation': 0,\n",
" 'username': 'ludovic.palmisano'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 52, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:52 UTC',\n",
" 'message': 'Larillo, ETH ?',\n",
" 'message_id': 18568522,\n",
" 'reputation': 67,\n",
" 'username': 'FlatEarth'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 50, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:50 UTC',\n",
" 'message': 'You dont understand that GAMERs will be pushing the price of GAME up simply buy buying a game on the store...',\n",
" 'message_id': 18568521,\n",
" 'reputation': 115,\n",
" 'username': 'btcnerd'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 48, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:48 UTC',\n",
" 'message': 'can we turn off the auto-scroll on this window?',\n",
" 'message_id': 18568520,\n",
" 'reputation': 0,\n",
" 'username': 'gabrieldib'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 45, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:45 UTC',\n",
" 'message': 'CyberKing, for realz? or just speculation?',\n",
" 'message_id': 18568519,\n",
" 'reputation': 0,\n",
" 'username': 'ex7231'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 43, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:43 UTC',\n",
" 'message': \"Let's try to make posts with more substance than just a coin name and one word. Thank you\",\n",
" 'message_id': 18568518,\n",
" 'reputation': 6558,\n",
" 'username': 'Xoblort'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 43, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:43 UTC',\n",
" 'message': 'jockeh_1 and josechacinu banned for 1 days, 0 hours, and 0 minutes by Xoblort.',\n",
" 'message_id': 18568517,\n",
" 'reputation': 0,\n",
" 'username': 'Banhammer'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 43, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:43 UTC',\n",
" 'message': 'FlatEarth, quite good, made some profits last night',\n",
" 'message_id': 18568516,\n",
" 'reputation': 20,\n",
" 'username': 'Larillo'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 43, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:43 UTC',\n",
" 'message': 'mitchelreedijk67, ah, well when ever the sell price matches the buy price of your currency',\n",
" 'message_id': 18568515,\n",
" 'reputation': 0,\n",
" 'username': 'PlasmaHydra'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 43, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:43 UTC',\n",
" 'message': 'Chinese don t have ETH anymore ???',\n",
" 'message_id': 18568514,\n",
" 'reputation': 67,\n",
" 'username': 'FlatEarth'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 41, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:41 UTC',\n",
" 'message': 'btcnerd, no thanks',\n",
" 'message_id': 18568513,\n",
" 'reputation': 33,\n",
" 'username': 'WhaleOnMe'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 38, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:38 UTC',\n",
" 'message': 'Xoblort, ok',\n",
" 'message_id': 18568512,\n",
" 'reputation': 0,\n",
" 'username': 'Donator'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 36, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:36 UTC',\n",
" 'message': 'GA880820, origi587, Sorry we mods do not have access to tickets on the support side. please lets wait for support to respond, appreciate the patience',\n",
" 'message_id': 18568511,\n",
" 'reputation': 6558,\n",
" 'username': 'Xoblort'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 29, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:29 UTC',\n",
" 'message': 'DGB HIGH',\n",
" 'message_id': 18568510,\n",
" 'reputation': 0,\n",
" 'username': 'josechacinu'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 27, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:27 UTC',\n",
" 'message': 'Donator, Please avoid the capslock, Thank you',\n",
" 'message_id': 18568509,\n",
" 'reputation': 6558,\n",
" 'username': 'Xoblort'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 27, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:27 UTC',\n",
" 'message': 'Xoblort, Please help me ticker 197387',\n",
" 'message_id': 18568508,\n",
" 'reputation': 2,\n",
" 'username': 'origi587'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 25, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:25 UTC',\n",
" 'message': 'Randeepchopra, safe khelna hai to 2800 tak ya fir 3000 ke upar tak ruk jao',\n",
" 'message_id': 18568507,\n",
" 'reputation': 64,\n",
" 'username': 'AlienX101'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 25, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:25 UTC',\n",
" 'message': 'WhaleOnMe, then dont buy it man.....But mark my words you will see a massive rise...',\n",
" 'message_id': 18568506,\n",
" 'reputation': 115,\n",
" 'username': 'btcnerd'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 25, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:25 UTC',\n",
" 'message': 'Bitcoin went up to 2500 http://coinhaunt.com/',\n",
" 'message_id': 18568505,\n",
" 'reputation': 0,\n",
" 'username': 'jay.mokashi'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 25, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:25 UTC',\n",
" 'message': 'by end of the month',\n",
" 'message_id': 18568504,\n",
" 'reputation': 0,\n",
" 'username': 'ivafamanesh'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 20, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:20 UTC',\n",
" 'message': 'maybe finish at 20 cents',\n",
" 'message_id': 18568503,\n",
" 'reputation': 0,\n",
" 'username': 'ivafamanesh'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 20, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:20 UTC',\n",
" 'message': 'mitchelreedijk67, depends on the price',\n",
" 'message_id': 18568502,\n",
" 'reputation': 0,\n",
" 'username': 'nathan.z.b'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 18, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:18 UTC',\n",
" 'message': 'burst the next SC',\n",
" 'message_id': 18568501,\n",
" 'reputation': 64,\n",
" 'username': 'Memegod'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 18, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:18 UTC',\n",
" 'message': 'Larillo, market ?',\n",
" 'message_id': 18568500,\n",
" 'reputation': 67,\n",
" 'username': 'FlatEarth'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 17, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:17 UTC',\n",
" 'message': 'just put coin towards dgb',\n",
" 'message_id': 18568499,\n",
" 'reputation': 0,\n",
" 'username': 'ivafamanesh'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 15, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:15 UTC',\n",
" 'message': \"TurbineBase, ex7231, Don't list to me bro, go read about SC they will take over Amazon and Google dropbox services\",\n",
" 'message_id': 18568498,\n",
" 'reputation': 0,\n",
" 'username': 'CyberKing'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 13, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:13 UTC',\n",
" 'message': 'lol',\n",
" 'message_id': 18568497,\n",
" 'reputation': 0,\n",
" 'username': 'GhostOf2016'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 10, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:10 UTC',\n",
" 'message': 'PlasmaHydra, damn but im buying it',\n",
" 'message_id': 18568496,\n",
" 'reputation': 0,\n",
" 'username': 'mitchelreedijk67'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 8, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:08 UTC',\n",
" 'message': 'dgb maybe intersting',\n",
" 'message_id': 18568495,\n",
" 'reputation': 0,\n",
" 'username': 'ivafamanesh'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 8, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:08 UTC',\n",
" 'message': 'Xoblort, Why is my Level 2 Verification always incomplete???',\n",
" 'message_id': 18568494,\n",
" 'reputation': 0,\n",
" 'username': 'GA880820'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 8, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:08 UTC',\n",
" 'message': 'lol ish just crazy out here',\n",
" 'message_id': 18568493,\n",
" 'reputation': 79,\n",
" 'username': 'DongQuixote'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 8, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:08 UTC',\n",
" 'message': 'Managothic, XPR Is instant not all are',\n",
" 'message_id': 18568492,\n",
" 'reputation': 0,\n",
" 'username': 'Monsterskater'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 8, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:08 UTC',\n",
" 'message': 'Suck up to pink for Mark she huh stroking anyone',\n",
" 'message_id': 18568491,\n",
" 'reputation': 7,\n",
" 'username': 'BlinkUBroke'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 6, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:06 UTC',\n",
" 'message': 'sc go',\n",
" 'message_id': 18568490,\n",
" 'reputation': 0,\n",
" 'username': 'seokhwa89'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 3, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:03 UTC',\n",
" 'message': 'mobilego app released for game?',\n",
" 'message_id': 18568489,\n",
" 'reputation': 248,\n",
" 'username': 'mybestestfriend'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 3, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:03 UTC',\n",
" 'message': 'or GAME',\n",
" 'message_id': 18568488,\n",
" 'reputation': 115,\n",
" 'username': 'btcnerd'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 3, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:03 UTC',\n",
" 'message': '12et2pq-5233, NOW',\n",
" 'message_id': 18568487,\n",
" 'reputation': 0,\n",
" 'username': 'Donator'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 3, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:03 UTC',\n",
" 'message': 'For those who keep asking about SIA + wallet issues on POLO -read up here: https://twitter.com/SiaTechHQ/status/870482084896354308',\n",
" 'message_id': 18568486,\n",
" 'reputation': 21,\n",
" 'username': 'Enoch'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 1, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:01 UTC',\n",
" 'message': 'mitchelreedijk67, Ask dormammu',\n",
" 'message_id': 18568485,\n",
" 'reputation': 0,\n",
" 'username': 'gabrieldib'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 1, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:01 UTC',\n",
" 'message': 'CyberKing, i tried to say sheet coin?',\n",
" 'message_id': 18568484,\n",
" 'reputation': 0,\n",
" 'username': 'ex7231'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 1, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:01 UTC',\n",
" 'message': 'Xoblort, Could you cancel my 0.8165BTC cash register? or hope to deal with BTC withdrawals as soon as possible',\n",
" 'message_id': 18568483,\n",
" 'reputation': 0,\n",
" 'username': '18501535715'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 1, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:01 UTC',\n",
" 'message': 'FlatEarth, marrekt? xd',\n",
" 'message_id': 18568482,\n",
" 'reputation': 20,\n",
" 'username': 'Larillo'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, 1, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:01 UTC',\n",
" 'message': 'btcnerd, 63 million coins available at 4.63 per coin. no thanks.',\n",
" 'message_id': 18568481,\n",
" 'reputation': 33,\n",
" 'username': 'WhaleOnMe'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 7, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:07:00 UTC',\n",
" 'message': 'Why is it taking so long to complete a withdrawl?',\n",
" 'message_id': 18568480,\n",
" 'reputation': 0,\n",
" 'username': 'CryptoSentient'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 56, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:56 UTC',\n",
" 'message': 'watch DGB',\n",
" 'message_id': 18568479,\n",
" 'reputation': 0,\n",
" 'username': 'jockeh_1'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 54, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:54 UTC',\n",
" 'message': 'hi',\n",
" 'message_id': 18568478,\n",
" 'reputation': 0,\n",
" 'username': '2ahmadiar'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 54, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:54 UTC',\n",
" 'message': 'mitchelreedijk67, when ever the buy price matches what youre selling it for',\n",
" 'message_id': 18568477,\n",
" 'reputation': 0,\n",
" 'username': 'PlasmaHydra'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 51, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:51 UTC',\n",
" 'message': 'guck0101 banned for 1 days, 0 hours, and 0 minutes by Popcorntime.',\n",
" 'message_id': 18568476,\n",
" 'reputation': 0,\n",
" 'username': 'Banhammer'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 49, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:49 UTC',\n",
" 'message': 'guck0101, where we ehaded',\n",
" 'message_id': 18568475,\n",
" 'reputation': 78,\n",
" 'username': 'BIGFKNFATE'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 49, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:49 UTC',\n",
" 'message': 'whats news on game ?',\n",
" 'message_id': 18568474,\n",
" 'reputation': 248,\n",
" 'username': 'mybestestfriend'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 49, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:49 UTC',\n",
" 'message': 'omg isnt cryptocurrency transfer instant??? why do i have to wait so long for my Polo account to be credited',\n",
" 'message_id': 18568473,\n",
" 'reputation': 0,\n",
" 'username': 'Managothic'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 49, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:49 UTC',\n",
" 'message': 'thanks for the ride SIA',\n",
" 'message_id': 18568472,\n",
" 'reputation': 0,\n",
" 'username': 'sqrtx'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 47, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:47 UTC',\n",
" 'message': 'Xoblort, Popcorntime. I messed up a NEM deposit with not including the identifing #. That a mod fix or something for support?(i got a ticket in already)',\n",
" 'message_id': 18568471,\n",
" 'reputation': 0,\n",
" 'username': 'tigerbombtrading'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 47, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:47 UTC',\n",
" 'message': 'Popcorntime, My account frozen 7day',\n",
" 'message_id': 18568470,\n",
" 'reputation': 0,\n",
" 'username': 'TurbineBase'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 45, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:45 UTC',\n",
" 'message': 'sia buy order added anothe 300 btc in 30 mins',\n",
" 'message_id': 18568469,\n",
" 'reputation': 0,\n",
" 'username': 'dixon.zim'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 44, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:44 UTC',\n",
" 'message': \"sc price in bittrex is 10% cheaper it's 660 now\",\n",
" 'message_id': 18568468,\n",
" 'reputation': 0,\n",
" 'username': 'seodh1229'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 42, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:42 UTC',\n",
" 'message': 'TurbineBase, Sorry i cant view tickets. Please allow support more time to respond to your ticket',\n",
" 'message_id': 18568467,\n",
" 'reputation': 1170,\n",
" 'username': 'Popcorntime'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 42, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:42 UTC',\n",
" 'message': 'strat up up up up up',\n",
" 'message_id': 18568466,\n",
" 'reputation': 0,\n",
" 'username': 'guck0101'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 38, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:38 UTC',\n",
" 'message': 'huaaahahaha thank you everyone :D',\n",
" 'message_id': 18568465,\n",
" 'reputation': 0,\n",
" 'username': 'komurasaki'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 38, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:38 UTC',\n",
" 'message': 'Okaykoko, support tickets and verifications may be taking longer due to larger than normal queues, we do apologize',\n",
" 'message_id': 18568464,\n",
" 'reputation': 6558,\n",
" 'username': 'Xoblort'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 38, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:38 UTC',\n",
" 'message': 'RStrayer, and??? if it makes money, its good xd',\n",
" 'message_id': 18568463,\n",
" 'reputation': 20,\n",
" 'username': 'Larillo'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 36, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:36 UTC',\n",
" 'message': 'BossBasso, old to;;?',\n",
" 'message_id': 18568462,\n",
" 'reputation': 3,\n",
" 'username': 'msmicromax'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 35, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:35 UTC',\n",
" 'message': 'when is the best time to sell DGB ?',\n",
" 'message_id': 18568461,\n",
" 'reputation': 0,\n",
" 'username': '12et2pq-5233'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 35, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:35 UTC',\n",
" 'message': 'Larillo, waza ? marekt ??',\n",
" 'message_id': 18568460,\n",
" 'reputation': 67,\n",
" 'username': 'FlatEarth'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 35, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:35 UTC',\n",
" 'message': 'srotman, i dont have eta sorry',\n",
" 'message_id': 18568459,\n",
" 'reputation': 1170,\n",
" 'username': 'Popcorntime'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 33, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:33 UTC',\n",
" 'message': 'AlienX101, khan tak jayega? means kab bech du?',\n",
" 'message_id': 18568458,\n",
" 'reputation': 0,\n",
" 'username': 'Randeepchopra'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 33, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:33 UTC',\n",
" 'message': 'wow,super growing sc',\n",
" 'message_id': 18568457,\n",
" 'reputation': 0,\n",
" 'username': 'bjork271828'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 31, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:31 UTC',\n",
" 'message': 'mycool57, SC long run real Technology',\n",
" 'message_id': 18568456,\n",
" 'reputation': 0,\n",
" 'username': 'CyberKing'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 31, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:31 UTC',\n",
" 'message': 'Randeepchopra, 2600?',\n",
" 'message_id': 18568455,\n",
" 'reputation': 64,\n",
" 'username': 'AlienX101'},\n",
" {'date': datetime.datetime(2017, 6, 4, 8, 6, 27, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:06:27 UTC',\n",
" 'message': 'Xoblort, thanks anyway!',\n",
" 'message_id': 18568454,\n",
" 'reputation': 2,\n",
" 'username': 'Gramsci'}]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_chat_list(html_list[4])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"전체 html 에 대해서 위에서 작성한 get_chat_list 함수를 통해 모든 채팅 데이터 파싱( 만여개가 넘어서 오래걸렸음 )"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# total_res_list = [] \n",
"# for html in html_list:\n",
"# total_res_list += get_chat_list(html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"total_res_dic = {}\n",
"cnt = 0\n",
"for html in html_list:\n",
" res = get_chat_list(html)\n",
" for i in res:\n",
" if i['message_id'] in total_res_dic:\n",
" continue\n",
" else:\n",
" total_res_dic[i['message_id']] = i\n",
" if cnt%500 == 0:\n",
" print(cnt)\n",
" cnt += 1"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pickle # 힘들게 파싱했으니 pickle 형식으로 serialize 하여 저장, 17,000개 기준으로 300MB\n",
"pickle.dump(total_res_dic, open('total_res_dic.p', 'wb'))\n",
"# total_res_list = pickle.load(open('total_res_dic.p','rb'))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1658602"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(total_res_dic)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'date': datetime.datetime(2017, 6, 4, 8, 8, 35, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:08:35 UTC',\n",
" 'message': 'Please, keep the language in the TrollBox to English. Thank you for your understanding. A message from your local MOD SQUAD.',\n",
" 'message_id': 18568551,\n",
" 'reputation': 1170,\n",
" 'username': 'Popcorntime'}"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_res_dic[18568551]"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"total_res_dic_with_senti = []"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for k,v in iter(total_res_dic.items()):\n",
" tmp_dic = v.copy()\n",
" res = sid.polarity_scores(tmp_dic['message'])\n",
" tmp_dic['neg'] = res['neg']\n",
" tmp_dic['neu'] = res['neu']\n",
" tmp_dic['pos'] = res['pos']\n",
" tmp_dic['compound'] = res['compound']\n",
" total_res_dic_with_senti.append(tmp_dic)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'compound': 0.0,\n",
" 'date': datetime.datetime(2017, 6, 4, 8, 33, 3, tzinfo=tzutc()),\n",
" 'date_str': '2017-06-04 08:33:03 UTC',\n",
" 'message': 'what is the minimum for etc?',\n",
" 'message_id': 18569509,\n",
" 'neg': 0.0,\n",
" 'neu': 1.0,\n",
" 'pos': 0.0,\n",
" 'reputation': 0,\n",
" 'username': 'ronald.narvasa.rn'}"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_res_dic_with_senti"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pickle\n",
"# pickle.dump(total_res_dic_with_senti, open('total_res_dic_with_senti.p', 'wb'))\n",
"total_res_dic_with_senti = pickle.load(open('total_res_dic_with_senti.p','rb'))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1658602"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(total_res_dic_with_senti)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gensim 을 통한 word2vec, Doc2vec"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'2.1.0'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import gensim # doc2vec 을 위해 gensim library 사용 \n",
"gensim.__version__"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'3.2.4'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import nltk\n",
"nltk.__version__"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# nltk 에 내장된 snowball stemmer 사용 \n",
"from nltk.stem.snowball import SnowballStemmer\n",
"stemmer = SnowballStemmer(\"english\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"run\n"
]
}
],
"source": [
"# stemmer 예제, 진행형, 과거형, 복수형을 원형으로 변환해줌\n",
"print(stemmer.stem(\"running\"))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from nltk import word_tokenize"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['Bitcoin',\n",
" 'has',\n",
" 'left',\n",
" 'gold',\n",
" 'in',\n",
" 'the',\n",
" 'dust',\n",
" 'in',\n",
" 'recent',\n",
" 'months',\n",
" '.']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# nltk 의 기본 word_tokenize 를 통해 문장 토크나이징 테스트\n",
"word_tokenize('Bitcoin has left gold in the dust in recent months.')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 문장 입력에 대해 토크나이징 및 스테밍을 함께 수행하여 토큰을 리턴해주는 함수 정의 \n",
"def stemming(text):\n",
" return [stemmer.stem(x) for x in word_tokenize(text)]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['it', 'run']"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stemming('its running')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['bitcoin', 'is', 'better', 'than', 'gold']"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stemming('Bitcoin Is Better Than Gold')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['talk',\n",
" 'about',\n",
" 'pump/dump',\n",
" 'group',\n",
" 'or',\n",
" 'announc',\n",
" 'pumps/dump',\n",
" 'is',\n",
" 'not',\n",
" 'want',\n",
" 'here',\n",
" '.',\n",
" 'thank',\n",
" 'you',\n",
" 'for',\n",
" 'your',\n",
" 'understand',\n",
" '.',\n",
" 'a',\n",
" 'messag',\n",
" 'from',\n",
" 'your',\n",
" 'local',\n",
" 'mod',\n",
" 'squad',\n",
" '.']"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stemming(total_res_dic_with_senti[0]['message'])"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from collections import namedtuple\n",
"TaggedDocument = namedtuple('TaggedDocument', 'words tags')\n",
"# from gensim.models.doc2vec import TaggedDocument\n",
"\n",
"# 임계치에 따라 긍부정 태깅하여 Doc2vec 을 위한 문서형태로 저장\n",
"tagged_document_list = []\n",
"pos_count = 0\n",
"neg_count = 0\n",
"for i in total_res_dic_with_senti[:]: # sample\n",
" label = []\n",
" if i['pos'] > 0.5 and i['neg'] < 0.4:\n",
" label = [1]\n",
" pos_count += 1\n",
" elif i['neg'] > 0.5 and i['pos'] < 0.4:\n",
" label = [0]\n",
" neg_count += 1\n",
" else:\n",
" continue\n",
" tagged_document_list.append(TaggedDocument(stemming(i['message']), label))\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1658602"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(total_res_dic_with_senti)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# 태깅 데이터 임시 저장 및 로드\n",
"import pickle\n",
"# pickle.dump(tagged_document_list, open('tagged_document_list.p','wb'))\n",
"tagged_document_list = pickle.load(open('tagged_document_list.p','rb'))"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"179412"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(tagged_document_list)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# 학습 및 테스트를 공평하게 하기위해 랜덤하게 training set 은 80%, test set 은 20%로 분할 \n",
"from random import shuffle\n",
"from math import ceil\n",
"shuffle(tagged_document_list)\n",
"persent = ceil(float(len(tagged_document_list))/100.0)\n",
"test_set_persentage = 20\n",
"test_set = tagged_document_list[:persent*test_set_persentage]\n",
"train_set = tagged_document_list[persent*test_set_persentage:]"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"35900"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(test_set)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"143512"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models import doc2vec\n",
"# 사전 구축\n",
"doc_vectorizer = doc2vec.Doc2Vec(size=300, alpha=0.025, min_alpha=0.025, seed=1234)\n",
"doc_vectorizer.build_vocab(train_set)\n",
"# doc2vec 학습\n",
"for epoch in range(10):\n",
" doc_vectorizer.train(tagged_document_list,total_examples=doc_vectorizer.corpus_count, epochs=doc_vectorizer.iter)\n",
" doc_vectorizer.alpha -= 0.002 # decrease the learning rate\n",
" doc_vectorizer.min_alpha = doc_vectorizer.alpha # fix the learning rate, no decay"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('dash', 0.41827628016471863),\n",
" ('xrp', 0.41451185941696167),\n",
" ('etc', 0.40095293521881104),\n",
" ('ltc', 0.39470094442367554),\n",
" ('zcash', 0.38932809233665466),\n",
" ('zec', 0.389003723859787),\n",
" ('usdt', 0.38660886883735657),\n",
" ('xmr', 0.3814651668071747),\n",
" ('dab', 0.38086938858032227),\n",
" ('vtc', 0.378325879573822)]"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 학습된 문서, 단어에 대해서 유사 vector 를 지닌 값 출력 \n",
"doc_vectorizer.most_similar('eth')"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('btc', 0.4619801342487335),\n",
" ('etc', 0.394686222076416),\n",
" ('buterin', 0.38618335127830505),\n",
" ('ether', 0.3723365068435669),\n",
" ('digibyt', 0.35357969999313354),\n",
" ('zcrash', 0.3457624912261963),\n",
" ('foldingcoin', 0.3446189761161804),\n",
" ('eur', 0.33972662687301636),\n",
" ('emc', 0.33606672286987305),\n",
" ('dab', 0.33002081513404846)]"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_vectorizer.most_similar('bitcoin')"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# gensim 모델 저장 및 불러오기 가능\n",
"# doc_vectorizer.save('doc_vectorizer')\n",
"# doc_vectorizer2 = doc2vec.Doc2Vec.load('doc_vectorizer')"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('maid', 0.42228561639785767),\n",
" ('xrp', 0.4165087044239044),\n",
" ('ltc', 0.4158686399459839),\n",
" ('etc', 0.41268616914749146),\n",
" ('xmr', 0.4046849012374878),\n",
" ('dash', 0.39771586656570435),\n",
" ('bitstamp', 0.38785219192504883),\n",
" ('sbd', 0.3800865113735199),\n",
" ('usdt', 0.37259727716445923),\n",
" ('vtc', 0.3709067702293396)]"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_vectorizer2.most_similar('eth')"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# 학습 및 평가, 분류를 위한 training set 문서들의 벡터 리스팅\n",
"train_x = [doc_vectorizer.infer_vector(doc.words) for doc in train_set]\n",
"train_y = [doc.tags[0] for doc in train_set]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 학습 및 평가, 분류를 위한 test set 문서들의 벡터 리스팅\n",
"test_x = [doc_vectorizer.infer_vector(doc.words) for doc in test_set]\n",
"test_y = [doc.tags[0] for doc in test_set]"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"143512\n",
"300\n",
"35900\n",
"300\n"
]
}
],
"source": [
"print(len(train_x))\n",
"print(len(train_x[0]))\n",
"\n",
"print(len(test_x))\n",
"print(len(test_x[0]))\n",
"# => 300"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.82038997214484677"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sickit-learn 의 logistic regression 을 통해 classify 및 정확도 체크\n",
"from sklearn.linear_model import LogisticRegression\n",
"classifier = LogisticRegression(random_state=1234)\n",
"classifier.fit(train_x, train_y)\n",
"classifier.score(test_x, test_y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 결론: 82% 의 높은 정확도 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "tensorflow",
"language": "python",
"name": "venv3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.