Skip to content

Instantly share code, notes, and snippets.

@euphoris
Created November 10, 2018 14:50
Show Gist options
  • Save euphoris/5f56fd52fdd6cdccfa8ad0fa8b57be74 to your computer and use it in GitHub Desktop.
Save euphoris/5f56fd52fdd6cdccfa8ad0fa8b57be74 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "practice11.ipynb",
"version": "0.3.2",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"metadata": {
"id": "osMMgqsAEZZd",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## 게시판 스크래핑\n",
"\n",
"### 페이지 바꾸기"
]
},
{
"metadata": {
"id": "dqn58OXebXSb",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import requests"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "HuSGinocEdry",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"국민대 홈페이지 게시판 URL. `pn=` 부분이 페이지를 나타낸다. `{}`로 페이지 번호가 들어갈 자리를 표시한다."
]
},
{
"metadata": {
"id": "WgmAkyEhcMbO",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"url = 'https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn={}'"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "I2b-7inoEnxS",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"페이지를 0번부터 9번까지 바꿔가며 출력한다"
]
},
{
"metadata": {
"id": "RISFXGvIcZx8",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
},
"outputId": "80332c06-6102-4812-f26a-28caadee4049"
},
"cell_type": "code",
"source": [
"for page in range(10):\n",
" res = requests.get(url.format(page))"
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=0\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=1\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=2\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=3\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=4\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=5\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=6\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=7\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=8\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=9\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "4hQ7cuV5dLi1",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"!pip install lxml"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "agESh8vFdWeA",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"!pip install cssselect"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "IVvfRgSSdILR",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import lxml.html"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "ZI3nCW8iEsiQ",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### 게시물 URL 가져오기\n",
"\n",
"0번 페이지 가져오기"
]
},
{
"metadata": {
"id": "r8Xj2EO5cbBv",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"res = requests.get(url.format(0))"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "K5IbK9DydKAJ",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"root = lxml.html.fromstring(res.text)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "QfGSE5ZjeYWi",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"from urllib.parse import urljoin"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "p4_KymRudZKI",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"outputId": "8a0dfa05-d752-4ad3-aaae-4619d00483f4"
},
"cell_type": "code",
"source": [
"for link in root.cssselect('.boardlist a'): # class=\"boardlist\" 아래에 있는 a 링크를 모두 모아서\n",
" print(urljoin(url, link.attrib['href'])) # href 속성값을 가져온다"
],
"execution_count": 16,
"outputs": [
{
"output_type": "stream",
"text": [
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122457\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122428\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122425\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122408\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122403\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122396\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122382\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122356\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122338\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122319\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122256\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122255\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122177\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122215\n",
"https://www.kookmin.ac.kr/site/ecampus/notice/all/122212\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "r3htR6xjE6Cr",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### 게시물 내용 가져오기"
]
},
{
"metadata": {
"id": "QKYWAX7wfWDw",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"res = requests.get('https://www.kookmin.ac.kr/site/ecampus/notice/all/122212')"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "_oa2DADzfzYf",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"res.encoding = 'utf8'"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "wDKA1JH1fao0",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"root = lxml.html.fromstring(res.text)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "88WkDGYOdo3K",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"content = root.cssselect('#view-detail-data')"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "-eSfsJvPfmv-",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 55
},
"outputId": "831d3e55-346b-4e3b-9283-8b686ca8d22a"
},
"cell_type": "code",
"source": [
"content[0].text_content()"
],
"execution_count": 32,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'\\n\\t\\xa0\\r\\n\\r\\n국민대학교 창업보육센터 계약직원 모집\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n1. 모집분야 및 응시자격 \\r\\n\\r\\n\\r\\n\\t\\r\\n\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t모집분야\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t인원\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t우대사항\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t공통사항\\r\\n\\t\\t\\t\\r\\n\\t\\t\\r\\n\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t창업보육센터\\r\\n\\r\\n\\t\\t\\t전담인력\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t1명\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t- 창업보육전문매니저 자격증, 경영지도사, 기술경영사, 기술평가사 자격증 소지자 우대\\r\\n\\r\\n\\t\\t\\t- 창업지원 및 창업교육 관련 업무 경력자 우대\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t- 4년제 대학 이상 졸업자 \\r\\n\\r\\n\\t\\t\\t- 아래한글 또는 MS워드, 엑셀, 파워포인트, 포토샵, 일러스트) 활용에 능숙한 자\\r\\n\\r\\n\\t\\t\\t- 해외여행에 결격 사유가 없는 자로 남자는 병역필 또는 면제자\\r\\n\\t\\t\\t\\r\\n\\t\\t\\r\\n\\t\\r\\n\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n2. 제출서류\\r\\n\\r\\n◦입사지원서\\r\\n\\r\\n(필히 본교 홈페이지 www.kookmin.ac.kr에서 다운받아 사용하시기 바랍니다.)\\r\\n\\r\\n◦자기소개서 1부\\r\\n\\r\\n◦대학 졸업 및 성적증명서 원본 각 1부 (반드시 성적증명서는100점 만점 환산 점수 기재된 것)\\r\\n\\r\\n\\xa0\\xa0 가. 편입자는 전적대학 졸업‧성적증명서 포함\\r\\n\\r\\n\\xa0\\xa0 나. 대학원 졸업(수료)자는 학위수여증명서(수료증명서)‧성적증명서 포함\\r\\n\\r\\n◦자격증(외국어성적표 포함) 사본(해당자에 한함) 1부\\r\\n\\r\\n◦경력증명서(해당자에 한함) 1부\\r\\n\\r\\n◦취업보호대상자 증명원(보훈대상자에 한함) 1부\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n3. 제출기간 및 제출처\\r\\n\\r\\n◦제출기간 : 2018. 11. 01.(목) ~ 11. 16.(금)\\r\\n\\r\\n◦제 출 처 : 우편접수 - 국민대학교 산학협력관 214호 창업지원단 사무실\\r\\n\\r\\n\\xa0\\xa0 (마감일 기준 도착분에 한함)\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n4. 전형방법\\r\\n\\r\\n◦1차 전형 : 서류심사\\r\\n\\r\\n◦2차 전형 : 면접\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n5. 전형일정 \\r\\n\\r\\n◦1차 서류심사 : 2018. 11. 19.(월)\\r\\n\\r\\n◦1차 서류심사 결과 통보 : 2018. 11. 20 (화) 예정\\r\\n\\r\\n\\xa0\\xa0 - 1차 서류심사 합격자에 한하여 개별 통지\\r\\n\\r\\n◦2차 면접 : 2018. 11. 22.(목) 11:00 예정\\r\\n\\r\\n◦최종 합격 통보 : 2018. 11. 23.(금) 예정\\r\\n\\r\\n◦임용일자 : 2018.12.03.(월) 예정\\r\\n\\r\\n◦전형일정은 본교 사정에 따라 변동될 수 있습니다.\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n6. 채용조건\\r\\n\\r\\n- 계약직원으로 1년간 고용 후 평가결과에 따라 1년 연장 가능\\r\\n\\r\\n(본 채용은 창업보육센터 사업 전담인력 채용임)\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n7. 기타\\r\\n\\r\\n서류(우편포함)는 마감일 16:00까지 도착된 것에 한하여 접수(e-mail 접수 불가)\\r\\n\\r\\n주 소 : 20707 서울 성북구 정릉로 77 국민대학교 산학협력관 214호 창업지원단 사무실\\r\\n\\r\\n전 화 : (02) 910 - 5911\\r\\n'"
]
},
"metadata": {
"tags": []
},
"execution_count": 32
}
]
},
{
"metadata": {
"id": "1v1otR8jE-RO",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### 종합\n",
"\n",
"페이지를 바꿔가며 게시물 주소를 수집한다"
]
},
{
"metadata": {
"id": "s0wvT6Uwfo9-",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"article_urls = []\n",
"\n",
"for page in range(10):\n",
" res = requests.get(url.format(page))\n",
" root = lxml.html.fromstring(res.text) \n",
" for link in root.cssselect('.boardlist a'):\n",
" article_urls.append(urljoin(url, link.attrib['href']))"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "GWIEWki_ggdG",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"outputId": "6ff34ef1-d419-4ec6-b887-f6622c3af1fc"
},
"cell_type": "code",
"source": [
"len(article_urls)"
],
"execution_count": 34,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"70"
]
},
"metadata": {
"tags": []
},
"execution_count": 34
}
]
},
{
"metadata": {
"id": "1wI6yHfcFBny",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"수집한 주소의 게시물 본문을 수집한다"
]
},
{
"metadata": {
"id": "mDud-Qo0gl5o",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"contents = []\n",
"for article_url in article_urls: # 각각의 게시물 주소에 대해\n",
" res = requests.get(article_url) # 접속해서\n",
" res.encoding = 'utf8' # 인코딩을 UTF8로 바꾸고\n",
" root = lxml.html.fromstring(res.text) # 해석해서\n",
" content = root.cssselect('#view-detail-data') # 본문 영역을 가져와\n",
" contents.append(content[0].text_content()) # 텍스트를 수집한다"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "UUgIiNbHFJH7",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## 11.2 Word Embedding\n",
"\n",
"(교재와 동일)"
]
},
{
"metadata": {
"id": "iR7hg-1ignwY",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import requests\n",
"import re\n",
"res = requests.get('https://www.gutenberg.org/files/2591/2591-0.txt')\n",
"grimm = res.text[2801:530661]\n",
"grimm = re.sub(r'[^a-zA-Z\\. ]', ' ', grimm)\n",
"sentences = grimm.split('. ') # 문장 단위로 자름\n",
"data = [s.split() for s in sentences]\n"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "hyPaPTbmtL-2",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 389
},
"outputId": "fea13696-9d2b-4b9c-fa29-c6b4fcf5351c"
},
"cell_type": "code",
"source": [
"data[0]"
],
"execution_count": 37,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['THE',\n",
" 'GOLDEN',\n",
" 'BIRD',\n",
" 'A',\n",
" 'certain',\n",
" 'king',\n",
" 'had',\n",
" 'a',\n",
" 'beautiful',\n",
" 'garden',\n",
" 'and',\n",
" 'in',\n",
" 'the',\n",
" 'garden',\n",
" 'stood',\n",
" 'a',\n",
" 'tree',\n",
" 'which',\n",
" 'bore',\n",
" 'golden',\n",
" 'apples']"
]
},
"metadata": {
"tags": []
},
"execution_count": 37
}
]
},
{
"metadata": {
"id": "qnYsur95tN1G",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"!pip install gensim"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "Hy9xg7BLt2PO",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"from gensim.models.word2vec import Word2Vec"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "vv7W_IGouGnH",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"model = Word2Vec(data, # 리스트 형태의 데이터\n",
" sg=1, # 0: CBOW, 1: Skip-gram\n",
" size=100, # 벡터 크기\n",
" window=3, # 고려할 앞뒤 폭(앞뒤 3단어)\n",
" min_count=3, # 사용할 단어의 최소 빈도(3회 이하 단어 무시)\n",
" workers=4) # 동시에 처리할 작업 수(코어 수와 비슷하게 설정)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "VlBbWfqdvWD5",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"model.save('word2vec.model')"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "51cuAPEkv62a",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 90
},
"outputId": "f85d7c95-ad35-4fc4-b61a-ef8011f09c36"
},
"cell_type": "code",
"source": [
"model.wv.similarity('princess', 'queen')"
],
"execution_count": 44,
"outputs": [
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
" if np.issubdtype(vec.dtype, np.int):\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.9875084"
]
},
"metadata": {
"tags": []
},
"execution_count": 44
}
]
},
{
"metadata": {
"id": "dd4-kpQYwQdx",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 250
},
"outputId": "a4b1a71b-1ca7-46f8-d00b-7b7ce8e9b69f"
},
"cell_type": "code",
"source": [
"model.wv.most_similar('princess')"
],
"execution_count": 49,
"outputs": [
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
" if np.issubdtype(vec.dtype, np.int):\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[('fox', 0.9914872646331787),\n",
" ('dwarf', 0.9899657964706421),\n",
" ('prince', 0.9898759126663208),\n",
" ('second', 0.9888558387756348),\n",
" ('wedding', 0.9885976314544678),\n",
" ('boy', 0.9884428977966309),\n",
" ('queen', 0.9875084757804871),\n",
" ('youth', 0.9870286583900452),\n",
" ('witch', 0.9852925539016724),\n",
" ('palace', 0.9848740100860596)]"
]
},
"metadata": {
"tags": []
},
"execution_count": 49
}
]
},
{
"metadata": {
"id": "JABHP02Dv_qj",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 250
},
"outputId": "dde4e19e-5259-4344-b23b-94153e32c415"
},
"cell_type": "code",
"source": [
"model.wv.most_similar(positive=['man', 'princess'], negative=['woman'])\n"
],
"execution_count": 50,
"outputs": [
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
" if np.issubdtype(vec.dtype, np.int):\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[('cat', 0.9745951890945435),\n",
" ('miller', 0.9728690981864929),\n",
" ('lady', 0.9711911678314209),\n",
" ('bird', 0.9709718823432922),\n",
" ('bride', 0.9689940214157104),\n",
" ('wolf', 0.9689082503318787),\n",
" ('child', 0.9684101343154907),\n",
" ('huntsman', 0.9650394320487976),\n",
" ('soldier', 0.9645828604698181),\n",
" ('peasant', 0.9642676115036011)]"
]
},
"metadata": {
"tags": []
},
"execution_count": 50
}
]
},
{
"metadata": {
"id": "uxi3ybXIxMDP",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"outputId": "c9eb50d0-2284-4feb-ed48-032f19ef6c42"
},
"cell_type": "code",
"source": [
"from keras.models import Sequential\n",
"from keras.layers import Embedding"
],
"execution_count": 51,
"outputs": [
{
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
],
"name": "stderr"
}
]
},
{
"metadata": {
"id": "7N5AGLjhyIwu",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"NUM_WORDS, EMB_DIM = model.wv.vectors.shape"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "5yhNQBfYyPst",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"outputId": "a9f4b8d6-fc76-4434-8021-aad0a30fe160"
},
"cell_type": "code",
"source": [
"NUM_WORDS"
],
"execution_count": 54,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"2481"
]
},
"metadata": {
"tags": []
},
"execution_count": 54
}
]
},
{
"metadata": {
"id": "qJgukL1hx9Qz",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"nn = Sequential()"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "Fe5SAerRyWVd",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"emb = Embedding(input_dim=NUM_WORDS, output_dim=EMB_DIM,\n",
" trainable=False, weights=[model.wv.vectors])"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "6LdARVmNyBLz",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"nn.add(emb)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "iSpo5mdQy6Qi",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 232
},
"outputId": "d36e74c0-1fef-4fa7-93d4-668548631af4"
},
"cell_type": "code",
"source": [
"!wget https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view"
],
"execution_count": 57,
"outputs": [
{
"output_type": "stream",
"text": [
"--2018-11-09 04:50:09-- https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view\n",
"Resolving drive.google.com (drive.google.com)... 74.125.195.101, 74.125.195.100, 74.125.195.139, ...\n",
"Connecting to drive.google.com (drive.google.com)|74.125.195.101|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: unspecified [text/html]\n",
"Saving to: ‘view’\n",
"\n",
"\rview [<=> ] 0 --.-KB/s \rview [ <=> ] 131.77K --.-KB/s in 0.05s \n",
"\n",
"2018-11-09 04:50:10 (2.50 MB/s) - ‘view’ saved [134932]\n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "YMkM6R4NzlKe",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 70
},
"outputId": "9cb951b5-9764-42c7-d3c0-bb4b306a35c3"
},
"cell_type": "code",
"source": [
"!unzip ko.zip"
],
"execution_count": 59,
"outputs": [
{
"output_type": "stream",
"text": [
"Archive: ko.zip\n",
" inflating: ko.bin \n",
" inflating: ko.tsv \n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "TX2XUK4G1-9u",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"kovec = Word2Vec.load('ko.bin')"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "SYVR0jG62OTQ",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 250
},
"outputId": "7f0430c6-5ac2-4912-dbfc-31df9c9c51ef"
},
"cell_type": "code",
"source": [
"kovec.wv.most_similar('여왕')"
],
"execution_count": 68,
"outputs": [
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
" if np.issubdtype(vec.dtype, np.int):\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[('국왕', 0.6174007654190063),\n",
" ('왕', 0.6089673638343811),\n",
" ('왕녀', 0.5904853343963623),\n",
" ('왕비', 0.5857207179069519),\n",
" ('왕자', 0.5760841965675354),\n",
" ('왕세자', 0.544166624546051),\n",
" ('왕인', 0.5402752161026001),\n",
" ('미실', 0.5337860584259033),\n",
" ('부왕', 0.5335291624069214),\n",
" ('모후', 0.5328422784805298)]"
]
},
"metadata": {
"tags": []
},
"execution_count": 68
}
]
},
{
"metadata": {
"id": "_pVEPgzaFjyb",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## 11.5 ELMo 실습\n",
"\n",
"다른 부분은 교재와 동일하나 전처리를 `np.expand_dims`로 간단히 한 차이가 있음. `expand_dims`에 대해서는 Q&A의 설명을 참고."
]
},
{
"metadata": {
"id": "AdhLlUk42DSx",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 126
},
"outputId": "368b28bd-5278-4de1-d9c2-88324926cc94"
},
"cell_type": "code",
"source": [
"!pip install tensorflow-hub"
],
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": [
"Requirement already satisfied: tensorflow-hub in /usr/local/lib/python3.6/dist-packages (0.1.1)\n",
"Requirement already satisfied: numpy>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-hub) (1.14.6)\n",
"Requirement already satisfied: protobuf>=3.4.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-hub) (3.6.1)\n",
"Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-hub) (1.11.0)\n",
"Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.4.0->tensorflow-hub) (40.5.0)\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "GD7X2VP_2-F1",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import tensorflow_hub as hub"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "WiJewaX83RMZ",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 70
},
"outputId": "a592e4f9-6989-4f58-d551-6387f7213fbc"
},
"cell_type": "code",
"source": [
"elmo = hub.Module(\"https://tfhub.dev/google/elmo/1\", trainable=True)"
],
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": [
"INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.\n",
"INFO:tensorflow:Downloading TF-Hub Module 'https://tfhub.dev/google/elmo/1'.\n",
"INFO:tensorflow:Downloaded TF-Hub Module 'https://tfhub.dev/google/elmo/1'.\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "L42D-rbM3U9k",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"outputId": "4434d778-7d5a-4728-bbf8-15bab512f2c1"
},
"cell_type": "code",
"source": [
"import tensorflow as tf\n",
"from keras.layers import Lambda"
],
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
],
"name": "stderr"
}
]
},
{
"metadata": {
"id": "xMwI8Cdz4Hjf",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"def elmo_embedding(x):\n",
" return elmo(tf.squeeze(tf.cast(x, tf.string)), signature=\"default\", as_dict=True)[\"default\"]"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "wTCpQ1sa4exO",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"elmo_layer = Lambda(elmo_embedding, input_shape=(1,), output_shape=(1024,))\n"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "7stCWquf4pzS",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 232
},
"outputId": "9e33b3e6-f08f-4b2e-87ce-693d30be0729"
},
"cell_type": "code",
"source": [
"!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip"
],
"execution_count": 7,
"outputs": [
{
"output_type": "stream",
"text": [
"--2018-11-09 05:40:08-- https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip\n",
"Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249\n",
"Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 84188 (82K) [application/zip]\n",
"Saving to: ‘sentiment labelled sentences.zip’\n",
"\n",
"sentiment labelled 100%[===================>] 82.21K 125KB/s in 0.7s \n",
"\n",
"2018-11-09 05:40:11 (125 KB/s) - ‘sentiment labelled sentences.zip’ saved [84188/84188]\n",
"\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "muRINwBl47Zc",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 247
},
"outputId": "1c5fc42b-6bca-4e49-bf3a-768ba739c6c0"
},
"cell_type": "code",
"source": [
"!unzip sentiment\\ labelled\\ sentences.zip"
],
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": [
"Archive: sentiment labelled sentences.zip\n",
" creating: sentiment labelled sentences/\n",
" inflating: sentiment labelled sentences/.DS_Store \n",
" creating: __MACOSX/\n",
" creating: __MACOSX/sentiment labelled sentences/\n",
" inflating: __MACOSX/sentiment labelled sentences/._.DS_Store \n",
" inflating: sentiment labelled sentences/amazon_cells_labelled.txt \n",
" inflating: sentiment labelled sentences/imdb_labelled.txt \n",
" inflating: __MACOSX/sentiment labelled sentences/._imdb_labelled.txt \n",
" inflating: sentiment labelled sentences/readme.txt \n",
" inflating: __MACOSX/sentiment labelled sentences/._readme.txt \n",
" inflating: sentiment labelled sentences/yelp_labelled.txt \n",
" inflating: __MACOSX/._sentiment labelled sentences \n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "gIRldvmn4_Bh",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import pandas as pd\n",
"from keras.preprocessing.sequence import pad_sequences\n",
"from sklearn.model_selection import train_test_split"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "GquT15ak5Gd_",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"df = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\\t', header=None)\n"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "FmaJVvGE53ej",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import numpy as np"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "oKB_-j4b7pQN",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"outputId": "135a8dca-a41e-44bb-dd47-d2418a83bc7d"
},
"cell_type": "code",
"source": [
"df[0].values.shape"
],
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(1000,)"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"metadata": {
"id": "YOISBGTZFrkK",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
""
]
},
{
"metadata": {
"id": "hts603305LQb",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"X = np.expand_dims(df[0].values, 1)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "DoEB0LQL7rzs",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"outputId": "447e5de7-b88d-44c9-a0c2-01d0c953f690"
},
"cell_type": "code",
"source": [
"X.shape"
],
"execution_count": 14,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(1000, 1)"
]
},
"metadata": {
"tags": []
},
"execution_count": 14
}
]
},
{
"metadata": {
"id": "yW4bR-BV59Tq",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, df[1], test_size=.2, random_state=1234)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "cL6-0H2v6Ab4",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"from keras.models import Model\n",
"from keras.layers import Dense, Input"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "yIC1K6166I65",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 55
},
"outputId": "1ce1f62b-0561-4ddb-9459-3ade500f45ad"
},
"cell_type": "code",
"source": [
"input_layer = Input(shape=(1,), dtype=tf.string)\n",
"emb_layer = elmo_layer(input_layer)\n",
"#hidden = Dense(256, activation='relu')(emb_layer)\n",
"out = Dense(1, activation='sigmoid')(emb_layer)"
],
"execution_count": 17,
"outputs": [
{
"output_type": "stream",
"text": [
"INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "YY_NcYtf6Mht",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"model = Model(inputs=[input_layer], outputs=out)\n"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "eY76JEvW9Gby",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 247
},
"outputId": "c4115b5a-f800-4043-8baa-1f75a434846d"
},
"cell_type": "code",
"source": [
"model.summary()\n"
],
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"text": [
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"input_1 (InputLayer) (None, 1) 0 \n",
"_________________________________________________________________\n",
"lambda_1 (Lambda) (None, 1024) 0 \n",
"_________________________________________________________________\n",
"dense_1 (Dense) (None, 1) 1025 \n",
"=================================================================\n",
"Total params: 1,025\n",
"Trainable params: 1,025\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "UNtbje-E6UpQ",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "BjwqifXN6pOo",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 250
},
"outputId": "c4363428-0a2c-41e8-d1c6-fc5e95f15603"
},
"cell_type": "code",
"source": [
"model.fit(X_train,\n",
" y_train,\n",
" validation_data=(X_test, y_test),\n",
" epochs=5,\n",
" batch_size=32)"
],
"execution_count": 22,
"outputs": [
{
"output_type": "stream",
"text": [
"Train on 800 samples, validate on 200 samples\n",
"Epoch 1/5\n",
"800/800 [==============================] - 13s 16ms/step - loss: 0.4748 - acc: 0.8175 - val_loss: 0.4663 - val_acc: 0.8550\n",
"Epoch 2/5\n",
"800/800 [==============================] - 13s 16ms/step - loss: 0.4540 - acc: 0.8400 - val_loss: 0.4524 - val_acc: 0.8450\n",
"Epoch 3/5\n",
"800/800 [==============================] - 12s 16ms/step - loss: 0.4409 - acc: 0.8325 - val_loss: 0.4464 - val_acc: 0.8250\n",
"Epoch 4/5\n",
"800/800 [==============================] - 12s 16ms/step - loss: 0.4252 - acc: 0.8450 - val_loss: 0.4267 - val_acc: 0.8450\n",
"Epoch 5/5\n",
"800/800 [==============================] - 12s 16ms/step - loss: 0.4132 - acc: 0.8538 - val_loss: 0.4106 - val_acc: 0.8750\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7f05003b60b8>"
]
},
"metadata": {
"tags": []
},
"execution_count": 22
}
]
},
{
"metadata": {
"id": "CnTzjvNN6rPi",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment