euphoris/class11.ipynb

## class11.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "practice11.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "metadata": {
        "id": "osMMgqsAEZZd",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 게시판 스크래핑\n",
        "\n",
        "### 페이지 바꾸기"
      ]
    },
    {
      "metadata": {
        "id": "dqn58OXebXSb",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import requests"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "HuSGinocEdry",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "국민대 홈페이지 게시판 URL. `pn=` 부분이 페이지를 나타낸다. `{}`로 페이지 번호가 들어갈 자리를 표시한다."
      ]
    },
    {
      "metadata": {
        "id": "WgmAkyEhcMbO",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "url = 'https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn={}'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "I2b-7inoEnxS",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "페이지를 0번부터 9번까지 바꿔가며 출력한다"
      ]
    },
    {
      "metadata": {
        "id": "RISFXGvIcZx8",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 194
        },
        "outputId": "80332c06-6102-4812-f26a-28caadee4049"
      },
      "cell_type": "code",
      "source": [
        "for page in range(10):\n",
        "  res = requests.get(url.format(page))"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=0\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=1\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=2\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=3\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=4\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=5\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=6\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=7\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=8\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/?&pn=9\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "4hQ7cuV5dLi1",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!pip install lxml"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "agESh8vFdWeA",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!pip install cssselect"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "IVvfRgSSdILR",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import lxml.html"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "ZI3nCW8iEsiQ",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### 게시물 URL 가져오기\n",
        "\n",
        "0번 페이지 가져오기"
      ]
    },
    {
      "metadata": {
        "id": "r8Xj2EO5cbBv",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "res = requests.get(url.format(0))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "K5IbK9DydKAJ",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "root = lxml.html.fromstring(res.text)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "QfGSE5ZjeYWi",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from urllib.parse import urljoin"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "p4_KymRudZKI",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 283
        },
        "outputId": "8a0dfa05-d752-4ad3-aaae-4619d00483f4"
      },
      "cell_type": "code",
      "source": [
        "for link in root.cssselect('.boardlist a'):  # class=\"boardlist\" 아래에 있는 a 링크를 모두 모아서\n",
        "  print(urljoin(url, link.attrib['href']))   # href 속성값을 가져온다"
      ],
      "execution_count": 16,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122457\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122428\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122425\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122408\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122403\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122396\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122382\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122356\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122338\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122319\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122256\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122255\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122177\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122215\n",
            "https://www.kookmin.ac.kr/site/ecampus/notice/all/122212\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "r3htR6xjE6Cr",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### 게시물 내용 가져오기"
      ]
    },
    {
      "metadata": {
        "id": "QKYWAX7wfWDw",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "res = requests.get('https://www.kookmin.ac.kr/site/ecampus/notice/all/122212')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "_oa2DADzfzYf",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "res.encoding = 'utf8'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "wDKA1JH1fao0",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "root = lxml.html.fromstring(res.text)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "88WkDGYOdo3K",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "content = root.cssselect('#view-detail-data')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "-eSfsJvPfmv-",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        },
        "outputId": "831d3e55-346b-4e3b-9283-8b686ca8d22a"
      },
      "cell_type": "code",
      "source": [
        "content[0].text_content()"
      ],
      "execution_count": 32,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'\\n\\t\\xa0\\r\\n\\r\\n국민대학교 창업보육센터 계약직원 모집\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n1. 모집분야 및 응시자격 \\r\\n\\r\\n\\r\\n\\t\\r\\n\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t모집분야\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t인원\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t우대사항\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t공통사항\\r\\n\\t\\t\\t\\r\\n\\t\\t\\r\\n\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t창업보육센터\\r\\n\\r\\n\\t\\t\\t전담인력\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t1명\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t- 창업보육전문매니저 자격증, 경영지도사, 기술경영사, 기술평가사 자격증 소지자 우대\\r\\n\\r\\n\\t\\t\\t- 창업지원 및 창업교육 관련 업무 경력자 우대\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t\\r\\n\\t\\t\\t- 4년제 대학 이상 졸업자 \\r\\n\\r\\n\\t\\t\\t- 아래한글 또는 MS워드, 엑셀, 파워포인트, 포토샵, 일러스트) 활용에 능숙한 자\\r\\n\\r\\n\\t\\t\\t- 해외여행에 결격 사유가 없는 자로 남자는 병역필 또는 면제자\\r\\n\\t\\t\\t\\r\\n\\t\\t\\r\\n\\t\\r\\n\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n2. 제출서류\\r\\n\\r\\n◦입사지원서\\r\\n\\r\\n(필히 본교 홈페이지 www.kookmin.ac.kr에서 다운받아 사용하시기 바랍니다.)\\r\\n\\r\\n◦자기소개서 1부\\r\\n\\r\\n◦대학 졸업 및 성적증명서 원본 각 1부 (반드시 성적증명서는100점 만점 환산 점수 기재된 것)\\r\\n\\r\\n\\xa0\\xa0 가. 편입자는 전적대학 졸업‧성적증명서 포함\\r\\n\\r\\n\\xa0\\xa0 나. 대학원 졸업(수료)자는 학위수여증명서(수료증명서)‧성적증명서 포함\\r\\n\\r\\n◦자격증(외국어성적표 포함) 사본(해당자에 한함) 1부\\r\\n\\r\\n◦경력증명서(해당자에 한함) 1부\\r\\n\\r\\n◦취업보호대상자 증명원(보훈대상자에 한함) 1부\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n3. 제출기간 및 제출처\\r\\n\\r\\n◦제출기간 : 2018. 11. 01.(목) ~ 11. 16.(금)\\r\\n\\r\\n◦제 출 처 : 우편접수 - 국민대학교 산학협력관 214호 창업지원단 사무실\\r\\n\\r\\n\\xa0\\xa0 (마감일 기준 도착분에 한함)\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n4. 전형방법\\r\\n\\r\\n◦1차 전형 : 서류심사\\r\\n\\r\\n◦2차 전형 : 면접\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n5. 전형일정 \\r\\n\\r\\n◦1차 서류심사 : 2018. 11. 19.(월)\\r\\n\\r\\n◦1차 서류심사 결과 통보 : 2018. 11. 20 (화) 예정\\r\\n\\r\\n\\xa0\\xa0 - 1차 서류심사 합격자에 한하여 개별 통지\\r\\n\\r\\n◦2차 면접 : 2018. 11. 22.(목) 11:00 예정\\r\\n\\r\\n◦최종 합격 통보 : 2018. 11. 23.(금) 예정\\r\\n\\r\\n◦임용일자 : 2018.12.03.(월) 예정\\r\\n\\r\\n◦전형일정은 본교 사정에 따라 변동될 수 있습니다.\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n6. 채용조건\\r\\n\\r\\n- 계약직원으로 1년간 고용 후 평가결과에 따라 1년 연장 가능\\r\\n\\r\\n(본 채용은 창업보육센터 사업 전담인력 채용임)\\r\\n\\r\\n\\xa0 \\r\\n\\r\\n7. 기타\\r\\n\\r\\n서류(우편포함)는 마감일 16:00까지 도착된 것에 한하여 접수(e-mail 접수 불가)\\r\\n\\r\\n주 소 : 20707 서울 성북구 정릉로 77 국민대학교 산학협력관 214호 창업지원단 사무실\\r\\n\\r\\n전 화 : (02) 910 - 5911\\r\\n'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 32
        }
      ]
    },
    {
      "metadata": {
        "id": "1v1otR8jE-RO",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### 종합\n",
        "\n",
        "페이지를 바꿔가며 게시물 주소를 수집한다"
      ]
    },
    {
      "metadata": {
        "id": "s0wvT6Uwfo9-",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "article_urls = []\n",
        "\n",
        "for page in range(10):\n",
        "  res = requests.get(url.format(page))\n",
        "  root = lxml.html.fromstring(res.text)  \n",
        "  for link in root.cssselect('.boardlist a'):\n",
        "    article_urls.append(urljoin(url, link.attrib['href']))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "GWIEWki_ggdG",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "6ff34ef1-d419-4ec6-b887-f6622c3af1fc"
      },
      "cell_type": "code",
      "source": [
        "len(article_urls)"
      ],
      "execution_count": 34,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "70"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 34
        }
      ]
    },
    {
      "metadata": {
        "id": "1wI6yHfcFBny",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "수집한 주소의 게시물 본문을 수집한다"
      ]
    },
    {
      "metadata": {
        "id": "mDud-Qo0gl5o",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "contents = []\n",
        "for article_url in article_urls:  # 각각의 게시물 주소에 대해\n",
        "  res = requests.get(article_url) # 접속해서\n",
        "  res.encoding = 'utf8'           # 인코딩을 UTF8로 바꾸고\n",
        "  root = lxml.html.fromstring(res.text)  # 해석해서\n",
        "  content = root.cssselect('#view-detail-data')  # 본문 영역을 가져와\n",
        "  contents.append(content[0].text_content())   # 텍스트를 수집한다"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "UUgIiNbHFJH7",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 11.2 Word Embedding\n",
        "\n",
        "(교재와 동일)"
      ]
    },
    {
      "metadata": {
        "id": "iR7hg-1ignwY",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import requests\n",
        "import re\n",
        "res = requests.get('https://www.gutenberg.org/files/2591/2591-0.txt')\n",
        "grimm = res.text[2801:530661]\n",
        "grimm = re.sub(r'[^a-zA-Z\\. ]', ' ', grimm)\n",
        "sentences = grimm.split('. ')  # 문장 단위로 자름\n",
        "data = [s.split() for s in sentences]\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "hyPaPTbmtL-2",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 389
        },
        "outputId": "fea13696-9d2b-4b9c-fa29-c6b4fcf5351c"
      },
      "cell_type": "code",
      "source": [
        "data[0]"
      ],
      "execution_count": 37,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['THE',\n",
              " 'GOLDEN',\n",
              " 'BIRD',\n",
              " 'A',\n",
              " 'certain',\n",
              " 'king',\n",
              " 'had',\n",
              " 'a',\n",
              " 'beautiful',\n",
              " 'garden',\n",
              " 'and',\n",
              " 'in',\n",
              " 'the',\n",
              " 'garden',\n",
              " 'stood',\n",
              " 'a',\n",
              " 'tree',\n",
              " 'which',\n",
              " 'bore',\n",
              " 'golden',\n",
              " 'apples']"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 37
        }
      ]
    },
    {
      "metadata": {
        "id": "qnYsur95tN1G",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!pip install gensim"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "Hy9xg7BLt2PO",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from gensim.models.word2vec import Word2Vec"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "vv7W_IGouGnH",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "model = Word2Vec(data,         # 리스트 형태의 데이터\n",
        "                 sg=1,         # 0: CBOW, 1: Skip-gram\n",
        "                 size=100,     # 벡터 크기\n",
        "                 window=3,     # 고려할 앞뒤 폭(앞뒤 3단어)\n",
        "                 min_count=3,  # 사용할 단어의 최소 빈도(3회 이하 단어 무시)\n",
        "                 workers=4)    # 동시에 처리할 작업 수(코어 수와 비슷하게 설정)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "VlBbWfqdvWD5",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "model.save('word2vec.model')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "51cuAPEkv62a",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 90
        },
        "outputId": "f85d7c95-ad35-4fc4-b61a-ef8011f09c36"
      },
      "cell_type": "code",
      "source": [
        "model.wv.similarity('princess', 'queen')"
      ],
      "execution_count": 44,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
            "  if np.issubdtype(vec.dtype, np.int):\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0.9875084"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 44
        }
      ]
    },
    {
      "metadata": {
        "id": "dd4-kpQYwQdx",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 250
        },
        "outputId": "a4b1a71b-1ca7-46f8-d00b-7b7ce8e9b69f"
      },
      "cell_type": "code",
      "source": [
        "model.wv.most_similar('princess')"
      ],
      "execution_count": 49,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
            "  if np.issubdtype(vec.dtype, np.int):\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('fox', 0.9914872646331787),\n",
              " ('dwarf', 0.9899657964706421),\n",
              " ('prince', 0.9898759126663208),\n",
              " ('second', 0.9888558387756348),\n",
              " ('wedding', 0.9885976314544678),\n",
              " ('boy', 0.9884428977966309),\n",
              " ('queen', 0.9875084757804871),\n",
              " ('youth', 0.9870286583900452),\n",
              " ('witch', 0.9852925539016724),\n",
              " ('palace', 0.9848740100860596)]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 49
        }
      ]
    },
    {
      "metadata": {
        "id": "JABHP02Dv_qj",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 250
        },
        "outputId": "dde4e19e-5259-4344-b23b-94153e32c415"
      },
      "cell_type": "code",
      "source": [
        "model.wv.most_similar(positive=['man', 'princess'], negative=['woman'])\n"
      ],
      "execution_count": 50,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
            "  if np.issubdtype(vec.dtype, np.int):\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('cat', 0.9745951890945435),\n",
              " ('miller', 0.9728690981864929),\n",
              " ('lady', 0.9711911678314209),\n",
              " ('bird', 0.9709718823432922),\n",
              " ('bride', 0.9689940214157104),\n",
              " ('wolf', 0.9689082503318787),\n",
              " ('child', 0.9684101343154907),\n",
              " ('huntsman', 0.9650394320487976),\n",
              " ('soldier', 0.9645828604698181),\n",
              " ('peasant', 0.9642676115036011)]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 50
        }
      ]
    },
    {
      "metadata": {
        "id": "uxi3ybXIxMDP",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "c9eb50d0-2284-4feb-ed48-032f19ef6c42"
      },
      "cell_type": "code",
      "source": [
        "from keras.models import Sequential\n",
        "from keras.layers import Embedding"
      ],
      "execution_count": 51,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Using TensorFlow backend.\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "id": "7N5AGLjhyIwu",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "NUM_WORDS, EMB_DIM = model.wv.vectors.shape"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "5yhNQBfYyPst",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "a9f4b8d6-fc76-4434-8021-aad0a30fe160"
      },
      "cell_type": "code",
      "source": [
        "NUM_WORDS"
      ],
      "execution_count": 54,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "2481"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 54
        }
      ]
    },
    {
      "metadata": {
        "id": "qJgukL1hx9Qz",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "nn = Sequential()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "Fe5SAerRyWVd",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "emb = Embedding(input_dim=NUM_WORDS, output_dim=EMB_DIM,\n",
        "                trainable=False, weights=[model.wv.vectors])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "6LdARVmNyBLz",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "nn.add(emb)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "iSpo5mdQy6Qi",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 232
        },
        "outputId": "d36e74c0-1fef-4fa7-93d4-668548631af4"
      },
      "cell_type": "code",
      "source": [
        "!wget https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view"
      ],
      "execution_count": 57,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "--2018-11-09 04:50:09--  https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view\n",
            "Resolving drive.google.com (drive.google.com)... 74.125.195.101, 74.125.195.100, 74.125.195.139, ...\n",
            "Connecting to drive.google.com (drive.google.com)|74.125.195.101|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: unspecified [text/html]\n",
            "Saving to: ‘view’\n",
            "\n",
            "\rview                    [<=>                 ]       0  --.-KB/s               \rview                    [ <=>                ] 131.77K  --.-KB/s    in 0.05s   \n",
            "\n",
            "2018-11-09 04:50:10 (2.50 MB/s) - ‘view’ saved [134932]\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "YMkM6R4NzlKe",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        },
        "outputId": "9cb951b5-9764-42c7-d3c0-bb4b306a35c3"
      },
      "cell_type": "code",
      "source": [
        "!unzip ko.zip"
      ],
      "execution_count": 59,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Archive:  ko.zip\n",
            "  inflating: ko.bin                  \n",
            "  inflating: ko.tsv                  \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "TX2XUK4G1-9u",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "kovec = Word2Vec.load('ko.bin')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "SYVR0jG62OTQ",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 250
        },
        "outputId": "7f0430c6-5ac2-4912-dbfc-31df9c9c51ef"
      },
      "cell_type": "code",
      "source": [
        "kovec.wv.most_similar('여왕')"
      ],
      "execution_count": 68,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
            "  if np.issubdtype(vec.dtype, np.int):\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('국왕', 0.6174007654190063),\n",
              " ('왕', 0.6089673638343811),\n",
              " ('왕녀', 0.5904853343963623),\n",
              " ('왕비', 0.5857207179069519),\n",
              " ('왕자', 0.5760841965675354),\n",
              " ('왕세자', 0.544166624546051),\n",
              " ('왕인', 0.5402752161026001),\n",
              " ('미실', 0.5337860584259033),\n",
              " ('부왕', 0.5335291624069214),\n",
              " ('모후', 0.5328422784805298)]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 68
        }
      ]
    },
    {
      "metadata": {
        "id": "_pVEPgzaFjyb",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## 11.5 ELMo 실습\n",
        "\n",
        "다른 부분은 교재와 동일하나 전처리를 `np.expand_dims`로 간단히 한 차이가 있음. `expand_dims`에 대해서는 Q&A의 설명을 참고."
      ]
    },
    {
      "metadata": {
        "id": "AdhLlUk42DSx",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 126
        },
        "outputId": "368b28bd-5278-4de1-d9c2-88324926cc94"
      },
      "cell_type": "code",
      "source": [
        "!pip install tensorflow-hub"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: tensorflow-hub in /usr/local/lib/python3.6/dist-packages (0.1.1)\n",
            "Requirement already satisfied: numpy>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-hub) (1.14.6)\n",
            "Requirement already satisfied: protobuf>=3.4.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-hub) (3.6.1)\n",
            "Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-hub) (1.11.0)\n",
            "Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.4.0->tensorflow-hub) (40.5.0)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "GD7X2VP_2-F1",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import tensorflow_hub as hub"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "WiJewaX83RMZ",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        },
        "outputId": "a592e4f9-6989-4f58-d551-6387f7213fbc"
      },
      "cell_type": "code",
      "source": [
        "elmo = hub.Module(\"https://tfhub.dev/google/elmo/1\", trainable=True)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.\n",
            "INFO:tensorflow:Downloading TF-Hub Module 'https://tfhub.dev/google/elmo/1'.\n",
            "INFO:tensorflow:Downloaded TF-Hub Module 'https://tfhub.dev/google/elmo/1'.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "L42D-rbM3U9k",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "4434d778-7d5a-4728-bbf8-15bab512f2c1"
      },
      "cell_type": "code",
      "source": [
        "import tensorflow as tf\n",
        "from keras.layers import Lambda"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Using TensorFlow backend.\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "id": "xMwI8Cdz4Hjf",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "def elmo_embedding(x):\n",
        "    return elmo(tf.squeeze(tf.cast(x, tf.string)), signature=\"default\", as_dict=True)[\"default\"]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "wTCpQ1sa4exO",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "elmo_layer = Lambda(elmo_embedding, input_shape=(1,), output_shape=(1024,))\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "7stCWquf4pzS",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 232
        },
        "outputId": "9e33b3e6-f08f-4b2e-87ce-693d30be0729"
      },
      "cell_type": "code",
      "source": [
        "!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "--2018-11-09 05:40:08--  https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip\n",
            "Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249\n",
            "Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 84188 (82K) [application/zip]\n",
            "Saving to: ‘sentiment labelled sentences.zip’\n",
            "\n",
            "sentiment labelled  100%[===================>]  82.21K   125KB/s    in 0.7s    \n",
            "\n",
            "2018-11-09 05:40:11 (125 KB/s) - ‘sentiment labelled sentences.zip’ saved [84188/84188]\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "muRINwBl47Zc",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 247
        },
        "outputId": "1c5fc42b-6bca-4e49-bf3a-768ba739c6c0"
      },
      "cell_type": "code",
      "source": [
        "!unzip sentiment\\ labelled\\ sentences.zip"
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Archive:  sentiment labelled sentences.zip\n",
            "   creating: sentiment labelled sentences/\n",
            "  inflating: sentiment labelled sentences/.DS_Store  \n",
            "   creating: __MACOSX/\n",
            "   creating: __MACOSX/sentiment labelled sentences/\n",
            "  inflating: __MACOSX/sentiment labelled sentences/._.DS_Store  \n",
            "  inflating: sentiment labelled sentences/amazon_cells_labelled.txt  \n",
            "  inflating: sentiment labelled sentences/imdb_labelled.txt  \n",
            "  inflating: __MACOSX/sentiment labelled sentences/._imdb_labelled.txt  \n",
            "  inflating: sentiment labelled sentences/readme.txt  \n",
            "  inflating: __MACOSX/sentiment labelled sentences/._readme.txt  \n",
            "  inflating: sentiment labelled sentences/yelp_labelled.txt  \n",
            "  inflating: __MACOSX/._sentiment labelled sentences  \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "gIRldvmn4_Bh",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "from keras.preprocessing.sequence import pad_sequences\n",
        "from sklearn.model_selection import train_test_split"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "GquT15ak5Gd_",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "df = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\\t', header=None)\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "FmaJVvGE53ej",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import numpy as np"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "oKB_-j4b7pQN",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "135a8dca-a41e-44bb-dd47-d2418a83bc7d"
      },
      "cell_type": "code",
      "source": [
        "df[0].values.shape"
      ],
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(1000,)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 12
        }
      ]
    },
    {
      "metadata": {
        "id": "YOISBGTZFrkK",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        ""
      ]
    },
    {
      "metadata": {
        "id": "hts603305LQb",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "X = np.expand_dims(df[0].values, 1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "DoEB0LQL7rzs",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "447e5de7-b88d-44c9-a0c2-01d0c953f690"
      },
      "cell_type": "code",
      "source": [
        "X.shape"
      ],
      "execution_count": 14,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(1000, 1)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 14
        }
      ]
    },
    {
      "metadata": {
        "id": "yW4bR-BV59Tq",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "X_train, X_test, y_train, y_test = train_test_split(\n",
        "    X, df[1], test_size=.2, random_state=1234)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "cL6-0H2v6Ab4",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from keras.models import Model\n",
        "from keras.layers import Dense, Input"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "yIC1K6166I65",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        },
        "outputId": "1ce1f62b-0561-4ddb-9459-3ade500f45ad"
      },
      "cell_type": "code",
      "source": [
        "input_layer = Input(shape=(1,), dtype=tf.string)\n",
        "emb_layer = elmo_layer(input_layer)\n",
        "#hidden = Dense(256, activation='relu')(emb_layer)\n",
        "out = Dense(1, activation='sigmoid')(emb_layer)"
      ],
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "YY_NcYtf6Mht",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "model = Model(inputs=[input_layer], outputs=out)\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "eY76JEvW9Gby",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 247
        },
        "outputId": "c4115b5a-f800-4043-8baa-1f75a434846d"
      },
      "cell_type": "code",
      "source": [
        "model.summary()\n"
      ],
      "execution_count": 19,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "_________________________________________________________________\n",
            "Layer (type)                 Output Shape              Param #   \n",
            "=================================================================\n",
            "input_1 (InputLayer)         (None, 1)                 0         \n",
            "_________________________________________________________________\n",
            "lambda_1 (Lambda)            (None, 1024)              0         \n",
            "_________________________________________________________________\n",
            "dense_1 (Dense)              (None, 1)                 1025      \n",
            "=================================================================\n",
            "Total params: 1,025\n",
            "Trainable params: 1,025\n",
            "Non-trainable params: 0\n",
            "_________________________________________________________________\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "UNtbje-E6UpQ",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "BjwqifXN6pOo",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 250
        },
        "outputId": "c4363428-0a2c-41e8-d1c6-fc5e95f15603"
      },
      "cell_type": "code",
      "source": [
        "model.fit(X_train,\n",
        "          y_train,\n",
        "          validation_data=(X_test, y_test),\n",
        "          epochs=5,\n",
        "          batch_size=32)"
      ],
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Train on 800 samples, validate on 200 samples\n",
            "Epoch 1/5\n",
            "800/800 [==============================] - 13s 16ms/step - loss: 0.4748 - acc: 0.8175 - val_loss: 0.4663 - val_acc: 0.8550\n",
            "Epoch 2/5\n",
            "800/800 [==============================] - 13s 16ms/step - loss: 0.4540 - acc: 0.8400 - val_loss: 0.4524 - val_acc: 0.8450\n",
            "Epoch 3/5\n",
            "800/800 [==============================] - 12s 16ms/step - loss: 0.4409 - acc: 0.8325 - val_loss: 0.4464 - val_acc: 0.8250\n",
            "Epoch 4/5\n",
            "800/800 [==============================] - 12s 16ms/step - loss: 0.4252 - acc: 0.8450 - val_loss: 0.4267 - val_acc: 0.8450\n",
            "Epoch 5/5\n",
            "800/800 [==============================] - 12s 16ms/step - loss: 0.4132 - acc: 0.8538 - val_loss: 0.4106 - val_acc: 0.8750\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<keras.callbacks.History at 0x7f05003b60b8>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 22
        }
      ]
    },
    {
      "metadata": {
        "id": "CnTzjvNN6rPi",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}