qbeer/hw12_no_code.ipynb Secret

## hw12_no_code.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "hw12_no_code.ipynb",
      "provenance": [],
      "authorship_tag": "ABX9TyMHCVu9wCMzzP1CmRVaNPPu",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/qbeer/e52ec7f519dfc2fa12583fa3b497769d/hw12_no_code.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HGAmCco5lyIU"
      },
      "source": [
        "## 1. Word2Vec (something like that) embeddings\n",
        "\n",
        "* Read the GloVE file into word - vector pairs \n",
        "* Create a 2D-embedding with PCA for the 10_000 nearest neighbors (based on L2 distance) for the word 'dog'.\n",
        "* Visualize the 2 dimensional embeddings on a plot and add text annotations to it\n",
        "  * 'dog' should be red\n",
        "  * only add the nearast 50 neighbors\n",
        "  * add an alpha (.3) to the 10_000 points (too much to visualize well with text)\n",
        "\n",
        "## 2. IMDB reviews with word embeddings\n",
        "\n",
        "Load the 'imdb_review' dataset from 'tf.keras.datasets.imdb' an convert each sentence into a sequence of its GloVe representations. This will generate (n_samples, sample_length, 50) dimensional dataset. \n",
        "\n",
        "  * mean your input along the `sample_length` axis -> this generates a dataset useable to the MLP -> (n_samples, 50)\n",
        "    * you are basically generating a mean representation of the sentence\n",
        "  * handle your OOV (out of vocabulary) tokens with e.g. np.zeros(50) -> this does not influence the mean much\n",
        "\n",
        "Loading the data:\n",
        "\n",
        "  * `(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(\n",
        "    path=\"imdb.npz\",\n",
        "    num_words=None,\n",
        "    skip_top=0,\n",
        "    maxlen=150,\n",
        "    seed=113,\n",
        "    start_char=1,\n",
        "    oov_char=2,\n",
        "    index_from=3\n",
        ")`\n",
        "  * do the preprocessing this way, this makes the dataset ~9'000 samples large and the maximum length is only 150 words\n",
        "  * the dataset is represented as index values, so you need to convert twice: index -> word -> GloVe\n",
        "    * the index-to-word conversion is achievable by Keras, read the documentation\n",
        "\n",
        "Model defintion:\n",
        "  * `Dense(256, relu)`,\n",
        "  * `Dense(64, relu)`,\n",
        "  * `Dense(1, sigmoid)`\n",
        "\n",
        "Use default parameters in the compile: 'adam', 'binary_crossentropy', 'accuracy' metric. Train for 20-25 epochs.\n",
        "\n",
        "***Hint: approximately 55-60% accuracy is achieveable on the test set.***\n",
        "\n",
        "## 3. Sequence modeling with LSTM\n",
        "\n",
        "  * use the IMDB dataset again converted into GloVe sequences but without the mean operation. This way you are going to generate (n_samples, sequence_length, 50) sample points with different sequence lengths\n",
        "  * pad every sequence to `150` in length with np.zeros(50) -> (n_samples, 150, 50)\n",
        "  * LSTM is a recurrent model with intricate inner operations, if you use it in a bideractional fashion, your sequence will be processed from both ends\n",
        "\n",
        "Model definition:\n",
        "  * `BidirectionalLSTM(64, return_sequences=True),`\n",
        "  * `BidirectionalLSTM(64),`\n",
        "  * `Dense(64, relu)`,\n",
        "  * `Dense(1, sigmoid)`\n",
        "\n",
        "Use default parameters in the compile: 'adam', 'binary_crossentropy', 'accuracy' metric. Train for 20-25 epochs.\n",
        "\n",
        "***Hint: approximately 65-70% accuracy is achieveable on the test set.***"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "w-CgB860lzJp"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "hw12_no_code.ipynb",
	"provenance": [],
	"authorship_tag": "ABX9TyMHCVu9wCMzzP1CmRVaNPPu",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/qbeer/e52ec7f519dfc2fa12583fa3b497769d/hw12_no_code.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "HGAmCco5lyIU"
	},
	"source": [
	"## 1. Word2Vec (something like that) embeddings\n",
	"\n",
	"* Read the GloVE file into word - vector pairs \n",
	"* Create a 2D-embedding with PCA for the 10_000 nearest neighbors (based on L2 distance) for the word 'dog'.\n",
	"* Visualize the 2 dimensional embeddings on a plot and add text annotations to it\n",
	" * 'dog' should be red\n",
	" * only add the nearast 50 neighbors\n",
	" * add an alpha (.3) to the 10_000 points (too much to visualize well with text)\n",
	"\n",
	"## 2. IMDB reviews with word embeddings\n",
	"\n",
	"Load the 'imdb_review' dataset from 'tf.keras.datasets.imdb' an convert each sentence into a sequence of its GloVe representations. This will generate (n_samples, sample_length, 50) dimensional dataset. \n",
	"\n",
	" * mean your input along the `sample_length` axis -> this generates a dataset useable to the MLP -> (n_samples, 50)\n",
	" * you are basically generating a mean representation of the sentence\n",
	" * handle your OOV (out of vocabulary) tokens with e.g. np.zeros(50) -> this does not influence the mean much\n",
	"\n",
	"Loading the data:\n",
	"\n",
	" * `(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(\n",
	" path=\"imdb.npz\",\n",
	" num_words=None,\n",
	" skip_top=0,\n",
	" maxlen=150,\n",
	" seed=113,\n",
	" start_char=1,\n",
	" oov_char=2,\n",
	" index_from=3\n",
	")`\n",
	" * do the preprocessing this way, this makes the dataset ~9'000 samples large and the maximum length is only 150 words\n",
	" * the dataset is represented as index values, so you need to convert twice: index -> word -> GloVe\n",
	" * the index-to-word conversion is achievable by Keras, read the documentation\n",
	"\n",
	"Model defintion:\n",
	" * `Dense(256, relu)`,\n",
	" * `Dense(64, relu)`,\n",
	" * `Dense(1, sigmoid)`\n",
	"\n",
	"Use default parameters in the compile: 'adam', 'binary_crossentropy', 'accuracy' metric. Train for 20-25 epochs.\n",
	"\n",
	"*Hint: approximately 55-60% accuracy is achieveable on the test set.*\n",
	"\n",
	"## 3. Sequence modeling with LSTM\n",
	"\n",
	" * use the IMDB dataset again converted into GloVe sequences but without the mean operation. This way you are going to generate (n_samples, sequence_length, 50) sample points with different sequence lengths\n",
	" * pad every sequence to `150` in length with np.zeros(50) -> (n_samples, 150, 50)\n",
	" * LSTM is a recurrent model with intricate inner operations, if you use it in a bideractional fashion, your sequence will be processed from both ends\n",
	"\n",
	"Model definition:\n",
	" * `BidirectionalLSTM(64, return_sequences=True),`\n",
	" * `BidirectionalLSTM(64),`\n",
	" * `Dense(64, relu)`,\n",
	" * `Dense(1, sigmoid)`\n",
	"\n",
	"Use default parameters in the compile: 'adam', 'binary_crossentropy', 'accuracy' metric. Train for 20-25 epochs.\n",
	"\n",
	"*Hint: approximately 65-70% accuracy is achieveable on the test set.*"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "w-CgB860lzJp"
	},
	"source": [
	""
	],
	"execution_count": null,
	"outputs": []
	}
	]
	}