-
-
Save qbeer/e52ec7f519dfc2fa12583fa3b497769d to your computer and use it in GitHub Desktop.
hw12_no_code.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "hw12_no_code.ipynb", | |
"provenance": [], | |
"authorship_tag": "ABX9TyMHCVu9wCMzzP1CmRVaNPPu", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/qbeer/e52ec7f519dfc2fa12583fa3b497769d/hw12_no_code.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "HGAmCco5lyIU" | |
}, | |
"source": [ | |
"## 1. Word2Vec (something like that) embeddings\n", | |
"\n", | |
"* Read the GloVE file into word - vector pairs \n", | |
"* Create a 2D-embedding with PCA for the 10_000 nearest neighbors (based on L2 distance) for the word 'dog'.\n", | |
"* Visualize the 2 dimensional embeddings on a plot and add text annotations to it\n", | |
" * 'dog' should be red\n", | |
" * only add the nearast 50 neighbors\n", | |
" * add an alpha (.3) to the 10_000 points (too much to visualize well with text)\n", | |
"\n", | |
"## 2. IMDB reviews with word embeddings\n", | |
"\n", | |
"Load the 'imdb_review' dataset from 'tf.keras.datasets.imdb' an convert each sentence into a sequence of its GloVe representations. This will generate (n_samples, sample_length, 50) dimensional dataset. \n", | |
"\n", | |
" * mean your input along the `sample_length` axis -> this generates a dataset useable to the MLP -> (n_samples, 50)\n", | |
" * you are basically generating a mean representation of the sentence\n", | |
" * handle your OOV (out of vocabulary) tokens with e.g. np.zeros(50) -> this does not influence the mean much\n", | |
"\n", | |
"Loading the data:\n", | |
"\n", | |
" * `(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(\n", | |
" path=\"imdb.npz\",\n", | |
" num_words=None,\n", | |
" skip_top=0,\n", | |
" maxlen=150,\n", | |
" seed=113,\n", | |
" start_char=1,\n", | |
" oov_char=2,\n", | |
" index_from=3\n", | |
")`\n", | |
" * do the preprocessing this way, this makes the dataset ~9'000 samples large and the maximum length is only 150 words\n", | |
" * the dataset is represented as index values, so you need to convert twice: index -> word -> GloVe\n", | |
" * the index-to-word conversion is achievable by Keras, read the documentation\n", | |
"\n", | |
"Model defintion:\n", | |
" * `Dense(256, relu)`,\n", | |
" * `Dense(64, relu)`,\n", | |
" * `Dense(1, sigmoid)`\n", | |
"\n", | |
"Use default parameters in the compile: 'adam', 'binary_crossentropy', 'accuracy' metric. Train for 20-25 epochs.\n", | |
"\n", | |
"***Hint: approximately 55-60% accuracy is achieveable on the test set.***\n", | |
"\n", | |
"## 3. Sequence modeling with LSTM\n", | |
"\n", | |
" * use the IMDB dataset again converted into GloVe sequences but without the mean operation. This way you are going to generate (n_samples, sequence_length, 50) sample points with different sequence lengths\n", | |
" * pad every sequence to `150` in length with np.zeros(50) -> (n_samples, 150, 50)\n", | |
" * LSTM is a recurrent model with intricate inner operations, if you use it in a bideractional fashion, your sequence will be processed from both ends\n", | |
"\n", | |
"Model definition:\n", | |
" * `BidirectionalLSTM(64, return_sequences=True),`\n", | |
" * `BidirectionalLSTM(64),`\n", | |
" * `Dense(64, relu)`,\n", | |
" * `Dense(1, sigmoid)`\n", | |
"\n", | |
"Use default parameters in the compile: 'adam', 'binary_crossentropy', 'accuracy' metric. Train for 20-25 epochs.\n", | |
"\n", | |
"***Hint: approximately 65-70% accuracy is achieveable on the test set.***" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "w-CgB860lzJp" | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment