Skip to content

Instantly share code, notes, and snippets.

@qbeer
Created November 29, 2021 16:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save qbeer/e52ec7f519dfc2fa12583fa3b497769d to your computer and use it in GitHub Desktop.
Save qbeer/e52ec7f519dfc2fa12583fa3b497769d to your computer and use it in GitHub Desktop.
hw12_no_code.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "hw12_no_code.ipynb",
"provenance": [],
"authorship_tag": "ABX9TyMHCVu9wCMzzP1CmRVaNPPu",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/qbeer/e52ec7f519dfc2fa12583fa3b497769d/hw12_no_code.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HGAmCco5lyIU"
},
"source": [
"## 1. Word2Vec (something like that) embeddings\n",
"\n",
"* Read the GloVE file into word - vector pairs \n",
"* Create a 2D-embedding with PCA for the 10_000 nearest neighbors (based on L2 distance) for the word 'dog'.\n",
"* Visualize the 2 dimensional embeddings on a plot and add text annotations to it\n",
" * 'dog' should be red\n",
" * only add the nearast 50 neighbors\n",
" * add an alpha (.3) to the 10_000 points (too much to visualize well with text)\n",
"\n",
"## 2. IMDB reviews with word embeddings\n",
"\n",
"Load the 'imdb_review' dataset from 'tf.keras.datasets.imdb' an convert each sentence into a sequence of its GloVe representations. This will generate (n_samples, sample_length, 50) dimensional dataset. \n",
"\n",
" * mean your input along the `sample_length` axis -> this generates a dataset useable to the MLP -> (n_samples, 50)\n",
" * you are basically generating a mean representation of the sentence\n",
" * handle your OOV (out of vocabulary) tokens with e.g. np.zeros(50) -> this does not influence the mean much\n",
"\n",
"Loading the data:\n",
"\n",
" * `(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(\n",
" path=\"imdb.npz\",\n",
" num_words=None,\n",
" skip_top=0,\n",
" maxlen=150,\n",
" seed=113,\n",
" start_char=1,\n",
" oov_char=2,\n",
" index_from=3\n",
")`\n",
" * do the preprocessing this way, this makes the dataset ~9'000 samples large and the maximum length is only 150 words\n",
" * the dataset is represented as index values, so you need to convert twice: index -> word -> GloVe\n",
" * the index-to-word conversion is achievable by Keras, read the documentation\n",
"\n",
"Model defintion:\n",
" * `Dense(256, relu)`,\n",
" * `Dense(64, relu)`,\n",
" * `Dense(1, sigmoid)`\n",
"\n",
"Use default parameters in the compile: 'adam', 'binary_crossentropy', 'accuracy' metric. Train for 20-25 epochs.\n",
"\n",
"***Hint: approximately 55-60% accuracy is achieveable on the test set.***\n",
"\n",
"## 3. Sequence modeling with LSTM\n",
"\n",
" * use the IMDB dataset again converted into GloVe sequences but without the mean operation. This way you are going to generate (n_samples, sequence_length, 50) sample points with different sequence lengths\n",
" * pad every sequence to `150` in length with np.zeros(50) -> (n_samples, 150, 50)\n",
" * LSTM is a recurrent model with intricate inner operations, if you use it in a bideractional fashion, your sequence will be processed from both ends\n",
"\n",
"Model definition:\n",
" * `BidirectionalLSTM(64, return_sequences=True),`\n",
" * `BidirectionalLSTM(64),`\n",
" * `Dense(64, relu)`,\n",
" * `Dense(1, sigmoid)`\n",
"\n",
"Use default parameters in the compile: 'adam', 'binary_crossentropy', 'accuracy' metric. Train for 20-25 epochs.\n",
"\n",
"***Hint: approximately 65-70% accuracy is achieveable on the test set.***"
]
},
{
"cell_type": "code",
"metadata": {
"id": "w-CgB860lzJp"
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment