Pierian-Data/secondtest.ipynb

## secondtest.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating new lyrics in the style of Weezer with TensorFlow\n",
    "\n",
    "Welcome to this blog post on Natural Language Processing, specifically generating new text! In this blog post we'll explore how to gather a bunch of Weezer lyrics and train a Recurrent Neural Network (RNN) to generate new lyrics character by character. \n",
    "\n",
    "Let's dive in!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gathering the Data\n",
    "\n",
    "Let's leverage an existing an API to gather the lyrics, www.genius.com provides an API we can use, we just need to register an account and then visit: https://genius.com/api-clients and authorize a new API client. (*Note*: You need to provide a placeholder URL that includes https://wwww).\n",
    "\n",
    "You will then be provided with a CLIENT ID and a secret key which will look something like:\n",
    "\n",
    "**rux5zYOIGi-9kHDSU8IYhbbbGaxTgopj-uR10K2TKPcYhbbbGaxT**\n",
    "\n",
    "Also click the button on the page to generate an access token.\n",
    "\n",
    "Now we use the power of the Python ecosystem to discover that people have already created Python interfaces to easily connect with these API keys.\n",
    "\n",
    "https://lyricsgenius.readthedocs.io/en/master/\n",
    "\n",
    "We just need to install the library:\n",
    "\n",
    "    pip install lyricsgenius"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install lyricsgenius"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lyricsgenius import Genius\n",
    "token = input(\"Paste your token here: \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use the Python library and API to grab all of Weezer's lyrics. We should note, depending on how many songs you query you may get data contamination due to Live recordings duplicating data, Rivers Cuomo's solo work being labeled as Weezer, and Weezer's [Teal Album](https://en.wikipedia.org/wiki/Weezer_(Teal_Album)) which is actually a bunch of 80s covers. We could clean this up, but for now we'll just treat it all as Weezer lyrics. We'll limit this to 200 songs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "genius = Genius(token)\n",
    "artist = genius.search_artist('Weezer',max_songs=200)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quickly save these lyrics to a JSON file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote Weezer200.json.\n"
     ]
    }
   ],
   "source": [
    "artist.save_lyrics('Weezer200.json')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then loop through all the songs and grab the lyrics, let's explore the first few and explore the structure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Say It Ain’t So Lyrics[Intro]\n",
      "Oh, yeah\n",
      "Alright\n",
      "\n",
      "[Verse 1]\n",
      "Somebody's Heine\n",
      "Is crowding my icebox\n",
      "Somebody's cold one\n",
      "Is giving me chills\n",
      "Guess I'll just close my eyes\n",
      "\n",
      "[Pre-Chorus]\n",
      "Oh, yeah\n",
      "Alright\n",
      "Feels good\n",
      "Inside\n",
      "[Verse 2]\n",
      "Flip on the telly\n",
      "Wrestle with Jimmy\n",
      "Somethin' is bubblin'\n",
      "Behind my back\n",
      "The bottle is ready to blow\n",
      "\n",
      "[Chorus]\n",
      "Say it ain't so\n",
      "Your drug is a heartbreaker\n",
      "Say it ain't so\n",
      "My love is a life taker\n",
      "\n",
      "[Verse 3]\n",
      "I can't confront you\n",
      "I never could do\n",
      "That which might hurt you\n",
      "So try and be cool\n",
      "When I say\n",
      "\"This way\n",
      "Is a water slide away from me that takes you further every day\"\n",
      "So be cool\n",
      "\n",
      "[Chorus]\n",
      "Say it ain't so\n",
      "Your drug is a heartbreaker\n",
      "Say it ain't so\n",
      "My love is a life taker\n",
      "[Bridge]\n",
      "Dear daddy, I write you\n",
      "In spite of years of silence\n",
      "You've cleaned up, found Jesus\n",
      "Things are good, or so I hear\n",
      "This bottle of Stephen's\n",
      "Awakens ancient feelin's\n",
      "Like father, stepfather\n",
      "The son is drowning in the flood\n",
      "Yeah, yeah-yeah, yeah-yeah\n",
      "\n",
      "[Guitar Solo]\n",
      "\n",
      "[Chorus]\n",
      "Say it ain't so\n",
      "Your drug is a heartbreaker\n",
      "Say it ain't so\n",
      "My love is a life taker77Embed\n",
      "Buddy Holly Lyrics[Verse 1]\n",
      "What's with these homies dissin' my girl?\n",
      "Why do they gotta front?\n",
      "What did we ever do to these guys\n",
      "That made them so violent?\n",
      "\n",
      "[Pre-Chorus]\n",
      "Woo-hoo, but you know I'm yours\n",
      "Woo-hoo, and I know you're mine\n",
      "Woo-hoo, and that's for all of time\n",
      "\n",
      "[Chorus]\n",
      "Ooh wee ooh, I look just like Buddy Holly\n",
      "Oh oh, and you're Mary Tyler Moore\n",
      "I don't care what they say about us anyway\n",
      "I don't care 'bout that\n",
      "[Verse 2]\n",
      "Don't you ever fear, I'm always near\n",
      "I know that you need help\n",
      "Your tongue is twisted, your eyes are slit\n",
      "You need a guardian\n",
      "\n",
      "[Pre-Chorus]\n",
      "Woo-hoo, and you know I'm yours\n",
      "Woo-hoo, and I know you're mine\n",
      "Woo-hoo, and that's for all of time\n",
      "\n",
      "[Chorus]\n",
      "Ooh wee ooh, I look just like Buddy Holly\n",
      "Oh oh, and you're Mary Tyler Moore\n",
      "I don't care what they say about us anyway\n",
      "I don't care 'bout that\n",
      "I don't care 'bout that\n",
      "\n",
      "[Bridge]\n",
      "Bang, bang, knocking on the door\n",
      "Another big bang, get down on the floor\n",
      "Oh no, what do we do?\n",
      "Don't look now, but I lost my shoe\n",
      "I can't run and I can't kick\n",
      "What's a matter, babe, are you feelin' sick?\n",
      "What's a matter, what's a matter, what's a matter you?\n",
      "What's a matter, babe, are you feelin' blue? Oh-oh\n",
      "[Guitar Solo]\n",
      "\n",
      "[Pre-Chorus]\n",
      "And that's for all of time\n",
      "And that's for all of time\n",
      "\n",
      "[Chorus]\n",
      "Ooh wee ooh, I look just like Buddy Holly\n",
      "Oh oh, and you're Mary Tyler Moore\n",
      "I don't care what they say about us anyway\n",
      "I don't care 'bout that\n",
      "I don't care 'bout that\n",
      "I don't care 'bout that\n",
      "I don't care 'bout that39Embed\n"
     ]
    }
   ],
   "source": [
    "for song in artist.songs[:2]:\n",
    "    print(song.lyrics)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the use of whitespace and terms such as [Chorus] and [Bridge] to define the song structure.\n",
    "\n",
    "We now have two options for tokenization, we can tokenize words, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['I', 'look', 'just', 'like', 'Buddy', 'Holly']"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lyrics = 'I look just like Buddy Holly'\n",
    "tokens = lyrics.split()\n",
    "tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we could tokenize by character:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['I',\n",
       " ' ',\n",
       " 'l',\n",
       " 'o',\n",
       " 'o',\n",
       " 'k',\n",
       " ' ',\n",
       " 'j',\n",
       " 'u',\n",
       " 's',\n",
       " 't',\n",
       " ' ',\n",
       " 'l',\n",
       " 'i',\n",
       " 'k',\n",
       " 'e',\n",
       " ' ',\n",
       " 'B',\n",
       " 'u',\n",
       " 'd',\n",
       " 'd',\n",
       " 'y',\n",
       " ' ',\n",
       " 'H',\n",
       " 'o',\n",
       " 'l',\n",
       " 'l',\n",
       " 'y']"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lyrics = \"I look just like Buddy Holly\"\n",
    "list(lyrics)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll notice whitespace and characters such as '\\n' denoting a new line will be kept if we tokenize by individual characters. While you should expect more human like results if we tokenize by words, let's push the limits of TensorFlow and RNNs by exploring just how well a simple Recurrent Neural Network can work if we try to have it predict characters. You should note that for the purposes of RNN text generation this is a tiny dataset. Let's go through some text processing to create a numeric representation of the letters through text vectorization.\n",
    "\n",
    "## Text Processing\n",
    "\n",
    "We know a neural network can't take in the raw string data, we need to assign numbers to each character. Let's create two dictionaries that can go from numeric index to character and character to numeric index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = ''\n",
    "\n",
    "# This will remove special characters for songs that are not in english\n",
    "special_chars = set([ 'ँ', 'आ', 'ए', 'क', 'ग', 'ज', 'त',\n",
    "       'द', 'ध', 'न', 'प', 'ब', 'म', 'य', 'र', 'व', 'श', 'ह', 'ा', 'ि',\n",
    "       'ी', 'ू', 'े', 'ै', 'ो', '्', '\\u2005', '\\u200c', '—', '‘', '’',\n",
    "       '\\u205f', '느', '사', '어', '이', '제', '죠', '품', '회','\\u0435','\\xa0'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for song in artist.songs:\n",
    "    chars =set(song.lyrics)\n",
    "    if len(chars.intersection(special_chars)) == 0:\n",
    "        text += song.lyrics\n",
    "    else:\n",
    "        pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = open(\"200_Weezer_Song_Lyrics.txt\",\"w\")\n",
    "f.write(text)\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can always read in the txt data with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read in cleaned text\n",
    "with open('200_Weezer_Song_Lyrics.txt') as f:\n",
    "    text = f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The unique characters in the file\n",
    "vocab = sorted(set(text))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "IalZLbvOzf-F"
   },
   "outputs": [],
   "source": [
    "char_to_ind = {u:i for i, u in enumerate(vocab)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "fmmP5iCwm4rp"
   },
   "outputs": [],
   "source": [
    "# char_to_ind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "30ZYaWAOm4rt"
   },
   "outputs": [],
   "source": [
    "ind_to_char = np.array(vocab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_6JPOWwJm4rz"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['\\n', ' ', '!', '\"', '&', \"'\", '(', ')', '*', ',', '-', '.', '/',\n",
       "       '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?',\n",
       "       'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',\n",
       "       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', '[',\n",
       "       ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',\n",
       "       'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',\n",
       "       'z', '¡', 'ó'], dtype='<U1')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ind_to_char"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "3fhOqV0lm4r2"
   },
   "outputs": [],
   "source": [
    "encoded_text = np.array([char_to_ind[c] for c in text])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "axOX7rFom4r5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([27, 73, 56, ..., 54, 57, 56])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoded_text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tZfqhkYCymwX"
   },
   "source": [
    "We now have a mapping we can use to go back and forth from characters to numerics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tFs1Uza-m4r9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Buddy Holly Lyrics[V'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample = text[:20]\n",
    "sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "gIqUCK5Am4sB"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([27, 73, 56, 56, 77,  1, 33, 67, 64, 64, 77,  1, 37, 77, 70, 61, 55,\n",
       "       71, 51, 47])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoded_text[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "bbmsf23Bymwe"
   },
   "source": [
    "## Step 3: Creating Batches\n",
    "\n",
    "Overall what we are trying to achieve is to have the model predict the next highest probability character given a historical sequence of characters. Its up to us (the user) to choose how long that historic sequence. Too short a sequence and we don't have enough information (e.g. given the letter \"a\" , what is the next character) , too long a sequence and training will take too long and most likely overfit to sequence characters that are irrelevant to characters farther out. While there is no correct sequence length choice, you should consider the text itself, how long normal phrases are in it, and a reasonable idea of what characters/words are relevant to each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pAvUYFk7m4sF"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Buddy Holly Lyrics[Verse 1]\n",
      "What's with these homies dissin' my girl?\n",
      "Why do they gotta front?\n",
      "What did we ever do to these guys\n",
      "That made them so violent?\n",
      "\n",
      "[Pre-Chorus]\n",
      "Woo-hoo, but you know I'm yours\n",
      "Woo-hoo, and I know you're mine\n",
      "Woo-hoo, and that's for all of time\n",
      "\n",
      "[Chorus]\n",
      "Ooh wee ooh, I look just like Buddy Holly\n",
      "Oh oh, and you're Mary Tyler Moore\n",
      "I don't care what they say about us anyway\n",
      "I don't care 'bout that\n",
      "[Verse 2]\n",
      "Don't you ever fear, I'm always near\n",
      "I know that you need help\n",
      "You\n"
     ]
    }
   ],
   "source": [
    "print(text[:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "D45OYgOfm4sJ"
   },
   "outputs": [],
   "source": [
    "line = \"Say it ain't so\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7dKiEVN8m4sL"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hgsVvVxnymwf"
   },
   "source": [
    "### Bringing in TensorFlow\n",
    "\n",
    "Now its time to start shaping our data to work with TensorFlow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]\n"
     ]
    }
   ],
   "source": [
    "print(tf.config.list_physical_devices('GPU'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hgsVvVxnymwf"
   },
   "source": [
    "### Training Sequences\n",
    "\n",
    "The actual text data will be the text sequence shifted one character forward. For example:\n",
    "\n",
    "Sequence In: \"Hello my nam\"\n",
    "Sequence Out: \"ello my name\"\n",
    "\n",
    "\n",
    "We can use the `tf.data.Dataset.from_tensor_slices` function to convert a text vector into a stream of character indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "0UHJDA39zf-O"
   },
   "outputs": [],
   "source": [
    "seq_len = 120"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7VRSK4cOm4sZ"
   },
   "outputs": [],
   "source": [
    "total_num_seq = len(text)//(seq_len+1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "xtW0jbbvm4sc"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1578"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total_num_seq"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ciatnowvm4se"
   },
   "outputs": [],
   "source": [
    "# Create Training Sequences\n",
    "char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)\n",
    "\n",
    "# for i in char_dataset.take(500):\n",
    "#      print(ind_to_char[i.numpy()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-ZSYAcQV8OGP"
   },
   "source": [
    "The **batch** method converts these individual character calls into sequences we can feed in as a batch. We use seq_len+1 because of zero indexing. Here is what drop_remainder means:\n",
    "\n",
    "drop_remainder: (Optional.) A `tf.bool` scalar `tf.Tensor`, representing\n",
    "    whether the last batch should be dropped in the case it has fewer than\n",
    "    `batch_size` elements; the default behavior is not to drop the smaller\n",
    "    batch.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "l4hkDU3i7ozi"
   },
   "outputs": [],
   "source": [
    "sequences = char_dataset.batch(seq_len+1, drop_remainder=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "UbLcIPBj_mWZ"
   },
   "source": [
    "Now that we have our sequences, we will perform the following steps for each one to create our target text sequences:\n",
    "\n",
    "1. Grab the input text sequence\n",
    "2. Assign the target text sequence as the input text sequence shifted by one step forward\n",
    "3. Group them together as a tuple"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "9NGu-FkO_kYU"
   },
   "outputs": [],
   "source": [
    "def create_seq_targets(seq):\n",
    "    input_txt = seq[:-1]\n",
    "    target_txt = seq[1:]\n",
    "    return input_txt, target_txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HszljTg8m4so"
   },
   "outputs": [],
   "source": [
    "dataset = sequences.map(create_seq_targets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "JkPa7AMrm4sq"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[27 73 56 56 77  1 33 67 64 64 77  1 37 77 70 61 55 71 51 47 57 70 71 57\n",
      "  1 14 52  0 48 60 53 72  5 71  1 75 61 72 60  1 72 60 57 71 57  1 60 67\n",
      " 65 61 57 71  1 56 61 71 71 61 66  5  1 65 77  1 59 61 70 64 25  0 48 60\n",
      " 77  1 56 67  1 72 60 57 77  1 59 67 72 72 53  1 58 70 67 66 72 25  0 48\n",
      " 60 53 72  1 56 61 56  1 75 57  1 57 74 57 70  1 56 67  1 72 67  1 72 60]\n",
      "Buddy Holly Lyrics[Verse 1]\n",
      "What's with these homies dissin' my girl?\n",
      "Why do they gotta front?\n",
      "What did we ever do to th\n",
      "\n",
      "\n",
      "[73 56 56 77  1 33 67 64 64 77  1 37 77 70 61 55 71 51 47 57 70 71 57  1\n",
      " 14 52  0 48 60 53 72  5 71  1 75 61 72 60  1 72 60 57 71 57  1 60 67 65\n",
      " 61 57 71  1 56 61 71 71 61 66  5  1 65 77  1 59 61 70 64 25  0 48 60 77\n",
      "  1 56 67  1 72 60 57 77  1 59 67 72 72 53  1 58 70 67 66 72 25  0 48 60\n",
      " 53 72  1 56 61 56  1 75 57  1 57 74 57 70  1 56 67  1 72 67  1 72 60 57]\n",
      "uddy Holly Lyrics[Verse 1]\n",
      "What's with these homies dissin' my girl?\n",
      "Why do they gotta front?\n",
      "What did we ever do to the\n"
     ]
    }
   ],
   "source": [
    "for input_txt, target_txt in  dataset.take(1):\n",
    "    print(input_txt.numpy())\n",
    "    print(''.join(ind_to_char[input_txt.numpy()]))\n",
    "    print('\\n')\n",
    "    print(target_txt.numpy())\n",
    "    # There is an extra whitespace!\n",
    "    print(''.join(ind_to_char[target_txt.numpy()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MJdfPmdqzf-R"
   },
   "source": [
    "### Generating training batches\n",
    "\n",
    "Now that we have the actual sequences, we will create the batches, we want to shuffle these sequences into a random order, so the model doesn't overfit to any section of the text, but can instead generate characters given any seed text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "p2pGotuNzf-S"
   },
   "outputs": [],
   "source": [
    "# Batch size\n",
    "batch_size = 128\n",
    "\n",
    "# Buffer size to shuffle the dataset so it doesn't attempt to shuffle\n",
    "# the entire sequence in memory. Instead, it maintains a buffer in which it shuffles elements\n",
    "buffer_size = 10000\n",
    "\n",
    "dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "gmcCALymm4su"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<BatchDataset element_spec=(TensorSpec(shape=(128, 120), dtype=tf.int32, name=None), TensorSpec(shape=(128, 120), dtype=tf.int32, name=None))>"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "r6oUuElIMgVx"
   },
   "source": [
    "## Creating the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "m8gPwEjRzf-Z"
   },
   "source": [
    "We will use an LSTM based model with a few extra features, including an embedding layer to start off with and **two** LSTM layers. We based this model architecture off the [DeepMoji](https://deepmoji.mit.edu/) and the original source code can be found [here](https://github.com/bfelbo/DeepMoji).\n",
    "\n",
    "The embedding layer will serve as the input layer, which essentially creates a lookup table that maps the numbers indices of each character to a vector with \"embedding dim\" number of dimensions. As you can imagine, the larger this embedding size, the more complex the training. This is similar to the idea behind word2vec, where words are mapped to some n-dimensional space. Embedding before feeding straight into the LSTM usually leads to more realisitic results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zHT8cLh7EAsg"
   },
   "outputs": [],
   "source": [
    "# Length of the vocabulary in chars\n",
    "vocab_size = len(vocab)\n",
    "\n",
    "# The embedding dimension\n",
    "embed_dim = 64\n",
    "\n",
    "# Number of RNN units\n",
    "rnn_neurons = 1026"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Atb060h5m4s0"
   },
   "source": [
    "Now let's create a function that easily adapts to different variables as shown above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "YeRlEXgym4s1"
   },
   "outputs": [],
   "source": [
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.layers import LSTM,Dense,Embedding,Dropout,GRU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "FcMbIy-xj-w-"
   },
   "source": [
    "### Setting up Loss Function\n",
    "\n",
    "For our loss we will use sparse categorical crossentropy, which we can import from Keras. We will also set this as logits=True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "VoFVGKlNkJfW"
   },
   "outputs": [],
   "source": [
    "from tensorflow.keras.losses import sparse_categorical_crossentropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "sblCzZoslZKH"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on function sparse_categorical_crossentropy in module keras.losses:\n",
      "\n",
      "sparse_categorical_crossentropy(y_true, y_pred, from_logits=False, axis=-1)\n",
      "    Computes the sparse categorical crossentropy loss.\n",
      "    \n",
      "    Standalone usage:\n",
      "    \n",
      "    >>> y_true = [1, 2]\n",
      "    >>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]\n",
      "    >>> loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)\n",
      "    >>> assert loss.shape == (2,)\n",
      "    >>> loss.numpy()\n",
      "    array([0.0513, 2.303], dtype=float32)\n",
      "    \n",
      "    Args:\n",
      "      y_true: Ground truth values.\n",
      "      y_pred: The predicted values.\n",
      "      from_logits: Whether `y_pred` is expected to be a logits tensor. By default,\n",
      "        we assume that `y_pred` encodes a probability distribution.\n",
      "      axis: Defaults to -1. The dimension along which the entropy is\n",
      "        computed.\n",
      "    \n",
      "    Returns:\n",
      "      Sparse categorical crossentropy loss value.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(sparse_categorical_crossentropy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "FrOOK61Olm1C"
   },
   "outputs": [],
   "source": [
    "def sparse_cat_loss(y_true,y_pred):\n",
    "    \n",
    "    return sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "MtCrdfzEI2N0"
   },
   "outputs": [],
   "source": [
    "def create_model(vocab_size, embed_dim, rnn_neurons, batch_size):\n",
    "    model = Sequential()\n",
    "    model.add(Embedding(vocab_size, embed_dim,batch_input_shape=[batch_size, None]))\n",
    "    model.add(GRU(rnn_neurons,return_sequences=True,stateful=True,recurrent_initializer='glorot_uniform'))\n",
    "    # Final Dense Layer to Predict\n",
    "    model.add(Dense(vocab_size))\n",
    "    model.compile(optimizer='adam', loss=sparse_cat_loss) \n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "wwsrpOik5zhv"
   },
   "outputs": [],
   "source": [
    "model = create_model(\n",
    "  vocab_size = vocab_size,\n",
    "  embed_dim=embed_dim,\n",
    "  rnn_neurons=rnn_neurons,\n",
    "  batch_size=batch_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "liXuTFYMm4s6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential\"\n",
      "_________________________________________________________________\n",
      " Layer (type)                Output Shape              Param #   \n",
      "=================================================================\n",
      " embedding (Embedding)       (128, None, 64)           5184      \n",
      "                                                                 \n",
      " gru (GRU)                   (128, None, 1026)         3361176   \n",
      "                                                                 \n",
      " dense (Dense)               (128, None, 81)           83187     \n",
      "                                                                 \n",
      "=================================================================\n",
      "Total params: 3,449,547\n",
      "Trainable params: 3,449,547\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "LJL0Q0YPY6Ee"
   },
   "source": [
    "## Training the model\n",
    "\n",
    "Let's make sure everything is ok with our model before we spend too much time training! Let's pass in a batch to confirm the model currently predicts random characters without any training.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "A4ygvfHn-wan"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(128, 120, 81)  <=== (batch_size, sequence_length, vocab_size)\n"
     ]
    }
   ],
   "source": [
    "for input_example_batch, target_example_batch in dataset.take(1):\n",
    "\n",
    "  # Predict off some random batch\n",
    "  example_batch_predictions = model(input_example_batch)\n",
    "\n",
    "  # Display the dimensions of the predictions\n",
    "  print(example_batch_predictions.shape, \" <=== (batch_size, sequence_length, vocab_size)\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "5ld8z3LPBAuv"
   },
   "outputs": [],
   "source": [
    "# example_batch_predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_achqjT-BGyY"
   },
   "outputs": [],
   "source": [
    "sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "xWrPFk2nBJX4"
   },
   "outputs": [],
   "source": [
    "# sampled_indices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Wi80PQVtBLqj"
   },
   "outputs": [],
   "source": [
    "# Reformat to not be a lists of lists\n",
    "sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4qYkIg00-wjq"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([66, 79, 76, 41,  5, 11, 74, 72, 70,  1, 73, 19, 26, 36, 30, 68,  0,\n",
       "       10, 78, 31, 73, 49, 13, 74, 38, 48, 76, 35, 51, 10, 54, 56, 22, 30,\n",
       "       63, 49, 65, 55, 61, 63, 31, 46, 76, 39, 58, 49, 68, 50, 50, 33, 45,\n",
       "       70, 27, 64, 47,  1, 18, 25, 23, 62, 21, 68,  8, 42, 60, 51, 11, 30,\n",
       "       48, 29, 18, 12, 43, 65, 24,  4, 17, 67, 32, 58, 30, 79, 21, 56, 35,\n",
       "       16, 42, 51, 23, 41, 38, 39, 70, 27, 72, 55, 37, 66, 20, 58, 24, 56,\n",
       "       57, 66, 65, 12, 45, 34, 26, 70,  4, 38, 43, 74, 25, 51, 69,  8, 75,\n",
       "       75], dtype=int64)"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sampled_indices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "H9-P_XqQ_7wY"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Given the input seq: \n",
      "\n",
      " of wonderful\n",
      "Kiss me while we both are still intangible\n",
      "\n",
      "[Chorus]\n",
      "Everybody needs salvation\n",
      "And if it doesn't come we m\n",
      "\n",
      "\n",
      "Next Char Predictions: \n",
      "\n",
      "n¡xP'.vtr u6AKEp\n",
      "-zFuY0vMWxJ[-bd9EkYmcikFUxNfYpZZHTrBlV 5?:j8p*Qh[.EWD5/Rm;&4oGfE¡8dJ3Q[:PMNrBtcLn7f;denm/TIAr&MRv?[q*ww\n"
     ]
    }
   ],
   "source": [
    "print(\"Given the input seq: \\n\")\n",
    "print(\"\".join(ind_to_char[input_example_batch[0]]))\n",
    "print('\\n')\n",
    "print(\"Next Char Predictions: \\n\")\n",
    "print(\"\".join(ind_to_char[sampled_indices ]))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alright, looks like everything's working, we just need to train the network to learn from our small dataset, let's train it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Num GPUs Available:  1\n"
     ]
    }
   ],
   "source": [
    "print(\"Num GPUs Available: \", len(tf.config.list_physical_devices('GPU')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "epochs = 100\n",
    "\n",
    "model.fit(dataset,epochs=epochs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "kKkD5M6eoSiN"
   },
   "source": [
    "## Generating text\n",
    "\n",
    "Currently our model only expects 128 sequences at a time. We can create a new model that only expects a batch_size=1. We can create a new model with this batch size, then load our saved models weights. Then call .build() on the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eYRNG57Govdc"
   },
   "outputs": [],
   "source": [
    "model.save('weezer_gen.h5') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "GCoJayFS8H4d"
   },
   "outputs": [],
   "source": [
    "from tensorflow.keras.models import load_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_iXG3VJvEXWM"
   },
   "outputs": [],
   "source": [
    "model = create_model(vocab_size, embed_dim, rnn_neurons, batch_size=1)\n",
    "\n",
    "model.load_weights('weezer_gen.h5')\n",
    "\n",
    "model.build(tf.TensorShape([1, None]))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "LAX3p7_YEilU"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"sequential_1\"\n",
      "_________________________________________________________________\n",
      " Layer (type)                Output Shape              Param #   \n",
      "=================================================================\n",
      " embedding_1 (Embedding)     (1, None, 64)             5184      \n",
      "                                                                 \n",
      " gru_1 (GRU)                 (1, None, 1026)           3361176   \n",
      "                                                                 \n",
      " dense_1 (Dense)             (1, None, 81)             83187     \n",
      "                                                                 \n",
      "=================================================================\n",
      "Total params: 3,449,547\n",
      "Trainable params: 3,449,547\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "WvuwZBX5Ogfd"
   },
   "outputs": [],
   "source": [
    "def generate_text(model, start_seed,gen_size=100,temp=1.0):\n",
    "    \n",
    "    '''\n",
    "    model: Trained Model to Generate Text\n",
    "    start_seed: Intial Seed text in string form\n",
    "    gen_size: Number of characters to generate\n",
    "\n",
    "    Basic idea behind this function is to take in some seed text, format it so\n",
    "    that it is in the correct shape for our network, then loop the sequence as\n",
    "    we keep adding our own predicted characters. Similar to our work in the RNN\n",
    "    time series problems.\n",
    "    '''\n",
    "\n",
    "    # Number of characters to generate\n",
    "    num_generate = gen_size\n",
    "\n",
    "    # Vecotrizing starting seed text\n",
    "    input_eval = [char_to_ind[s] for s in start_seed]\n",
    "\n",
    "    # Expand to match batch format shape\n",
    "    input_eval = tf.expand_dims(input_eval, 0)\n",
    "\n",
    "    # Empty list to hold resulting generated text\n",
    "    text_generated = []\n",
    "\n",
    "    # Temperature effects randomness in our resulting text\n",
    "    # The term is derived from entropy/thermodynamics.\n",
    "    # The temperature is used to effect probability of next characters.\n",
    "    # Higher probability == lesss surprising/ more expected\n",
    "    # Lower temperature == more surprising / less expected\n",
    "\n",
    "    temperature = temp\n",
    "\n",
    "    # Here batch size == 1\n",
    "    model.reset_states()\n",
    "\n",
    "    for i in range(num_generate):\n",
    "\n",
    "        # Generate Predictions\n",
    "        predictions = model(input_eval)\n",
    "\n",
    "        # Remove the batch shape dimension\n",
    "        predictions = tf.squeeze(predictions, 0)\n",
    "\n",
    "        # Use a cateogircal disitribution to select the next character\n",
    "        predictions = predictions / temperature\n",
    "        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()\n",
    "\n",
    "        # Pass the predicted charracter for the next input\n",
    "        input_eval = tf.expand_dims([predicted_id], 0)\n",
    "\n",
    "        # Transform back to character letter\n",
    "        text_generated.append(ind_to_char[predicted_id])\n",
    "\n",
    "    return (start_seed + ''.join(text_generated))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "bS69SG5D5lwd"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hey, thirk you're wild\n",
      "When you're out there's gotta be more than this\n",
      "\n",
      "[Chorus]\n",
      "This is such a pity, we should give, eloo!\n",
      "\n",
      "[Chorus]\n",
      "Other way, other way\n",
      "I will turn and look the other way\n",
      "Other way, other way\n",
      "I will turn any nime anymoreahe that pean painful stars\n",
      "I hear them blahs harm meaning some om this\n",
      "I did what my body told me to\n",
      "I didn't mean to do youm ruse\n",
      "There's no better then, hen?\n",
      "[Chorus]\n",
      "I'm a troublemaker, never bight cliends\n",
      "Livifill our novep and every will remember prom night?\n",
      "Remember prom night\n",
      "Oo-oo-hoo-hoo\n",
      "\n",
      "[Verse 1]\n",
      "You wanna cry\n",
      "When you're crind\n",
      "\n",
      "[Verse 2]\n",
      "Keep on, blah, blah, blah\n",
      "We can go up (Hey), we go down (Hey)\n",
      "We gon' run you out of town\n",
      "We can go up (Hey), we go down)\n",
      "Yes, I'm down if you're down\n",
      "California snow, never let me got so gond\n",
      "In the mall, I was in the mall\n",
      "I was in the mall, I was in the mall\n",
      "I was in the mall\n",
      "Keep it steady\n",
      "I'm s your bat\n",
      "Give me feel, I can't tell you how the words hove and go\n",
      "Felland Hellont that thinks And I mate the g\n"
     ]
    }
   ],
   "source": [
    "print(generate_text(model,\"Hey\",gen_size=1000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you may not be very impressed by the results, but take a closer look at the output and remember that the model is predicting character by character! This means its actually starting to learn the structure of a song, you can see it begin to learn concepts of structure by utilizing whitespace and markers such as [Verse], [Bridge], and [Chorus] , which is incredible given how small the dataset is. We also see some evidence of overfitting as the model begins to just duplicate existing song lyrics (this is only something you can tell if you're quite the Weezer fan and recognize the lyrics of multiple songs starting to merge).\n",
    "\n",
    "Try taking this further by building out an even larger data set that includes more bands, or play around with the model hyperparameters or training epochs!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}