Skip to content

Instantly share code, notes, and snippets.

@Orbifold
Last active August 25, 2019 14:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Orbifold/56b748bd5faf8c058f687842a1121571 to your computer and use it in GitHub Desktop.
Save Orbifold/56b748bd5faf8c058f687842a1121571 to your computer and use it in GitHub Desktop.
Analyzing sentiment using TensorFlow RC0 and NLTK.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Sentiment analysis\n",
"\n",
"This is a recipe to analyze positive vs. negative sentiments in text using [NLTK](https://www.nltk.org) and [TensorFlow v2](http://tensorflow.org).\n",
"\n",
"One can find datasets in many places, especially on [Kaggle](https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only/data). The datasets usually come in a 'positive' and 'negative' part but sometimes in one set with a column denoting the sentiment. If you don't have two sets like this NLTK has it all but you need to assemble things a bit because the review come as separate files.\n",
"\n",
"Easy enough, use something like this to compile the separate files into just two files:"
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"import sklearn\n",
"from sklearn.datasets import load_files\n",
"moviedir = r'/Users/You/nltk_data/corpora/movie_reviews'\n",
"movie_train = load_files(moviedir, shuffle=True)\n",
"\n",
"pos =\"\"\n",
"neg=\"\"\n",
"for i,item in enumerate(movie_train.data):\n",
" if movie_train.target[i] == 0: #neg\n",
" neg += item.decode(\"utf-8\").replace(\"\\n\",\" \") + \"\\n\"\n",
" else:\n",
" pos += item.decode(\"utf-8\").replace(\"\\n\",\" \") + \"\\n\"\n",
"with open('/Users/You/desktop/positive.txt', 'wt') as f:\n",
" f.write(pos)\n",
"with open('/Users/You/desktop/negative.txt', 'wt') as f:\n",
" f.write(neg) "
],
"outputs": [],
"execution_count": null,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "markdown",
"source": [
"Of course, you need some packages. Note that this example is based on **TensorFlow RC0**."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"from nltk.tokenize import word_tokenize \n",
"from nltk.stem import WordNetLemmatizer \n",
"import numpy as np \n",
"import random \n",
"import pickle \n",
"from collections import Counter \n",
"\n",
"# TensorFlow and tf.keras\n",
"import tensorflow as tf\n",
"from tensorflow import keras\n",
"\n",
"# Helper libraries\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"print(tf.__version__)"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"2.0.0-rc0\n"
]
}
],
"execution_count": 2,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "markdown",
"source": [
"Normally you would use an embedding layer (things like [GloVe](https://nlp.stanford.edu/projects/glove/)) but let's approach things in a simplistic fashion here. We'll create for every review one big vector with an entry for every word appearing at least 50 times in the reviews.\n",
"\n",
"Lemmatization is the process of converting a word to its base form. This is where you need NLTK to perform this conversion."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"lemmatizer = WordNetLemmatizer() \n",
"max_lines = 10000000 \n",
"pos = 'positive.txt'\n",
"neg = 'negative.txt'\n",
"\n",
"\n",
"def create_lexicon(pos, neg): \n",
" '''\n",
" Returns a vector with the most important words\n",
" for the given positive and negative reviews.\n",
" '''\n",
" lexicon = [] \n",
" for fi in [pos, neg]: \n",
" with open(fi, 'r') as f: \n",
" contents = f.readlines() \n",
" for l in contents[:max_lines]: \n",
" all_words = word_tokenize(l.lower()) \n",
" lexicon += list(all_words) \n",
"\n",
" lexicon = [lemmatizer.lemmatize(i) for i in lexicon] \n",
" w_counts = Counter(lexicon) \n",
"\n",
" l2 =[] # vector with the words appearing more than 50 times\n",
" for w in w_counts: \n",
" if 1000 > w_counts[w] > 50: \n",
" l2.append(w) \n",
" return l2 "
],
"outputs": [],
"execution_count": 4,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "markdown",
"source": [
"With the lexicon we now approach the reviews and convert them to vectors of the size of the lexicon and for each word how often it appears in the review. \n",
"This is a poor-man's way of embedding text in a low-dimensional vector space. The simplification being that we do not embed affinity between words, there is no statistical distribution or minimization involved in this simple algorithm."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"def create_embedding(sample, lexicon, classification): \n",
" '''\n",
" Returns a lexicon-sized vector for each review.\n",
" '''\n",
" featureset = [] \n",
" with open(sample,'r') as f: \n",
" contents = f.readlines() \n",
" for l in contents[:max_lines]: \n",
" current_words = word_tokenize(l.lower()) \n",
" current_words = [lemmatizer.lemmatize(i) for i in current_words] \n",
" features = np.zeros(len(lexicon)) \n",
" for word in current_words: \n",
" if word.lower() in lexicon: \n",
" index_value = lexicon.index(word.lower()) \n",
" features[index_value] += 1 \n",
"\n",
" features = list(features)\n",
" featureset.append([features, classification]) \n",
"\n",
" return featureset "
],
"outputs": [],
"execution_count": 73,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "markdown",
"source": [
"Now we effectively apply this to the dataset. "
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"lexicon = create_lexicon(pos,neg) \n",
"features = [] \n",
"features += create_embedding(pos, lexicon,[1,0]) \n",
"features += create_embedding(neg, lexicon,[0,1]) \n",
"random.shuffle(features) \n",
"features = np.array(features) \n",
"\n"
],
"outputs": [],
"execution_count": 74,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "markdown",
"source": [
"We take 10% of the data for testing purposes and create numpy arrays because that's what TensorFlow expects. The switch between list and numpy array is because of the ease with which you can subset things when you have an array."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"testing_size = int(0.1*len(features)) \n",
"X_train = np.array(list(features[:,0][:-testing_size])) \n",
"y_train = np.array(list(features[:,1][:-testing_size])) \n",
"X_test = np.array(list(features[:,0][-testing_size:])) \n",
"y_test = np.array(list(features[:,1][-testing_size:]))"
],
"outputs": [],
"execution_count": 97,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "markdown",
"source": [
"Each vector in the sets has the dimension of the lexicon:"
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"assert(X_train.shape[1] == len(lexicon))"
],
"outputs": [],
"execution_count": 98,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "markdown",
"source": [
"From this point on you recognize that AI is as much an art as it is science. The way you assemble the network is where expeerience and insights show. For playing purposes, anything works."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
" model = keras.Sequential([\n",
" \n",
" keras.layers.Dense(13, activation='relu', input_dim=len(lexicon)),\n",
" keras.layers.Dense(10, activation='relu'),\n",
" keras.layers.Dense(2, activation='sigmoid') \n",
"])\n",
"model.summary()"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Model: \"sequential_15\"\n",
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"dense_35 (Dense) (None, 13) 30121 \n",
"_________________________________________________________________\n",
"dense_36 (Dense) (None, 10) 140 \n",
"_________________________________________________________________\n",
"dense_37 (Dense) (None, 2) 22 \n",
"=================================================================\n",
"Total params: 30,283\n",
"Trainable params: 30,283\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
]
}
],
"execution_count": 102,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "code",
"source": [
"model.compile(optimizer='adam', \n",
" loss='binary_crossentropy',\n",
" metrics=['accuracy'])"
],
"outputs": [],
"execution_count": 103,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "code",
"source": [
"history = model.fit(X_train, y_train, epochs=5)"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Train on 1800 samples\n",
"Epoch 1/5\n",
"1800/1800 [==============================] - 0s 65us/sample - loss: 0.0374 - accuracy: 0.9967\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
"Epoch 2/5\n",
"1800/1800 [==============================] - 0s 55us/sample - loss: 0.0222 - accuracy: 0.9994\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
"Epoch 3/5\n",
"1800/1800 [==============================] - 0s 51us/sample - loss: 0.0147 - accuracy: 1.0000\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
"Epoch 4/5\n",
"1800/1800 [==============================] - 0s 50us/sample - loss: 0.0102 - accuracy: 1.0000\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
"Epoch 5/5\n",
"1800/1800 [==============================] - 0s 50us/sample - loss: 0.0076 - accuracy: 1.0000\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
]
}
],
"execution_count": 106,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "code",
"source": [
"results = model.evaluate(X_test, y_test, verbose=0)\n",
"print(results)"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[0.604368144646287, 0.8225]\n"
]
}
],
"execution_count": 117,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "code",
"source": [
"history_dict = history.history\n",
"\n",
"import matplotlib.pyplot as plt\n",
"\n",
"acc = history_dict['accuracy']\n",
" \n",
"loss = history_dict['loss']\n",
"epochs = range(1, len(acc) + 1)\n",
"\n",
"plt.plot(epochs, loss, 'bo', label='Training loss')\n",
"plt.title('Training and validation loss')\n",
"plt.xlabel('Epochs')\n",
"plt.ylabel('Loss')\n",
"plt.legend()\n",
"\n",
"plt.show()"
],
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
],
"image/png": [
"\n"
]
},
"metadata": {
"needs_background": "light"
}
}
],
"execution_count": 112,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
},
{
"cell_type": "code",
"source": [],
"outputs": [],
"execution_count": null,
"metadata": {
"collapsed": false,
"outputHidden": false,
"inputHidden": false
}
}
],
"metadata": {
"kernel_info": {
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.7.2",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"kernelspec": {
"name": "python3",
"language": "python",
"display_name": "Python 3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment