Skip to content

Instantly share code, notes, and snippets.

@vgpena
Last active November 17, 2020 16:39
Show Gist options
  • Star 21 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
  • Save vgpena/b1c088f3c8b8c2c65dd8edbe0eae7023 to your computer and use it in GitHub Desktop.
Save vgpena/b1c088f3c8b8c2c65dd8edbe0eae7023 to your computer and use it in GitHub Desktop.
Basic text classification with Keras and TensorFlow

Text Classification with Keras and TensorFlow

Blog post is here

If you want an intro to neural nets and the "long version" of what this is and what it does, read my blog post.

Data can be downloaded here. Many thanks to ThinkNook for putting such a great resource out there.

Installation

You need Python 2 to run this project; I also recommend Virtualenv and iPython.

Run pip install to install everything listed in requirements.txt.

Usage

You need to train your net once, and then you can load those settings and use it whenever you want without having to retrain it.

Training

Change line 10 of makeModel.py to point to wherever you downloaded your data as a CSV.

Then run Python makeModel.py (or, if you're in iPython, run makeModel.py). Then go do something else for the 40-60 minutes that it takes to train your neural net.

When creating the net finishes, three new files should have been created: dictionary.json, model.json, and model.h5. You will need these to use the net.

Classification

To use the net to classify data, run loadModel.py and type into the console when prompted. Hitting Enter without typing anything will quit the program.

import json
import numpy as np
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.models import model_from_json
# we're still going to use a Tokenizer here, but we don't need to fit it
tokenizer = Tokenizer(num_words=3000)
# for human-friendly printing
labels = ['negative', 'positive']
# read in our saved dictionary
with open('dictionary.json', 'r') as dictionary_file:
dictionary = json.load(dictionary_file)
# this utility makes sure that all the words in your input
# are registered in the dictionary
# before trying to turn them into a matrix.
def convert_text_to_index_array(text):
words = kpt.text_to_word_sequence(text)
wordIndices = []
for word in words:
if word in dictionary:
wordIndices.append(dictionary[word])
else:
print("'%s' not in training corpus; ignoring." %(word))
return wordIndices
# read in your saved model structure
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
# and create a model from that
model = model_from_json(loaded_model_json)
# and weight your nodes with your saved values
model.load_weights('model.h5')
# okay here's the interactive part
while 1:
evalSentence = raw_input('Input a sentence to be evaluated, or Enter to quit: ')
if len(evalSentence) == 0:
break
# format your input for the neural net
testArr = convert_text_to_index_array(evalSentence)
input = tokenizer.sequences_to_matrix([testArr], mode='binary')
# predict which bucket your input belongs in
pred = model.predict(input)
# and print it for the humons
print("%s sentiment; %f%% confidence" % (labels[np.argmax(pred)], pred[0][np.argmax(pred)] * 100))
import json
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
import numpy as np
# extract data from a csv
# notice the cool options to skip lines at the beginning
# and to only take data from certain columns
training = np.genfromtxt('/path/to/your/data.csv', delimiter=',', skip_header=1, usecols=(1, 3), dtype=None)
# create our training data from the tweets
train_x = [x[1] for x in training]
# index all the sentiment labels
train_y = np.asarray([x[0] for x in training])
# only work with the 3000 most popular words found in our dataset
max_words = 3000
# create a new Tokenizer
tokenizer = Tokenizer(num_words=max_words)
# feed our tweets to the Tokenizer
tokenizer.fit_on_texts(train_x)
# Tokenizers come with a convenient list of words and IDs
dictionary = tokenizer.word_index
# Let's save this out so we can use it later
with open('dictionary.json', 'w') as dictionary_file:
json.dump(dictionary, dictionary_file)
def convert_text_to_index_array(text):
# one really important thing that `text_to_word_sequence` does
# is make all texts the same length -- in this case, the length
# of the longest text in the set.
return [dictionary[word] for word in kpt.text_to_word_sequence(text)]
allWordIndices = []
# for each tweet, change each token to its ID in the Tokenizer's word_index
for text in train_x:
wordIndices = convert_text_to_index_array(text)
allWordIndices.append(wordIndices)
# now we have a list of all tweets converted to index arrays.
# cast as an array for future usage.
allWordIndices = np.asarray(allWordIndices)
# create one-hot matrices out of the indexed tweets
train_x = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')
# treat the labels as categories
train_y = keras.utils.to_categorical(train_y, 2)
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
model_json = model.to_json()
with open('model.json', 'w') as json_file:
json_file.write(model_json)
model.save_weights('model.h5')
print('saved model!')
backports.weakref==1.0rc1
bleach==1.5.0
funcsigs==1.0.2
html5lib==0.9999999
Keras==2.0.6
Markdown==2.2.0
mock==2.0.0
numpy==1.13.1
pbr==3.1.1
protobuf==3.3.0
PyYAML==3.12
scipy==0.19.1
six==1.10.0
tensorflow==1.2.0
Theano==0.9.0
Werkzeug==0.12.2
@timpal0l
Copy link

Is there any quick fix to make it work for python3?

  File "/usr/local/lib/python3.5/dist-packages/keras/preprocessing/text.py", line 47, in text_to_word_sequence
    text = text.translate(translate_map)
TypeError: a bytes-like object is required, not 'dict'

@kvlinden
Copy link

@timpal0l Python 3 defaults to unicode and it appears that the Keras tokenizer.fit_on_text() checks for strings. I got make_model.py to run by explicitly converting the tweets to strings:
train_x = [str(x[1]) for x in training]

@wroales
Copy link

wroales commented Feb 14, 2018

thanks kvlinden! I got model working..

vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Pentium(R) Dual-Core CPU E5300 @ 2.60GHz

Linux version 4.4.0-112-generic (buildd@lgw01-amd64-010) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) ) #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018

@malgamves
Copy link

Hey I'm getting a MemoryError on line 48.. anybody else got that?

@cfowlerdev
Copy link

@malgamves,
I was seeing the same error on my Ubuntu machine, however it works perfectly on my iMac. Both with 8GB RAM.

It turns out this example uses a whole load of RAM. If you're on Mac it should be ok due to the way MacOS dynamically allocates virtual memory from your disk when you run out of physical RAM. However, it would also fail on a Mac if you're low on disk space or have the virtual memory disabled.

On my Ubuntu it was failing due to the swapfile (Ubuntu's virtual memory) being only 2GB. I had to increase it all the way to 16GB before this finally worked. On other Linux flavours you'd have to increase the size of your swap partition, which is a lot more work.

Haven't tried it on Windows, but afaik you would need to increase the size of your pagefile in the Advanced section of your System Settings.

@marius-tu
Copy link

marius-tu commented May 7, 2018

@malgamves @cfowlerdev
I'm getting this error too, but increasing my swapspace to 16GB and more did not help. I'm still trying to figure out, how I get this example working on ubuntu 16.04.
Interestingly python seems to not use that much memory, if I inspect this with htop, a maximum of 2 GB of RAM and ~600MB from swapspace is used, before the application crashes, even though I have 8GB RAM and a 64 bit ubuntu.

Have you managed to get it running?

Edit: Ok, got it working, but I had to increase the swapfile size to around 30GB
Edit2: You also have to install h5py in the requirements, otherwise you have to train the network once again after running makeModel.py

@innajiyaharifins
Copy link

I get an error:

ValueError Traceback (most recent call last)
in ()
48 train_x = tokenizer.sequences_to_matrix (allWordIndices, mode = 'binary')
49 # treat the labels as categories
---> 50 train_y = hard.utils.to_categorical (train_y, 3)
51
52 from hardware.models import Sequential

~ \ Anaconda3 \ lib \ site-packages \ hard \ utils \ np_utils.py in to_categorical (y, num_classes)
18 A binary matrix representation of the input.
19 "" "
---> 20 y = np.array (y, dtype = 'int'). Ravel ()
21 if not num_classes:
22 num_classes = np.max (y) + 1

if in your project there are only 2 labels that is positive and negativ, I have 3 labels that is advice and information and prohibition.
then how do i find the solution of my problem is this ??

@zbokaee
Copy link

zbokaee commented Jul 17, 2018

hi vgpena
i test your code and it was interesting.but i have question about one-hot-Vec and Gensim W2V
is it possible to use W2V with your code?
i mean how can i use my own gensim W2V and save my trained model? and after that using my neural net , like you have done in your code?

@SahasG
Copy link

SahasG commented Aug 5, 2018

Hi vgpena,

When running 'pip install -r requirements.txt' I am getting multiple errors where it is unable to build the wheel for scipy. Any idea on how to fix it?

@giriannamalai
Copy link

This is good. But how do you export to Save Model? To deploy in GCP? Can you share the inputs and output params in saved model??

@ahmadalli
Copy link

in Windows, you'll need encoding='utf-8' when loading the data: training = np.genfromtxt('Sentiment Analysis Dataset.csv', delimiter=',', skip_header=1, usecols=(1, 3), dtype=None, encoding='utf-8')

@ahmadalli
Copy link

is there any reason you used adam and not sgd optimizer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment