Skip to content

Instantly share code, notes, and snippets.

@CjMoore
Last active August 19, 2020 01:05
Show Gist options
  • Save CjMoore/2d73c96bf553f767797d5e6992575129 to your computer and use it in GitHub Desktop.
Save CjMoore/2d73c96bf553f767797d5e6992575129 to your computer and use it in GitHub Desktop.

Hashtag generator

Introduction

The general concept behind this project is a hashtag generator. I used data taken from twitter to train a Recurrent Neural Network using the OpenNMT-py library. The idea is that if you feed the Neural Network a tweet or sentence, it can suggest appropriate hashtags. I chose the Recurrent Neural Network Sequence to Sequence model because it had the ability to generate novel hashtags, as opposed to another strategy that would have been limited to a set of hashtags/classes. Ultimately, I decided to do this project to explore a novel and potentially entertaining use of a sequence to sequence generator. In the process, I learned about several things including how to consume and clean data from twitter, the training process of Neural Networks, and how to use the OpenNMT-py library.

Training and Accuracy

The model took just over 36 hours to train on a training set of about 2900 examples. I definitely have a greater appreciation for how much computing power is required to train Neural Networks. I had a test data set of about 700 examples which I used to evaluate the accuracy of the Neural Network, which came out to about 9.5% accurate. While this accuracy is pretty low, I'm ultimately pretty amazed that it was able to get anything right given the diversity of the language used across the data sets. Even in examples where the Neural Network was not able to accurately guess the expected hashtags, it was often able to give a suggestion that was appropriate given the input. Another extremely interesting aspect of the Neural Network is that the outputs given that are inappropriate for the inputs. In this respect I think I unintentionally created a joke generator.

Technical Breakdown

Getting and Cleaning the Data

I am much more comfortable with Ruby so I ended up using that to get data directly from twitter. I used the twitter ruby gem to make getting the data from a stream easier. I outline exactly the scripts used to obtain and clean the data in a Github gist here. Basically it consisted of 5 steps.

1. Get the data from a stream
2. Clean the data
3. Separate the data into training and test sets
4. Write the data to their respective files

I made some decisions with respect to cleaning the data and writing it to files that I might change in the future. Some tweets had multiple hashtags associated with them. I decided to have a 1:1 relationship between tweets and hashtags so there were duplicate tweets in the input file. I think in the future I would like to see what the results might look like if I mapped a single tweet to a list of hashtags. I also decided to clean hashtags that were inside of a sentence from the inputs. I didn't want the Neural Network to learn on examples where the output was in fact found within the input. I might also consider cleaning out URLs that are included in the tweet text in future iterations. I'm not sure including those really helped the Neural Network understand the relationship between the inputs and outputs.

Setting up OpenNMT-py

I only had python 2.7 installed on my local machine so my first step was to install a version of python3. I opted to use a version manager called pyenv and installed 3.4.2. Then I essentially followed the instructions on the OpenNMT-py github page to install the library. I used the cloning instructions and ran the setup script. This gave me everything I needed to run the preprocessing, training, and prediction scripts. I found it to be very straightforward.

Training and Testing the Neural Network

I moved all the data I had written to the relevant files to a sub-directory within the OpenNMT-py library I had cloned down. I then ran the preprocessing script using the training and validation files I wrote from the twitter data.

onmt_preprocess -train_src data/twitter/src-train.txt -train_tgt data/twitter/tgt-train.txt -valid_src data/twitter/src-val.txt -valid_tgt data/twitter/tgt-val.txt -save_data data/twitter/demo

This ran pretty quickly. I then ran the script to train the model.

onmt_train -data data/demo -save_model demo-model

There were about 100,000 steps it needed to complete. I realized that it was going to take quite some time when it was only just over 10% finished at about the four-hour mark. Ultimately it ended up taking just over 36 hours to train. If I were to do this again I would probably create a Linux partition on my gaming computer so I could leverage my GPU. There were only about ~2900 training examples (including the validation subset), so I was a bit surprised at how long it ended up taking. Finally, once it was done training I was able to run the testing data against the model to see how good it turned out to be.

onmt_translate -model demo-model_step_100000.pt -src data/twitter/src-test.txt -output pred.txt -replace_unk -verbose

Testing the Accuracy

I ran a script, provided at the bottom of this file, to see how accurate the Neural Network was. It came out to be about 9.5% accurate, which is pretty low. I'm not entirely sure testing accuracy in this way is really representative of how the Neural Network really performs. After digging into the outputs, I found several examples where the Neural Network didn't get the 'correct' hashtag, but it was able to generate a hashtag that was appropriate given the input text. I would argue that these examples, while difficult to quantify, demonstrate the power of creativity that a tool like this can exercise. There were also several examples that were not appropriate given the input text that were absolutely hilarious, which I think helps to show on a broad level the trends seen across this random sample of internet data. Conclusion

I think given the goals of this project it was an overall success. Even though the Neural Network was not very accurate with respect to the measurable outcomes, there is a quality of success that is not quantifiable. I am pretty astounded that it was able to even get some predictions correct given the novelty of the source data. I think this project is a good demonstration of the power of these tools even in a situation that is entirely experimental.

Best Examples

Accurate Predictions

🇵🇭ROOTING FOR PHILIPPINES -> #missuniverse
violin solo Hong Kong people are so talented 😍 -> #HongKongProtests
Me everytime lucid dreams comes on now💔😭😭 💔 -> #ripjuicewrld

Inaccurate but Appropriate

Former Republican House Members know the oaths they took. Why don’t today’s Republicans? -> #GOPTraitors

Inaccurate and Inappropriate (these are my favorite)

13 Creative Stunts People Used to Land Their Dream Jobs https://t.co/0nsovpSS57 -> #asshat
Pig in a blanket :pig_nose: -> #hotwife
When you’re serving salad and breadsticks but you also hungry AF -> #WarRoomImeachment

Script for testing accuracy

test_lines = [line for line in open(f'pred.txt').read().split('\n')]

actual_output = [line for line in open(f'expected-output-test.txt').read().split('\n')]

test_tweets = [line for line in open(f'src-test.txt').read().split('\n')]


def evaluate():

    total = len(actual_output)

    accurate = 0

    for i in range(len(actual_output)):

        if actual_output[i] == test_lines[i]:

            accurate = accurate + 1

    return (accurate/total) * 100   

​

evaluate()

9.541697971450038
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment