arntzy/SDA Final Project

## SDA Final Project
{
 "metadata": {
  "name": "",
  "signature": "sha256:872250c64520d898242c5fb7694e2767e4eef9dbcd46e03ed1fe970e17e62da7"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "    This is a step-by-step guide for using an image-analysis SDK in combination with a \"bag of words\" approach to training a naive bayes classifier, for use with Instagram images and captions.   "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###First you have to obtain instagram posts from a desired user..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#if you dont't have the python-instagram library, run 'pip install python-instagram'.\n",
      "\n",
      "client_id = #Your Client ID goes here\n",
      "client_secret = #Your Client Secret goes here\n",
      "access_token = #Your Instagram access token goes here, it can be obtained at http://www.pinceladasdaweb.com.br/instagram/access-token/\n",
      "apiauth = InstagramAPI(access_token=access_token)\n",
      "\n",
      "from instagram.client import InstagramAPI\n",
      "api = InstagramAPI(client_id=client_id, client_secret=client_secret)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "user_id = #some instagram user id\n",
      "user_media = []\n",
      "tmpmedia = api.user_recent_media(user_id=user_id, count = 33)\n",
      "tmpmedia[0]\n",
      "for m in tmpmedia[0]:\n",
      "    user_media.append(m)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from urlparse import urlparse\n",
      "parsed = urlparse(tmpmedia[1])\n",
      "params = {a:b for a,b in [x.split('=') for x in parsed.query.split('&')]}\n",
      "int(params['max_id'].split('_')[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#you may run into rate limiting here, if you are trying to acquire a large dataset. If so, perhaps lower the range argument. \n",
      "\n",
      "for i in range(100):\n",
      "    max_id = int(params['max_id'].split('_')[0])\n",
      "    tmpmedia = api.user_recent_media(user_id=user_id, max_id=max_id - 1, count=33)\n",
      "    for m in tmpmedia[0]:\n",
      "        user_media.append(m)\n",
      "    parsed = urlparse(tmpmedia[1])\n",
      "    params = {a:b for a,b in [x.split('=') for x in parsed.query.split('&')]}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###So, now you have your instagram media in a list called 'user_media'. Let's save every photograph from all the posts so we can run image analysis on them..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import urllib2\n",
      "imagedir = #PATH to a the directory where you want to save the .jpg images. ex. 'home/user/igimages/'   \n",
      "\n",
      "for m in user_media:\n",
      "    f = urllib2.urlopen(m.images['standard_resolution'].url)\n",
      "    data = f.read()\n",
      "    with open(imagedir + m.id + '.jpg', \"wb\") as code:\n",
      "        code.write(data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#The above for-loop will save all of the pictures with the media id as the name. ex. '866246441730669483_11404563.jpg'\n",
      "# If you would like to reference the actual media item associated with a media-id, you can use this helper function: \n",
      "\n",
      "def getmediaitem(curlist, media_id):\n",
      "    for m in curlist:\n",
      "        if m.id == media_id:\n",
      "            return m\n",
      "        else:\n",
      "            continue"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Pass the list you want to look through, in this case'user_media', and whichever media-id you want to examine.\n",
      "# The function will return that item.\n",
      "mediaobject = getmediaitem(user_media, '866246441730669483_11404563')\n",
      "mediaobject"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Let's use the DeepBeliefSDK to analyze all of the instagram photographs, and output the analysis into text files..."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Navigate to https://github.com/jetpacapp/DeepBeliefSDK and clone the repository. For the sake of this example code, I used the SimpleLinux examples to run analysis against a hard-coded repository. You can configure this very powerful library yourself, however, if you so desire. Once it is installed, you can run the DeepBelief analysis on your Instagram Image files.\n",
      "\n",
      "NOTE: Because this is a simple example, the reference to the jetpac.ntwk file is hard-coded, meaning you MUST run this type of command from the /DeepBeliefSDK/examples/SimpleLinux/ directory. For the shell script below to work, move all your instagram images into the /DeepBeliefSDK/examples/SimpleLinux/ directory.  \n",
      "\n",
      "Here is a little shell script that I used to crank out the text files:\n",
      "\n",
      "```\n",
      "#!/usr/bin/bash\n",
      "\n",
      "for file in *.jpg\n",
      "do \n",
      "\t./deepbelief $file | sort -k2nr | head -25 > $file.txt \n",
      "done\n",
      "```\n",
      "\n",
      "This makes text files with the name of the photograph as the first part. (ex. 866246441730669483_11404563.jpg.txt) "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Now that we have a bags of words describing images, let's figure out how to measure it...."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Start by getting the like_counts of all posts.\n",
      "like_counts = [m.like_count for m in user_media]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Let's plot the data into a boxplot to break down the five areas we want to categorize:\n",
      "%pylab inline\n",
      "boxplot(like_counts,0,'')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#to give you the exact numbers in the boxplot\n",
      "median = np.median(like_counts)\n",
      "upper_quartile = np.percentile(like_counts, 75)\n",
      "lower_quartile = np.percentile(like_counts, 25)\n",
      "iqr = upper_quartile - lower_quartile\n",
      "upper_whisker = upper_quartile + 1.5*iqr\n",
      "\n",
      "print lower_quartile, median, upper_quartile, upper_whisker, iqr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Now our task is to create a feature set using the wordbags we have from the images, the wordbags we will get from the captions, and the labels of what like-count quartile the instagram media falls into... "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Here is a sorting function that will help sort our instagram posts into quartile like_count bins. When calling this function,\n",
      "#pass in the values you got from the cell directly above.\n",
      "\n",
      "#Here you could write your own function if you wanted to, say, divide the posts into only two groups, counts above the median,\n",
      "#or counts below the median. The amount of bins you create is up to you.\n",
      "\n",
      "def sortintoquartiles(num,lq,median,uq,uw):\n",
      "    if num>= 0 and num<lq:\n",
      "        return 'lw'\n",
      "    if num>=lq and num<median:\n",
      "        return 'lq'\n",
      "    if num>=median and num<uq:\n",
      "        return 'uq'\n",
      "    if num>=uq and num<uw:\n",
      "        return 'uw'\n",
      "    if num>=uw:\n",
      "        return 'ol'\n",
      "    \n",
      "#lw: Lower Whisker\n",
      "#lq: Lower Quartile\n",
      "#median: Median\n",
      "#up: Upper Quartile\n",
      "#uw: Upper Whisker\n",
      "#ol: Outlier (more likes than the upper whisker limit)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#If you don't have nltk, visit this link http://www.nltk.org/install.html and follow instructions to install it. \n",
      "#Also, make sure you install at least the stopwords corpus, check out this link http://www.nltk.org/data.html to do so.\n",
      "\n",
      "# This is a way of compiling text into a \"wordbag\" with each word set to \"TRUE\". I gave you the option of subtracting\n",
      "#usermentions or hashtags from the text. Uncomment them below to subtract them from the wordbags.\n",
      "\n",
      "import nltk\n",
      "from nltk.corpus import stopwords\n",
      "import re\n",
      "\n",
      "stub = re.compile('[^A-Za-z]')\n",
      "\n",
      "\n",
      "def bag_of_non_stopwords(text):\n",
      "    words = [stub.sub('', w).lower() for w in text.split()]\n",
      "    usermentions = re.findall(\"(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z_]+[A-Za-z0-9_]+)\", text, re.I)\n",
      "    tagmentions = re.findall(\"(?<=^|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z_]+[A-Za-z0-9_]+)\", text, re.I)\n",
      "    \n",
      "    finalwords = set(words) - set(stopwords.words('english')) #- set(usermentions) #- set(tagmentions)\n",
      "    \n",
      "    featureset = dict([(word, True) for word in finalwords if not word.startswith('http') and len(word)>2])\n",
      "    \n",
      "    return featureset\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Here is a function to create the actual formatted wordbags for the text files we created earlier. They should be words describing the\n",
      "#Instagram photos with the same media id.\n",
      "\n",
      "\n",
      "def extractphotofeatures(mediaid):\n",
      "    textpath = #Change this to the path where your .txt files are\n",
      "    file = open(textpath + mediaid + '.jpg.txt', 'r')\n",
      "    stub = re.compile('[^A-Za-z]')\n",
      "    listy = file.readlines()[:25]\n",
      "    newdict = [] \n",
      "\n",
      "    for text in listy:\n",
      "        newdict.append([stub.sub('', w).lower() for w in text.split('\\t')][2])\n",
      "    \n",
      "    featureset = dict([(word, True) for word in newdict if not word.startswith('http') and len(word)>2])\n",
      "    \n",
      "    return featureset\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Here we will finally create the list of wordbag features associated with the images and captions from the instagram posts,\n",
      "#and also we will create an index to check our results. The wordsbags will each be classified according to the quartile\n",
      "#of like_counts for that post. \n",
      "\n",
      "likefeats = []\n",
      "index = []\n",
      "\n",
      "for m in user_media:\n",
      "    if hasattr(m.caption, 'text'):\n",
      "        temptup = (dict(bag_of_non_stopwords(m.caption.text).items() + extractphotofeatures(m.id).items()),\n",
      "                   sortintoquartiles(m.like_count,#Use your own numbers here: 2575,3941,5330,9462)) \n",
      "        likefeats.append(temptup)\n",
      "        index.append([m.id, m.link, m.caption.text, m.like_count, temptup])\n",
      "    else:\n",
      "        temptup = (dict(bag_of_non_stopwords('nocaption.').items() + extractphotofeatures(m.id).items()), sortintoquartiles(m.like_count,2575,3941,5330,9462))\n",
      "        likefeats.append(temptup)\n",
      "        index.append([m.id, m.link, 'nocaption', m.like_count, temptup])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#test to make sure these matchup\n",
      "likefeats[5]\n",
      "index[5]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Now that we have feature sets, let's train a Naive Bayes Classifier..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nltk.classify.util\n",
      "from nltk.classify import NaiveBayesClassifier\n",
      "  \n",
      "cutoff = len(likefeats)*3/4\n",
      "\n",
      "trainfeats = likefeats[:cutoff] \n",
      "testfeats = likefeats[cutoff:]\n",
      "print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))\n",
      " \n",
      "classifier = NaiveBayesClassifier.train(trainfeats)\n",
      "print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)\n",
      "classifier.show_most_informative_features()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###And finally, you can now predict (with varying accuracy) the like-count quartile that a new Instagram image+caption might land in...In other words, we can make some predictions for a given user which of his future posts might generate more or less likes.   "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#By taking a new Instagram post from the same user, running the Deepbelief on the image file and adding the caption words, simply\n",
      "#run this function now to make a guess at the possible popularity of the post.\n",
      "\n",
      "classifier.classify(#featureset representing a single post. ex: likefeats[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Thank you to Gilad Lotan, and the others whose code I \"Frankensteined\" to create this approach."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "",
	"signature": "sha256:872250c64520d898242c5fb7694e2767e4eef9dbcd46e03ed1fe970e17e62da7"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	" This is a step-by-step guide for using an image-analysis SDK in combination with a \"bag of words\" approach to training a naive bayes classifier, for use with Instagram images and captions. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###First you have to obtain instagram posts from a desired user..."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#if you dont't have the python-instagram library, run 'pip install python-instagram'.\n",
	"\n",
	"client_id = #Your Client ID goes here\n",
	"client_secret = #Your Client Secret goes here\n",
	"access_token = #Your Instagram access token goes here, it can be obtained at http://www.pinceladasdaweb.com.br/instagram/access-token/\n",
	"apiauth = InstagramAPI(access_token=access_token)\n",
	"\n",
	"from instagram.client import InstagramAPI\n",
	"api = InstagramAPI(client_id=client_id, client_secret=client_secret)"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"user_id = #some instagram user id\n",
	"user_media = []\n",
	"tmpmedia = api.user_recent_media(user_id=user_id, count = 33)\n",
	"tmpmedia[0]\n",
	"for m in tmpmedia[0]:\n",
	" user_media.append(m)"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from urlparse import urlparse\n",
	"parsed = urlparse(tmpmedia[1])\n",
	"params = {a:b for a,b in [x.split('=') for x in parsed.query.split('&')]}\n",
	"int(params['max_id'].split('_')[0])"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#you may run into rate limiting here, if you are trying to acquire a large dataset. If so, perhaps lower the range argument. \n",
	"\n",
	"for i in range(100):\n",
	" max_id = int(params['max_id'].split('_')[0])\n",
	" tmpmedia = api.user_recent_media(user_id=user_id, max_id=max_id - 1, count=33)\n",
	" for m in tmpmedia[0]:\n",
	" user_media.append(m)\n",
	" parsed = urlparse(tmpmedia[1])\n",
	" params = {a:b for a,b in [x.split('=') for x in parsed.query.split('&')]}"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###So, now you have your instagram media in a list called 'user_media'. Let's save every photograph from all the posts so we can run image analysis on them..."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import urllib2\n",
	"imagedir = #PATH to a the directory where you want to save the .jpg images. ex. 'home/user/igimages/' \n",
	"\n",
	"for m in user_media:\n",
	" f = urllib2.urlopen(m.images['standard_resolution'].url)\n",
	" data = f.read()\n",
	" with open(imagedir + m.id + '.jpg', \"wb\") as code:\n",
	" code.write(data)"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#The above for-loop will save all of the pictures with the media id as the name. ex. '866246441730669483_11404563.jpg'\n",
	"# If you would like to reference the actual media item associated with a media-id, you can use this helper function: \n",
	"\n",
	"def getmediaitem(curlist, media_id):\n",
	" for m in curlist:\n",
	" if m.id == media_id:\n",
	" return m\n",
	" else:\n",
	" continue"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Pass the list you want to look through, in this case'user_media', and whichever media-id you want to examine.\n",
	"# The function will return that item.\n",
	"mediaobject = getmediaitem(user_media, '866246441730669483_11404563')\n",
	"mediaobject"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Let's use the DeepBeliefSDK to analyze all of the instagram photographs, and output the analysis into text files..."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Navigate to https://github.com/jetpacapp/DeepBeliefSDK and clone the repository. For the sake of this example code, I used the SimpleLinux examples to run analysis against a hard-coded repository. You can configure this very powerful library yourself, however, if you so desire. Once it is installed, you can run the DeepBelief analysis on your Instagram Image files.\n",
	"\n",
	"NOTE: Because this is a simple example, the reference to the jetpac.ntwk file is hard-coded, meaning you MUST run this type of command from the /DeepBeliefSDK/examples/SimpleLinux/ directory. For the shell script below to work, move all your instagram images into the /DeepBeliefSDK/examples/SimpleLinux/ directory. \n",
	"\n",
	"Here is a little shell script that I used to crank out the text files:\n",
	"\n",
	"```\n",
	"#!/usr/bin/bash\n",
	"\n",
	"for file in *.jpg\n",
	"do \n",
	"\t./deepbelief $file \| sort -k2nr \| head -25 > $file.txt \n",
	"done\n",
	"```\n",
	"\n",
	"This makes text files with the name of the photograph as the first part. (ex. 866246441730669483_11404563.jpg.txt) "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Now that we have a bags of words describing images, let's figure out how to measure it...."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#Start by getting the like_counts of all posts.\n",
	"like_counts = [m.like_count for m in user_media]"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#Let's plot the data into a boxplot to break down the five areas we want to categorize:\n",
	"%pylab inline\n",
	"boxplot(like_counts,0,'')"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#to give you the exact numbers in the boxplot\n",
	"median = np.median(like_counts)\n",
	"upper_quartile = np.percentile(like_counts, 75)\n",
	"lower_quartile = np.percentile(like_counts, 25)\n",
	"iqr = upper_quartile - lower_quartile\n",
	"upper_whisker = upper_quartile + 1.5*iqr\n",
	"\n",
	"print lower_quartile, median, upper_quartile, upper_whisker, iqr"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Now our task is to create a feature set using the wordbags we have from the images, the wordbags we will get from the captions, and the labels of what like-count quartile the instagram media falls into... "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#Here is a sorting function that will help sort our instagram posts into quartile like_count bins. When calling this function,\n",
	"#pass in the values you got from the cell directly above.\n",
	"\n",
	"#Here you could write your own function if you wanted to, say, divide the posts into only two groups, counts above the median,\n",
	"#or counts below the median. The amount of bins you create is up to you.\n",
	"\n",
	"def sortintoquartiles(num,lq,median,uq,uw):\n",
	" if num>= 0 and num<lq:\n",
	" return 'lw'\n",
	" if num>=lq and num<median:\n",
	" return 'lq'\n",
	" if num>=median and num<uq:\n",
	" return 'uq'\n",
	" if num>=uq and num<uw:\n",
	" return 'uw'\n",
	" if num>=uw:\n",
	" return 'ol'\n",
	" \n",
	"#lw: Lower Whisker\n",
	"#lq: Lower Quartile\n",
	"#median: Median\n",
	"#up: Upper Quartile\n",
	"#uw: Upper Whisker\n",
	"#ol: Outlier (more likes than the upper whisker limit)"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#If you don't have nltk, visit this link http://www.nltk.org/install.html and follow instructions to install it. \n",
	"#Also, make sure you install at least the stopwords corpus, check out this link http://www.nltk.org/data.html to do so.\n",
	"\n",
	"# This is a way of compiling text into a \"wordbag\" with each word set to \"TRUE\". I gave you the option of subtracting\n",
	"#usermentions or hashtags from the text. Uncomment them below to subtract them from the wordbags.\n",
	"\n",
	"import nltk\n",
	"from nltk.corpus import stopwords\n",
	"import re\n",
	"\n",
	"stub = re.compile('[^A-Za-z]')\n",
	"\n",
	"\n",
	"def bag_of_non_stopwords(text):\n",
	" words = [stub.sub('', w).lower() for w in text.split()]\n",
	" usermentions = re.findall(\"(?<=^\|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z_]+[A-Za-z0-9_]+)\", text, re.I)\n",
	" tagmentions = re.findall(\"(?<=^\|(?<=[^a-zA-Z0-9-_\\.]))#([A-Za-z_]+[A-Za-z0-9_]+)\", text, re.I)\n",
	" \n",
	" finalwords = set(words) - set(stopwords.words('english')) #- set(usermentions) #- set(tagmentions)\n",
	" \n",
	" featureset = dict([(word, True) for word in finalwords if not word.startswith('http') and len(word)>2])\n",
	" \n",
	" return featureset\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#Here is a function to create the actual formatted wordbags for the text files we created earlier. They should be words describing the\n",
	"#Instagram photos with the same media id.\n",
	"\n",
	"\n",
	"def extractphotofeatures(mediaid):\n",
	" textpath = #Change this to the path where your .txt files are\n",
	" file = open(textpath + mediaid + '.jpg.txt', 'r')\n",
	" stub = re.compile('[^A-Za-z]')\n",
	" listy = file.readlines()[:25]\n",
	" newdict = [] \n",
	"\n",
	" for text in listy:\n",
	" newdict.append([stub.sub('', w).lower() for w in text.split('\\t')][2])\n",
	" \n",
	" featureset = dict([(word, True) for word in newdict if not word.startswith('http') and len(word)>2])\n",
	" \n",
	" return featureset\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#Here we will finally create the list of wordbag features associated with the images and captions from the instagram posts,\n",
	"#and also we will create an index to check our results. The wordsbags will each be classified according to the quartile\n",
	"#of like_counts for that post. \n",
	"\n",
	"likefeats = []\n",
	"index = []\n",
	"\n",
	"for m in user_media:\n",
	" if hasattr(m.caption, 'text'):\n",
	" temptup = (dict(bag_of_non_stopwords(m.caption.text).items() + extractphotofeatures(m.id).items()),\n",
	" sortintoquartiles(m.like_count,#Use your own numbers here: 2575,3941,5330,9462)) \n",
	" likefeats.append(temptup)\n",
	" index.append([m.id, m.link, m.caption.text, m.like_count, temptup])\n",
	" else:\n",
	" temptup = (dict(bag_of_non_stopwords('nocaption.').items() + extractphotofeatures(m.id).items()), sortintoquartiles(m.like_count,2575,3941,5330,9462))\n",
	" likefeats.append(temptup)\n",
	" index.append([m.id, m.link, 'nocaption', m.like_count, temptup])"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#test to make sure these matchup\n",
	"likefeats[5]\n",
	"index[5]"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Now that we have feature sets, let's train a Naive Bayes Classifier..."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import nltk.classify.util\n",
	"from nltk.classify import NaiveBayesClassifier\n",
	" \n",
	"cutoff = len(likefeats)*3/4\n",
	"\n",
	"trainfeats = likefeats[:cutoff] \n",
	"testfeats = likefeats[cutoff:]\n",
	"print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))\n",
	" \n",
	"classifier = NaiveBayesClassifier.train(trainfeats)\n",
	"print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)\n",
	"classifier.show_most_informative_features()"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###And finally, you can now predict (with varying accuracy) the like-count quartile that a new Instagram image+caption might land in...In other words, we can make some predictions for a given user which of his future posts might generate more or less likes. "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#By taking a new Instagram post from the same user, running the Deepbelief on the image file and adding the caption words, simply\n",
	"#run this function now to make a guess at the possible popularity of the post.\n",
	"\n",
	"classifier.classify(#featureset representing a single post. ex: likefeats[0])"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Thank you to Gilad Lotan, and the others whose code I \"Frankensteined\" to create this approach."
	]
	}
	],
	"metadata": {}
	}
	]
	}