pkipsy/episodic-lists.ipynb

## episodic-lists.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyzing Episodic Word Lists\n",
    "## January 2017\n",
    "The following investigation was conducted -  \n",
    "1) to illustrate how various lexical and semantic feature information can be derived directly from word pools and recognition lists,  \n",
    "2) to examine how these feature values can be expected to vary as a function of item frequency, and   \n",
    "3) to assess whether standard word pools mimic these differences (and each other).\n",
    "## Verbal Properties and Frequency Class\n",
    "In the study of semantic and episodic memory, different word pools make use of somewhat different sampling procedures and controls. Thus, our first goal was to establish a neutral, independent baseline, in which words were sampled without any special consideration other than frequency.\n",
    "### Word Frequency \n",
    "Words and their frequencies were extracted from the state-of-the-art 51 million word [SUBTLEXus corpus](http://subtlexus.lexique.org/) (Brysbaert & New, 2009). Frequency classes were assigned according to the Zipf scale, a logarithmic scale with seven frequency classes, which has a number of advantages over the typical binary division between HF and LF words (van Heuven et al., 2014; Figure 1). For purposes of comparison, a Zipf value of 3 or lower corresponds to LF words; 4 or higher to HF words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure1.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 1**: The Zipf scale is a logarithmic scale that divides the frequency spectrum into seven discrete classes (van Heuven et al. 2014)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#word frequency from SUBTLEXus\n",
    "def make_freq_dict():\n",
    "    import csv\n",
    "    reader = csv.reader(open('[folder]/SUBTLEXus.csv', 'rU'), dialect=csv.excel_tab)\n",
    "    for row in reader:\n",
    "        for s in row:\n",
    "            wordz = [x.strip() for x in s.split(',')][0]\n",
    "            countz = [x.strip() for x in s.split(',')][1]\n",
    "            if wordz not in word_freq_dict:\n",
    "                word_freq_dict[wordz] = countz\n",
    "\n",
    "word_freq_dict = {}\n",
    "make_freq_dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure2.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 2**: The number of distinct word types in the SUBTLEXus corpus for each value of the Zipf scale. The comparatively small number of types in the higher frequency ranges placed constraints on the construction of recognition lists: List length was necessarily kept small; and while lists were created for Zipf values 1-6, 7 was excluded, as it comprised only 13 distinct word types, all of them function words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#get zipf scale info\n",
    "def make_freqz_dict():\n",
    "    import csv\n",
    "    reader = csv.reader(open('[folder]/SUBTLEXus.csv', 'rU'), dialect=csv.excel_tab)\n",
    "    for row in reader:\n",
    "        for s in row:\n",
    "            wordz = [x.strip() for x in s.split(',')][0]\n",
    "            countz = [x.strip() for x in s.split(',')][-2]\n",
    "            if wordz not in zipf_freq_dict:\n",
    "                zipf_freq_dict[wordz] = countz\n",
    "    zipf_freq_dict.pop('Word', None)\n",
    "\n",
    "zipf_freq_dict = {}\n",
    "make_freqz_dict()\n",
    "\n",
    "#reverse dictionary (so counts are keys)\n",
    "def make_zipf_dict():\n",
    "    import csv\n",
    "    reader = csv.reader(open('[folder]/SUBTLEXus.csv', 'rU'), dialect=csv.excel_tab)\n",
    "    for row in reader:\n",
    "        for s in row:\n",
    "            wordz = [x.strip() for x in s.split(',')][0]\n",
    "            countz = [x.strip() for x in s.split(',')][-1]\n",
    "            if countz not in zipf_scale_dict:\n",
    "                zipf_scale_dict[countz] = [wordz]\n",
    "            else:\n",
    "                zipf_scale_dict[countz].append(wordz)\n",
    "    zipf_scale_dict.pop('Zipf_Round', None)\n",
    "\n",
    "zipf_scale_dict = {}\n",
    "make_zipf_dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "#number of words in each entry\n",
    ">>> for keyz in zipf_scale_dict:\n",
    "    print keyz, str(len(zipf_scale_dict[keyz]))\n",
    "1 25326\n",
    "2 31063\n",
    "3 13426\n",
    "4 3698\n",
    "5 637\n",
    "6 123\n",
    "7 13\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#create a graph of this in R\n",
    "Zipf_rank <- c('1','2','3','4','5','6','7')\n",
    "frequency <- c(25326, 31063, 13426, 3698, 637, 123, 13)\n",
    "Zipf.data <- data.frame(Zipf_rank, frequency)\n",
    "\n",
    "library(ggplot2)\n",
    "attach(Zipf.data)\n",
    "ggplot(Zipf.data, aes(Zipf_rank, frequency)) + geom_point() + geom_line() + xlab(\"Zipf Scale\") + ylab(\"Number of Word Types\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recognition Lists \n",
    "To create recognition lists, 10 items were selected at random (without replacement) from a given frequency bin. Half of these items were labeled targets, and the other half foils. This sampling procedure was repeated until there were 1000 such lists for each frequency class (see Figure 2 for more details). \n",
    "\n",
    "The aim was to compare lists created in each band on four dimensions: **word length**, **feature frequency**, and **orthographic** and **semantic similarity** of targets to distractors. These particular dimensions were chosen to be illustrative, and because they are known to be contributing factors to item recognition. For word length and feature frequency, counts were computed for each item, and averaged over the entire list. For orthographic and semantic similarity, the similarity of each target to the distractors present at test was computed, and similarly averaged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#generate pure lists from SUBTLEXus for each Zipf Rank\n",
    "\n",
    "#load random samples from SUBTLEXus\n",
    "#zipf rank, number sampled, number of repetitions\n",
    "def create_zipf_list(zipf_rank, sample_size, num_rep, output_file):\n",
    "    #create storage dictionary\n",
    "    data_set = {'targets':[], 'foils':[], 'samps':[]}\n",
    "    #iterate through required repetitions\n",
    "    import random\n",
    "    for x in range(1,num_rep+1):\n",
    "        rand = random.sample(zipf_scale_dict[zipf_rank], sample_size)\n",
    "        #targets are first half of the sample, foils are second half\n",
    "        targets1 = rand[0:int(len(rand)/2)]\n",
    "        foils1 = rand[int(len(rand)/2):len(rand)]\n",
    "        #have each target on the list repeat; #repeats = #foils\n",
    "        targets2 = [val for val in targets1 for _ in range(0,len(foils1))]\n",
    "        for t2 in targets2:\n",
    "            data_set['targets'].append(t2)\n",
    "        #have the foil list repeat verbatim; #repeats = #targets\n",
    "        foils2 = [i for i in range(0,len(targets1)) for i in foils1]\n",
    "        for f2 in foils2:\n",
    "            data_set['foils'].append(f2)\n",
    "        #what sample number is this\n",
    "        samp_num = [x]*len(targets2)\n",
    "        for s2 in samp_num:\n",
    "            data_set['samps'].append(s2)\n",
    "    #output dictionary contents to csv file\n",
    "    import csv\n",
    "    import sys\n",
    "    with open(output_file, 'wt') as f:\n",
    "        writer = csv.writer(f)\n",
    "        writer.writerow( ('Zipf', 'Sample', 'Targets', 'Foils', 'Similarity', 'Edit Distance', 'Zipf_Numeric', 'Target Length', 'Unigram_Freq', 'Bigram_Freq', 'Trigram_Freq', 'Fourgram_Freq', 'Fivegram_Freq') )\n",
    "        for t in range(0,len(targets2)*num_rep):\n",
    "            try:\n",
    "                s00 = zipf_rank\n",
    "                s01 = data_set['samps'][t]\n",
    "                s02 = data_set['targets'][t]\n",
    "                s03 = data_set['foils'][t]\n",
    "                s04 = word2vecmodel.similarity(s02, s03)\n",
    "                s05 = minimumEditDistance(s02, s03)\n",
    "                s06 = zipf_freq_dict[s02]\n",
    "                s07 = len(s02)\n",
    "                s08 = feat_info[s02]['unigram']\n",
    "                s09 = feat_info[s02]['bigram']\n",
    "                s10 = feat_info[s02]['trigram']\n",
    "                s11 = feat_info[s02]['quad']\n",
    "                s12 = feat_info[s02]['quint']\n",
    "            except KeyError:\n",
    "                s04 = 0\n",
    "            writer.writerow( (s00, s01, s02, s03, s04, s05, s06, s07, s08, s09, s10, s11, s12) )\n",
    "        f.close()\n",
    "\n",
    "#Example Usage: zipf rank = 5, sample_size = 10, num_rep = 100, output_file = 'output_file.csv'\n",
    "#Note: Sample Size MUST BE EVEN / half will be targets & half foils\n",
    "#create_pure_list(file_return['HF1'], 10, 2, 'output_file.csv')\n",
    "\n",
    "#automatically generate sample lists from files\n",
    "def gen_zipf_lists():\n",
    "    zipf_rank = ['1', '2', '3', '4', '5', '6']\n",
    "    sample_size = [10]\n",
    "    for zr in zipf_rank:\n",
    "        for ss in sample_size:\n",
    "            file_name = 'ZipfRank' + zr + '-' + str(ss) + 'itemlist.csv'\n",
    "            create_zipf_list(zr, ss, 1000, file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Word Length \n",
    "Word length, whether computed in terms of letters or phonemes, has an inverse relationship with frequency, with word lengths tending to increase as frequency declines (Piantadosi et al., 2011; Sigurd, Eeg-Olofsson, & Van Weijer, 2004; Wright, 1979; see Figure 3).\n",
    "\n",
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure3.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 3**: Average word length of list items increases as frequency declines.\n",
    "\n",
    "### Feature Frequency \n",
    "Feature frequencies represent the empirical n-gram frequencies of individual letters and letter combinations, and can be conceptualized as a measure of orthographic distinctiveness (Figure 4). Feature frequency is known to vary with word frequency. On average, rarer words contain both more unusual letters, and more unusual combinations of letters (Malmberg et al. 2002; Zechmeister, 1969).\n",
    "\n",
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure4.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 4**: The five panels depict the average feature frequencies of list items in SUBTLEXus as a function of their Zipf value. The overall trend indicates that higher frequency items are comprised of higher frequency features. Moreover, the larger the n-gram, the greater the separation between frequency classes. For unigrams, a more pronounced pattern of separation between Zipf bands is observable when minimum (rather than average) feature frequency is used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#get feature frequency information \n",
    "\n",
    "#create dictionaries of counts for letters, bigrams, trigrams\n",
    "letter_dict = {}\n",
    "bigram_dict = {}\n",
    "trigram_dict = {}\n",
    "quad_dict = {}\n",
    "quint_dict = {}\n",
    "\n",
    "for text in word_freq_dict:\n",
    "    from collections import Counter\n",
    "    count = int(word_freq_dict[text])\n",
    "    #create letter dictionary\n",
    "    s = list(text)*count\n",
    "    for x in s:\n",
    "        if x not in letter_dict:\n",
    "            letter_dict[x] = 1\n",
    "        else:\n",
    "            letter_dict[x] += 1\n",
    "    #create bigram dictionary\n",
    "    bigram = Counter(x+y for x, y in zip(*[text[i:] for i in range(2)]))\n",
    "    for b in bigram:\n",
    "        new_count = bigram[b]*count\n",
    "        if b not in bigram_dict:\n",
    "            bigram_dict[b] = new_count\n",
    "        else:\n",
    "            bigram_dict[b] += new_count\n",
    "    #create trigram dictionary\n",
    "    trigram = Counter(x+y+z for x, y, z in zip(*[text[i:] for i in range(3)]))\n",
    "    for t in trigram:\n",
    "        new_count2 = trigram[t]*count\n",
    "        if t not in trigram_dict:\n",
    "            trigram_dict[t] = new_count2\n",
    "        else:\n",
    "            trigram_dict[t] += new_count2\n",
    "    #create quadgram dictionary\n",
    "    quad = Counter(v+x+y+z for v, x, y, z in zip(*[text[i:] for i in range(4)]))\n",
    "    for q in quad:\n",
    "        new_count3 = quad[q]*count\n",
    "        if q not in quad_dict:\n",
    "            quad_dict[q] = new_count3\n",
    "        else:\n",
    "            quad_dict[q] += new_count3\n",
    "    #create quintgram\n",
    "    quint = Counter(u+v+x+y+z for u, v, x, y, z in zip(*[text[i:] for i in range(5)]))\n",
    "    for qu in quint:\n",
    "        new_count4 = quint[qu]*count\n",
    "        if qu not in quint_dict:\n",
    "            quint_dict[qu] = new_count3\n",
    "        else:\n",
    "            quint_dict[qu] += new_count3\n",
    "            \n",
    "#for each word in SUBTLEXus, get its average letter unigram, bigram, and trigram frequencies\n",
    "#populate dictionary with words and counts\n",
    "feat_info = {}\n",
    "for w in word_freq_dict:\n",
    "    if w not in feat_info:\n",
    "        feat_info[w] = {'count': word_freq_dict[w], 'unigram': 0, 'bigram': 0, 'trigram': 0, 'quad': 0, 'quint': 0, 'length': len(w)}\n",
    "\n",
    "#add letter unigram, bigram, trigram means\n",
    "for n in feat_info:\n",
    "    import math\n",
    "    #unigram\n",
    "    letter_count = 0\n",
    "    n_length = len(n)\n",
    "    text = list(n)\n",
    "    for t in text:\n",
    "        letter_count += math.log(letter_dict[t.lower()],10)\n",
    "    feat_info[n]['unigram'] = letter_count/n_length\n",
    "    # bigrams = length - 1\n",
    "    if n_length > 1:\n",
    "        bi_count = 0\n",
    "        bigram = Counter(x+y for x, y in zip(*[n[i:] for i in range(2)]))\n",
    "        for b in bigram:\n",
    "            bi_count += math.log(bigram_dict[b.lower()],10)\n",
    "        feat_info[n]['bigram'] = bi_count/(n_length - 1)\n",
    "    if n_length > 2:\n",
    "        tri_count = 0\n",
    "        trigram = Counter(x+y+z for x, y, z in zip(*[n[i:] for i in range(3)]))\n",
    "        for tr in trigram:\n",
    "            tri_count += math.log(trigram_dict[tr.lower()],10)\n",
    "        feat_info[n]['trigram'] = tri_count/(n_length - 2)\n",
    "    if n_length > 3:\n",
    "        quad_count = 0\n",
    "        quad = Counter(v+x+y+z for v, x, y, z in zip(*[n[i:] for i in range(4)]))\n",
    "        for q in quad:\n",
    "            quad_count += math.log(quad_dict[q.lower()],10)\n",
    "        feat_info[n]['quad'] = quad_count/(n_length - 3)\n",
    "    if n_length > 4:\n",
    "        quint_count = 0\n",
    "        quint = Counter(u+v+x+y+z for u, v, x, y, z in zip(*[n[i:] for i in range(5)]))\n",
    "        for qu in quint:\n",
    "            quint_count += math.log(quint_dict[qu.lower()],10)\n",
    "        feat_info[n]['quint'] = quint_count/(n_length - 4)\n",
    "\n",
    "#add letter unigram, bigram, trigram MINS (take the minimum)\n",
    "for n in feat_info:\n",
    "    import math\n",
    "    #unigram\n",
    "    letter_count = []\n",
    "    n_length = len(n)\n",
    "    text = list(n)\n",
    "    for t in text:\n",
    "        letter_count.append(float(letter_dict[t.lower()]))\n",
    "    feat_info[n]['unigram'] = math.log(min(letter_count),10)\n",
    "    # bigrams = length - 1\n",
    "    if n_length > 1:\n",
    "        bi_count = []\n",
    "        bigram = Counter(x+y for x, y in zip(*[n[i:] for i in range(2)]))\n",
    "        for b in bigram:\n",
    "            bi_count.append(float(bigram_dict[b.lower()]))\n",
    "        feat_info[n]['bigram'] = math.log(min(bi_count),10)\n",
    "    if n_length > 2:\n",
    "        tri_count = []\n",
    "        trigram = Counter(x+y+z for x, y, z in zip(*[n[i:] for i in range(3)]))\n",
    "        for tr in trigram:\n",
    "            tri_count.append(float(trigram_dict[tr.lower()]))\n",
    "        feat_info[n]['trigram'] = math.log(min(tri_count),10)\n",
    "    if n_length > 3:\n",
    "        quad_count = []\n",
    "        quad = Counter(v+x+y+z for v, x, y, z in zip(*[n[i:] for i in range(4)]))\n",
    "        for q in quad:\n",
    "            quad_count.append(float(quad_dict[q.lower()]))\n",
    "        feat_info[n]['quad'] = math.log(min(quad_count),10)\n",
    "    if n_length > 4:\n",
    "        quint_count = []\n",
    "        quint = Counter(u+v+x+y+z for u, v, x, y, z in zip(*[n[i:] for i in range(5)]))\n",
    "        for qu in quint:\n",
    "            quint_count.append(float(quint_dict[qu.lower()]))\n",
    "        feat_info[n]['quint'] = math.log(min(quint_count),10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Letter Frequency in SUBTLEXus**\n",
    "```python\n",
    "letter_dict = {'a': 13848908, 'c': 3914403, 'b': 2798849, 'e': 21170430, 'd': 6439842, 'g': 4549429, 'f': 3028467, 'i': 12925517, 'h': 10838197, 'k': 2573294, 'j': 477929, 'm': 5076370, 'l': 7624241, 'o': 16779213, 'n': 11853549, 'q': 96556, 'p': 2680673, 's': 10527194, 'r': 9507112, 'u': 6936365, 't': 17344980, 'w': 4968188, 'v': 1621136, 'y': 6359442, 'x': 218230, 'z': 103392}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Orthographic similarity\n",
    "Orthographic similarity was computed as Levenshtein edit distance, a string metric that calculates the minimum number of edits (such as insertions, deletions, or substitutions) required to transform one word into the other. Given that rare words are more orthographically distinctive (Landauer & Streeter, 1973; Andrews, 1992), it stands to reason that in a recognition list context, they should be less orthographically similar to frequency-matched distractors than more common words (Hall, 1979)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure5.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 5**: Average orthographic similarity between targets and distractors declines as a function of frequency."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#code for levenstein edit distance\n",
    "def minimumEditDistance(s1,s2):\n",
    "    if len(s1) > len(s2):\n",
    "        s1,s2 = s2,s1\n",
    "    distances = range(len(s1) + 1)\n",
    "    for index2,char2 in enumerate(s2):\n",
    "        newDistances = [index2+1]\n",
    "        for index1,char1 in enumerate(s1):\n",
    "            if char1 == char2:\n",
    "                newDistances.append(distances[index1])\n",
    "            else:\n",
    "                newDistances.append(1 + min((distances[index1],\n",
    "                                             distances[index1+1],\n",
    "                                             newDistances[-1])))\n",
    "        distances = newDistances\n",
    "    return distances[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Semantic similarity \n",
    "Semantic similarity values were obtained from **word2vec** trained on the 300 billion word Google News corpus. word2vec is a two-layer neural network that produces word embeddings (Mikolov et al., 2013), and is considered state of the art in semantic space modeling (Baroni, Dinu, & Kruszewski, 2014). word2vec was implemented with **gensim**, a Python framework for vector space modeling (Řehůřek & Sojka, 2010), which adopts the continuous skip-gram architecture. The skip-gram model weights proximate context words more highly than distant ones, yielding better results for lower frequency words.\n",
    "\n",
    "In a recognition task in which list items are randomly sampled from a given frequency band, the semantic similarity between targets and distractors should tend to decrease with frequency (Figure 6). This outcome is all but assured by the distributional properties of the lexicon: In the SUBTLEXus corpus, LF words comprise 80% of word tokens (van Heuven et al., 2014) and fully 94% of word types (Figure 2). The semantic spread from which LF words are sampled will thus be far greater than that for HF items (Figure 6)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure6.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 6**: Average semantic similarity between targets and distractors declines across the Zipf scale, implying that a set of randomly sampled words will be less semantically similar, on average, the lower their frequency class. While a slight (ns) trend in the opposite direction is observable in the lower range of the scale, this is almost certainly a methodological artifact. If the missing data in Figure 7 is included as 0-counts, the apparent trend reverses, and the pattern resembles that seen in the top panel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#calculate word similarity scores\n",
    "import gensim\n",
    "word2vecmodel = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)\n",
    "#example: print word2vecmodel.similarity('cat', 'dog')\n",
    "\n",
    "#make all pairs readable\n",
    "all_w1 = []\n",
    "all_w2 = []\n",
    "all_ls = []\n",
    "all_sim = []\n",
    "for ls in allpairs:\n",
    "    for i,t in enumerate(allpairs[ls]):\n",
    "        try:\n",
    "            all_ls.append(ls)\n",
    "            w1 = allpairs[ls][i][0]\n",
    "            all_w1.append(w1)\n",
    "            w2 = allpairs[ls][i][1]\n",
    "            all_w2.append(w2)\n",
    "    #get similarity score\n",
    "            wsim = word2vecmodel.similarity(w1, w2)\n",
    "            all_sim.append(wsim)\n",
    "        except KeyError:\n",
    "            all_sim.append('NA')\n",
    "\n",
    "#create dictionary of values\n",
    "sim_values = []\n",
    "for i in range(0,len(all_w1)-1):\n",
    "    sim_values.append((all_w1[i], all_w2[i], all_sim[i]))\n",
    "\n",
    "#output dictionary contents to csv file\n",
    "import csv\n",
    "import sys\n",
    "with open('simvalues2_file.csv', 'wt') as f:\n",
    "    writer = csv.writer(f)\n",
    "    writer.writerow( ('W1', 'W2', 'Similarity') )\n",
    "    for t in range(0,len(sim_values)):\n",
    "        writer.writerow( (sim_values[t][0], sim_values[t][1], sim_values[t][2]) )\n",
    "    f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data sparsity\n",
    "In making these calculations, there are important methodological issues to consider—in particular, the problem, well-known to linguists, of data sparsity (Sinclair, 1997). The issue relates to the universal scaling law for word frequencies, commonly known as Zipf’s Law (1949). The idea is this: Say, an English text is selected, and each of the word types that occur in the text are arranged in order of their frequency, from most to least common, and assigned a numerical rank. Then, the full contents of the text – that is, all of its word tokens – are thrown into a bag, shook, and one word is selected at random. Zipf’s Law states that the probability of drawing a given word is inversely proportional to that word’s rank ordering. The law formalizes the notion that while a few words in a language are very common, the greater part are exceedingly rare (Figure 2). \n",
    "\n",
    "Thus, while any given sample of language will provide ample evidence about its common words and phrases, it will provide little or none about its rarer, more informative elements (Church & Gale, 1995). Not only will many perfectly legitimate words (and word co-occurrences) fail to occur in even very large swaths of text, but even most of those that do will occur only a few times, making their estimation unreliable. This is the basic problem of data sparsity and it is one that plagues semantic similarity analyses in the lower frequency ranges (Figures 7). \n",
    "\n",
    "Figure 6 shows the similarity distributions for item pairs that were known to our word2vec model. However, given the significant data loss for LF items, looking solely at returned values constitutes selection bias, as it implies that unobserved pairs—for which the model cannot supply a score—likely have the same distributional properties as observed pairs. In fact, it is reasonable to assume that unobserved pairs are much less similar, on average. One way of addressing this issue is to assign item pairs with null values a similarity score of 0. When these scores are included, the trend observable in the HF range (Figure 6) is also clearly observable in the LF range. \n",
    "\n",
    "In the absence of knowledge, assigning 0-counts is a useful heuristic. However, given that problems with data sparsity increase as frequency declines, this solution may disproportionately penalize the lowest frequency words. In future work, similarity-based smoothing techniques might be used to better estimate similarity values for unobserved pairs (c.f. Yarlett, 2007)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure7.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 7**: Data loss for the semantic similarity analyses as a function of frequency class. Semantic similarity values were not available for all the words sampled, and the proportion of words with no data points grew as frequency decreased. For Zipf rank 1, fully 25% of data was lost."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#create a subset of SUBTLEXus that only returns valid values in word2vec\n",
    "def make_zipf_dict():\n",
    "    import csv\n",
    "    reader = csv.reader(open('[folder]/SUBTLEXus.csv', 'rU'), dialect=csv.excel_tab)\n",
    "    for row in reader:\n",
    "        for s in row:\n",
    "            wordz = [x.strip() for x in s.split(',')][0]\n",
    "            countz = [x.strip() for x in s.split(',')][-1]\n",
    "            try:\n",
    "                attempt = word2vecmodel.similarity('cat', wordz)\n",
    "                if countz not in zipf_scale_dict:\n",
    "                    zipf_scale_dict[countz] = [wordz]\n",
    "                else:\n",
    "                    zipf_scale_dict[countz].append(wordz)\n",
    "            except KeyError:\n",
    "                continue\n",
    "    zipf_scale_dict.pop('Zipf_Round', None)\n",
    "\n",
    "zipf_scale_dict = {}\n",
    "make_zipf_dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "#number of words in each entry\n",
    ">>> for keyz in zipf_scale_dict:\n",
    "        print keyz, str(len(zipf_scale_dict[keyz]))\n",
    "\n",
    "1 18867 #25.5% data loss\n",
    "2 28365 #8.69% data loss\n",
    "3 13011 #3.09% data loss\n",
    "4 3684  #.038% data loss\n",
    "5 637   #0% data loss\n",
    "6 123   #0% data loss\n",
    "7 9\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#create a graph of this in R\n",
    "Zipf_rank <- c('1','2','3','4','5','6')\n",
    "loss <- c(25.5, 8.69, 3.09, .038, 0, 0)\n",
    "data.loss <- data.frame(Zipf_rank, loss)\n",
    "\n",
    "library(ggplot2)\n",
    "attach(data.loss)\n",
    "ggplot(data.loss) + geom_point(aes(x=Zipf_rank, y=loss)) + xlab(\"Zipf Scale\") + ylab(\"% Data Loss\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary \n",
    "Our analyses of words in the SUBTLEXus corpus replicates and extends a number of well-known findings on the relationship between a word’s frequency and its lexical and semantic features, including that: **word length** increases as word frequency declines, **feature frequency** increases with word frequency, with the rate of increase dependent on feature length, **orthographic similarity** between targets and foils increases with word frequency, **semantic similarity** between targets and foils increases with word frequency (though the calculation of similarity scores for LF item pairs requires careful consideration)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word Pools\n",
    "In the study of semantic and episodic memory, different word pools make use of somewhat different sampling procedures and controls. One concern is that different word lists may vary in systematic ways from each other,  producing variability in results; another is that they may have distinctly different properties from the language ‘at large’. To check the validity of these worries, we compared the word pools of two representative cognitive memory labs, with an average h-index among the principle investigators of 20, and published theoretical disagreements. These word pools were compared against a recognition word list devised by Dye, Jones, & Shiffrin (in sub) (Figures 8, 9).\n",
    "\n",
    "The Dye et al. (in sub) word list was deliberately constructed to increase the semantic and orthographic similarity of LF items, as reflected in Figures 9 and 10. In a recognition list experiment, this had the predicted effect of diminishing the standard mirror effect for word frequency, by bringing the false alarm rate for low and high frequency items into line."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure8.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 8**: A comparison of average semantic similarity of targets to foils across three word pools.\n",
    "\n",
    "<img src=\"http://mypage.iu.edu/~meldye/images/episodic/Figure9.png\" style=\"width: 500px;\"/>\n",
    "> **Figure 9**: A comparison of average orthographic similarity of  targets to foils across three word pools."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notably, while the Dye et al. word list clearly differs from the two standard word pools, these word pools are not identical to each other either. In particular, though both pools are similarly distributed in terms of frequency and semantic similarity among items, in Word Pool 2, orthographic similarity among items is substantially increased compared to Word Pool 1, and is matched across HF and LF items. This may influence reported results, as orthographic similarity is known to modulate false alarm rates (Malmberg, Holden, & Shiffrin, 2004). \n",
    "\n",
    "Finally, it is worth noting that none of these ‘controlled’ word pools reflect the properties expected from random sampling, as illustrated in our exploration of the SUBTLEXus corpus. In particular, while the distribution of orthographic and semantic similarity values for LF and HF items are largely overlapping for the standard word pools (Figures 9, 10), a truly random selection of these items shows significant separation between frequency bands (Figures 5, 6).\n",
    "\n",
    "These examples illustrate how the properties of word lists can be readily and fruitfully compared both to each other, and to larger corpora. In future work, we plan to expand this analysis to include more widely used word pools, such as the Toronto word pool (Friendly, Franklin, Hoffman, & Rubin, 1982), a modified version of the Kucera & Francis word pool (1967), and a categorized word pool (Murdock, 1976)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#load word lists\n",
    "def load_lists(file_dict):\n",
    "    for k in file_dict:\n",
    "        text_file = open(file_dict[k], \"r\")\n",
    "        if k=='HF1' or k=='LF1':\n",
    "            opener = '\\r\\n'\n",
    "        else:\n",
    "            opener = '\\n'\n",
    "        lines = text_file.read().split(opener)\n",
    "        for l in lines:\n",
    "            if l != '':\n",
    "                file_return[k].append(l)\n",
    "\n",
    "file_dict = {'toronto':\"[path]/toronto.txt\", 'kucera':\"[path]/kucera-francis.txt\", 'murdock':\"[path]/murdock.txt\"}\n",
    "file_return = {'toronto':[], 'kucera':[], 'murdock':[]}\n",
    "\n",
    "load_lists(file_dict)\n",
    "\n",
    "#create word pairs for similarity analyses \n",
    "allpairs = {'toronto':[], 'murdock':[], 'kucera':[]}\n",
    "\n",
    "def make_pairs(which_list, which_pair):\n",
    "    import itertools\n",
    "    for pair in itertools.combinations(which_list, 2):\n",
    "        which_pair.append(pair)\n",
    "\n",
    "for p in allpairs:\n",
    "    make_pairs(file_return[p], allpairs[p])\n",
    "    \n",
    "#save pair information\n",
    "import numpy as np\n",
    "np.save('all_pairs.npy', allpairs)\n",
    "read_dictionary = np.load('all_pairs.npy').item()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#generate pure list from word pool\n",
    "\n",
    "#load random samples from a given list to a file\n",
    "#file name, number sampled, number of repetitions\n",
    "def create_pure_list(file_name, sample_size, num_rep, output_file):\n",
    "    #create storage dictionary\n",
    "    data_set = {'targets':[], 'foils':[], 'samps':[]}\n",
    "    #iterate through required repetitions\n",
    "    import random\n",
    "    for x in range(1,num_rep+1):\n",
    "        rand = random.sample(file_name, sample_size)\n",
    "        #targets are first half of the sample, foils are second half\n",
    "        targets1 = rand[0:len(rand)/2]\n",
    "        foils1 = rand[len(rand)/2:len(rand)]\n",
    "        #have each target on the list repeat; #repeats = #foils\n",
    "        targets2 = [val for val in targets1 for _ in range(0,len(foils1))]\n",
    "        for t2 in targets2:\n",
    "            data_set['targets'].append(t2)\n",
    "        #have the foil list repeat verbatim; #repeats = #targets\n",
    "        foils2 = [i for i in range(0,len(targets1)) for i in foils1]\n",
    "        for f2 in foils2:\n",
    "            data_set['foils'].append(f2)\n",
    "        #what sample number is this\n",
    "        samp_num = [x]*len(targets2)\n",
    "        for s2 in samp_num:\n",
    "            data_set['samps'].append(s2)\n",
    "#output dictionary contents to csv file\n",
    "    import csv\n",
    "    import sys\n",
    "    with open(output_file, 'wt') as f:\n",
    "        writer = csv.writer(f)\n",
    "        writer.writerow( ('Sample', 'Targets', 'Target_Freq', 'Foils', 'Foil_Freq', 'Similarity', 'Edit Distance') )\n",
    "        for t in range(0,len(targets2)*num_rep):\n",
    "            try:\n",
    "                s01 = data_set['samps'][t]\n",
    "                s02 = data_set['targets'][t]\n",
    "                s02f = word_freq_dict[s02]\n",
    "                s03 = data_set['foils'][t]\n",
    "                s03f = word_freq_dict[s03]\n",
    "                s04 = word2vecmodel.similarity(s02, s03)\n",
    "                s05 = minimumEditDistance(s02, s03)\n",
    "            except KeyError:\n",
    "                s04 = 0\n",
    "                writer.writerow( (s01, s02, s02f, s03, s03f, s04, s05) )\n",
    "        f.close()\n",
    "\n",
    "with open('output_file.txt', 'wt') as f:\n",
    "    writer = csv.writer(f)\n",
    "    writer.writerow( ('Sample', 'Targets', 'Target_Freq', 'Foils', 'Foil_Freq', 'Similarity', 'Edit Distance') )\n",
    "    for t in range(0,len(targets2)):\n",
    "        try:\n",
    "            s01 = data_set['samps'][t]\n",
    "            s02 = data_set['targets'][t]\n",
    "            s03 = data_set['foils'][t]\n",
    "            s04 = word2vecmodel.similarity(s02, s03)\n",
    "            s05 = minimumEditDistance(s02, s03)\n",
    "            s02f = word_freq_dict[s02]\n",
    "            s03f = word_freq_dict[s03]\n",
    "        except KeyError:\n",
    "            s04 = 0\n",
    "            writer.writerow( (s01, s02, s02f, s03, s03f, s04, s05) )\n",
    "    f.close()\n",
    "\n",
    "\n",
    "#Example Usage: file_name = HF1, sample_size = 10, num_rep = 100, output_file = 'output_file.csv'\n",
    "#Note: Sample Size MUST BE EVEN / half will be targets & half foils\n",
    "#create_pure_list(file_return['HF1'], 10, 2, 'output_file.csv')\n",
    "\n",
    "create_pure_list(file_return['toronto'], 10, 100, 'test_output.csv')\n",
    "\n",
    "#automatically generate sample lists from files\n",
    "def gen_pure_lists():\n",
    "    all_files = ['HF1', 'LF1', 'HF2', 'LF2']\n",
    "    pure_lists = [20, 40, 60, 80, 100]\n",
    "    for af in all_files:\n",
    "        for pl in pure_lists:\n",
    "            returning = af + '-' + str(pl) + 'wordlist.csv'\n",
    "            create_pure_list(file_return[af], pl, 1000, returning)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#generate mixed list from word pool\n",
    "\n",
    "#load random samples from a given list to a file\n",
    "#HF file name, LF file name, number sampled, number of repetitions\n",
    "def create_mixed_list(HF_file, LF_file, sample_size, num_rep, output_file):\n",
    "    #create storage dictionary\n",
    "    data_set = {'targets':[], 'foils':[], 'freq_targets':[], 'freq_foils':[], 'samps':[]}\n",
    "    #iterate through required repetitions\n",
    "    import random\n",
    "    for x in range(1,num_rep+1):\n",
    "        #high frequency sample\n",
    "        rand_HF = random.sample(HF_file, sample_size)\n",
    "        #targets are first half of the sample, foils are second half\n",
    "        targets1_HF = rand_HF[0:len(rand_HF)/2]\n",
    "        foils1_HF = rand_HF[(len(rand_HF)/2):len(rand_HF)]\n",
    "        #low frequency sample\n",
    "        rand_LF = random.sample(LF_file, sample_size)\n",
    "        targets1_LF = rand_LF[0:len(rand_LF)/2]\n",
    "        foils1_LF = rand_LF[len(rand_LF)/2:len(rand_LF)]\n",
    "        #have each target on the list repeat; #repeats = #foils\n",
    "        targets1 = targets1_HF + targets1_LF\n",
    "        foils1 = foils1_HF + foils1_LF\n",
    "        targets2 = [val for val in targets1 for _ in range(0,len(foils1))]\n",
    "        for t2 in targets2:\n",
    "            data_set['targets'].append(t2)\n",
    "        #have the foil list repeat verbatim; #repeats = #targets\n",
    "        foils2 = [i for i in range(0,len(targets1)) for i in foils1]\n",
    "        for f2 in foils2:\n",
    "            data_set['foils'].append(f2)\n",
    "        #what sample number is this\n",
    "        samp_num = [x]*len(targets2)\n",
    "        for s2 in samp_num:\n",
    "            data_set['samps'].append(s2)\n",
    "        #what is the frequency of the target in this pair?\n",
    "        f_target = ['HF']*(len(targets2)/2) + ['LF']*(len(targets2)/2)\n",
    "        for ft2 in f_target:\n",
    "            data_set['freq_targets'].append(ft2)\n",
    "        #what is the frequency of the foil in this pair?\n",
    "        f_foil = (['HF']*len(foils1_HF) + ['LF']*len(foils1_LF))*len(targets1)\n",
    "        for ff2 in f_foil:\n",
    "            data_set['freq_foils'].append(ff2)\n",
    "    #output dictionary contents to csv file\n",
    "    import csv\n",
    "    import sys\n",
    "    with open(output_file, 'wt') as f:\n",
    "        writer = csv.writer(f)\n",
    "        writer.writerow( ('Sample', 'Targets', 'Target_Bin', 'Target_Freq', 'Foils', 'Foil_Bin', 'Foil_Freq', 'Similarity', 'Edit Distance') )\n",
    "        for t in range(0,len(targets2)*num_rep):\n",
    "            s01 = data_set['samps'][t]\n",
    "            s02 = data_set['targets'][t]\n",
    "            s03 = data_set['freq_targets'][t]\n",
    "            s02f = word_freq_dict[s02]\n",
    "            s04 = data_set['foils'][t]\n",
    "            s05 = data_set['freq_foils'][t]\n",
    "            s04f = word_freq_dict[s04]\n",
    "            s06 = word2vecmodel.similarity(s02, s04)\n",
    "            s07 = minimumEditDistance(s02, s04)\n",
    "            writer.writerow( (s01, s02, s03, s02f, s04, s05, s04f, s06, s07) )\n",
    "        f.close()\n",
    "\n",
    "#Example Usage: file_name = HF1, sample_size = 10, num_rep = 1000, output_file = 'output_file.csv'\n",
    "#Note: number sampled MUST BE EVEN / half will be targets & half foils\n",
    "#create_mixed_list(file_return['HF1'], file_return['LF1'], 8, 1, 'output_file.csv')\n",
    "\n",
    "create_mixed_list(file_return['HF1'], file_return['HF1'], 8, 1, 'output_file.csv')\n",
    "\n",
    "create_mixed_list(file_return['toronto'], file_return['toronto'], 8, 1, 'output_file.csv')\n",
    "\n",
    "#systematically sample lists\n",
    "def gen_mixed_lists():\n",
    "    all_files = {'HF1':'LF1', 'HF2':'LF2'}\n",
    "    mixed_lists = [10, 20, 30, 40, 50]\n",
    "    for af in all_files:\n",
    "        af_l = all_files[af]\n",
    "        for ml in mixed_lists:\n",
    "            returning = 'mixed' + str(af)[2] + '-' + str(ml*2) + 'wordlist.csv'\n",
    "            create_mixed_list(file_return[str(af)], file_return[str(af_l)], ml, 1000, returning)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#assess real list\n",
    "\n",
    "#load real HF list and LF list to assess similarity values\n",
    "def get_sim_for_list(HF_file, LF_file, output_file):\n",
    "    #create storage dictionary\n",
    "    data_set = {'targets':[], 'foils':[], 'freq_targets':[], 'freq_foils':[]}\n",
    "    ###high frequency sample is just the population\n",
    "    rand_HF = HF_file\n",
    "    #targets are first half of the sample, foils are second half\n",
    "    targets1_HF = rand_HF[0:len(rand_HF)/2]\n",
    "    foils1_HF = rand_HF[(len(rand_HF)/2):len(rand_HF)]\n",
    "    ###low frequency sample is just the population\n",
    "    rand_LF = LF_file\n",
    "    targets1_LF = rand_LF[0:len(rand_LF)/2]\n",
    "    foils1_LF = rand_LF[len(rand_LF)/2:len(rand_LF)]\n",
    "    #have each target on the list repeat; #repeats = #foils\n",
    "    targets1 = targets1_HF + targets1_LF\n",
    "    foils1 = foils1_HF + foils1_LF\n",
    "    targets2 = [val for val in targets1 for _ in range(0,len(foils1))]\n",
    "    for t2 in targets2:\n",
    "        data_set['targets'].append(t2)\n",
    "    #have the foil list repeat verbatim; #repeats = #targets\n",
    "    foils2 = [i for i in range(0,len(targets1)) for i in foils1]\n",
    "    for f2 in foils2:\n",
    "        data_set['foils'].append(f2)\n",
    "    #what is the frequency of the target in this pair?\n",
    "    f_target = ['HF']*(len(targets2)/2) + ['LF']*(len(targets2)/2)\n",
    "    for ft2 in f_target:\n",
    "        data_set['freq_targets'].append(ft2)\n",
    "    #what is the frequency of the foil in this pair?\n",
    "    f_foil = (['HF']*len(foils1_HF) + ['LF']*len(foils1_LF))*len(targets1)\n",
    "    for ff2 in f_foil:\n",
    "        data_set['freq_foils'].append(ff2)\n",
    "    #output dictionary contents to csv file\n",
    "    import csv\n",
    "    import sys\n",
    "    with open(output_file, 'wt') as f:\n",
    "        writer = csv.writer(f)\n",
    "        writer.writerow( ('Targets', 'Target_Bin', 'Target_Freq', 'Foils', 'Foil_Bin', 'Foil_Freq', 'Similarity', 'Edit Distance') )\n",
    "        for t in range(0,len(targets2)):\n",
    "            try:\n",
    "                s02 = data_set['targets'][t]\n",
    "                s03 = data_set['freq_targets'][t]\n",
    "                s02f = word_freq_dict[s02]\n",
    "                s04 = data_set['foils'][t]\n",
    "                s05 = data_set['freq_foils'][t]\n",
    "                s04f = word_freq_dict[s04]\n",
    "                s06 = word2vecmodel.similarity(s02, s04)\n",
    "                s07 = minimumEditDistance(s02, s04)\n",
    "            except KeyError:\n",
    "                s06 = 0\n",
    "            writer.writerow( (s02, s03, s02f, s04, s05, s04f, s06, s07) )\n",
    "        f.close()\n",
    "\n",
    "#HF_file = file_return['HFT'], LF_file = file_return['HFT'], output_file = 'Context-Experiment.csv'\n",
    "get_sim_for_list(file_return['HFT'], file_return['LFT'], 'Context-Experiment.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**word2vec appendix**\n",
    "\n",
    "Trained Google News vectors: https://docs.google.com/open?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM  \n",
    "For questions about the toolkit, see http://groups.google.com/group/word2vec-toolkit"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}