joyrexus/README.md

## README.md

      
    Raw
  

              README.md
            
          
    README

NORC is requesting counts of word types and tokens for each student (for each story) as
well as MLU (mean length of utterance).
NOTE: MLU was requested, but I'm not sure how to obtain it given that there are often
multiple utterances per row/field and these utterances are not reliably/consistently
delimited.)
They provided an excel file (data.xlsx originally WPB_all_data_5_16_13.xlsx)
containing all utterances to be analyzed.
The provided data table contains 11 columns:
INDEX  COL    HEADER
    0    A    StudentID

    1    B    WPBNA_PG1-2 
    2    C    WPBNA_PG3-4
    3    D    WPBNA_PG5-6 
    4    E    WPBNA_PG7-8
    5    F    WPBNA_PG9-10
    
    6    G    WPBNB_PG1-2
    7    H    WPBNB_PG3-4
    8    I    WPBNB_PG5-6
    9    J    WPBNB_PG7-8
   10    K    WPBNB_PG9-10

We converted this into a more convenient format with the following columns:
ID     - student id
STORY  - A or B
PAGE   - 1 (1-2), 2 (3-4), etc.
TOKENS - word tokens parsed from TEXT 
TEXT   - original text

See data.xls for the resulting data table.
Files


query.py - script used to generate report.xls from data.xls


data.xls - data file described above


report.xls - resulting report containing word token and type counts for each subject/story.


Email Record

Date: May 17, 2013 10:51:50 AM CDT
Subject: word counts
Here is the data set from our field study.  As we discussed on Tuesday we are interested
in getting token, types and MLU for each story for each student.  These are organized so
that each student ID has two stories across a row.
That means for each student ID we need 3 counts for story 1 (WPBNA_PG 1-2, PG 3-4, PG 5-
6, PG 7-8) and 3 counts for story 2 (WPBNB PG1-2, PG 3-4, PG 5-6, PG 7-8).

  
## query.py
from collections import defaultdict as dd

count = dd(int)
types = dd(set)

data = open('data.xls')
header = data.readline()    # omit header

def pprint(*items):
    '''Pretty-print items.'''
    print "\t".join(str(i) for i in items)

for row in data:
    id, story, page, tokens = row.split('\t')[:4]
    key = (id, story)
    for t in tokens.split(' '):
        if t == 'p':  continue
        count[key] += 1     # increment token count for (id, story)
        types[key].add(t)   # add token to set of types for (id, story)

pprint('ID', 'STORY', 'TOKENS', 'TYPES')     # report header

for key, tokens in sorted(count.items()):
    id, story = key
    pprint(id, story, tokens, len(types[key]))
	from collections import defaultdict as dd

	count = dd(int)
	types = dd(set)

	data = open('data.xls')
	header = data.readline() # omit header

	def pprint(*items):
	'''Pretty-print items.'''
	print "\t".join(str(i) for i in items)

	for row in data:
	id, story, page, tokens = row.split('\t')[:4]
	key = (id, story)
	for t in tokens.split(' '):
	if t == 'p': continue
	count[key] += 1 # increment token count for (id, story)
	types[key].add(t) # add token to set of types for (id, story)

	pprint('ID', 'STORY', 'TOKENS', 'TYPES') # report header

	for key, tokens in sorted(count.items()):
	id, story = key
	pprint(id, story, tokens, len(types[key]))