Skip to content

Instantly share code, notes, and snippets.

@joyrexus
Last active December 17, 2015 13:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joyrexus/5615982 to your computer and use it in GitHub Desktop.
Save joyrexus/5615982 to your computer and use it in GitHub Desktop.
Query script and report for NORC request.

README

NORC is requesting counts of word types and tokens for each student (for each story) as well as MLU (mean length of utterance).

NOTE: MLU was requested, but I'm not sure how to obtain it given that there are often multiple utterances per row/field and these utterances are not reliably/consistently delimited.)

They provided an excel file (data.xlsx originally WPB_all_data_5_16_13.xlsx) containing all utterances to be analyzed.

The provided data table contains 11 columns:

INDEX  COL    HEADER
    0    A    StudentID

    1    B    WPBNA_PG1-2 
    2    C    WPBNA_PG3-4
    3    D    WPBNA_PG5-6 
    4    E    WPBNA_PG7-8
    5    F    WPBNA_PG9-10
    
    6    G    WPBNB_PG1-2
    7    H    WPBNB_PG3-4
    8    I    WPBNB_PG5-6
    9    J    WPBNB_PG7-8
   10    K    WPBNB_PG9-10

We converted this into a more convenient format with the following columns:

ID     - student id
STORY  - A or B
PAGE   - 1 (1-2), 2 (3-4), etc.
TOKENS - word tokens parsed from TEXT 
TEXT   - original text

See data.xls for the resulting data table.

Files

  • query.py - script used to generate report.xls from data.xls

  • data.xls - data file described above

  • report.xls - resulting report containing word token and type counts for each subject/story.

Email Record

Date: May 17, 2013 10:51:50 AM CDT Subject: word counts

Here is the data set from our field study. As we discussed on Tuesday we are interested in getting token, types and MLU for each story for each student. These are organized so that each student ID has two stories across a row.

That means for each student ID we need 3 counts for story 1 (WPBNA_PG 1-2, PG 3-4, PG 5- 6, PG 7-8) and 3 counts for story 2 (WPBNB PG1-2, PG 3-4, PG 5-6, PG 7-8).

from collections import defaultdict as dd
count = dd(int)
types = dd(set)
data = open('data.xls')
header = data.readline() # omit header
def pprint(*items):
'''Pretty-print items.'''
print "\t".join(str(i) for i in items)
for row in data:
id, story, page, tokens = row.split('\t')[:4]
key = (id, story)
for t in tokens.split(' '):
if t == 'p': continue
count[key] += 1 # increment token count for (id, story)
types[key].add(t) # add token to set of types for (id, story)
pprint('ID', 'STORY', 'TOKENS', 'TYPES') # report header
for key, tokens in sorted(count.items()):
id, story = key
pprint(id, story, tokens, len(types[key]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment