Skip to content

Instantly share code, notes, and snippets.

@kohlmeier
Last active January 4, 2016 15:49
Show Gist options
  • Save kohlmeier/8643158 to your computer and use it in GitHub Desktop.
Save kohlmeier/8643158 to your computer and use it in GitHub Desktop.
Compute number of MasteryTask updates per day per mission type
"""This script is hack to get a quick idea of how many mastery challenges
are being done under each mission. It's hacky for a lot of reasons.. I
don't have time to list them all. But here's a few. :)
*) The user_mission (and thus the mission) associated with each LearningTask
is the mission at the time of the creation. For MasteryChallenges, this
may not be a problem-- I'm not sure. But if this was extended to work
on, say, PracticeTasks, a user could create the task in one mission, then
switch missions and actually do the problems in another misison. This
script would not understand that.
*) There may be a lot of double counting in here, because this script works
off the data pipeline which pulls data from the datastore that has
been modified in the last hour. Thus, if someone works on a MasteryTask
across an hour boundary on the wall clock, there is likely to be
multiple copies of that entity stored on disk. They get merged later
when loading into hive, but that hasn't happened yet in the data
this script operates on. Why am I not freaked out about this? Because
I'm mostly interested in the *relative* count between missions, and
I'm hoping that the double counting occurs the same amount in all missions.
USAGE:
mastery_tasks_per_mission.py [YYYY-MM-DD]
It will output a single CSV line summarizing the counts for each day, e.g.,:
YYYY-MM-DD,num_allofmath_mts,num_othermission_mts
If no day is provided as a command line argument, it will process all days
for which data is present on the anlaytics machine.
It should be run from the ~/analytics/src diretory on the analytics machine.
"""
from collections import defaultdict
import glob
import json
import os
import pickle
import sys
import gae_util
gae_util.fix_sys_path()
from google.appengine.api import datastore
from google.appengine.datastore import entity_pb
ALL_OF_MATH_KEY = 'ag5zfmtoYW4tYWNhZGVteXIWCxIHTWlzc2lvbiIJYWxsb2ZtYXRoDA'
DATA_PATH = '/ebs/kadata2/daily_new/'
DATA_PATH_FULL = DATA_PATH + '%s/LearningTask/*.pickle.gz'
stats = defaultdict(int)
def process_file(filename, stats):
"""This function takes the name of a gzipped file of a pickled list of
protobufs. Phew, that's mouthful! It increments the counts of
mastery tasks in the 'stats' argument appropriately.
"""
print >> sys.stderr, filename
os.system("gunzip --stdout %s > /ebs/modeling/jace/temp/lt.pickle" % filename)
with open('/ebs/modeling/jace/temp/lt.pickle', 'rb') as pkl_file:
pb_list = pickle.load(pkl_file)
for pb in pb_list:
entity = datastore.Entity._FromPb(entity_pb.EntityProto(pb))
if 'MasteryTask' in entity['class']:
if 'user_mission' in entity and entity['user_mission']:
mission = entity['user_mission'].id_or_name().split(':')[-1]
else:
mission = "NONE"
stats[mission] += 1
else:
pass # print "non-mastery task"
if len(sys.argv) < 2:
dirnames = glob.glob(DATA_PATH + '*')
dirnames = sorted([os.path.basename(d) for d in dirnames])
else:
dirnames = [sys.argv[1]]
print >> sys.stderr, dirnames
for day in dirnames:
# NOTE: day is a directory name, but will be in the format YYYY-MM-DD
stats = defaultdict(int)
# Get all the file names for the day, and process them
filenames = glob.glob(DATA_PATH_FULL % day)
for filename in filenames:
process_file(filename, stats)
print >> sys.stderr, json.dumps(dict(stats), indent=4)
all_of_math_count = stats[ALL_OF_MATH_KEY]
other_mission_count = sum(stats.values()) - all_of_math_count
# output the line to stdout
print "%s, %d, %d" % (day, all_of_math_count, other_mission_count)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment