Skip to content

Instantly share code, notes, and snippets.

@edavis
Created January 3, 2020 17:36
Show Gist options
  • Save edavis/d84067da84d16f039b37693bfca81119 to your computer and use it in GitHub Desktop.
Save edavis/d84067da84d16f039b37693bfca81119 to your computer and use it in GitHub Desktop.
Determining when baseball games become unique
#!/usr/bin/env python
import csv
from itertools import groupby
from collections import Counter
reader = csv.DictReader(open('MLB.CWEVENTS'))
writer = csv.writer(open('sequences.csv', 'wb'))
it = groupby(reader, key=lambda row: row['GAME_ID'])
count = Counter()
for gid, events in it:
print gid
sequence = []
all_events = list(events)
# keep incrementing and re-running this script until
# .most_common() returns all 1s.
for event in all_events[:5]:
sequence += list(event['PITCH_SEQ_TX']) + [event['EVENT_CD']]
joined = ','.join(sequence)
count.update([joined])
writer.writerow([gid, joined])
# Grep for the output sequence in sequences.csv to figure out
# which GID produced it.
print count.most_common(3)
#!/usr/bin/env bash
function run_cwevent {
if [ ! -f $2 ]; then
cwevent -n -f 0-10,7,29,34 -y $1 ${1}*.EV[AN] > $2
else
cwevent -f 0-10,7,29,34 -y $1 ${1}*.EV[AN] >> $2
fi
}
pushd 2000seve/
for year in $(seq 2000 2009); do
run_cwevent $year ../MLB.CWEVENTS
done
popd
pushd 2010seve/
for year in $(seq 2010 2019); do
run_cwevent $year ../MLB.CWEVENTS
done
popd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment