Skip to content

Instantly share code, notes, and snippets.

View lonnen's full-sized avatar
:shipit:

Lonnen lonnen

:shipit:
View GitHub Profile
#!/usr/bin/env python
# How the corpus was built:
# 1. Start with the Belfry WebComics Index: http://new.belfrycomics.net/view/all. (Choose "All Comics" and "Daily" as filters.) Spider down all those linked pages.
# 2. Throw out the few (about 5) pages that are clearly not single comics. Doing this early makes it likely our training and test sets will still be close to 70/30.
# 3. Use this to divide into a training set (for optimizing weights and getting best-case performance) and a testing set (for seeing how we did on unseen data). We use a 70/30 split only because it is a common rule of thumb. Since I'm writing features by hand, I start with a small subset of the training set (30 comics) and validate on another small subset. If the validation goes poorly, we induct that validation set into the training set, look for patterns of error in it, and adjust the features to eliminate them. Then we validate against a new subset, and so on. Hopefully, we start to do okay before we run out of data. The test set is f