This document is still a scratch
You’ll want a directory to do this in so that you don’t screw up your machine.
#!/bin/sh | |
### | |
# SOME COMMANDS WILL NOT WORK ON macOS (Sierra or newer) | |
# For Sierra or newer, see https://github.com/mathiasbynens/dotfiles/blob/master/.macos | |
### | |
# Alot of these configs have been taken from the various places | |
# on the web, most from here | |
# https://github.com/mathiasbynens/dotfiles/blob/5b3c8418ed42d93af2e647dc9d122f25cc034871/.osx |
A lot of math grad school is reading books and papers and trying to understand what's going on. The difficulty is that reading math is not like reading a mystery thriller, and it's not even like reading a history book or a New York Times article.
The main issue is that, by the time you get to the frontiers of math, the words to describe the concepts don't really exist yet. Communicating these ideas is a bit like trying to explain a vacuum cleaner to someone who has never seen one, except you're only allowed to use words that are four letters long or shorter.
What can you say?
#!/usr/bin/python | |
# crf.py (by Graham Neubig) | |
# This script trains conditional random fields (CRFs) | |
# stdin: A corpus of WORD_POS WORD_POS WORD_POS sentences | |
# stdout: Feature vectors for emission and transition properties | |
from collections import defaultdict | |
from math import log, exp | |
import sys |
#!/usr/bin/env python | |
import base64 | |
from Crypto import Random | |
from Crypto.Cipher import AES | |
BS = 16 | |
pad = lambda s: s + (BS - len(s) % BS) * chr(BS - len(s) % BS) | |
unpad = lambda s : s[0:-ord(s[-1])] |
To demonstrate text classification with Scikit Learn, we'll build a simple spam filter. While the filters in production for services like Gmail will obviously be vastly more sophisticated, the model we'll have by the end of this chapter is effective and surprisingly accurate.
Spam filtering is the "hello world" of document classification, but something to be aware of is that we aren't limited to two classes. The classifier we will be using supports multi-class classification, which opens up vast opportunities like author identification, support email routing, etc… However, in this example we'll just stick to two classes: SPAM and HAM.
For this exercise, we'll be using a combination of the Enron-Spam data sets and the SpamAssassin public corpus. Both are publicly available for download and are retreived from the internet during the setup phase of the example code that goes with this chapter.
var graph = 'graph.json' | |
var w = 960, | |
h = 700, | |
r = 10; | |
var vis = d3.select(".graph") | |
.append("svg:svg") | |
.attr("width", w) | |
.attr("height", h) |
# OSX for Hackers (Mavericks/Yosemite) | |
# | |
# Source: https://gist.github.com/brandonb927/3195465 | |
#!/bin/sh | |
# Some things taken from here | |
# https://github.com/mathiasbynens/dotfiles/blob/master/.osx | |
# Ask for the administrator password upfront |
""" | |
Multiclass SVMs (Crammer-Singer formulation). | |
A pure Python re-implementation of: | |
Large-scale Multiclass Support Vector Machine Training via Euclidean Projection onto the Simplex. | |
Mathieu Blondel, Akinori Fujino, and Naonori Ueda. | |
ICPR 2014. | |
http://www.mblondel.org/publications/mblondel-icpr2014.pdf | |
""" |