Detecting anomalous senators

This interactive visualization demonstrates the Stochastic Outlier Selection (SOS) applied to roll call voting data. It was first presented at the NYC Machine Learning meetup on November 21, 2013. SOS is an unsupervised outlier-selection algorithm by J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik (2012). It employs the concept of affinity to quantify the relationship between data points and subsequently computes an outlier probability for each data point. Intuitively, a data point is selected as an outlier when the other data points have insufficient affinity with it.

The data set contains 103 data points (senators) and 172 features (votes). The dissimilarity between the data points is the Euclidean distance. Each circle in the scatter plot represents a senator, of which the location is determined by applying the non-linear dimensionality reduction technique t-SNE to it. Please note that SOS is applied to the original, 172-dimensional, data set.

Once the visualization has focus, pressing the n key will cause every data point to independently select one other data point. The resulting graph is called a Stochastic Neighbor Graph (SNG). A data point is an outlier given this graph when it is not selected by any other data point (hover over a data point or corresponding name to highlight which data point it selects). We are, however, in the probability that a data point is an outlier. The more SNGs are generated, the closer the outlier probability is approximated. Press the p key to keep on generating SNGs. This can be stopped by pressing s. The speed can be controlled with the keys 1 to 9.

The bar next to each senator name represents the current fraction that the data point has been an outlier. The list of senators can be sorted by pressing the r key. Press the u key to reverse the list.

It can take a very long time before the outlier fractions approximate the outlier probabilities. Fortunately, the outlier probability can be computed directly in closed form because it is defined as the joint probability of all data points not binding to it. So, the purpose of these SNGs is merely an illustrative one. Please see the SOS repository on Github for a Python implementation and the technical report.

The following Drakefile shows how the outlier probabilities (and the binding probabilities need the for this demo) for the senators can be computed from the command-line.

``````; Get dataset
dataset.csv <- [-timecheck]
curl -s https://raw.github.com/VikParuchuri/political-positions/master/113_frame.csv > \$OUTPUT

; Extract features
features.csv <- dataset.csv
csvcut \$INPUT -C 1,name,party,state | sed '1d;s/NA/4/g' > \$OUTPUT

; Extract labels
labels.csv <- dataset.csv
csvcut \$INPUT -c name,party,state > \$OUTPUT

; Compute outlier probabilities using SOS
outlier.csv <- features.csv
echo 'outlier' > \$OUTPUT
< \$INPUT ../bin/sos -p 50 >> \$OUTPUT

; Combine labels and outlier probabilities and sort
result.csv <- labels.csv, outlier.csv
paste -d, \$INPUT0 \$INPUT1 | csvsort -rc outlier > \$OUTPUT

; Compute binding probablities for the demo at http://bl.ocks.org/jeroenjanssens/7608890
bindings.csv <- features.csv
< \$INPUT ../bin/sos -p 50 -b > \$OUTPUT
``````
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
