Skip to content

Instantly share code, notes, and snippets.

@mimno
mimno / README.md
Last active July 14, 2017 13:53
Monte Carlo sampling for 538 riddler (pizza slices)

A Monte Carlo approximation of the area of overlap between two circles, inspired by this 538 riddler question. Sample random points in a rectangle and see which circle(s) they fall into. If "both" is more than half of "left" or "right", chooose the middle slices.

@mimno
mimno / index.html
Last active November 8, 2016 20:39
Clone of Financial Times dot map
<html>
<head>
<script src="https://d3js.org/d3.v4.min.js"></script>
<script src="https://d3js.org/topojson.v2.min.js"></script>
<style>
path { stroke: #555; fill: none; }
</style>
</head>
<body>
<svg></svg>
@mimno
mimno / .block
Last active June 8, 2016 15:51
Recipe topics
height: 1300
@mimno
mimno / index.html
Last active June 7, 2016 15:34
Divided Lines from a Scale
<html>
<head>
<title>Divided Lines from a scale</title>
<meta charset="utf-8" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.16/d3.min.js"></script>
</head>
<style>
svg {
height: 500px;
width: 500px;
<html>
<head>
<!-- Load the d3 library. -->
<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
<link href='http://fonts.googleapis.com/css?family=Open+Sans' rel='stylesheet' type='text/css'>
<style>
body { font-family: "Open Sans"; }
text.stateID { dominant-baseline: middle; text-anchor: middle; }
</style>
</head>
@mimno
mimno / index.html
Last active August 29, 2015 14:16
Maps with fade-in borders
<html>
<head>
<link href='http://fonts.googleapis.com/css?family=Open+Sans' rel='stylesheet' type='text/css'>
<!-- Load the d3 library. -->
<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
<script src="http://d3js.org/topojson.v1.min.js"></script>
<style>
/* put a border around the svg element so we can see the coordinate system better. */
body { font-family: "Open Sans"; } div { margin: 30px; }
</style>

Burstiness experiment

These files implement an experiment based on Madsen et al., Modeling Word Burstiness Using the Dirichlet Distribution. They observe that the probability of seeing large numbers of rare words in a single document is much larger in real natural language text than we would expect if words were sampled i.i.d from a multinomial distribution. The multinomial model is good at predicting how many times a low-frequency word like "camelid" will appear in a corpus overall, but it assigns far too little probability to the event that all instances of "camelid" will occur in a single document.

We simulate this by counting the event that word w occurs N times in a document from a real corpus,

This page demonstrates a CSS style that replicates a page from a 19th century book. The font is Old Standard TT, supplied by Google Fonts. The edge shading is done with an internal drop shadow. I sampled the color for the background and the shaded edge from a scanned image of a real hundred year old page.

@mimno
mimno / README.md
Last active August 29, 2015 14:01

Here are some points, which happen to have been sampled from six round Gaussian distributions. Can we figure out where the centers of those Gaussians were, and which points came from which cluster? Theoretically, this is a hard problem. There's no way to know that we have the best clustering without checking all possible assignments of points to clusters.

We use an iterative algorithm, k-means. Rather than solving the hard problem of finding the best cluster assignments, this algorithm alternates between two easy problems: finding pairs of points that are closest to each other, and calculating an average.

Click the "Random Clusters" button to drop in six random cluster centers.

We just sampled two groups of points (red and blue), and fit a regression line to each one.

How much confidence do we have in these fitted linear models? This page shows three variations of randomization tests that show us what regression lines from "similar" datasets would look like. The buttons at the top will sample a random, but similar, dataset in one of three ways. We will then fit regression lines for the randomized dataset, and then go back to the original data.

Comparing the "real" line to these replicated lines can tell us whether the original line tells us something interesting about the dataset, or if it's just fitting random noise. The numbers at the top will tell us how many of the replicated models have had a greater slope than