Created
July 23, 2017 00:03
-
-
Save trevormunoz/c9ed757baaed27c27c37f025b3f2309f to your computer and use it in GitHub Desktop.
Trying to find "bins" representing columns using math.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Binning Horizontal Page Position Data\n", | |
"\n", | |
"22 July 2017" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"const dsv = require('d3-dsv');\n", | |
"const arr = require('d3-array');\n", | |
"const stats = require('simple-statistics');\n", | |
"const fs = require('fs');\n", | |
"const path = require('path');" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"let __dirname = path.resolve();\n", | |
"let filePath = path.join(__dirname, '..', '/data/modified/xpositions_years_dataset.csv')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Read in the contents of the CSV file synchronously …" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"let dataString = fs.readFileSync(filePath, {encoding: 'utf-8'});" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Parse the contents into an array of objects …" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"let data = dsv.csvParse(dataString)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[Array] [\"dish_id\",\"year\",\"scaled_xpos\"]" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data.columns" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### D3\n", | |
"\n", | |
"Look at a couple of different ways of dividing the data into bins. First using the histogram generator function from D3. Got the idea from [this StackOverflow question](https://stackoverflow.com/questions/37445495/binning-an-array-in-javascript-for-a-histogram).\n", | |
"\n", | |
"The value function in each case below, let's us tell the function to use the `scaled_xpos` value without having to mess with the original object." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Preset number of bins (here: 4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"let histogram = arr.histogram()\n", | |
" .value(function(d,i,array) { return d['scaled_xpos']; })\n", | |
" .thresholds(4);" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Freedman-Diaconis threshold algorithm" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"let histogram1 = arr.histogram()\n", | |
" .value(function(d,i,array) { return d['scaled_xpos']; })\n", | |
" .thresholds(arr.thresholdFreedmanDiaconis);" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Scott threshold algorithm" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"let histogram2 = arr.histogram()\n", | |
" .value(function(d,i,array) { return d['scaled_xpos']; })\n", | |
" .thresholds(arr.thresholdScott);" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Sturges threshold algorithm (d3 default)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"let histogram3 = arr.histogram()\n", | |
" .value(function(d,i,array) { return d['scaled_xpos']; })\n", | |
" .thresholds(arr.thresholdSturges);" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Run the histogram generators on the data to get the bins …" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"// preset\n", | |
"let bins = histogram(data)\n", | |
"\n", | |
"//Freedman-Diaconis\n", | |
"let bins1 = histogram1(data)\n", | |
"\n", | |
"// Scott\n", | |
"let bins2 = histogram2(data)\n", | |
"\n", | |
"//Sturges\n", | |
"let bins3 = histogram3(data)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"4" | |
] | |
}, | |
"execution_count": 35, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"bins.length // Sanity check!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"197" | |
] | |
}, | |
"execution_count": 36, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"//Freedman-Diaconis\n", | |
"bins1.length" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"197" | |
] | |
}, | |
"execution_count": 37, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"//Scott\n", | |
"bins2.length" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"19" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// Sturges\n", | |
"bins3.length" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If we wanted to report out the number of values and the start and end indexes from each binning method, we could run the following:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"for(b=0; b < bins1.length; b++) {\n", | |
" console.log(\"Start index: \" + bins1[b].x0);\n", | |
" console.log(\"End index: \" + (bins1[b].x1 - 1));\n", | |
" console.log(\"Bin size: \" + bins1[b].length);\n", | |
" console.log(\"====\")\n", | |
"}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Simple Statistics\n", | |
"\n", | |
"None of the above is particularly satisfying. Let's try a method from another library …" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Have to explicitly pull out the position values …" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"let positions = data.map((d) => { return d['scaled_xpos']; })" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"\"111.429\"" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"positions[0] // Sanity check" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Some Googling let me to [this Stack Overflow Stats discussion](https://stats.stackexchange.com/questions/34242/how-to-intelligently-bin-a-collection-of-sorted-data). First, I followed the suggestion to look at using the [Jenks natural breaks optimization](http://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization) algorithm. I remembered seeing an implementation of this in an impressive-looking javascript statistics library I'd perused before called [`simple-statistics`](https://www.npmjs.com/package/simple-statistics).\n", | |
"\n", | |
"When I went looking for the algorithm in the current version of the library, I saw that it had been superseded by an implementation of [ckmeans clustering](https://simplestatistics.org/docs/#ckmeans). So, let's try that. For the sake of this experiment, I picked 4 clusters — thinking of our previous visualizations.\n", | |
"\n", | |
"(But of course, I don't really know what I am doing with this algorithm!)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"// This takes a couple hours to run!\n", | |
"let clusters = stats.ckmeans(positions, 4)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"As promised this generates four clusters …" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"4" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"clusters.length // sanity check" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Out of curiousity, how many values are in each cluster?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"436568" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"clusters[0].length" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"333537" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"clusters[1].length" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"306477" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"clusters[2].length" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"255927" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"clusters[3].length" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"At the moment, I just want to know the values at the \"breaks\" between the clusters, so grab the last value from each array (representing a cluster)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"let breaks = clusters.map(function(item, index, array) {\n", | |
" return item.slice(-1)[0]\n", | |
"});" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[Array] [\"238.667\",\"433.333\",\"620.0\",\"985.333\"]" | |
] | |
}, | |
"execution_count": 33, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"breaks" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So, I'm interpreting this as four bins of x-position data. If the position is less than 238.667, consider it part of \"column one\". If the position is between 238.668 and 433.333, consider it in \"column 2\", etc. …" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Next: combine these break point positions with our dot scatter graphs and see if they make any sense." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "NodeJS v6.9.0", | |
"language": "javascript", | |
"name": "nodejs" | |
}, | |
"language_info": { | |
"codemirror_mode": "javascript", | |
"file_extension": ".js", | |
"mimetype": "text/javascript", | |
"name": "nodejs", | |
"pygments_lexer": "javascript", | |
"version": "0.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment