Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Trying to find "bins" representing columns using math.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Binning Horizontal Page Position Data\n",
"\n",
"22 July 2017"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"const dsv = require('d3-dsv');\n",
"const arr = require('d3-array');\n",
"const stats = require('simple-statistics');\n",
"const fs = require('fs');\n",
"const path = require('path');"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"let __dirname = path.resolve();\n",
"let filePath = path.join(__dirname, '..', '/data/modified/xpositions_years_dataset.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read in the contents of the CSV file synchronously …"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"let dataString = fs.readFileSync(filePath, {encoding: 'utf-8'});"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parse the contents into an array of objects …"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"let data = dsv.csvParse(dataString)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Array] [\"dish_id\",\"year\",\"scaled_xpos\"]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### D3\n",
"\n",
"Look at a couple of different ways of dividing the data into bins. First using the histogram generator function from D3. Got the idea from [this StackOverflow question](https://stackoverflow.com/questions/37445495/binning-an-array-in-javascript-for-a-histogram).\n",
"\n",
"The value function in each case below, let's us tell the function to use the `scaled_xpos` value without having to mess with the original object."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Preset number of bins (here: 4)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"let histogram = arr.histogram()\n",
" .value(function(d,i,array) { return d['scaled_xpos']; })\n",
" .thresholds(4);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Freedman-Diaconis threshold algorithm"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"let histogram1 = arr.histogram()\n",
" .value(function(d,i,array) { return d['scaled_xpos']; })\n",
" .thresholds(arr.thresholdFreedmanDiaconis);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Scott threshold algorithm"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"let histogram2 = arr.histogram()\n",
" .value(function(d,i,array) { return d['scaled_xpos']; })\n",
" .thresholds(arr.thresholdScott);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sturges threshold algorithm (d3 default)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"let histogram3 = arr.histogram()\n",
" .value(function(d,i,array) { return d['scaled_xpos']; })\n",
" .thresholds(arr.thresholdSturges);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the histogram generators on the data to get the bins …"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"// preset\n",
"let bins = histogram(data)\n",
"\n",
"//Freedman-Diaconis\n",
"let bins1 = histogram1(data)\n",
"\n",
"// Scott\n",
"let bins2 = histogram2(data)\n",
"\n",
"//Sturges\n",
"let bins3 = histogram3(data)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bins.length // Sanity check!"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"197"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"//Freedman-Diaconis\n",
"bins1.length"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"197"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"//Scott\n",
"bins2.length"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"19"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// Sturges\n",
"bins3.length"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted to report out the number of values and the start and end indexes from each binning method, we could run the following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for(b=0; b < bins1.length; b++) {\n",
" console.log(\"Start index: \" + bins1[b].x0);\n",
" console.log(\"End index: \" + (bins1[b].x1 - 1));\n",
" console.log(\"Bin size: \" + bins1[b].length);\n",
" console.log(\"====\")\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple Statistics\n",
"\n",
"None of the above is particularly satisfying. Let's try a method from another library …"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Have to explicitly pull out the position values …"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"let positions = data.map((d) => { return d['scaled_xpos']; })"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"111.429\""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"positions[0] // Sanity check"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some Googling let me to [this Stack Overflow Stats discussion](https://stats.stackexchange.com/questions/34242/how-to-intelligently-bin-a-collection-of-sorted-data). First, I followed the suggestion to look at using the [Jenks natural breaks optimization](http://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization) algorithm. I remembered seeing an implementation of this in an impressive-looking javascript statistics library I'd perused before called [`simple-statistics`](https://www.npmjs.com/package/simple-statistics).\n",
"\n",
"When I went looking for the algorithm in the current version of the library, I saw that it had been superseded by an implementation of [ckmeans clustering](https://simplestatistics.org/docs/#ckmeans). So, let's try that. For the sake of this experiment, I picked 4 clusters &mdash; thinking of our previous visualizations.\n",
"\n",
"(But of course, I don't really know what I am doing with this algorithm!)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"// This takes a couple hours to run!\n",
"let clusters = stats.ckmeans(positions, 4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As promised this generates four clusters …"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters.length // sanity check"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Out of curiousity, how many values are in each cluster?"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"436568"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters[0].length"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"333537"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters[1].length"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"306477"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters[2].length"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"255927"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clusters[3].length"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the moment, I just want to know the values at the \"breaks\" between the clusters, so grab the last value from each array (representing a cluster)."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"let breaks = clusters.map(function(item, index, array) {\n",
" return item.slice(-1)[0]\n",
"});"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Array] [\"238.667\",\"433.333\",\"620.0\",\"985.333\"]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breaks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, I'm interpreting this as four bins of x-position data. If the position is less than 238.667, consider it part of \"column one\". If the position is between 238.668 and 433.333, consider it in \"column 2\", etc. …"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next: combine these break point positions with our dot scatter graphs and see if they make any sense."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "NodeJS v6.9.0",
"language": "javascript",
"name": "nodejs"
},
"language_info": {
"codemirror_mode": "javascript",
"file_extension": ".js",
"mimetype": "text/javascript",
"name": "nodejs",
"pygments_lexer": "javascript",
"version": "0.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.