Skip to content

Instantly share code, notes, and snippets.

@kojix2
Created February 28, 2019 08:34
Show Gist options
  • Save kojix2/986aef56aedcbfb59be3283a886457cf to your computer and use it in GitHub Desktop.
Save kojix2/986aef56aedcbfb59be3283a886457cf to your computer and use it in GitHub Desktop.
Usage of DataFrame.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Usage of Daru::DataFrame\n",
"\n",
"Daru::DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and vectors). \n",
"\n",
"Arithmetic operations align on both row and vector labels. Can be thought of as a container for Daru::Vector objects. This is primary data structure used by daru and gems that depend on it (like statsample).\n",
"\n",
"You should use DataFrame because it allows you to easily store, access and manipulate labelled data, plot it using an interactive graph library and perform various statistics operations by ignoring missing data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"application/javascript": [
"if(window['d3'] === undefined ||\n",
" window['Nyaplot'] === undefined){\n",
" var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n",
"\n",
"\n",
"\n",
" var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n",
"\n",
" require.config({paths: path, shim:shim});\n",
"\n",
"\n",
"require(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n",
"\n",
"\tvar script = d3.select(\"head\")\n",
"\t .append(\"script\")\n",
"\t .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n",
"\t .attr(\"async\", true);\n",
"\n",
"\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n",
"\n",
"\n",
"\t var event = document.createEvent(\"HTMLEvents\");\n",
"\t event.initEvent(\"load_nyaplot\",false,false);\n",
"\t window.dispatchEvent(event);\n",
"\t console.log('Finished loading Nyaplotjs');\n",
"\n",
"\t};\n",
"\n",
"\n",
"});});\n",
"}\n"
],
"text/plain": [
"\"if(window['d3'] === undefined ||\\n window['Nyaplot'] === undefined){\\n var path = {\\\"d3\\\":\\\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\\\",\\\"downloadable\\\":\\\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\\\"};\\n\\n\\n\\n var shim = {\\\"d3\\\":{\\\"exports\\\":\\\"d3\\\"},\\\"downloadable\\\":{\\\"exports\\\":\\\"downloadable\\\"}};\\n\\n require.config({paths: path, shim:shim});\\n\\n\\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\\n\\n\\tvar script = d3.select(\\\"head\\\")\\n\\t .append(\\\"script\\\")\\n\\t .attr(\\\"src\\\", \\\"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\\\")\\n\\t .attr(\\\"async\\\", true);\\n\\n\\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\\n\\n\\n\\t var event = document.createEvent(\\\"HTMLEvents\\\");\\n\\t event.initEvent(\\\"load_nyaplot\\\",false,false);\\n\\t window.dispatchEvent(event);\\n\\t console.log('Finished loading Nyaplotjs');\\n\\n\\t};\\n\\n\\n});});\\n}\\n\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"require 'daru'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Creation and Access\n",
"\n",
"Daru offers many options for creating DataFrames. You can create it from Hashes, Arrays, Daru::Vectors or even load it from CSV files, Excel spreadsheets or SQL databases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**From Array of Arrays**\n",
"\n",
"In the example below, I'm specifying the vertical _Vectors_ of the DataFrame as an Array of Arrays and I specify their names in the `:order` option, by supplying an Array of names that the vectors should be called by. \n",
"\n",
"In the `:index` option, we'll specify the names of the rows of the DataFrame. If the `:index` is not given, DataFrame will assign numerical indexes starting from 0 to each row."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(4x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>one</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>two</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>three</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>four</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(4x2)>\n",
" a b\n",
" one 1 1\n",
" two 2 2\n",
" three 3 3\n",
" four 4 4"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.new([[1,2,3,4], [1,2,3,4]],order: [:a, :b], index: [:one, :two, :three, :four])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**From Hash of Arrays**\n",
"\n",
"A similar DataFrame can be created from a Hash. In this case the keys of the Hash are the names of the vectors in the DataFrame. The `:order` option, if specified, will only serve to decide the orientation of the Vectors in the DataFrame. Not specfiying `:order` in this case will align the vectors alphabetically."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(4x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>a</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(4x2)>\n",
" b a\n",
" 0 1 1\n",
" 1 2 2\n",
" 2 3 3\n",
" 3 4 4"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.new({a: [1,2,3,4], b: [1,2,3,4]},order: [:b, :a])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**From Hash of Vectors**\n",
"\n",
"A DataFrame can be created from a Hash of Daru::Vectors and their names. The name of the vector will be the key and the corresponding value, a Daru::Vector.\n",
"\n",
"The values of the DataFrame are aligned according to the index of each Daru::Vector. A *nil* is assigned whenever a particular index is not available for one Vector but is present in any of the other Vectors, and the resulting index of the DataFrame is a union of the indexes of all the Vectors in alphabetical order.\n",
"\n",
"The sizes or indexes of the supplied Vectors don't matter."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(6x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>v1</th>\n",
" \n",
" <th>v2</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>absent</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>44</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(6x2)>\n",
" v1 v2\n",
" a 1 33\n",
" absent nil 44\n",
" b 2 11\n",
" c 3 nil\n",
" d 4 nil\n",
" e 5 22"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"v1 = Daru::Vector.new([1,2,3,4,5], index: [:a, :b, :c, :d, :e])\n",
"v2 = Daru::Vector.new([11,22,33,44], index: [:b, :e, :a, :absent])\n",
"\n",
"Daru::DataFrame.new({v1: v1, v2: v2})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**The 'clone' option**\n",
"\n",
"If you have Vectors that have _exactly_ the same index, you can specify the `:clone` option to DataFrame. Setting `:clone` to *false* will direct daru to utilize the same Vector objects in creating the DataFrame, that you have specified in the Hash and will prevent their cloning when being stored in the DataFrame. Thus the object IDs of the Vectors will remain the same.\n",
"\n",
"Be wary of making changes in the DataFrame or the supplied vectors if you set `:clone` to *false*."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"equalness a : true\n",
"equalness b : true\n"
]
}
],
"source": [
"v1 = Daru::Vector.new([1,2,3,4,5])\n",
"v2 = Daru::Vector.new([11,22,33,44,55])\n",
"\n",
"df = Daru::DataFrame.new({a: v1, b: v2}, clone: false)\n",
"puts \"equalness a : #{v1.object_id == df[:a].object_id}\\nequalness b : #{v2.object_id == df[:b].object_id}\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Creating with rows**\n",
"\n",
"If you want to create a DataFrame by specifying the rows, you can do so by specifying an _Array of Arrays_ or _Array of Vectors_ to the `.rows` method.\n",
"\n",
"Lets first see creating DataFrames from an Array of Arrays:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(4x4) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>10</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td>20</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>30</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>40</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(4x4)>\n",
" a b c d\n",
" 0 1 11 10 a\n",
" 1 2 22 20 4\n",
" 2 3 33 30 g\n",
" 3 4 44 40 3"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Daru::DataFrame.rows([\n",
" [1,11,10,'a'],\n",
" [2,22,20 ,4 ],\n",
" [3,33,30,'g'],\n",
" [4,44,40, 3 ]\n",
" ], order: [:a, :b, :c, :d])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you supply an Array of Vectors to the `.rows` method, the index of the Vectors will be automatically assigned as the names of the vectors of the DataFrame. Moreover, elements will be aligned by their indexes in the completed DataFrame.\n",
"\n",
"If a Vector does not have a particular index that is present in other Vectors, a **nil** will be placed in that position.\n",
"\n",
"The `:order` option should be set in this case to whatever values you want to keep in your DataFrame to avoid unexpected behaviour."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(2x5) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" <th>odd</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>55</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(2x5)>\n",
" a b c d odd\n",
" 0 1 2 3 4 nil\n",
" 1 11 44 22 nil 55"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r1 = Daru::Vector.new([1,2,3,4,5], index: [:a, :b, :c, :d, :e])\n",
"r2 = Daru::Vector.new([11,22,33,44,55], index: [:a, :c, :e, :b, :odd])\n",
"\n",
"Daru::DataFrame.rows([r1,r2], order: [:a, :b, :c, :d, :odd])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading data from different data sources\n",
"\n",
"Daru::DataFrame currently supports loading data from CSV files, Excel spreadsheets and SQL databases. You can also write your DataFrames to these kinds of files using some simple functions. Daru also supports saving and loading data by Marshalling. Lets go through them one by one."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**CSV (Comma Separated Values) files**\n",
"\n",
"To demonstrate loading and writing to CSV files, we'll read some sales data from [this CSV file](https://github.com/v0dro/daru/blob/master/spec/fixtures/sales-funnel.csv)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(17x8) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>Account</th>\n",
" \n",
" <th>Name</th>\n",
" \n",
" <th>Rep</th>\n",
" \n",
" <th>Manager</th>\n",
" \n",
" <th>Product</th>\n",
" \n",
" <th>Quantity</th>\n",
" \n",
" <th>Price</th>\n",
" \n",
" <th>Status</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>714466</td>\n",
" \n",
" <td>Trantow-Barrows</td>\n",
" \n",
" <td>Craig Booker</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>30000</td>\n",
" \n",
" <td>presented</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>714466</td>\n",
" \n",
" <td>Trantow-Barrows</td>\n",
" \n",
" <td>Craig Booker</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>Software</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>10000</td>\n",
" \n",
" <td>presented</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>714466</td>\n",
" \n",
" <td>Trantow-Barrows</td>\n",
" \n",
" <td>Craig Booker</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>Maintenance</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>5000</td>\n",
" \n",
" <td>pending</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td>737550</td>\n",
" \n",
" <td>Fritsch, Russel and Anderson</td>\n",
" \n",
" <td>Craig Booker</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>35000</td>\n",
" \n",
" <td>declined</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>4</td>\n",
" \n",
" <td>146832</td>\n",
" \n",
" <td>Kiehn-Spinka</td>\n",
" \n",
" <td>Daniel Hilton</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>65000</td>\n",
" \n",
" <td>won</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>5</td>\n",
" \n",
" <td>218895</td>\n",
" \n",
" <td>Kulas Inc</td>\n",
" \n",
" <td>Daniel Hilton</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>40000</td>\n",
" \n",
" <td>pending</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>6</td>\n",
" \n",
" <td>218895</td>\n",
" \n",
" <td>Kulas Inc</td>\n",
" \n",
" <td>Daniel Hilton</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>Software</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>10000</td>\n",
" \n",
" <td>presented</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>7</td>\n",
" \n",
" <td>412290</td>\n",
" \n",
" <td>Jerde-Hilpert</td>\n",
" \n",
" <td>John Smith</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>Maintenance</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>5000</td>\n",
" \n",
" <td>pending</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>8</td>\n",
" \n",
" <td>740150</td>\n",
" \n",
" <td>Barton LLC</td>\n",
" \n",
" <td>John Smith</td>\n",
" \n",
" <td>Debra Henley</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>35000</td>\n",
" \n",
" <td>declined</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>9</td>\n",
" \n",
" <td>141962</td>\n",
" \n",
" <td>Herman LLC</td>\n",
" \n",
" <td>Cedric Moss</td>\n",
" \n",
" <td>Fred Anderson</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>65000</td>\n",
" \n",
" <td>won</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>10</td>\n",
" \n",
" <td>163416</td>\n",
" \n",
" <td>Purdy-Kunde</td>\n",
" \n",
" <td>Cedric Moss</td>\n",
" \n",
" <td>Fred Anderson</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>30000</td>\n",
" \n",
" <td>presented</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>11</td>\n",
" \n",
" <td>239344</td>\n",
" \n",
" <td>Stokes LLC</td>\n",
" \n",
" <td>Cedric Moss</td>\n",
" \n",
" <td>Fred Anderson</td>\n",
" \n",
" <td>Maintenance</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>5000</td>\n",
" \n",
" <td>pending</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>12</td>\n",
" \n",
" <td>239344</td>\n",
" \n",
" <td>Stokes LLC</td>\n",
" \n",
" <td>Cedric Moss</td>\n",
" \n",
" <td>Fred Anderson</td>\n",
" \n",
" <td>Software</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>10000</td>\n",
" \n",
" <td>presented</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>13</td>\n",
" \n",
" <td>307599</td>\n",
" \n",
" <td>Kassulke, Ondricka and Metz</td>\n",
" \n",
" <td>Wendy Yule</td>\n",
" \n",
" <td>Fred Anderson</td>\n",
" \n",
" <td>Maintenance</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>7000</td>\n",
" \n",
" <td>won</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>14</td>\n",
" \n",
" <td>688981</td>\n",
" \n",
" <td>Keeling LLC</td>\n",
" \n",
" <td>Wendy Yule</td>\n",
" \n",
" <td>Fred Anderson</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>100000</td>\n",
" \n",
" <td>won</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>15</td>\n",
" \n",
" <td>729833</td>\n",
" \n",
" <td>Koepp Ltd</td>\n",
" \n",
" <td>Wendy Yule</td>\n",
" \n",
" <td>Fred Anderson</td>\n",
" \n",
" <td>CPU</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>65000</td>\n",
" \n",
" <td>declined</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>16</td>\n",
" \n",
" <td>729833</td>\n",
" \n",
" <td>Koepp Ltd</td>\n",
" \n",
" <td>Wendy Yule</td>\n",
" \n",
" <td>Fred Anderson</td>\n",
" \n",
" <td>Monitor</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>5000</td>\n",
" \n",
" <td>presented</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(17x8)>\n",
" Account Name Rep Manager Product Quantity Price Status\n",
" 0 714466 Trantow-Ba Craig Book Debra Henl CPU 1 30000 presented\n",
" 1 714466 Trantow-Ba Craig Book Debra Henl Software 1 10000 presented\n",
" 2 714466 Trantow-Ba Craig Book Debra Henl Maintenanc 2 5000 pending\n",
" 3 737550 Fritsch, R Craig Book Debra Henl CPU 1 35000 declined\n",
" 4 146832 Kiehn-Spin Daniel Hil Debra Henl CPU 2 65000 won\n",
" 5 218895 Kulas Inc Daniel Hil Debra Henl CPU 2 40000 pending\n",
" 6 218895 Kulas Inc Daniel Hil Debra Henl Software 1 10000 presented\n",
" 7 412290 Jerde-Hilp John Smith Debra Henl Maintenanc 2 5000 pending\n",
" 8 740150 Barton LLC John Smith Debra Henl CPU 1 35000 declined\n",
" 9 141962 Herman LLC Cedric Mos Fred Ander CPU 2 65000 won\n",
" 10 163416 Purdy-Kund Cedric Mos Fred Ander CPU 1 30000 presented\n",
" 11 239344 Stokes LLC Cedric Mos Fred Ander Maintenanc 1 5000 pending\n",
" 12 239344 Stokes LLC Cedric Mos Fred Ander Software 1 10000 presented\n",
" 13 307599 Kassulke, Wendy Yule Fred Ander Maintenanc 3 7000 won\n",
" 14 688981 Keeling LL Wendy Yule Fred Ander CPU 5 100000 won\n",
" ... ... ... ... ... ... ... ... ..."
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Daru::DataFrame.from_csv 'data/sales-funnel.csv'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can specify all the options to the `.from_csv` function that you do to the Ruby `CSV.read()` function, since this is what is used internally.\n",
"\n",
"For example, if the columns in your CSV file are separated by something other that commas, you can use the `:col_sep` option. If you want to convert numeric values to numbers and not keep them as strings, you can use the `:converters` option and set it to `:numeric`.\n",
"\n",
"The `.from_csv` function uses the following defaults for reading CSV files (that are passed into the `CSV.read()` function):\n",
"``` ruby\n",
"\n",
"{\n",
" :col_sep => ',',\n",
" :converters => :numeric\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `#write_csv` function is used for writing the contents of a DataFrame to a CSV file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Excel Files**\n",
"\n",
"The `::from_excel` method can be used for loading Excel files. The [spreadsheet](https://github.com/zdavatz/spreadsheet) gem is used in the background in this case, so whatever variants of Excel compatible files can be loaded by spreadsheet should be easily loadable in this case too.\n",
"\n",
"Let me demonstrate this using [this Excel file](https://github.com/v0dro/daru/blob/master/spec/fixtures/test_xls.xls)."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(6x5) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>id</th>\n",
" \n",
" <th>name</th>\n",
" \n",
" <th>age</th>\n",
" \n",
" <th>city</th>\n",
" \n",
" <th>a1</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>Alex</td>\n",
" \n",
" <td>20</td>\n",
" \n",
" <td>New York</td>\n",
" \n",
" <td>a,b</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>Claude</td>\n",
" \n",
" <td>23</td>\n",
" \n",
" <td>London</td>\n",
" \n",
" <td>b,c</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>Peter</td>\n",
" \n",
" <td>25</td>\n",
" \n",
" <td>London</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>Franz</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>Paris</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>4</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>George</td>\n",
" \n",
" <td>5.5</td>\n",
" \n",
" <td>Tome</td>\n",
" \n",
" <td>a,b,c</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>5</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>Fernand</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(6x5)>\n",
" id name age city a1\n",
" 0 1 Alex 20 New York a,b\n",
" 1 2 Claude 23 London b,c\n",
" 2 3 Peter 25 London a\n",
" 3 4 Franz nil Paris nil\n",
" 4 5 George 5.5 Tome a,b,c\n",
" 5 6 Fernand nil nil nil"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.from_excel 'data/test_xls.xls'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Likewise, the `#write_excel` method can be used for writing data stored in the DataFrame to an Excel file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**SQL Databases**\n",
"\n",
"Similar to the examples above you can use the `::from_sql` and `#write_sql` methods for interacting with SQL databases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Plaintext Files**\n",
"\n",
"In case your data is stored as columns in plaintext (for example [this](https://github.com/v0dro/daru/blob/master/spec/fixtures/bank2.dat) file), you can use the `::from_plaintext` method for loading data from the file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying and accessing data\n",
"\n",
"Daru::DataFrame consists of rows and vectors, both of which can be accessed by their labels using an intuitive syntax.\n",
"\n",
"Consider the following DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(7x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>b</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>d</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>e</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>f</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(7x3)>\n",
" a b c\n",
" a 1 a 11\n",
" b 2 b 22\n",
" c 3 c 33\n",
" d 4 d 44\n",
" e 5 e 55\n",
" f 6 f 66\n",
" g 7 g 77"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.new({\n",
" a: [1,2,3,4,5,6,7], \n",
" b: ['a','b','c','d','e','f','g'], \n",
" c: [11,22,33,44,55,66,77]\n",
" }, index: [:a,:b,:c,:d,:e,:f,:g])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can access any Vector using the `#[]` operator. The resultant Vector is returned as a Daru::Vector which preserves the index of the DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::Vector(7) </b>\n",
"<table>\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th> </th>\n",
" <th>b</th>\n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" <td>a</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" <td>b</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" <td>c</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" <td>d</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" <td>e</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" <td>g</td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::Vector(7)>\n",
" b\n",
" a a\n",
" b b\n",
" c c\n",
" d d\n",
" e e\n",
" f f\n",
" g g"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[:b]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify a Range inside `#[]` to return a DataFrame which contains the columns within the Range."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(7x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>b</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>d</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>e</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>f</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(7x2)>\n",
" b c\n",
" a a 11\n",
" b b 22\n",
" c c 33\n",
" d d 44\n",
" e e 55\n",
" f f 66\n",
" g g 77"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[:b..:c]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A row can be accessed using the `#row[]` method. The row is also returned as a Daru::Vector and any operations so any operations on a Daru::Vector will be valid on the row too.\n",
"\n",
"The index of the returned row corresponds to the names of the Vectors."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::Vector(3) </b>\n",
"<table>\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th> </th>\n",
" <th>c</th>\n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" <td>3</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" <td>c</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" <td>33</td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::Vector(3)>\n",
" c\n",
" a 3\n",
" b c\n",
" c 33"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.row[:c]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here too, you can specify a Range, and you will receive a Daru::DataFrame instead of a Daru::Vector containing the relevant rows specified by the Range."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(3x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>d</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>e</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>f</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(3x3)>\n",
" a b c\n",
" d 4 d 44\n",
" e 5 e 55\n",
" f 6 f 66"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.row[:d..:f]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rows can be accessed using numerical indices too (this works for columns too)."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::Vector(3) </b>\n",
"<table>\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th> </th>\n",
" <th>3</th>\n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" <td>4</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" <td>d</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" <td>44</td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::Vector(3)>\n",
" 3\n",
" a 4\n",
" b d\n",
" c 44"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.row[3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can get the top 3 rows by passing an argument to the `#head` method (or the bottom 3 using `#tail`)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(3x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>b</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(3x3)>\n",
" a b c\n",
" a 1 a 11\n",
" b 2 b 22\n",
" c 3 c 33"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering, selecting, adding and deleting data\n",
"\n",
"A column can be added by simply specifying it's name and value using the `#[]=` operator."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(7x4) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>b</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>99</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>d</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>176</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>e</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" <td>275</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>f</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" <td>396</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>539</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(7x4)>\n",
" a b c d\n",
" a 1 a 11 11\n",
" b 2 b 22 44\n",
" c 3 c 33 99\n",
" d 4 d 44 176\n",
" e 5 e 55 275\n",
" f 6 f 66 396\n",
" g 7 g 77 539"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[:d] = df[:a] * df[:c]\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can delete a vector with the `#delete_vector` method."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(7x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>99</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>176</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" <td>275</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" <td>396</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>539</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(7x3)>\n",
" a c d\n",
" a 1 11 11\n",
" b 2 22 44\n",
" c 3 33 99\n",
" d 4 44 176\n",
" e 5 55 275\n",
" f 6 66 396\n",
" g 7 77 539"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.delete_vector :b"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you try to insert a Daru::Vector that does not conform to the index of the DataFrame, the values will be appropriately placed such that they conform to the DataFrame's index.\n",
"\n",
"*nil* is inserted wherever a similar index cannot be found on the DataFrame.\n",
"\n",
"Inserting an Array will require the Array to be of the same length as that of the DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(7x4) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>99</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>176</td>\n",
" \n",
" <td>b</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" <td>275</td>\n",
" \n",
" <td>d</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" <td>396</td>\n",
" \n",
" <td>88</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>539</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(7x4)>\n",
" a c d b\n",
" a 1 11 11 a\n",
" b 2 22 44 c\n",
" c 3 33 99 33\n",
" d 4 44 176 b\n",
" e 5 55 275 d\n",
" f 6 66 396 88\n",
" g 7 77 539 nil"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[:b] = Daru::Vector.new(['a',33,'b','c','d',88,'e'], index: [:a,:c,:d,:b,:e,:f,:extra])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inserting a row also works similarly."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(8x4) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>99</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>176</td>\n",
" \n",
" <td>b</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" <td>275</td>\n",
" \n",
" <td>d</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" <td>396</td>\n",
" \n",
" <td>88</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>539</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" \n",
" <td>30</td>\n",
" \n",
" <td>10</td>\n",
" \n",
" <td>40</td>\n",
" \n",
" <td>20</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(8x4)>\n",
" a c d b\n",
" a 1 11 11 a\n",
" b 2 22 44 c\n",
" c 3 33 99 33\n",
" d 4 44 176 b\n",
" e 5 55 275 d\n",
" f 6 66 396 88\n",
" g 7 77 539 nil\n",
" latest 30 10 40 20"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.row[:latest] = Daru::Vector.new([10,20,30,40], index: [:c,:b,:a,:d])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In both row and vector insertion, if the index specified is not present in the DataFrame, a new index is created and appended or if it is present then the existing index will be over-ridden."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For filtering out certain rows/vectors based on their values, use the `#filter` method. By default it iterates over vectors and keeps those vectors for which the block returns **true**. It accepts an optional _axis_ argument which lets you specify whether you want to iterate over vectors or rows."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(8x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" \n",
" <td>30</td>\n",
" \n",
" <td>10</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(8x2)>\n",
" a c\n",
" a 1 11\n",
" b 2 22\n",
" c 3 33\n",
" d 4 44\n",
" e 5 55\n",
" f 6 66\n",
" g 7 77\n",
" latest 30 10"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Filter vectors.\n",
"\n",
"# The `type` method returns either :numeric or :object. The :numeric type states\n",
"# that the Vector consists only of numerical data (combined with missing data).\n",
"# If the type happens to be :object, it contains non-numerical data like strings\n",
"# or symbols. Statistical operations will not be possible on Vectors of type :object.\n",
"\n",
"df.filter do |vector|\n",
" vector.type == :numeric and vector.median < 50\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(3x4) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" \n",
" <td>30</td>\n",
" \n",
" <td>10</td>\n",
" \n",
" <td>40</td>\n",
" \n",
" <td>20</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(3x4)>\n",
" a c d b\n",
" a 1 11 11 a\n",
" b 2 22 44 c\n",
" latest 30 10 40 20"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Filter rows\n",
"\n",
"df.filter(:row) do |row|\n",
" row[:a] + row[:d] < 100\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataFrame can be transposed using the `#transpose` method."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(4x8) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" <th>e</th>\n",
" \n",
" <th>f</th>\n",
" \n",
" <th>g</th>\n",
" \n",
" <th>latest</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>30</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>10</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>99</td>\n",
" \n",
" <td>176</td>\n",
" \n",
" <td>275</td>\n",
" \n",
" <td>396</td>\n",
" \n",
" <td>539</td>\n",
" \n",
" <td>40</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>b</td>\n",
" \n",
" <td>d</td>\n",
" \n",
" <td>88</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>20</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(4x8)>\n",
" a b c d e f g latest\n",
" a 1 2 3 4 5 6 7 30\n",
" c 11 22 33 44 55 66 77 10\n",
" d 11 44 99 176 275 396 539 40\n",
" b a c 33 b d 88 nil 20"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.transpose"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Arithmetic\n",
"\n",
"All arithmetic operations can be performed on a Daru::DataFrame and you can a DataFrame with another DataFrame, a Vector or a scalar.\n",
"\n",
"Indexes are aligned appropriately whenever an operation is performed with a non-scalar quantity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**With a Scalar**\n",
"\n",
"Adding a scalar quantity will add that number to all the numeric type vectors, keeping :object type Vectors the way they originally were."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(8x4) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>21</td>\n",
" \n",
" <td>21</td>\n",
" \n",
" <td>a</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>12</td>\n",
" \n",
" <td>32</td>\n",
" \n",
" <td>54</td>\n",
" \n",
" <td>c</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>13</td>\n",
" \n",
" <td>43</td>\n",
" \n",
" <td>109</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>14</td>\n",
" \n",
" <td>54</td>\n",
" \n",
" <td>186</td>\n",
" \n",
" <td>b</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>15</td>\n",
" \n",
" <td>65</td>\n",
" \n",
" <td>285</td>\n",
" \n",
" <td>d</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>16</td>\n",
" \n",
" <td>76</td>\n",
" \n",
" <td>406</td>\n",
" \n",
" <td>88</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>17</td>\n",
" \n",
" <td>87</td>\n",
" \n",
" <td>549</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" \n",
" <td>40</td>\n",
" \n",
" <td>20</td>\n",
" \n",
" <td>50</td>\n",
" \n",
" <td>20</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(8x4)>\n",
" a c d b\n",
" a 11 21 21 a\n",
" b 12 32 54 c\n",
" c 13 43 109 33\n",
" d 14 54 186 b\n",
" e 15 65 285 d\n",
" f 16 76 406 88\n",
" g 17 87 549 nil\n",
" latest 40 20 50 20"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df + 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**With another DataFrame**\n",
"\n",
"Performing arithmetic between two data frames will align the elements by row and column indexes of either dataframe. \n",
"\n",
"If a column is present in one dataframe but not in the other, the resultant dataframe will be populated with a column full of *nils* of that name.\n",
"\n",
"DataFrames need not be of the same size for this operation to succeed."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(9x5) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" <th>f</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>84</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>92</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>54</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>51</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>79</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>29</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>96</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>95</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>157</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" \n",
" <td>39</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>27</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>older</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(9x5)>\n",
" a b c d f\n",
" a 84 nil 92 nil nil\n",
" b 22 nil 54 nil nil\n",
" c 51 nil 79 nil nil\n",
" d 29 nil 96 nil nil\n",
" e nil nil nil nil nil\n",
" f 95 nil 157 nil nil\n",
" g nil nil nil nil nil\n",
" latest 39 nil 27 nil nil\n",
" older nil nil nil nil nil"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = Daru::DataFrame.new({\n",
" a: 7.times.map { rand(100) },\n",
" f: 7.times.map { rand(100) },\n",
" c: 7.times.map { rand(100) }\n",
" }, index: [:a,:b,:c,:d,:latest,:older,:f])\n",
"\n",
"df1 + df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Statistics\n",
"\n",
"Statistical methods perform basic statistics on numerical Vectors only.\n",
"\n",
"For a whole list of methods see the Daru::Maths::Statistics::DataFrame module in the [docs](https://rubygems.org/gems/daru).\n",
"\n",
"To demonstrate, the `#mean` method calculates the mean of each numeric vector and returns a Daru::Vector with the vector's name as the index alongwith the corresponding value."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::Vector(3) </b>\n",
"<table>\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th> </th>\n",
" <th>mean</th>\n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" <td>7.25</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" <td>39.75</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" <td>197.5</td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::Vector(3)>\n",
" mean\n",
" a 7.25\n",
" c 39.75\n",
" d 197.5"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `#describe` method can be used for knowing various statistics in one shot."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(5x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>count</td>\n",
" \n",
" <td>8</td>\n",
" \n",
" <td>8</td>\n",
" \n",
" <td>8</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>mean</td>\n",
" \n",
" <td>7.25</td>\n",
" \n",
" <td>39.75</td>\n",
" \n",
" <td>197.5</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>std</td>\n",
" \n",
" <td>9.40744386111339</td>\n",
" \n",
" <td>25.06990227344335</td>\n",
" \n",
" <td>190.99214643539665</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>min</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>10</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>max</td>\n",
" \n",
" <td>30</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>539</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(5x3)>\n",
" a c d\n",
" count 8 8 8\n",
" mean 7.25 39.75 197.5\n",
" std 9.40744386 25.0699022 190.992146\n",
" min 1 10 11\n",
" max 30 77 539"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`#cov` will return a covariance matrix of the DataFrame, and it will be properly indexed so you can see the data clearly. "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(3x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>88.5</td>\n",
" \n",
" <td>-66.5</td>\n",
" \n",
" <td>-233.0</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>-66.5</td>\n",
" \n",
" <td>628.5</td>\n",
" \n",
" <td>4637.0</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>-233.0</td>\n",
" \n",
" <td>4637.0</td>\n",
" \n",
" <td>36478.0</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(3x3)>\n",
" a c d\n",
" a 88.5 -66.5 -233.0\n",
" c -66.5 628.5 4637.0\n",
" d -233.0 4637.0 36478.0"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.cov"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Likewise `#corr` computes the correlation matrix."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(3x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>1.0</td>\n",
" \n",
" <td>-0.28196640612394586</td>\n",
" \n",
" <td>-0.12967873822641748</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>-0.28196640612394586</td>\n",
" \n",
" <td>0.9999999999999998</td>\n",
" \n",
" <td>0.9684315851062977</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>-0.12967873822641748</td>\n",
" \n",
" <td>0.9684315851062977</td>\n",
" \n",
" <td>1.0</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(3x3)>\n",
" a c d\n",
" a 1.0 -0.2819664 -0.1296787\n",
" c -0.2819664 0.99999999 0.96843158\n",
" d -0.1296787 0.96843158 1.0"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use [report builder](https://rubygems.org/gems/reportbuilder) to create a quick summary of the DataFrame using the `#summary` method."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"= \n",
" Number of rows: 8\n",
" Element:[a]\n",
" == a\n",
" n :8\n",
" non-missing:8\n",
" median: 4.5\n",
" mean: 7.2500\n",
" std.dev.: 9.4074\n",
" std.err.: 3.3260\n",
" skew: 1.6908\n",
" kurtosis: 1.3190\n",
" Element:[c]\n",
" == c\n",
" n :8\n",
" non-missing:8\n",
" median: 38.5\n",
" mean: 39.7500\n",
" std.dev.: 25.0699\n",
" std.err.: 8.8635\n",
" skew: 0.1381\n",
" kurtosis: -1.7271\n",
" Element:[d]\n",
" == a\n",
" n :8\n",
" non-missing:8\n",
" median: 137.5\n",
" mean: 197.5000\n",
" std.dev.: 190.9921\n",
" std.err.: 67.5259\n",
" skew: 0.5945\n",
" kurtosis: -1.3406\n",
" Element:[b]\n",
" == b\n",
" n :8\n",
" non-missing:7\n",
" factors: a,c,33,b,d,88,20\n",
" mode: a,c,33,b,d,88,20\n",
" Distribution\n",
" a 1 100.00%\n",
" c 1 100.00%\n",
" 33 1 100.00%\n",
" b 1 100.00%\n",
" d 1 100.00%\n",
" 88 1 100.00%\n",
" 20 1 100.00%\n"
]
}
],
"source": [
"puts df.summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Looping and iterators\n",
"\n",
"Daru::DataFrame offers many iterators to loop over either rows or columns. \n",
"\n",
"**#each**\n",
"\n",
"`#each` works exactly like Array#each. The default mode for `each` is to iterate over the columns of the DataFrame. To iterate over rows you must pass the axis, i.e `:row` as an argument."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[\"130\", \"1110\", \"1140\", \"a20\"]\n"
]
}
],
"source": [
"# Iterate over vectors\n",
"\n",
"e = []\n",
"df.each do |vector|\n",
" e << vector[:a].to_s + vector[:latest].to_s\n",
"end\n",
"\n",
"puts e"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[11, 44, 99, 176, 275, 396, 539, 300]\n"
]
}
],
"source": [
"# Iterate over rows\n",
"\n",
"r = []\n",
"df.each(:row) do |row|\n",
" r << row[:a] * row[:c]\n",
"end\n",
"\n",
"puts r"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**#map**\n",
"\n",
"The #map iterator works like Array#map. The value returned by each run of the block is added to an Array and the Array is returned.\n",
"\n",
"This method also accepts an `axis` argument, like `#each`. The default is `:vector`."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[7.25, 39.75, 197.5]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Map over vectors. \n",
"\n",
"# The `only_numerics` method returns a DataFrame which contains vectors \n",
"# with only numerical values. Setting the `:clone` option to false will \n",
"# return the same Vector objects that are contained in the original DataFrame.\n",
"\n",
"df.only_numerics(clone: false).map do |vector| \n",
" vector.mean\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[7.666666666666667, 22.666666666666668, 42.0, 74.66666666666667, 111.66666666666667, 139.0, 207.66666666666666, 25.0]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Map over rows.\n",
"\n",
"# Calling `only_numerics` on a Daru::Vector will return a Vector with only numeric and\n",
"# missing data. Data marked as 'missing' is not considered during statistical computation.\n",
"\n",
"df.map(:row) do |row|\n",
" row.only_numerics.mean\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**#recode**\n",
"\n",
"Recode works similarly to `#map`, but an important difference between the two is that recode returns a modified _Daru::DataFrame_ instead of an Array. For this reason, `#recode`expects that every run of the block to return a `Daru::Vector`.\n",
"\n",
"Just like map and each, recode also accepts an optional _axis_ argument."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(8x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>275</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>15.0</td>\n",
" \n",
" <td>125.0</td>\n",
" \n",
" <td>505.5</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>99</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>176</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" <td>275</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>6</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" <td>396</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>7</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>539</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" \n",
" <td>30</td>\n",
" \n",
" <td>10</td>\n",
" \n",
" <td>40</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(8x3)>\n",
" a c d\n",
" a 7 77 275\n",
" b 15.0 125.0 505.5\n",
" c 3 33 99\n",
" d 4 44 176\n",
" e 5 55 275\n",
" f 6 66 396\n",
" g 7 77 539\n",
" latest 30 10 40"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Recode vectors\n",
"\n",
"df.only_numerics(clone: false).recode do |vector|\n",
" vector[:a] = vector[:d] + vector[:c]\n",
" vector[:b] = vector.mean + vector[:a]\n",
" vector # <- return the vector to the block\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(8x4) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" <th>d</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" \n",
" <td>0</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>0</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" \n",
" <td>-22</td>\n",
" \n",
" <td>22</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>0</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" \n",
" <td>-66</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" <td>99</td>\n",
" \n",
" <td>33</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" \n",
" <td>-132</td>\n",
" \n",
" <td>44</td>\n",
" \n",
" <td>176</td>\n",
" \n",
" <td>0</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" \n",
" <td>-220</td>\n",
" \n",
" <td>55</td>\n",
" \n",
" <td>275</td>\n",
" \n",
" <td>0</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" \n",
" <td>-330</td>\n",
" \n",
" <td>66</td>\n",
" \n",
" <td>396</td>\n",
" \n",
" <td>88</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" \n",
" <td>-462</td>\n",
" \n",
" <td>77</td>\n",
" \n",
" <td>539</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" \n",
" <td>-30</td>\n",
" \n",
" <td>10</td>\n",
" \n",
" <td>40</td>\n",
" \n",
" <td>20</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(8x4)>\n",
" a c d b\n",
" a 0 11 11 0\n",
" b -22 22 44 0\n",
" c -66 33 99 33\n",
" d -132 44 176 0\n",
" e -220 55 275 0\n",
" f -330 66 396 88\n",
" g -462 77 539 nil\n",
" latest -30 10 40 20"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Recode rows\n",
"\n",
"df.recode(:row) do |row|\n",
" row[:a] = row[:c] - row[:d]\n",
" row[:b] = row[:b].to_i if row[:b].is_a?(String)\n",
" row\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**#collect**\n",
"\n",
"The `#collect` iterator works similar to `#map`, the only difference being that it returns a Daru::Vector comprising of the results of each block run. The resultant Vector has the same index as that of the axis over which `collect` has iterated.\n",
"\n",
"It also accepts the optional _axis_ argument."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::Vector(4) </b>\n",
"<table>\n",
" <thead>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" <td>9</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" <td>99</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" <td>495</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" <td>121</td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::Vector(4)>\n",
" a 9\n",
" c 99\n",
" d 495\n",
" b 121"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Collect Vectors\n",
"\n",
"df.collect do |vector|\n",
" vector[:c] + vector[:f]\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::Vector(8) </b>\n",
"<table>\n",
" <thead>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" <td>24</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" <td>69</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" <td>136</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" <td>225</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" <td>336</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" <td>469</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" <td>60</td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::Vector(8)>\n",
" a 1\n",
" b 24\n",
" c 69\n",
" d 136\n",
" e 225\n",
" f 336\n",
" g 469\n",
" latest 60"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Collect Rows\n",
"\n",
"df.collect(:row) do |row|\n",
" row[:a] + row[:d] - row[:c]\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**#vector_by_calculation**\n",
"\n",
"`#vector_by_calculation` is a DSL that can be used for generating a Daru::Vector based on the results returned by the block.\n",
"\n",
"This DSL lets you refer to elements directly as methods inside the block."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::Vector(8) </b>\n",
"<table>\n",
" <thead>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>a</td>\n",
" <td>23</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>b</td>\n",
" <td>68</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>c</td>\n",
" <td>135</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>d</td>\n",
" <td>224</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>e</td>\n",
" <td>335</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>f</td>\n",
" <td>468</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>g</td>\n",
" <td>623</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>latest</td>\n",
" <td>80</td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::Vector(8)>\n",
" a 23\n",
" b 68\n",
" c 135\n",
" d 224\n",
" e 335\n",
" f 468\n",
" g 623\n",
" latest 80"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.vector_by_calculation { a + c + d }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sorting\n",
"\n",
"Daru::DataFrame offers a robust `#sort` function which can be used for hierarchically sorting the Vectors in the DataFrame.\n",
"\n",
"Here are couple of examples to demonstrate a lot of the options:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(5x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>This</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>dataframe</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>335</td>\n",
" \n",
" <td>is</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td>sort</td>\n",
" \n",
" <td>32</td>\n",
" \n",
" <td>for</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>4</td>\n",
" \n",
" <td>this</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>sorting</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(5x3)>\n",
" a b c\n",
" 0 g 4 This\n",
" 1 g 4 dataframe\n",
" 2 g 335 is\n",
" 3 sort 32 for\n",
" 4 this 11 sorting"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.new({\n",
" a: ['g', 'g','g','sort', 'this'],\n",
" b: [4,4,335,32,11],\n",
" c: ['This', 'dataframe','is','for','sorting']\n",
" })"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Array passed as an argument to 'sort' tells the method the order\n",
"in which preference of sorting should be given to each Vector.\n",
"\n",
"The **:ascending** option will tell DataFrame the order in which you want\n",
"the Vectors to be sorted. *true* for ascending sort and *false* for \n",
"descending sort.\n",
"\n",
"The **:by** option lets you define a custom attribute for each vector to sort by.\n",
"This works similarly to passing a block to Array#sort_by."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(5x3) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" <th>c</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>335</td>\n",
" \n",
" <td>is</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>This</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>g</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" <td>dataframe</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td>sort</td>\n",
" \n",
" <td>32</td>\n",
" \n",
" <td>for</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>4</td>\n",
" \n",
" <td>this</td>\n",
" \n",
" <td>11</td>\n",
" \n",
" <td>sorting</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(5x3)>\n",
" a b c\n",
" 2 g 335 is\n",
" 0 g 4 This\n",
" 1 g 4 dataframe\n",
" 3 sort 32 for\n",
" 4 this 11 sorting"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort([:a,:b,:c], ascending: [true, false, true], by: {c: lambda { |a| a.size }})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Additional examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort a dataframe with a vector sequence. "
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(5x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>4</td>\n",
" \n",
" <td>3</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(5x2)>\n",
" a b\n",
" 2 1 3\n",
" 0 1 5\n",
" 3 2 2\n",
" 1 2 4\n",
" 4 3 1"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.new({a: [1,2,1,2,3], b: [5,4,3,2,1]})\n",
"\n",
"df.sort [:a, :b]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort a dataframe without a block. Here nils will be handled automatically and appear at top. "
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(5x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>3</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>1</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td>-3</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>-1</td>\n",
" \n",
" <td>2</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>4</td>\n",
" \n",
" <td>5</td>\n",
" \n",
" <td>4</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(5x2)>\n",
" a b\n",
" 1 nil 3\n",
" 3 nil 1\n",
" 0 -3 4\n",
" 2 -1 2\n",
" 4 5 4"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.new({a: [-3,nil,-1,nil,5], b: [4,3,2,1,4]})\n",
"\n",
"df.sort([:a])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort a dataframe with a block with nils handled automatically. "
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(6x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>5</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>4</td>\n",
" \n",
" <td>-1</td>\n",
" \n",
" <td>x</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>-1</td>\n",
" \n",
" <td>aa</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>aaa</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>baaa</td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(6x2)>\n",
" a b\n",
" 2 1 nil\n",
" 5 1 nil\n",
" 4 -1 x\n",
" 1 -1 aa\n",
" 0 nil aaa\n",
" 3 nil baaa"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.new({a: [nil,-1,1,nil,-1,1], b: ['aaa','aa',nil,'baaa','x',nil] })\n",
"\n",
"# df.sort [:b], by: {b: lambda { |a| a.length } }\n",
"# This would give \"NoMethodError: undefined method `length' for nil:NilClass\"\n",
"\n",
"# Instead you could do the following if you want the nils to be handled automatically\n",
"df.sort [:b], by: {b: lambda { |a| a.length } }, handle_nils: true"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort a dataframe with a block with nils handled manually. "
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<b> Daru::DataFrame(6x2) </b>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" \n",
" <tr>\n",
" <th></th>\n",
" \n",
" <th>a</th>\n",
" \n",
" <th>b</th>\n",
" \n",
" </tr>\n",
" \n",
"</thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" <td>4</td>\n",
" \n",
" <td>-1</td>\n",
" \n",
" <td>x</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>1</td>\n",
" \n",
" <td>-1</td>\n",
" \n",
" <td>aa</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>0</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>aaa</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>3</td>\n",
" \n",
" <td></td>\n",
" \n",
" <td>baaa</td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>2</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <td>5</td>\n",
" \n",
" <td>1</td>\n",
" \n",
" <td></td>\n",
" \n",
" </tr>\n",
" \n",
"\n",
" \n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"#<Daru::DataFrame(6x2)>\n",
" a b\n",
" 4 -1 x\n",
" 1 -1 aa\n",
" 0 nil aaa\n",
" 3 nil baaa\n",
" 2 1 nil\n",
" 5 1 nil"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = Daru::DataFrame.new({a: [nil,-1,1,nil,-1,1], b: ['aaa','aa',nil,'baaa','x',nil] })\n",
"\n",
"# To print nils at the bottom one can use lambda { |a| (a.nil?)[1]:[0,a.length] }\n",
"df.sort [:b], by: {b: lambda { |a| (a.nil?)?[1]:[0,a.length] } }, handle_nils: true"
]
}
],
"metadata": {
"@webio": {
"lastCommId": "9b4556e8a2374d0da72924d471c9fd7b",
"lastKernelId": "9cf769c1-6908-41c5-9514-e5faa1f47b2b"
},
"kernelspec": {
"display_name": "Ruby 2.6.1",
"language": "ruby",
"name": "ruby"
},
"language_info": {
"file_extension": ".rb",
"mimetype": "application/x-ruby",
"name": "ruby",
"version": "2.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment