Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jorisvandenbossche/ddc824d4f5fc4e4efb575c3bd59ea201 to your computer and use it in GitHub Desktop.
Save jorisvandenbossche/ddc824d4f5fc4e4efb575c3bd59ea201 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Floating point data type with NaN and NA/NULL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some exploration of how NaN is handles in different languages and frameworks:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Julia"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Julia has both as separate concepts for float arrays using the `Missing` type (https://docs.julialang.org/en/v1/manual/missing/):"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element Array{Union{Missing, Float64},1}:\n",
" 0.0 \n",
" NaN \n",
" missing"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr = [0.0, NaN, missing]"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element BitArray{1}:\n",
" false\n",
" false\n",
" true"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ismissing.(arr)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element Array{Union{Missing, Bool},1}:\n",
" false \n",
" true \n",
" missing"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"isnan.(arr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NaN can be created by division of 0 by 0:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element Array{Union{Missing, Float64},1}:\n",
" NaN \n",
" NaN \n",
" missing"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr ./ 0.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Julia follows IEEE 754 when it comes to element-wise comparisons (always false, except for `!=`):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element Array{Union{Missing, Bool},1}:\n",
" false \n",
" false \n",
" missing"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr .> 1"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element Array{Union{Missing, Bool},1}:\n",
" true \n",
" false \n",
" missing"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr .< 1"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element Array{Union{Missing, Bool},1}:\n",
" false \n",
" false \n",
" missing"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr .== NaN"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3-element Array{Union{Missing, Bool},1}:\n",
" true \n",
" true \n",
" missing"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr .!= NaN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But for sorting, it deviates from those rules to provide deterministic results (NaNs always last, but before missing):"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8-element Array{Union{Missing, Float64},1}:\n",
" 3.5 \n",
" 4.0 \n",
" 5.0 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" missing\n",
" missing"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sort([NaN, 5, NaN, missing, 4.0, 3.5, NaN, missing])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reducations propagate both missing and NaN. Missing values can be skipped with `skipmissing`:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"missing"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum([1.0, NaN, missing])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NaN"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum(skipmissing([1.0, NaN, missing]))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum(skipmissing([1.0, missing]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## R"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"a <- c(1., 0., NaN, NA)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
".list-inline {list-style: none; margin:0; padding: 0}\n",
".list-inline>li {display: inline-block}\n",
".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
"</style>\n",
"<ol class=list-inline><li>1</li><li>0</li><li>NaN</li><li>&lt;NA&gt;</li></ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1\n",
"\\item 0\n",
"\\item NaN\n",
"\\item <NA>\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1\n",
"2. 0\n",
"3. NaN\n",
"4. &lt;NA&gt;\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1 0 NaN NA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"a"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
".list-inline {list-style: none; margin:0; padding: 0}\n",
".list-inline>li {display: inline-block}\n",
".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
"</style>\n",
"<ol class=list-inline><li>Inf</li><li>NaN</li><li>NaN</li><li>&lt;NA&gt;</li></ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item Inf\n",
"\\item NaN\n",
"\\item NaN\n",
"\\item <NA>\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. Inf\n",
"2. NaN\n",
"3. NaN\n",
"4. &lt;NA&gt;\n",
"\n",
"\n"
],
"text/plain": [
"[1] Inf NaN NaN NA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"a / 0."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
".list-inline {list-style: none; margin:0; padding: 0}\n",
".list-inline>li {display: inline-block}\n",
".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
"</style>\n",
"<ol class=list-inline><li>FALSE</li><li>FALSE</li><li>TRUE</li><li>TRUE</li></ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item FALSE\n",
"\\item FALSE\n",
"\\item TRUE\n",
"\\item TRUE\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. FALSE\n",
"2. FALSE\n",
"3. TRUE\n",
"4. TRUE\n",
"\n",
"\n"
],
"text/plain": [
"[1] FALSE FALSE TRUE TRUE"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"is.na(a)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
".list-inline {list-style: none; margin:0; padding: 0}\n",
".list-inline>li {display: inline-block}\n",
".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
"</style>\n",
"<ol class=list-inline><li>FALSE</li><li>FALSE</li><li>TRUE</li><li>FALSE</li></ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item FALSE\n",
"\\item FALSE\n",
"\\item TRUE\n",
"\\item FALSE\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. FALSE\n",
"2. FALSE\n",
"3. TRUE\n",
"4. FALSE\n",
"\n",
"\n"
],
"text/plain": [
"[1] FALSE FALSE TRUE FALSE"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"is.nan(a)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"── \u001b[1mAttaching packages\u001b[22m ─────────────────────────────────────── tidyverse 1.2.1 ──\n",
"\n",
"\u001b[32m✔\u001b[39m \u001b[34mggplot2\u001b[39m 3.2.1 \u001b[32m✔\u001b[39m \u001b[34mpurrr \u001b[39m 0.3.2\n",
"\u001b[32m✔\u001b[39m \u001b[34mtibble \u001b[39m 2.1.3 \u001b[32m✔\u001b[39m \u001b[34mdplyr \u001b[39m 0.8.3\n",
"\u001b[32m✔\u001b[39m \u001b[34mtidyr \u001b[39m 1.0.0 \u001b[32m✔\u001b[39m \u001b[34mstringr\u001b[39m 1.4.0\n",
"\u001b[32m✔\u001b[39m \u001b[34mreadr \u001b[39m 1.3.1 \u001b[32m✔\u001b[39m \u001b[34mforcats\u001b[39m 0.4.0\n",
"\n",
"── \u001b[1mConflicts\u001b[22m ────────────────────────────────────────── tidyverse_conflicts() ──\n",
"\u001b[31m✖\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mfilter()\u001b[39m masks \u001b[34mstats\u001b[39m::filter()\n",
"\u001b[31m✖\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mlag()\u001b[39m masks \u001b[34mstats\u001b[39m::lag()\n",
"\n"
]
}
],
"source": [
"library(tidyverse)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df <- tibble(x = c(-1., 0., 1., NaN, NA))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<caption>A tibble: 5 × 1</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>x</th></tr>\n",
"\t<tr><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td> -1</td></tr>\n",
"\t<tr><td> 0</td></tr>\n",
"\t<tr><td> 1</td></tr>\n",
"\t<tr><td>NaN</td></tr>\n",
"\t<tr><td> NA</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 5 × 1\n",
"\\begin{tabular}{l}\n",
" x\\\\\n",
" <dbl>\\\\\n",
"\\hline\n",
"\t -1\\\\\n",
"\t 0\\\\\n",
"\t 1\\\\\n",
"\t NaN\\\\\n",
"\t NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 5 × 1\n",
"\n",
"| x &lt;dbl&gt; |\n",
"|---|\n",
"| -1 |\n",
"| 0 |\n",
"| 1 |\n",
"| NaN |\n",
"| NA |\n",
"\n"
],
"text/plain": [
" x \n",
"1 -1\n",
"2 0\n",
"3 1\n",
"4 NaN\n",
"5 NA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<caption>A tibble: 5 × 2</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>x</th><th scope=col>x2</th></tr>\n",
"\t<tr><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td> -1</td><td>-Inf</td></tr>\n",
"\t<tr><td> 0</td><td> NaN</td></tr>\n",
"\t<tr><td> 1</td><td> Inf</td></tr>\n",
"\t<tr><td>NaN</td><td> NaN</td></tr>\n",
"\t<tr><td> NA</td><td> NA</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 5 × 2\n",
"\\begin{tabular}{ll}\n",
" x & x2\\\\\n",
" <dbl> & <dbl>\\\\\n",
"\\hline\n",
"\t -1 & -Inf\\\\\n",
"\t 0 & NaN\\\\\n",
"\t 1 & Inf\\\\\n",
"\t NaN & NaN\\\\\n",
"\t NA & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 5 × 2\n",
"\n",
"| x &lt;dbl&gt; | x2 &lt;dbl&gt; |\n",
"|---|---|\n",
"| -1 | -Inf |\n",
"| 0 | NaN |\n",
"| 1 | Inf |\n",
"| NaN | NaN |\n",
"| NA | NA |\n",
"\n"
],
"text/plain": [
" x x2 \n",
"1 -1 -Inf\n",
"2 0 NaN\n",
"3 1 Inf\n",
"4 NaN NaN\n",
"5 NA NA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% mutate(x2 = x / 0)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Warning message in log(x):\n",
"“NaNs produced”\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
"<caption>A tibble: 5 × 2</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>x</th><th scope=col>x2</th></tr>\n",
"\t<tr><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td> -1</td><td> NaN</td></tr>\n",
"\t<tr><td> 0</td><td>-Inf</td></tr>\n",
"\t<tr><td> 1</td><td> 0</td></tr>\n",
"\t<tr><td>NaN</td><td> NaN</td></tr>\n",
"\t<tr><td> NA</td><td> NA</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 5 × 2\n",
"\\begin{tabular}{ll}\n",
" x & x2\\\\\n",
" <dbl> & <dbl>\\\\\n",
"\\hline\n",
"\t -1 & NaN\\\\\n",
"\t 0 & -Inf\\\\\n",
"\t 1 & 0\\\\\n",
"\t NaN & NaN\\\\\n",
"\t NA & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 5 × 2\n",
"\n",
"| x &lt;dbl&gt; | x2 &lt;dbl&gt; |\n",
"|---|---|\n",
"| -1 | NaN |\n",
"| 0 | -Inf |\n",
"| 1 | 0 |\n",
"| NaN | NaN |\n",
"| NA | NA |\n",
"\n"
],
"text/plain": [
" x x2 \n",
"1 -1 NaN\n",
"2 0 -Inf\n",
"3 1 0\n",
"4 NaN NaN\n",
"5 NA NA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% mutate(x2 = log(x))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<caption>A tibble: 5 × 4</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>x</th><th scope=col>x2</th><th scope=col>x3</th><th scope=col>x4</th></tr>\n",
"\t<tr><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;lgl&gt;</th><th scope=col>&lt;lgl&gt;</th><th scope=col>&lt;lgl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td> -1</td><td>FALSE</td><td> TRUE</td><td>NA</td></tr>\n",
"\t<tr><td> 0</td><td>FALSE</td><td>FALSE</td><td>NA</td></tr>\n",
"\t<tr><td> 1</td><td> TRUE</td><td>FALSE</td><td>NA</td></tr>\n",
"\t<tr><td>NaN</td><td> NA</td><td> NA</td><td>NA</td></tr>\n",
"\t<tr><td> NA</td><td> NA</td><td> NA</td><td>NA</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 5 × 4\n",
"\\begin{tabular}{llll}\n",
" x & x2 & x3 & x4\\\\\n",
" <dbl> & <lgl> & <lgl> & <lgl>\\\\\n",
"\\hline\n",
"\t -1 & FALSE & TRUE & NA\\\\\n",
"\t 0 & FALSE & FALSE & NA\\\\\n",
"\t 1 & TRUE & FALSE & NA\\\\\n",
"\t NaN & NA & NA & NA\\\\\n",
"\t NA & NA & NA & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 5 × 4\n",
"\n",
"| x &lt;dbl&gt; | x2 &lt;lgl&gt; | x3 &lt;lgl&gt; | x4 &lt;lgl&gt; |\n",
"|---|---|---|---|\n",
"| -1 | FALSE | TRUE | NA |\n",
"| 0 | FALSE | FALSE | NA |\n",
"| 1 | TRUE | FALSE | NA |\n",
"| NaN | NA | NA | NA |\n",
"| NA | NA | NA | NA |\n",
"\n"
],
"text/plain": [
" x x2 x3 x4\n",
"1 -1 FALSE TRUE NA\n",
"2 0 FALSE FALSE NA\n",
"3 1 TRUE FALSE NA\n",
"4 NaN NA NA NA\n",
"5 NA NA NA NA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% mutate(x2 = x > 0, x3 = x < 0, x4 = x != NaN)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<caption>A tibble: 1 × 2</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>x1</th><th scope=col>x2</th></tr>\n",
"\t<tr><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td>0</td><td>NaN</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 1 × 2\n",
"\\begin{tabular}{ll}\n",
" x1 & x2\\\\\n",
" <dbl> & <dbl>\\\\\n",
"\\hline\n",
"\t 0 & NaN\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 1 × 2\n",
"\n",
"| x1 &lt;dbl&gt; | x2 &lt;dbl&gt; |\n",
"|---|---|\n",
"| 0 | NaN |\n",
"\n"
],
"text/plain": [
" x1 x2 \n",
"1 0 NaN"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% summarise(x1 = mean(x, na.rm = TRUE), x2 = mean(x, na.rm = FALSE))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<caption>A tibble: 8 × 1</caption>\n",
"<thead>\n",
"\t<tr><th scope=col>x</th></tr>\n",
"\t<tr><th scope=col>&lt;dbl&gt;</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"\t<tr><td>3.5</td></tr>\n",
"\t<tr><td>4.0</td></tr>\n",
"\t<tr><td>5.0</td></tr>\n",
"\t<tr><td> NA</td></tr>\n",
"\t<tr><td> NA</td></tr>\n",
"\t<tr><td>NaN</td></tr>\n",
"\t<tr><td>NaN</td></tr>\n",
"\t<tr><td>NaN</td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"A tibble: 8 × 1\n",
"\\begin{tabular}{l}\n",
" x\\\\\n",
" <dbl>\\\\\n",
"\\hline\n",
"\t 3.5\\\\\n",
"\t 4.0\\\\\n",
"\t 5.0\\\\\n",
"\t NA\\\\\n",
"\t NA\\\\\n",
"\t NaN\\\\\n",
"\t NaN\\\\\n",
"\t NaN\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A tibble: 8 × 1\n",
"\n",
"| x &lt;dbl&gt; |\n",
"|---|\n",
"| 3.5 |\n",
"| 4.0 |\n",
"| 5.0 |\n",
"| NA |\n",
"| NA |\n",
"| NaN |\n",
"| NaN |\n",
"| NaN |\n",
"\n"
],
"text/plain": [
" x \n",
"1 3.5\n",
"2 4.0\n",
"3 5.0\n",
"4 NA\n",
"5 NA\n",
"6 NaN\n",
"7 NaN\n",
"8 NaN"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df <- tibble(x = c(NaN, 5, NaN, NA, 4.0, 3.5, NaN, NA))\n",
"df %>% arrange(x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Arrow / Dremio"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Arrow itself also distinguishes both concepts, since it uses bitmasks consistently for missing values:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow as pa"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pyarrow.lib.DoubleArray object at 0x7ff14a15cee8>\n",
"[\n",
" 0,\n",
" 1,\n",
" nan,\n",
" null\n",
"]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a = pa.array([0.0, 1.0, float(\"nan\"), None])\n",
"a"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is just storage, though, and there not yet many actual computional operations provided in the `pyarrow` package itself. \n",
"\n",
"One part of Apache Arrow that already provides computational functionality is Gandiva (LLVM-based execution kernel for SQL-like expressions, https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/), which is eg used in Dremio's data services.\n",
"\n",
"Here, we can check eg the behaviour of division by 0:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from pyarrow import gandiva"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"table = pa.table({'a': a, 'b': [0.0, 0.0, 0.0, 0.0]})"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"builder = gandiva.TreeExprBuilder()\n",
"node_a = builder.make_field(table.schema.field(\"a\"))\n",
"node_b = builder.make_field(table.schema.field(\"b\"))\n",
"node_upper = builder.make_function(\"divide\", [node_a, node_b], pa.float64())\n",
"field_result = pa.field('res', pa.float64())\n",
"expr = builder.make_expression(node_upper, field_result)\n",
"projector = gandiva.make_projector(table.schema, [expr], pa.default_memory_pool())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"ename": "ArrowException",
"evalue": "ExecutionError in Gandiva: divide by zero error",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mArrowException\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-6-2f99515c79a6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprojector\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mevaluate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_batches\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/miniconda3/envs/arrow-0151/lib/python3.7/site-packages/pyarrow/gandiva.pyx\u001b[0m in \u001b[0;36mpyarrow.gandiva.Projector.evaluate\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/arrow-0151/lib/python3.7/site-packages/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.check_status\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mArrowException\u001b[0m: ExecutionError in Gandiva: divide by zero error"
]
}
],
"source": [
"projector.evaluate(table.to_batches()[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where an error is raised, similar to other databases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PostgreSQL"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%load_ext sql"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Connected: postgres@test_db'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql postgresql://postgres:postgres@localhost/test_db"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n",
"Done.\n",
"Done.\n",
"4 rows affected.\n"
]
},
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS test;\n",
"CREATE TABLE test (\n",
" x double precision,\n",
" y double precision\n",
");\n",
"INSERT INTO test VALUES (0.0, 0.0), (1.0, 0.0), ('NaN'::double precision, 0.0), (NULL, 0.0);"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n",
"4 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>x</th>\n",
" <th>y</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(0.0, 0.0), (1.0, 0.0), (nan, 0.0), (None, 0.0)]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT * FROM test;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Division by zero raises an error (so does not result in NaN):"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n"
]
},
{
"ename": "DataError",
"evalue": "(psycopg2.errors.DivisionByZero) division by zero\n\n[SQL: SELECT x / y FROM test]\n(Background on this error at: http://sqlalche.me/e/9h9h)",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mDivisionByZero\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_execute_context\u001b[0;34m(self, dialect, constructor, statement, parameters, *args)\u001b[0m\n\u001b[1;32m 1247\u001b[0m self.dialect.do_execute(\n\u001b[0;32m-> 1248\u001b[0;31m \u001b[0mcursor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1249\u001b[0m )\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/engine/default.py\u001b[0m in \u001b[0;36mdo_execute\u001b[0;34m(self, cursor, statement, parameters, context)\u001b[0m\n\u001b[1;32m 589\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdo_execute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcursor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 590\u001b[0;31m \u001b[0mcursor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstatement\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 591\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mDivisionByZero\u001b[0m: division by zero\n",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mDataError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-15-ea847347b56b>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_line_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'sql'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'SELECT x / y FROM test'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py\u001b[0m in \u001b[0;36mrun_line_magic\u001b[0;34m(self, magic_name, line, _stack_depth)\u001b[0m\n\u001b[1;32m 2305\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'local_ns'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getframe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstack_depth\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mf_locals\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2306\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuiltin_trap\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2307\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2308\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2309\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m</home/joris/miniconda3/lib/python3.7/site-packages/decorator.py:decorator-gen-127>\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(self, line, cell, local_ns)\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/IPython/core/magic.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(f, *a, **k)\u001b[0m\n\u001b[1;32m 185\u001b[0m \u001b[0;31m# but it's overkill for just that one bit of state.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 186\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mmagic_deco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m \u001b[0mcall\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 188\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 189\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m</home/joris/miniconda3/lib/python3.7/site-packages/decorator.py:decorator-gen-126>\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(self, line, cell, local_ns)\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/IPython/core/magic.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(f, *a, **k)\u001b[0m\n\u001b[1;32m 185\u001b[0m \u001b[0;31m# but it's overkill for just that one bit of state.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 186\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mmagic_deco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m \u001b[0mcall\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 188\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 189\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sql/magic.py\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(self, line, cell, local_ns)\u001b[0m\n\u001b[1;32m 93\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 94\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 95\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msql\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparsed\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'sql'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser_ns\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 96\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 97\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumn_local_vars\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sql/run.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(conn, sql, config, user_namespace)\u001b[0m\n\u001b[1;32m 338\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 339\u001b[0m \u001b[0mtxt\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msqlalchemy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msql\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstatement\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 340\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msession\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtxt\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muser_namespace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 341\u001b[0m \u001b[0m_commit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconn\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconfig\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 342\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfeedback\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(self, object_, *multiparams, **params)\u001b[0m\n\u001b[1;32m 982\u001b[0m )\n\u001b[1;32m 983\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 984\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mmeth\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmultiparams\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 985\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 986\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_execute_function\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmultiparams\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/sql/elements.py\u001b[0m in \u001b[0;36m_execute_on_connection\u001b[0;34m(self, connection, multiparams, params)\u001b[0m\n\u001b[1;32m 291\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_execute_on_connection\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconnection\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmultiparams\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 292\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msupports_execution\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 293\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mconnection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_execute_clauseelement\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmultiparams\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 294\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 295\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mObjectNotExecutableError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_execute_clauseelement\u001b[0;34m(self, elem, multiparams, params)\u001b[0m\n\u001b[1;32m 1101\u001b[0m \u001b[0mdistilled_params\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1102\u001b[0m \u001b[0mcompiled_sql\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1103\u001b[0;31m \u001b[0mdistilled_params\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1104\u001b[0m )\n\u001b[1;32m 1105\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_has_events\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_has_events\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_execute_context\u001b[0;34m(self, dialect, constructor, statement, parameters, *args)\u001b[0m\n\u001b[1;32m 1286\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mBaseException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1287\u001b[0m self._handle_dbapi_exception(\n\u001b[0;32m-> 1288\u001b[0;31m \u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcursor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1289\u001b[0m )\n\u001b[1;32m 1290\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_handle_dbapi_exception\u001b[0;34m(self, e, statement, parameters, cursor, context)\u001b[0m\n\u001b[1;32m 1480\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mshould_wrap\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1481\u001b[0m util.raise_(\n\u001b[0;32m-> 1482\u001b[0;31m \u001b[0msqlalchemy_exception\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwith_traceback\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mexc_info\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfrom_\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1483\u001b[0m )\n\u001b[1;32m 1484\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/util/compat.py\u001b[0m in \u001b[0;36mraise_\u001b[0;34m(***failed resolving arguments***)\u001b[0m\n\u001b[1;32m 176\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 177\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 178\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mexception\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 179\u001b[0m \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 180\u001b[0m \u001b[0;31m# credit to\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_execute_context\u001b[0;34m(self, dialect, constructor, statement, parameters, *args)\u001b[0m\n\u001b[1;32m 1246\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mevt_handled\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1247\u001b[0m self.dialect.do_execute(\n\u001b[0;32m-> 1248\u001b[0;31m \u001b[0mcursor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1249\u001b[0m )\n\u001b[1;32m 1250\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/sqlalchemy/engine/default.py\u001b[0m in \u001b[0;36mdo_execute\u001b[0;34m(self, cursor, statement, parameters, context)\u001b[0m\n\u001b[1;32m 588\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 589\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdo_execute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcursor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 590\u001b[0;31m \u001b[0mcursor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstatement\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 591\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 592\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdo_execute_no_params\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcursor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mDataError\u001b[0m: (psycopg2.errors.DivisionByZero) division by zero\n\n[SQL: SELECT x / y FROM test]\n(Background on this error at: http://sqlalche.me/e/9h9h)"
]
}
],
"source": [
"%sql SELECT x / y FROM test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comparisons with NaN do not always give False:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n",
"4 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>?column?</th>\n",
" </tr>\n",
" <tr>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(False,), (True,), (True,), (None,)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT x > y FROM test"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n",
"4 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>?column?</th>\n",
" </tr>\n",
" <tr>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(False,), (False,), (False,), (None,)]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT x < y FROM test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note from https://www.postgresql.org/docs/current/datatype-numeric.html:\n",
"\n",
"> In most implementations of the “not-a-number” concept, NaN is not considered equal to any other numeric value (including NaN). In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>sum</th>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(nan,)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT sum(x) FROM test"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>sum</th>\n",
" </tr>\n",
" <tr>\n",
" <td>1.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(1.0,)]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT Sum(x) FROM test WHERE x != 'NaN'"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n",
"4 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>x</th>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>0.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(None,), (nan,), (1.0,), (0.0,)]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT x FROM test ORDER BY x DESC"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * postgresql://postgres:***@localhost/test_db\n",
"4 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>x</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(0.0,), (1.0,), (nan,), (None,)]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT x FROM test ORDER BY x ASC"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clickhouse"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Connected: default@default'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql clickhouse://default:clickhouse@localhost/default"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql DROP TABLE IF EXISTS test"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"%sql CREATE TABLE test (id Int, x Nullable(Double), y Nullable(Double)) ENGINE = MergeTree ORDER BY id"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"INSERT INTO test VALUES (1, 0.0, 0.0), (1, 1.0, 0.0), (3, NaN, 0.0), (4, NULL, 0.0)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>id</th>\n",
" <th>x</th>\n",
" <th>y</th>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>nan</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>None</td>\n",
" <td>0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(1, '0', '0'), (1, '1', '0'), (3, 'nan', '0'), (4, None, '0')]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT * FROM test"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>divide(x, y)</th>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" </tr>\n",
" <tr>\n",
" <td>inf</td>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('nan',), ('inf',), ('nan',), (None,)]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT x / y FROM test"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>greater(x, y)</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('0',), ('1',), ('0',), (None,)]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT x > y FROM test"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>isNaN(x)</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('0',), ('0',), ('1',), (None,)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT isNaN(x) FROM test"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>Sum(x)</th>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('nan',)]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT Sum(x) FROM test"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>Sum(x)</th>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('1',)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT Sum(x) FROM test WHERE isFinite(x)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>x</th>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('1',), ('0',), ('nan',), (None,)]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT x FROM test ORDER BY x DESCENDING"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * clickhouse://default:***@localhost/default\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>x</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>nan</td>\n",
" </tr>\n",
" <tr>\n",
" <td>None</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('0',), ('1',), ('nan',), (None,)]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT x FROM test ORDER BY x ASCENDING"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Spark\n",
"\n",
"\n",
"```\n",
"\n",
"df = spark.createDataFrame(\n",
" [\n",
" (1, 0.0, 0.0),\n",
" (2, 1.0, 0.0),\n",
" (3, float('nan'), 0.0),\n",
" (4, None, 0.0),\n",
" ],\n",
" ['id', 'x', 'y']\n",
")\n",
"\n",
"\n",
"df.show()\n",
"\n",
"\n",
"\n",
"import pyspark.sql.functions as F\n",
"\n",
"df.withColumn(\"x2\", (F.col(\"x\") / F.col(\"y\"))).show()\n",
"df.withColumn(\"x2\", F.log(\"x\")).show()\n",
"df.na.fill(0).show()\n",
"```\n",
"\n",
"```\n",
"In [4]: df.withColumn(\"x2\", (F.col(\"x\") / F.col(\"y\"))).show()\n",
"+---+----+---+----+\n",
"| id| x| y| x2|\n",
"+---+----+---+----+\n",
"| 1| 0.0|0.0|null|\n",
"| 2| 1.0|0.0|null|\n",
"| 3| NaN|0.0|null|\n",
"| 4|null|0.0|null|\n",
"+---+----+---+----+\n",
"\n",
"In [5]: df.withColumn(\"x2\", F.log(\"x\")).show()\n",
"+---+----+---+----+\n",
"| id| x| y| x2|\n",
"+---+----+---+----+\n",
"| 1| 0.0|0.0|null|\n",
"| 2| 1.0|0.0| 0.0|\n",
"| 3| NaN|0.0| NaN|\n",
"| 4|null|0.0|null|\n",
"+---+----+---+----+\n",
"\n",
"In [6]: df.na.fill(0).show()\n",
"+---+---+---+\n",
"| id| x| y|\n",
"+---+---+---+\n",
"| 1|0.0|0.0|\n",
"| 2|1.0|0.0|\n",
"| 3|0.0|0.0|\n",
"| 4|0.0|0.0|\n",
"+---+---+---+\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.6.1"
},
"toc-autonumbering": false
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment