Skip to content

Instantly share code, notes, and snippets.

@ChelseaTrotter
Created December 18, 2020 20:29
Show Gist options
  • Save ChelseaTrotter/c93c7076928b6e20ae8f466414ed2c43 to your computer and use it in GitHub Desktop.
Save ChelseaTrotter/c93c7076928b6e20ae8f466414ed2c43 to your computer and use it in GitHub Desktop.
Reading and Writing Speed of Wide Data vs Tall Data
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The Impact of Data Orientation on I/O Speed"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"## The motivation.\n",
"\n",
"The first step to your data ananlysis project, before you do any exploration, is loading the data into R(or other language for that matter, we focus on R in this article). \n",
"\n",
"I have a project where I use matrix multiplication to speed up genome scan, where we using Linear Model to do eQTL analysis. \n",
"\n",
"We were able to keep the algorithm run relatively fast for the core algorithm, but the data cleaning process takes way to long, that any of our improvment on the alorithm will be dimished, if we include the data cleaning process. \n",
"\n",
"For our project, Reading and writing the BXD data set, takes up majority of the time in the data cleaning process. \n",
"We use r/qtl package to read in genotype and phenotype data. The shape of the data is short and wide, meaning we have lots of columns and few rows. \n",
"We found that, **given the same data, if we pass them in as is, compared to pass them in as transposed, the latter version is much faster.**\n",
"This is very interesting. Maybe this is already known to you that wide data takes longer to read than long and slender data.\n",
"So I wrote a micro benchmark to discover what is the reason for long IO time.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Motivation:\n",
"## My Project: \n",
"- Speeding up eQTL scans in the BXD population using GPUs. \n",
"([BioRxiv link](https://www.biorxiv.org/content/10.1101/2020.06.22.153742v1?rss=1))\n",
"- Targeting the backend service of GeneNetwork\n",
"\n",
"## Data cleaning for genome scan takes too long. \n",
"\n",
"| | Data Cleaning Time | Dimension of Data |\n",
"|:-----------:|:------------------:|:-----------------:|\n",
"| Spleen | 1m21.206s | 79 x 35,000 |\n",
"| Hippocampus | 460m21.567s | 79 x 1,000,000 |"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The Microbenchmark Setup:\n",
" A microbenchmark is either a program or routine to measure and test the performance of a single component or task\n",
"\n",
"There will be 8192 (2^13) elements. \n",
"- Dimension for tall matrix is 8192 x 1, \n",
"- Dimension for wide matrix is 1 x 8192.\n",
"\n",
"And we want to time how long the functions `read.csv` and `write.csv` takes for tall matrix vs wide matrix. "
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Reading time:\"\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.004 0.000 0.005 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 1.269 0.005 1.274 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Writing time:\"\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.006 0.000 0.007 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.409 0.090 0.499 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Reading transposed data, then transposing it.\"\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.005 0.000 0.005 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Comparing the reading and writing time of wide and tall matrix\n",
"\n",
"tall = matrix(rnorm(8192),nrow=8192)\n",
"wide = matrix(rnorm(8192),nrow=1)\n",
"\n",
"write.csv(tall, \"tall.csv\", row.names=FALSE)\n",
"write.csv(wide, \"wide.csv\", row.names=FALSE)\n",
"\n",
"print(\"Reading time:\")\n",
"system.time({tallr = read.csv(\"tall.csv\")})\n",
"system.time({wider = read.csv(\"wide.csv\")})\n",
"\n",
"print(\"Writing time:\")\n",
"system.time(write.csv(tallr, \"tallr.csv\", row.names=FALSE))\n",
"system.time(write.csv(wider, \"wider.csv\", row.names=FALSE))\n",
"\n",
"print(\"Reading transposed data, then transposing it.\")\n",
"system.time({\n",
" wide_transpose = read.csv(\"tall.csv\")\n",
" wider_wt = t(wide_transpose)\n",
"})\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a lot going on in this code block. \n",
"\n",
"First we created our tall and wide matrix. The tall matrix is 8192x1, and the wide matrix is 1x8192. \n",
"Then we write out the matrix to file. \n",
"\n",
"Now comes the time to benchmark read.csv and write.csv (code line 9 to 21). \n",
"\n",
"As the output shows, reading time for tall matrix(code line 10) takes 0.005 second, while wide matrixs takes 1.268 seconds(code line 11). (Your output may vary because of differences in hardware and randomness). The performance gap is jaw dropping. \n",
"\n",
"For the case of writing time, the timing tells the same story(code line 14 to 15), wide dataset takes longer than tall dataset, only a little less dramatic. \n",
"\n",
"However, if we read in the tall matrix, then transpose it (the same effect as reading in a wide matrix), it takes about the same time as reading in a tall dataset. (see code line 18 to 21). "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# In read.csv documentation: \n",
"`read.table is not the right tool for reading large matrices, especially those with many columns: it is designed to read data frames which may have columns of very different classes. Use scan instead for matrices. `"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.003 0.000 0.003 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# As suggested by the documentation, we try to time scan. \n",
"# system.time({wides <- scan(\"wide.csv\", sep=\",\",skip = 1, quiet = FALSE, nlines=1)})\n",
"system.time({wides <- scan(\"wide.csv\", sep=\",\",skip = 1, quiet = FALSE)})"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Why the difference?\n",
"\n",
"Dataframes are ideal when you have different data type in different columns. \n",
"The convenience data manupulation functions, and syntax sugar that allows you to write concise and prettified code, does not come free. \n",
"All those convenience comes with overhead cost of the dataframe itself. \n",
"If the dataset is tall (lots of rows and not a lot columns), then the overhead diminishes. \n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"65536"
],
"text/latex": [
"65536"
],
"text/markdown": [
"65536"
],
"text/plain": [
"[1] 65536"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Tall:\"\n"
]
},
{
"data": {
"text/plain": [
"65752 bytes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"66264 bytes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Wide:\"\n"
]
},
{
"data": {
"text/plain": [
"65752 bytes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"1049184 bytes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"66040 bytes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"65584 bytes"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Size of memory taken by 8192 double precision floating point numbers. \n",
"8*8192\n",
"\n",
"print(\"Tall:\")\n",
"object.size(tall) # size of a 8192x1 matrix\n",
"object.size(tallr) # size of a 8192x1 dataframe\n",
"\n",
"print(\"Wide:\")\n",
"object.size(wide) # size of a 1x8192 matrix\n",
"object.size(wider) # size of a 1x8192 dataframe\n",
"object.size(wider_wt) # side of a 1x8192 numeric\n",
"object.size(wides) # side of a 1x8192 numeric"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Knowing there are tradeoff of memory vs performance and convenience, we can use this information to our advantage. \n",
"One double precision floating point number (which is the default) takes up 8 bytes in memory. \n",
"Theoretical speaking, if there is no overhead, 8192 `double`s takes 65538 bytes. \n",
"Looking at the output of above code cell, one thing that grasps my attention is how much memory is taken up by object `wider`(1x8192 dataframe): 1049184 bytes, much more than our theoretical estimate!\n",
"Memory consumption is not the cause of bad performance, but they are usually related.\n",
"Because bigger memory consumption means it will take more time to retrieve data, has a bad data locality, also is unfriendly to cache."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using read_csv() \n",
"\n",
"Tidyverse also has a reading function: `read_csv` (notice the underscore, rather than .) \n",
"It will read data into tibble, while read.csv will load data as a dataframe. \n",
"The following code block output shows, yes, it is much faster than `read.csv`, but what we observed that wider dataset takes longer to read and write than tall datasets still stands using `read_csv` or `write_csv`."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Parsed with column specification:\n",
"cols(\n",
" V1 = \u001b[32mcol_double()\u001b[39m\n",
")\n",
"\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.004 0.000 0.004 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Parsed with column specification:\n",
"cols(\n",
" .default = col_double()\n",
")\n",
"\n",
"See spec(...) for full column specifications.\n",
"\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 1.056 0.064 1.118 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.002 0.000 0.002 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.412 0.121 0.534 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Tall tibble:\"\n"
]
},
{
"data": {
"text/plain": [
"68304 bytes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Wide tibble:\"\n"
]
},
{
"data": {
"text/plain": [
"4655264 bytes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Reading transposed data, then transposing it.\"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Parsed with column specification:\n",
"cols(\n",
" V1 = \u001b[32mcol_double()\u001b[39m\n",
")\n",
"\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.004 0.001 0.005 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"66040 bytes"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"library(tidyverse)\n",
"system.time({tall_r = read_csv(\"tall.csv\")})\n",
"system.time({wide_r = read_csv(\"wide.csv\")})\n",
"system.time({write_csv(tall_r, \"tall_tb.csv\")})\n",
"system.time({write_csv(wide_r, \"wide_tb.csv\")})\n",
"\n",
"print(\"Tall tibble:\")\n",
"object.size(tall_r) # size of a 8192x1 dataframe\n",
"\n",
"print(\"Wide tibble:\")\n",
"object.size(wide_r) # size of a 1x8192 dataframe\n",
"\n",
"print(\"Reading transposed data, then transposing it.\")\n",
"system.time({\n",
" wide_transpose = read_csv(\"tall.csv\")\n",
" wide_r_wt = t(wide_transpose)\n",
"})\n",
"object.size(wide_r_wt) # size of a 8192x1 dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Want fast reading for a data.frame?\n",
"\n",
"`fread` is a lot faster than read.csv. \n",
"This has to do with memory use when reading the file. \n",
"Essentially, fread memory maps the file into memory and then iterates through the file using pointers. \n",
"Whereas `read.csv` reads the file into a buffer via a connection."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.000 0.000 0.001 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.006 0.000 0.006 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"66688 bytes"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"820168 bytes"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"library(data.table)\n",
"# many people claim fread is faster (use verbose=TRUE, if interested in full log)\n",
"system.time({tallr = fread(\"tall.csv\")})\n",
"system.time({wider = fread(\"wide.csv\")})\n",
"\n",
"object.size(tallr) # size of a 8192x1 dataframe\n",
"object.size(wider) # size of a 1x8192 dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"# Conclusion\n",
"The wide data orientation will take much longer for I/O speed for data frame, particularly, reading speed. **Even if you have a wide data set, it is faster to have the data written out as a transposed version, because later on, reading in + transposing the data will take less time than reading in the wide data as is.**\n",
"\n",
"If you have a matrix, consider using the scan function for speed reason. \n",
"If you have a dataframe (columns of data may have different type of data), then the your choice depends on what you have. If only consider performance, the order of these functions: fread > read_csv > read.csv\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# A happy ending:\n",
"After we realized this issue, we modified our code for I/O and data transformation, we were able to achieve about 5 times speed up for the smaller data set spleen data, and more than 30 times speed up for the bigger, hippocampus dataset. \n",
"\n",
"| Timing | Before | After |\n",
"|-------------|-------------|-----------|\n",
"| Spleen | 1m21.206s | 0m25.543s |\n",
"| Hippocampus | 460m21.567s | 13m2.260s |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Backup"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [
{
"data": {
"text/plain": [
" user system elapsed \n",
" 1.277 0.008 1.284 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 1.185 0.000 1.186 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Specifying ColClasses will decrease reading time\n",
"system.time({wider = read.csv(\"wide.csv\")})\n",
"system.time({wider = read.csv(\"wide.csv\", colClasses=\"numeric\")})\n",
"# But it is still taking unproportionally long time."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Classes for wide data:\"\n"
]
},
{
"data": {
"text/html": [
"'matrix'"
],
"text/latex": [
"'matrix'"
],
"text/markdown": [
"'matrix'"
],
"text/plain": [
"[1] \"matrix\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"'data.frame'"
],
"text/latex": [
"'data.frame'"
],
"text/markdown": [
"'data.frame'"
],
"text/plain": [
"[1] \"data.frame\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"'matrix'"
],
"text/latex": [
"'matrix'"
],
"text/markdown": [
"'matrix'"
],
"text/plain": [
"[1] \"matrix\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"'numeric'"
],
"text/latex": [
"'numeric'"
],
"text/markdown": [
"'numeric'"
],
"text/plain": [
"[1] \"numeric\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Lengthes of wide data:\"\n"
]
},
{
"data": {
"text/html": [
"<style>\n",
".list-inline {list-style: none; margin:0; padding: 0}\n",
".list-inline>li {display: inline-block}\n",
".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
"</style>\n",
"<ol class=list-inline><li>1</li><li>8192</li></ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1\n",
"\\item 8192\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1\n",
"2. 8192\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1 8192"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<style>\n",
".list-inline {list-style: none; margin:0; padding: 0}\n",
".list-inline>li {display: inline-block}\n",
".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
"</style>\n",
"<ol class=list-inline><li>1</li><li>8192</li></ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1\n",
"\\item 8192\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1\n",
"2. 8192\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1 8192"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"8192"
],
"text/latex": [
"8192"
],
"text/markdown": [
"8192"
],
"text/plain": [
"[1] 8192"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Classes for wide data:\")\n",
"class(wide) # Created with rnorm\n",
"class(wider) # Read in with read.csv\n",
"class(wider_wt) # Read in tranposed then transposing it \n",
"class(wides) # Read in with scan \n",
"\n",
"print(\"Lengthes of wide data:\")\n",
"dim(wide)\n",
"dim(wider)\n",
"length(wides)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment