Skip to content

Instantly share code, notes, and snippets.

@alexstorer
Last active May 31, 2024 22:56
Show Gist options
  • Save alexstorer/e7697a10fc0130d27c79d23245ec44a4 to your computer and use it in GitHub Desktop.
Save alexstorer/e7697a10fc0130d27c79d23245ec44a4 to your computer and use it in GitHub Desktop.
Examples of how to work with filesystems on the yens
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "2529d419-4d4c-4666-a536-7ea3aa60ecfd",
"metadata": {},
"source": [
"## Using Local Disk to Increase Performance on Yen\n",
"### Alex Storer\n",
"### May 2024\n",
"\n",
"On the Yen servers, our storage array is currently mounted at `/zfs`, and reading from this array means transferring data over the network. If you're doing this once or twice, it's no problem, but if you're reading large amounts of data over the network in parallel, it can be a drag on system performance not just for you, but for other folks using the Yens, too. Let's look at a few examples using R.\n",
"\n",
"In this case, I'll be comparing using the following directories:\n",
"\n",
"* `/zfs/projects/darc/astorer_diagnostics` -- Mounted on the storage array\n",
"* `/tmp` -- Mounted locally (not available on other systems)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "d9372ae1-3320-45ab-87a4-f51e71e0a13b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Timing for writing 100 files to storage array:\n",
" user system elapsed \n",
" 0.206 0.032 0.500 \n",
"Timing for writing 100 files to local disk:\n",
" user system elapsed \n",
" 0.189 0.052 0.242 \n",
"Timing for reading 100 files from storage array:\n",
" user system elapsed \n",
" 0.023 0.005 0.089 \n",
"Timing for reading 100 files from local disk:\n",
" user system elapsed \n",
" 0.016 0.008 0.024 \n"
]
}
],
"source": [
"# Function to generate and write a single 1 MB file\n",
"write_one_mb_file <- function(file_id, directory) {\n",
" # Generate 1 MB of random data\n",
" data <- runif(131072) # 131072 numbers * 8 bytes/number = 1 MB\n",
"\n",
" # File path\n",
" filepath <- file.path(directory, paste0(\"output_\", file_id, \".bin\"))\n",
"\n",
" # Write the data to a binary file\n",
" writeBin(as.raw(data), filepath)\n",
"}\n",
"\n",
"# Function to read a single 1 MB file\n",
"read_one_mb_file <- function(file_id, directory) {\n",
" # File path\n",
" filepath <- file.path(directory, paste0(\"output_\", file_id, \".bin\"))\n",
"\n",
" # Read the data from a binary file\n",
" data <- readBin(filepath, \"raw\", 131072 * 8)\n",
" return(data)\n",
"}\n",
"\n",
"# Directory paths\n",
"storage_array_dir <- \"/zfs/projects/darc/astorer_diagnostics\"\n",
"local_tmp_dir <- \"/tmp\"\n",
"\n",
"# Function to write and read 100 files and measure total time for each operation\n",
"write_100_files <- function(directory) {\n",
" # Timing the writing process\n",
" write_timing <- system.time({\n",
" for (i in 1:100) {\n",
" write_one_mb_file(i, directory)\n",
" }\n",
" })\n",
"}\n",
"\n",
"read_100_files <- function(directory) {\n",
" # Timing the reading process\n",
" read_timing <- system.time({\n",
" for (i in 1:100) {\n",
" read_one_mb_file(i, directory)\n",
" }\n",
" })\n",
"}\n",
"\n",
"# Time writing and reading 100 files to the storage array\n",
"timing_storage_array <- write_and_read_100_files(storage_array_dir)\n",
"\n",
"# Time writing and reading 100 files to the local tmp directory\n",
"timing_local_tmp <- write_and_read_100_files(local_tmp_dir)\n",
"\n",
"# Print timings\n",
"cat(\"Timing for writing 100 files to storage array:\\n\")\n",
"print(write_100_files(storage_array_dir))\n",
"cat(\"Timing for writing 100 files to local disk:\\n\")\n",
"print(write_100_files(local_tmp_dir))\n",
"\n",
"# Print timings\n",
"cat(\"Timing for reading 100 files from storage array:\\n\")\n",
"print(read_100_files(storage_array_dir))\n",
"cat(\"Timing for reading 100 files from local disk:\\n\")\n",
"print(read_100_files(local_tmp_dir))\n"
]
},
{
"cell_type": "markdown",
"id": "c1e85914-ae58-4e20-976f-3c94ea59e17e",
"metadata": {},
"source": [
"The `elapsed` time, also called \"wall time\", indicates how long we waited. In this case, reading from `/tmp` was almost 4 times faster."
]
},
{
"cell_type": "markdown",
"id": "cc0f4e78-5388-4dcb-b50a-938dbc9e223b",
"metadata": {},
"source": [
"### Staging Data in a Practical Example\n",
"\n",
"One example we discussed recently involved doing regression on a large dataset, but using only a subset of the available columns. It takes time to copy and load the dataset, so we want to avoid doing that repeatedly if we can. One idea is to split the dataset into columns, and then load the columns into memory as needed.\n",
"\n",
"One package that does this natively is `fst` -- I went ahead and downloaded the San Francisco Police statstics as `sfpd.csv`, and then converted it to a `fst` file."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "c03f668e-cf36-44e9-9919-346d14b597c1",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[1mRows: \u001b[22m\u001b[34m856037\u001b[39m \u001b[1mColumns: \u001b[22m\u001b[34m35\u001b[39m\n",
"\u001b[36m──\u001b[39m \u001b[1mColumn specification\u001b[22m \u001b[36m────────────────────────────────────────────────────────\u001b[39m\n",
"\u001b[1mDelimiter:\u001b[22m \",\"\n",
"\u001b[31mchr\u001b[39m (15): Incident Datetime, Incident Day of Week, Report Datetime, Inciden...\n",
"\u001b[32mdbl\u001b[39m (16): Incident Year, Row ID, Incident ID, CAD Number, CNN, Supervisor D...\n",
"\u001b[33mlgl\u001b[39m (2): Filed Online, Invest In Neighborhoods (IIN) Areas\n",
"\u001b[34mdate\u001b[39m (1): Incident Date\n",
"\u001b[34mtime\u001b[39m (1): Incident Time\n",
"\n",
"\u001b[36mℹ\u001b[39m Use `spec()` to retrieve the full column specification for this data.\n",
"\u001b[36mℹ\u001b[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 2.436 0.370 1.630 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 2.947 0.478 0.978 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"library(fst)\n",
"library(tidyverse)\n",
"\n",
"df <- read_csv('/zfs/projects/darc/astorer_diagnostics/sfpd.csv')\n",
"\n",
"# Make sure you don't use too many threads to read/write data!\n",
"fst::threads_fst(12)\n",
"\n",
"system.time(\n",
"write.fst(df, '/zfs/projects/darc/astorer_diagnostics/sfpd.fst')\n",
")\n",
"\n",
"system.time(\n",
"write.fst(df, '/tmp/sfpd.fst')\n",
")\n"
]
},
{
"cell_type": "markdown",
"id": "aec629fb-1ec6-4ae6-9b95-8da55bcb6ba6",
"metadata": {},
"source": [
"If we look at the size of the original csv and the fst file, we can see that it's compressed:\n",
"\n",
"```R\n",
"> system('ls -lah /zfs/projects/darc/astorer_diagnostics/sfpd.fst')\n",
"-rw-rw---- 1 astorer gsb-rc_sysadmins 197M May 31 15:14 /zfs/projects/darc/astorer_diagnostics/sfpd.fst\n",
"> system('ls -lah /zfs/projects/darc/astorer_diagnostics/sfpd.csv')\n",
"-rw-r----- 1 astorer gsb-rc_sysadmins 300M May 31 13:44 /zfs/projects/darc/astorer_diagnostics/sfpd.csv\n",
"```\n",
"\n",
"At this point, we can read only specific columns directly from disk using `fst`. Let's compare reading the specific columns from `/tmp` (the fast way) with reading the entire dataset as a csv from `/zfs` and then subsetting it (the slow way). We'll set the threads to `1`, which is best practice if you're running parallel jobs on the file system."
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "6971278d-1a49-455b-a9e4-9a321bf6d5de",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.037 0.007 0.045 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[1mRows: \u001b[22m\u001b[34m856037\u001b[39m \u001b[1mColumns: \u001b[22m\u001b[34m35\u001b[39m\n",
"\u001b[36m──\u001b[39m \u001b[1mColumn specification\u001b[22m \u001b[36m────────────────────────────────────────────────────────\u001b[39m\n",
"\u001b[1mDelimiter:\u001b[22m \",\"\n",
"\u001b[31mchr\u001b[39m (15): Incident Datetime, Incident Day of Week, Report Datetime, Inciden...\n",
"\u001b[32mdbl\u001b[39m (16): Incident Year, Row ID, Incident ID, CAD Number, CNN, Supervisor D...\n",
"\u001b[33mlgl\u001b[39m (2): Filed Online, Invest In Neighborhoods (IIN) Areas\n",
"\u001b[34mdate\u001b[39m (1): Incident Date\n",
"\u001b[34mtime\u001b[39m (1): Incident Time\n",
"\n",
"\u001b[36mℹ\u001b[39m Use `spec()` to retrieve the full column specification for this data.\n",
"\u001b[36mℹ\u001b[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 11.483 0.722 4.860 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Set the threads\n",
"threads_fst(1)\n",
"\n",
"# Time to read just two columns\n",
"system.time({\n",
" df_small = read.fst('/tmp/sfpd.fst',columns = c(\"Incident Year\",\"Incident Category\"))\n",
"})\n",
"\n",
"# Time to read everything and then subset\n",
"system.time({\n",
" df = read_csv('/zfs/projects/darc/astorer_diagnostics/sfpd.csv')\n",
" df_small = df[,c(\"Incident Year\", \"Incident Category\")]\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "81ee160d-090a-48a5-9685-136810ee2970",
"metadata": {},
"source": [
"The \"fast way\" is more than **100x faster**."
]
},
{
"cell_type": "markdown",
"id": "31fe86c8-b140-4b87-a6cf-076326fe5038",
"metadata": {},
"source": [
"### Adding More Columns"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "0903c930-648d-4e37-817c-89f4b836acd2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Read two, then add one\"\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.075 0.001 0.074 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Read two, then read three\"\n"
]
},
{
"data": {
"text/plain": [
" user system elapsed \n",
" 0.115 0.004 0.119 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Read two, then add one\")\n",
"system.time({\n",
" # Read in two variables\n",
" df_2 = read.fst('/tmp/sfpd.fst',columns = c(\"Incident Year\",\"Incident Category\"))\n",
" # Then add the third variable in by binding it to the data frame\n",
" df_3 = cbind(df_2,read.fst('/tmp/sfpd.fst',columns = c(\"Resolution\")))\n",
"})\n",
"\n",
"print(\"Read two, then read three\")\n",
"system.time({\n",
" # Read in two variables\n",
" df_2 = read.fst('/tmp/sfpd.fst',columns = c(\"Incident Year\",\"Incident Category\"))\n",
" # Then read in three variables\n",
" df_3 = read.fst('/tmp/sfpd.fst',columns = c(\"Incident Year\",\"Incident Category\",\"Resolution\"))\n",
"})\n"
]
},
{
"cell_type": "markdown",
"id": "46249331-2198-43f4-b541-0d20b3c529d7",
"metadata": {},
"source": [
"It's faster to read column by column as needed from the `fst` file than it is to read all of the data at the same time."
]
},
{
"cell_type": "markdown",
"id": "34adb691-007d-4a4e-b070-1400aab33444",
"metadata": {},
"source": [
"## Conclusions\n",
"\n",
"1. Using local disk (`/tmp`) is a faster place for reading and writing data, particularly if you have lots of files\n",
"2. The `fst` package lets you quickly read in data by columns\n",
"3. You have to remember to set the threads using `threads_fst(<threads>)` -- if you're planning on having *multiple* jobs reading the same data, set the threads to `1`.\n",
"4. You can build your data frame effectively by using `cbind` on the columns that you read in with the `fst` package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed4d9d55-7211-4fa6-aad8-50ee376165fc",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "R 4.2",
"language": "R",
"name": "ir42"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.2.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment