Skip to content

Instantly share code, notes, and snippets.

@gergness
Created July 9, 2018 18:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gergness/22b6603e69506357b6e76ba0a40d14f2 to your computer and use it in GitHub Desktop.
Save gergness/22b6603e69506357b6e76ba0a40d14f2 to your computer and use it in GitHub Desktop.
ipumsr big data vignette
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="author" content="Minnesota Population Center" />
<meta name="date" content="2018-07-09" />
<title>Big IPUMS data</title>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20both%3B%0Amargin%3A%200%200%2010px%2010px%3B%0Apadding%3A%204px%3B%0Awidth%3A%20400px%3B%0Aborder%3A%201px%20solid%20%23CCCCCC%3B%0Aborder%2Dradius%3A%205px%3B%0Abackground%2Dcolor%3A%20%23f6f6f6%3B%0Afont%2Dsize%3A%2013px%3B%0Aline%2Dheight%3A%201%2E3%3B%0A%7D%0A%23TOC%20%2Etoctitle%20%7B%0Afont%2Dweight%3A%20bold%3B%0Afont%2Dsize%3A%2015px%3B%0Amargin%2Dleft%3A%205px%3B%0A%7D%0A%23TOC%20ul%20%7B%0Apadding%2Dleft%3A%2040px%3B%0Amargin%2Dleft%3A%20%2D1%2E5em%3B%0Amargin%2Dtop%3A%205px%3B%0Amargin%2Dbottom%3A%205px%3B%0A%7D%0A%23TOC%20ul%20ul%20%7B%0Amargin%2Dleft%3A%20%2D2em%3B%0A%7D%0A%23TOC%20li%20%7B%0Aline%2Dheight%3A%2016px%3B%0A%7D%0Atable%20%7B%0Amargin%3A%201em%20auto%3B%0Aborder%2Dwidth%3A%201px%3B%0Aborder%2Dcolor%3A%20%23DDDDDD%3B%0Aborder%2Dstyle%3A%20outset%3B%0Aborder%2Dcollapse%3A%20collapse%3B%0A%7D%0Atable%20th%20%7B%0Aborder%2Dwidth%3A%202px%3B%0Apadding%3A%205px%3B%0Aborder%2Dstyle%3A%20inset%3B%0A%7D%0Atable%20td%20%7B%0Aborder%2Dwidth%3A%201px%3B%0Aborder%2Dstyle%3A%20inset%3B%0Aline%2Dheight%3A%2018px%3B%0Apadding%3A%205px%205px%3B%0A%7D%0Atable%2C%20table%20th%2C%20table%20td%20%7B%0Aborder%2Dleft%2Dstyle%3A%20none%3B%0Aborder%2Dright%2Dstyle%3A%20none%3B%0A%7D%0Atable%20thead%2C%20table%20tr%2Eeven%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0Ap%20%7B%0Amargin%3A%200%2E5em%200%3B%0A%7D%0Ablockquote%20%7B%0Abackground%2Dcolor%3A%20%23f6f6f6%3B%0Apadding%3A%200%2E25em%200%2E75em%3B%0A%7D%0Ahr%20%7B%0Aborder%2Dstyle%3A%20solid%3B%0Aborder%3A%20none%3B%0Aborder%2Dtop%3A%201px%20solid%20%23777%3B%0Amargin%3A%2028px%200%3B%0A%7D%0Adl%20%7B%0Amargin%2Dleft%3A%200%3B%0A%7D%0Adl%20dd%20%7B%0Amargin%2Dbottom%3A%2013px%3B%0Amargin%2Dleft%3A%2013px%3B%0A%7D%0Adl%20dt%20%7B%0Afont%2Dweight%3A%20bold%3B%0A%7D%0Aul%20%7B%0Amargin%2Dtop%3A%200%3B%0A%7D%0Aul%20li%20%7B%0Alist%2Dstyle%3A%20circle%20outside%3B%0A%7D%0Aul%20ul%20%7B%0Amargin%2Dbottom%3A%200%3B%0A%7D%0Apre%2C%20code%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0Aborder%2Dradius%3A%203px%3B%0Acolor%3A%20%23333%3B%0Awhite%2Dspace%3A%20pre%2Dwrap%3B%20%0A%7D%0Apre%20%7B%0Aborder%2Dradius%3A%203px%3B%0Amargin%3A%205px%200px%2010px%200px%3B%0Apadding%3A%2010px%3B%0A%7D%0Apre%3Anot%28%5Bclass%5D%29%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0Acode%20%7B%0Afont%2Dfamily%3A%20Consolas%2C%20Monaco%2C%20%27Courier%20New%27%2C%20monospace%3B%0Afont%2Dsize%3A%2085%25%3B%0A%7D%0Ap%20%3E%20code%2C%20li%20%3E%20code%20%7B%0Apadding%3A%202px%200px%3B%0A%7D%0Adiv%2Efigure%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0Aimg%20%7B%0Abackground%2Dcolor%3A%20%23FFFFFF%3B%0Apadding%3A%202px%3B%0Aborder%3A%201px%20solid%20%23DDDDDD%3B%0Aborder%2Dradius%3A%203px%3B%0Aborder%3A%201px%20solid%20%23CCCCCC%3B%0Amargin%3A%200%205px%3B%0A%7D%0Ah1%20%7B%0Amargin%2Dtop%3A%200%3B%0Afont%2Dsize%3A%2035px%3B%0Aline%2Dheight%3A%2040px%3B%0A%7D%0Ah2%20%7B%0Aborder%2Dbottom%3A%204px%20solid%20%23f7f7f7%3B%0Apadding%2Dtop%3A%2010px%3B%0Apadding%2Dbottom%3A%202px%3B%0Afont%2Dsize%3A%20145%25%3B%0A%7D%0Ah3%20%7B%0Aborder%2Dbottom%3A%202px%20solid%20%23f7f7f7%3B%0Apadding%2Dtop%3A%2010px%3B%0Afont%2Dsize%3A%20120%25%3B%0A%7D%0Ah4%20%7B%0Aborder%2Dbottom%3A%201px%20solid%20%23f7f7f7%3B%0Amargin%2Dleft%3A%208px%3B%0Afont%2Dsize%3A%20105%25%3B%0A%7D%0Ah5%2C%20h6%20%7B%0Aborder%2Dbottom%3A%201px%20solid%20%23ccc%3B%0Afont%2Dsize%3A%20105%25%3B%0A%7D%0Aa%20%7B%0Acolor%3A%20%230033dd%3B%0Atext%2Ddecoration%3A%20none%3B%0A%7D%0Aa%3Ahover%20%7B%0Acolor%3A%20%236666ff%3B%20%7D%0Aa%3Avisited%20%7B%0Acolor%3A%20%23800080%3B%20%7D%0Aa%3Avisited%3Ahover%20%7B%0Acolor%3A%20%23BB00BB%3B%20%7D%0Aa%5Bhref%5E%3D%22http%3A%22%5D%20%7B%0Atext%2Ddecoration%3A%20underline%3B%20%7D%0Aa%5Bhref%5E%3D%22https%3A%22%5D%20%7B%0Atext%2Ddecoration%3A%20underline%3B%20%7D%0A%0Acode%20%3E%20span%2Ekw%20%7B%20color%3A%20%23555%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%0Acode%20%3E%20span%2Edt%20%7B%20color%3A%20%23902000%3B%20%7D%20%0Acode%20%3E%20span%2Edv%20%7B%20color%3A%20%2340a070%3B%20%7D%20%0Acode%20%3E%20span%2Ebn%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Efl%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Ech%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Est%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Eco%20%7B%20color%3A%20%23888888%3B%20font%2Dstyle%3A%20italic%3B%20%7D%20%0Acode%20%3E%20span%2Eot%20%7B%20color%3A%20%23007020%3B%20%7D%20%0Acode%20%3E%20span%2Eal%20%7B%20color%3A%20%23ff0000%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%0Acode%20%3E%20span%2Efu%20%7B%20color%3A%20%23900%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%20code%20%3E%20span%2Eer%20%7B%20color%3A%20%23a61717%3B%20background%2Dcolor%3A%20%23e3d2d2%3B%20%7D%20%0A" rel="stylesheet" type="text/css" />
</head>
<body>
<h1 class="title toc-ignore">Big IPUMS data</h1>
<h4 class="author"><em>Minnesota Population Center</em></h4>
<h4 class="date"><em>2018-07-09</em></h4>
<p>They say you should never go to a grocery store when you’re hungry. The bad new is that many IPUMS users are always hungry for more data, so every visit to our website can lead to an overly full data cart. Here are some tips for how to handle IPUMS extracts that are too large to fit on your computer (or avoid them altogether).</p>
<p>The examples in this vignette will rely on the ipumsr, dplyr and biglm packages, and the example CPS extract used in the <code>ipums-cps</code> vignette. If you want to follow along, you should follow the instructions in that vignette to make an extract.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(ipumsr)
<span class="kw">library</span>(dplyr)
<span class="co"># To run the full vignette you'll also need the following packages:</span>
installed_biglm &lt;-<span class="st"> </span><span class="kw">requireNamespace</span>(<span class="st">&quot;biglm&quot;</span>)
installed_db_pkgs &lt;-<span class="st"> </span><span class="kw">requireNamespace</span>(<span class="st">&quot;DBI&quot;</span>) <span class="op">&amp;</span><span class="st"> </span>
<span class="st"> </span><span class="kw">requireNamespace</span>(<span class="st">&quot;RSQLite&quot;</span>) <span class="op">&amp;</span><span class="st"> </span>
<span class="st"> </span><span class="kw">requireNamespace</span>(<span class="st">&quot;dbplyr&quot;</span>)
<span class="co"># Change these filepaths to the filepaths of your downloaded extract</span>
cps_ddi_file &lt;-<span class="st"> &quot;cps_00001.xml&quot;</span>
cps_data_file &lt;-<span class="st"> &quot;cps_00001.dat&quot;</span></code></pre></div>
<div id="option-1-do-you-really-need-all-of-that" class="section level1">
<h1>Option 1: Do you really need all of that?</h1>
<p>The IPUMS website has many features that will let you reduce the size of your extract. The easiest thing to do is to review your sample and variable selections to see if you can drop some.</p>
<p>If you do need every sample and variable, but your analysis is on some other subset of the data, the IPUMS extract engine has a feature called “Select Cases”, where you can subset on an included variable (for example you could subset on AGE so that your extract only includes those older than 65, or subset on EDUCATION to look at only college graduates). In most IPUMS microdata projects, the select cases feature is on the “Create Extract” page, as the last step before you submit the extract.</p>
<p>Or, if you would be happy with a random subsample of the data, the IPUMS extract engine has an option to “Customize Sample Size” that will take a random sample. This feature is also available on the “Create Extract” page, as the last step before you submit the extract.</p>
</div>
<div id="option-2-trade-money-for-convenience" class="section level1">
<h1>Option 2: Trade money for convenience</h1>
<p>You can save a lot of time and effort dealing with big data if you can buy your way out of a “big data” problem by buying a computer with more RAM or using a cloud service like Amazon or Microsoft Azure (or one of the many other similar services). Here are guides for using R on <a href="https://aws.amazon.com/blogs/big-data/statistical-analysis-with-open-source-r-and-rstudio-on-amazon-emr/">Amazon</a> and <a href="https://blog.jumpingrivers.com/posts/2017/rstudio_azure_cloud_1/">Microsoft Azure</a>.</p>
</div>
<div id="option-3-work-one-chunk-at-a-time" class="section level1">
<h1>Option 3: Work one chunk at a time</h1>
<p>Starting with version 0.3, ipumsr has “chunked” versions of the microdata reading functions (<code>read_ipums_micro_chunked()</code> and <code>read_ipums_micro_list_chunked()</code>). These “chunked” versions of the function allow you to specify a function that will be applied to each chunk, and then also control how the results from these chunks are combined. This functionality is based on the chunked functionality introduced by <code>readr</code> and so is quite flexible. Below, we’ll outline solutions to three common use-cases for IPUMS data: tabulation, regression and selecting cases.</p>
<div id="chunked-tabulation-example" class="section level2">
<h2>Chunked tabulation example</h2>
<p>Let’s say you want to find the percent of people in the workforce by their self-reported health. Since this extract is small enough to fit in memory, we could just do the following:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">read_ipums_micro</span>(
cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span>
) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">mutate</span>(
<span class="dt">HEALTH =</span> <span class="kw">as_factor</span>(HEALTH),
<span class="dt">AT_WORK =</span> EMPSTAT <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">lbl_relabel</span>(
<span class="kw">lbl</span>(<span class="dv">1</span>, <span class="st">&quot;Yes&quot;</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">==</span><span class="st"> &quot;At work&quot;</span>,
<span class="kw">lbl</span>(<span class="dv">0</span>, <span class="st">&quot;No&quot;</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">!=</span><span class="st"> &quot;At work&quot;</span>
) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">as_factor</span>()
) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">group_by</span>(HEALTH, AT_WORK) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>())
<span class="co">#&gt; # A tibble: 10 x 3</span>
<span class="co">#&gt; # Groups: HEALTH [?]</span>
<span class="co">#&gt; HEALTH AT_WORK n</span>
<span class="co">#&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1 Excellent No 40582</span>
<span class="co">#&gt; 2 Excellent Yes 28071</span>
<span class="co">#&gt; 3 Very good No 32367</span>
<span class="co">#&gt; 4 Very good Yes 32947</span>
<span class="co">#&gt; 5 Good No 26726</span>
<span class="co">#&gt; 6 Good Yes 22483</span>
<span class="co">#&gt; 7 Fair No 11089</span>
<span class="co">#&gt; 8 Fair Yes 4520</span>
<span class="co">#&gt; 9 Poor No 5418</span>
<span class="co">#&gt; 10 Poor Yes 780</span></code></pre></div>
<p>But let’s pretend like we can only store 1,000 rows at a time. In this case, we need to use a chunked function, tabulate for each chunk, and then calculate the percentage across all of the chunks.</p>
<p>First we’ll make the callback function, which will take two arguments: x (the data from a chunk) and pos (the position of the chunk). We’ll only use x, but the callback function must always take both these arguments.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">cb_function &lt;-<span class="st"> </span><span class="cf">function</span>(x, pos) {
x <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">mutate</span>(
<span class="dt">HEALTH =</span> <span class="kw">as_factor</span>(HEALTH),
<span class="dt">AT_WORK =</span> EMPSTAT <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">lbl_relabel</span>(
<span class="kw">lbl</span>(<span class="dv">1</span>, <span class="st">&quot;Yes&quot;</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">==</span><span class="st"> &quot;At work&quot;</span>,
<span class="kw">lbl</span>(<span class="dv">0</span>, <span class="st">&quot;No&quot;</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">!=</span><span class="st"> &quot;At work&quot;</span>
) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">as_factor</span>()
) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">group_by</span>(HEALTH, AT_WORK) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>())
}</code></pre></div>
<p>Next we need to create a callback object. In this case, we want to have the IPUMS metadata applied to the data (so we have value labels) and the resulting data.frames from <code>cb_function()</code> to be combined together, we’ll use <code>IpumsDataFrameCallback</code>. If we didn’t care about the value labels, we could use <code>readr::DataFrameCallback</code> instead. Callback objects are a little unusual compared to R objects you are used to because they are R6 objects. For now, all we really need to know is that to create a callback we can use, we use <code>$new()</code> syntax.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">cb &lt;-<span class="st"> </span>IpumsDataFrameCallback<span class="op">$</span><span class="kw">new</span>(cb_function)</code></pre></div>
<p>Next we read in the data with the <code>read_ipums_micro_chunked()</code> function, specifying the callback and that we want 1000 to be the chunk_size.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">chunked_tabulations &lt;-<span class="st"> </span><span class="kw">read_ipums_micro_chunked</span>(
cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span>,
<span class="dt">callback =</span> cb, <span class="dt">chunk_size =</span> <span class="dv">1000</span>
)</code></pre></div>
<p>Now we have a data.frame with tabulations for each chunk, we need to tabulate once again across all chunks to get our final results.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">chunked_tabulations <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">group_by</span>(HEALTH, AT_WORK) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">sum</span>(n))
<span class="co">#&gt; # A tibble: 10 x 3</span>
<span class="co">#&gt; # Groups: HEALTH [?]</span>
<span class="co">#&gt; HEALTH AT_WORK n</span>
<span class="co">#&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1 Excellent No 40582</span>
<span class="co">#&gt; 2 Excellent Yes 28071</span>
<span class="co">#&gt; 3 Very good No 32367</span>
<span class="co">#&gt; 4 Very good Yes 32947</span>
<span class="co">#&gt; 5 Good No 26726</span>
<span class="co">#&gt; 6 Good Yes 22483</span>
<span class="co">#&gt; 7 Fair No 11089</span>
<span class="co">#&gt; 8 Fair Yes 4520</span>
<span class="co">#&gt; 9 Poor No 5418</span>
<span class="co">#&gt; 10 Poor Yes 780</span></code></pre></div>
</div>
<div id="chunked-regression-example" class="section level2">
<h2>Chunked regression example</h2>
<p>With the biglm package, it is possible to use R to perform a regression on data that is too large to store in memory all at once. The ipumsr package provides a callback designed to make this simple <code>IpumsBiglmCallback</code>.</p>
<p>Again we’ll use the CPS example, which is small enough that we can keep it in memory. Here’s an example of a regression looking at how hours work, self-reported health and age are related among those who are currently working. This is meant as a simple example, and ignores many of the complexities in this relationship, so please use caution when interpreting.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Read in data</span>
data &lt;-<span class="st"> </span><span class="kw">read_ipums_micro</span>(
cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span>
)
<span class="co"># Prepare data for model</span>
<span class="co"># (age has been capped at 99, which we assume is high enough to not</span>
<span class="co"># cause any problems so we leave it.)</span>
data &lt;-<span class="st"> </span>data <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">mutate</span>(
<span class="dt">HEALTH =</span> <span class="kw">as_factor</span>(HEALTH),
<span class="dt">AHRSWORKT =</span> <span class="kw">lbl_na_if</span>(AHRSWORKT, <span class="op">~</span>.lbl <span class="op">==</span><span class="st"> &quot;NIU (Not in universe)&quot;</span>),
<span class="dt">AT_WORK =</span> EMPSTAT <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">lbl_relabel</span>(
<span class="kw">lbl</span>(<span class="dv">1</span>, <span class="st">&quot;Yes&quot;</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">==</span><span class="st"> &quot;At work&quot;</span>,
<span class="kw">lbl</span>(<span class="dv">0</span>, <span class="st">&quot;No&quot;</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">!=</span><span class="st"> &quot;At work&quot;</span>
) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">as_factor</span>()
) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">filter</span>(AT_WORK <span class="op">==</span><span class="st"> &quot;Yes&quot;</span>)
<span class="co"># Run regression</span>
model &lt;-<span class="st"> </span><span class="kw">lm</span>(AHRSWORKT <span class="op">~</span><span class="st"> </span>AGE <span class="op">+</span><span class="st"> </span><span class="kw">I</span>(AGE<span class="op">^</span><span class="dv">2</span>) <span class="op">+</span><span class="st"> </span>HEALTH, data)
<span class="kw">summary</span>(model)
<span class="co">#&gt; </span>
<span class="co">#&gt; Call:</span>
<span class="co">#&gt; lm(formula = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Residuals:</span>
<span class="co">#&gt; &lt;Labelled double&gt;</span>
<span class="co">#&gt; Min 1Q Median 3Q Max </span>
<span class="co">#&gt; -41.230 -4.949 -0.080 5.945 75.697 </span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Coefficients:</span>
<span class="co">#&gt; Estimate Std. Error t value Pr(&gt;|t|) </span>
<span class="co">#&gt; (Intercept) 5.626953 0.367851 15.297 &lt; 2e-16 ***</span>
<span class="co">#&gt; AGE 1.568287 0.017790 88.156 &lt; 2e-16 ***</span>
<span class="co">#&gt; I(AGE^2) -0.016798 0.000204 -82.338 &lt; 2e-16 ***</span>
<span class="co">#&gt; HEALTHVery good -0.280826 0.104433 -2.689 0.00717 ** </span>
<span class="co">#&gt; HEALTHGood -1.275358 0.115861 -11.008 &lt; 2e-16 ***</span>
<span class="co">#&gt; HEALTHFair -3.614487 0.207121 -17.451 &lt; 2e-16 ***</span>
<span class="co">#&gt; HEALTHPoor -5.732656 0.465751 -12.308 &lt; 2e-16 ***</span>
<span class="co">#&gt; ---</span>
<span class="co">#&gt; Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Residual standard error: 12.8 on 88794 degrees of freedom</span>
<span class="co">#&gt; Multiple R-squared: 0.08886, Adjusted R-squared: 0.08879 </span>
<span class="co">#&gt; F-statistic: 1443 on 6 and 88794 DF, p-value: &lt; 2.2e-16</span></code></pre></div>
<p>To do the same regression, but with only 1000 rows loaded at a time, we work in a similar manner.</p>
<p>First we make the <code>IpumsBiglmCallback</code> callback object that specifies both the model and a function to prepare the data.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">biglm_cb &lt;-<span class="st"> </span>IpumsBiglmCallback<span class="op">$</span><span class="kw">new</span>(
<span class="dt">model =</span> AHRSWORKT <span class="op">~</span><span class="st"> </span>AGE <span class="op">+</span><span class="st"> </span><span class="kw">I</span>(AGE<span class="op">^</span><span class="dv">2</span>) <span class="op">+</span><span class="st"> </span>HEALTH,
<span class="dt">prep =</span> <span class="cf">function</span>(x, pos) {
x <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">mutate</span>(
<span class="dt">HEALTH =</span> <span class="kw">as_factor</span>(HEALTH),
<span class="dt">AHRSWORKT =</span> <span class="kw">lbl_na_if</span>(AHRSWORKT, <span class="op">~</span>.lbl <span class="op">==</span><span class="st"> &quot;NIU (Not in universe)&quot;</span>),
<span class="dt">AT_WORK =</span> EMPSTAT <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">lbl_relabel</span>(
<span class="kw">lbl</span>(<span class="dv">1</span>, <span class="st">&quot;Yes&quot;</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">==</span><span class="st"> &quot;At work&quot;</span>,
<span class="kw">lbl</span>(<span class="dv">0</span>, <span class="st">&quot;No&quot;</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">!=</span><span class="st"> &quot;At work&quot;</span>
) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">as_factor</span>()
) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">filter</span>(AT_WORK <span class="op">==</span><span class="st"> &quot;Yes&quot;</span>)
}
)</code></pre></div>
<p>And then we read the data using <code>read_ipums_micro_chunked()</code>, passing the callback that we just made.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">chunked_model &lt;-<span class="st"> </span><span class="kw">read_ipums_micro_chunked</span>(
cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span>,
<span class="dt">callback =</span> biglm_cb, <span class="dt">chunk_size =</span> <span class="dv">1000</span>
)
<span class="kw">summary</span>(chunked_model)
<span class="co">#&gt; Large data regression model: biglm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data, ...)</span>
<span class="co">#&gt; Sample size = 88801 </span>
<span class="co">#&gt; Coef (95% CI) SE p</span>
<span class="co">#&gt; (Intercept) 5.6270 4.8913 6.3627 0.3679 0.0000</span>
<span class="co">#&gt; AGE 1.5683 1.5327 1.6039 0.0178 0.0000</span>
<span class="co">#&gt; I(AGE^2) -0.0168 -0.0172 -0.0164 0.0002 0.0000</span>
<span class="co">#&gt; HEALTHVery good -0.2808 -0.4897 -0.0720 0.1044 0.0072</span>
<span class="co">#&gt; HEALTHGood -1.2754 -1.5071 -1.0436 0.1159 0.0000</span>
<span class="co">#&gt; HEALTHFair -3.6145 -4.0287 -3.2002 0.2071 0.0000</span>
<span class="co">#&gt; HEALTHPoor -5.7327 -6.6642 -4.8012 0.4658 0.0000</span></code></pre></div>
</div>
<div id="chunked-select-cases-example" class="section level2">
<h2>Chunked “select cases” example</h2>
<p>Sometimes you may want to select a subset of the data before reading it in. The IPUMS website has this functionality built in, which can be a faster way to do this (this “select cases” functionality is described in the first section above). Also, linux commands like <code>awk</code> and <code>sed</code> will generally be much faster than these R based solutions. However, it is possible to use the chunked functions to create a subset, which can be convenient if you want to subset on some complex logic that would be hard to code into the IPUMS extract system or linux tools.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Subset only those in &quot;Poor&quot; health</span>
chunked_subset &lt;-<span class="st"> </span><span class="kw">read_ipums_micro_chunked</span>(
cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span>,
<span class="dt">callback =</span> IpumsDataFrameCallback<span class="op">$</span><span class="kw">new</span>(<span class="cf">function</span>(x, pos) {
<span class="kw">filter</span>(x, HEALTH <span class="op">==</span><span class="st"> </span><span class="dv">5</span>)
}),
<span class="dt">chunk_size =</span> <span class="dv">1000</span>
)</code></pre></div>
</div>
</div>
<div id="option-4-use-a-database" class="section level1">
<h1>Option 4: Use a database</h1>
<p>Databases are another option for data that cannot fit in memory as an R data.frame. If you have access to a database on a remote machine, then you can easily pull in parts of the data for your analysis. Even if you’ll need to store the database on your machine, it may have more efficient storage of data so your data fits in your memory, or it may use your hard drive.</p>
<p>R’s tools for integrating with databases are improving quickly. The DBI package has been updated, dplyr (through dbplyr) provides a frontend that allows you to write the same code for data in a database as you would in a local data.frame, and packages like sparklyr, sparkR, bigrquery and others provide access to the latest and greatest.</p>
<p>There are many different kinds of databases, each with their own benefits, weaknesses and tradeoffs. As such, it’s hard to give concrete advice without knowing your specific use-case. However, once you’ve chosen a database, in general, there will be two steps: Importing the data to the database and then connecting it to R.</p>
<p>As an example, we’ll use the RSQLite package to load the data into an in-memory database. RSQLite is great because it is easy to set up, but it is probably not efficient enough to help you if you need to use a database because your data doesn’t fit in memory.</p>
<div id="importing-data-into-a-database" class="section level2">
<h2>Importing data into a database</h2>
<p>When using rectangular extracts, your best bet to import IPUMS data into your database is probably going to be a csv file. Most databases support csv importing, and these implementations will generally be well supported since this is a common file format.</p>
<p>However, if you need a hierarchical extract, or your database software doesn’t support the csv format, then you can use the chunking functions to load the data into a database without storing the full data in R.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Connect to database</span>
<span class="kw">library</span>(DBI)
<span class="kw">library</span>(RSQLite)
con &lt;-<span class="st"> </span><span class="kw">dbConnect</span>(<span class="kw">SQLite</span>(), <span class="dt">path =</span> <span class="st">&quot;:memory:&quot;</span>)
<span class="co"># Add data to tables in chunks</span>
ddi &lt;-<span class="st"> </span><span class="kw">read_ipums_ddi</span>(<span class="kw">ipums_example</span>(<span class="st">&quot;cps_00010.xml&quot;</span>))
<span class="kw">read_ipums_micro_list_chunked</span>(
ddi,
readr<span class="op">::</span>SideEffectChunkCallback<span class="op">$</span><span class="kw">new</span>(<span class="cf">function</span>(x, pos) {
<span class="cf">if</span> (pos <span class="op">==</span><span class="st"> </span><span class="dv">1</span>) {
<span class="kw">dbWriteTable</span>(con, <span class="st">&quot;person&quot;</span>, x<span class="op">$</span>P)
<span class="kw">dbWriteTable</span>(con, <span class="st">&quot;household&quot;</span>, x<span class="op">$</span>H)
} <span class="cf">else</span> {
<span class="kw">dbWriteTable</span>(con, <span class="st">&quot;person&quot;</span>, x<span class="op">$</span>P, <span class="dt">row.names =</span> <span class="ot">FALSE</span>, <span class="dt">append =</span> <span class="ot">TRUE</span>)
<span class="kw">dbWriteTable</span>(con, <span class="st">&quot;household&quot;</span>, x<span class="op">$</span>H, <span class="dt">row.names =</span> <span class="ot">FALSE</span>, <span class="dt">append =</span> <span class="ot">TRUE</span>)
}
}),
<span class="dt">chunk_size =</span> <span class="dv">1000</span>,
<span class="dt">verbose =</span> <span class="ot">FALSE</span>
)
<span class="co">#&gt; NULL</span></code></pre></div>
</div>
<div id="connecting-to-a-database-with-dbplyr" class="section level2">
<h2>Connecting to a database with dbplyr</h2>
<p>The dbplyr vignette “dbplyr” (which you can access with <code>vignette(&quot;dbplyr&quot;, package = &quot;dbplyr&quot;)</code>) is a good place to get started learning about how to connect to a database. Here I’ll just briefly show some examples.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">example &lt;-<span class="st"> </span><span class="kw">tbl</span>(con, <span class="st">&quot;person&quot;</span>)
example <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">filter</span>(<span class="st">'AGE'</span> <span class="op">&gt;</span><span class="st"> </span><span class="dv">25</span>)
<span class="co">#&gt; # Source: lazy query [?? x 6]</span>
<span class="co">#&gt; # Database: sqlite 3.22.0 []</span>
<span class="co">#&gt; RECTYPE YEAR SERIAL PERNUM WTSUPP INCTOT</span>
<span class="co">#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span class="co">#&gt; 1 P 1962 80 1 14755900 4883</span>
<span class="co">#&gt; 2 P 1962 80 2 14707200 5800</span>
<span class="co">#&gt; 3 P 1962 80 3 15787500 99999998</span>
<span class="co">#&gt; 4 P 1962 82 1 15976100 14015</span>
<span class="co">#&gt; 5 P 1962 83 1 17066500 16552</span>
<span class="co">#&gt; 6 P 1962 84 1 17902500 6375</span>
<span class="co">#&gt; 7 P 1962 107 1 43554000 99999999</span>
<span class="co">#&gt; 8 P 1962 107 2 13858100 0</span>
<span class="co">#&gt; 9 P 1962 107 3 16291000 600</span>
<span class="co">#&gt; 10 P 1962 107 4 14322400 99999999</span>
<span class="co">#&gt; # ... with more rows</span></code></pre></div>
<p>Though dbplyr shows us a nice preview of the first rows of the result of our query, the data still lives in the database. When using a regular database, in general you’d use the function <code>dplyr::collect()</code> to load in the full results of the query to your R session. However, the database has no concept of IPUMS attributes like value and variable lables, so if you want them, you can use <code>ipums_collect()</code> like so:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">example <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">filter</span>(<span class="st">'AGE'</span> <span class="op">&gt;</span><span class="st"> </span><span class="dv">25</span>) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">ipums_collect</span>(ddi)
<span class="co">#&gt; # A tibble: 7,668 x 6</span>
<span class="co">#&gt; RECTYPE YEAR SERIAL PERNUM WTSUPP INCTOT </span>
<span class="co">#&gt; &lt;chr+lbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl+lbl&gt;</span>
<span class="co">#&gt; 1 P 1962 80 1 14755900 4883 </span>
<span class="co">#&gt; 2 P 1962 80 2 14707200 5800 </span>
<span class="co">#&gt; 3 P 1962 80 3 15787500 99999998 </span>
<span class="co">#&gt; 4 P 1962 82 1 15976100 14015 </span>
<span class="co">#&gt; 5 P 1962 83 1 17066500 16552 </span>
<span class="co">#&gt; 6 P 1962 84 1 17902500 6375 </span>
<span class="co">#&gt; 7 P 1962 107 1 43554000 99999999 </span>
<span class="co">#&gt; 8 P 1962 107 2 13858100 0 </span>
<span class="co">#&gt; 9 P 1962 107 3 16291000 600 </span>
<span class="co">#&gt; 10 P 1962 107 4 14322400 99999999 </span>
<span class="co">#&gt; # ... with 7,658 more rows</span></code></pre></div>
</div>
</div>
<div id="learning-more" class="section level1">
<h1>Learning more</h1>
<p>Big data is a problem for lots of R users, not just IPUMS users, so there are a lot of resources to help you out! These are just a few that I found useful while compling this document:</p>
<ul>
<li><em>Best practice to handle out-of-memory data</em> - RStudio Community Thread <a href="https://community.rstudio.com/t/best-practice-to-handle-out-of-memory-data/734">link</a></li>
<li><em>Big Data in R</em> - Part of Stephen Mooney’s EPIC: Epidemiologic Analysis Using R, June 2015 class <a href="http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf">link</a></li>
<li><em>Statistical Analysis with Open-Source R and RStudio on Amazon EMR</em> - Markus Schmidberger on the AWS Big Data Blog <a href="https://aws.amazon.com/blogs/big-data/statistical-analysis-with-open-source-r-and-rstudio-on-amazon-emr/">link</a></li>
<li><em>Hosting RStudio Server on Azure</em> - Colin Gillespie’s blog post on using Rstudion on Azure <a href="https://blog.jumpingrivers.com/posts/2017/rstudio_azure_cloud_1/">link</a></li>
<li><em>Improving DBI: A Retrospect</em> - Kirill Müller’s report on the R Consortium grant to improve database support in R <a href="https://www.r-consortium.org/blog/2017/05/15/improving-dbi-a-retrospect">link</a></li>
<li><em>R6</em> - a chapter of Hadley Wickham’s upcoming revision to his Advanced R book (Relevant if you’re curious about how readr/ipumsr callback objects work <a href="https://github.com/hadley/adv-r/blob/master/R6.Rmd">link</a> (in progress)</li>
</ul>
</div>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment