Skip to content

Instantly share code, notes, and snippets.

@gergness
Last active July 16, 2018 18:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gergness/20f00d78cc518518ba5c9c4da65172d1 to your computer and use it in GitHub Desktop.
Save gergness/20f00d78cc518518ba5c9c4da65172d1 to your computer and use it in GitHub Desktop.
ipumsr big data vignette
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="author" content="Minnesota Population Center">
<meta name="date" content="2018-07-16">
<title>Big IPUMS data</title>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
a.sourceLine { display: inline-block; line-height: 1.25; }
a.sourceLine { pointer-events: none; color: inherit; text-decoration: inherit; }
a.sourceLine:empty { height: 1.2em; position: absolute; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
code.sourceCode { white-space: pre-wrap; }
a.sourceLine { text-indent: -1em; padding-left: 1em; }
}
pre.numberSource a.sourceLine
{ position: relative; }
pre.numberSource a.sourceLine:empty
{ position: absolute; }
pre.numberSource a.sourceLine::before
{ content: attr(data-line-number);
position: absolute; left: -5em; text-align: right; vertical-align: baseline;
border: none; pointer-events: all;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
a.sourceLine::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
</style>
<style type="text/css">body {
background-color: #fff;
margin: 1em auto;
max-width: 700px;
overflow: visible;
padding-left: 2em;
padding-right: 2em;
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
font-size: 14px;
line-height: 1.35;
}
#header {
text-align: center;
}
#TOC {
clear: both;
margin: 0 0 10px 10px;
padding: 4px;
width: 400px;
border: 1px solid #CCCCCC;
border-radius: 5px;
background-color: #f6f6f6;
font-size: 13px;
line-height: 1.3;
}
#TOC .toctitle {
font-weight: bold;
font-size: 15px;
margin-left: 5px;
}
#TOC ul {
padding-left: 40px;
margin-left: -1.5em;
margin-top: 5px;
margin-bottom: 5px;
}
#TOC ul ul {
margin-left: -2em;
}
#TOC li {
line-height: 16px;
}
table {
margin: 1em auto;
border-width: 1px;
border-color: #DDDDDD;
border-style: outset;
border-collapse: collapse;
}
table th {
border-width: 2px;
padding: 5px;
border-style: inset;
}
table td {
border-width: 1px;
border-style: inset;
line-height: 18px;
padding: 5px 5px;
}
table, table th, table td {
border-left-style: none;
border-right-style: none;
}
table thead, table tr.even {
background-color: #f7f7f7;
}
p {
margin: 0.5em 0;
}
blockquote {
background-color: #f6f6f6;
padding: 0.25em 0.75em;
}
hr {
border-style: solid;
border: none;
border-top: 1px solid #777;
margin: 28px 0;
}
dl {
margin-left: 0;
}
dl dd {
margin-bottom: 13px;
margin-left: 13px;
}
dl dt {
font-weight: bold;
}
ul {
margin-top: 0;
}
ul li {
list-style: circle outside;
}
ul ul {
margin-bottom: 0;
}
pre, code {
background-color: #f7f7f7;
border-radius: 3px;
color: #333;
white-space: pre-wrap;
}
pre {
border-radius: 3px;
margin: 5px 0px 10px 0px;
padding: 10px;
}
pre:not([class]) {
background-color: #f7f7f7;
}
code {
font-family: Consolas, Monaco, 'Courier New', monospace;
font-size: 85%;
}
p > code, li > code {
padding: 2px 0px;
}
div.figure {
text-align: center;
}
img {
background-color: #FFFFFF;
padding: 2px;
border: 1px solid #DDDDDD;
border-radius: 3px;
border: 1px solid #CCCCCC;
margin: 0 5px;
}
h1 {
margin-top: 0;
font-size: 35px;
line-height: 40px;
}
h2 {
border-bottom: 4px solid #f7f7f7;
padding-top: 10px;
padding-bottom: 2px;
font-size: 145%;
}
h3 {
border-bottom: 2px solid #f7f7f7;
padding-top: 10px;
font-size: 120%;
}
h4 {
border-bottom: 1px solid #f7f7f7;
margin-left: 8px;
font-size: 105%;
}
h5, h6 {
border-bottom: 1px solid #ccc;
font-size: 105%;
}
a {
color: #0033dd;
text-decoration: none;
}
a:hover {
color: #6666ff; }
a:visited {
color: #800080; }
a:visited:hover {
color: #BB00BB; }
a[href^="http:"] {
text-decoration: underline; }
a[href^="https:"] {
text-decoration: underline; }
code > span.kw { color: #555; font-weight: bold; }
code > span.dt { color: #902000; }
code > span.dv { color: #40a070; }
code > span.bn { color: #d14; }
code > span.fl { color: #d14; }
code > span.ch { color: #d14; }
code > span.st { color: #d14; }
code > span.co { color: #888888; font-style: italic; }
code > span.ot { color: #007020; }
code > span.al { color: #ff0000; font-weight: bold; }
code > span.fu { color: #900; font-weight: bold; } code > span.er { color: #a61717; background-color: #e3d2d2; }
</style>
<script type="text/javascript" src="./Big IPUMS data_files/MathJax.js.download"></script><style type="text/css">.MathJax_Hover_Frame {border-radius: .25em; -webkit-border-radius: .25em; -moz-border-radius: .25em; -khtml-border-radius: .25em; box-shadow: 0px 0px 15px #83A; -webkit-box-shadow: 0px 0px 15px #83A; -moz-box-shadow: 0px 0px 15px #83A; -khtml-box-shadow: 0px 0px 15px #83A; border: 1px solid #A6D ! important; display: inline-block; position: absolute}
.MathJax_Menu_Button .MathJax_Hover_Arrow {position: absolute; cursor: pointer; display: inline-block; border: 2px solid #AAA; border-radius: 4px; -webkit-border-radius: 4px; -moz-border-radius: 4px; -khtml-border-radius: 4px; font-family: 'Courier New',Courier; font-size: 9px; color: #F0F0F0}
.MathJax_Menu_Button .MathJax_Hover_Arrow span {display: block; background-color: #AAA; border: 1px solid; border-radius: 3px; line-height: 0; padding: 4px}
.MathJax_Hover_Arrow:hover {color: white!important; border: 2px solid #CCC!important}
.MathJax_Hover_Arrow:hover span {background-color: #CCC!important}
</style><style type="text/css">#MathJax_About {position: fixed; left: 50%; width: auto; text-align: center; border: 3px outset; padding: 1em 2em; background-color: #DDDDDD; color: black; cursor: default; font-family: message-box; font-size: 120%; font-style: normal; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; z-index: 201; border-radius: 15px; -webkit-border-radius: 15px; -moz-border-radius: 15px; -khtml-border-radius: 15px; box-shadow: 0px 10px 20px #808080; -webkit-box-shadow: 0px 10px 20px #808080; -moz-box-shadow: 0px 10px 20px #808080; -khtml-box-shadow: 0px 10px 20px #808080; filter: progid:DXImageTransform.Microsoft.dropshadow(OffX=2, OffY=2, Color='gray', Positive='true')}
#MathJax_About.MathJax_MousePost {outline: none}
.MathJax_Menu {position: absolute; background-color: white; color: black; width: auto; padding: 2px; border: 1px solid #CCCCCC; margin: 0; cursor: default; font: menu; text-align: left; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; z-index: 201; box-shadow: 0px 10px 20px #808080; -webkit-box-shadow: 0px 10px 20px #808080; -moz-box-shadow: 0px 10px 20px #808080; -khtml-box-shadow: 0px 10px 20px #808080; filter: progid:DXImageTransform.Microsoft.dropshadow(OffX=2, OffY=2, Color='gray', Positive='true')}
.MathJax_MenuItem {padding: 2px 2em; background: transparent}
.MathJax_MenuArrow {position: absolute; right: .5em; padding-top: .25em; color: #666666; font-size: .75em}
.MathJax_MenuActive .MathJax_MenuArrow {color: white}
.MathJax_MenuArrow.RTL {left: .5em; right: auto}
.MathJax_MenuCheck {position: absolute; left: .7em}
.MathJax_MenuCheck.RTL {right: .7em; left: auto}
.MathJax_MenuRadioCheck {position: absolute; left: 1em}
.MathJax_MenuRadioCheck.RTL {right: 1em; left: auto}
.MathJax_MenuLabel {padding: 2px 2em 4px 1.33em; font-style: italic}
.MathJax_MenuRule {border-top: 1px solid #CCCCCC; margin: 4px 1px 0px}
.MathJax_MenuDisabled {color: GrayText}
.MathJax_MenuActive {background-color: Highlight; color: HighlightText}
.MathJax_MenuDisabled:focus, .MathJax_MenuLabel:focus {background-color: #E8E8E8}
.MathJax_ContextMenu:focus {outline: none}
.MathJax_ContextMenu .MathJax_MenuItem:focus {outline: none}
#MathJax_AboutClose {top: .2em; right: .2em}
.MathJax_Menu .MathJax_MenuClose {top: -10px; left: -10px}
.MathJax_MenuClose {position: absolute; cursor: pointer; display: inline-block; border: 2px solid #AAA; border-radius: 18px; -webkit-border-radius: 18px; -moz-border-radius: 18px; -khtml-border-radius: 18px; font-family: 'Courier New',Courier; font-size: 24px; color: #F0F0F0}
.MathJax_MenuClose span {display: block; background-color: #AAA; border: 1.5px solid; border-radius: 18px; -webkit-border-radius: 18px; -moz-border-radius: 18px; -khtml-border-radius: 18px; line-height: 0; padding: 8px 0 6px}
.MathJax_MenuClose:hover {color: white!important; border: 2px solid #CCC!important}
.MathJax_MenuClose:hover span {background-color: #CCC!important}
.MathJax_MenuClose:hover:focus {outline: none}
</style><style type="text/css">.MathJax_Preview .MJXf-math {color: inherit!important}
</style><style type="text/css">.MJX_Assistive_MathML {position: absolute!important; top: 0; left: 0; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display: block!important; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none}
.MJX_Assistive_MathML.MJX_Assistive_MathML_Block {width: 100%!important}
</style><style type="text/css">#MathJax_Zoom {position: absolute; background-color: #F0F0F0; overflow: auto; display: block; z-index: 301; padding: .5em; border: 1px solid black; margin: 0; font-weight: normal; font-style: normal; text-align: left; text-indent: 0; text-transform: none; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; -webkit-box-sizing: content-box; -moz-box-sizing: content-box; box-sizing: content-box; box-shadow: 5px 5px 15px #AAAAAA; -webkit-box-shadow: 5px 5px 15px #AAAAAA; -moz-box-shadow: 5px 5px 15px #AAAAAA; -khtml-box-shadow: 5px 5px 15px #AAAAAA; filter: progid:DXImageTransform.Microsoft.dropshadow(OffX=2, OffY=2, Color='gray', Positive='true')}
#MathJax_ZoomOverlay {position: absolute; left: 0; top: 0; z-index: 300; display: inline-block; width: 100%; height: 100%; border: 0; padding: 0; margin: 0; background-color: white; opacity: 0; filter: alpha(opacity=0)}
#MathJax_ZoomFrame {position: relative; display: inline-block; height: 0; width: 0}
#MathJax_ZoomEventTrap {position: absolute; left: 0; top: 0; z-index: 302; display: inline-block; border: 0; padding: 0; margin: 0; background-color: white; opacity: 0; filter: alpha(opacity=0)}
</style><style type="text/css">.MathJax_Preview {color: #888}
#MathJax_Message {position: fixed; left: 1em; bottom: 1.5em; background-color: #E6E6E6; border: 1px solid #959595; margin: 0px; padding: 2px 8px; z-index: 102; color: black; font-size: 80%; width: auto; white-space: nowrap}
#MathJax_MSIE_Frame {position: absolute; top: 0; left: 0; width: 0px; z-index: 101; border: 0px; margin: 0px; padding: 0px}
.MathJax_Error {color: #CC0000; font-style: italic}
</style><style type="text/css">.MJXp-script {font-size: .8em}
.MJXp-right {-webkit-transform-origin: right; -moz-transform-origin: right; -ms-transform-origin: right; -o-transform-origin: right; transform-origin: right}
.MJXp-bold {font-weight: bold}
.MJXp-italic {font-style: italic}
.MJXp-scr {font-family: MathJax_Script,'Times New Roman',Times,STIXGeneral,serif}
.MJXp-frak {font-family: MathJax_Fraktur,'Times New Roman',Times,STIXGeneral,serif}
.MJXp-sf {font-family: MathJax_SansSerif,'Times New Roman',Times,STIXGeneral,serif}
.MJXp-cal {font-family: MathJax_Caligraphic,'Times New Roman',Times,STIXGeneral,serif}
.MJXp-mono {font-family: MathJax_Typewriter,'Times New Roman',Times,STIXGeneral,serif}
.MJXp-largeop {font-size: 150%}
.MJXp-largeop.MJXp-int {vertical-align: -.2em}
.MJXp-math {display: inline-block; line-height: 1.2; text-indent: 0; font-family: 'Times New Roman',Times,STIXGeneral,serif; white-space: nowrap; border-collapse: collapse}
.MJXp-display {display: block; text-align: center; margin: 1em 0}
.MJXp-math span {display: inline-block}
.MJXp-box {display: block!important; text-align: center}
.MJXp-box:after {content: " "}
.MJXp-rule {display: block!important; margin-top: .1em}
.MJXp-char {display: block!important}
.MJXp-mo {margin: 0 .15em}
.MJXp-mfrac {margin: 0 .125em; vertical-align: .25em}
.MJXp-denom {display: inline-table!important; width: 100%}
.MJXp-denom > * {display: table-row!important}
.MJXp-surd {vertical-align: top}
.MJXp-surd > * {display: block!important}
.MJXp-script-box > * {display: table!important; height: 50%}
.MJXp-script-box > * > * {display: table-cell!important; vertical-align: top}
.MJXp-script-box > *:last-child > * {vertical-align: bottom}
.MJXp-script-box > * > * > * {display: block!important}
.MJXp-mphantom {visibility: hidden}
.MJXp-munderover {display: inline-table!important}
.MJXp-over {display: inline-block!important; text-align: center}
.MJXp-over > * {display: block!important}
.MJXp-munderover > * {display: table-row!important}
.MJXp-mtable {vertical-align: .25em; margin: 0 .125em}
.MJXp-mtable > * {display: inline-table!important; vertical-align: middle}
.MJXp-mtr {display: table-row!important}
.MJXp-mtd {display: table-cell!important; text-align: center; padding: .5em 0 0 .5em}
.MJXp-mtr > .MJXp-mtd:first-child {padding-left: 0}
.MJXp-mtr:first-child > .MJXp-mtd {padding-top: 0}
.MJXp-mlabeledtr {display: table-row!important}
.MJXp-mlabeledtr > .MJXp-mtd:first-child {padding-left: 0}
.MJXp-mlabeledtr:first-child > .MJXp-mtd {padding-top: 0}
.MJXp-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 1px 3px; font-style: normal; font-size: 90%}
.MJXp-scale0 {-webkit-transform: scaleX(.0); -moz-transform: scaleX(.0); -ms-transform: scaleX(.0); -o-transform: scaleX(.0); transform: scaleX(.0)}
.MJXp-scale1 {-webkit-transform: scaleX(.1); -moz-transform: scaleX(.1); -ms-transform: scaleX(.1); -o-transform: scaleX(.1); transform: scaleX(.1)}
.MJXp-scale2 {-webkit-transform: scaleX(.2); -moz-transform: scaleX(.2); -ms-transform: scaleX(.2); -o-transform: scaleX(.2); transform: scaleX(.2)}
.MJXp-scale3 {-webkit-transform: scaleX(.3); -moz-transform: scaleX(.3); -ms-transform: scaleX(.3); -o-transform: scaleX(.3); transform: scaleX(.3)}
.MJXp-scale4 {-webkit-transform: scaleX(.4); -moz-transform: scaleX(.4); -ms-transform: scaleX(.4); -o-transform: scaleX(.4); transform: scaleX(.4)}
.MJXp-scale5 {-webkit-transform: scaleX(.5); -moz-transform: scaleX(.5); -ms-transform: scaleX(.5); -o-transform: scaleX(.5); transform: scaleX(.5)}
.MJXp-scale6 {-webkit-transform: scaleX(.6); -moz-transform: scaleX(.6); -ms-transform: scaleX(.6); -o-transform: scaleX(.6); transform: scaleX(.6)}
.MJXp-scale7 {-webkit-transform: scaleX(.7); -moz-transform: scaleX(.7); -ms-transform: scaleX(.7); -o-transform: scaleX(.7); transform: scaleX(.7)}
.MJXp-scale8 {-webkit-transform: scaleX(.8); -moz-transform: scaleX(.8); -ms-transform: scaleX(.8); -o-transform: scaleX(.8); transform: scaleX(.8)}
.MJXp-scale9 {-webkit-transform: scaleX(.9); -moz-transform: scaleX(.9); -ms-transform: scaleX(.9); -o-transform: scaleX(.9); transform: scaleX(.9)}
.MathJax_PHTML .noError {vertical-align: ; font-size: 90%; text-align: left; color: black; padding: 1px 3px; border: 1px solid}
</style></head>
<body><div id="MathJax_Message" style="display: none;"></div>
<h1 class="title toc-ignore">Big IPUMS data</h1>
<h4 class="author"><em>Minnesota Population Center</em></h4>
<h4 class="date"><em>2018-07-16</em></h4>
<p>Browsing data on the IPUMS website can be a little like grocery shopping when you’re hungry — you show up to grab a couple things, but everything looks so good, and you end up with an overflowing cart<a href="file:///C:/Users/gfellis/AppData/Local/Temp/Rtmpy0q33V/preview-1d48a05797c.dir/ipums-bigdata.html#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. Sometimes, your extract may get so large that it doesn’t fit in your computer’s memory. If this is the case, both the IPUMS website and the ipumsr package have tools to help.</p>
<p>If you can’t fit your whole IPUMS dataset into memory, you’ve got four basic options:</p>
<ol style="list-style-type: decimal">
<li>Get more memory.</li>
<li>Reduce the size of your dataset.</li>
<li>Use “chunked” reading.</li>
<li>Use a database.</li>
</ol>
<p>The IPUMS website has features to help with option 2, and the ipumsr package can help you with options 3 and 4 (option 1 relies on your wallet).</p>
<p>The examples in this vignette will rely on the ipumsr, dplyr and biglm packages, and the example CPS extract used in the <code>ipums-cps</code> vignette. If you want to follow along, you should follow the instructions in that vignette to make an extract.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" data-line-number="1"><span class="kw">library</span>(ipumsr)</a>
<a class="sourceLine" id="cb1-2" data-line-number="2"><span class="kw">library</span>(dplyr)</a>
<a class="sourceLine" id="cb1-3" data-line-number="3"></a>
<a class="sourceLine" id="cb1-4" data-line-number="4"><span class="co"># To run the full vignette you'll also need the following packages:</span></a>
<a class="sourceLine" id="cb1-5" data-line-number="5">installed_biglm &lt;-<span class="st"> </span><span class="kw">requireNamespace</span>(<span class="st">"biglm"</span>)</a>
<a class="sourceLine" id="cb1-6" data-line-number="6">installed_db_pkgs &lt;-<span class="st"> </span><span class="kw">requireNamespace</span>(<span class="st">"DBI"</span>) <span class="op">&amp;</span><span class="st"> </span></a>
<a class="sourceLine" id="cb1-7" data-line-number="7"><span class="st"> </span><span class="kw">requireNamespace</span>(<span class="st">"RSQLite"</span>) <span class="op">&amp;</span><span class="st"> </span></a>
<a class="sourceLine" id="cb1-8" data-line-number="8"><span class="st"> </span><span class="kw">requireNamespace</span>(<span class="st">"dbplyr"</span>)</a>
<a class="sourceLine" id="cb1-9" data-line-number="9"></a>
<a class="sourceLine" id="cb1-10" data-line-number="10"><span class="co"># Change these filepaths to the filepaths of your downloaded extract</span></a>
<a class="sourceLine" id="cb1-11" data-line-number="11">cps_ddi_file &lt;-<span class="st"> "cps_00001.xml"</span></a>
<a class="sourceLine" id="cb1-12" data-line-number="12">cps_data_file &lt;-<span class="st"> "cps_00001.dat"</span></a></code></pre></div>
<div id="option-1-trade-money-for-convenience" class="section level1">
<h1>Option 1: Trade money for convenience</h1>
<p>If you’ve got a dataset that’s too big for your RAM, you could always get more. You could accomplish this by upgrading your current computer, getting a new one, or paying a cloud service like Amazon or Microsoft Azure (or one of the many other similar services). Here are guides for using R on <a href="https://aws.amazon.com/blogs/big-data/statistical-analysis-with-open-source-r-and-rstudio-on-amazon-emr/">Amazon</a> and <a href="https://blog.jumpingrivers.com/posts/2017/rstudio_azure_cloud_1/">Microsoft Azure</a>.</p>
</div>
<div id="option-2-do-you-really-need-all-of-that" class="section level1">
<h1>Option 2: Do you really need all of that?</h1>
<p>The IPUMS website has many features that will let you reduce the size of your extract. The easiest thing to do is to review your sample and variable selections to see if you can drop some.</p>
<p>If you do need every sample and variable, but your analysis is on a specific subset of the data, the IPUMS extract engine has a feature called “Select Cases”, where you can subset on an included variable (for example you could subset on AGE so that your extract only includes those older than 65, or subset on EDUCATION to look at only college graduates). In most IPUMS microdata projects, the select cases feature is on the “Create Extract” page, as the last step before you submit the extract. If you’ve already submitted the extract, you can click the “revise” link on the “Download or Revise Extracts” page to access the “Select Cases” feature.</p>
<p>Or, if you would be happy with a random subsample of the data, the IPUMS extract engine has an option to “Customize Sample Size” that will take a random sample. This feature is also available on the “Create Extract” page, as the last step before you submit the extract. Again, if you’ve already submitted your extract, you can access this feature by clicking the “revise” link on the “Download or Revise Extracts” page.</p>
</div>
<div id="option-3-work-one-chunk-at-a-time" class="section level1">
<h1>Option 3: Work one chunk at a time</h1>
<p>ipumsr has “chunked” versions of the microdata reading functions (<code>read_ipums_micro_chunked()</code> and <code>read_ipums_micro_list_chunked()</code>). These chunked versions of the functions allow you to specify a function that will be applied to each chunk, and then also control how the results from these chunks are combined. This functionality is based on the chunked functionality introduced by <code>readr</code> and so is quite flexible. Below, we’ll outline solutions to three common use-cases for IPUMS data: tabulation, regression and selecting cases.</p>
<div id="chunked-tabulation-example" class="section level2">
<h2>Chunked tabulation example</h2>
<p>Let’s say you want to find the percent of people in the workforce by their self-reported health. Since this extract is small enough to fit in memory, we could just do the following:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb2-1" data-line-number="1"><span class="kw">read_ipums_micro</span>(</a>
<a class="sourceLine" id="cb2-2" data-line-number="2"> cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span></a>
<a class="sourceLine" id="cb2-3" data-line-number="3">) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb2-4" data-line-number="4"><span class="st"> </span><span class="kw">mutate</span>(</a>
<a class="sourceLine" id="cb2-5" data-line-number="5"> <span class="dt">HEALTH =</span> <span class="kw">as_factor</span>(HEALTH),</a>
<a class="sourceLine" id="cb2-6" data-line-number="6"> <span class="dt">AT_WORK =</span> EMPSTAT <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb2-7" data-line-number="7"><span class="st"> </span><span class="kw">lbl_relabel</span>(</a>
<a class="sourceLine" id="cb2-8" data-line-number="8"> <span class="kw">lbl</span>(<span class="dv">1</span>, <span class="st">"Yes"</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">==</span><span class="st"> "At work"</span>, </a>
<a class="sourceLine" id="cb2-9" data-line-number="9"> <span class="kw">lbl</span>(<span class="dv">0</span>, <span class="st">"No"</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">!=</span><span class="st"> "At work"</span></a>
<a class="sourceLine" id="cb2-10" data-line-number="10"> ) <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb2-11" data-line-number="11"><span class="st"> </span><span class="kw">as_factor</span>()</a>
<a class="sourceLine" id="cb2-12" data-line-number="12"> ) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb2-13" data-line-number="13"><span class="st"> </span><span class="kw">group_by</span>(HEALTH, AT_WORK) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb2-14" data-line-number="14"><span class="st"> </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>())</a>
<a class="sourceLine" id="cb2-15" data-line-number="15"><span class="co">#&gt; # A tibble: 10 x 3</span></a>
<a class="sourceLine" id="cb2-16" data-line-number="16"><span class="co">#&gt; # Groups: HEALTH [?]</span></a>
<a class="sourceLine" id="cb2-17" data-line-number="17"><span class="co">#&gt; HEALTH AT_WORK n</span></a>
<a class="sourceLine" id="cb2-18" data-line-number="18"><span class="co">#&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;</span></a>
<a class="sourceLine" id="cb2-19" data-line-number="19"><span class="co">#&gt; 1 Excellent No 40582</span></a>
<a class="sourceLine" id="cb2-20" data-line-number="20"><span class="co">#&gt; 2 Excellent Yes 28071</span></a>
<a class="sourceLine" id="cb2-21" data-line-number="21"><span class="co">#&gt; 3 Very good No 32367</span></a>
<a class="sourceLine" id="cb2-22" data-line-number="22"><span class="co">#&gt; 4 Very good Yes 32947</span></a>
<a class="sourceLine" id="cb2-23" data-line-number="23"><span class="co">#&gt; 5 Good No 26726</span></a>
<a class="sourceLine" id="cb2-24" data-line-number="24"><span class="co">#&gt; 6 Good Yes 22483</span></a>
<a class="sourceLine" id="cb2-25" data-line-number="25"><span class="co">#&gt; 7 Fair No 11089</span></a>
<a class="sourceLine" id="cb2-26" data-line-number="26"><span class="co">#&gt; 8 Fair Yes 4520</span></a>
<a class="sourceLine" id="cb2-27" data-line-number="27"><span class="co">#&gt; 9 Poor No 5418</span></a>
<a class="sourceLine" id="cb2-28" data-line-number="28"><span class="co">#&gt; 10 Poor Yes 780</span></a></code></pre></div>
<p>But let’s pretend like we can only store 1,000 rows at a time. In this case, we need to use a chunked function, tabulate for each chunk, and then calculate the counts across all of the chunks.</p>
<p>First we’ll make the callback function, which will take two arguments: x (the data from a chunk) and pos (the position of the chunk, expressed as the line in the input file at which the chunk starts). We’ll only use x, but the callback function must always take both these arguments.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb3-1" data-line-number="1">cb_function &lt;-<span class="st"> </span><span class="cf">function</span>(x, pos) {</a>
<a class="sourceLine" id="cb3-2" data-line-number="2"> x <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">mutate</span>(</a>
<a class="sourceLine" id="cb3-3" data-line-number="3"> <span class="dt">HEALTH =</span> <span class="kw">as_factor</span>(HEALTH),</a>
<a class="sourceLine" id="cb3-4" data-line-number="4"> <span class="dt">AT_WORK =</span> EMPSTAT <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb3-5" data-line-number="5"><span class="st"> </span><span class="kw">lbl_relabel</span>(</a>
<a class="sourceLine" id="cb3-6" data-line-number="6"> <span class="kw">lbl</span>(<span class="dv">1</span>, <span class="st">"Yes"</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">==</span><span class="st"> "At work"</span>, </a>
<a class="sourceLine" id="cb3-7" data-line-number="7"> <span class="kw">lbl</span>(<span class="dv">0</span>, <span class="st">"No"</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">!=</span><span class="st"> "At work"</span></a>
<a class="sourceLine" id="cb3-8" data-line-number="8"> ) <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb3-9" data-line-number="9"><span class="st"> </span><span class="kw">as_factor</span>()</a>
<a class="sourceLine" id="cb3-10" data-line-number="10"> ) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb3-11" data-line-number="11"><span class="st"> </span><span class="kw">group_by</span>(HEALTH, AT_WORK) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb3-12" data-line-number="12"><span class="st"> </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">n</span>())</a>
<a class="sourceLine" id="cb3-13" data-line-number="13">}</a></code></pre></div>
<p>Next we need to create a callback object. The choice of a callback object depends mainly on how we want to combine the results from applying our callback function to each chunk. In this case, we want to row-bind the data.frames returned by <code>cb_function()</code>. If we didn’t care about attaching IPUMS value labels and other metadata, we could use <code>readr::DataFrameCallback</code>, but ipumsr includes the <code>IpumsDataFrameCallback</code> object that allows you to preserve this metadata.</p>
<p>Callback objects are <a href="https://cran.r-project.org/web/packages/R6/index.html">R6</a> objects, but you don’t need to be familiar with R6 to use them<a href="file:///C:/Users/gfellis/AppData/Local/Temp/Rtmpy0q33V/preview-1d48a05797c.dir/ipums-bigdata.html#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>. For now, all we really need to know is that to create a callback we can use, we use <code>$new()</code> syntax.</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb4-1" data-line-number="1">cb &lt;-<span class="st"> </span>IpumsDataFrameCallback<span class="op">$</span><span class="kw">new</span>(cb_function)</a></code></pre></div>
<p>Next we read in the data with the <code>read_ipums_micro_chunked()</code> function, specifying the callback and that we want the <code>chunk_size</code> to be 1000.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" data-line-number="1">chunked_tabulations &lt;-<span class="st"> </span><span class="kw">read_ipums_micro_chunked</span>(</a>
<a class="sourceLine" id="cb5-2" data-line-number="2"> cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span>,</a>
<a class="sourceLine" id="cb5-3" data-line-number="3"> <span class="dt">callback =</span> cb, <span class="dt">chunk_size =</span> <span class="dv">1000</span></a>
<a class="sourceLine" id="cb5-4" data-line-number="4">)</a></code></pre></div>
<p>Now we have a data.frame with the counts by health and work status within each chunk. To get the full table, we just need to sum by health and work status one more time.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb6-1" data-line-number="1">chunked_tabulations <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb6-2" data-line-number="2"><span class="st"> </span><span class="kw">group_by</span>(HEALTH, AT_WORK) <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb6-3" data-line-number="3"><span class="st"> </span><span class="kw">summarize</span>(<span class="dt">n =</span> <span class="kw">sum</span>(n))</a>
<a class="sourceLine" id="cb6-4" data-line-number="4"><span class="co">#&gt; # A tibble: 10 x 3</span></a>
<a class="sourceLine" id="cb6-5" data-line-number="5"><span class="co">#&gt; # Groups: HEALTH [?]</span></a>
<a class="sourceLine" id="cb6-6" data-line-number="6"><span class="co">#&gt; HEALTH AT_WORK n</span></a>
<a class="sourceLine" id="cb6-7" data-line-number="7"><span class="co">#&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;</span></a>
<a class="sourceLine" id="cb6-8" data-line-number="8"><span class="co">#&gt; 1 Excellent No 40582</span></a>
<a class="sourceLine" id="cb6-9" data-line-number="9"><span class="co">#&gt; 2 Excellent Yes 28071</span></a>
<a class="sourceLine" id="cb6-10" data-line-number="10"><span class="co">#&gt; 3 Very good No 32367</span></a>
<a class="sourceLine" id="cb6-11" data-line-number="11"><span class="co">#&gt; 4 Very good Yes 32947</span></a>
<a class="sourceLine" id="cb6-12" data-line-number="12"><span class="co">#&gt; 5 Good No 26726</span></a>
<a class="sourceLine" id="cb6-13" data-line-number="13"><span class="co">#&gt; 6 Good Yes 22483</span></a>
<a class="sourceLine" id="cb6-14" data-line-number="14"><span class="co">#&gt; 7 Fair No 11089</span></a>
<a class="sourceLine" id="cb6-15" data-line-number="15"><span class="co">#&gt; 8 Fair Yes 4520</span></a>
<a class="sourceLine" id="cb6-16" data-line-number="16"><span class="co">#&gt; 9 Poor No 5418</span></a>
<a class="sourceLine" id="cb6-17" data-line-number="17"><span class="co">#&gt; 10 Poor Yes 780</span></a></code></pre></div>
</div>
<div id="chunked-regression-example" class="section level2">
<h2>Chunked regression example</h2>
<p>With the biglm package, it is possible to use R to perform a regression on data that is too large to store in memory all at once. The ipumsr package provides a callback designed to make this simple: <code>IpumsBiglmCallback</code>.</p>
<p>Again we’ll use the CPS example, which is small enough that we can keep it in memory. Here’s an example of a regression looking at how hours worked, self-reported health and age are related among those who are currently working. This is meant as a simple example, and ignores many of the complexities in this relationship, so please use caution when interpreting.</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb7-1" data-line-number="1"><span class="co"># Read in data</span></a>
<a class="sourceLine" id="cb7-2" data-line-number="2">data &lt;-<span class="st"> </span><span class="kw">read_ipums_micro</span>(</a>
<a class="sourceLine" id="cb7-3" data-line-number="3"> cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span></a>
<a class="sourceLine" id="cb7-4" data-line-number="4">)</a>
<a class="sourceLine" id="cb7-5" data-line-number="5"></a>
<a class="sourceLine" id="cb7-6" data-line-number="6"><span class="co"># Prepare data for model</span></a>
<a class="sourceLine" id="cb7-7" data-line-number="7"><span class="co"># (age has been capped at 99, which we assume is high enough to not</span></a>
<a class="sourceLine" id="cb7-8" data-line-number="8"><span class="co"># cause any problems so we leave it.)</span></a>
<a class="sourceLine" id="cb7-9" data-line-number="9">data &lt;-<span class="st"> </span>data <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb7-10" data-line-number="10"><span class="st"> </span><span class="kw">mutate</span>(</a>
<a class="sourceLine" id="cb7-11" data-line-number="11"> <span class="dt">HEALTH =</span> <span class="kw">as_factor</span>(HEALTH),</a>
<a class="sourceLine" id="cb7-12" data-line-number="12"> <span class="dt">AHRSWORKT =</span> <span class="kw">lbl_na_if</span>(AHRSWORKT, <span class="op">~</span>.lbl <span class="op">==</span><span class="st"> "NIU (Not in universe)"</span>),</a>
<a class="sourceLine" id="cb7-13" data-line-number="13"> <span class="dt">AT_WORK =</span> EMPSTAT <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb7-14" data-line-number="14"><span class="st"> </span><span class="kw">lbl_relabel</span>(</a>
<a class="sourceLine" id="cb7-15" data-line-number="15"> <span class="kw">lbl</span>(<span class="dv">1</span>, <span class="st">"Yes"</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">==</span><span class="st"> "At work"</span>, </a>
<a class="sourceLine" id="cb7-16" data-line-number="16"> <span class="kw">lbl</span>(<span class="dv">0</span>, <span class="st">"No"</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">!=</span><span class="st"> "At work"</span></a>
<a class="sourceLine" id="cb7-17" data-line-number="17"> ) <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb7-18" data-line-number="18"><span class="st"> </span><span class="kw">as_factor</span>()</a>
<a class="sourceLine" id="cb7-19" data-line-number="19"> ) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb7-20" data-line-number="20"><span class="st"> </span><span class="kw">filter</span>(AT_WORK <span class="op">==</span><span class="st"> "Yes"</span>)</a>
<a class="sourceLine" id="cb7-21" data-line-number="21"></a>
<a class="sourceLine" id="cb7-22" data-line-number="22"><span class="co"># Run regression</span></a>
<a class="sourceLine" id="cb7-23" data-line-number="23">model &lt;-<span class="st"> </span><span class="kw">lm</span>(AHRSWORKT <span class="op">~</span><span class="st"> </span>AGE <span class="op">+</span><span class="st"> </span><span class="kw">I</span>(AGE<span class="op">^</span><span class="dv">2</span>) <span class="op">+</span><span class="st"> </span>HEALTH, data)</a>
<a class="sourceLine" id="cb7-24" data-line-number="24"><span class="kw">summary</span>(model)</a>
<a class="sourceLine" id="cb7-25" data-line-number="25"><span class="co">#&gt; </span></a>
<a class="sourceLine" id="cb7-26" data-line-number="26"><span class="co">#&gt; Call:</span></a>
<a class="sourceLine" id="cb7-27" data-line-number="27"><span class="co">#&gt; lm(formula = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)</span></a>
<a class="sourceLine" id="cb7-28" data-line-number="28"><span class="co">#&gt; </span></a>
<a class="sourceLine" id="cb7-29" data-line-number="29"><span class="co">#&gt; Residuals:</span></a>
<a class="sourceLine" id="cb7-30" data-line-number="30"><span class="co">#&gt; &lt;Labelled double&gt;</span></a>
<a class="sourceLine" id="cb7-31" data-line-number="31"><span class="co">#&gt; Min 1Q Median 3Q Max </span></a>
<a class="sourceLine" id="cb7-32" data-line-number="32"><span class="co">#&gt; -41.230 -4.949 -0.080 5.945 75.697 </span></a>
<a class="sourceLine" id="cb7-33" data-line-number="33"><span class="co">#&gt; </span></a>
<a class="sourceLine" id="cb7-34" data-line-number="34"><span class="co">#&gt; Coefficients:</span></a>
<a class="sourceLine" id="cb7-35" data-line-number="35"><span class="co">#&gt; Estimate Std. Error t value Pr(&gt;|t|) </span></a>
<a class="sourceLine" id="cb7-36" data-line-number="36"><span class="co">#&gt; (Intercept) 5.626953 0.367851 15.297 &lt; 2e-16 ***</span></a>
<a class="sourceLine" id="cb7-37" data-line-number="37"><span class="co">#&gt; AGE 1.568287 0.017790 88.156 &lt; 2e-16 ***</span></a>
<a class="sourceLine" id="cb7-38" data-line-number="38"><span class="co">#&gt; I(AGE^2) -0.016798 0.000204 -82.338 &lt; 2e-16 ***</span></a>
<a class="sourceLine" id="cb7-39" data-line-number="39"><span class="co">#&gt; HEALTHVery good -0.280826 0.104433 -2.689 0.00717 ** </span></a>
<a class="sourceLine" id="cb7-40" data-line-number="40"><span class="co">#&gt; HEALTHGood -1.275358 0.115861 -11.008 &lt; 2e-16 ***</span></a>
<a class="sourceLine" id="cb7-41" data-line-number="41"><span class="co">#&gt; HEALTHFair -3.614487 0.207121 -17.451 &lt; 2e-16 ***</span></a>
<a class="sourceLine" id="cb7-42" data-line-number="42"><span class="co">#&gt; HEALTHPoor -5.732656 0.465751 -12.308 &lt; 2e-16 ***</span></a>
<a class="sourceLine" id="cb7-43" data-line-number="43"><span class="co">#&gt; ---</span></a>
<a class="sourceLine" id="cb7-44" data-line-number="44"><span class="co">#&gt; Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span></a>
<a class="sourceLine" id="cb7-45" data-line-number="45"><span class="co">#&gt; </span></a>
<a class="sourceLine" id="cb7-46" data-line-number="46"><span class="co">#&gt; Residual standard error: 12.8 on 88794 degrees of freedom</span></a>
<a class="sourceLine" id="cb7-47" data-line-number="47"><span class="co">#&gt; Multiple R-squared: 0.08886, Adjusted R-squared: 0.08879 </span></a>
<a class="sourceLine" id="cb7-48" data-line-number="48"><span class="co">#&gt; F-statistic: 1443 on 6 and 88794 DF, p-value: &lt; 2.2e-16</span></a></code></pre></div>
<p>To do the same regression, but with only 1000 rows loaded at a time, we work in a similar manner.</p>
<p>First we make the <code>IpumsBiglmCallback</code> callback object that specifies both the model and a function to prepare the data.</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb8-1" data-line-number="1">biglm_cb &lt;-<span class="st"> </span>IpumsBiglmCallback<span class="op">$</span><span class="kw">new</span>(</a>
<a class="sourceLine" id="cb8-2" data-line-number="2"> <span class="dt">model =</span> AHRSWORKT <span class="op">~</span><span class="st"> </span>AGE <span class="op">+</span><span class="st"> </span><span class="kw">I</span>(AGE<span class="op">^</span><span class="dv">2</span>) <span class="op">+</span><span class="st"> </span>HEALTH,</a>
<a class="sourceLine" id="cb8-3" data-line-number="3"> <span class="dt">prep =</span> <span class="cf">function</span>(x, pos) {</a>
<a class="sourceLine" id="cb8-4" data-line-number="4"> x <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb8-5" data-line-number="5"><span class="st"> </span><span class="kw">mutate</span>(</a>
<a class="sourceLine" id="cb8-6" data-line-number="6"> <span class="dt">HEALTH =</span> <span class="kw">as_factor</span>(HEALTH),</a>
<a class="sourceLine" id="cb8-7" data-line-number="7"> <span class="dt">AHRSWORKT =</span> <span class="kw">lbl_na_if</span>(AHRSWORKT, <span class="op">~</span>.lbl <span class="op">==</span><span class="st"> "NIU (Not in universe)"</span>),</a>
<a class="sourceLine" id="cb8-8" data-line-number="8"> <span class="dt">AT_WORK =</span> EMPSTAT <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb8-9" data-line-number="9"><span class="st"> </span><span class="kw">lbl_relabel</span>(</a>
<a class="sourceLine" id="cb8-10" data-line-number="10"> <span class="kw">lbl</span>(<span class="dv">1</span>, <span class="st">"Yes"</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">==</span><span class="st"> "At work"</span>, </a>
<a class="sourceLine" id="cb8-11" data-line-number="11"> <span class="kw">lbl</span>(<span class="dv">0</span>, <span class="st">"No"</span>) <span class="op">~</span><span class="st"> </span>.lbl <span class="op">!=</span><span class="st"> "At work"</span></a>
<a class="sourceLine" id="cb8-12" data-line-number="12"> ) <span class="op">%&gt;%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb8-13" data-line-number="13"><span class="st"> </span><span class="kw">as_factor</span>()</a>
<a class="sourceLine" id="cb8-14" data-line-number="14"> ) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb8-15" data-line-number="15"><span class="st"> </span><span class="kw">filter</span>(AT_WORK <span class="op">==</span><span class="st"> "Yes"</span>)</a>
<a class="sourceLine" id="cb8-16" data-line-number="16"> }</a>
<a class="sourceLine" id="cb8-17" data-line-number="17">)</a></code></pre></div>
<p>And then we read the data using <code>read_ipums_micro_chunked()</code>, passing the callback that we just made.</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" data-line-number="1">chunked_model &lt;-<span class="st"> </span><span class="kw">read_ipums_micro_chunked</span>(</a>
<a class="sourceLine" id="cb9-2" data-line-number="2"> cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span>,</a>
<a class="sourceLine" id="cb9-3" data-line-number="3"> <span class="dt">callback =</span> biglm_cb, <span class="dt">chunk_size =</span> <span class="dv">1000</span></a>
<a class="sourceLine" id="cb9-4" data-line-number="4">)</a>
<a class="sourceLine" id="cb9-5" data-line-number="5"></a>
<a class="sourceLine" id="cb9-6" data-line-number="6"><span class="kw">summary</span>(chunked_model)</a>
<a class="sourceLine" id="cb9-7" data-line-number="7"><span class="co">#&gt; Large data regression model: biglm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data, ...)</span></a>
<a class="sourceLine" id="cb9-8" data-line-number="8"><span class="co">#&gt; Sample size = 88801 </span></a>
<a class="sourceLine" id="cb9-9" data-line-number="9"><span class="co">#&gt; Coef (95% CI) SE p</span></a>
<a class="sourceLine" id="cb9-10" data-line-number="10"><span class="co">#&gt; (Intercept) 5.6270 4.8913 6.3627 0.3679 0.0000</span></a>
<a class="sourceLine" id="cb9-11" data-line-number="11"><span class="co">#&gt; AGE 1.5683 1.5327 1.6039 0.0178 0.0000</span></a>
<a class="sourceLine" id="cb9-12" data-line-number="12"><span class="co">#&gt; I(AGE^2) -0.0168 -0.0172 -0.0164 0.0002 0.0000</span></a>
<a class="sourceLine" id="cb9-13" data-line-number="13"><span class="co">#&gt; HEALTHVery good -0.2808 -0.4897 -0.0720 0.1044 0.0072</span></a>
<a class="sourceLine" id="cb9-14" data-line-number="14"><span class="co">#&gt; HEALTHGood -1.2754 -1.5071 -1.0436 0.1159 0.0000</span></a>
<a class="sourceLine" id="cb9-15" data-line-number="15"><span class="co">#&gt; HEALTHFair -3.6145 -4.0287 -3.2002 0.2071 0.0000</span></a>
<a class="sourceLine" id="cb9-16" data-line-number="16"><span class="co">#&gt; HEALTHPoor -5.7327 -6.6642 -4.8012 0.4658 0.0000</span></a></code></pre></div>
</div>
<div id="chunked-select-cases-example" class="section level2">
<h2>Chunked “select cases” example</h2>
<p>Sometimes you may want to select a subset of the data before reading it in. The IPUMS website has this functionality built in, which can be a faster way to do this (this “select cases” functionality is described in the second section above). Also, Unix commands like <code>awk</code> and <code>sed</code> will generally be much faster than these R based solutions. However, it is possible to use the chunked functions to create a subset, which can be convenient if you want to subset on some complex logic that would be hard to code into the IPUMS extract system or Unix tools.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb10-1" data-line-number="1"><span class="co"># Subset only those in "Poor" health</span></a>
<a class="sourceLine" id="cb10-2" data-line-number="2">chunked_subset &lt;-<span class="st"> </span><span class="kw">read_ipums_micro_chunked</span>(</a>
<a class="sourceLine" id="cb10-3" data-line-number="3"> cps_ddi_file, <span class="dt">data_file =</span> cps_data_file, <span class="dt">verbose =</span> <span class="ot">FALSE</span>,</a>
<a class="sourceLine" id="cb10-4" data-line-number="4"> <span class="dt">callback =</span> IpumsDataFrameCallback<span class="op">$</span><span class="kw">new</span>(<span class="cf">function</span>(x, pos) {</a>
<a class="sourceLine" id="cb10-5" data-line-number="5"> <span class="kw">filter</span>(x, HEALTH <span class="op">==</span><span class="st"> </span><span class="dv">5</span>)</a>
<a class="sourceLine" id="cb10-6" data-line-number="6"> }), </a>
<a class="sourceLine" id="cb10-7" data-line-number="7"> <span class="dt">chunk_size =</span> <span class="dv">1000</span></a>
<a class="sourceLine" id="cb10-8" data-line-number="8">)</a></code></pre></div>
</div>
</div>
<div id="option-4-use-a-database" class="section level1">
<h1>Option 4: Use a database</h1>
<p>Databases are another option for data that cannot fit in memory as an R data.frame. If you have access to a database on a remote machine, then you can easily pull in parts of the data for your analysis. Even if you’ll need to store the database on your machine, it may have more efficient storage of data so your data fits in your memory, or it may use your hard drive.</p>
<p>R’s tools for integrating with databases are improving quickly. The DBI package has been updated, dplyr (through dbplyr) provides a frontend that allows you to write the same code for data in a database as you would for a local data.frame, and packages like sparklyr, sparkR, bigrquery and others provide access to the latest and greatest.</p>
<p>There are many different kinds of databases, each with their own benefits, weaknesses and tradeoffs. As such, it’s hard to give concrete advice without knowing your specific use-case. However, once you’ve chosen a database, in general, there will be two steps: Importing the data into the database and then connecting it to R.</p>
<p>As an example, we’ll use the RSQLite package to load the data into an in-memory database. RSQLite is great because it is easy to set up, but it is probably not efficient enough to help you if you need to use a database because your data doesn’t fit in memory.</p>
<div id="importing-data-into-a-database" class="section level2">
<h2>Importing data into a database</h2>
<p>When using rectangular extracts, your best bet to import IPUMS data into your database is probably going to be a csv file. Most databases support csv importing, and these implementations will generally be well supported since this is a common file format.</p>
<p>However, if you need a hierarchical extract, or your database software doesn’t support the csv format, then you can use the chunking functions to load the data into a database without storing the full data in R.</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" data-line-number="1"><span class="co"># Connect to database</span></a>
<a class="sourceLine" id="cb11-2" data-line-number="2"><span class="kw">library</span>(DBI)</a>
<a class="sourceLine" id="cb11-3" data-line-number="3"><span class="kw">library</span>(RSQLite)</a>
<a class="sourceLine" id="cb11-4" data-line-number="4"><span class="co">#&gt; Warning: package 'RSQLite' was built under R version 3.5.1</span></a>
<a class="sourceLine" id="cb11-5" data-line-number="5">con &lt;-<span class="st"> </span><span class="kw">dbConnect</span>(<span class="kw">SQLite</span>(), <span class="dt">path =</span> <span class="st">":memory:"</span>)</a>
<a class="sourceLine" id="cb11-6" data-line-number="6"></a>
<a class="sourceLine" id="cb11-7" data-line-number="7"><span class="co"># Add data to tables in chunks</span></a>
<a class="sourceLine" id="cb11-8" data-line-number="8">ddi &lt;-<span class="st"> </span><span class="kw">read_ipums_ddi</span>(cps_ddi_file)</a>
<a class="sourceLine" id="cb11-9" data-line-number="9"><span class="kw">read_ipums_micro_chunked</span>(</a>
<a class="sourceLine" id="cb11-10" data-line-number="10"> ddi,</a>
<a class="sourceLine" id="cb11-11" data-line-number="11"> <span class="dt">data_file =</span> cps_data_file,</a>
<a class="sourceLine" id="cb11-12" data-line-number="12"> readr<span class="op">::</span>SideEffectChunkCallback<span class="op">$</span><span class="kw">new</span>(<span class="cf">function</span>(x, pos) {</a>
<a class="sourceLine" id="cb11-13" data-line-number="13"> <span class="cf">if</span> (pos <span class="op">==</span><span class="st"> </span><span class="dv">1</span>) {</a>
<a class="sourceLine" id="cb11-14" data-line-number="14"> <span class="kw">dbWriteTable</span>(con, <span class="st">"cps"</span>, x)</a>
<a class="sourceLine" id="cb11-15" data-line-number="15"> } <span class="cf">else</span> {</a>
<a class="sourceLine" id="cb11-16" data-line-number="16"> <span class="kw">dbWriteTable</span>(con, <span class="st">"cps"</span>, x, <span class="dt">row.names =</span> <span class="ot">FALSE</span>, <span class="dt">append =</span> <span class="ot">TRUE</span>)</a>
<a class="sourceLine" id="cb11-17" data-line-number="17"> }</a>
<a class="sourceLine" id="cb11-18" data-line-number="18"> }),</a>
<a class="sourceLine" id="cb11-19" data-line-number="19"> <span class="dt">chunk_size =</span> <span class="dv">1000</span>,</a>
<a class="sourceLine" id="cb11-20" data-line-number="20"> <span class="dt">verbose =</span> <span class="ot">FALSE</span></a>
<a class="sourceLine" id="cb11-21" data-line-number="21">)</a>
<a class="sourceLine" id="cb11-22" data-line-number="22"><span class="co">#&gt; NULL</span></a></code></pre></div>
</div>
<div id="connecting-to-a-database-with-dbplyr" class="section level2">
<h2>Connecting to a database with dbplyr</h2>
<p>The dbplyr vignette “dbplyr” (which you can access with <code>vignette("dbplyr", package = "dbplyr")</code>) is a good place to get started learning about how to connect to a database. Here I’ll just briefly show some examples.</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb12-1" data-line-number="1">example &lt;-<span class="st"> </span><span class="kw">tbl</span>(con, <span class="st">"cps"</span>)</a>
<a class="sourceLine" id="cb12-2" data-line-number="2"></a>
<a class="sourceLine" id="cb12-3" data-line-number="3">example <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb12-4" data-line-number="4"><span class="st"> </span><span class="kw">filter</span>(<span class="st">'AGE'</span> <span class="op">&gt;</span><span class="st"> </span><span class="dv">25</span>)</a>
<a class="sourceLine" id="cb12-5" data-line-number="5"><span class="co">#&gt; # Source: lazy query [?? x 14]</span></a>
<a class="sourceLine" id="cb12-6" data-line-number="6"><span class="co">#&gt; # Database: sqlite 3.22.0 []</span></a>
<a class="sourceLine" id="cb12-7" data-line-number="7"><span class="co">#&gt; YEAR SERIAL HWTSUPP CPSID ASECFLAG FOODSTMP MONTH PERNUM CPSIDP</span></a>
<a class="sourceLine" id="cb12-8" data-line-number="8"><span class="co">#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;</span></a>
<a class="sourceLine" id="cb12-9" data-line-number="9"><span class="co">#&gt; 1 2011 2 5454500 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb12-10" data-line-number="10"><span class="co">#&gt; 2 2011 3 4754200 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb12-11" data-line-number="11"><span class="co">#&gt; 3 2011 3 4754200 2.01e13 1 1 3 2 2.01e13</span></a>
<a class="sourceLine" id="cb12-12" data-line-number="12"><span class="co">#&gt; 4 2011 4 5483800 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb12-13" data-line-number="13"><span class="co">#&gt; 5 2011 4 5483800 2.01e13 1 1 3 2 2.01e13</span></a>
<a class="sourceLine" id="cb12-14" data-line-number="14"><span class="co">#&gt; 6 2011 5 4754200 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb12-15" data-line-number="15"><span class="co">#&gt; 7 2011 6 2983100 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb12-16" data-line-number="16"><span class="co">#&gt; 8 2011 6 2983100 2.01e13 1 1 3 2 2.01e13</span></a>
<a class="sourceLine" id="cb12-17" data-line-number="17"><span class="co">#&gt; 9 2011 6 2983100 2.01e13 1 1 3 3 2.01e13</span></a>
<a class="sourceLine" id="cb12-18" data-line-number="18"><span class="co">#&gt; 10 2011 6 2983100 2.01e13 1 1 3 4 2.01e13</span></a>
<a class="sourceLine" id="cb12-19" data-line-number="19"><span class="co">#&gt; # ... with more rows, and 5 more variables: WTSUPP &lt;dbl&gt;, AGE &lt;int&gt;,</span></a>
<a class="sourceLine" id="cb12-20" data-line-number="20"><span class="co">#&gt; # EMPSTAT &lt;int&gt;, AHRSWORKT &lt;dbl&gt;, HEALTH &lt;int&gt;</span></a></code></pre></div>
<p>Though dbplyr shows us a nice preview of the first rows of the result of our query, the data still lives in the database. When using a regular database, in general you’d use the function <code>dplyr::collect()</code> to load in the full results of the query to your R session. However, the database has no concept of IPUMS attributes like value and variable labels, so if you want them, you can use <code>ipums_collect()</code> like so:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" data-line-number="1">example <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb13-2" data-line-number="2"><span class="st"> </span><span class="kw">filter</span>(<span class="st">'AGE'</span> <span class="op">&gt;</span><span class="st"> </span><span class="dv">25</span>) <span class="op">%&gt;%</span></a>
<a class="sourceLine" id="cb13-3" data-line-number="3"><span class="st"> </span><span class="kw">ipums_collect</span>(ddi)</a>
<a class="sourceLine" id="cb13-4" data-line-number="4"><span class="co">#&gt; # A tibble: 204,983 x 14</span></a>
<a class="sourceLine" id="cb13-5" data-line-number="5"><span class="co">#&gt; YEAR SERIAL HWTSUPP CPSID ASECFLAG FOODSTMP MONTH PERNUM CPSIDP</span></a>
<a class="sourceLine" id="cb13-6" data-line-number="6"><span class="co">#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int+lbl&gt; &lt;int+lbl&gt; &lt;int+l&gt; &lt;dbl&gt; &lt;dbl&gt;</span></a>
<a class="sourceLine" id="cb13-7" data-line-number="7"><span class="co">#&gt; 1 2011 2 5454500 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb13-8" data-line-number="8"><span class="co">#&gt; 2 2011 3 4754200 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb13-9" data-line-number="9"><span class="co">#&gt; 3 2011 3 4754200 2.01e13 1 1 3 2 2.01e13</span></a>
<a class="sourceLine" id="cb13-10" data-line-number="10"><span class="co">#&gt; 4 2011 4 5483800 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb13-11" data-line-number="11"><span class="co">#&gt; 5 2011 4 5483800 2.01e13 1 1 3 2 2.01e13</span></a>
<a class="sourceLine" id="cb13-12" data-line-number="12"><span class="co">#&gt; 6 2011 5 4754200 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb13-13" data-line-number="13"><span class="co">#&gt; 7 2011 6 2983100 2.01e13 1 1 3 1 2.01e13</span></a>
<a class="sourceLine" id="cb13-14" data-line-number="14"><span class="co">#&gt; 8 2011 6 2983100 2.01e13 1 1 3 2 2.01e13</span></a>
<a class="sourceLine" id="cb13-15" data-line-number="15"><span class="co">#&gt; 9 2011 6 2983100 2.01e13 1 1 3 3 2.01e13</span></a>
<a class="sourceLine" id="cb13-16" data-line-number="16"><span class="co">#&gt; 10 2011 6 2983100 2.01e13 1 1 3 4 2.01e13</span></a>
<a class="sourceLine" id="cb13-17" data-line-number="17"><span class="co">#&gt; # ... with 204,973 more rows, and 5 more variables: WTSUPP &lt;dbl&gt;,</span></a>
<a class="sourceLine" id="cb13-18" data-line-number="18"><span class="co">#&gt; # AGE &lt;int+lbl&gt;, EMPSTAT &lt;int+lbl&gt;, AHRSWORKT &lt;dbl+lbl&gt;,</span></a>
<a class="sourceLine" id="cb13-19" data-line-number="19"><span class="co">#&gt; # HEALTH &lt;int+lbl&gt;</span></a></code></pre></div>
</div>
</div>
<div id="learning-more" class="section level1">
<h1>Learning more</h1>
<p>Big data is a problem for lots of R users, not just IPUMS users, so there are a lot of resources to help you out! These are just a few that I found useful while writing this document:</p>
<ul>
<li><em>Best practice to handle out-of-memory data</em> - RStudio Community Thread <a href="https://community.rstudio.com/t/best-practice-to-handle-out-of-memory-data/734">link</a></li>
<li><em>Big Data in R</em> - Part of Stephen Mooney’s EPIC: Epidemiologic Analysis Using R, June 2015 class <a href="http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf">link</a></li>
<li><em>Statistical Analysis with Open-Source R and RStudio on Amazon EMR</em> - Markus Schmidberger on the AWS Big Data Blog <a href="https://aws.amazon.com/blogs/big-data/statistical-analysis-with-open-source-r-and-rstudio-on-amazon-emr/">link</a></li>
<li><em>Hosting RStudio Server on Azure</em> - Colin Gillespie’s blog post on using Rstudion on Azure <a href="https://blog.jumpingrivers.com/posts/2017/rstudio_azure_cloud_1/">link</a></li>
<li><em>Improving DBI: A Retrospect</em> - Kirill Müller’s report on the R Consortium grant to improve database support in R <a href="https://www.r-consortium.org/blog/2017/05/15/improving-dbi-a-retrospect">link</a></li>
</ul>
</div>
<div class="footnotes">
<hr>
<ol>
<li id="fn1"><p>Bonus joke: Why is the IPUMS website better than any grocery store? Answer: More free samples.)<a href="file:///C:/Users/gfellis/AppData/Local/Temp/Rtmpy0q33V/preview-1d48a05797c.dir/ipums-bigdata.html#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p>If you’re interested in learning more about R6, the upcoming revision to Hadley Wickham’s Advanced R book includes a chapter on R6 <a href="https://github.com/hadley/adv-r/blob/master/R6.Rmd">available for free here</a><a href="file:///C:/Users/gfellis/AppData/Local/Temp/Rtmpy0q33V/preview-1d48a05797c.dir/ipums-bigdata.html#fnref2" class="footnote-back">↩</a></p></li>
</ol>
</div>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body></html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment