Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@jknowles
Forked from lecy/datausa_census_api.md
Created May 30, 2018 22:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jknowles/210a4d2dbe9052a00c347813a0681e39 to your computer and use it in GitHub Desktop.
Save jknowles/210a4d2dbe9052a00c347813a0681e39 to your computer and use it in GitHub Desktop.
Building Census Dataset in R Using datausa.io API
# Using the dataUSA.io API for Census Data in R
This gist contains some notes on constructing a query for census and economic data from the [DataUSA.io](http://datausa.io/) site. This is a quick-start guide to their API; for in-depth documentation check out their [API documentation](https://github.com/DataUSA/datausa-api/wiki/Overview).
A great way to learn how to structure a query is to visit a specific datausa.io page and click on the "Options" button on top of any graph, then select "API" to see the query syntax that created the graph.
![Analytics](https://ga-beacon.appspot.com/UA-27835807-2/gist-id?pixel)
## Example Use
```r
library( dplyr )
### POPULATION OF METRO AREAS IN THE US
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=pop&year=latest"
dat <- read.csv( api )
head( dat )
# year geo_name geo pop
# 1 2014 Aberdeen, SD 31000US10100 41564
# 2 2014 Aberdeen, WA 31000US10140 71734
# 3 2014 Abilene, TX 31000US10180 167171
### INCOME BY CENSUS TRACT
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=tract&required=income&year=latest"
api %>% read.csv() %>% head()
# year geo_name geo income
# 1 2014 Census Tract 201 14000US01001020100 60000.0
# 2 2014 Census Tract 211 14000US01001021100 34183.0
# 3 2014 Census Tract 102 14000US01003010200 31480.0
### MEDIAN PROPERTY VALUE FOR EACH COUNTY IN ATLANTA METRO AREA
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=median_property_value&geo=31000US12060&year=latest"
api %>% read.csv() %>% head()
# year geo_name geo median_property_value
# 1 2014 Cherokee County, GA 05000US13057 183200
# 2 2014 Clayton County, GA 05000US13063 85700
# 3 2014 Cobb County, GA 05000US13067 203600
### NUMBER OF EMPLOYEES BY TOP-LEVEL INDUSTRY:
api <- "http://api.datausa.io/api/csv/?show=naics,geo&sumlevel=0,nation&required=emp&year=latest"
api %>% read.csv() %>% head()
# year geo_name geo naics_name naics emp
# 1 2013 United States 01000US Agriculture, Forestry, Fishing, & Hunting 11 154496
# 2 2013 United States 01000US Mining, Quarrying, & Oil & Gas Extraction 21 732186
# 3 2013 United States 01000US Utilities 22 638575
# 4 2013 United States 01000US Construction 23 5470181
### NUMBER OF BACHELOR'S DEGREES AWARDED IN 2014 BY AREA OF STUDY
api <- "http://api.datausa.io/api/csv/?show=cip&sumlevel=6&degree=5&required=grads_total&year=latest"
api %>% read.csv() %>% head()
# year cip_name cip degree_name degree grads_total
# 1 2014 General Agriculture 10000 Bachelor's Degree 5 1924
# 2 2014 General Agriculture Management 10101 Bachelor's Degree 5 1190
# 3 2014 Agricultural Business Operations 10102 Bachelor's Degree 5 1562
```
Note the use of the [pipe operator](http://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/) in these examples to keep the code short. If you are not familiar with the pipe operator, just note that these statements are equivalent:
```r
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=pop&year=latest"
dat <- read.csv( api )
head( dat )
library( dplyr ) # required for the pipe operator
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=pop&year=latest"
api %>% read.csv() %>% head()
```
## Anatomy of the API
All queries start with one of the following prefixes:
```r
http://api.datausa.io/api/? # returns data as json file
http://api.datausa.io/api/csv/? # returns data as csv file
```
Each query consists of a set of statements, all separated by the **&** operator.
* **show=** which type of data to query
* **sumlevel=** which level or geography to aggregate over
* **required=** which variable(s) to use
* **year=** which period of data to use ("latest" is a valid argument)
Depending upon what type of data you query, the *required* and *year* statements may be omitted.
For example, grab the population of all counties for 2014:
```r
# show=geo
# sumlevel=county
# required=pop
# year=2014
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=2014"
api %>% read.csv() %>% head()
# year geo_name geo pop
# 1 2014 Cook County, IL 05000US17031 5227827
# 2 2014 King County, WA 05000US53033 2008997
# 3 2014 Wayne County, MI 05000US26163 1790078
# 4 2014 Broward County, FL 05000US12011 1815269
```
Populaton of all counties within a specific state:
```r
# All counties in Massachusettes start with FIPS 05000US25
# show=geo
# sumlevel=county
# required=pop
# year=latest
# where=geo:^05000US25
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=latest&where=geo:^05000US25"
api %>% read.csv() %>% head()
# year geo_name geo pop
# 1 2014 Barnstable County, MA 05000US25001 215167
# 2 2014 Berkshire County, MA 05000US25003 130064
# 3 2014 Bristol County, MA 05000US25005 55106
```
College degree data for all of the US:
```r
# show=cip data on college majors
# sumlevel=2 cip level 2 categories (see below)
api <- "http://api.datausa.io/api/csv/?show=cip&sumlevel=2&required=grads_total"
api %>% read.csv() %>% head()
# year cip_name cip grads_total
# 1 2014 Philosophy and Religious Studies 38 18426
# 2 2014 Construction 46 31596
# 3 2014 Mechanical Technologies 47 106442
# 4 2014 Education 13 324344
# 5 2014 Social Sciences 45 199424
```
## Datasets and Variable
To find variables for the **required=** argument see these sources and their respective data dictionaries.
DATASET | CONTAINS | DATA DICTIONARY
----------|---------------|---------------------
ACS | American Community Survey - census demographics | [LINK](https://github.com/DataUSA/datausa-api/wiki/ACS-API)
ACS PUMS | ACS Public Use Microdata - employment statistics | [LINK](https://github.com/DataUSA/datausa-api/wiki/ACS-PUMS-API)
BLS | Bureau of Labor Statistics - industry and occupation | [LINK](https://github.com/DataUSA/datausa-api/wiki/BLS-API)
CBP | County Business Patterns - number of businesses | [LINK](https://github.com/DataUSA/datausa-api/wiki/CBP-API)
CHR | County Health Rankings | [LINK](https://github.com/DataUSA/datausa-api/wiki/County-Health-Rankings-API)
IPEDS | Integrated Post-Secondary Education Data System - data on college education | [LINK](https://github.com/DataUSA/datausa-api/wiki/IPEDS-API)
ONET | Skills needed for employment | [LINK](https://github.com/DataUSA/datausa-api/wiki/ONET-API)
## Data Type & Level of Aggregation
Each query requires the user to specify a type of data (show=), a level of aggregation (sumlevel=), and a variable (require=).
In this API, what dataUSA will call an "[attribute](https://github.com/DataUSA/datausa-api/wiki/Attributes)" is what R users would refer to as a factor. These are variables that have levels that will be used to aggregate the outcome variables specified by *require=*. The primary nuance is that these data types often have nested or hierarchical levels. For example, the **geo** factor includes levels for states, counties, and MSAs. The **sumlevel=** statement allows the use to select aggregation at a specific level.
Data types are specified through the **show=** statement. The API recognizes all of the following:
* Geography (geo) - typically census and economic data aggregated by geography
* Course (cip) - employement and earnings by college and major
* Industry (naics,bls_naics,acs_ind) - industry data aggregated by industry subsector or group
* Occupation (soc,acs_occ) - occupation data aggregated by occupation group
* Sex (sex) - aggregate data by sex
* Race (acs_race,pums_race,race) - aggregate data by race - use acs_race with census data, etc.
* Veterans (conflict) - stats on veterans
* College Degree (degree)
* Skill (skill)
* Birthplace (birthplace)
* Wage Bin (wage_bin)
* Language (language)
* University Location (university) and Sector - Public, State, Private (sector)
It is helpful to be able to peek inside of these data types in order to see the levels that are includes and their associated attributes. Try the following helper function:
```r
# install.packages( "jsonlite" )
library( jsonlite )
# PRINT ALL VALID DATA (show=) TYPES:
fromJSON( "http://api.datausa.io/attrs/list/" )
viewDataLevels <- function( attribute )
{
url <- paste( "http://api.datausa.io/attrs/", attribute, sep="" )
temp <- fromJSON( url )
df <- as.data.frame( temp[[1]] )
names(df) <- temp[[2]]
print( paste( "NUMBER OF ROWS:", nrow(df) ) )
return(df)
}
viewDataLevels( "acs_ind" )[1:10,]
# [1] "NUMBER OF ROWS: 191"
# id name level
# 1 04 Retail Trade 0
# 2 08 Professional, Scientific, Management, Admin., Waste Mgmt. 0
# 3 0403 Building Material & Garden Equipment Dealers 1
# 4 020x Manufacturing 1
# 5 030x Wholesale trade 1
viewDataLevels( "university" )[1:10,]
# [1] "NUMBER OF ROWS: 9124"
# sector name url is_stem county state lat lng id msa
# 1 3 South University - Richmond www.southuniversity.edu 0 05000US51087 04000US51 37.64676 -77.59915 459259 31000US40060
# 2 3 Herzing University-Atlanta www.herzing.edu 1 05000US13121 04000US13 33.846627 -84.362635 140340 31000US12060
# 3 1 Boise State University www.boisestate.edu 1 05000US16001 04000US16 43.604403 -116.205789 142115 31000US14260
# 4 4 Eastern Idaho Technical College www.eitc.edu 1 05000US16019 04000US16 43.4818 -111.984645 142179 31000US26820
# 5 1 Idaho State University www.isu.edu/ 1 05000US16005 04000US16 42.864108 -112.429084 142276 31000US38540
```
The level of aggregation (unit of analysis) for the data is specified by the sumlevels statement. Sumlevels for the most commmon data types are listed here:
If sumlevels for a particular variable type are not listed below, try **sumlevel=all**.
Supported sumlevels for **show=geo**:
* Nation (nation)
* State (state) [ [list](http://datausa.io/about/attributes/geo/state/1/) ]
* Public Use Micro Data (puma) [ [list](http://datausa.io/about/attributes/geo/puma/1/) ] [ [definition](https://www.census.gov/geo/reference/puma.html) ]
* County (county) [ [list](http://datausa.io/about/attributes/geo/county/1/) ]
* Metropolitan Statistical Area (msa) [ [list](http://datausa.io/about/attributes/geo/msa/1/) ] [ [definition](https://www.census.gov/population/metro/) ]
* Census Designated Place (place) [ [list](http://datausa.io/about/attributes/geo/place/1/) ] [ [definition](https://www.census.gov/geo/reference/gtc/gtc_place.html) ]
* Census tract (tract)
*See below on how to query a specific geographic region.*
```r
### income rank by state ###
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=state&required=income_rank&year=latest"
api %>% read.csv() %>% head()
# year geo_name geo income_rank
# 1 2014 Alabama 04000US01 47
# 2 2014 Alaska 04000US02 3
# 3 2014 Arizona 04000US04 31
# 4 2014 Arkansas 04000US05 50
# 5 2014 California 04000US06 10
# 6 2014 Colorado 04000US08 15
### income rank by city ###
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=income_rank&year=latest"
api %>% read.csv() %>% head()
# year geo_name geo income_rank
# 1 2014 Aberdeen, SD 31000US10100 193
# 2 2014 Aberdeen, WA 31000US10140 565
# 3 2014 Abilene, TX 31000US10180 530
# 4 2014 Ada, OK 31000US10220 600
# 5 2014 Adjuntas, PR 31000US10260 929
# 6 2014 Adrian, MI 31000US10300 369
```
Supported sumlevels for **show=cip**: [ [link to codes](http://datausa.io/about/attributes/cip/2_digit_course/1/) ]
* 2-digit (In ACS PUMS data only 2-digit data is available.)
* 4-digit
* 6-digit
Example of sumlevel=2, sumlevel=4, and sumlevel=6 results:
```
Close11) COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES
Close11.01) Computer and Information Sciences, General
11.0101) Computer and Information Sciences, General
11.0102) Artificial Intelligence
11.0103) Information Technology
11.0104) Informatics
11.0199) Computer and Information Sciences, Other
Close11.02) Computer Programming
11.0201) Computer Programming/Programmer, General
11.0202) Computer Programming, Specific Applications
11.0203) Computer Programming, Vendor/Product Certification
11.0299) Computer Programming, Other
```
Supported sumlevels for **show=naics**: [ [link to codes](http://datausa.io/about/attributes/naics/sector/1/) ]
* Sector (0)
* Sub Sector (1)
* Group (2)
Example industry at sumlevel=0, sumlevel=1, and sumlevel=2.
```
Financial Activities
Finance and Insurance (NAICS 52)
Monetary Authorities - Central Bank (NAICS 521)
Credit Intermediation and Related Activities (NAICS 522)
Insurance Carriers and Related Activities (NAICS 524)
Funds, Trusts, and Other Financial Vehicles (NAICS 525)
```
``` r
library( dplyr )
# NUMBER OF EMPLOYEES BY INDUSTRY (IN THOUSANDS)
api <- "http://api.datausa.io/api/csv/?show=naics&sumlevel=all&required=emp_2014_thousands"
api %>% read.csv() %>% head()
# naics_name naics emp_2014_thousands
# 1 Mining, Quarrying, and Oil and Gas Extraction 21 843.8
# 2 Oil and Gas Extraction 211 197.9
# 3 Mining (except Oil and Gas) 212 207.4
# ESTABLISHMENTS BY INDUSTRY WITHIN MSAs
api <- "http://api.datausa.io/api/csv/?show=geo,naics&sumlevel=msa,1&required=est"
api %>% read.csv() %>% head()
# year geo_name geo naics_name naics est
# 1 2013 Escanaba, MI 31000US21540 Couriers & messengers 492 4
# 2 2013 Escanaba, MI 31000US21540 Warehousing & storage 493 1
# 3 2013 Escanaba, MI 31000US21540 None 511 3
```
Supported sumlevels for **show=soc**: [ [link to codes](http://datausa.io/about/attributes/soc/major_group/1/) ]
* Major Group (0)
* Minor Group (1)
* Broad Occupation (2)
* Detailed Occupation (3)
**IMPORTANT**: The allowable sumlevels will depend entirely upon the underlying data table that is being queried. In order to determine capatability of your show= & sumlevel= arguments, you need to check. Currently you can search by the data type only. The function below will print out all compatible data tables and their joint allowable sumlevels.
```r
# install.packages( "jsonlite" )
getUsage <- function( attribute, return.df=F )
{
library( jsonlite )
url <- paste( "http://api.datausa.io/api/logic/?show=", attribute, "&sumlevel=all", sep="" )
temp <- fromJSON( url, simplifyDataFrame=FALSE )
for( i in 1:length(temp$tables) )
{
cat( paste( "TABLE:", temp$tables[[i]]$table, "\n" ) )
cat( paste( "DATA SOURCE:", temp$tables[[i]]$dataset, "\n" ) )
cat( paste( "DEPARTMENT:", temp$tables[[i]]$org, "\n" ) )
cat( paste( "LINK:", temp$tables[[i]]$link, "\n\n" ) )
cat( "SUPPORTED SUMLEVELS:", "\n\n" )
for( j in 1:length(temp$tables[[i]]$supported_levels) )
{
cat( paste( names(temp$tables[[i]]$supported_levels[j]), ": ",
paste( unlist(temp$tables[[i]]$supported_levels[j]), collapse=", "), "\n", sep="" ) )
}
cat("\n \n \n \n \n")
}
if( return.df==T )
{
temp <- fromJSON( url, simplifyDataFrame=TRUE )
df <- as.data.frame( temp[[1]] )
return(df)
}
else{ return(NULL) }
}
getUsage( "geo" )
# TABLE: ygi_num_emp
# DATA SOURCE: ACS 3-year Estimate
# DEPARTMENT: Census Bureau
# LINK: http://www.census.gov/programs-surveys/acs/
#
# SUPPORTED SUMLEVELS:
#
# acs_ind: 0, 1, 2, all
# geo: nation, state, msa, all
# msa is a supported geography:
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=num_emp"
api %>% read.csv() %>% head()
# year acs_ind_name acs_ind geo_name geo num_emp
# 1 2014 Agriculture, Forestry, Fishing, Hunting, Mining 00 Albert Lea, MN 31000US10660 953
# 2 2014 Agriculture, Forestry, Fishing, Hunting 0000 Albert Lea, MN 31000US10660 930
# 3 2014 Mining, Quarrying, Oil, Gas Extraction 0001 Albert Lea, MN 31000US10660 23
# county is not a supported geography
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=num_emp"
api %>% read.csv() %>% head()
# does not work because county is not a supported sumlevel
# we can use any combination of the supported geo and acs_ind sumlevels:
api <- "http://api.datausa.io/api/csv/?show=geo,acs_ind&sumlevel=msa,0&required=num_emp"
api %>% read.csv() %>% head()
# year acs_ind_name acs_ind geo_name geo num_emp
# 1 2014 Agriculture, Forestry, Fishing, Hunting, Mining 0 Aberdeen, SD 31000US10100 1545
# 2 2014 Construction 1 Aberdeen, SD 31000US10100 1328
# 3 2014 Manufacturing 2 Aberdeen, SD 31000US10100 2887
usage.table <- getUsage( "geo", return.df=T )
```
Data compatability will vary with each geography depending upon the source dataset. For example, BLS data may only be available at the county level, not the census tract. This table presents a rough approximation of source data and compatible geography (rough draft...need to verify these).
DATASET | NATION | STATE | MSA | PUMA | PLACE | COUNTY | TRACT
----------|---------|---------|-------|---------|---------|---------|--------
IPEDS | X | X | X | X | X | X | .
ACS | X | X | X | X | X | X | X
ACS PUMS | X | X | . | X | . | . | .
BLS | X | X | X | X | X | X | .
CPB | X | X | . | . | . | . | .
CHR | X | X | X | X | X | X | .
## Geographic Aggregation
With census and economic data (show=geo & required=variable_of_interest) geography is implied. The other data types are aggregated by college major (show=cip), industry (show=naics), or occupational group (show=soc). If you want to calculate the statistics for these data by specific geographic regions you can do this in two ways.
The most straight-forward means is by using compound statement separated by commas. For example, show=naics,geo & level=0,state.
```r
# Number of employees by top-level industry, by state:
api <- "http://api.datausa.io/api/csv/?show=geo,naics&sumlevel=state,0&required=emp"
api %>% read.csv() %>% head()
# year geo_name geo naics_name naics emp
# 1 2013 Alabama 04000US01 Agriculture, Forestry, Fishing, & Hunting 11 6356
# 2 2013 Alabama 04000US01 Mining, Quarrying, & Oil & Gas Extraction 21 7942
# 3 2013 Alabama 04000US01 Utilities 22 17238
# 4 2013 Alabama 04000US01 Construction 23 78318
```
For finer control over geography in addition to using the sumlevel = nation, state, msa, etc. statements above, you can also use GEO + US + FIPS codes to specify specific regions (the code prefix in parenthesis).
* Nation (01000)
* State (04000) [ [list](http://datausa.io/about/attributes/geo/state/1/) ]
* County (05000) [ [list](http://datausa.io/about/attributes/geo/county/1/) ]
* Public Use Microdata Area (79500) [ [list](http://datausa.io/about/attributes/geo/puma/1/) ]
* Metropolitan Statistical Area (31000) [ [list](http://datausa.io/about/attributes/geo/msa/1/) ]
* Census Designated Place (16000) [ [list](http://datausa.io/about/attributes/geo/place/1/) ]
* Census Tracts (14000)
To create a code, combine the prefix, the letters "US", and the specific GEOID:
Area Type | GEOID Structure | Number of Digits | Example Geographic Area | Census GEOID | Full API Code
----------|-----------------|------------------|-------------------------|---------------|---------------
State | STATE | 2 | Texas | 48 | 04000US48 (STATE CODE+"US"+TEXAS CODE)
MSA | MSA ID | 5 | Atlanta, GA | 12060 | 31000US12060 (MSA CODE+"US"+MSA ID)
County | STATE+COUNTY | 2+3=5 | Harris County, TX | 48201 | 05000US48201 (COUNTY CODE+"US"+HARRIS CODE)
Places | STATE+PLACE | 2+5=7 | Houston, TX | 4835000 | 16000US4835000 (PLACE CODE+"US"+HOUSTON CODE)
Tracts | STATE+COUNTY+TRACTID | 2+3+6=11 | Census Tract 6001.02 in Akron, OH | 39133600102 | 14000US39133600102
See below for instructions on specifying advanced queries for instructions on selecting sets of geographic units using the code structure. For example, since the county code has the structure STATE+COUNTY, you can select all counties within a state by selecting all instances that begin with the state FIP.
For example, Massachusettes has a state FIP of 25, so all counties in MA will look like "05000US25xxx". You can select all counties within MA using the **where=** statement and the starts-with operator **^**:
```r
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=latest&where=geo:^05000US25"
api %>% read.csv() %>% head()
# year geo_name geo pop
# 1 2014 Barnstable County, MA 05000US25001 215167
# 2 2014 Berkshire County, MA 05000US25003 130064
# 3 2014 Bristol County, MA 05000US25005 551065
```
There is also a special case where you may want to search for all county-level or tract-level data within a specific geography such as an MSA or Census Place. You do this by specifying your data unit using "sumlevel" and also adding an additional **geo=** statement with the specific location.
For example, to return population data from all counties in Akron, OH (MSA Code=10420), you would specify the query this way:
```
http://api.datausa.io/api/?show=geo&sumlevel=county&required=pop&geo=31000US10420
```
This only works with COUNTIES inside MSAs, or TRACTS inside PLACES.
```r
library( dplyr )
# COUNTIES INSIDE AKRON, OH
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=income&year=latest&geo=31000US10420"
api %>% read.csv() %>% head()
# year geo_name geo income
# 1 2014 Portage County, OH 05000US39133 51275
# 2 2014 Summit County, OH 05000US39153 50365
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=tract&required=income&year=latest&geo=31000US10420"
api %>% read.csv() %>% head()
# Doesn't work!!!
# INSTEAD QUERY BOTH COUNTIES SEPARATELY USING 'WHERE' FILTER
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=tract&required=income&year=latest&where=geo:^14000US39133"
api %>% read.csv() %>% head() # Portage County, OH
# year geo_name geo income
# 1 2014 Census Tract 6001.02 14000US39133600102 60568
# 2 2014 Census Tract 6001.03 14000US39133600103 48229
# 3 2014 Census Tract 6002 14000US39133600200 58958
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=tract&required=income&year=latest&where=geo:^14000US39153"
api %>% read.csv() %>% head() # Summit County, OH
# year geo_name geo income
# 1 2014 Census Tract 5011 14000US39153501100 28298
# 2 2014 Census Tract 5018 14000US39153501800 24837
# 3 2014 Census Tract 5019 14000US39153501900 10874
# COUNTIES INSIDE ALASKA
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=latest&geo=04000US02"
api %>% read.csv() %>% head()
# this does not work with Alaska state ID 04000US02
# TRY WITH COUNTIES GEO USING WHERE FILTER
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=latest&where=geo:^05000US02"
api %>% read.csv() %>% head()
# year geo_name geo pop
# 1 2014 Aleutians East Borough 05000US02013 3296
# 2 2014 Aleutians West Census Area 05000US02016 5650
# 3 2014 Anchorage Municipality, AK 05000US02020 298178
```
## Returning Multiple Variables
To return several variables at once, add a comma between names in the variable definition slot:
```
http://api.datausa.io/api/csv/?show=cip&sumlevel=2&required=grads_men,grads_women
```
## Filtering Results by Criteria
Sometimes you would like to filter not merely by equality but by other mechanisms. Using the where clause in an API call lets us express more complicated expressions. The basic syntax of using where is: **where= column_name : condition**
Let's say we wanted to filter the results to only show top-level industries containing more than an estimated 10M people in the workforce. We would write:
```
http://api.datausa.io/api/? show=naics&sumlevel=0 & where=num_ppl:>10000000
```
Here are some operators for filtering:
|Description|Operator|Example Syntax|
|----|----|----|
|Greater than|>|[&where=num_ppl:>10000000](http://api.datausa.io/api/?show=naics&sumlevel=0&where=num_ppl:>10000000)|
|Less than|<|[&where=num_ppl:<10000000](http://api.datausa.io/api/?show=naics&sumlevel=0&where=num_ppl:<10000000)|
|String starts with|^|[&where=naics:^23](http://api.datausa.io/api/?show=naics&sumlevel=0&where=naics:^23)|
|String ends with|$ (placed after text)|[&where=naics:3$](http://api.datausa.io/api/?show=naics&sumlevel=0&where=naics:3$)|
|Not equal (integer)|!|[&where=avg_wage_rank:!1](http://api.datausa.io/api/?show=naics&sumlevel=0&where=avg_wage_rank:!1&order=avg_wage_rank&required=avg_wage_rank)|
|Not equal (string)|str!|[&where=geo:str!04000US25](http://api.datausa.io/api/?show=geo&sumlevel=state&where=geo:str!04000US25)|
## Importing Data into R
To easily import results of a query into R, make sure to include the "csv" argument and read directly from the URL:
```r
dat <- read.csv( "http://api.datausa.io/api/csv/?show=cip&sumlevel=2&required=grads_men,grads_women" )
# > head( dat )
# year cip_name cip grads_men grads_women
# 1 2014 Philosophy and Religious Studies 38 11496 6930
# 2 2014 Construction 46 30134 1462
# 3 2014 Mechanical Technologies 47 101496 4946
# 4 2014 Education 13 70721 253623
# 5 2014 Social Sciences 45 95246 104178
# 6 2014 Engineering Technologies 15 77850 11092
```
Quickly build a census dataset by adding multiple variables:
```r
dat <- read.csv( "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=pop,income,mean_commute_minutes" )
# > head( dat )
# year geo_name geo pop income mean_commute_minutes
# 1 2014 Louisville/Jefferson County, KY-IN Metro Area 31000US31140 1253305 50386 22.8425
# 2 2013 Louisville/Jefferson County, KY-IN Metro Area 31000US31140 1244880 49753 22.7871
# 3 2014 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area 31000US37980 6015336 62169 27.5973
# 4 2013 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area 31000US37980 5992766 61923 27.3712
# 5 2014 Seattle-Tacoma-Bellevue, WA Metro Area 31000US42660 3557037 68969 26.7180
# 6 2013 Seattle-Tacoma-Bellevue, WA Metro Area 31000US42660 3504628 67745 26.3895
```
## Useful Economic Queries
Age, Salary
```
http://api.datausa.io/api/?show=age&sumlevel=all&required=avg_wage
```
CIP Course, Wage
```
http://api.datausa.io/api/?show=cip&sumlevel=all&required=avg_wage
```
CIP Course, Age, Wage
```
http://api.datausa.io/api/?show=age,cip&sumlevel=all,all&required=avg_wage
```
Course, Birthplace, Wage
```
http://api.datausa.io/api/?show=birthplace&sumlevel=all&required=avg_wage&cip=40
```
Course, Degree
```
http://api.datausa.io/api/?show=degree&sumlevel=all&required=avg_wage,num_ppl&cip=40
```
Geography
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=avg_wage
```
Geography, Birthplace
```
http://api.datausa.io/api/?show=geo,birthplace&sumlevel=state,all&required=num_over5
```
Geography, Course
```
http://api.datausa.io/api/?show=cip&sumlevel=all&required=num_ppl&geo=04000US25
```
Geography, Degree
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=num_ppl&degree=19
```
Geography, Industry
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=num_ppl&naics=111
```
Geography, Industry, Occupation
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=num_ppl&naics=111&soc=110000
```
Geography, Occupation
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=num_ppl&soc=110000
```
Geography, Occupation, Race
```
http://api.datausa.io/api/?show=geo,race&sumlevel=state,all&required=num_ppl&soc=110000
```
Geography, Occupation, Sex
```
http://api.datausa.io/api/?show=geo,sex&sumlevel=state,all&required=num_ppl&soc=110000
```
Geography, Race
```
http://api.datausa.io/api/?show=geo,race&sumlevel=state,all&required=num_ppl
```
Geography, Sex
```
http://api.datausa.io/api/?show=geo,sex&sumlevel=state,all&required=num_ppl
```
Geography, Wage Bin
```
http://api.datausa.io/api/?show=geo,wage_bin&sumlevel=state,all&required=num_ppl
```
Industry
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl
```
Industry, Course
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&cip=40
```
Industry, Degree
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&degree=19
```
Industry, Occupation
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&soc=110000
```
Industry, Occupation, Race
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&soc=110000&race=1
```
Industry, Occupation, Sex
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&soc=110000&sex=1
```
Industry, Race
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&race=1
```
Industry, Sex
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&sex=1
```
Industry, Wage Bin
```
http://api.datausa.io/api/?show=wage_bin,naics&sumlevel=all,0&year=latest
```
Occupation
```
http://api.datausa.io/api/?show=soc&sumlevel=0&year=latest
```
Occupation, Course
```
http://api.datausa.io/api/?show=soc&sumlevel=0&year=latest&cip=40
```
Occupation, Course, Degree
```
http://api.datausa.io/api/?show=soc,degree&sumlevel=0,all&year=latest&cip=40
```
Occupation, Degree
```
http://api.datausa.io/api/?show=soc,degree&sumlevel=0,all&year=latest
```
Occupation, Race
```
http://api.datausa.io/api/?show=soc,race&sumlevel=0,all&year=latest
```
Occupation, Sex
```
http://api.datausa.io/api/?show=soc&sumlevel=0&year=latest&sex=1
```
Occupation, Wage Bin
```
http://api.datausa.io/api/?show=soc,wage_bin&sumlevel=0,all&year=latest
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment