jknowles/datausa_census_api.rmd

## datausa_census_api.rmd
# Using the dataUSA.io API for Census Data in R

This gist contains some notes on constructing a query for census and economic data from the [DataUSA.io](http://datausa.io/) site. This is a quick-start guide to their API; for in-depth documentation check out their [API documentation](https://github.com/DataUSA/datausa-api/wiki/Overview).

A great way to learn how to structure a query is to visit a specific datausa.io page and click on the "Options" button on top of any graph, then select "API" to see the query syntax that created the graph.


![Analytics](https://ga-beacon.appspot.com/UA-27835807-2/gist-id?pixel)

## Example Use


```r

library( dplyr )


### POPULATION OF METRO AREAS IN THE US

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=pop&year=latest"
dat <- read.csv( api )
head( dat )
#   year     geo_name          geo    pop
# 1 2014 Aberdeen, SD 31000US10100  41564
# 2 2014 Aberdeen, WA 31000US10140  71734
# 3 2014  Abilene, TX 31000US10180 167171


### INCOME BY CENSUS TRACT

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=tract&required=income&year=latest"
api %>% read.csv() %>% head()
#   year            geo_name                geo  income
# 1 2014    Census Tract 201 14000US01001020100 60000.0
# 2 2014    Census Tract 211 14000US01001021100 34183.0
# 3 2014    Census Tract 102 14000US01003010200 31480.0


### MEDIAN PROPERTY VALUE FOR EACH COUNTY IN ATLANTA METRO AREA

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=median_property_value&geo=31000US12060&year=latest"
api %>% read.csv() %>% head()
#   year            geo_name          geo median_property_value
# 1 2014 Cherokee County, GA 05000US13057                183200
# 2 2014  Clayton County, GA 05000US13063                 85700
# 3 2014     Cobb County, GA 05000US13067                203600


### NUMBER OF EMPLOYEES BY TOP-LEVEL INDUSTRY:

api <- "http://api.datausa.io/api/csv/?show=naics,geo&sumlevel=0,nation&required=emp&year=latest"
api %>% read.csv() %>% head()
#   year      geo_name     geo                                naics_name naics     emp
# 1 2013 United States 01000US Agriculture, Forestry, Fishing, & Hunting    11  154496
# 2 2013 United States 01000US Mining, Quarrying, & Oil & Gas Extraction    21  732186
# 3 2013 United States 01000US                                 Utilities    22  638575
# 4 2013 United States 01000US                              Construction    23 5470181


### NUMBER OF BACHELOR'S DEGREES AWARDED IN 2014 BY AREA OF STUDY

api <- "http://api.datausa.io/api/csv/?show=cip&sumlevel=6&degree=5&required=grads_total&year=latest"
api %>% read.csv() %>% head()
#   year                           cip_name   cip       degree_name degree grads_total
# 1 2014                General Agriculture 10000 Bachelor's Degree      5        1924
# 2 2014     General Agriculture Management 10101 Bachelor's Degree      5        1190
# 3 2014   Agricultural Business Operations 10102 Bachelor's Degree      5        1562


```

Note the use of the [pipe operator](http://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/) in these examples to keep the code short. If you are not familiar with the pipe operator, just note that these statements are equivalent:

```r
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=pop&year=latest"
dat <- read.csv( api )
head( dat )

library( dplyr ) # required for the pipe operator

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=pop&year=latest"
api %>% read.csv() %>% head()
```


## Anatomy of the API

All queries start with one of the following prefixes:

```r
http://api.datausa.io/api/?         # returns data as json file

http://api.datausa.io/api/csv/?     # returns data as csv file

```

Each query consists of a set of statements, all separated by the **&** operator.

* **show=**  which type of data to query
* **sumlevel=**  which level or geography to aggregate over
* **required=**  which variable(s) to use
* **year=**  which period of data to use ("latest" is a valid argument)

Depending upon what type of data you query, the *required* and *year* statements may be omitted.

For example, grab the population of all counties for 2014:

```r
# show=geo
# sumlevel=county
# required=pop
# year=2014

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=2014"
api %>% read.csv() %>% head()
#   year                geo_name          geo     pop
# 1 2014         Cook County, IL 05000US17031 5227827
# 2 2014         King County, WA 05000US53033 2008997
# 3 2014        Wayne County, MI 05000US26163 1790078
# 4 2014      Broward County, FL 05000US12011 1815269

```

Populaton of all counties within a specific state:

```r
# All counties in Massachusettes start with FIPS 05000US25

# show=geo
# sumlevel=county
# required=pop
# year=latest
# where=geo:^05000US25

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=latest&where=geo:^05000US25"
api %>% read.csv() %>% head()
#   year              geo_name          geo    pop
# 1 2014 Barnstable County, MA 05000US25001 215167
# 2 2014  Berkshire County, MA 05000US25003 130064
# 3 2014    Bristol County, MA 05000US25005 55106

```

College degree data for all of the US:

```r
# show=cip         data on college majors
# sumlevel=2       cip level 2 categories (see below)

api <- "http://api.datausa.io/api/csv/?show=cip&sumlevel=2&required=grads_total"
api %>% read.csv() %>% head()
#   year                         cip_name cip grads_total
# 1 2014 Philosophy and Religious Studies  38       18426
# 2 2014                     Construction  46       31596
# 3 2014          Mechanical Technologies  47      106442
# 4 2014                        Education  13      324344
# 5 2014                  Social Sciences  45      199424
```


## Datasets and Variable

To find variables for the **required=** argument see these sources and their respective data dictionaries.

DATASET   |   CONTAINS    |  DATA DICTIONARY
----------|---------------|---------------------
ACS | American Community Survey - census demographics | [LINK](https://github.com/DataUSA/datausa-api/wiki/ACS-API)
ACS PUMS | ACS Public Use Microdata - employment statistics | [LINK](https://github.com/DataUSA/datausa-api/wiki/ACS-PUMS-API)
BLS | Bureau of Labor Statistics - industry and occupation | [LINK](https://github.com/DataUSA/datausa-api/wiki/BLS-API)
CBP | County Business Patterns - number of businesses | [LINK](https://github.com/DataUSA/datausa-api/wiki/CBP-API)
CHR | County Health Rankings | [LINK](https://github.com/DataUSA/datausa-api/wiki/County-Health-Rankings-API)
IPEDS | Integrated Post-Secondary Education Data System - data on college education |  [LINK](https://github.com/DataUSA/datausa-api/wiki/IPEDS-API)
ONET | Skills needed for employment | [LINK](https://github.com/DataUSA/datausa-api/wiki/ONET-API)


## Data Type & Level of Aggregation

Each query requires the user to specify a type of data (show=), a level of aggregation (sumlevel=), and a variable (require=).

In this API, what dataUSA will call an "[attribute](https://github.com/DataUSA/datausa-api/wiki/Attributes)" is what R users would refer to as a factor. These are variables that have levels that will be used to aggregate the outcome variables specified by *require=*. The primary nuance is that these data types often have nested or hierarchical levels. For example, the **geo** factor includes levels for states, counties, and MSAs. The **sumlevel=** statement allows the use to select aggregation at a specific level.

Data types are specified through the **show=** statement. The API recognizes all of the following:

* Geography (geo) - typically census and economic data aggregated by geography
* Course (cip) - employement and earnings by college and major
* Industry (naics,bls_naics,acs_ind) - industry data aggregated by industry subsector or group
* Occupation (soc,acs_occ) - occupation data aggregated by occupation group
* Sex (sex) - aggregate data by sex
* Race (acs_race,pums_race,race) - aggregate data by race - use acs_race with census data, etc.
* Veterans (conflict) - stats on veterans
* College Degree (degree)
* Skill (skill)
* Birthplace (birthplace)
* Wage Bin (wage_bin)
* Language (language)
* University Location (university) and Sector - Public, State, Private (sector)

It is helpful to be able to peek inside of these data types in order to see the levels that are includes and their associated attributes. Try the following helper function:

```r
# install.packages( "jsonlite" )
library( jsonlite )

# PRINT ALL VALID DATA (show=) TYPES:
fromJSON( "http://api.datausa.io/attrs/list/" )


viewDataLevels <- function( attribute )
{

    url <- paste( "http://api.datausa.io/attrs/", attribute, sep="" )
    temp <- fromJSON( url )
    df <- as.data.frame( temp[[1]] )
    names(df) <- temp[[2]]
    print( paste( "NUMBER OF ROWS:", nrow(df) ) )
    return(df)

}


viewDataLevels( "acs_ind" )[1:10,]
# [1] "NUMBER OF ROWS: 191"
#      id                                                      name level
# 1    04                                              Retail Trade     0
# 2    08 Professional, Scientific, Management, Admin., Waste Mgmt.     0
# 3  0403              Building Material & Garden Equipment Dealers     1
# 4  020x                                             Manufacturing     1
# 5  030x                                           Wholesale trade     1

viewDataLevels( "university" )[1:10,]
# [1] "NUMBER OF ROWS: 9124"
#    sector                                name                     url is_stem       county     state       lat         lng     id          msa
# 1       3         South University - Richmond www.southuniversity.edu       0 05000US51087 04000US51  37.64676   -77.59915 459259 31000US40060
# 2       3          Herzing University-Atlanta         www.herzing.edu       1 05000US13121 04000US13 33.846627  -84.362635 140340 31000US12060
# 3       1              Boise State University      www.boisestate.edu       1 05000US16001 04000US16 43.604403 -116.205789 142115 31000US14260
# 4       4     Eastern Idaho Technical College            www.eitc.edu       1 05000US16019 04000US16   43.4818 -111.984645 142179 31000US26820
# 5       1              Idaho State University            www.isu.edu/       1 05000US16005 04000US16 42.864108 -112.429084 142276 31000US38540

```


The level of aggregation (unit of analysis) for the data is specified by the sumlevels statement. Sumlevels for the most commmon data types are listed here:

If sumlevels for a particular variable type are not listed below, try **sumlevel=all**.

Supported sumlevels for **show=geo**:

* Nation (nation)
* State (state)                          [ [list](http://datausa.io/about/attributes/geo/state/1/) ]
* Public Use Micro Data (puma)           [ [list](http://datausa.io/about/attributes/geo/puma/1/) ] [ [definition](https://www.census.gov/geo/reference/puma.html) ]
* County (county)                        [ [list](http://datausa.io/about/attributes/geo/county/1/) ]
* Metropolitan Statistical Area (msa)    [ [list](http://datausa.io/about/attributes/geo/msa/1/) ] [ [definition](https://www.census.gov/population/metro/) ]
* Census Designated Place (place)        [ [list](http://datausa.io/about/attributes/geo/place/1/) ] [ [definition](https://www.census.gov/geo/reference/gtc/gtc_place.html) ]
* Census tract (tract)

*See below on how to query a specific geographic region.*

```r

###   income rank by state    ###
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=state&required=income_rank&year=latest"
api %>% read.csv() %>% head()
#   year   geo_name       geo income_rank
# 1 2014    Alabama 04000US01          47
# 2 2014     Alaska 04000US02           3
# 3 2014    Arizona 04000US04          31
# 4 2014   Arkansas 04000US05          50
# 5 2014 California 04000US06          10
# 6 2014   Colorado 04000US08          15


###   income rank by city   ###
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=income_rank&year=latest"
api %>% read.csv() %>% head()
#   year     geo_name          geo income_rank
# 1 2014 Aberdeen, SD 31000US10100         193
# 2 2014 Aberdeen, WA 31000US10140         565
# 3 2014  Abilene, TX 31000US10180         530
# 4 2014      Ada, OK 31000US10220         600
# 5 2014 Adjuntas, PR 31000US10260         929
# 6 2014   Adrian, MI 31000US10300         369

```


Supported sumlevels for **show=cip**:  [ [link to codes](http://datausa.io/about/attributes/cip/2_digit_course/1/) ]

* 2-digit (In ACS PUMS data only 2-digit data is available.)
* 4-digit
* 6-digit

Example of sumlevel=2, sumlevel=4, and sumlevel=6 results:

```
Close11) COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES

  Close11.01) Computer and Information Sciences, General

    11.0101) Computer and Information Sciences, General
    11.0102) Artificial Intelligence
    11.0103) Information Technology
    11.0104) Informatics
    11.0199) Computer and Information Sciences, Other

  Close11.02) Computer Programming

    11.0201) Computer Programming/Programmer, General
    11.0202) Computer Programming, Specific Applications
    11.0203) Computer Programming, Vendor/Product Certification
    11.0299) Computer Programming, Other

```


Supported sumlevels for **show=naics**:  [ [link to codes](http://datausa.io/about/attributes/naics/sector/1/) ]

* Sector (0)
* Sub Sector (1)
* Group (2)

Example industry at sumlevel=0, sumlevel=1, and sumlevel=2.

```
Financial Activities

  Finance and Insurance (NAICS 52)

    Monetary Authorities - Central Bank (NAICS 521)
    Credit Intermediation and Related Activities (NAICS 522)
    Insurance Carriers and Related Activities (NAICS 524)
    Funds, Trusts, and Other Financial Vehicles (NAICS 525)

```

``` r
library( dplyr )

# NUMBER OF EMPLOYEES BY INDUSTRY (IN THOUSANDS)

api <- "http://api.datausa.io/api/csv/?show=naics&sumlevel=all&required=emp_2014_thousands"
api %>% read.csv() %>% head()
#                                      naics_name  naics emp_2014_thousands
# 1 Mining, Quarrying, and Oil and Gas Extraction     21              843.8
# 2                        Oil and Gas Extraction    211              197.9
# 3                   Mining (except Oil and Gas)    212              207.4

# ESTABLISHMENTS BY INDUSTRY WITHIN MSAs

api <- "http://api.datausa.io/api/csv/?show=geo,naics&sumlevel=msa,1&required=est"
api %>% read.csv() %>% head()
#   year     geo_name          geo                     naics_name naics est
# 1 2013 Escanaba, MI 31000US21540         Couriers & messengers    492   4
# 2 2013 Escanaba, MI 31000US21540         Warehousing & storage    493   1
# 3 2013 Escanaba, MI 31000US21540                           None   511   3

```


Supported sumlevels for **show=soc**:  [ [link to codes](http://datausa.io/about/attributes/soc/major_group/1/) ]

* Major Group (0)
* Minor Group (1)
* Broad Occupation (2)
* Detailed Occupation (3)


**IMPORTANT**: The allowable sumlevels will depend entirely upon the underlying data table that is being queried. In order to determine capatability of your show= & sumlevel= arguments, you need to check. Currently you can search by the data type only. The function below will print out all compatible data tables and their joint allowable sumlevels.


```r
# install.packages( "jsonlite" )

getUsage <- function( attribute, return.df=F )
{
    library( jsonlite )
    url <- paste( "http://api.datausa.io/api/logic/?show=", attribute, "&sumlevel=all", sep="" )
    temp <- fromJSON( url, simplifyDataFrame=FALSE )

    for( i in 1:length(temp$tables) )
    {
       cat( paste( "TABLE:", temp$tables[[i]]$table, "\n" ) )
       cat( paste( "DATA SOURCE:", temp$tables[[i]]$dataset, "\n" ) )
       cat( paste( "DEPARTMENT:", temp$tables[[i]]$org, "\n" ) )
       cat( paste( "LINK:", temp$tables[[i]]$link, "\n\n" ) )
       cat( "SUPPORTED SUMLEVELS:", "\n\n" )
       for( j in 1:length(temp$tables[[i]]$supported_levels) )
       {
            cat( paste( names(temp$tables[[i]]$supported_levels[j]), ":  ",
                         paste( unlist(temp$tables[[i]]$supported_levels[j]), collapse=", "), "\n", sep="" ) )
       }
       cat("\n \n \n \n \n")
    }

    if( return.df==T )
    {
      temp <- fromJSON( url, simplifyDataFrame=TRUE )
      df <- as.data.frame( temp[[1]] )
      return(df)
    }
    else{ return(NULL) }

}


getUsage( "geo" )
# TABLE: ygi_num_emp
# DATA SOURCE: ACS 3-year Estimate
# DEPARTMENT: Census Bureau
# LINK: http://www.census.gov/programs-surveys/acs/
#
# SUPPORTED SUMLEVELS:
#
# acs_ind:  0, 1, 2, all
# geo:  nation, state, msa, all


# msa is a supported geography:

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=num_emp"
api %>% read.csv() %>% head()
#   year                                    acs_ind_name acs_ind       geo_name          geo num_emp
# 1 2014 Agriculture, Forestry, Fishing, Hunting, Mining      00 Albert Lea, MN 31000US10660     953
# 2 2014         Agriculture, Forestry, Fishing, Hunting    0000 Albert Lea, MN 31000US10660     930
# 3 2014          Mining, Quarrying, Oil, Gas Extraction    0001 Albert Lea, MN 31000US10660      23


# county is not a supported geography

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=num_emp"
api %>% read.csv() %>% head()
# does not work because county is not a supported sumlevel


# we can use any combination of the supported geo and acs_ind sumlevels:

api <- "http://api.datausa.io/api/csv/?show=geo,acs_ind&sumlevel=msa,0&required=num_emp"
api %>% read.csv() %>% head()
#   year                                    acs_ind_name acs_ind     geo_name          geo num_emp
# 1 2014 Agriculture, Forestry, Fishing, Hunting, Mining       0 Aberdeen, SD 31000US10100    1545
# 2 2014                                    Construction       1 Aberdeen, SD 31000US10100    1328
# 3 2014                                   Manufacturing       2 Aberdeen, SD 31000US10100    2887


usage.table <- getUsage( "geo", return.df=T )

```

Data compatability will vary with each geography depending upon the source dataset. For example, BLS data may only be available at the county level, not the census tract. This table presents a rough approximation of source data and compatible geography (rough draft...need to verify these).

DATASET   |  NATION  |  STATE  |  MSA  |  PUMA  |  PLACE  |  COUNTY  |  TRACT
----------|---------|---------|-------|---------|---------|---------|--------
IPEDS     |   X     |   X     |   X   |   X     |   X     |   X     |   .
ACS       |   X     |   X     |   X   |   X     |   X     |   X     |   X
ACS PUMS  |   X     |   X     |   .   |   X     |   .     |   .     |   .
BLS       |   X     |   X     |   X   |   X     |   X     |   X     |   .
CPB       |   X     |   X     |   .   |   .     |   .     |   .     |   .
CHR       |   X     |   X     |   X   |   X     |   X     |   X     |   .


## Geographic Aggregation


With census and economic data (show=geo & required=variable_of_interest) geography is implied. The other data types are aggregated by college major (show=cip), industry (show=naics), or occupational group (show=soc). If you want to calculate the statistics for these data by specific geographic regions you can do this in two ways.

The most straight-forward means is by using compound statement separated by commas. For example, show=naics,geo & level=0,state.

```r
# Number of employees by top-level industry, by state:

api <- "http://api.datausa.io/api/csv/?show=geo,naics&sumlevel=state,0&required=emp"
api %>% read.csv() %>% head()
#   year geo_name       geo                                naics_name naics   emp
# 1 2013  Alabama 04000US01 Agriculture, Forestry, Fishing, & Hunting    11  6356
# 2 2013  Alabama 04000US01 Mining, Quarrying, & Oil & Gas Extraction    21  7942
# 3 2013  Alabama 04000US01                                 Utilities    22 17238
# 4 2013  Alabama 04000US01                              Construction    23 78318
```


For finer control over geography in addition to using the sumlevel = nation, state, msa, etc. statements above, you can also use GEO + US + FIPS codes to specify specific regions (the code prefix in parenthesis).

* Nation (01000)
* State (04000)                             [ [list](http://datausa.io/about/attributes/geo/state/1/) ]
* County (05000)                            [ [list](http://datausa.io/about/attributes/geo/county/1/) ]
* Public Use Microdata Area (79500)         [ [list](http://datausa.io/about/attributes/geo/puma/1/) ]
* Metropolitan Statistical Area (31000)     [ [list](http://datausa.io/about/attributes/geo/msa/1/) ]
* Census Designated Place (16000)           [ [list](http://datausa.io/about/attributes/geo/place/1/) ]
* Census Tracts (14000)

To create a code, combine the prefix, the letters "US", and the specific GEOID:

Area Type | GEOID Structure | Number of Digits | Example Geographic Area | Census GEOID | Full API Code
----------|-----------------|------------------|-------------------------|---------------|---------------
State | STATE | 2 | Texas | 48 | 04000US48 (STATE CODE+"US"+TEXAS CODE)
MSA   | MSA ID | 5 | Atlanta, GA | 12060 | 31000US12060 (MSA CODE+"US"+MSA ID)
County | STATE+COUNTY | 2+3=5 | Harris County, TX | 48201 | 05000US48201 (COUNTY CODE+"US"+HARRIS CODE)
Places | STATE+PLACE | 2+5=7 | Houston, TX | 4835000 | 16000US4835000 (PLACE CODE+"US"+HOUSTON CODE)
Tracts | STATE+COUNTY+TRACTID | 2+3+6=11 | Census Tract 6001.02 in Akron, OH | 39133600102 | 14000US39133600102

See below for instructions on specifying advanced queries for instructions on selecting sets of geographic units using the code structure. For example, since the county code has the structure STATE+COUNTY, you can select all counties within a state by selecting all instances that begin with the state FIP.

For example, Massachusettes has a state FIP of 25, so all counties in MA will look like "05000US25xxx". You can select all counties within MA using the **where=** statement and the starts-with operator **^**:

```r
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=latest&where=geo:^05000US25"
api %>% read.csv() %>% head()
#   year              geo_name          geo    pop
# 1 2014 Barnstable County, MA 05000US25001 215167
# 2 2014  Berkshire County, MA 05000US25003 130064
# 3 2014    Bristol County, MA 05000US25005 551065
```

There is also a special case where you may want to search for all county-level or tract-level data within a specific geography such as an MSA or Census Place. You do this by specifying your data unit using "sumlevel" and also adding an additional **geo=** statement with the specific location.

For example, to return population data from all counties in Akron, OH (MSA Code=10420), you would specify the query this way:

```
http://api.datausa.io/api/?show=geo&sumlevel=county&required=pop&geo=31000US10420
```

This only works with COUNTIES inside MSAs, or TRACTS inside PLACES.


```r
library( dplyr )

# COUNTIES INSIDE AKRON, OH
api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=income&year=latest&geo=31000US10420"
api %>% read.csv() %>% head()
#   year           geo_name          geo income
# 1 2014 Portage County, OH 05000US39133  51275
# 2 2014  Summit County, OH 05000US39153  50365

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=tract&required=income&year=latest&geo=31000US10420"
api %>% read.csv() %>% head()
# Doesn't work!!!

# INSTEAD QUERY BOTH COUNTIES SEPARATELY USING 'WHERE' FILTER

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=tract&required=income&year=latest&where=geo:^14000US39133"
api %>% read.csv() %>% head() # Portage County, OH
#   year             geo_name                geo income
# 1 2014 Census Tract 6001.02 14000US39133600102  60568
# 2 2014 Census Tract 6001.03 14000US39133600103  48229
# 3 2014    Census Tract 6002 14000US39133600200  58958

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=tract&required=income&year=latest&where=geo:^14000US39153"
api %>% read.csv() %>% head() # Summit County, OH
#   year             geo_name                geo income
# 1 2014    Census Tract 5011 14000US39153501100  28298
# 2 2014    Census Tract 5018 14000US39153501800  24837
# 3 2014    Census Tract 5019 14000US39153501900  10874


# COUNTIES INSIDE ALASKA

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=latest&geo=04000US02"
api %>% read.csv() %>% head()
# this does not work with Alaska state ID 04000US02

# TRY WITH COUNTIES GEO USING WHERE FILTER

api <- "http://api.datausa.io/api/csv/?show=geo&sumlevel=county&required=pop&year=latest&where=geo:^05000US02"
api %>% read.csv() %>% head()
#   year                   geo_name          geo    pop
# 1 2014     Aleutians East Borough 05000US02013   3296
# 2 2014 Aleutians West Census Area 05000US02016   5650
# 3 2014 Anchorage Municipality, AK 05000US02020 298178
```


## Returning Multiple Variables

To return several variables at once, add a comma between names in the variable definition slot:

```
http://api.datausa.io/api/csv/?show=cip&sumlevel=2&required=grads_men,grads_women
```

## Filtering Results by Criteria

Sometimes you would like to filter not merely by equality but by other mechanisms. Using the where clause in an API call lets us express more complicated expressions. The basic syntax of using where is: **where= column_name : condition**

Let's say we wanted to filter the results to only show top-level industries containing more than an estimated 10M people in the workforce. We would write:

```
http://api.datausa.io/api/? show=naics&sumlevel=0 & where=num_ppl:>10000000
```

Here are some operators for filtering:

|Description|Operator|Example Syntax|
|----|----|----|
|Greater than|>|[&where=num_ppl:>10000000](http://api.datausa.io/api/?show=naics&sumlevel=0&where=num_ppl:>10000000)|
|Less than|<|[&where=num_ppl:<10000000](http://api.datausa.io/api/?show=naics&sumlevel=0&where=num_ppl:<10000000)|
|String starts with|^|[&where=naics:^23](http://api.datausa.io/api/?show=naics&sumlevel=0&where=naics:^23)|
|String ends with|$ (placed after text)|[&where=naics:3$](http://api.datausa.io/api/?show=naics&sumlevel=0&where=naics:3$)|
|Not equal (integer)|!|[&where=avg_wage_rank:!1](http://api.datausa.io/api/?show=naics&sumlevel=0&where=avg_wage_rank:!1&order=avg_wage_rank&required=avg_wage_rank)|
|Not equal (string)|str!|[&where=geo:str!04000US25](http://api.datausa.io/api/?show=geo&sumlevel=state&where=geo:str!04000US25)|


## Importing Data into R

To easily import results of a query into R, make sure to include the "csv" argument and read directly from the URL:


```r
dat <- read.csv( "http://api.datausa.io/api/csv/?show=cip&sumlevel=2&required=grads_men,grads_women" )

# > head( dat )
#   year                         cip_name cip grads_men grads_women
# 1 2014 Philosophy and Religious Studies  38     11496        6930
# 2 2014                     Construction  46     30134        1462
# 3 2014          Mechanical Technologies  47    101496        4946
# 4 2014                        Education  13     70721      253623
# 5 2014                  Social Sciences  45     95246      104178
# 6 2014         Engineering Technologies  15     77850       11092

```

Quickly build a census dataset by adding multiple variables:

```r
dat <- read.csv( "http://api.datausa.io/api/csv/?show=geo&sumlevel=msa&required=pop,income,mean_commute_minutes" )

# > head( dat )
#   year                                               geo_name          geo     pop income mean_commute_minutes
# 1 2014          Louisville/Jefferson County, KY-IN Metro Area 31000US31140 1253305  50386              22.8425
# 2 2013          Louisville/Jefferson County, KY-IN Metro Area 31000US31140 1244880  49753              22.7871
# 3 2014 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area 31000US37980 6015336  62169              27.5973
# 4 2013 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area 31000US37980 5992766  61923              27.3712
# 5 2014                 Seattle-Tacoma-Bellevue, WA Metro Area 31000US42660 3557037  68969              26.7180
# 6 2013                 Seattle-Tacoma-Bellevue, WA Metro Area 31000US42660 3504628  67745              26.3895
```


## Useful Economic Queries


Age, Salary
```
http://api.datausa.io/api/?show=age&sumlevel=all&required=avg_wage
```

CIP Course, Wage
```
http://api.datausa.io/api/?show=cip&sumlevel=all&required=avg_wage
```

CIP Course, Age, Wage
```
http://api.datausa.io/api/?show=age,cip&sumlevel=all,all&required=avg_wage
```

Course, Birthplace, Wage
```
http://api.datausa.io/api/?show=birthplace&sumlevel=all&required=avg_wage&cip=40
```

Course, Degree
```
http://api.datausa.io/api/?show=degree&sumlevel=all&required=avg_wage,num_ppl&cip=40
```

Geography
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=avg_wage
```

Geography, Birthplace
```
http://api.datausa.io/api/?show=geo,birthplace&sumlevel=state,all&required=num_over5
```

Geography, Course
```
http://api.datausa.io/api/?show=cip&sumlevel=all&required=num_ppl&geo=04000US25
```

Geography, Degree
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=num_ppl&degree=19
```

Geography, Industry
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=num_ppl&naics=111
```

Geography, Industry, Occupation
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=num_ppl&naics=111&soc=110000
```

Geography, Occupation
```
http://api.datausa.io/api/?show=geo&sumlevel=state&required=num_ppl&soc=110000
```

Geography, Occupation, Race
```
http://api.datausa.io/api/?show=geo,race&sumlevel=state,all&required=num_ppl&soc=110000
```

Geography, Occupation, Sex
```
http://api.datausa.io/api/?show=geo,sex&sumlevel=state,all&required=num_ppl&soc=110000
```

Geography, Race
```
http://api.datausa.io/api/?show=geo,race&sumlevel=state,all&required=num_ppl
```

Geography, Sex
```
http://api.datausa.io/api/?show=geo,sex&sumlevel=state,all&required=num_ppl
```

Geography, Wage Bin
```
http://api.datausa.io/api/?show=geo,wage_bin&sumlevel=state,all&required=num_ppl
```

Industry
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl
```

Industry, Course
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&cip=40
```

Industry, Degree
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&degree=19
```

Industry, Occupation
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&soc=110000
```

Industry, Occupation, Race
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&soc=110000&race=1
```

Industry, Occupation, Sex
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&soc=110000&sex=1
```

Industry, Race
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&race=1
```

Industry, Sex
```
http://api.datausa.io/api/?show=naics&sumlevel=0&required=num_ppl&sex=1
```

Industry, Wage Bin
```
http://api.datausa.io/api/?show=wage_bin,naics&sumlevel=all,0&year=latest
```

Occupation
```
http://api.datausa.io/api/?show=soc&sumlevel=0&year=latest
```

Occupation, Course
```
http://api.datausa.io/api/?show=soc&sumlevel=0&year=latest&cip=40
```

Occupation, Course, Degree
```
http://api.datausa.io/api/?show=soc,degree&sumlevel=0,all&year=latest&cip=40
```

Occupation, Degree
```
http://api.datausa.io/api/?show=soc,degree&sumlevel=0,all&year=latest
```

Occupation, Race
```
http://api.datausa.io/api/?show=soc,race&sumlevel=0,all&year=latest
```

Occupation, Sex
```
http://api.datausa.io/api/?show=soc&sumlevel=0&year=latest&sex=1
```

Occupation, Wage Bin
```
http://api.datausa.io/api/?show=soc,wage_bin&sumlevel=0,all&year=latest
```