Skip to content

Instantly share code, notes, and snippets.

@djhocking
Last active August 29, 2015 14:13
Show Gist options
  • Save djhocking/eff1072b54b6d8049270 to your computer and use it in GitHub Desktop.
Save djhocking/eff1072b54b6d8049270 to your computer and use it in GitHub Desktop.
Big Queries and collect with dplyr
# fetch temperature data
tbl_values <- left_join(tbl_series,
select(tbl_variables, variable_id, variable_name),
by=c('variable_id'='variable_id')) %>%
select(-file_id) %>%
filter(location_id %in% df_locations$location_id,
variable_name=="TEMP") %>%
left_join(tbl_values,
by=c('series_id'='series_id')) %>%
left_join(select(tbl_locations, location_id, location_name, latitude, longitude, featureid=catchment_id),
by=c('location_id'='location_id')) %>%
left_join(tbl_agencies,
by=c('agency_id'='agency_id')) %>%
mutate(year = date_part('year', datetime))
df_values <- collect(tbl_values)
df_values <- df_values %>%
mutate(datetime=with_tz(datetime, tzone='EST'))
summary(df_values)
# create climateData input dataset (too big without pre-filter or smaller join)
# tried to do the year filter within the postgres tbl query but was getting errors with recognizing
climate <- tbl_daymet %>%
mutate(year = date_part('year', date)) %>%
filter(featureid %in% df_locations$featureid)
tbl_climate <- climate %>%
filter(year %in% unique(df_values$year)) # distinct() doesn't work on numeric values
climateData <- collect(tbl_climate)
@djhocking
Copy link
Author

> tbl_climate$query
<Query> SELECT "featureid", "date", "tmax", "tmin", "prcp", "dayl", "srad", "vp", "swe", DATE_PART('year', "date") AS "year"
FROM "daymet"
WHERE "featureid" IN (NULL, NULL, NULL, NULL, NULL, NULL, 831777, NULL, 831777, 818865, 823965, 817228, 818552, 844517,

...

749089, 740783) AND "year" IN UNIQUE((2005.0, 2005.0, ... , 2006.0, 2006.0))
<PostgreSQLConnection:(2376,1)> 

@walkerjeffd
Copy link

Hmm, why so many nulls in the featureid list. all locations should be associated with a featureid. let me check that out

@walkerjeffd
Copy link

and looks like something weird happens with the UNIQUE. dplyr may not be handline this right, I'm not sure. You could try creating the list of unique values outside the filter() call:

unique_years <- as.integer(unique(df_values$year))
tbl_climate <- climate %>%
  filter(year %in% unique_years)

climateData <- collect(tbl_climate)

@walkerjeffd
Copy link

oh i know why there are nulls, because the catchment_id is not currently be set when someone adds a new location. so those locations that are null have all been created in the past week or two.

@walkerjeffd
Copy link

ok i just updated the locations table to fill the missing catchment_ids, so you shouldn't get any more nulls in locations.catchment_id

@djhocking
Copy link
Author

Great thanks. What about creating a unique site-year combination in postgres before I collect it? That is really what I want, a query of the site-year combinations from the values table. I was hoping something like mutate(siteyear = featureid || '-' || year) would work but to no avail.

@djhocking
Copy link
Author

I would use paste but I assume it doesn't work when doing the postgres query. I could do it in the dataframe after collecting the query but the collecting is going to take hours so I wanted to make it as small as possible.

@djhocking
Copy link
Author

When I do the unique outside of the query it still gives the error

Error in postgresqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  column "year" does not exist
LINE 3: ...9688, 750755, 750240, 747527, 749089, 740783) AND "year" IN ...
                                                             ^
)
In addition: Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
  Could not create execute: SELECT count(*) FROM (SELECT "featureid", "date", "tmax", "tmin", "prcp", "dayl", "srad", "vp", "swe", DATE_PART('year', "date") AS "year"
FROM "daymet"
WHERE "featureid" IN (751988, 892785, 751071, 751326, 752235, 750688, 831777, 752527, 831777, 818865, 823965, 817228, 818552, 844517, 842601, 836717, 831211, 830038, 834069, 821646, 821232, 820519, 822376, 831406, 837021, 834836, 835015, 831556, 847590, 850363, 848456, 823248, 823105, 818739, 818995, 827206, 823447, 842681, 844720, 836877, 848226, 847984, 852898, 832460, 834868, 827625, 821590, 829471, 830232, 817498, 842827, 836579, 839353, 848112, 834432, 829941, 832740, 832740, 827155, 827155, 831625, 831204, 846512, 838486, 852052, 838187, 836698, 838152, 838495, 828220, 827384, 826628, 826628, 826800, 837836, 837836, 836195, 836195, 830802, 827155, 827155, 827155, 832532, 831586, 836683, 842041, 842926, 841984, 837476, 845441, 826973, 837610, 849419, 825153, 823803, 819869, 830736, 830598, 829474, 830778, [... truncated]

@djhocking
Copy link
Author

> tbl_climate$query
<Query> SELECT "featureid", "date", "tmax", "tmin", "prcp", "dayl", "srad", "vp", "swe", DATE_PART('year', "date") AS "year"
FROM "daymet"
WHERE "featureid" IN (751988, 892785, 751071, 751326, 752235, 750688, 831777, 752527, 831777, 818865, 823965, 817228, 818552, 844517, 842601, 836717, 831211, 830038, 834069, 821646, 821232, 820519, 822376, 831406, 837021, 834836, 835015, 831556, 847590, 850363, 848456, 823248, 823105, 818739, 818995, 827206, 823447, 842681, 844720, 836877, 848226, 847984, 852898, 832460, 834868, 827625, 821590, 829471, 830232, 817498, 842827, 836579, 839353, 848112, 834432, 829941, 832740, 832740, 827155, 827155, 831625, 831204, 846512, 838486, 852052, 838187, 836698, 838152, 838495, 828220, 827384, 826628, 826628, 826800, 837836, 837836, 836195, 836195, 830802, 827155, 827155, 827155, 832532, 831586, 836683, 842041, 842926, 841984, 837476, 845441, 826973, 837610, 849419, 825153, 823803, 819869, 830736, 830598, 829474, 830778, 822994, 823054, 826053, 845784, 844333, 842739, 842037, 842322, 830769, 833192, 853850, 852906, 851490, 850789, 845245, 833354, 829948, 832450, 833118, 833001, 834912, 831928, 847662, 847872, 824708, 824972, 825012, 825741, 825310, 825347, 822962, 844993, 844993, 844993, 845700, 845700, 845700, 829145, 829145, 830830, 830482, 830344, 831967, 831456, 833244, 831293, 833368, 848385, 848385, 842700, 842700, 853228, 848611, 848612, 848898, 852744, 834816, 850788, 851773, 854457, 844127, 845232, 848605, 846189, 844862, 845691, 848103, 823950, 831270, 821687, 828249, 826326, 824368, 827041, 829634, 828444, 825361, 827041, 830633, 816747, 818720, 819838, 820570, 822347, 817886, 822156, 819969, 823281, 822378, 821005, 836217, 834443, 832872, 831927, 828738, 837596, 820907, 820386, 812717, 815619, 814460, 820786, 817714, 833910, 834122, 833922, 834122, 833922, 834183, 834122, 753616, 753616, 752852, 752852, 752130, 751771, 751771, 751146, 750708, 750345, 750190, 750190, 750190, 749887, 749313, 748691, 753350, 753240, 751085, 751326, 750755, 751397, 751397, 751397, 751928, 751889, 751889, 751507, 751507, 751183, 751072, 750893, 750893, 750893, 750687, 750687, 750357, 750357, 751448, 750991, 751362, 747621, 748692, 748738, 748521, 748279, 748236, 748692, 750302, 746598, 747594, 747933, 747074, 747074, 747074, 747074, 746516, 746516, 746516, 749389, 748568, 748072, 748045, 750106, 750068, 746855, 746855, 749491, 749980, 749292, 749015, 748018, 748349, 748443, 747635, 747635, 747525, 747144, 748336, 747747, 747579, 747751, 747751, 747751, 747683, 747638, 747375, 747375, 747375, 747683, 746724, 746229, 749707, 749707, 749707, 749707, 747964, 747433, 746502, 746289, 746533, 747070, 747070, 747070, 747070, 747355, 747100, 747100, 747100, 747100, 746982, 745702, 746630, 746630, 746630, 746630, 746630, 746430, 746430, 746203, 746011, 746011, 746669, 745871, 745847, 746755, 740393, 740020, 744350, 749688, 750755, 750240, 747527, 749089, 740783) 
AND "year" IN (2005, 2006, 2007, 2008, 2009, 2010, 2012, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 2011, 1995, 1994, 1993, 1996, 1992, 1991)
<PostgreSQLConnection:(2376,2)> 

@walkerjeffd
Copy link

yeah, i hear what your saying about trying to make it as small as possible. i'll try to write a straight SQL query to do this, then once we get that working, we can try to figure out if we can recreate it from dplyr. I suspect we're pushing the limits of dplyr's R->SQL translation here.

Not sure why you're still getting that error, the DATE_PART('year', "date") as "year" should be creating that column for the WHERE statement.

@walkerjeffd
Copy link

Ok here's what I got so far. This query will create a table with columns [featureid, variable, year, n] that lists the number of values for each unique combination of featureid, variable.name and year (and only includes water temperature):

SELECT l.catchment_id AS featureid,
       DATE_PART('year', v.datetime) AS year,
       var.name as variable, count(v.value) AS N
FROM series s, values v, locations l, variables var
WHERE s.id = v.series_id AND s.location_id=l.id AND s.variable_id=var.id
  AND var.name='TEMP'
GROUP BY var.name, featureid, year;

Now just need to join this with daymet...

@walkerjeffd
Copy link

FYI, i'm building a new index on the daymet table that is based on both featureid and date_part('year', daymet.date). This should drastically speed up these queries since most daymet queries will involve filter by both featureid and the year. Cross your fingers...

@walkerjeffd
Copy link

Ok I think this will work for getting the daily daymet values associated with only the catchments and years that have data. The first part (WITH loc_years AS (...)) is called a Common Table Expression (CTE) (read this) and basically creates a temporary table named loc_years that can then be used in the join with the daymet table. Note this table is not saved in the database, it is only used by the query when it is run.

The loc_year CTE will be a table containing the unique combination of featureid, year, variable and number of values (n). But this is also filtered to only include water temperature (so the variable column is kind of moot). The query below also filters for only two featureids ('740020', '746011') just for testing. You can remove the last line of the WHERE clause (AND l.catchment_id IN ('740020', '746011')) to retrieve all the featureids with data.

WITH loc_year AS (
  SELECT l.catchment_id AS featureid,
         DATE_PART('year', v.datetime) AS year,
         var.name as variable, count(v.value) AS N
  FROM series s, values v, locations l, variables var
  WHERE s.id = v.series_id 
    AND s.location_id=l.id 
    AND s.variable_id=var.id
    AND var.name='TEMP'
    AND l.catchment_id IN ('740020', '746011')
  GROUP BY var.name, featureid, year
)

SELECT date_part('year', d.date) as year, d.featureid, ly.n as n_values,
       d.date, d.tmax, d.tmin, d.prcp, d.dayl, d.srad, d.vp, d.swe
FROM daymet d
INNER JOIN loc_year ly
ON d.featureid=ly.featureid
  AND date_part('year', d.date)=ly.year
ORDER BY d.featureid, d.date;

The multicolumn index (featureid, year) on daymet is still being created, might take a few hours. but after that is ready, this query should hopefully be at least somewhat faster.

Here's code to try running in R:

library(RPostgreSQL)

drv <- dbDriver("PostgreSQL")

# create connection
con <- dbConnect(drv, dbname="conte_dev", host="127.0.0.1", user="conte", password="conte")

# create sql query string
qry <- "WITH loc_year AS (
  SELECT l.catchment_id AS featureid,
         DATE_PART('year', v.datetime) AS year,
var.name as variable, count(v.value) AS N
FROM series s, values v, locations l, variables var
WHERE s.id = v.series_id 
AND s.location_id=l.id 
AND s.variable_id=var.id
AND var.name='TEMP'
AND l.catchment_id IN ('740020', '746011')
GROUP BY var.name, featureid, year
)

SELECT date_part('year', d.date) as year, d.featureid, ly.n as n_values,
d.date, d.tmax, d.tmin, d.prcp, d.dayl, d.srad, d.vp, d.swe
FROM daymet d
INNER JOIN loc_year ly
ON d.featureid=ly.featureid
AND date_part('year', d.date)=ly.year
ORDER BY d.featureid, d.date;"

# submit query
result <- dbSendQuery(con, qry)

# fetch results (n=-1 means return all rows, use n=5 to return just first 5 rows, for example)
df <- fetch(result, n=-1)

# check that each featureid has only one or more complete years of daily daymet values
table(df$year, df$featureid)

# plot
library(ggplot2)
theme_set(theme_bw())

ggplot(df, aes(date, tmax)) +
  geom_line() +
  facet_wrap(~featureid)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment