Skip to content

Instantly share code, notes, and snippets.

@fredbenenson
Created November 20, 2013 16:56
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fredbenenson/7566727 to your computer and use it in GitHub Desktop.
Save fredbenenson/7566727 to your computer and use it in GitHub Desktop.
Example Redshift Query
SELECT *
FROM
(SELECT
month,
amount,
pledge_count,
SUM(1) OVER(PARTITION BY month ORDER BY pledge_count DESC ROWS UNBOUNDED PRECEDING) as row
FROM
(SELECT
TO_CHAR(CONVERT_TIMEZONE('UTC', 'America/New_York', backings.pledged_at), 'YYYY-MM-01') as month,
backings.amount as amount,
COUNT(DISTINCT backings.id) AS pledge_count
FROM
backings
GROUP BY month, backings.amount) AS backings_per_month
ORDER BY month)
WHERE row <= 10;
@hadley
Copy link

hadley commented Nov 20, 2013

Goal of dplyr is to make it so you could express that in R with code that looks like this:

backings <- tbl(my_redshift, "backing")

backings_per_month <- backings %.% 
  mutate(month = TO_CHAR(CONVERT_TIMEZONE('UTC', 'America/New_York', pledged_at), 'YYYY-MM-01')) %.% 
  group_by(backings, month) %.% 
  summarise(pledge_count = n())

top10 <- backings_per_month() %.% 
  mutate(row = rownum()) %.%
  arrange(month) %.%
  filter(row <= 10)

For redshift, dplyr could know that rownum() doesn't exist and it should use windowed sum instead.

x %.% f(...) is a shortcut for f(x, ...) that makes it easier to read from left to right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment