Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@halhen
Last active October 25, 2023 01:20
Show Gist options
  • Save halhen/659780120accd82e043986c8b57deae0 to your computer and use it in GitHub Desktop.
Save halhen/659780120accd82e043986c8b57deae0 to your computer and use it in GitHub Desktop.
# data from http://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography/geostat
# Originally seen at http://spatial.ly/2014/08/population-lines/
# So, this blew up on both Reddit and Twitter. Two bugs fixed (southern Spain was a mess,
# and some countries where missing -- measure twice, submit once, damnit), and two silly superflous lines removed after
# @hadleywickham pointed that out. Also, switched from geom_segment to geom_line.
# The result of the code below can be seen at http://imgur.com/ob8c8ph
library(tidyverse)
read_csv('../data/geostat-2011/GEOSTAT_grid_POP_1K_2011_V2_0_1.csv') %>%
rbind(read_csv('../data/geostat-2011/JRC-GHSL_AIT-grid-POP_1K_2011.csv') %>%
mutate(TOT_P_CON_DT='')) %>%
mutate(lat = as.numeric(gsub('.*N([0-9]+)[EW].*', '\\1', GRD_ID))/100,
lng = as.numeric(gsub('.*[EW]([0-9]+)', '\\1', GRD_ID)) * ifelse(gsub('.*([EW]).*', '\\1', GRD_ID) == 'W', -1, 1) / 100) %>%
filter(lng > 25, lng < 60) %>%
group_by(lat=round(lat, 1), lng=round(lng, 1)) %>%
summarize(value = sum(TOT_P, na.rm=TRUE)) %>%
ungroup() %>%
complete(lat, lng) %>%
ggplot(aes(lng, lat + 5*(value/max(value, na.rm=TRUE)))) +
geom_line(size=0.4, alpha=0.8, color='#5A3E37', aes(group=lat), na.rm=TRUE) +
ggthemes::theme_map() +
coord_equal(0.9)
ggsave('/tmp/europe.png', width=10, height=10)
@sj-porter-dev
Copy link

Love the simplicity and readability of this sample. Thank you for sharing.

@hrbrmstr
Copy link

@bbrewington
Copy link

Really clever and succinct. Nice.

@p0bs
Copy link

p0bs commented Apr 28, 2017

Very nice. Thanks.
BTW, I got curious about the details and so tweaked your code (by changing the rounding from 1 to 2 and by using geom_raster):

library(tidyverse)

read_csv('GEOSTAT_grid_POP_1K_2011_V2_0_1.csv') %>%
  rbind(read_csv('JRC-GHSL_AIT-grid-POP_1K_2011.csv') %>%
          mutate(TOT_P_CON_DT='')) %>%
  mutate(lat = as.numeric(gsub('.*N([0-9]+)[EW].*', '\\1', GRD_ID))/100,
         lng = as.numeric(gsub('.*[EW]([0-9]+)', '\\1', GRD_ID)) * ifelse(gsub('.*([EW]).*', '\\1', GRD_ID) == 'W', -1, 1) / 100) %>%
  filter(lng > 25, lng < 60) %>%
  group_by(lat=round(lat, 2), lng=round(lng, 2)) %>%
  summarize(value = sum(TOT_P, na.rm=TRUE))  %>%
  ungroup() %>%
  complete(lat, lng) %>% 
  ggplot(aes(lng, lat)) + 
    geom_raster(aes(fill = log(value)), na.rm=TRUE) + 
    ggthemes::theme_map() + 
    scale_fill_gradientn(colours=c("#0000FFFF","#FFFFFFFF","#FF0000FF")) + 
    coord_equal(0.9)

ggsave('europe.png', width=10, height=10)

It produced this:
europe

Gotta love Iceland!

@jwhendy
Copy link

jwhendy commented May 5, 2017

Neat! I've been playing around with this tonight and it's quite interesting. Some questions:

  • it appears ggthemes is not loaded via tidyverse; one has to load it correct?

  • I've never seen such a continuous usage of %>% before! Is the format you used primarily to reproduce as a one-liner? If so, am I correct to believe one's workflow wouldn't typically do this until the final code was known (otherwise you repeat the reading in of data every time you tweak the plot)?

  • I use group_by() quite a bit and have never passed it some var = fun(var) argument before. I take it you're grouping by rounded lat and lon to sort of "cluster" your summed populations (some set of lat/lon combination will have the same summed population since they were in the same group, but they retain their individual values for plotting)?

After investigating the data itself, I think it can be improved speed wise quite a bit since it's so big.

  • the only columns used are TOT_P and GRD_ID; subset those early and no need to set TOT_P_CON_DT to ""
  • if you play with the GRD_ID column, you'll find that there are no W values! There's no need for the ifelse() looking for west longitudes to make negative
  • give the above, I thought splitting based on a search or position once would be more optimal than gsub() repeated twice
## since I was fiddling quite a bit, I read things in once to avoid repeating
## drop all but POP_T and GRD_ID
master <- read_csv('GEOSTAT_grid_POP_1K_2011_V2_0_1.csv')[, c(1, 2)] %>%
  rbind(read_csv('JRC-GHSL_AIT-grid-POP_1K_2011.csv')[, c(1, 2)])

data <- master %>%
  separate(col = "GRD_ID",
           into = c("pre", "lat", "ew", "lon"),
           sep = c(4, 8, 9),
           convert = T) %>%
  select(-c(2, 4)) %>%
  filter(lon > 2500, lon < 6000) %>%
  mutate(lat = lat / 100,
         lon = lon / 100) %>%
  group_by(lat = round(lat, 1), lon = round(lon, 1)) %>%
  summarize(value = sum(TOT_P)) %>%
  ungroup() %>%
  complete(lat, lon)

I was trying to figure out a clever strsplit() method to strip of the 1kmN chunk, keep the digits, and split again on E, but couldn't figure out how.

Thanks for making this. I've been meaning to dive deeper into Hadley's magical land of tidyverse and keep not getting around to it. I learned the separate() function as a result of this. and at least more familiar with filter, ungroup, complete, and select, so thanks for posting the code an indirect motivation for me!


@p0bs: I don't think rounding to 2 decimals does anything. When as.numeric(POP_T)/100 is run, everything of the form xx.xx. Basically, you're not grouping but summarizing for every unique [raw] lat/lon combination. Or at least that's my interpretation.

@halhen
Copy link
Author

halhen commented May 12, 2017

@jwhendy

it appears ggthemes is not loaded via tidyverse; one has to load it correct?

IIRC, yes. Prefixing with ggthemes:: is probably in my fingers for a reason (which by all means may be that I simply started doing it)

I've never seen such a continuous usage of %>% before! Is the format you used primarily to reproduce as a one-liner? If so, am I correct to believe one's workflow wouldn't typically do this until the final code was known (otherwise you repeat the reading in of data every time you tweak the plot)?

Oh, no, on the contrary. I write 10-30 line %>% flows for most of my analyses. Getting into the pipe way of thinking is super convenient. My hurdle was getting over the vectorization mindset (which is kinda' orthogonal to this anyways). From a naive standpoint %>% simply replaces intermediate variables, or nested functions. If you ever catch yourself doing either, %>% is quite likely a better choice.

I use group_by() quite a bit and have never passed it some var = fun(var) argument before. I take it you're grouping by rounded lat and lon to sort of "cluster" your summed populations (some set of lat/lon combination will have the same summed population since they were in the same group, but they retain their individual values for plotting)?

group_by(var = fun(x)) creates a new variable within the data frame named var with fun(x) as it's value. It's a convenience over mutate(var=fun(x)) %>% group_by(var)

Thanks for making this. I've been meaning to dive deeper into Hadley's magical land of tidyverse and keep not getting around to it. I learned the separate() function as a result of this. and at least more familiar with filter, ungroup, complete, and select, so thanks for posting the code an indirect motivation for me!

http://r4ds.had.co.nz/ . Buy it, read it, practice. Tidyverse is a gift from heaven.

@Meredith95
Copy link

May I ask the meaning of the strings, like '1kmN2689E4337'? Can these represent real geographic coordinates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment