-
-
Save halhen/659780120accd82e043986c8b57deae0 to your computer and use it in GitHub Desktop.
# data from http://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography/geostat | |
# Originally seen at http://spatial.ly/2014/08/population-lines/ | |
# So, this blew up on both Reddit and Twitter. Two bugs fixed (southern Spain was a mess, | |
# and some countries where missing -- measure twice, submit once, damnit), and two silly superflous lines removed after | |
# @hadleywickham pointed that out. Also, switched from geom_segment to geom_line. | |
# The result of the code below can be seen at http://imgur.com/ob8c8ph | |
library(tidyverse) | |
read_csv('../data/geostat-2011/GEOSTAT_grid_POP_1K_2011_V2_0_1.csv') %>% | |
rbind(read_csv('../data/geostat-2011/JRC-GHSL_AIT-grid-POP_1K_2011.csv') %>% | |
mutate(TOT_P_CON_DT='')) %>% | |
mutate(lat = as.numeric(gsub('.*N([0-9]+)[EW].*', '\\1', GRD_ID))/100, | |
lng = as.numeric(gsub('.*[EW]([0-9]+)', '\\1', GRD_ID)) * ifelse(gsub('.*([EW]).*', '\\1', GRD_ID) == 'W', -1, 1) / 100) %>% | |
filter(lng > 25, lng < 60) %>% | |
group_by(lat=round(lat, 1), lng=round(lng, 1)) %>% | |
summarize(value = sum(TOT_P, na.rm=TRUE)) %>% | |
ungroup() %>% | |
complete(lat, lng) %>% | |
ggplot(aes(lng, lat + 5*(value/max(value, na.rm=TRUE)))) + | |
geom_line(size=0.4, alpha=0.8, color='#5A3E37', aes(group=lat), na.rm=TRUE) + | |
ggthemes::theme_map() + | |
coord_equal(0.9) | |
ggsave('/tmp/europe.png', width=10, height=10) |
Very nice. Thanks.
BTW, I got curious about the details and so tweaked your code (by changing the rounding from 1 to 2 and by using geom_raster):
library(tidyverse)
read_csv('GEOSTAT_grid_POP_1K_2011_V2_0_1.csv') %>%
rbind(read_csv('JRC-GHSL_AIT-grid-POP_1K_2011.csv') %>%
mutate(TOT_P_CON_DT='')) %>%
mutate(lat = as.numeric(gsub('.*N([0-9]+)[EW].*', '\\1', GRD_ID))/100,
lng = as.numeric(gsub('.*[EW]([0-9]+)', '\\1', GRD_ID)) * ifelse(gsub('.*([EW]).*', '\\1', GRD_ID) == 'W', -1, 1) / 100) %>%
filter(lng > 25, lng < 60) %>%
group_by(lat=round(lat, 2), lng=round(lng, 2)) %>%
summarize(value = sum(TOT_P, na.rm=TRUE)) %>%
ungroup() %>%
complete(lat, lng) %>%
ggplot(aes(lng, lat)) +
geom_raster(aes(fill = log(value)), na.rm=TRUE) +
ggthemes::theme_map() +
scale_fill_gradientn(colours=c("#0000FFFF","#FFFFFFFF","#FF0000FF")) +
coord_equal(0.9)
ggsave('europe.png', width=10, height=10)
Gotta love Iceland!
Neat! I've been playing around with this tonight and it's quite interesting. Some questions:
-
it appears
ggthemes
is not loaded viatidyverse
; one has to load it correct? -
I've never seen such a continuous usage of
%>%
before! Is the format you used primarily to reproduce as a one-liner? If so, am I correct to believe one's workflow wouldn't typically do this until the final code was known (otherwise you repeat the reading in of data every time you tweak the plot)? -
I use
group_by()
quite a bit and have never passed it somevar = fun(var)
argument before. I take it you're grouping by roundedlat
andlon
to sort of "cluster" your summed populations (some set oflat/lon
combination will have the same summed population since they were in the same group, but they retain their individual values for plotting)?
After investigating the data itself, I think it can be improved speed wise quite a bit since it's so big.
- the only columns used are
TOT_P
andGRD_ID
; subset those early and no need to setTOT_P_CON_DT
to""
- if you play with the
GRD_ID
column, you'll find that there are noW
values! There's no need for theifelse()
looking for west longitudes to make negative - give the above, I thought splitting based on a search or position once would be more optimal than
gsub()
repeated twice
## since I was fiddling quite a bit, I read things in once to avoid repeating
## drop all but POP_T and GRD_ID
master <- read_csv('GEOSTAT_grid_POP_1K_2011_V2_0_1.csv')[, c(1, 2)] %>%
rbind(read_csv('JRC-GHSL_AIT-grid-POP_1K_2011.csv')[, c(1, 2)])
data <- master %>%
separate(col = "GRD_ID",
into = c("pre", "lat", "ew", "lon"),
sep = c(4, 8, 9),
convert = T) %>%
select(-c(2, 4)) %>%
filter(lon > 2500, lon < 6000) %>%
mutate(lat = lat / 100,
lon = lon / 100) %>%
group_by(lat = round(lat, 1), lon = round(lon, 1)) %>%
summarize(value = sum(TOT_P)) %>%
ungroup() %>%
complete(lat, lon)
I was trying to figure out a clever strsplit()
method to strip of the 1kmN
chunk, keep the digits, and split again on E
, but couldn't figure out how.
Thanks for making this. I've been meaning to dive deeper into Hadley's magical land of tidyverse
and keep not getting around to it. I learned the separate()
function as a result of this. and at least more familiar with filter
, ungroup
, complete
, and select
, so thanks for posting the code an indirect motivation for me!
@p0bs: I don't think rounding to 2 decimals does anything. When as.numeric(POP_T)/100
is run, everything of the form xx.xx
. Basically, you're not grouping but summarizing for every unique [raw] lat
/lon
combination. Or at least that's my interpretation.
it appears ggthemes is not loaded via tidyverse; one has to load it correct?
IIRC, yes. Prefixing with ggthemes::
is probably in my fingers for a reason (which by all means may be that I simply started doing it)
I've never seen such a continuous usage of %>% before! Is the format you used primarily to reproduce as a one-liner? If so, am I correct to believe one's workflow wouldn't typically do this until the final code was known (otherwise you repeat the reading in of data every time you tweak the plot)?
Oh, no, on the contrary. I write 10-30 line %>%
flows for most of my analyses. Getting into the pipe way of thinking is super convenient. My hurdle was getting over the vectorization mindset (which is kinda' orthogonal to this anyways). From a naive standpoint %>%
simply replaces intermediate variables, or nested functions. If you ever catch yourself doing either, %>%
is quite likely a better choice.
I use group_by() quite a bit and have never passed it some var = fun(var) argument before. I take it you're grouping by rounded lat and lon to sort of "cluster" your summed populations (some set of lat/lon combination will have the same summed population since they were in the same group, but they retain their individual values for plotting)?
group_by(var = fun(x))
creates a new variable within the data frame named var
with fun(x)
as it's value. It's a convenience over mutate(var=fun(x)) %>% group_by(var)
Thanks for making this. I've been meaning to dive deeper into Hadley's magical land of tidyverse and keep not getting around to it. I learned the separate() function as a result of this. and at least more familiar with filter, ungroup, complete, and select, so thanks for posting the code an indirect motivation for me!
http://r4ds.had.co.nz/ . Buy it, read it, practice. Tidyverse is a gift from heaven.
May I ask the meaning of the strings, like '1kmN2689E4337'? Can these represent real geographic coordinates?
Really clever and succinct. Nice.