johnburnmurdoch/custom_unit_histograms_in_R.R

## custom_unit_histograms_in_R.R
install.packages("needs")
library(needs)
needs(tidyverse, magrittr)

# I’m using an example dataset of net national income per capita, from the World Bank (https://data.worldbank.org/indicator/NY.ADJ.NNTY.PC.CD?most_recent_value_desc=true)
head(dataset)

# Here’s a basic histogram using the ggplot defaults to show the distribution of NNI per capita across the world:
ggplot(dataset, aes(value)) +
  geom_histogram()

# The problem with the above is that it’s just used the default of 30 bins, meaning the bins are generated at totally arbitrary values. In this case, for example, the lowest bin contains all countries whose NNI per capita is under $,1105.082.

# One solution is just to specfiy a bin width. I’m choosing $1,000 here:
ggplot(dataset, aes(value)) +
  geom_histogram(binwidth = 1000)

# Now we have nice $1,000-wide bins, but they’re centred around every $1,000, i.e the left-most bin is centred on zero, extending to -500 and 500. This doesn’t really make sense, since no country has an NNI per capita below zero. If each bin is $1,000 wide, it would make more sense for the left-most bin to contain countries with values between 0 and 1,000.

# One way of doing that is to set `boundary=T` to have the bars/bins *end at* multiples of 1,000, instead of *centred* on multiples of 1,000
ggplot(dataset, aes(value)) +
  geom_histogram(binwidth = 1000, boundary = T)

# Now we’re getting there.

# Let’s colour them by region, using a nice colour-brewer palette
ggplot(dataset, aes(value)) +
  geom_histogram(binwidth = 1000, boundary = T, aes(fill = region)) +
  scale_fill_brewer(palette = "Set2")

# But what if we want to point to individual blocks? Let’s try adding a white stroke around the blocks using the `colour` argument:
ggplot(dataset, aes(value)) +
  geom_histogram(binwidth = 1000, boundary=T, aes(fill = region), colour = "white") +
  scale_fill_brewer(palette = "Set2")

# You can see the problem: each block is a large region/bin combo, bucketing all of the individual data points inside it. We can’t easily pinpoint any inidividual country, especially where they’re all piled up in big chunks on the left.

# So instead, let’s get artisanal and transform the data ourselves for a nice block-by-block layout.

# Step one, let’s set a bin-width that we want to use. We’ll keep this at 1,000:
bin_width <- 1000

# Now let’s pipe our data and generate our plot:
dataset %>%
  mutate(
    # We calculate this by dividing the true data value by our `bin_width`, rounding that down to the nearest whole number (this is what `floor()` does) and then multiplying the rounded result by the original `bin_width`.
    # In English, what we’re doing here is rounding every value down to the nearest `bin_width`, i.e down to the nearest 1,000 in this case.
    bin_min = floor(value / bin_width) * bin_width,
    # Every country now has a `bin_min` associated with it. Calculating the `bin_max` — the right-hand side of each rectangle — is easy. We just add our `bin_width`of 1,000 back to the `bin_min`:
    bin_max = bin_min + bin_width
    # Now each country has the left- and right-hand side of its block locked in
  ) %>%
  # Next we group by our newly created `bin_min`:
  group_by(bin_min) %>%
  # And then we arrange our data within each bin by region, so countries will appear in neat coloured region blocks, rather than a messy mix of un-ordered colours:
  arrange(region) %>%
  # Now we can calculate the rest of our positional values: the bottom and top of each block
  mutate(
    # `block_min` is simply a sequence of numbers starting at zero within each bin, and then increasing by one for each country inside that bin. Because we arranged by region, this will keep countries in the same region on top of one another
    block_min = row_number() - 1,
    # For `block_max` we simply add one to block_min in every case
    block_max = block_min + 1
  ) %>%
  # All data has now been transformed, so we can pass everything through ggplot() to create the graphic:
  # We start by passing either of our horizontal measures (`bin_min` and `bin_max`) to the x axis, and then add either of our vertical measures (`block_min` and `block_max`) to the y axis.
  ggplot(aes(x = bin_min, y = block_max)) +
  # Then we draw a rectangle (`geom_rect`) for each country block. We specify the rectangle’s left edge using xmin, right edge using xmax, base using ymin and top using ymax. We pass our pre-calculated measures in each case.
  geom_rect(aes(
    xmin = bin_min,
    xmax = bin_max,
    ymin = block_min,
    ymax = block_max,
    # Finally we add `fill = region` to get our region colours
    fill = region
    ),# And white borders for each block, just to show that we do indeed have a block-by-block layout
    colour = "white") +
  # And again we’re using the nice "Set2" palette from colour brewer
  scale_fill_brewer(palette = "Set2")

# Finally, here’s how to re-order the colours if you want your colours in a specific order.
# In this case I’m putting them in a custom order by approximate level of wealth, so Europe comes first, then North America, then Lat Am etc.
dataset %>%
  mutate(
    # We do this by telling R and ggplot to understand our `region` variable as a factor, rather than just plain text strings.
    # Then we specify the `levels` in the order that we want them to appear in our plots and our colour scale.
    # And finally don’t forget to use `oredered = T` to tell R that this order matters.
    # Everything else remains exactly as above
    region = factor(region, levels = c("Europe & Central Asia", "North America", "Latin America & Caribbean", "Middle East & North Africa", "East Asia & Pacific", "South Asia", "Sub-Saharan Africa"), ordered = T),
    bin_min = floor(value / bin_width) * bin_width,
    bin_max = bin_min + bin_width
  ) %>%
  group_by(bin_min) %>%
  arrange(region) %>%
  mutate(
    block_min = row_number() - 1,
    block_max = block_min + 1
  ) %>%
  ggplot(aes(x = bin_min, y = block_max)) +
  geom_rect(aes(
    xmin = bin_min,
    xmax = bin_max,
    ymin = block_min,
    ymax = block_max,
    fill = region
  ),
  colour = "white") +
  scale_fill_brewer(palette = "Set2")
	install.packages("needs")
	library(needs)
	needs(tidyverse, magrittr)

	# I’m using an example dataset of net national income per capita, from the World Bank (https://data.worldbank.org/indicator/NY.ADJ.NNTY.PC.CD?most_recent_value_desc=true)
	head(dataset)

	# Here’s a basic histogram using the ggplot defaults to show the distribution of NNI per capita across the world:
	ggplot(dataset, aes(value)) +
	geom_histogram()

	# The problem with the above is that it’s just used the default of 30 bins, meaning the bins are generated at totally arbitrary values. In this case, for example, the lowest bin contains all countries whose NNI per capita is under $,1105.082.

	# One solution is just to specfiy a bin width. I’m choosing $1,000 here:
	ggplot(dataset, aes(value)) +
	geom_histogram(binwidth = 1000)

	# Now we have nice $1,000-wide bins, but they’re centred around every $1,000, i.e the left-most bin is centred on zero, extending to -500 and 500. This doesn’t really make sense, since no country has an NNI per capita below zero. If each bin is $1,000 wide, it would make more sense for the left-most bin to contain countries with values between 0 and 1,000.

	# One way of doing that is to set `boundary=T` to have the bars/bins end at multiples of 1,000, instead of centred on multiples of 1,000
	ggplot(dataset, aes(value)) +
	geom_histogram(binwidth = 1000, boundary = T)

	# Now we’re getting there.

	# Let’s colour them by region, using a nice colour-brewer palette
	ggplot(dataset, aes(value)) +
	geom_histogram(binwidth = 1000, boundary = T, aes(fill = region)) +
	scale_fill_brewer(palette = "Set2")

	# But what if we want to point to individual blocks? Let’s try adding a white stroke around the blocks using the `colour` argument:
	ggplot(dataset, aes(value)) +
	geom_histogram(binwidth = 1000, boundary=T, aes(fill = region), colour = "white") +
	scale_fill_brewer(palette = "Set2")

	# You can see the problem: each block is a large region/bin combo, bucketing all of the individual data points inside it. We can’t easily pinpoint any inidividual country, especially where they’re all piled up in big chunks on the left.

	# So instead, let’s get artisanal and transform the data ourselves for a nice block-by-block layout.

	# Step one, let’s set a bin-width that we want to use. We’ll keep this at 1,000:
	bin_width <- 1000

	# Now let’s pipe our data and generate our plot:
	dataset %>%
	mutate(
	# We calculate this by dividing the true data value by our `bin_width`, rounding that down to the nearest whole number (this is what `floor()` does) and then multiplying the rounded result by the original `bin_width`.
	# In English, what we’re doing here is rounding every value down to the nearest `bin_width`, i.e down to the nearest 1,000 in this case.
	bin_min = floor(value / bin_width) * bin_width,
	# Every country now has a `bin_min` associated with it. Calculating the `bin_max` — the right-hand side of each rectangle — is easy. We just add our `bin_width`of 1,000 back to the `bin_min`:
	bin_max = bin_min + bin_width
	# Now each country has the left- and right-hand side of its block locked in
	) %>%
	# Next we group by our newly created `bin_min`:
	group_by(bin_min) %>%
	# And then we arrange our data within each bin by region, so countries will appear in neat coloured region blocks, rather than a messy mix of un-ordered colours:
	arrange(region) %>%
	# Now we can calculate the rest of our positional values: the bottom and top of each block
	mutate(
	# `block_min` is simply a sequence of numbers starting at zero within each bin, and then increasing by one for each country inside that bin. Because we arranged by region, this will keep countries in the same region on top of one another
	block_min = row_number() - 1,
	# For `block_max` we simply add one to block_min in every case
	block_max = block_min + 1
	) %>%
	# All data has now been transformed, so we can pass everything through ggplot() to create the graphic:
	# We start by passing either of our horizontal measures (`bin_min` and `bin_max`) to the x axis, and then add either of our vertical measures (`block_min` and `block_max`) to the y axis.
	ggplot(aes(x = bin_min, y = block_max)) +
	# Then we draw a rectangle (`geom_rect`) for each country block. We specify the rectangle’s left edge using xmin, right edge using xmax, base using ymin and top using ymax. We pass our pre-calculated measures in each case.
	geom_rect(aes(
	xmin = bin_min,
	xmax = bin_max,
	ymin = block_min,
	ymax = block_max,
	# Finally we add `fill = region` to get our region colours
	fill = region
	),# And white borders for each block, just to show that we do indeed have a block-by-block layout
	colour = "white") +
	# And again we’re using the nice "Set2" palette from colour brewer
	scale_fill_brewer(palette = "Set2")

	# Finally, here’s how to re-order the colours if you want your colours in a specific order.
	# In this case I’m putting them in a custom order by approximate level of wealth, so Europe comes first, then North America, then Lat Am etc.
	dataset %>%
	mutate(
	# We do this by telling R and ggplot to understand our `region` variable as a factor, rather than just plain text strings.
	# Then we specify the `levels` in the order that we want them to appear in our plots and our colour scale.
	# And finally don’t forget to use `oredered = T` to tell R that this order matters.
	# Everything else remains exactly as above
	region = factor(region, levels = c("Europe & Central Asia", "North America", "Latin America & Caribbean", "Middle East & North Africa", "East Asia & Pacific", "South Asia", "Sub-Saharan Africa"), ordered = T),
	bin_min = floor(value / bin_width) * bin_width,
	bin_max = bin_min + bin_width
	) %>%
	group_by(bin_min) %>%
	arrange(region) %>%
	mutate(
	block_min = row_number() - 1,
	block_max = block_min + 1
	) %>%
	ggplot(aes(x = bin_min, y = block_max)) +
	geom_rect(aes(
	xmin = bin_min,
	xmax = bin_max,
	ymin = block_min,
	ymax = block_max,
	fill = region
	),
	colour = "white") +
	scale_fill_brewer(palette = "Set2")