title | output | |||
---|---|---|---|---|
How to make bar graphs using *ggplot2* in R |
|
https://gist.github.com/97f5808569477fba885a98b493016f0f
We recently wrote about how IDinsight strives to use the right analytical and statistical tools (at the right time) to advise decision-makers and improve social impact. In that post, we highlighted the benefits of the statistical software R, which is especially useful to visually communicate complex ideas. This post aims to provide beginner practioners with the tools to make a graphic using ggplot2
, a package within R.
At the end of this post, we hope you will have a better understanding of the graphical design process from beginning (deciding the elements of your graph) to end (making the final graph look polished). Additionally, you will have code for a plot that you can easily modify for your future graphing needs.
There is a wealth of information on the philosophy of ggplot2
, how to get started with ggplot2
, and how to customize the smallest elements of a graphic using ggplot2
- but it's all in different corners of the Internet. It can be difficult for a beginner to tie all this information together.
This post assumes basic familiarity with the following R concepts:
I also use the dplyr package to clean data. All code is commented so this should be straightforward to follow even if you have not used dplyr before.
We will be using the gapminder dataset that comes pre-packaged with R. This dataset is an excerpt from the GapMinder data, and it shows the life expectancy, population and GDP per capita of various countries over 12 years between 1952 to 2007.
https://gist.github.com/e09831cc02742628bef340e7898269df
We would like to show the change in life expectancy from 1952 to 2007 for 11 (arbitrarily-selected) countries: Bolivia, China, Ethiopia, Guatemala, Haiti, India, Kenya, Pakistan, Sri Lanka, Tanzania, Uganda.
Specifically, we want to see the life expectancy in each of these countries in 1952 and 2007. We also want to group the countries by continent.
We will use a bar plot to communicate this information graphically because we can easily see the levels of the life expectancy variable, and compare values over time and across countries. Here is a rough sketch to get us started on what we can do: https://gist.github.com/63f5b11ffbb2dc2cc0c8a3c58e555094
Note that we want two bars per country - one of these should be the life expectancy in 1952 and the other in 2007. We also want to colour the bars differently based on the continent.
ggplot2
is based on the "grammar of graphics", which provides a standard way to describe the components of a graph (the "gg" in ggplot2
refers to the grammar of graphics). It has specialized terminology to refer to the elements of a graph, and I'll introduce and explain new terms as we encounter them. For now, what we need to understand is that we will build a graphic by adding components one after the other, like layers.
The first step to building the graphic is to identify the components. Using our rough sketch as a guide, we know that our components are:
- Dataset - for us, this is a subset of the gapminder data that includes only the countries and years in question
- Coordinate system - Cartesian
- Axes - we want country name on the x-axis and life expectancy on the y-axis
- Type of visualization - we want one bar per country per year e.g. for India, we want one bar for the life expectancy in 1952 and another bar for 2007
- Groups on the x-axis - we want to group countries by continent
Now that we know what we need to include in the graph, let's move on to writing code.
We need to install the following packages:
ggplot2
dplyr
: to manipulate datagapminder
: data source
We can use the following code to install and load packages.
https://gist.github.com/ecda88f6feb48c32cdcf68fb2ccb6ab1
Let's have a look at the data again. It's saved under gapminder
:
https://gist.github.com/a177a881006f71e060f668cce78b277a
Let's restrict the data to the countries and years we are interested in, and save this new dataset as data_graph
.
https://gist.github.com/738b5e733c8446a2a6fe4c4eff931ac0
Let's also make "year" a factor, since it is a discrete variable: https://gist.github.com/2e9f39da425f25dd6fd08934f3c84274
To build a ggplot, we first use the ggplot()
function to specify the default data source and aesthetic mappings:
https://gist.github.com/d35633102a4240b5eb1f11d409cdabca
Let's break this down a little:
- data source: data_graph in our case
- aesthetic mappings: The
aes()
function maps variables in our data frame to aesthetic attributes. An "aesthetic attribute"" is a visual element of the graph, such as the shape of a point or the colour of a line. In our case, we are specifying that the axes (which are aesthetic attributes) should correspond to the variables "country" and "lifeExp".
Note that there is no bar graph because we haven't specified one yet. We have just specified which dataset and axes to use, not the type of graphic to display.
Let's make the graph look a bit nicer. My preference is to make the following adjustments:
- Simple, black-and-white layout
- No background colour
- No gridlines
- The chart area shouldn't be in a box; we should have only the x and y axis
We will use the theme()
function to make these changes. theme()
allows us to modify the display of non-data elements of the graph.
https://gist.github.com/2752ab193455285713962631dbe8ccd3
Note that we did not have to re-write the code to make the base plot or modify it in any way. Instead, we kept the base plot object as-is and "added" themes to it using the + operator. This is how we build a ggplot - we add components together to build a graphic.
In order to add bars to our ggplot, we need to understand geometric objects ("geoms"). A "geom" is a mark we add to the plot to represent data. For example, we can use the geom "point" to display our data using points, in which case the resulting graphic would be a scatterplot. The ggplot2 cheatsheet has a list of all the geoms we can add to a plot.
We will be adding bars to our graph using geom_bar()
:
https://gist.github.com/784c63867da4848d393312cab52351af
We now have a bar graph. The numbers don't seem to be right since the life expectancy is very close to 100 for all countries - we will fix this later.
It may seem strange that we didn't specify the x and y values for the bars, but the bars displayed life expectancy by country anyway. This is because of ggplot's "hierarchy of defaults". Since we add the call to geom_bar()
to an existing call to ggplot(data = data_graph, aes(x = country, y = lifeExp))
, ggplot2
assumes that the x and y variables for geom_bar()
are the same as those for ggplot()
i.e. the x and y variables are "country" and "lifeExp", respectively.
We also specified stat
in the call to geom_bar
. stat
is used when we want to apply a statistical function to the data and show the results graphically. When we use geom_bar()
, by default, stat
assumes that we want each bar to show the count of y-variables per x-variable. Since we want ggplot to plot the values as-is, we specify stat = "identity"
.
Now, let's change the colour of the bars. We ultimately want the colour of the bars to vary by continent, but let's start with something simpler - let's change the colour of the bars to light blue. To do this, we will specify fill = "lightblue"
inside the call to geom_bar()
.
https://gist.github.com/f083a500c08f7a0c63a3181c2349e605
Now, let's make the colour of the bars vary by continent. We are saying that we want a mapping from an aesthetic element (the colour inside the bars) to a variable in our data ("continent"). Recall that we use the aes()
function to specify a relationship between a visual element and a variable. Within aes()
, we will use the fill
argument to specify that we are interested in changing the colour of the bars.
https://gist.github.com/fb25b698f0c1b35187aaf570b1d24da8
Note that we used fill
in both cases, because fill
is what controls the colour inside the bars. However, we did not use aes()
when we coloured the bars light blue because the colour inside the bars wasn't related to any variables.
Now, we will address why we aren't seeing the correct values of life expectancy in the graph. Since each country has two observations for life expectancy (one for 1952 and one for 2007), and we haven't specified which observation to use, the life expectancy shown by the bars is actually the sum of life expectancy for both years.
Let's see what happens when we restrict the graph to include only data for 2007.
https://gist.github.com/301d34df65f93e725f6440435fcfa7d1
We now see the correct values of life expectancy. Note that though the plot_base_clean
object already had a default value of data
(data_graph), we were able to override it in the call to geom_bar()
. This again ties back to the hierarchy of defaults - if we don't specify a new dataset or xy-variables for our geoms, we simply use the dataset and xy-variables provided in the call to ggplot()
, but since we specified a new value of data
within geom_bar()
, the bars reflect a new data source.
Next, we are interested in showing two data points per country, one for 1952 and one for 2007. Here is where the alpha
aesthetic is useful. It specifies the transparency of the colours we are using. Let's try using alpha
with the same subsetted dataset:
https://gist.github.com/2af38aa5b40a52920df88f93aa564a63
We see that similar to specifying fill = "lightblue"
, specifying alpha
to be a number changes the transparency levels of each bar. alpha
values range from 0 to 1, with higher values being more opaque.
Like fill
, alpha
can also be used as an aesthetic. Let's establish a relationship between the transparency of a bar and the year. Since we are interested in both years, we won't restrict graph_data in geom_bar()
.
https://gist.github.com/1a6c0e88f538f516393342b6228d2c09
We don't want a stacked bar chart, but alpha
does seem to be working - we see that the lighter portions of the bars correspond to the values in 1952, while the darker portions correspond to values in 2007.
Now, let's use the position
argument to make the bars appear side-by-side, instead of being stacked. According to the ggplot2 documentation, bars are stacked by default and we need to specify position = "dodge"
to make the bars appear side-by-side.
https://gist.github.com/6ccb1d4fabee615635e60b9d7f9fd4c1
Note that position = "dodge"
is another way of writing position = position_dodge()
. position_dodge()
can take a width argument, which is discussed in detail in this Stack Overflow post. We are using the default width, which is why we can use the shorter version position = "dodge"
.
The 1952 colours for alpha
are very light. Let's modify the transparency provided by alpha
using scale_alpha_manual()
.
https://gist.github.com/0fb0f36cfbfbc4fc063c7adc8c8c5c32
Here, we specified a vector for scale_alpha_manual
, where each element provides the transparency of the corresponding year. We assigned a transparency of 0.6 to 1952 and 1 to 2007 (we know the first element corresponds to 1952 and the second element to 2007 because that is the order of levels for the "year" factor. You can check this using levels(data_graph$year)
).
Let's also change the colour scheme for the continent colours using scale_fill_manual()
. We provide a vector of colours, where each element provides the colour for the corresponding continent. I have provided the colours in hexadecimal format (e.g. as "#FF0011"), but you can provide colours in any other format you prefer.
https://gist.github.com/dde3473a791b26249de6b1d22ca63e2a
Let's turn our plot into a horizontal bar chart using coord_flip()
:
https://gist.github.com/edc6a1f6dcd727bb22dd88337cc530c1
Note the order of the bars still reflects the levels of the factor i.e. countries coming first alphabetically are closer to the origin, and the bar for 1952 is below the bar for 2007. We are going to go ahead with this order, but if you'd like the countries or years to appear in a different order, all you have to do is modify the factor levels of the corresponding variables.
Our graph is already quite informative - we can identify the continent a country belongs to by the colour of the bar. If we want the country bars to appear by continent, we can change the levels of the "country" factor so that the country names are sorted by continent.
However, it would be much more effective if we could group the countries into continents on the x-axis. The reader of the graph wouldn't need to keep referring to the legend; all the information would be in one place. We can create these groups using facets.
Facets are used to split the ggplot into a matrix of panels. Let's add a facet for the "continent" variable to understand what "matrix of panels" means: https://gist.github.com/faf162df4d2ec579ac9f99d69917d905
We see that our graph is now in 3 horizontal panels, with each panel representing a different continent.
Let's break the facet_grid()
command down a little: we wanted horizontal panels, so we specified the rows
argument. Each row/panel was on the basis on continent, so we specified rows = vars(continent))
. vars
just indicates that the "continent" object exists in the context of the dataset we are using in our ggplot()
command. If we don't specify vars
, we will get an error saying that the object "continent" was not found.
Now, we will explore some arguments of facet_grid()
that can improve the appearance of the graph. All of these are covered in detail in the ggplot2 documentation; in this post, we will use only a few options.
First, we see that the graph is assuming that every x-variable ("country", in our case) exists for every faceting variable ("continent") e.g. Haiti is in the Africa and Asia panel as well as the Americas panel. This is because ggplot2
assumes every panel will have the same scale, where "scale" refers to the values the x and y axis take on. Our scale of interest is country names, and currently each continent has exactly the same scale - all of the country names are included for each continent. To remedy this, we specify scales = "free_y"
- we say that every faceting variable ("continent") can have its own scale (where a "scale" would be only those country names that are part of the continent).
https://gist.github.com/725fb31465b3dcebc1c96520410704b0
Now, notice that the bars for the Americas are thicker than the bars for Africa or Asia. This is because by default, ggplot makes all panels (i.e. all continents) occupy the same amount of space. We'd prefer that all our bars be equally thick, rather than our panels be equally tall. Let's add space = "free_y"
.
https://gist.github.com/921277f779890b898926a9d55fefddb4
It seems a little confusing to have the continent names to the right and the country names to left. We can use the switch
option to change where the facet labels (i.e. continent names) are displayed.
https://gist.github.com/42d9fa9ebb9906b79b9e9000cbe4c6c2
This looks quite good! Let's do the following to modify the appearance of the facet labels i.e. the continent names:
- Move the continent names to the left of country names
- Remove the gray background and box from the continent labels
- Make the continent names horizontal and not vertical https://gist.github.com/3b29cb18dc70e1fd79ebdd56f20d9138
Our graph is almost ready! Let's clean up the legend and the axes, and give a title to our graph.
To reduce chartjunk, let's suppress the legend for continent because we already have that information in the facets. We will use the guides()
function to suppress the legend for the fill
aesthetic (recall that we set aes(fill = continent)
in geom_bar()
).
https://gist.github.com/cd1d5d796105f08becc0883ce40c2dc8
DataNovia has an excellent guide for formatting ggplot legends, if you'd like to modify the legend further e.g. change its position, manually change legend colours, etc.
Finally, let's use the labs
function to change the labels for this graph. We want to:
- Remove the x-axis label - we don't need to say "country" since it is apparent
- Change the y-axis label to "Life expectancy (years)"
- Add a title above the graph explaining what the graph shows
- Add the data source below the graph. This is a good location for technical notes https://gist.github.com/d4420f6c5c623a41b64d323c6e75e8e9
And that is our graph!
Here is all the graph code in one place: https://gist.github.com/40ec3147f1b1fe0c2d478e247a2a018a
You can save a copy of the graph using the ggsave()
command, which allows you to specify the save location, dimensions of the file, image format (.png, .jpg etc.), and more.
Now that we understand how to build a ggplot, let's map the elements of our graph to the components of a plot:
- A default dataset and set of mappings from variables to aesthetics - we did this in
ggplot(data = data_graph, aes(x = country, y = lifeExp))
. - One or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings - we created a layer for bars using
geom_bar()
,stat = "identity"
and "position = "dodge"
. - one scale for each aesthetic mapping used - the x and y axes had default scales based on the values of "country" and "lifeExp". We also created scales for
fill
andalpha
. - a coordinate system - Cartesian, in our case, as we specified aesthetics for
x
andy
. We also flipped the axes. - the facet specification - we did this using
facet_grid()
.
The graph components are succinctly expressed in this code template:
https://gist.github.com/400a5e5e34a572435570e3d0003675b1
You can make the following graphs to learn more about ggplot()
:
- Change the font and font size for the chart title, facet labels, and axis labels (you'll need to use the
theme()
function) - Modify the existing graph to show the value of life expectancy for each bar (you'll need to add a
geom_text()
) - Create some dummy data with confidence intervals for estimates of life expectancy, and show these confidence intervals on our existing graph (you'll need to use
geom_errorbar()
) - Create a line graph showing the value of life expectancy over several years for different countries (you'll need to use
geom_line()
and take a new subset of the data) - You can have a look at the ggplot2 cheatsheet to get more ideas for what you can do!
We would love to know if this worked for you. Write to us with questions or share your graphs with us in the comments below.