cokelly/australias-legal-profession-and-the-gender-income-gap-and-learning-slopegraphs.rmd

## australias-legal-profession-and-the-gender-income-gap-and-learning-slopegraphs.rmd
---
title: "Australia's legal profession and the gender income gap (and learning slopegraphs)"
author: Ciaran
date: '2018-04-24'
slug: australias-legal-profession-and-the-gender-income-gap-and-learning-slopegraphs
categories:
  - rstats
tags:
  - tidy_tuesday
  - inequality
header:
  caption: ''
  image: ''
---

The [R for Data Science](https://www.jessemaegan.com/post/r4ds-the-next-iteration/) community has been running a '[Tidy Tuesday](https://github.com/rfordatascience/tidytuesday)' project for a few weeks. In essence they link to a data-driven paper and a somewhat tidy version of the paper's underlying dataset. The challenge is to develop some visualisations etc from the data, all within the [R for Data Science](http://r4ds.had.co.nz/) approach to working with R.

This week's challenge is drawn from [an article on Australia's pay gap](http://www.womensagenda.com.au/latest/eds-blog/australia-s-50-highest-paying-jobs-are-paying-men-significantly-more/). The article's data is sourced [here](https://data.gov.au/dataset/taxation-statistics-2013-14/resource/c506c052-be2f-4fba-8a65-90f9e60f7775?inner_span=True).

```{r setup, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(cache = TRUE,
                      echo = FALSE)

library(tidyverse)
library(cowplot)
library(ggrepel)
library(kableExtra)

suppressWarnings(salaries <- read_csv("../../static/files/data/week4_australian_salary.csv"))
```

```{r compare_salaries}
# Split the dataset in two (there are probably group_by or tidyr::spread options but I find this easier)
male_top_jobs <- salaries %>%
      filter(gender == "Male") %>% # Filter by gender
      arrange(desc(average_taxable_income)) %>%
      select(occupation, `Male Average Taxable Income` = average_taxable_income)

female_top_jobs <- salaries %>%
      filter(gender == "Female") %>% # Filter by gender
      arrange(desc(average_taxable_income)) %>%
      select(occupation, `Female Average Taxable Income` = average_taxable_income)

# Prepare for plotting
male_compared_jobs <- male_top_jobs %>%
      left_join(., female_top_jobs, by = "occupation") %>% # join the two tables agani
      arrange(desc(`Male Average Taxable Income`)) %>% # Arrange by male income
      slice(1:100) %>% # Isolate top 100 by male income
      mutate(occupation = str_replace_all(occupation, "\uFFFD", "-")) %>% # Tidy the text hyphen artefact
      add_column(label_male = paste(.$occupation, paste("AUS$", prettyNum(.$`Male Average Taxable Income`, big.mark = ","), sep = ""), sep = " ")) %>% # label for male jobs
      add_column(label_female00 = round((.$`Female Average Taxable Income`/.$`Male Average Taxable Income`)*100, 0)) %>% # Get female wage as % of male wage
      mutate(label_female0 = paste(label_female00, "%", sep = "")) %>%
      mutate(label_female = paste(.$occupation, paste("AUS$", prettyNum(.$`Female Average Taxable Income`, big.mark = ","), " (", label_female0, ")", sep = ""), sep = "\n")) %>% # Assemble female label
      mutate(label_male = case_when(label_female00 > 38~ "", label_female00 <= 38 | label_female00 > 90 ~ label_male)) %>% # I only want to show the male label where the female wage is less than 38% or more than 90% of the male wage
      mutate(label_female = case_when(label_female00 > 38 & label_female00 < 90 ~ "", label_female00 > 90 ~ label_female)) %>%  # Likewise for the female label
      mutate(line_colour = case_when(label_female00 > 38 & label_female00 < 90 ~ "gray", label_female00 <= 38 ~ "blue", label_female00 > 90 ~ "green")) %>% # And I want to differentiate these by line colour
      mutate(line_transparency = case_when(label_female00 > 38 & label_female00 < 90 ~ "0.4", label_female00 <= 38 ~ "0.9", label_female00 > 90 ~ "0.9")) # And to render the uninteresting lines more transparent

```

```{r plot, message=FALSE}

theend <- 10 # for an xend in the ggplot

male_compared_plot <- ggplot(male_compared_jobs) +
      geom_segment(aes(x=0, y = `Male Average Taxable Income`, xend=theend, yend = `Female Average Taxable Income`), alpha = male_compared_jobs$line_transparency, colour = male_compared_jobs$line_colour) +
      theme(axis.ticks = element_blank(),
            axis.text.x = element_blank(),
            axis.text.y = element_blank()) +
      theme_void() + # Probably replicating the three lines above to an extent
      xlab("") + # Clean up labels
      ylab("") +
      geom_text(label = "Male", x = 0, y = (max(male_compared_jobs$`Male Average Taxable Income`)), hjust = -0.2, vjust = 0, size = 7, na.rm=TRUE) + # Create points on left y axis
      geom_text(label = "Female", x = theend, y = (max(male_compared_jobs$`Male Average Taxable Income`)), vjust = 0, size = 7, hjust = 1.1, na.rm=TRUE) + # Create points on right y axis
      geom_vline(xintercept = 0, linetype="dotted") + # Create left y axis line
      geom_vline(xintercept = theend, linetype="dotted") + # Create right y  axis line
      geom_text_repel(label = male_compared_jobs$label_male, y = male_compared_jobs$`Male Average Taxable Income`, x = 0, segment.color = "red", na.rm=TRUE) + # Label male points
      geom_text_repel(label = male_compared_jobs$label_female, y = male_compared_jobs$`Female Average Taxable Income`, x = theend, segment.color = "red", na.rm=TRUE) + # Label female points
      labs(title = "The Australian Male-Female Income Gap", subtitle = "The 100 best-paid occupations as measured by average male income, with labels highlighting gender disparities")  # Title

#ggsave("Male Jobs Compared to Female.png", plot = male_compared_plot)


```

So I took the opportunity to try figuring out how to build a slopegraph in R. As [Cole Nussbaumer Knaflic](http://www.storytellingwithdata.com/blog/2014/03/more-on-slopegraphs) puts it, slopegraphs are great for highlighting comparisons between two groups, two points in time etc. Here is my attempt at visualising some of the data:

```{r print_plot, message=FALSE}
suppressMessages(male_compared_plot)
```

```{r specific_occupations}
futures_traders <- salaries %>% filter(occupation == "Futures trader")

legal_occupations <- salaries %>% filter((str_detect(occupation, "Judge") & str_detect(occupation, "law")) | occupation == "Magistrate" | occupation == "Barrister" | occupation == "Lawyer; Solicitor") # str_detect used here where "Judge" and "law" aren't the full cell.

law_men <- legal_occupations %>% filter(gender == "Male") %>%# There is likely a handier way to do this with tidyr::spread
      rename(Men = individuals) %>%
      rename(`Average taxable income (men)` = average_taxable_income)
law_women <- legal_occupations %>% filter(gender == "Female")  %>% # There is likely a handier way to do this with tidyr::spread
      rename(Women = individuals) %>%
      rename(`Average taxable income (women)` = average_taxable_income)

law <- full_join(law_men, law_women, by = "occupation") %>%
      mutate(Occupation = str_replace_all(occupation, " \uFFFD law", "")) %>%
      mutate(Occupation = str_replace_all(Occupation, "Lawyer; ", "")) %>%
      mutate(`Men (%)` = round((Men/(Men+Women)*100), digits = 0)) %>%
      mutate(`Women (%)` = round((Women/(Men+Women)*100), digits = 0)) %>%
      mutate(`Women's income as % of men's` = paste(round((`Average taxable income (women)`/`Average taxable income (men)`)*100, digits = 0), "%", sep = "")) %>%
      mutate(`Average taxable income (men)` = paste("AUS$", prettyNum(`Average taxable income (men)`, big.mark = ","), sep = "")) %>%
      select(Occupation, `Men (%)`, `Women (%)`, `Average taxable income (men)`, `Women's income as % of men's`) %>%
      arrange(desc(`Men (%)`))

equal_pay <- law[1:2,]
equal_access <- law[3:4,]
```

So, what we're looking at is a graph visualising the gender pay gap for the 100 best-paid occupations, as measured by average male income. I've used labels to highlight the five occupations with the worst disparities and the only three occupations where women earn 90% of men or more.

So: a critique. The labels obviously need work. They are wordy and it's not obvious which applies where. And the basic design is flawed: if I include more data by putting more labels in, the graph becomes completely cluttered. Likewise if I add more guidance to navigate the data. And lots of information is missing, especially regarding how gendered the occupations are in the first place. I thought about placing points of varying size on each axis to signify this, but there is no tidy way to do so within a single graph.

On the data itself, it would be interesting to seek a pattern relating income equality to equality of access, but that's for another day. Anecdotally, don't be fooled by the futures traders: there are `r futures_traders[2,5]` male futures traders in the survey and only `r futures_traders[1,5]` women.

Likewise for members of the legal profession. Incomes are more equal where [pay scales apply](http://www.justice.vic.gov.au/home/justice+system/courts+and+tribunals/judicial+salaries+and+entitlements). But women are less likely to occupy those roles. Where occupational access is more equal, employers are freer to set salaries, and women are paid less well. When it comes to the Bar, in all likelihood the male average income is positively skewed by the male-dominated big-earners at the top:

```{r print_law_table}

knitr::kable(equal_pay, format = "html", align = "l", caption = "Income is more equal but women have less access where pay scales apply") %>%
      kable_styling(full_width = FALSE, bootstrap_options = "striped")

knitr::kable(equal_access, format = "html", align = "l", caption = "Access is more equal but women have less income where pay scales do not apply") %>%
      kable_styling(full_width = FALSE, bootstrap_options = "striped")

```

To my mind, this reflects the classic patterns of gender discrimination. Unless you are in a tightly regulated profession, disparities persist. And when it comes to the tightly regulated top of the profession, the career necessary for access is likely not available to enough women at all.

Gist with code [here](https://gist.github.com/cokelly/7ae45d5284d37857c139ce293146ab69).
	---
	title: "Australia's legal profession and the gender income gap (and learning slopegraphs)"
	author: Ciaran
	date: '2018-04-24'
	slug: australias-legal-profession-and-the-gender-income-gap-and-learning-slopegraphs
	categories:
	- rstats
	tags:
	- tidy_tuesday
	- inequality
	header:
	caption: ''
	image: ''
	---

	The [R for Data Science](https://www.jessemaegan.com/post/r4ds-the-next-iteration/) community has been running a '[Tidy Tuesday](https://github.com/rfordatascience/tidytuesday)' project for a few weeks. In essence they link to a data-driven paper and a somewhat tidy version of the paper's underlying dataset. The challenge is to develop some visualisations etc from the data, all within the [R for Data Science](http://r4ds.had.co.nz/) approach to working with R.

	This week's challenge is drawn from [an article on Australia's pay gap](http://www.womensagenda.com.au/latest/eds-blog/australia-s-50-highest-paying-jobs-are-paying-men-significantly-more/). The article's data is sourced [here](https://data.gov.au/dataset/taxation-statistics-2013-14/resource/c506c052-be2f-4fba-8a65-90f9e60f7775?inner_span=True).

	```{r setup, echo=FALSE, include=FALSE}
	knitr::opts_chunk$set(cache = TRUE,
	echo = FALSE)

	library(tidyverse)
	library(cowplot)
	library(ggrepel)
	library(kableExtra)

	suppressWarnings(salaries <- read_csv("../../static/files/data/week4_australian_salary.csv"))
	```

	```{r compare_salaries}
	# Split the dataset in two (there are probably group_by or tidyr::spread options but I find this easier)
	male_top_jobs <- salaries %>%
	filter(gender == "Male") %>% # Filter by gender
	arrange(desc(average_taxable_income)) %>%
	select(occupation, `Male Average Taxable Income` = average_taxable_income)

	female_top_jobs <- salaries %>%
	filter(gender == "Female") %>% # Filter by gender
	arrange(desc(average_taxable_income)) %>%
	select(occupation, `Female Average Taxable Income` = average_taxable_income)

	# Prepare for plotting
	male_compared_jobs <- male_top_jobs %>%
	left_join(., female_top_jobs, by = "occupation") %>% # join the two tables agani
	arrange(desc(`Male Average Taxable Income`)) %>% # Arrange by male income
	slice(1:100) %>% # Isolate top 100 by male income
	mutate(occupation = str_replace_all(occupation, "\uFFFD", "-")) %>% # Tidy the text hyphen artefact
	add_column(label_male = paste(.$occupation, paste("AUS$", prettyNum(.$`Male Average Taxable Income`, big.mark = ","), sep = ""), sep = " ")) %>% # label for male jobs
	add_column(label_female00 = round((.$`Female Average Taxable Income`/.$`Male Average Taxable Income`)*100, 0)) %>% # Get female wage as % of male wage
	mutate(label_female0 = paste(label_female00, "%", sep = "")) %>%
	mutate(label_female = paste(.$occupation, paste("AUS$", prettyNum(.$`Female Average Taxable Income`, big.mark = ","), " (", label_female0, ")", sep = ""), sep = "\n")) %>% # Assemble female label
	mutate(label_male = case_when(label_female00 > 38~ "", label_female00 <= 38 \| label_female00 > 90 ~ label_male)) %>% # I only want to show the male label where the female wage is less than 38% or more than 90% of the male wage
	mutate(label_female = case_when(label_female00 > 38 & label_female00 < 90 ~ "", label_female00 > 90 ~ label_female)) %>% # Likewise for the female label
	mutate(line_colour = case_when(label_female00 > 38 & label_female00 < 90 ~ "gray", label_female00 <= 38 ~ "blue", label_female00 > 90 ~ "green")) %>% # And I want to differentiate these by line colour
	mutate(line_transparency = case_when(label_female00 > 38 & label_female00 < 90 ~ "0.4", label_female00 <= 38 ~ "0.9", label_female00 > 90 ~ "0.9")) # And to render the uninteresting lines more transparent

	```

	```{r plot, message=FALSE}

	theend <- 10 # for an xend in the ggplot

	male_compared_plot <- ggplot(male_compared_jobs) +
	geom_segment(aes(x=0, y = `Male Average Taxable Income`, xend=theend, yend = `Female Average Taxable Income`), alpha = male_compared_jobs$line_transparency, colour = male_compared_jobs$line_colour) +
	theme(axis.ticks = element_blank(),
	axis.text.x = element_blank(),
	axis.text.y = element_blank()) +
	theme_void() + # Probably replicating the three lines above to an extent
	xlab("") + # Clean up labels
	ylab("") +
	geom_text(label = "Male", x = 0, y = (max(male_compared_jobs$`Male Average Taxable Income`)), hjust = -0.2, vjust = 0, size = 7, na.rm=TRUE) + # Create points on left y axis
	geom_text(label = "Female", x = theend, y = (max(male_compared_jobs$`Male Average Taxable Income`)), vjust = 0, size = 7, hjust = 1.1, na.rm=TRUE) + # Create points on right y axis
	geom_vline(xintercept = 0, linetype="dotted") + # Create left y axis line
	geom_vline(xintercept = theend, linetype="dotted") + # Create right y axis line
	geom_text_repel(label = male_compared_jobs$label_male, y = male_compared_jobs$`Male Average Taxable Income`, x = 0, segment.color = "red", na.rm=TRUE) + # Label male points
	geom_text_repel(label = male_compared_jobs$label_female, y = male_compared_jobs$`Female Average Taxable Income`, x = theend, segment.color = "red", na.rm=TRUE) + # Label female points
	labs(title = "The Australian Male-Female Income Gap", subtitle = "The 100 best-paid occupations as measured by average male income, with labels highlighting gender disparities") # Title

	#ggsave("Male Jobs Compared to Female.png", plot = male_compared_plot)


	```

	So I took the opportunity to try figuring out how to build a slopegraph in R. As [Cole Nussbaumer Knaflic](http://www.storytellingwithdata.com/blog/2014/03/more-on-slopegraphs) puts it, slopegraphs are great for highlighting comparisons between two groups, two points in time etc. Here is my attempt at visualising some of the data:

	```{r print_plot, message=FALSE}
	suppressMessages(male_compared_plot)
	```

	```{r specific_occupations}
	futures_traders <- salaries %>% filter(occupation == "Futures trader")

	legal_occupations <- salaries %>% filter((str_detect(occupation, "Judge") & str_detect(occupation, "law")) \| occupation == "Magistrate" \| occupation == "Barrister" \| occupation == "Lawyer; Solicitor") # str_detect used here where "Judge" and "law" aren't the full cell.

	law_men <- legal_occupations %>% filter(gender == "Male") %>%# There is likely a handier way to do this with tidyr::spread
	rename(Men = individuals) %>%
	rename(`Average taxable income (men)` = average_taxable_income)
	law_women <- legal_occupations %>% filter(gender == "Female") %>% # There is likely a handier way to do this with tidyr::spread
	rename(Women = individuals) %>%
	rename(`Average taxable income (women)` = average_taxable_income)

	law <- full_join(law_men, law_women, by = "occupation") %>%
	mutate(Occupation = str_replace_all(occupation, " \uFFFD law", "")) %>%
	mutate(Occupation = str_replace_all(Occupation, "Lawyer; ", "")) %>%
	mutate(`Men (%)` = round((Men/(Men+Women)*100), digits = 0)) %>%
	mutate(`Women (%)` = round((Women/(Men+Women)*100), digits = 0)) %>%
	mutate(`Women's income as % of men's` = paste(round((`Average taxable income (women)`/`Average taxable income (men)`)*100, digits = 0), "%", sep = "")) %>%
	mutate(`Average taxable income (men)` = paste("AUS$", prettyNum(`Average taxable income (men)`, big.mark = ","), sep = "")) %>%
	select(Occupation, `Men (%)`, `Women (%)`, `Average taxable income (men)`, `Women's income as % of men's`) %>%
	arrange(desc(`Men (%)`))

	equal_pay <- law[1:2,]
	equal_access <- law[3:4,]
	```

	So, what we're looking at is a graph visualising the gender pay gap for the 100 best-paid occupations, as measured by average male income. I've used labels to highlight the five occupations with the worst disparities and the only three occupations where women earn 90% of men or more.

	So: a critique. The labels obviously need work. They are wordy and it's not obvious which applies where. And the basic design is flawed: if I include more data by putting more labels in, the graph becomes completely cluttered. Likewise if I add more guidance to navigate the data. And lots of information is missing, especially regarding how gendered the occupations are in the first place. I thought about placing points of varying size on each axis to signify this, but there is no tidy way to do so within a single graph.

	On the data itself, it would be interesting to seek a pattern relating income equality to equality of access, but that's for another day. Anecdotally, don't be fooled by the futures traders: there are `r futures_traders[2,5]` male futures traders in the survey and only `r futures_traders[1,5]` women.

	Likewise for members of the legal profession. Incomes are more equal where [pay scales apply](http://www.justice.vic.gov.au/home/justice+system/courts+and+tribunals/judicial+salaries+and+entitlements). But women are less likely to occupy those roles. Where occupational access is more equal, employers are freer to set salaries, and women are paid less well. When it comes to the Bar, in all likelihood the male average income is positively skewed by the male-dominated big-earners at the top:

	```{r print_law_table}

	knitr::kable(equal_pay, format = "html", align = "l", caption = "Income is more equal but women have less access where pay scales apply") %>%
	kable_styling(full_width = FALSE, bootstrap_options = "striped")

	knitr::kable(equal_access, format = "html", align = "l", caption = "Access is more equal but women have less income where pay scales do not apply") %>%
	kable_styling(full_width = FALSE, bootstrap_options = "striped")

	```

	To my mind, this reflects the classic patterns of gender discrimination. Unless you are in a tightly regulated profession, disparities persist. And when it comes to the tightly regulated top of the profession, the career necessary for access is likely not available to enough women at all.

	Gist with code [here](https://gist.github.com/cokelly/7ae45d5284d37857c139ce293146ab69).