tomjemmett/bipartite_matching_with_jaccard_similarity.qmd

## bipartite_matching_with_jaccard_similarity.qmd
---
title: "Fuzzy joining tables using string distance methods"
format: html
---

```{r setup}
suppressPackageStartupMessages({
  library(tidyverse)
  library(sf)
  library(janitor)
  library(igraph)
})
```

I recently had a problem where I had two datasets containing data which I needed to join together. The two datasets had a nice 1:1 mapping between them, but unfortunately there was not a nice coded identifier to join the two datasets. There was just a name field, and annoyingly there were subtle differences between the two.

For demonstration purposes, I'm going to show a similar problem. Imagine that we have one dataset that contains data about [ICB's][0], and another about [STP's][1]. (For those not familiar with English NHS geographies, STP's were 42 geographic areas covering England, which in July 2022 became ICB's). This has a 1:1 mapping, but some of the names changed slightly when ICB's came into effect.

[0]: https://en.wikipedia.org/wiki/Integrated_care_system
[1]: https://en.wikipedia.org/wiki/Sustainability_and_transformation_plan

If you want to follow along, download the list of [STP's][2] and [ICB's][3] from the ONS Geoportal site.

[2]: https://geoportal.statistics.gov.uk/documents/sustainability-and-transformation-partnerships-april-2021-names-and-codes-in-england-1/about
[3]: https://geoportal.statistics.gov.uk/documents/integrated-care-boards-july-2022-names-and-codes-in-england-1/about

```{r}
stps <- readxl::read_excel("STP_APR_2021_EN_NC.xlsx") |>
  select(code = STP21CDH, description = STP21NM) |>
  arrange(code)

stps
```


```{r}
icbs <- readxl::read_excel("ICB_JUL_2022_EN_NC.xlsx") |>
  select(code = ICB22CDH, description = ICB22NM) |>
  arrange(code)

icbs
```

Obviously, here we have the ["E54* ONS codes][4] which we could join on, a luxury I did not have. I've left these in to test the matching does work later.

[4]: https://en.wikipedia.org/wiki/ONS_coding_system

First of all, how many rows are we able to match joining on the name?

```{r}
icbs |>
  inner_join(stps, by = "description")
```

None! Looking at the `icbs` dataset, we can see rows start with "NHS" and end with "Integrated Care Board", which doesn't happen in `stps`. Perhaps, by just stripping these we get a perfect match?

```{r}
icbs |>
  select(description) |>
  mutate(across(description, str_remove_all, "^NHS | Integrated Care Board$")) |>
  inner_join(stps |> select(description), by = "description")
```

Roughly half... not good enough!

## String distance methods to the rescue?

Many of us will have had to compare strings at some point, perhaps using `LIKE` in Sql, or Regular Expressions (regex's) in R. But there are a class of algorithms that can calculate the "distance" or "similarity" between two strings.

Consider the two words "grey" and "gray". How similar are these two words? The [Hamming Distance][5] algorithm compares two strings of equal length, and returns a number indicating how many positions are different in the string. So for our two words above, we get a distance of 1.

[5]: https://en.wikipedia.org/wiki/Hamming_distance

A generally more useful method though is the [Damerau-Levenshtein distance][6]. This calculates the number of operations to make the first string equal the second string.

[6]: https://en.wikipedia.org/wiki/Damerau-Levenshtein_distance

Operations in this context are single-character insertions, deletions or substitutions, or transposition of two adjacent characters.

Alternatively, we could consider the set of unique words used in two strings. We can count the intersection of words (words common to both strings) and divide by the count of all the words used to give us a value between 0 and 1. A value of 0 would indicate that the two strings are completely different, and a value of 1 would indicate that the two strings are very similar. This method is called the [Jaccard Similarity][7].

[7]: https://towardsdatascience.com/overview-of-text-similarity-metrics-3396c4601f50

This is a very useful method for the problem I faced, as I expect the names in both datasets to be free of spelling mistakes.

## Using the Jaccard Similartiy method

First, we can use the `stringsimmatrix()` function from the `{stringdist}` package to calculate the Jaccard Similarity matrix, comparing the names from the first table to the names from the second table.

```{r}
dist_matrix <- stringdist::stringsimmatrix(
  icbs$description,
  stps$description,
  "jaccard"
)
```

However, simply calculating the string distance matrix doesn't give us a solution to the problem. In the table below, you can see that in column y, some rows appear more than once, and eyeballing the match it's clear it hasn't found the correct pair.

```{r}
# we can find the index of the maximum
tibble(
  x = icbs$description |> str_remove_all("^NHS | Integrated Care Board$"),
  y = stps$description[apply(dist_matrix, 1, which.max)]
) |>
  group_by(y) |>
  arrange(y) |>
  filter(n() > 1)
```

## Graph theory saves the day

There is a quick solution to this though using a [Bipartite Graph][8]. A Birpartite Graph is a type of network where we have vertices of two types, and edges only exist between nodes of the different types.

[8]: https://en.wikipedia.org/wiki/Bipartite_graph

We can use the `{igraph}` package to construct and manipulate graphs. First, let's construct a table where we have names from the first table as nodes of one type, and the names from the second table as nodes of the other type.

```{r}
# the column `name` is special in a named graph. it will uniquely identify each vertex.
vertices <- dplyr::bind_rows(
  .id = "type",
  icbs = icbs |> mutate(name = paste0("icb_", code)),
  stps = stps |> mutate(name = paste0("stp_", code))
) |>
  dplyr::relocate(name, .before = dplyr::everything()) |>
  # the "type" column needs to be a logical vector, so we use TRUE for the first type, and FALSE for the second
  dplyr::mutate(dplyr::across("type", ~.x == "icbs"))

vertices
```

Then create weighted edges between each pair of names, using the distance matrix we calculated above.

```{r}
edges <- dist_matrix |>
  # this will convert our matrix into a list of lists
  purrr::array_branch(1) |>
  # the lists will be in the same order as our original data
  # so we can use purrr to change into a dataframe
  purrr::set_names(icbs$code) |>
  purrr::map_dfr(
    .id = "to",
    \(.x) tibble::tibble(from = icbs$code, weight = .x)
  ) |>
  mutate(
    across(to, ~paste0("icb_", .x)),
    across(from, ~paste0("stp_", .x))
  )

edges
```

This tibble gives us the string similarities between each pair of names from our two tables.

Now that we have our edges and vertices, we can construct a graph, and find the maximum bipartite matching. This works without much effort as we constructed our vertices with a `type` logical column, and we constructed our edges with a `weight` numeric column. `{igraph}` handles the rest for us.

```{r}
g <- graph_from_data_frame(edges, TRUE, vertices)

m <- maximum.bipartite.matching(g)$matching |>
  enframe("icb_code", "stp_code") |>
  # the results gives us results from icb22cdh->icb22cd and vice versa
  # keep just the icb22cdh->icb22cd results
  filter(icb_code |> str_starts("icb_")) |>
  mutate(across(c(icb_code, stp_code), str_remove_all, "^.{3}_"))

m |>
  filter(icb_code == stp_code) |>
  rename(code = icb_code) |>
  select(-stp_code) |>
  inner_join(
    icbs |> rename(icb_name = description),
    by = "code"
  ) |>
  inner_join(
    stps |> rename(stp_name = description),
    by = "code"
  ) |>
  mutate(across(everything(), str_trunc, 30)) |>
  print(n = 42)
```

This gives us a perfect match!

## How does this work?

The way this matching algorithm works is it starts off and find's the edge with the greatest possible matching score, and pairs those two nodes together. It then removes those nodes (and edges to/from those nodes) from the graph, and repeats until all nodes are paired, or no edges remain.

This prevents the issue we initially saw, because a node can only be paired to one other node.

This algorithm works when we have a good set of weights to the edges. In fact, if you try running the string similarity function with some of the different algorithms that are available, such as the Levenshtein Distance, most give us bipartite matchings that aren't correct.

For a more complete description, see the help page for the function ([`igraph::max_bipartite_match`][9]), and the [Push-relabel maximum flow algorithm][10].

[9]: https://igraph.org/r/doc/matching.html
[10]: https://en.wikipedia.org/wiki/Push%E2%80%93relabel_maximum_flow_algorithm

## Final thoughts

Hopefully this has been interesting to you and introduced some new and interesting techniques to play with. Both string-distance algorithms and graph theory are very powerful tools that crop up again and again in computer science, so are worth diving into if you are curious!

There is an obvious question of, are there easier approaches to this problem? In this case, we only had 42 options, which is probably quick enough to solve by hand in Excel by starting with two sorted columns and manually lining up the correct rows.

However, if you had a similar problem with more options then the manual approach would quickly becoming tiring. It is worth noting that you should not blindly trust the results; In my original problem I scanned the results and confirmed that I got the results I was after. In this example we had the codes which allowed us to confirm the correct results

I also came into this problem expecting there to be a perfect 1:1 mapping between both set's. If it isn't guaranteed that constraint holds in your problem then you may need to treat the results more cautiously.
	---
	title: "Fuzzy joining tables using string distance methods"
	format: html
	---

	```{r setup}
	suppressPackageStartupMessages({
	library(tidyverse)
	library(sf)
	library(janitor)
	library(igraph)
	})
	```

	I recently had a problem where I had two datasets containing data which I needed to join together. The two datasets had a nice 1:1 mapping between them, but unfortunately there was not a nice coded identifier to join the two datasets. There was just a name field, and annoyingly there were subtle differences between the two.

	For demonstration purposes, I'm going to show a similar problem. Imagine that we have one dataset that contains data about [ICB's][0], and another about [STP's][1]. (For those not familiar with English NHS geographies, STP's were 42 geographic areas covering England, which in July 2022 became ICB's). This has a 1:1 mapping, but some of the names changed slightly when ICB's came into effect.

	[0]: https://en.wikipedia.org/wiki/Integrated_care_system
	[1]: https://en.wikipedia.org/wiki/Sustainability_and_transformation_plan

	If you want to follow along, download the list of [STP's][2] and [ICB's][3] from the ONS Geoportal site.

	[2]: https://geoportal.statistics.gov.uk/documents/sustainability-and-transformation-partnerships-april-2021-names-and-codes-in-england-1/about
	[3]: https://geoportal.statistics.gov.uk/documents/integrated-care-boards-july-2022-names-and-codes-in-england-1/about

	```{r}
	stps <- readxl::read_excel("STP_APR_2021_EN_NC.xlsx") \|>
	select(code = STP21CDH, description = STP21NM) \|>
	arrange(code)

	stps
	```


	```{r}
	icbs <- readxl::read_excel("ICB_JUL_2022_EN_NC.xlsx") \|>
	select(code = ICB22CDH, description = ICB22NM) \|>
	arrange(code)

	icbs
	```

	Obviously, here we have the ["E54* ONS codes][4] which we could join on, a luxury I did not have. I've left these in to test the matching does work later.

	[4]: https://en.wikipedia.org/wiki/ONS_coding_system

	First of all, how many rows are we able to match joining on the name?

	```{r}
	icbs \|>
	inner_join(stps, by = "description")
	```

	None! Looking at the `icbs` dataset, we can see rows start with "NHS" and end with "Integrated Care Board", which doesn't happen in `stps`. Perhaps, by just stripping these we get a perfect match?

	```{r}
	icbs \|>
	select(description) \|>
	mutate(across(description, str_remove_all, "^NHS \| Integrated Care Board$")) \|>
	inner_join(stps \|> select(description), by = "description")
	```

	Roughly half... not good enough!

	## String distance methods to the rescue?

	Many of us will have had to compare strings at some point, perhaps using `LIKE` in Sql, or Regular Expressions (regex's) in R. But there are a class of algorithms that can calculate the "distance" or "similarity" between two strings.

	Consider the two words "grey" and "gray". How similar are these two words? The [Hamming Distance][5] algorithm compares two strings of equal length, and returns a number indicating how many positions are different in the string. So for our two words above, we get a distance of 1.

	[5]: https://en.wikipedia.org/wiki/Hamming_distance

	A generally more useful method though is the [Damerau-Levenshtein distance][6]. This calculates the number of operations to make the first string equal the second string.

	[6]: https://en.wikipedia.org/wiki/Damerau-Levenshtein_distance

	Operations in this context are single-character insertions, deletions or substitutions, or transposition of two adjacent characters.

	Alternatively, we could consider the set of unique words used in two strings. We can count the intersection of words (words common to both strings) and divide by the count of all the words used to give us a value between 0 and 1. A value of 0 would indicate that the two strings are completely different, and a value of 1 would indicate that the two strings are very similar. This method is called the [Jaccard Similarity][7].

	[7]: https://towardsdatascience.com/overview-of-text-similarity-metrics-3396c4601f50

	This is a very useful method for the problem I faced, as I expect the names in both datasets to be free of spelling mistakes.

	## Using the Jaccard Similartiy method

	First, we can use the `stringsimmatrix()` function from the `{stringdist}` package to calculate the Jaccard Similarity matrix, comparing the names from the first table to the names from the second table.

	```{r}
	dist_matrix <- stringdist::stringsimmatrix(
	icbs$description,
	stps$description,
	"jaccard"
	)
	```

	However, simply calculating the string distance matrix doesn't give us a solution to the problem. In the table below, you can see that in column y, some rows appear more than once, and eyeballing the match it's clear it hasn't found the correct pair.

	```{r}
	# we can find the index of the maximum
	tibble(
	x = icbs$description \|> str_remove_all("^NHS \| Integrated Care Board$"),
	y = stps$description[apply(dist_matrix, 1, which.max)]
	) \|>
	group_by(y) \|>
	arrange(y) \|>
	filter(n() > 1)
	```

	## Graph theory saves the day

	There is a quick solution to this though using a [Bipartite Graph][8]. A Birpartite Graph is a type of network where we have vertices of two types, and edges only exist between nodes of the different types.

	[8]: https://en.wikipedia.org/wiki/Bipartite_graph

	We can use the `{igraph}` package to construct and manipulate graphs. First, let's construct a table where we have names from the first table as nodes of one type, and the names from the second table as nodes of the other type.

	```{r}
	# the column `name` is special in a named graph. it will uniquely identify each vertex.
	vertices <- dplyr::bind_rows(
	.id = "type",
	icbs = icbs \|> mutate(name = paste0("icb_", code)),
	stps = stps \|> mutate(name = paste0("stp_", code))
	) \|>
	dplyr::relocate(name, .before = dplyr::everything()) \|>
	# the "type" column needs to be a logical vector, so we use TRUE for the first type, and FALSE for the second
	dplyr::mutate(dplyr::across("type", ~.x == "icbs"))

	vertices
	```

	Then create weighted edges between each pair of names, using the distance matrix we calculated above.

	```{r}
	edges <- dist_matrix \|>
	# this will convert our matrix into a list of lists
	purrr::array_branch(1) \|>
	# the lists will be in the same order as our original data
	# so we can use purrr to change into a dataframe
	purrr::set_names(icbs$code) \|>
	purrr::map_dfr(
	.id = "to",
	\(.x) tibble::tibble(from = icbs$code, weight = .x)
	) \|>
	mutate(
	across(to, ~paste0("icb_", .x)),
	across(from, ~paste0("stp_", .x))
	)

	edges
	```

	This tibble gives us the string similarities between each pair of names from our two tables.

	Now that we have our edges and vertices, we can construct a graph, and find the maximum bipartite matching. This works without much effort as we constructed our vertices with a `type` logical column, and we constructed our edges with a `weight` numeric column. `{igraph}` handles the rest for us.

	```{r}
	g <- graph_from_data_frame(edges, TRUE, vertices)

	m <- maximum.bipartite.matching(g)$matching \|>
	enframe("icb_code", "stp_code") \|>
	# the results gives us results from icb22cdh->icb22cd and vice versa
	# keep just the icb22cdh->icb22cd results
	filter(icb_code \|> str_starts("icb_")) \|>
	mutate(across(c(icb_code, stp_code), str_remove_all, "^.{3}_"))

	m \|>
	filter(icb_code == stp_code) \|>
	rename(code = icb_code) \|>
	select(-stp_code) \|>
	inner_join(
	icbs \|> rename(icb_name = description),
	by = "code"
	) \|>
	inner_join(
	stps \|> rename(stp_name = description),
	by = "code"
	) \|>
	mutate(across(everything(), str_trunc, 30)) \|>
	print(n = 42)
	```

	This gives us a perfect match!

	## How does this work?

	The way this matching algorithm works is it starts off and find's the edge with the greatest possible matching score, and pairs those two nodes together. It then removes those nodes (and edges to/from those nodes) from the graph, and repeats until all nodes are paired, or no edges remain.

	This prevents the issue we initially saw, because a node can only be paired to one other node.

	This algorithm works when we have a good set of weights to the edges. In fact, if you try running the string similarity function with some of the different algorithms that are available, such as the Levenshtein Distance, most give us bipartite matchings that aren't correct.

	For a more complete description, see the help page for the function ([`igraph::max_bipartite_match`][9]), and the [Push-relabel maximum flow algorithm][10].

	[9]: https://igraph.org/r/doc/matching.html
	[10]: https://en.wikipedia.org/wiki/Push%E2%80%93relabel_maximum_flow_algorithm

	## Final thoughts

	Hopefully this has been interesting to you and introduced some new and interesting techniques to play with. Both string-distance algorithms and graph theory are very powerful tools that crop up again and again in computer science, so are worth diving into if you are curious!

	There is an obvious question of, are there easier approaches to this problem? In this case, we only had 42 options, which is probably quick enough to solve by hand in Excel by starting with two sorted columns and manually lining up the correct rows.

	However, if you had a similar problem with more options then the manual approach would quickly becoming tiring. It is worth noting that you should not blindly trust the results; In my original problem I scanned the results and confirmed that I got the results I was after. In this example we had the codes which allowed us to confirm the correct results

	I also came into this problem expecting there to be a perfect 1:1 mapping between both set's. If it isn't guaranteed that constraint holds in your problem then you may need to treat the results more cautiously.