Skip to content

Instantly share code, notes, and snippets.

@tomjemmett
Last active February 3, 2023 20:36
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomjemmett/fc5fa443a3c1d1bd964e9eba34bb2d10 to your computer and use it in GitHub Desktop.
Save tomjemmett/fc5fa443a3c1d1bd964e9eba34bb2d10 to your computer and use it in GitHub Desktop.
Fuzzy joining tables using string distance methods
---
title: "Fuzzy joining tables using string distance methods"
format: html
---
```{r setup}
suppressPackageStartupMessages({
library(tidyverse)
library(sf)
library(janitor)
library(igraph)
})
```
I recently had a problem where I had two datasets containing data which I needed to join together. The two datasets had a nice 1:1 mapping between them, but unfortunately there was not a nice coded identifier to join the two datasets. There was just a name field, and annoyingly there were subtle differences between the two.
For demonstration purposes, I'm going to show a similar problem. Imagine that we have one dataset that contains data about [ICB's][0], and another about [STP's][1]. (For those not familiar with English NHS geographies, STP's were 42 geographic areas covering England, which in July 2022 became ICB's). This has a 1:1 mapping, but some of the names changed slightly when ICB's came into effect.
[0]: https://en.wikipedia.org/wiki/Integrated_care_system
[1]: https://en.wikipedia.org/wiki/Sustainability_and_transformation_plan
If you want to follow along, download the list of [STP's][2] and [ICB's][3] from the ONS Geoportal site.
[2]: https://geoportal.statistics.gov.uk/documents/sustainability-and-transformation-partnerships-april-2021-names-and-codes-in-england-1/about
[3]: https://geoportal.statistics.gov.uk/documents/integrated-care-boards-july-2022-names-and-codes-in-england-1/about
```{r}
stps <- readxl::read_excel("STP_APR_2021_EN_NC.xlsx") |>
select(code = STP21CDH, description = STP21NM) |>
arrange(code)
stps
```
```{r}
icbs <- readxl::read_excel("ICB_JUL_2022_EN_NC.xlsx") |>
select(code = ICB22CDH, description = ICB22NM) |>
arrange(code)
icbs
```
Obviously, here we have the ["E54* ONS codes][4] which we could join on, a luxury I did not have. I've left these in to test the matching does work later.
[4]: https://en.wikipedia.org/wiki/ONS_coding_system
First of all, how many rows are we able to match joining on the name?
```{r}
icbs |>
inner_join(stps, by = "description")
```
None! Looking at the `icbs` dataset, we can see rows start with "NHS" and end with "Integrated Care Board", which doesn't happen in `stps`. Perhaps, by just stripping these we get a perfect match?
```{r}
icbs |>
select(description) |>
mutate(across(description, str_remove_all, "^NHS | Integrated Care Board$")) |>
inner_join(stps |> select(description), by = "description")
```
Roughly half... not good enough!
## String distance methods to the rescue?
Many of us will have had to compare strings at some point, perhaps using `LIKE` in Sql, or Regular Expressions (regex's) in R. But there are a class of algorithms that can calculate the "distance" or "similarity" between two strings.
Consider the two words "grey" and "gray". How similar are these two words? The [Hamming Distance][5] algorithm compares two strings of equal length, and returns a number indicating how many positions are different in the string. So for our two words above, we get a distance of 1.
[5]: https://en.wikipedia.org/wiki/Hamming_distance
A generally more useful method though is the [Damerau-Levenshtein distance][6]. This calculates the number of operations to make the first string equal the second string.
[6]: https://en.wikipedia.org/wiki/Damerau-Levenshtein_distance
Operations in this context are single-character insertions, deletions or substitutions, or transposition of two adjacent characters.
Alternatively, we could consider the set of unique words used in two strings. We can count the intersection of words (words common to both strings) and divide by the count of all the words used to give us a value between 0 and 1. A value of 0 would indicate that the two strings are completely different, and a value of 1 would indicate that the two strings are very similar. This method is called the [Jaccard Similarity][7].
[7]: https://towardsdatascience.com/overview-of-text-similarity-metrics-3396c4601f50
This is a very useful method for the problem I faced, as I expect the names in both datasets to be free of spelling mistakes.
## Using the Jaccard Similartiy method
First, we can use the `stringsimmatrix()` function from the `{stringdist}` package to calculate the Jaccard Similarity matrix, comparing the names from the first table to the names from the second table.
```{r}
dist_matrix <- stringdist::stringsimmatrix(
icbs$description,
stps$description,
"jaccard"
)
```
However, simply calculating the string distance matrix doesn't give us a solution to the problem. In the table below, you can see that in column y, some rows appear more than once, and eyeballing the match it's clear it hasn't found the correct pair.
```{r}
# we can find the index of the maximum
tibble(
x = icbs$description |> str_remove_all("^NHS | Integrated Care Board$"),
y = stps$description[apply(dist_matrix, 1, which.max)]
) |>
group_by(y) |>
arrange(y) |>
filter(n() > 1)
```
## Graph theory saves the day
There is a quick solution to this though using a [Bipartite Graph][8]. A Birpartite Graph is a type of network where we have vertices of two types, and edges only exist between nodes of the different types.
[8]: https://en.wikipedia.org/wiki/Bipartite_graph
We can use the `{igraph}` package to construct and manipulate graphs. First, let's construct a table where we have names from the first table as nodes of one type, and the names from the second table as nodes of the other type.
```{r}
# the column `name` is special in a named graph. it will uniquely identify each vertex.
vertices <- dplyr::bind_rows(
.id = "type",
icbs = icbs |> mutate(name = paste0("icb_", code)),
stps = stps |> mutate(name = paste0("stp_", code))
) |>
dplyr::relocate(name, .before = dplyr::everything()) |>
# the "type" column needs to be a logical vector, so we use TRUE for the first type, and FALSE for the second
dplyr::mutate(dplyr::across("type", ~.x == "icbs"))
vertices
```
Then create weighted edges between each pair of names, using the distance matrix we calculated above.
```{r}
edges <- dist_matrix |>
# this will convert our matrix into a list of lists
purrr::array_branch(1) |>
# the lists will be in the same order as our original data
# so we can use purrr to change into a dataframe
purrr::set_names(icbs$code) |>
purrr::map_dfr(
.id = "to",
\(.x) tibble::tibble(from = icbs$code, weight = .x)
) |>
mutate(
across(to, ~paste0("icb_", .x)),
across(from, ~paste0("stp_", .x))
)
edges
```
This tibble gives us the string similarities between each pair of names from our two tables.
Now that we have our edges and vertices, we can construct a graph, and find the maximum bipartite matching. This works without much effort as we constructed our vertices with a `type` logical column, and we constructed our edges with a `weight` numeric column. `{igraph}` handles the rest for us.
```{r}
g <- graph_from_data_frame(edges, TRUE, vertices)
m <- maximum.bipartite.matching(g)$matching |>
enframe("icb_code", "stp_code") |>
# the results gives us results from icb22cdh->icb22cd and vice versa
# keep just the icb22cdh->icb22cd results
filter(icb_code |> str_starts("icb_")) |>
mutate(across(c(icb_code, stp_code), str_remove_all, "^.{3}_"))
m |>
filter(icb_code == stp_code) |>
rename(code = icb_code) |>
select(-stp_code) |>
inner_join(
icbs |> rename(icb_name = description),
by = "code"
) |>
inner_join(
stps |> rename(stp_name = description),
by = "code"
) |>
mutate(across(everything(), str_trunc, 30)) |>
print(n = 42)
```
This gives us a perfect match!
## How does this work?
The way this matching algorithm works is it starts off and find's the edge with the greatest possible matching score, and pairs those two nodes together. It then removes those nodes (and edges to/from those nodes) from the graph, and repeats until all nodes are paired, or no edges remain.
This prevents the issue we initially saw, because a node can only be paired to one other node.
This algorithm works when we have a good set of weights to the edges. In fact, if you try running the string similarity function with some of the different algorithms that are available, such as the Levenshtein Distance, most give us bipartite matchings that aren't correct.
For a more complete description, see the help page for the function ([`igraph::max_bipartite_match`][9]), and the [Push-relabel maximum flow algorithm][10].
[9]: https://igraph.org/r/doc/matching.html
[10]: https://en.wikipedia.org/wiki/Push%E2%80%93relabel_maximum_flow_algorithm
## Final thoughts
Hopefully this has been interesting to you and introduced some new and interesting techniques to play with. Both string-distance algorithms and graph theory are very powerful tools that crop up again and again in computer science, so are worth diving into if you are curious!
There is an obvious question of, are there easier approaches to this problem? In this case, we only had 42 options, which is probably quick enough to solve by hand in Excel by starting with two sorted columns and manually lining up the correct rows.
However, if you had a similar problem with more options then the manual approach would quickly becoming tiring. It is worth noting that you should not blindly trust the results; In my original problem I scanned the results and confirmed that I got the results I was after. In this example we had the codes which allowed us to confirm the correct results
I also came into this problem expecting there to be a perfect 1:1 mapping between both set's. If it isn't guaranteed that constraint holds in your problem then you may need to treat the results more cautiously.
@Lextuga007
Copy link

Should we drop tidyverse from the libraries call and just use dplyr? I was wondering about it for things like training and getting people into the better practice of being explicit.

@tomjemmett
Copy link
Author

tomjemmett commented Jan 23, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment