Last active
May 29, 2024 15:34
-
-
Save tomjemmett/fc5fa443a3c1d1bd964e9eba34bb2d10 to your computer and use it in GitHub Desktop.
Fuzzy joining tables using string distance methods
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Fuzzy joining tables using string distance methods" | |
format: html | |
--- | |
```{r setup} | |
suppressPackageStartupMessages({ | |
library(tidyverse) | |
library(sf) | |
library(janitor) | |
library(igraph) | |
}) | |
``` | |
I recently had a problem where I had two datasets containing data which I needed to join together. The two datasets had a nice 1:1 mapping between them, but unfortunately there was not a nice coded identifier to join the two datasets. There was just a name field, and annoyingly there were subtle differences between the two. | |
For demonstration purposes, I'm going to show a similar problem. Imagine that we have one dataset that contains data about [ICB's][0], and another about [STP's][1]. (For those not familiar with English NHS geographies, STP's were 42 geographic areas covering England, which in July 2022 became ICB's). This has a 1:1 mapping, but some of the names changed slightly when ICB's came into effect. | |
[0]: https://en.wikipedia.org/wiki/Integrated_care_system | |
[1]: https://en.wikipedia.org/wiki/Sustainability_and_transformation_plan | |
If you want to follow along, download the list of [STP's][2] and [ICB's][3] from the ONS Geoportal site. | |
[2]: https://geoportal.statistics.gov.uk/documents/sustainability-and-transformation-partnerships-april-2021-names-and-codes-in-england-1/about | |
[3]: https://geoportal.statistics.gov.uk/documents/integrated-care-boards-july-2022-names-and-codes-in-england-1/about | |
```{r} | |
stps <- readxl::read_excel("STP_APR_2021_EN_NC.xlsx") |> | |
select(code = STP21CDH, description = STP21NM) |> | |
arrange(code) | |
stps | |
``` | |
```{r} | |
icbs <- readxl::read_excel("ICB_JUL_2022_EN_NC.xlsx") |> | |
select(code = ICB22CDH, description = ICB22NM) |> | |
arrange(code) | |
icbs | |
``` | |
Obviously, here we have the ["E54* ONS codes][4] which we could join on, a luxury I did not have. I've left these in to test the matching does work later. | |
[4]: https://en.wikipedia.org/wiki/ONS_coding_system | |
First of all, how many rows are we able to match joining on the name? | |
```{r} | |
icbs |> | |
inner_join(stps, by = "description") | |
``` | |
None! Looking at the `icbs` dataset, we can see rows start with "NHS" and end with "Integrated Care Board", which doesn't happen in `stps`. Perhaps, by just stripping these we get a perfect match? | |
```{r} | |
icbs |> | |
select(description) |> | |
mutate(across(description, str_remove_all, "^NHS | Integrated Care Board$")) |> | |
inner_join(stps |> select(description), by = "description") | |
``` | |
Roughly half... not good enough! | |
## String distance methods to the rescue? | |
Many of us will have had to compare strings at some point, perhaps using `LIKE` in Sql, or Regular Expressions (regex's) in R. But there are a class of algorithms that can calculate the "distance" or "similarity" between two strings. | |
Consider the two words "grey" and "gray". How similar are these two words? The [Hamming Distance][5] algorithm compares two strings of equal length, and returns a number indicating how many positions are different in the string. So for our two words above, we get a distance of 1. | |
[5]: https://en.wikipedia.org/wiki/Hamming_distance | |
A generally more useful method though is the [Damerau-Levenshtein distance][6]. This calculates the number of operations to make the first string equal the second string. | |
[6]: https://en.wikipedia.org/wiki/Damerau-Levenshtein_distance | |
Operations in this context are single-character insertions, deletions or substitutions, or transposition of two adjacent characters. | |
Alternatively, we could consider the set of unique words used in two strings. We can count the intersection of words (words common to both strings) and divide by the count of all the words used to give us a value between 0 and 1. A value of 0 would indicate that the two strings are completely different, and a value of 1 would indicate that the two strings are very similar. This method is called the [Jaccard Similarity][7]. | |
[7]: https://towardsdatascience.com/overview-of-text-similarity-metrics-3396c4601f50 | |
This is a very useful method for the problem I faced, as I expect the names in both datasets to be free of spelling mistakes. | |
## Using the Jaccard Similartiy method | |
First, we can use the `stringsimmatrix()` function from the `{stringdist}` package to calculate the Jaccard Similarity matrix, comparing the names from the first table to the names from the second table. | |
```{r} | |
dist_matrix <- stringdist::stringsimmatrix( | |
icbs$description, | |
stps$description, | |
"jaccard" | |
) | |
``` | |
However, simply calculating the string distance matrix doesn't give us a solution to the problem. In the table below, you can see that in column y, some rows appear more than once, and eyeballing the match it's clear it hasn't found the correct pair. | |
```{r} | |
# we can find the index of the maximum | |
tibble( | |
x = icbs$description |> str_remove_all("^NHS | Integrated Care Board$"), | |
y = stps$description[apply(dist_matrix, 1, which.max)] | |
) |> | |
group_by(y) |> | |
arrange(y) |> | |
filter(n() > 1) | |
``` | |
## Graph theory saves the day | |
There is a quick solution to this though using a [Bipartite Graph][8]. A Birpartite Graph is a type of network where we have vertices of two types, and edges only exist between nodes of the different types. | |
[8]: https://en.wikipedia.org/wiki/Bipartite_graph | |
We can use the `{igraph}` package to construct and manipulate graphs. First, let's construct a table where we have names from the first table as nodes of one type, and the names from the second table as nodes of the other type. | |
```{r} | |
# the column `name` is special in a named graph. it will uniquely identify each vertex. | |
vertices <- dplyr::bind_rows( | |
.id = "type", | |
icbs = icbs |> mutate(name = paste0("icb_", code)), | |
stps = stps |> mutate(name = paste0("stp_", code)) | |
) |> | |
dplyr::relocate(name, .before = dplyr::everything()) |> | |
# the "type" column needs to be a logical vector, so we use TRUE for the first type, and FALSE for the second | |
dplyr::mutate(dplyr::across("type", ~.x == "icbs")) | |
vertices | |
``` | |
Then create weighted edges between each pair of names, using the distance matrix we calculated above. | |
```{r} | |
edges <- dist_matrix |> | |
# this will convert our matrix into a list of lists | |
purrr::array_branch(1) |> | |
# the lists will be in the same order as our original data | |
# so we can use purrr to change into a dataframe | |
purrr::set_names(icbs$code) |> | |
purrr::map_dfr( | |
.id = "to", | |
\(.x) tibble::tibble(from = icbs$code, weight = .x) | |
) |> | |
mutate( | |
across(to, ~paste0("icb_", .x)), | |
across(from, ~paste0("stp_", .x)) | |
) | |
edges | |
``` | |
This tibble gives us the string similarities between each pair of names from our two tables. | |
Now that we have our edges and vertices, we can construct a graph, and find the maximum bipartite matching. This works without much effort as we constructed our vertices with a `type` logical column, and we constructed our edges with a `weight` numeric column. `{igraph}` handles the rest for us. | |
```{r} | |
g <- graph_from_data_frame(edges, TRUE, vertices) | |
m <- maximum.bipartite.matching(g)$matching |> | |
enframe("icb_code", "stp_code") |> | |
# the results gives us results from icb22cdh->icb22cd and vice versa | |
# keep just the icb22cdh->icb22cd results | |
filter(icb_code |> str_starts("icb_")) |> | |
mutate(across(c(icb_code, stp_code), str_remove_all, "^.{3}_")) | |
m |> | |
filter(icb_code == stp_code) |> | |
rename(code = icb_code) |> | |
select(-stp_code) |> | |
inner_join( | |
icbs |> rename(icb_name = description), | |
by = "code" | |
) |> | |
inner_join( | |
stps |> rename(stp_name = description), | |
by = "code" | |
) |> | |
mutate(across(everything(), str_trunc, 30)) |> | |
print(n = 42) | |
``` | |
This gives us a perfect match! | |
## How does this work? | |
The way this matching algorithm works is it starts off and find's the edge with the greatest possible matching score, and pairs those two nodes together. It then removes those nodes (and edges to/from those nodes) from the graph, and repeats until all nodes are paired, or no edges remain. | |
This prevents the issue we initially saw, because a node can only be paired to one other node. | |
This algorithm works when we have a good set of weights to the edges. In fact, if you try running the string similarity function with some of the different algorithms that are available, such as the Levenshtein Distance, most give us bipartite matchings that aren't correct. | |
For a more complete description, see the help page for the function ([`igraph::max_bipartite_match`][9]), and the [Push-relabel maximum flow algorithm][10]. | |
[9]: https://igraph.org/r/doc/matching.html | |
[10]: https://en.wikipedia.org/wiki/Push%E2%80%93relabel_maximum_flow_algorithm | |
## Final thoughts | |
Hopefully this has been interesting to you and introduced some new and interesting techniques to play with. Both string-distance algorithms and graph theory are very powerful tools that crop up again and again in computer science, so are worth diving into if you are curious! | |
There is an obvious question of, are there easier approaches to this problem? In this case, we only had 42 options, which is probably quick enough to solve by hand in Excel by starting with two sorted columns and manually lining up the correct rows. | |
However, if you had a similar problem with more options then the manual approach would quickly becoming tiring. It is worth noting that you should not blindly trust the results; In my original problem I scanned the results and confirmed that I got the results I was after. In this example we had the codes which allowed us to confirm the correct results | |
I also came into this problem expecting there to be a perfect 1:1 mapping between both set's. If it isn't guaranteed that constraint holds in your problem then you may need to treat the results more cautiously. |
I’ve been a bit inconsistent, probably because parts were lifted directly out of my work (where everything is name::spaced), but also I’m using more than just dplyr. I think enframe is in {tibble}, and I use stuff from {purrr}. I did have an example with {ggplot} (using {graph}), but that didn’t really add much to the exampleOn 23 Jan 2023, at 17:23, Zoë Turner ***@***.***> wrote:Re: ***@***.*** commented on this gist.Should we drop tidyverse from the libraries call and just use dplyr? I was wondering about it for things like training and getting people into the better practice of being explicit.—Reply to this email directly, view it on GitHub or unsubscribe.You are receiving this email because you authored the thread.Triage notifications on the go with GitHub Mobile for iOS or Android.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Should we drop tidyverse from the libraries call and just use dplyr? I was wondering about it for things like training and getting people into the better practice of being explicit.