Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Quick and Dirty Address Matching with LibPostal

Quick and Dirty Address Matching with LibPostal

Most businesses have databases of previous customers, and data analysts will frequently be asked to join arbitrary data to the customer tables in order to provide analysis.

Unfortunately joining address data together is notoriously difficult:

  • The same address can be expressed in many ways
  • The parts of addresses are not always clear
  • There are valid lexically very similar addresses very nearby any given address

One way of doing quick and dirty categorization of address data to find duplicates or join tables is to apply "address normalization" to the address data.

Address normalization attempts to convert an arbitrary input address string into a finite number of standard forms that the input might intend to represent.

It's not hard to imagine writing your own normalizer in python or another scripting language: convert standard abbreviations to fully spelled out words, deal with equivalencies like "first" and "1st" and "1", strip out lexical quirks like capitalization and diacritics. But what if you had to do it for a second language and region? And then another? You might be bi-lingual, but are you tri-lingual? Quad?

Enter libpostal, a natural language processing library trained on over 1B address records in the international OpenStreetMap database. For use inside PostgreSQL, you can use the pgsql-postal extension to call the library directly from SQL.

libpostal only has two operations, "address normalization" and "address parsing", that are exposted by pgsql-postal with the postal_normalize() and postal_parse() functions.

Normalization takes an address string and converts it to all the standard forms that "make sense". For example:

SELECT unnest(postal_normalize('390 Greenwich St., New york, ny, 10013'));
 390 greenwich saint new york ny 10013
 390 greenwich saint new york new york 10013
 390 greenwich street new york ny 10013
 390 greenwich street new york new york 10013

Parsing takes apart a string into address components and returns a JSONB of those components:

SELECT jsonb_pretty(postal_parse('390 Greenwich St., New york, ny, 10013'));
 {                           
     "city": "new york",     
     "road": "greenwich st.",
     "state": "ny",          
     "postcode": "10013",    
     "house_number": "390"   
 }

We can use normalization to create a table of text searchable address strings, and then use full text search to efficiently search that table for potential matches for new addresses.

International Normalization

As we saw above, normalization takes raw address strings and turns them into "possible standard forms", which are suitable for searching against. They aren't necessarily the best forms, more regionally-aware parsing can do a better job of standard North American parsing and formatting, but where libpostal shines is being a ready-to-run fully international solution that doesn't even need to be told what language it is working on.

For example, this address in Berlin:

SELECT unnest(postal_normalize('Potsdamer Straße 3, 10785 Berlin, Germany'));
potsdamer strasse 3 10785 berlin germany

The diacritics are stripped out, but otherwise it is not heavily changed. The same address with an abbreviation:

SELECT unnest(postal_normalize('POTSDAMER STR. 3, 10785 BERLIN'));
potsdamer strasse 3 10785 berlin

The library recognizes based on nothing but context that it is working in german, and that 'STR' is short for 'strasse', not 'street'.

Build and Install LibPostal

  • Download or clone the libpostal

  • Build and install the library

    cd libpostal
    ./bootstrap.sh
    sudo mkdir /opt/libpostal
    ./configure --datadir=/opt/libpostal
    make
    sudo make install
    
  • Download or clone the pgsql-postal extension

  • Build and install the extension

    cd pgsql-postal
    make
    sudo make install
    
  • Create a working database

    createdb postal
    
  • Create the extensions

    psql -d postal -c 'create extension postal'
    psql -d postal -c 'create extension fuzzystrmatch'
    

Load and Prepare the Addresses

For this example, we will load a CSV file of New York addresses.

Now we need a table to load the addresses into:

DROP TABLE IF EXISTS addresses;
CREATE TABLE addresses (
    pk serial primary key,
    lon double precision,
    lat double precision,
    number varchar,
    street varchar,
    unit varchar,
    city varchar,
    district varchar,
    region varchar,
    postcode varchar,
    id double precision,
    hash varchar
    );

Copy the New York data into the empty table:

COPY addresses (lon, lat, number, street, unit, city, district, region, postcode, id, hash) 
  FROM 'us/ny/city_of_new_york.csv' WITH (format csv, header true);

Create the Normalized Address Table

The New York address table has addresses already split out into components, but we want to feed them to the normalizer as one string, so we actually stick them back together again.

SELECT
  concat_ws(', ', concat_ws(' ', unit, number, street), 'NEW YORK', postcode)
FROM addresses
LIMIT 5;
 1458 36 ST, NEW YORK, 11218
 927 80 ST, NEW YORK, 11228
 2116 73 ST, NEW YORK, 11204
 2168 73 ST, NEW YORK, 11204
 105 E  32 ST, NEW YORK, 11226

We are also going to want to search the normalized address using the PostgreSQL full-text search engine, so while creating the normalized form we'll also build a tsvector of the normalized text.

DROP TABLE IF EXISTS addresses_normalized;
CREATE TABLE addresses_normalized AS
  SELECT pk, 
    na,                              -- normalized address
    to_tsvector('simple', na) AS ts  -- tsearch address
  FROM (
    SELECT pk, 
      unnest(postal_normalize(concat_ws(', ', concat_ws(' ', unit, number, street), 'NEW YORK', postcode))) AS na 
    FROM addresses
    ) AS subq;

Now we have a search table that we can join back to the original addresses via the primary key, pk.

SELECT * 
FROM addresses_normalized 
LIMIT 5;
   pk   |                na                 |                            ts                             
--------+-----------------------------------+-----------------------------------------------------------
 817072 | 410 liberty avenue new york 11207 | '11207':6 '410':1 'avenue':3 'liberty':2 'new':4 'york':5
 817073 | 5006 snyder avenue new york 11203 | '11203':6 '5006':1 'avenue':3 'new':4 'snyder':2 'york':5
 817074 | 586 liberty avenue new york 11207 | '11207':6 '586':1 'avenue':3 'liberty':2 'new':4 'york':5
 817075 | 45 hewitt avenue new york 10301   | '10301':6 '45':1 'avenue':3 'hewitt':2 'new':4 'york':5
 817076 | 88 coventry road new york 10304   | '10304':6 '88':1 'coventry':2 'new':4 'road':3 'york':5

Search Using the Normalized Address Table

To search the addresses with a single input, we need to keep a few things in mind:

  • A single search term can generate multiple normalized forms that we will want to search with
  • We will need to turn those normalized forms into text search queries
  • Order matters in addresses, so we need to run a phrase search

We start the query using WITH to create a set of text search queries to join to the search table, then join them and add the levenshtein distance to get a feel for how lexically close each potential match is.

WITH 
normalized AS (
    SELECT unnest(postal_normalize('388 Greenwich st, ny')) AS na
),
tsqueries as (
    SELECT phraseto_tsquery('simple', na) AS ts, na 
    FROM normalized
)
SELECT DISTINCT ON (pk)
  addresses.*, 
  addresses_normalized.na AS addresses_normalized_na, 
  tsqueries.na AS tsqueries_na,
  levenshtein(tsqueries.na, addresses_normalized.na) AS levenshtein
FROM addresses
JOIN addresses_normalized USING (pk)
JOIN tsqueries ON (addresses_normalized.ts @@ tsqueries.ts)
ORDER BY pk;

With 1M New York addresses and 1.7M normalized candidates, the search takes about 90ms or less.

Join Using the Normalized Address Table

We can use the normalized address table and the normalizer to do a bulk join of another address table to the master table the New York addresses.

  • Download the New York open data Health Centers data

  • Create a table for the data

    CREATE TABLE health_centers (
        name text,
        address text,
        postal text,
        telephone text,
        lat double precision,
        lon double precision,
        board text,
        council text,
        censustract text,
        bin text,
        bbl text,
        nta text,
        borough text
    );
  • Load the data

    COPY health_centers 
        FROM 'rows.csv' 
        WITH (format csv, header true);

The health center data has quite different form for addresses from the base New York address table, but we can still get a lot of joins.

WITH queries AS (
    SELECT name as health_name, 
        address as health_address, 
        phraseto_tsquery('simple', unnest(postal_normalize(concat_ws(', ', address, 'new york')))) AS tsq 
    FROM health_centers
)
SELECT 
  addresses.*, 
  addresses_normalized.*, 
  queries.health_name,
  queries.health_address
FROM addresses
JOIN addresses_normalized using (pk)
JOIN queries ON (addresses_normalized.ts @@ queries.tsq);

Here's one example of a matched record:

pk             | 485629
lon            | -74.0091443
lat            | 40.645177
number         | 514
street         | 49 ST
unit           | 
city           | NEW YORK
district       | 
region         | NEW YORK
postcode       | 11220
id             | 3017183
hash           | a373b8e9e014c4b4
geog           | 0101000020E6100000785CF9D1958052C0D190F12895524440
pk             | 485629
na             | 514 49 street new york 11220
ts             | '11220':6 '49':2 '514':1 'new':4 'street':3 'york':5
health_name    | Sunset Park Family Health Center
health_address | 514 49th Street

And another:

pk             | 902117
lon            | -73.9440892
lat            | 40.7911112
number         | 212
street         | E  106 ST
unit           | 
city           | NEW YORK
district       | 
region         | NEW YORK
postcode       | 10029
id             | 9106834
hash           | 9802bcab491efb2a
geog           | 0101000020E61000006FA01BF56B7C52C0EABFBD2143654440
pk             | 902117
na             | 212 east 106 street new york 10029
ts             | '10029':7 '106':3 '212':1 'east':2 'new':5 'street':4 'york':6
health_name    | Settlement Health & Medical Services
health_address | 212 East 106th Street

Conclusion

While it is not a 100% solution, using normalized addresses and full text search provides a relatively fast (less than 100ms) matching approach for loose address matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.