fxn/post.md

## post.md

      
    Raw
  

              post.md
            
          
    GeoPlanet data with ancestor chain cache imported in 10 minutes

Yahoo! provides its GeoPlanet data as three separate TSV files, available for download here.
That's a database with some 17 million records:

5.7 million records: locations (aka places).
2.2 million records: alternative names for each place (aka aliases).
9.6 million records: matrix of neighbourhoods per place (aka adjacencies).

All places have a unique and permanent ID, called WOEID, and they are structured as a tree: each record has a parent_id column. That's fantastic if you need to produce geographic breadcrumbs, or be able to constrain searches to some particular place, like a city. Yahoo! reverse geocoding gives you a WOEID.
I wanted to import this data into Postgres for a Ruby on Rails application I am working on, and compute the ancestors cache per row the ancestry Rails plugin needs. While the TSVs can be imported in minutes using COPY FROM, I was a bit concerned about the ancestry cache column (explained later), because for example this post says it took more than four hours to be computed, and I would prefer not to need that amount of time when this stuff gets deployed in production.
Migrations are
create_table :geo_planet_places, :id => false do |t|
  t.primary_key :woeid
  t.string  :iso
  t.string  :name
  t.string  :language
  t.string  :place_type
  t.integer :parent_id, :null => false
  t.string  :ancestry
end
add_index :geo_planet_places, :parent_id
add_index :geo_planet_places, :place_type

# This index needs the varchar_pattern_ops option to let ancestry LIKE
# conditions with a fixed prefix to use it.
execute("CREATE INDEX geo_planet_places_ancestry_index ON geo_planet_places(ancestry varchar_pattern_ops)")

create_table :geo_planet_aliases do |t|
  t.integer :woeid, :null => false
  t.string  :name
  t.string  :name_type
  t.string  :language
end
add_index :geo_planet_aliases, :woeid

create_table :geo_planet_adjacencies do |t|
  t.integer :place_woeid, :null => false
  t.string  :place_iso
  t.integer :neighbour_woeid, :null => false
  t.string  :neighbour_iso
end
add_index :geo_planet_adjacencies, [:place_woeid, :neighbour_woeid]

Rules of thumb for big bulk imports

There's a rule of thumb for big bulk imports: use CSV files and the builtin file import feature of your database. That's COPY FROM in Postgres. If you don't have CSVs do not switch to INSERTs or UPDATEs,  try first a strategy that creates ad-hoc CSVs.

Inserts: Generate CSV files and use COPY FROM or the analogous idiom in your database. This is by far the most efficient way to do bulk imports of new data. Not because of your ORM, mind you, raw INSERTs are orders of magnitude slower than a direct CSV file import. So, even if your data is not originally in that format, create a CSV out of it, and import it.
Updates: Generate a CSV, import it into a temporary table with COPY FROM, and leverage the FROM clause of the UPDATE statement to update everything with a single UPDATE call. That's in turn orders of magnitude faster that issuing individual UPDATEs. Once done, drop the temporary table.
Data Migrations: Dump tables with COPY TO, massage as needed, import with COPY FROM.

How to do bulk imports in Ruby on Rails

Ruby is a scripting language and I have no problem whatsoever shelling out where appropriate. A very simple call to psql in the case of Postgres is enough most of the time. You have there two options, one is the aforementioned COPY FROM, and the other one is the \copy command of psql itself. One important difference between the two of them is that in COPY FROM/TO filenames are assumed to belong to the server's filesystem, whereas filenames in \copy are assumed to belong to the client's filesystem. I often use the standard streams though, which are always obviously attached to the client.
If instead of shelling out you rather need to do this from within Ruby, Active Record provides no abstraction. You have to grab the raw connection and use the API of the driver. This method designed for this particular GeoPlanet import is based on some example code by Jeremy Kemper (thanks!):
def import_tsv(sql, tsv)
  puts "importing #{File.basename(tsv)}"

  pg = connection.raw_connection
  pg.exec("COPY #{sql} FROM STDIN WITH DELIMITER E'\\t' CSV HEADER")

  buf = ''
  begin
    File.open(tsv) do |fh|
      while fh.read(1024, buf)
        pg.put_copy_data(buf)
      end

      pg.put_copy_end
    end
  rescue Errno => err
    errmsg = "%s while reading file: %s" % [ err.class.name, err.message ]
    pg.put_copy_end(errmsg)
  end
end

A call would look like this:
import_tsv('geo_planet_aliases (woeid, name, name_type, language)', aliases_tsv)

Computing the ancestry cache per row

The ancestry Rails plugin gives you a tree API over a model avoiding recursive SQL thanks to an ancestry column that has the IDs of the ancestors as a path. Ryan Bates demonstrates this plugin in episode #262.
For example, the stadium of FC Barcelona belongs to the place with WOEID 20078609, which has these ancestors:
"1/23424950/12578034/12602124/12693396/753692"

There, 753692 is the parent node of 20078609, and 1 is the WOEID of the Earth, that's the root node. The ancestry column is a string like that. The descendants of 12578034 are those with an ancestry column equal to "1/23424950/12578034" or LIKE "1/23424950/12578034/%". Point is LIKE uses the index on a column if the pattern starts with a known prefix rather than a wildcard %. Also, you just split the path to get the ancestors of a given place. All of this is encapsulated by the plugin, you work at the tree API level.
Well, we need to compute that string for all 5.7 million rows in the places table.
The post I linked above uses a recursive method of the plugin that it is clearly meant for small tables. It starts at the root node, computes its ancestry chain, and recurses over its children. No way that's going to work well for 5.7 million nodes and some depth.
The mantra about bulk imports above wins again: get a CSV out of it. I precompute the ancestry of each WOEID from the very places TSV, not the database, and generate a new TSV, which is like the original one plus the ancestry column. Then, COPY FROM that one instead of Yahoo!'s.
The first baby step to probe how to do this was a Ruby script that iterated over the TSV and built a hash woeid -> [parent_id]. That needed 5 minutes with REE 1.8.7. Hmmm, suspicious. Some other Ruby interpreters perform better, but that's the one this application uses. I was into Perl for several years, in dynamic languages Perl is my mother tongue. Wrote the same thing in Perl: 54 seconds. And this was still doing no ancestry computing! Sold.
The script that precomputes all ancestries and generates the new TSV does everything, parsing the original TSV, ancestry building per WOEID, and printing, in less than 2 minutes.
Final result

Importing these three TSVs into Postgres, containing the 17 million records and the ancestry chain cache, needs some 10 minutes.