norman/earthdistance.rb

## earthdistance.rb
#!/usr/bin/env ruby
=begin

= Geographic Searches With Postgres's Earthdistance and Cube Extensions

This program shows how to easily create a Postgres database that uses the Cube
and Earthdistance extensions to perform fast queries on geographic data.

Briefly, the problem this code solves is "show me all places within 50
kilometers of New York City."

I've written this brief guide because while the Postgres docs have all the info
you need, I couldn't easily find any tutorial-type documentation for Postgres
beginners.

First, be sure you have installed this program's dependencies:

* Postgres 9.1.x
* The "squirm" Ruby gem
* Ruby 1.9.x

Next, you must download a datafile from Geonames:

    http://download.geonames.org/export/dump/US.zip

is what I'm using here.

You'll also need to create a database:

    createdb geosearch_test

== Earthdistance

The Earthdistance extension adds some stored procedures which can
be used to calculate approximate distances between points on the Earth. It
assumes the Earth is a perfect sphere, so if you need very accurate
calculations, then this is not for you; take a look at PostGIS instead. But for
most websites that need to do basic proximity searches, it's sufficient.

The extension offers two means of calculating distance: with points or with
cubes. Using points is conceptually simpler but has two drawbacks: first, it
becomes less accurate the closer you get to the poles, so geographic searches in
Alaska, for example, are less accurate than in Ecuador. The second drawback is,
you can't use indexes with point-based searches. So let's stick to cubes.

So let's see how we would find all the records within 5 kilometers of New York
City:

  SELECT * FROM cities
  WHERE earth_box(ll_to_earth(40.71427000, -74.00597000), 50000) @> ll_to_earth(lat, lng);


The `ll_to_earth` function returns a point on the surface of the earth when
given a latitide and longitude, and the `earth_box` function will give us a
bounding box. Using the cube operator (`@>`) we select only points that fall
inside the box.

Now this is quite nice, but by itself it will perform a sequential scan on the
table. With my sample database of 100,000 rows, it takes roughly 8 seconds - too
slow to be at all useful.

Luckily Postgres's GIST index type understands how to index cube data, so we
can build an index to speed up this query significantly:

  CREATE INDEX test_index ON cities USING gist (ll_to_earth(lat, lng));

Notice that what we're indexing is the return value of the function
`ll_to_earth`, and not the columns themselves.

So how much faster will the index make the query? After adding it, the query
runs in a mere 1.3 milliseconds. Now *that's* a little better!

If you'd like to get some more background on this topic, the Postgres docs are
brief but comprehensive:

http://www.postgresql.org/docs/9.1/static/earthdistance.html

=end
require "squirm"

Squirm.connect dbname: "geosearch_test", pool_size: 1

Squirm do

  exec "CREATE EXTENSION IF NOT EXISTS cube"
  exec "CREATE EXTENSION IF NOT EXISTS earthdistance"
  exec "DROP TABLE IF EXISTS cities CASCADE"

  exec %q{
    CREATE TABLE cities(
      id         SERIAL NOT NULL PRIMARY KEY,
      name       VARCHAR(255) NOT NULL,
      state      CHAR(2) NOT NULL,
      population INTEGER NOT NULL DEFAULT 0,
      lat        DECIMAL(11,8) NOT NULL,
      lng        DECIMAL(11,8) NOT NULL
    )
  }

  exec %q{
    CREATE OR REPLACE FUNCTION insert_city(_name text, _state text,
      _population integer, _lat decimal, _lng decimal) RETURNS VOID AS $$
      BEGIN
        INSERT INTO cities (name, state, population, lat, lng)
          VALUES (_name, _state, _population, _lat, _lng);
      END;
    $$ LANGUAGE 'PLPGSQL'
  }

  exec "CREATE INDEX test_index ON cities USING gist (ll_to_earth(lat, lng))"

end

File.open("US.txt", "r:utf-8") do |file|

  insert_city = Squirm.procedure "insert_city"

  i = 0;
  file.lines.each do |line|
    fields = line.strip.split("\t")

    # Check "feature class" field, only include populated places.
    next if fields[7] !~ /^PPL/

    data = {
      :name       => fields[1],
      :state      => fields[10].upcase,
      :population => fields[14].to_i,
      :lat        => fields[4].to_f,
      :lng        => fields[5].to_f,
    }
    i = i.next
    insert_city.call(*data.values)
    puts "#{i} - #{data[:name]}"

    # Let's break after 100k since we don't need that many records to prove our
    # concept.
    break if i == 100_000
  end
end
	#!/usr/bin/env ruby
	=begin

	= Geographic Searches With Postgres's Earthdistance and Cube Extensions

	This program shows how to easily create a Postgres database that uses the Cube
	and Earthdistance extensions to perform fast queries on geographic data.

	Briefly, the problem this code solves is "show me all places within 50
	kilometers of New York City."

	I've written this brief guide because while the Postgres docs have all the info
	you need, I couldn't easily find any tutorial-type documentation for Postgres
	beginners.

	First, be sure you have installed this program's dependencies:

	* Postgres 9.1.x
	* The "squirm" Ruby gem
	* Ruby 1.9.x

	Next, you must download a datafile from Geonames:

	http://download.geonames.org/export/dump/US.zip

	is what I'm using here.

	You'll also need to create a database:

	createdb geosearch_test

	== Earthdistance

	The Earthdistance extension adds some stored procedures which can
	be used to calculate approximate distances between points on the Earth. It
	assumes the Earth is a perfect sphere, so if you need very accurate
	calculations, then this is not for you; take a look at PostGIS instead. But for
	most websites that need to do basic proximity searches, it's sufficient.

	The extension offers two means of calculating distance: with points or with
	cubes. Using points is conceptually simpler but has two drawbacks: first, it
	becomes less accurate the closer you get to the poles, so geographic searches in
	Alaska, for example, are less accurate than in Ecuador. The second drawback is,
	you can't use indexes with point-based searches. So let's stick to cubes.

	So let's see how we would find all the records within 5 kilometers of New York
	City:

	SELECT * FROM cities
	WHERE earth_box(ll_to_earth(40.71427000, -74.00597000), 50000) @> ll_to_earth(lat, lng);


	The `ll_to_earth` function returns a point on the surface of the earth when
	given a latitide and longitude, and the `earth_box` function will give us a
	bounding box. Using the cube operator (`@>`) we select only points that fall
	inside the box.

	Now this is quite nice, but by itself it will perform a sequential scan on the
	table. With my sample database of 100,000 rows, it takes roughly 8 seconds - too
	slow to be at all useful.

	Luckily Postgres's GIST index type understands how to index cube data, so we
	can build an index to speed up this query significantly:

	CREATE INDEX test_index ON cities USING gist (ll_to_earth(lat, lng));

	Notice that what we're indexing is the return value of the function
	`ll_to_earth`, and not the columns themselves.

	So how much faster will the index make the query? After adding it, the query
	runs in a mere 1.3 milliseconds. Now that's a little better!

	If you'd like to get some more background on this topic, the Postgres docs are
	brief but comprehensive:

	http://www.postgresql.org/docs/9.1/static/earthdistance.html

	=end
	require "squirm"

	Squirm.connect dbname: "geosearch_test", pool_size: 1

	Squirm do

	exec "CREATE EXTENSION IF NOT EXISTS cube"
	exec "CREATE EXTENSION IF NOT EXISTS earthdistance"
	exec "DROP TABLE IF EXISTS cities CASCADE"

	exec %q{
	CREATE TABLE cities(
	id SERIAL NOT NULL PRIMARY KEY,
	name VARCHAR(255) NOT NULL,
	state CHAR(2) NOT NULL,
	population INTEGER NOT NULL DEFAULT 0,
	lat DECIMAL(11,8) NOT NULL,
	lng DECIMAL(11,8) NOT NULL
	)
	}

	exec %q{
	CREATE OR REPLACE FUNCTION insert_city(_name text, _state text,
	_population integer, _lat decimal, _lng decimal) RETURNS VOID AS $$
	BEGIN
	INSERT INTO cities (name, state, population, lat, lng)
	VALUES (_name, _state, _population, _lat, _lng);
	END;
	$$ LANGUAGE 'PLPGSQL'
	}

	exec "CREATE INDEX test_index ON cities USING gist (ll_to_earth(lat, lng))"

	end

	File.open("US.txt", "r:utf-8") do \|file\|

	insert_city = Squirm.procedure "insert_city"

	i = 0;
	file.lines.each do \|line\|
	fields = line.strip.split("\t")

	# Check "feature class" field, only include populated places.
	next if fields[7] !~ /^PPL/

	data = {
	:name => fields[1],
	:state => fields[10].upcase,
	:population => fields[14].to_i,
	:lat => fields[4].to_f,
	:lng => fields[5].to_f,
	}
	i = i.next
	insert_city.call(*data.values)
	puts "#{i} - #{data[:name]}"

	# Let's break after 100k since we don't need that many records to prove our
	# concept.
	break if i == 100_000
	end
	end