Instantly share code, notes, and snippets.

Embed
What would you like to do?
Geocode as many addresses as you'd like with a powerful Python and Google Geocoding API combination
"""
Python script for batch geocoding of addresses using the Google Geocoding API.
This script allows for massive lists of addresses to be geocoded for free by pausing when the
geocoder hits the free rate limit set by Google (2500 per day). If you have an API key for paid
geocoding from Google, set it in the API key section.
Addresses for geocoding can be specified in a list of strings "addresses". In this script, addresses
come from a csv file with a column "Address". Adjust the code to your own requirements as needed.
After every 500 successul geocode operations, a temporary file with results is recorded in case of
script failure / loss of connection later.
Addresses and data are held in memory, so this script may need to be adjusted to process files line
by line if you are processing millions of entries.
Shane Lynn
5th November 2016
"""
import pandas as pd
import requests
import logging
import time
logger = logging.getLogger("root")
logger.setLevel(logging.DEBUG)
# create console handler
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)
#------------------ CONFIGURATION -------------------------------
# Set your Google API key here.
# Even if using the free 2500 queries a day, its worth getting an API key since the rate limit is 50 / second.
# With API_KEY = None, you will run into a 2 second delay every 10 requests or so.
# With a "Google Maps Geocoding API" key from https://console.developers.google.com/apis/,
# the daily limit will be 2500, but at a much faster rate.
# Example: API_KEY = 'AIzaSyC9azed9tLdjpZNjg2_kVePWvMIBq154eA'
API_KEY = None
# Backoff time sets how many minutes to wait between google pings when your API limit is hit
BACKOFF_TIME = 30
# Set your output file name here.
output_filename = 'data/output-2015.csv'
# Set your input file here
input_filename = "data/PPR-2015.csv"
# Specify the column name in your input data that contains addresses here
address_column_name = "Address"
# Return Full Google Results? If True, full JSON results from Google are included in output
RETURN_FULL_RESULTS = False
#------------------ DATA LOADING --------------------------------
# Read the data to a Pandas Dataframe
data = pd.read_csv(input_filename, encoding='utf8')
if address_column_name not in data.columns:
raise ValueError("Missing Address column in input data")
# Form a list of addresses for geocoding:
# Make a big list of all of the addresses to be processed.
addresses = data[address_column_name].tolist()
# **** DEMO DATA / IRELAND SPECIFIC! ****
# We know that these addresses are in Ireland, and there's a column for county, so add this for accuracy.
# (remove this line / alter for your own dataset)
addresses = (data[address_column_name] + ',' + data['County'] + ',Ireland').tolist()
#------------------ FUNCTION DEFINITIONS ------------------------
def get_google_results(address, api_key=None, return_full_response=False):
"""
Get geocode results from Google Maps Geocoding API.
Note, that in the case of multiple google geocode reuslts, this function returns details of the FIRST result.
@param address: String address as accurate as possible. For Example "18 Grafton Street, Dublin, Ireland"
@param api_key: String API key if present from google.
If supplied, requests will use your allowance from the Google API. If not, you
will be limited to the free usage of 2500 requests per day.
@param return_full_response: Boolean to indicate if you'd like to return the full response from google. This
is useful if you'd like additional location details for storage or parsing later.
"""
# Set up your Geocoding url
geocode_url = "https://maps.googleapis.com/maps/api/geocode/json?address={}".format(address)
if api_key is not None:
geocode_url = geocode_url + "&key={}".format(api_key)
# Ping google for the reuslts:
results = requests.get(geocode_url)
# Results will be in JSON format - convert to dict using requests functionality
results = results.json()
# if there's no results or an error, return empty results.
if len(results['results']) == 0:
output = {
"formatted_address" : None,
"latitude": None,
"longitude": None,
"accuracy": None,
"google_place_id": None,
"type": None,
"postcode": None
}
else:
answer = results['results'][0]
output = {
"formatted_address" : answer.get('formatted_address'),
"latitude": answer.get('geometry').get('location').get('lat'),
"longitude": answer.get('geometry').get('location').get('lng'),
"accuracy": answer.get('geometry').get('location_type'),
"google_place_id": answer.get("place_id"),
"type": ",".join(answer.get('types')),
"postcode": ",".join([x['long_name'] for x in answer.get('address_components')
if 'postal_code' in x.get('types')])
}
# Append some other details:
output['input_string'] = address
output['number_of_results'] = len(results['results'])
output['status'] = results.get('status')
if return_full_response is True:
output['response'] = results
return output
#------------------ PROCESSING LOOP -----------------------------
# Ensure, before we start, that the API key is ok/valid, and internet access is ok
test_result = get_google_results("London, England", API_KEY, RETURN_FULL_RESULTS)
if (test_result['status'] != 'OK') or (test_result['formatted_address'] != 'London, UK'):
logger.warning("There was an error when testing the Google Geocoder.")
raise ConnectionError('Problem with test results from Google Geocode - check your API key and internet connection.')
# Create a list to hold results
results = []
# Go through each address in turn
for address in addresses:
# While the address geocoding is not finished:
geocoded = False
while geocoded is not True:
# Geocode the address with google
try:
geocode_result = get_google_results(address, API_KEY, return_full_response=RETURN_FULL_RESULTS)
except Exception as e:
logger.exception(e)
logger.error("Major error with {}".format(address))
logger.error("Skipping!")
geocoded = True
# If we're over the API limit, backoff for a while and try again later.
if geocode_result['status'] == 'OVER_QUERY_LIMIT':
logger.info("Hit Query Limit! Backing off for a bit.")
time.sleep(BACKOFF_TIME * 60) # sleep for 30 minutes
geocoded = False
else:
# If we're ok with API use, save the results
# Note that the results might be empty / non-ok - log this
if geocode_result['status'] != 'OK':
logger.warning("Error geocoding {}: {}".format(address, geocode_result['status']))
logger.debug("Geocoded: {}: {}".format(address, geocode_result['status']))
results.append(geocode_result)
geocoded = True
# Print status every 100 addresses
if len(results) % 100 == 0:
logger.info("Completed {} of {} address".format(len(results), len(addresses)))
# Every 500 addresses, save progress to file(in case of a failure so you have something!)
if len(results) % 500 == 0:
pd.DataFrame(results).to_csv("{}_bak".format(output_filename))
# All done
logger.info("Finished geocoding all addresses")
# Write the full results to csv using the pandas library.
pd.DataFrame(results).to_csv(output_filename, encoding='utf8')
@navaed01

This comment has been minimized.

navaed01 commented Feb 22, 2017

awesome code. clear and well written annotation, great for noobs like me

@andrasvereckei

This comment has been minimized.

andrasvereckei commented Jun 7, 2017

Thanks for this code. Works well for forward and reverse geocoding also. Great!

@Jiayi-Yang

This comment has been minimized.

Jiayi-Yang commented Aug 11, 2017

I got same 'status' error when trying both with or without my API key. Please help.
Traceback (most recent call last):
File "C:/Users//Documents/python batch geocoding.py", line 121, in
if (test_result['status'] != 'OK') or (test_result['formatted_address'] != 'London, UK'):
KeyError: 'status'

@FalkoKoehler

This comment has been minimized.

FalkoKoehler commented Jan 17, 2018

Hi all, I am a complete noob to Phyton but the code seems handable for me. Can somebody share some background infos of how to get a such a code workin? Which tools do i need to install (x64 Win 10) etc.?

@prodnr8

This comment has been minimized.

prodnr8 commented Mar 16, 2018

This helped a lot, thank you!

@jtindle

This comment has been minimized.

jtindle commented May 10, 2018

Thanks for this! Helped a lot and was easy to read. By the way, I had to add utf-8 encoding to line 168 to get the in-progress file to save. FYI in case anyone else runs into that issue.

@christianjgentry

This comment has been minimized.

christianjgentry commented May 14, 2018

Awesome code, and incredible documentation. Thanks!

@kkraoj

This comment has been minimized.

kkraoj commented May 22, 2018

Fantastic!

@JVA93

This comment has been minimized.

JVA93 commented May 22, 2018

How can I adapt this code for reverse geocoding, I'm completely lost!!

I have a csv file with Lat and Long.

@RenatoJoseVieira

This comment has been minimized.

RenatoJoseVieira commented May 24, 2018

Congratulations!
Easy, clear and practical!

@botdotcom

This comment has been minimized.

botdotcom commented Jun 7, 2018

Thanks for the well-commented code!

@ricsteve

This comment has been minimized.

ricsteve commented Jun 22, 2018

Great work. I've managed to adapt the code to connect to a SQL database, call a stored procedure to pass into the Pandas dataframe. I do wonder if anyone has any tip to have the output table include a row from the original input. My tables have a unique ID column for each address and I'd like to maintain that in the output. I can get the column to appear in the output, but each row has the complete list of IDs so it's not participating in the loop.

@karna2017

This comment has been minimized.

karna2017 commented Jun 23, 2018

everything is great until I start the test. any idea, why?
i get the following error while testing:


gaierror Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/urllib3/connection.py in _new_conn(self)
140 conn = connection.create_connection(
--> 141 (self.host, self.port), self.timeout, **extra_kw)
142

/anaconda3/lib/python3.6/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
59
---> 60 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
61 af, socktype, proto, canonname, sa = res

/anaconda3/lib/python3.6/socket.py in getaddrinfo(host, port, family, type, proto, flags)
744 addrlist = []
--> 745 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
746 af, socktype, proto, canonname, sa = res

gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

NewConnectionError Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
600 body=body, headers=headers,
--> 601 chunked=chunked)
602

/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
345 try:
--> 346 self._validate_conn(conn)
347 except (SocketTimeout, BaseSSLError) as e:

/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
849 if not getattr(conn, 'sock', None): # AppEngine might not have .sock
--> 850 conn.connect()
851

/anaconda3/lib/python3.6/site-packages/urllib3/connection.py in connect(self)
283 # Add certificate verification
--> 284 conn = self._new_conn()
285

/anaconda3/lib/python3.6/site-packages/urllib3/connection.py in _new_conn(self)
149 raise NewConnectionError(
--> 150 self, "Failed to establish a new connection: %s" % e)
151

NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x11a934898>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
439 retries=self.max_retries,
--> 440 timeout=timeout
441 )

/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
638 retries = retries.increment(method, url, error=e, _pool=self,
--> 639 _stacktrace=sys.exc_info()[2])
640 retries.sleep()

/anaconda3/lib/python3.6/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
387 if new_retry.is_exhausted():
--> 388 raise MaxRetryError(_pool, url, error or ResponseError(cause))
389

MaxRetryError: HTTPSConnectionPool(host='maps.googleapis.com', port=443): Max retries exceeded with url: /maps/api/geocode/json?address=London,%20England (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x11a934898>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last)
in ()
2
3 # Ensure, before we start, that the API key is ok/valid, and internet access is ok
----> 4 test_result = get_google_results("London, England", API_KEY, RETURN_FULL_RESULTS)
5 if (test_result['status'] != 'OK') or (test_result['formatted_address'] != 'London, UK'):
6 logger.warning("There was an error when testing the Google Geocoder.")

in get_google_results(address, api_key, return_full_response)
23
24 # Ping google for the reuslts:
---> 25 results = requests.get(geocode_url)
26 # Results will be in JSON format - convert to dict using requests functionality
27 results = results.json()

/anaconda3/lib/python3.6/site-packages/requests/api.py in get(url, params, **kwargs)
70
71 kwargs.setdefault('allow_redirects', True)
---> 72 return request('get', url, params=params, **kwargs)
73
74

/anaconda3/lib/python3.6/site-packages/requests/api.py in request(method, url, **kwargs)
56 # cases, and look like a memory leak in others.
57 with sessions.Session() as session:
---> 58 return session.request(method=method, url=url, **kwargs)
59
60

/anaconda3/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
506 }
507 send_kwargs.update(settings)
--> 508 resp = self.send(prep, **send_kwargs)
509
510 return resp

/anaconda3/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
616
617 # Send the request
--> 618 r = adapter.send(request, **kwargs)
619
620 # Total elapsed time of the request (approximately)

/anaconda3/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
506 raise SSLError(e, request=request)
507
--> 508 raise ConnectionError(e, request=request)
509
510 except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='maps.googleapis.com', port=443): Max retries exceeded with url: /maps/api/geocode/json?address=London,%20England (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x11a934898>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

@bandiatindra

This comment has been minimized.

bandiatindra commented Jul 19, 2018

If my internet connection goes away after hitting the 2500 limit on day 1, how will the code re-run on the next day?

@voidfire

This comment has been minimized.

voidfire commented Sep 22, 2018

@andrasvereckei how does it work for reverse? did you modify it appropriately?

@tobiz

This comment has been minimized.

tobiz commented Oct 9, 2018

October 2018. Does this still work now that google is enforcing its requirement to use keyed access to its geocoding service?

@saurabh0777

This comment has been minimized.

saurabh0777 commented Oct 19, 2018

It is not working now

Traceback (most recent call last):
File "C:\Users\saurabhs\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'County'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/saurabhs/PycharmProjects/Test/Test.py", line 62, in
addresses = (data[address_column_name] + ',' + data['County'] + ',Ireland').tolist()
File "C:\Users\saurabhs\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2688, in getitem
return self._getitem_column(key)
File "C:\Users\saurabhs\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\saurabhs\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "C:\Users\saurabhs\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "C:\Users\saurabhs\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'County'

Process finished with exit code 1

@renauld94

This comment has been minimized.

renauld94 commented Nov 8, 2018

After geocoding about 50 addresses always getting : Hit Query Limit! Backing off for a bit.
Anyone have this problem?

@plasmonresonator

This comment has been minimized.

plasmonresonator commented Nov 11, 2018

Works awesome, thank you so much!

@roushaniiitmk

This comment has been minimized.

roushaniiitmk commented Nov 13, 2018

you have mentioned that in the case of multiple google geocode results, this function returns details of the FIRST result, what if the first result is wrong or somewhere else. I have found some case where google shows three partial matches out of which third one correct but first shows lat-long of some other countries.
Is it possible to filter address with city or zip code so that we can get maximum possible correct address(getting second or third address instead of first).

@agrawalparth08

This comment has been minimized.

agrawalparth08 commented Nov 26, 2018

@renauld94

After geocoding about 50 addresses always getting : Hit Query Limit! Backing off for a bit.
Anyone have this problem?

Mine gets stuck after 130. Some weird throttling.

[Update] I got to know from the data logs that the string had # in it which was making it behave weirdly. So replace #, check the string on which it gets stuck and replace the culprit. It should work then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment