Frackalyzer/GeoCoderRev.py

## README.md

      
    Raw
  

              README.md
            
          
    PyGeoCoderRev

Python reverse geo-coder for NCEDC-formatted comma-separated-value earthquake files.
Synopsis

This project, as currently implemented, is intended to reverse-geocode NCEDC-formatted earthquake comma-separated-value (CSV) files.  Reverse-geocoding is the process of obtaining administrative units (e.g. country, state, county/province, city/village) from latitude and longitude (lat-long) coordinates.  With a modicum of effort, this program could be modified so as to reverse-geocode most any file or database table.
Source(s) of Earthquake Data in CSV format


ANSS Composite Catalog Search

Choose Catalog in CSV format
Enter Start date,time value with a comma separating the date (yyyy/MM/dd) and time (HH:mm:ss) value
Enter End date,time value with a comma separating the date (yyyy/MM/dd) and time (HH:mm:ss) value, leaving this blank to default to today's date, time value.
Enter Minimum magnitude value, recommended minimum value, especially for fracking research, is 2.0 or less
Leave Maximum magnitude blank so that all earthquakes above the Minimum magnitude will be included
Choose Send output to an anonymous FTP file on the NCEDC within the "Select output mechanism" section
Enter 10000000 in the Line limit on output box (i.e. 10,000,000 rows max)
Click on the Submit request button
On the "NCEDC_Search_Results" web page that appears after the Submit request button is pressed, wait until a Url link appears, right-click on it and click on the Save link as... sub-menu item, and save the file to a location of your choosing.
The saved file mentioned in the bullet above, and nominally entitled catsearch.12345 with the 12345 being a variable value, is the file to which you'll point the GeoCoderRev.py script when you invoke it to reverse-geocode the rows therein.


Invoking the GeoCoderRev.py program


The simplest invocation of the program is as follows:

Navigate the folder holding the PyGeoCoderRev project.
Open a command terminal from within that folder

Windows: Shift-Right-click within the project's folder, choose Open command window here
Linux (Ubuntu with nautilus-open-terminal installed): Right-click within the project's folder, choose Open terminal
Linux (Ubuntu without nautilus-open-terminal installed): Ctrl-Alt-T, then navigate to the project's folder


Within the command terminal, enter the following command:

python GeoCoderRev.py --src-file-path=/path/to/the/downloaded/NCEDC/earthquake/CSV/file --out-file-path=/path/to/the/resulting/reverse-geocoded/NCEDC/earthquake/CSV/file


Command-line arguments

The GeoCoderRev.py program has more command-line options than just the two shown in the example above, a quick explanation of them follows:


--src-file-path: The required path to the raw NCEDC-formatted earthquake source file in CSV format.


--src-delimiter: The character that separates each value within the file. The default is a comma ,.


--src-quotechar: The character that surrounds each value within the file, should it contain a delimiter. The default is a double-quote ".


--src-quotemode: The quoting mode, which defaults to QUOTE_MINIMAL.  Valid choices are QUOTE_MINIMAL, QUOTE_NONE, QUOTE_ALL, QUOTE_NONNUMERIC.


--out-file-path: The path to the reverse-geocoded NCEDC-formatted earthquake output file in CSV format.


--out-delimiter: The character that separates each value within the file. The default is a comma ,.


--out-quotechar: The character that surrounds each value within the file, should it contain a delimiter. The default is a double-quote ".


--out-quotemode: The quoting mode, which defaults to QUOTE_MINIMAL.  Valid choices are QUOTE_MINIMAL, QUOTE_NONE, QUOTE_ALL, QUOTE_NONNUMERIC.


--out-file-name-folder: The output file's destination folder, default is None.


--out-file-name-prefix: The output file name's prefix, default is NCEDC_earthquakes.


--out-file-name-suffix: The output file name's suffix, default is _reverse_geocoded.


--out-file-name-extension: The output file name's extension, default is .csv.


--max-rows: Mostly intended to be used for testing purposes, this integer argument defaults to 0, which means unlimited rows will be processed.  Any positive integer above zero will result in just that many rows being processed, for example 10 means only ten rows would be processed.


--flush-rows: This integer value controls how often a progress message is output to the console as well as when any buffered rows are "flushed" to the output file.


-h or --help: Specifying this argument will output command-line usage information to the console, which describes the command-line arguments for this program, and then terminates the program without any further processing.


Installation

PyGeoCoderRev is a Python 3 project, and as such a compatible Python 3 interpreter is required.  In addition, the program utilizes the reverse-geocoder package, of which installation instructions appear below.
For first time installation,
$ pip install reverse_geocoder

Or upgrade an existing installation using,
$ pip install --upgrade reverse_geocoder

Package can be found on PyPI.
Dependencies (Python 3 packages)


scipy
numpy

License

Copyright © 2016 Khepry Quixote
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Apache License, Version 2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

  
## GeoCoderRev.py
# ========================================================================
#
# Copyright © 2016 Khepry Quixote
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# ========================================================================

import argparse
import csv
import io
import os

from pprint import pprint
from time import time

import reverse_geocoder as rg

pgm_name = 'GeoCoderRev.py'
pgm_version = '1.0'

quotemode_choices = ['QUOTE_MINIMAL', 'QUOTE_NONE', 'QUOTE_ALL', 'QUOTE_NONNUMERIC']

def quotemode_xlator(quote_mode_str):

    quote_mode_val = csv.QUOTE_MINIMAL

    if quote_mode_str.upper() == 'QUOTE_MINIMAL':
        quote_mode_val = csv.QUOTE_MINIMAL
    elif quote_mode_str.upper() == 'QUOTE_ALL':
        quote_mode_val = csv.QUOTE_ALL
    elif quote_mode_str.upper() == 'QUOTE_NONE':
        quote_mode_val = csv.QUOTE_NONE
    elif quote_mode_str.upper() == 'QUOTE_NONNUMERIC':
        quote_mode_val = csv.QUOTE_NONNUMERIC

    return quote_mode_val

arg_parser = argparse.ArgumentParser(prog='%s' % pgm_name, description='Reverse geo-code an NCEDC-formatted earthquake CSV file.')

arg_parser.add_argument('--src-file-path', required=True, help='source file path')
arg_parser.add_argument('--src-delimiter', default=',', help='source file delimiter character')
arg_parser.add_argument('--src-quotechar', default='"', help='source file quote character')
arg_parser.add_argument('--src-quotemode', dest='src_quotemode_str', default='QUOTE_MINIMAL', choices=quotemode_choices, help='source file quoting mode (default: %s)' % 'QUOTE_MINIMAL')

arg_parser.add_argument('--out-file-path', default=None, help='output file path (default: None, same path as source file)')
arg_parser.add_argument('--out-delimiter', default=',', help='output file delimiter character')
arg_parser.add_argument('--out-quotechar', default='"', help='output file quote character')
arg_parser.add_argument('--out-quotemode', dest='out_quotemode_str', default='QUOTE_MINIMAL', choices=quotemode_choices, help='output file quoting mode (default: %s)' % 'QUOTE_MINIMAL')

arg_parser.add_argument('--out-file-name-folder', default=None, help='output file name folder (default: None')
arg_parser.add_argument('--out-file-name-prefix', default='NCEDC_earthquakes', help='output file name prefix (default: NCEDC_earthquakes')
arg_parser.add_argument('--out-file-name-suffix', default='_reverse_geocoded.csv', help='output file name suffix (default: _reverse_geocoded)')
arg_parser.add_argument('--out-file-name-extension', default='.csv', help='output file name extension (default: .csv)')

arg_parser.add_argument('--max-rows', type=int, default=0, help='maximum rows to process, 0 means unlimited')
arg_parser.add_argument('--flush-rows', type=int, default=1000, help='flush rows interval')

arg_parser.add_argument('--version', action='version', version='version=%s %s' % (pgm_name, pgm_version))

args = arg_parser.parse_args()

if args.out_file_path is None:
    if args.out_file_name_folder is None:
        args.out_file_name_folder = os.path.dirname(args.src_file_path)
    args.out_file_path = os.path.join(args.out_file_name_folder, args.out_file_name_prefix + args.out_file_name_suffix + args.out_file_name_extension)

args.src_quotemode_enm = quotemode_xlator(args.src_quotemode_str)
args.out_quotemode_enm = quotemode_xlator(args.out_quotemode_str)

args.max_rows = abs(args.max_rows)
args.flush_rows = abs(args.flush_rows)

if args.src_file_path.startswith('~'):
    args.src_file_path = os.path.expanduser(args.src_file_path)
args.src_file_path = os.path.abspath(args.src_file_path)

if args.out_file_path.startswith('~'):
    args.out_file_path = os.path.expanduser(args.outfile_path)
args.out_file_path = os.path.abspath(args.out_file_path)

print ('Reverse-geocoding source NCEDC earthquakes file: "%s"' % args.src_file_path)
print ('Outputting to the target NCEDC earthquakes file: "%s"' % args.out_file_path)
print ('')

print('Command line args:')
pprint (vars(args))
print('')

# beginning time hack
bgn_time = time()

# initialize
# row counters
row_count = 0
out_count = 0

# if the source file exists
if os.path.exists(args.src_file_path):

    # open the target file for writing
    with io.open(args.out_file_path, 'w', newline='') as out_file:

        # open the source file for reading
        with io.open(args.src_file_path, 'r', newline='') as src_file:

            # open a CSV file dictionary reader object
            csv_reader = csv.DictReader(src_file, delimiter=args.src_delimiter, quotechar=args.src_quotechar, quoting=args.src_quotemode_enm)

            # obtain the field names from
            # the first line of the source file
            fieldnames = csv_reader.fieldnames
            # append the reverse geo-coding
            # result fields to field names list
            fieldnames.append('cc')
            fieldnames.append('admin1')
            fieldnames.append('admin2')
            fieldnames.append('name')

            # instantiate the CSV dictionary writer object with the modified field names list
            csv_writer = csv.DictWriter(out_file, delimiter=args.out_delimiter, quotechar=args.out_quotechar, quoting=args.out_quotemode_enm, fieldnames=fieldnames)

            # output the header row
            csv_writer.writeheader()

            # beginning time hack
            bgn_time = time()

            # reader row-by-row
            for row in csv_reader:

                row_count += 1

                # convert string lat/lon
                # to floating-point values
                latitude = float(row['Latitude'])
                longitude = float(row['Longitude'])

                # instantiate coordinates tuple
                coordinates = (latitude, longitude)

                # search for the coordinates
                # returning the cc, admin1, admin2, and name values
                # using a mode 1 (single-threaded) search
                results = rg.search(coordinates, mode=1) # default mode = 2

                # if results obtained
                if results is not None:
                    # result-by-result
                    for result in results:
                        # map result values
                        # to the row values
                        row['cc'] = result['cc']
                        row['admin1'] = result['admin1']
                        row['admin2'] = result['admin2']
                        row['name'] = result['name']
                        # output a row
                        csv_writer.writerow(row)
                        out_count += 1
                else:
                    # map empty values
                    # to the row values
                    row['cc'] = ''
                    row['admin1'] = ''
                    row['admin2'] = ''
                    row['name'] = ''
                    # output a row
                    csv_writer.writerow(row)
                    out_count += 1

                # if row count equals or exceeds max rows
                if args.max_rows > 0 and row_count >= args.max_rows:
                    # break out of reading loop
                    break

                # if row count is modulus
                # of the flush count value
                if row_count % args.flush_rows == 0:

                    # flush accumulated
                    # rows to target file
                    out_file.flush()

                    # ending time hack
                    end_time = time()
                    # compute records/second
                    seconds = end_time - bgn_time
                    if seconds > 0:
                        rcds_per_second = row_count / seconds
                    else:
                        rcds_per_second = 0
                    # output progress message
                    message = "Processed: {:,} rows in {:,.0f} seconds @ {:,.0f} records/second".format(row_count, seconds, rcds_per_second)
                    print(message)

else:

    print ('NCEDC-formatted Earthquake file not found: "%s"' % args.src_file_path)

# ending time hack
end_time = time()
# compute records/second
seconds = end_time - bgn_time
if seconds > 0:
    rcds_per_second = row_count / seconds
else:
    rcds_per_second = row_count
# output end-of-processing messages
message = "Processed: {:,} rows in {:,.0f} seconds @ {:,.0f} records/second".format(row_count, seconds, rcds_per_second)
print(message)
print('Output file path: "%s"' % args.out_file_path)
print("Processing finished, {:,} rows output!".format(out_count))
	# ========================================================================
	#
	# Copyright © 2016 Khepry Quixote
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.
	#
	# ========================================================================

	import argparse
	import csv
	import io
	import os

	from pprint import pprint
	from time import time

	import reverse_geocoder as rg

	pgm_name = 'GeoCoderRev.py'
	pgm_version = '1.0'

	quotemode_choices = ['QUOTE_MINIMAL', 'QUOTE_NONE', 'QUOTE_ALL', 'QUOTE_NONNUMERIC']

	def quotemode_xlator(quote_mode_str):

	quote_mode_val = csv.QUOTE_MINIMAL

	if quote_mode_str.upper() == 'QUOTE_MINIMAL':
	quote_mode_val = csv.QUOTE_MINIMAL
	elif quote_mode_str.upper() == 'QUOTE_ALL':
	quote_mode_val = csv.QUOTE_ALL
	elif quote_mode_str.upper() == 'QUOTE_NONE':
	quote_mode_val = csv.QUOTE_NONE
	elif quote_mode_str.upper() == 'QUOTE_NONNUMERIC':
	quote_mode_val = csv.QUOTE_NONNUMERIC

	return quote_mode_val

	arg_parser = argparse.ArgumentParser(prog='%s' % pgm_name, description='Reverse geo-code an NCEDC-formatted earthquake CSV file.')

	arg_parser.add_argument('--src-file-path', required=True, help='source file path')
	arg_parser.add_argument('--src-delimiter', default=',', help='source file delimiter character')
	arg_parser.add_argument('--src-quotechar', default='"', help='source file quote character')
	arg_parser.add_argument('--src-quotemode', dest='src_quotemode_str', default='QUOTE_MINIMAL', choices=quotemode_choices, help='source file quoting mode (default: %s)' % 'QUOTE_MINIMAL')

	arg_parser.add_argument('--out-file-path', default=None, help='output file path (default: None, same path as source file)')
	arg_parser.add_argument('--out-delimiter', default=',', help='output file delimiter character')
	arg_parser.add_argument('--out-quotechar', default='"', help='output file quote character')
	arg_parser.add_argument('--out-quotemode', dest='out_quotemode_str', default='QUOTE_MINIMAL', choices=quotemode_choices, help='output file quoting mode (default: %s)' % 'QUOTE_MINIMAL')

	arg_parser.add_argument('--out-file-name-folder', default=None, help='output file name folder (default: None')
	arg_parser.add_argument('--out-file-name-prefix', default='NCEDC_earthquakes', help='output file name prefix (default: NCEDC_earthquakes')
	arg_parser.add_argument('--out-file-name-suffix', default='_reverse_geocoded.csv', help='output file name suffix (default: _reverse_geocoded)')
	arg_parser.add_argument('--out-file-name-extension', default='.csv', help='output file name extension (default: .csv)')

	arg_parser.add_argument('--max-rows', type=int, default=0, help='maximum rows to process, 0 means unlimited')
	arg_parser.add_argument('--flush-rows', type=int, default=1000, help='flush rows interval')

	arg_parser.add_argument('--version', action='version', version='version=%s %s' % (pgm_name, pgm_version))

	args = arg_parser.parse_args()

	if args.out_file_path is None:
	if args.out_file_name_folder is None:
	args.out_file_name_folder = os.path.dirname(args.src_file_path)
	args.out_file_path = os.path.join(args.out_file_name_folder, args.out_file_name_prefix + args.out_file_name_suffix + args.out_file_name_extension)

	args.src_quotemode_enm = quotemode_xlator(args.src_quotemode_str)
	args.out_quotemode_enm = quotemode_xlator(args.out_quotemode_str)

	args.max_rows = abs(args.max_rows)
	args.flush_rows = abs(args.flush_rows)

	if args.src_file_path.startswith('~'):
	args.src_file_path = os.path.expanduser(args.src_file_path)
	args.src_file_path = os.path.abspath(args.src_file_path)

	if args.out_file_path.startswith('~'):
	args.out_file_path = os.path.expanduser(args.outfile_path)
	args.out_file_path = os.path.abspath(args.out_file_path)

	print ('Reverse-geocoding source NCEDC earthquakes file: "%s"' % args.src_file_path)
	print ('Outputting to the target NCEDC earthquakes file: "%s"' % args.out_file_path)
	print ('')

	print('Command line args:')
	pprint (vars(args))
	print('')

	# beginning time hack
	bgn_time = time()

	# initialize
	# row counters
	row_count = 0
	out_count = 0

	# if the source file exists
	if os.path.exists(args.src_file_path):

	# open the target file for writing
	with io.open(args.out_file_path, 'w', newline='') as out_file:

	# open the source file for reading
	with io.open(args.src_file_path, 'r', newline='') as src_file:

	# open a CSV file dictionary reader object
	csv_reader = csv.DictReader(src_file, delimiter=args.src_delimiter, quotechar=args.src_quotechar, quoting=args.src_quotemode_enm)

	# obtain the field names from
	# the first line of the source file
	fieldnames = csv_reader.fieldnames
	# append the reverse geo-coding
	# result fields to field names list
	fieldnames.append('cc')
	fieldnames.append('admin1')
	fieldnames.append('admin2')
	fieldnames.append('name')

	# instantiate the CSV dictionary writer object with the modified field names list
	csv_writer = csv.DictWriter(out_file, delimiter=args.out_delimiter, quotechar=args.out_quotechar, quoting=args.out_quotemode_enm, fieldnames=fieldnames)

	# output the header row
	csv_writer.writeheader()

	# beginning time hack
	bgn_time = time()

	# reader row-by-row
	for row in csv_reader:

	row_count += 1

	# convert string lat/lon
	# to floating-point values
	latitude = float(row['Latitude'])
	longitude = float(row['Longitude'])

	# instantiate coordinates tuple
	coordinates = (latitude, longitude)

	# search for the coordinates
	# returning the cc, admin1, admin2, and name values
	# using a mode 1 (single-threaded) search
	results = rg.search(coordinates, mode=1) # default mode = 2

	# if results obtained
	if results is not None:
	# result-by-result
	for result in results:
	# map result values
	# to the row values
	row['cc'] = result['cc']
	row['admin1'] = result['admin1']
	row['admin2'] = result['admin2']
	row['name'] = result['name']
	# output a row
	csv_writer.writerow(row)
	out_count += 1
	else:
	# map empty values
	# to the row values
	row['cc'] = ''
	row['admin1'] = ''
	row['admin2'] = ''
	row['name'] = ''
	# output a row
	csv_writer.writerow(row)
	out_count += 1

	# if row count equals or exceeds max rows
	if args.max_rows > 0 and row_count >= args.max_rows:
	# break out of reading loop
	break

	# if row count is modulus
	# of the flush count value
	if row_count % args.flush_rows == 0:

	# flush accumulated
	# rows to target file
	out_file.flush()

	# ending time hack
	end_time = time()
	# compute records/second
	seconds = end_time - bgn_time
	if seconds > 0:
	rcds_per_second = row_count / seconds
	else:
	rcds_per_second = 0
	# output progress message
	message = "Processed: {:,} rows in {:,.0f} seconds @ {:,.0f} records/second".format(row_count, seconds, rcds_per_second)
	print(message)

	else:

	print ('NCEDC-formatted Earthquake file not found: "%s"' % args.src_file_path)

	# ending time hack
	end_time = time()
	# compute records/second
	seconds = end_time - bgn_time
	if seconds > 0:
	rcds_per_second = row_count / seconds
	else:
	rcds_per_second = row_count
	# output end-of-processing messages
	message = "Processed: {:,} rows in {:,.0f} seconds @ {:,.0f} records/second".format(row_count, seconds, rcds_per_second)
	print(message)
	print('Output file path: "%s"' % args.out_file_path)
	print("Processing finished, {:,} rows output!".format(out_count))