Copy config.json.sample
to config.json
and set the variables as described below. (or skip to a sample config file)
-
publish_to
can beFILE
orCOMPORT
. If it'sFILE
, extracted data will be written to a local CSV file, perfect for testing the output of queries. Once queries are finalized, setting it toCOMPORT
will send data to the Comport web application. -
extract_from_date
is a date in the formatmonth/year
, like4/2014
. Ifpublish_to
is set toFILE
, this date will be used as the start date for extraction. ifpublish_to
is set toCOMPORT
, this date will be ignored, and the extractor will get its start date from the Comport web application. -
comport_baseurl
is the URL of the Comport app, probablyhttps://www.projectcomport.org/
-
comport_username
andcomport_password
are the credentials the extractor will use to authenticate with the Comport app, you should receive them from the administrator. -
comport_key
is a string that'll be used to hash values that should be anonymized. -
sql_server
is the address of the SQL server. -
sql_server_database
is the name of the database on the server that contains the data to be extracted. -
sql_server_username
andsql_server_password
are the credentials the extractor will use to authenticate with the SQL database. -
queries_dir
is the directory wheresql
files containing the database queries are kept. -
templates_dir
is the directory wherejson
files containing the data templates are kept. These templates describe how the results of the queries are processed and mapped. -
proxy_http
andproxy_https
are the URLs of proxy servers for http and https traffic. Thecomport_username
andcomport_password
will be automatically added to the URLs if necessary. -
datasets
is a list of objects that each describe a dataset to be extracted, transformed, and loaded. Each object has:-
name
a short, unique name that describes the data set, like "complaints" -
query_filename
the name of thesql
file containing a database query, kept in thequeries_dir
. See The Query Files for an example of what the query files should look like. -
template_filename
the name of thejson
file containing a dataset template, kept in thetemplates_dir
. See The Template Files for an example of what the template files should look like. -
endpoint
the endpoint on the Comport site that the data will be sent to, like "data/complaints" -
A dataset can optionally include its own
sql_server
,sql_server_database
,sql_server_username
, andsql_server_password
parameters. If all these parameters are included in a dataset object, they will override the credentials in the main body of the configuration for that dataset only.
-
When your config file is filled out, it should look something like this.
{
"publish_to": "FILE",
"extract_from_date": "1/2005",
"comport_baseurl": "https://www.projectcomport.org/",
"comport_username": "Sample_Police_Department-extractor",
"comport_password": "f5e7db81-678d-451b-8bc0-029c2687206a",
"comport_key": "grgF5u9ZTF2yALj",
"sql_server": "SERVER-NAME",
"sql_server_database": "DBNAME",
"sql_server_username": "username",
"sql_server_password": "password",
"queries_dir": "queries",
"templates_dir": "templates",
"proxy_http": "http://proxy.example.com/",
"proxy_https": "http://proxy.example.com/",
"datasets": [
{
"name": "complaints",
"query_filename": "query-complaints.sql",
"template_filename": "template-complaints.json",
"endpoint": "data/complaints"
},
{
"name": "ois",
"query_filename": "query-ois.sql",
"template_filename": "template-ois.json",
"endpoint": "data/OIS"
},
{
"name": "uof",
"query_filename": "query-uof.sql",
"template_filename": "template-uof.json",
"endpoint": "data/UOF"
}
]
}
Each dataset requires a query file and a template file. The locations and names of these files are defined in config.json. The query file contains a SQL query that is used to extract data from the database. Here's a very simple example that pulls data from the IA_ADM.INCIDENTS
table:
SELECT
IA_ADM.INCIDENTS.incnum,
IA_ADM.INCIDENTS.service_type,
IA_ADM.INCIDENTS.occurred_dt,
IA_ADM.INCIDENTS.occurred_tm,
IA_ADM.INCIDENTS.udtext24a,
IA_ADM.INCIDENTS.udtext24b,
IA_ADM.INCIDENTS.udtext24c,
IA_ADM.INCIDENTS.udtext24d,
IA_ADM.INCIDENTS.disposition,
IA_ADM.INCIDENTS.citnum
FROM IA_ADM.INCIDENTS
WHERE IA_ADM.INCIDENTS.incident_type = N'Citizen complaint'
AND IA_ADM.INCIDENTS.occurred_dt >= ?
Note the ?
at the end of the query. When the query is run, that ?
will be replaced with the extraction start date.
Each dataset requires a query file and a template file. The locations and names of these files are defined in config.json. The template file contains a JSON object that is used to transform data extracted from the database by the corresponding query. Here's a very simple example that transforms data extracted by the query above:
[
{"name": "opaqueId", "row_index": 0, "process_as": "anonymous"},
{"name": "serviceType", "row_index": 1, "process_as": "string"},
{"name": "occuredDate", "row_index": 2, "process_as": "date"},
{"name": "occuredTime", "row_index": 3, "process_as": "date"},
{"name": "division", "row_index": 4, "process_as": "string"},
{"name": "precinct", "row_index": 5, "process_as": "string"},
{"name": "shift", "row_index": 6, "process_as": "string"},
{"name": "beat", "row_index": 7, "process_as": "string"},
{"name": "disposition", "row_index": 8, "process_as": "string"},
{"name": "residentId", "row_index": 9, "process_as": "anonymous", "process_with": "opaqueId"}
]
Each row in the JSON file corresponds to a selection from the query. So the first item in the query selection (IA_ADM.INCIDENTS.incnum
) is at index 0
; this template file maps that value using the row_index
variable, names it opaqueID
, and instructs that it should be processed as an anonymous
variable. The fifth item in the query selection (IA_ADM.INCIDENTS.udtext24a
) is at index 4
; it's named division
here and processed as a string
variable.
The accepted values for process_as
are:
date
- this value will be treated as a dateboolean
- this value will be treated as a boolean valueexists
- this value will be converted to a boolean;True
if it exists,False
if it doesn't.anonymous
- this value will be anonymized (used for id numbers)string
- this value will be treated as a string value. Numerical values should also be processed as strings. If a value isn't adate
,boolean
,exists
, oranonymous
value, it should be astring
.
If you have a value that you want to anonymize within the context of a single incident, you can also use the process_with
parameter as shown above. Using the unique incident ID (opaqueId
) to anonymize the unique resident ID (residentId
) will keep a single resident from being tracked across incidents.
To perform a simple find and replace operation on a value as it's processed, add a find_replace
value to a template row like so: ..., "find_replace": "hello/goodbye"
. The pattern to find is on the left side of an unescaped /
, and the string to replace it with is on the right, so this example would replace the word "hello" with "goodbye". A pattern like "find_replace": "^Z/"
would strip a Z
at the beginning of the value, and "find_replace": "\\bY/Z"
would replace every Y
at the beginning of a word with an Z
.
The template defines how an incident will look when it's written to disk, or posted to a web service. A CSV file generated using the above query and template would look like this:
opaqueId,serviceType,occuredDate,occuredTime,division,precinct,shift,beat,disposition,residentId
11db42a5d058a4033316e5872875e2d0,Arresting,2007-05-08 00:00:00,2007-05-08 12:37:00,Operational Bureau,Fourth Precinct,C.O.P. Program,Days,Within policy,2d2ff48f831256de10690246c2b7b0de
A JSON file posted to a web service using the above query and template would look like this:
{
"data": [
{
"opaqueId": "11db42a5d058a4033316e5872875e2d0",
"serviceType": "Arresting",
"occuredDate": "2007-05-08 00:00:00",
"occuredTime": "2007-05-08 12:37:00",
"division": "Operational Bureau",
"precinct": "Fourth Precinct",
"shift": "C.O.P. Program",
"beat": "Days",
"disposition": "Within policy",
"residentId": "2d2ff48f831256de10690246c2b7b0de"
}
]
}