Skip to content

Instantly share code, notes, and snippets.

@tmaybe
Last active February 16, 2017 22:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tmaybe/589bf0e933aa959a3e64ce31bab391f8 to your computer and use it in GitHub Desktop.
Save tmaybe/589bf0e933aa959a3e64ce31bab391f8 to your computer and use it in GitHub Desktop.

How To Configure Comport Extractor (pulling data from MS SQL databases)

  1. The config.json File
  2. The Query Files
  3. The Template Files

The config.json File

Copy config.json.sample to config.json and set the variables as described below. (or skip to a sample config file)

  • publish_to can be FILE or COMPORT. If it's FILE, extracted data will be written to a local CSV file, perfect for testing the output of queries. Once queries are finalized, setting it to COMPORT will send data to the Comport web application.

  • extract_from_date is a date in the format month/year, like 4/2014. If publish_to is set to FILE, this date will be used as the start date for extraction. if publish_to is set to COMPORT, this date will be ignored, and the extractor will get its start date from the Comport web application.

  • comport_baseurl is the URL of the Comport app, probably https://www.projectcomport.org/

  • comport_username and comport_password are the credentials the extractor will use to authenticate with the Comport app, you should receive them from the administrator.

  • comport_key is a string that'll be used to hash values that should be anonymized.

  • sql_server is the address of the SQL server.

  • sql_server_database is the name of the database on the server that contains the data to be extracted.

  • sql_server_username and sql_server_password are the credentials the extractor will use to authenticate with the SQL database.

  • queries_dir is the directory where sql files containing the database queries are kept.

  • templates_dir is the directory where json files containing the data templates are kept. These templates describe how the results of the queries are processed and mapped.

  • proxy_http and proxy_https are the URLs of proxy servers for http and https traffic. The comport_username and comport_password will be automatically added to the URLs if necessary.

  • datasets is a list of objects that each describe a dataset to be extracted, transformed, and loaded. Each object has:

    • name a short, unique name that describes the data set, like "complaints"

    • query_filename the name of the sql file containing a database query, kept in the queries_dir. See The Query Files for an example of what the query files should look like.

    • template_filename the name of the json file containing a dataset template, kept in the templates_dir. See The Template Files for an example of what the template files should look like.

    • endpoint the endpoint on the Comport site that the data will be sent to, like "data/complaints"

    • A dataset can optionally include its own sql_server, sql_server_database, sql_server_username, and sql_server_password parameters. If all these parameters are included in a dataset object, they will override the credentials in the main body of the configuration for that dataset only.

Sample Config File

When your config file is filled out, it should look something like this.

{
    "publish_to": "FILE",
    "extract_from_date": "1/2005",
    "comport_baseurl": "https://www.projectcomport.org/",
    "comport_username": "Sample_Police_Department-extractor",
    "comport_password": "f5e7db81-678d-451b-8bc0-029c2687206a",
    "comport_key": "grgF5u9ZTF2yALj",
    "sql_server": "SERVER-NAME",
    "sql_server_database": "DBNAME",
    "sql_server_username": "username",
    "sql_server_password": "password",
    "queries_dir": "queries",
    "templates_dir": "templates",
    "proxy_http": "http://proxy.example.com/",
    "proxy_https": "http://proxy.example.com/",
    "datasets": [
        {
            "name": "complaints",
            "query_filename": "query-complaints.sql",
            "template_filename": "template-complaints.json",
            "endpoint": "data/complaints"
        },
        {
            "name": "ois",
            "query_filename": "query-ois.sql",
            "template_filename": "template-ois.json",
            "endpoint": "data/OIS"
        },
        {
            "name": "uof",
            "query_filename": "query-uof.sql",
            "template_filename": "template-uof.json",
            "endpoint": "data/UOF"
        }
    ]
}

The Query Files

Each dataset requires a query file and a template file. The locations and names of these files are defined in config.json. The query file contains a SQL query that is used to extract data from the database. Here's a very simple example that pulls data from the IA_ADM.INCIDENTS table:

SELECT
IA_ADM.INCIDENTS.incnum,
IA_ADM.INCIDENTS.service_type,
IA_ADM.INCIDENTS.occurred_dt,
IA_ADM.INCIDENTS.occurred_tm,
IA_ADM.INCIDENTS.udtext24a,
IA_ADM.INCIDENTS.udtext24b,
IA_ADM.INCIDENTS.udtext24c,
IA_ADM.INCIDENTS.udtext24d,
IA_ADM.INCIDENTS.disposition,
IA_ADM.INCIDENTS.citnum
FROM IA_ADM.INCIDENTS
WHERE IA_ADM.INCIDENTS.incident_type = N'Citizen complaint'
AND IA_ADM.INCIDENTS.occurred_dt >= ?

Note the ? at the end of the query. When the query is run, that ? will be replaced with the extraction start date.

The Template Files

Each dataset requires a query file and a template file. The locations and names of these files are defined in config.json. The template file contains a JSON object that is used to transform data extracted from the database by the corresponding query. Here's a very simple example that transforms data extracted by the query above:

[
    {"name": "opaqueId", "row_index": 0, "process_as": "anonymous"},
    {"name": "serviceType", "row_index": 1, "process_as": "string"},
    {"name": "occuredDate", "row_index": 2, "process_as": "date"},
    {"name": "occuredTime", "row_index": 3, "process_as": "date"},
    {"name": "division", "row_index": 4, "process_as": "string"},
    {"name": "precinct", "row_index": 5, "process_as": "string"},
    {"name": "shift", "row_index": 6, "process_as": "string"},
    {"name": "beat", "row_index": 7, "process_as": "string"},
    {"name": "disposition", "row_index": 8, "process_as": "string"},
    {"name": "residentId", "row_index": 9, "process_as": "anonymous", "process_with": "opaqueId"}
]

Each row in the JSON file corresponds to a selection from the query. So the first item in the query selection (IA_ADM.INCIDENTS.incnum) is at index 0; this template file maps that value using the row_index variable, names it opaqueID, and instructs that it should be processed as an anonymous variable. The fifth item in the query selection (IA_ADM.INCIDENTS.udtext24a) is at index 4; it's named division here and processed as a string variable.

The accepted values for process_as are:

  • date - this value will be treated as a date
  • boolean - this value will be treated as a boolean value
  • exists - this value will be converted to a boolean; True if it exists, False if it doesn't.
  • anonymous- this value will be anonymized (used for id numbers)
  • string - this value will be treated as a string value. Numerical values should also be processed as strings. If a value isn't a date, boolean, exists, or anonymous value, it should be a string.

If you have a value that you want to anonymize within the context of a single incident, you can also use the process_with parameter as shown above. Using the unique incident ID (opaqueId) to anonymize the unique resident ID (residentId) will keep a single resident from being tracked across incidents.

To perform a simple find and replace operation on a value as it's processed, add a find_replace value to a template row like so: ..., "find_replace": "hello/goodbye". The pattern to find is on the left side of an unescaped /, and the string to replace it with is on the right, so this example would replace the word "hello" with "goodbye". A pattern like "find_replace": "^Z/" would strip a Z at the beginning of the value, and "find_replace": "\\bY/Z" would replace every Y at the beginning of a word with an Z.

The template defines how an incident will look when it's written to disk, or posted to a web service. A CSV file generated using the above query and template would look like this:

opaqueId,serviceType,occuredDate,occuredTime,division,precinct,shift,beat,disposition,residentId
11db42a5d058a4033316e5872875e2d0,Arresting,2007-05-08 00:00:00,2007-05-08 12:37:00,Operational Bureau,Fourth Precinct,C.O.P. Program,Days,Within policy,2d2ff48f831256de10690246c2b7b0de

A JSON file posted to a web service using the above query and template would look like this:

{
    "data": [
        {
            "opaqueId": "11db42a5d058a4033316e5872875e2d0",
            "serviceType": "Arresting",
            "occuredDate": "2007-05-08 00:00:00",
            "occuredTime": "2007-05-08 12:37:00",
            "division": "Operational Bureau",
            "precinct": "Fourth Precinct",
            "shift": "C.O.P. Program",
            "beat": "Days",
            "disposition": "Within policy",
            "residentId": "2d2ff48f831256de10690246c2b7b0de"
        }
    ]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment