Skip to content

Instantly share code, notes, and snippets.

@tmaybe
Last active August 26, 2016 21:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tmaybe/1794ee5d58f40a3430043ae3961bdb0c to your computer and use it in GitHub Desktop.
Save tmaybe/1794ee5d58f40a3430043ae3961bdb0c to your computer and use it in GitHub Desktop.

How To Configure Comport Extractor (pulling data from ORACLE databases)

  1. The config.json File
  2. The Query Files
  3. The Template Files
  4. Set Up the Oracle Client and Environment Variables

The config.json File

Copy config.json.sample to config.json and set the variables as described below. (or skip to a sample config file)

  • publish_to can be FILE or COMPORT. If it's FILE, extracted data will be written to a local CSV file, perfect for testing the output of queries. Once queries are finalized, setting it to COMPORT will send data to the Comport web application.

  • extract_from_date is a date in the format month/year, like 4/2014. If publish_to is set to FILE, this date will be used as the start date for extraction. if publish_to is set to COMPORT, this date will be ignored, and the extractor will get its start date from the Comport web application.

  • comport_baseurl is the URL of the Comport app, probably https://www.projectcomport.org/

  • comport_username and comport_password are the credentials the extractor will use to authenticate with the Comport app, you should receive them from the administrator.

  • comport_key is a string that'll be used to hash values that should be anonymized.

  • oracle_host is the address of the Oracle server.

  • oracle_port is the port to connect to on the server, probably 1521

  • oracle_database is the name of the database on the server that contains the data to be extracted.

  • oracle_username and oracle_password are the credentials the extractor will use to authenticate with the SQL database.

  • queries_dir is the directory where sql files containing the database queries are kept.

  • templates_dir is the directory where json files containing the data templates are kept. These templates describe how the results of the queries are processed and mapped.

  • proxy_http and proxy_https are the URLs of proxy servers for http and https traffic. The comport_username and comport_password will be automatically added to the URLs if necessary.

  • datasets is a list of objects that each describe a dataset to be extracted, transformed, and loaded. Each object has:

    • name a short, unique name that describes the data set, like "complaints"

    • query_filename the name of the sql file containing a database query, kept in the queries_dir. See The Query Files for an example of what the query files should look like.

    • template_filename the name of the json file containing a dataset template, kept in the templates_dir. See The Template Files for an example of what the template files should look like.

    • endpoint the endpoint on the Comport site that the data will be sent to, like "data/complaints"

Sample Config File

When your config file is filled out, it should look something like this.

{
    "publish_to": "FILE",
    "extract_from_date": "1/2005",
    "comport_baseurl": "https://www.projectcomport.org/",
    "comport_username": "Sample_Police_Department-extractor",
    "comport_password": "f5e7db81-678d-451b-8bc0-029c2687206a",
    "comport_key": "grgF5u9ZTF2yALj",
    "oracle_host": "hostname",
    "oracle_port": 1521,
    "oracle_database": "DBNAME",
    "oracle_username": "username",
    "oracle_password": "password",
    "queries_dir": "queries",
    "templates_dir": "templates",
    "proxy_http": "http://proxy.example.com/",
    "proxy_https": "http://proxy.example.com/",
    "datasets": [
        {
            "name": "complaints",
            "query_filename": "query-complaints.sql",
            "template_filename": "template-complaints.json",
            "endpoint": "data/complaints"
        },
        {
            "name": "ois",
            "query_filename": "query-ois.sql",
            "template_filename": "template-ois.json",
            "endpoint": "data/OIS"
        },
        {
            "name": "uof",
            "query_filename": "query-uof.sql",
            "template_filename": "template-uof.json",
            "endpoint": "data/UOF"
        }
    ]
}

The Query Files

Each dataset requires a query file and a template file. The locations and names of these files are defined in config.json. The query file contains a SQL query that is used to extract data from the database. Here's a very simple example that pulls data from the IADATA.INCIDENTS table:

SELECT
IADATA.INCIDENTS.incnum,
IADATA.INCIDENTS.service_type,
IADATA.INCIDENTS.occurred_dt,
IADATA.INCIDENTS.occurred_tm,
IADATA.INCIDENTS.udtext24a,
IADATA.INCIDENTS.udtext24b,
IADATA.INCIDENTS.udtext24c,
IADATA.INCIDENTS.udtext24d,
IADATA.INCIDENTS.disposition,
IADATA.INCIDENTS.citnum
FROM IADATA.INCIDENTS
WHERE IADATA.INCIDENTS.incident_type = 'Citizen complaint'
AND IADATA.INCIDENTS.occurred_dt >= to_date(:dt, 'YYYY-MM-DD')

Note the :dt on the last line of the query. When the query is run, that :dt will be replaced with the extraction start date.

The Template Files

Each dataset requires a query file and a template file. The locations and names of these files are defined in config.json. The template file contains a JSON object that is used to transform data extracted from the database by the corresponding query. Here's a very simple example that transforms data extracted by the query above:

[
    {"name": "opaqueId", "row_index": 0, "process_as": "anonymous"},
    {"name": "serviceType", "row_index": 1, "process_as": "string"},
    {"name": "occuredDate", "row_index": 2, "process_as": "date"},
    {"name": "occuredTime", "row_index": 3, "process_as": "date"},
    {"name": "division", "row_index": 4, "process_as": "string"},
    {"name": "precinct", "row_index": 5, "process_as": "string"},
    {"name": "shift", "row_index": 6, "process_as": "string"},
    {"name": "beat", "row_index": 7, "process_as": "string"},
    {"name": "disposition", "row_index": 8, "process_as": "string"},
    {"name": "residentId", "row_index": 9, "process_as": "anonymous", "process_with": "opaqueId"}
]

Each row in the JSON file corresponds to a selection from the query. So the first item in the query selection (IA_ADM.INCIDENTS.incnum) is at index 0; this template file maps that value using the row_index variable, names it opaqueID, and instructs that it should be processed as an anonymous variable. The fifth item in the query selection (IA_ADM.INCIDENTS.udtext24a) is at index 4; it's named division here and processed as a string variable.

The accepted values for process_as are:

  • date - this value will be treated as a date
  • boolean - this value will be treated as a boolean value
  • anonymous- this value will be anonymized (used for id numbers)
  • string - this value will be treated as a string value. Numerical values should also be processed as strings. If a value isn't a date, boolean, or anonymous value, it should be a string.

If you have a value that you want to anonymize within the context of a single incident, you can also use the process_with parameter as shown above. Using the unique incident ID (opaqueId) to anonymize the unique resident ID (residentId) will keep a single resident from tracked across incidents.

The template defines how an incident will look when it's written to disk, or posted to a web service. A CSV file generated using the above query and template would look like this:

opaqueId,serviceType,occuredDate,occuredTime,division,precinct,shift,beat,disposition,residentId
11db42a5d058a4033316e5872875e2d0,Arresting,2007-05-08 00:00:00,2007-05-08 12:37:00,Operational Bureau,Fourth Precinct,C.O.P. Program,Days,Within policy,2d2ff48f831256de10690246c2b7b0de

A JSON file posted to a web service using the above query and template would look like this:

{
    "data": [
        {
            "opaqueId": "11db42a5d058a4033316e5872875e2d0",
            "serviceType": "Arresting",
            "occuredDate": "2007-05-08 00:00:00",
            "occuredTime": "2007-05-08 12:37:00",
            "division": "Operational Bureau",
            "precinct": "Fourth Precinct",
            "shift": "C.O.P. Program",
            "beat": "Days",
            "disposition": "Within policy",
            "residentId": "2d2ff48f831256de10690246c2b7b0de"
        }
    ]
}

Set Up the Oracle Client and Environment Variables

If you don't already have an Oracle client installed on the machine:

  • download and install the Oracle Instant Client. Get the Instant Client Package - Basic Lite and the Instant Client Package - SDK packages appropriate for your system and database version.
  • Unzip the Instant Client into a directory, like C:\Oracle\instantclient_11_1
  • Unzip the SDK and copy the sdk folder into that directory; so it will be at C:\Oracle\instantclient_11_1\sdk

Create an ORACLE_HOME environment variable that contains the Instant Client path:

  1. Control Panel -> System and Security -> System
  2. Click Advanced System Settings in the left panel
  3. Click the Environment Variables... button
  4. Click the New... button in either User or System variables.
  5. Create a new variable named ORACLE_HOME and set its value to the client path, like C:\Oracle\instantclient_11_1\

And add that same path to the end of the PATH environment variable:

  1. In the Environment Variables window (after steps 1-3 above), select the Path system variable in the bottom scroll window and click Edit...
  2. To the end of Variable value, add the client path, like ;C:\Oracle\instantclient_11_1\

If you already have an Oracle client installed on the machine, set ORACLE_HOME and append to PATH the appropriate path, like C:\oracle\product\11.1.0\client_1\

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment