Skip to content

Instantly share code, notes, and snippets.

@prateek
Last active June 21, 2019 19:13
Show Gist options
  • Save prateek/b75ab045555f3dbc1ffd4b6cf7af7d8a to your computer and use it in GitHub Desktop.
Save prateek/b75ab045555f3dbc1ffd4b6cf7af7d8a to your computer and use it in GitHub Desktop.
Extract Claims Data from Collective Health.

Free your Collective Claim Data(!)

Found it much easier to hijack their APIs using Chrome (as opposed to browser scraping).

Broke the process into two parts:

  1. Getting a list of all relevant claims
  2. Retrieving PDFs for said claims

(1) Getting a list of all relevant claims

  • Go to 'https://my.collectivehealth.com/activity?patientId=0' (and login if prompted), and navigate to the person who's claims you want.
  • Open Chrome's inspector to the Network panel
  • Refresh the page
  • One of the requests is called 'claim', right click, and select "Copy > Copy as cURL"
  • This will give you something like:
curl 'https://my.collectivehealth.com/api/v2/person/...&skip=0&limit=20' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, br' - ....
  • Modify the limit parameter to a sufficiently large value (I set mine at 5000), basically it controls the number of claims to scrape.

  • Save that output to a file by adding -o claims.json to the end of the curl command.

You now have all your claims raw data!

(2) Retrieving PDFS for said claims

(a) I use the following incantation to make the json data into a smaller structure that I find useful to read.

cat claims.json | jq '[.data[] | select(.claimType == "professionalMedicalClaim" or .claimType == "institutionalMedicalClaim") | {id: .id, claimSystemId: .claimSystemId, dateOfServiceStart: .dateOfServiceStart, claimDescription:.claimDescription, displayProvider: .displayProvider.name, billedAmount: .billedAmount, planPaid: .planPaid, patientResponsibility: .patientResponsibility, filename: ((.dateOfServiceStart | gsub("-";"")) + "_" + .claimSystemId + "_" + (.claimDescription | gsub(" ";"-") | ascii_downcase) + "_" + (.displayProvider.name| gsub(" ";"-") | ascii_downcase)) }]' > claims-filtered.json

(if you want to do so, I highly recommend using http://visidata.org/ to explore your claims file. It's incredible!)

(b) Find an example claim you want to download a PDF for on 'https://my.collectivehealth.com/activity?patientId=0'. Similar to what we did in (1), find the request called download and get it's cURL. It'll looks something like:

curl 'https://my.collectivehealth.com/api/v1/claim/1234567/download' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, br' -H 'CH-Login-Token: ...' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36' -H 'Accept: application/json' -H 'Cookie: ...' -H 'Connection: keep-alive' --compressed

(c) Create a file called download.py with contents:

#!/usr/bin/env python

import json
import sys
import subprocess

if len(sys.argv) != 3:
  print "Usage: %s <filename> <curl-params>" % sys.argv[0]
  sys.exit(1)

filename=sys.argv[1]
curl_params=sys.argv[2]

with open(filename, "r") as read_file:
    data = json.load(read_file)
    for claim in data:
      response = subprocess.check_output('''curl 'https://my.collectivehealth.com/api/v1/claim/%d/download' %s''' % (
        claim["id"], curl_params), shell=True)
      jresponse = json.loads(response)
      subprocess.check_output('''curl -X GET "%s" -o %s.pdf''' % (
        jresponse["url"], claim["filename"]), shell=True)

Copy everything after the url from the cURL command from (b) into the command below.

chmod +x download.py
./download.py claims-filtered.json "STUFF_AFTER_URL_FROM_CURL"
# this takes a second

(time to ... profit!)

Caveats

  • The stuff mentioned here downloads pdfs for all the claims. You can modify the claims.json filtering to pick whatever you see fit.

E.g. find all the claims for a given provider:

cat claims-filtered.json | jq '[.[] | select(.displayProvider | startswith("PROVIDER_NAME"))]
  • This only works for medical claims (not RX, etc). So I'm filtering based on that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment