Skip to content

Instantly share code, notes, and snippets.

@voidfiles
Created June 25, 2011 18:33
Show Gist options
  • Save voidfiles/1046748 to your computer and use it in GitHub Desktop.
Save voidfiles/1046748 to your computer and use it in GitHub Desktop.
Parsing Extractiv JSON Crawl Job Output
"""
Extractive is a cool new service that does some on the fly web
crawling, and machine learning stuff like Entity Extraction, Full
text extraction, and more things I don't even understand. There
on demand api was very easy to use, but I built my first crawl
job the other night, and had some trouble parsing the results.
I thought I would put this up so people can see how I fixed the
output.
"""
# Get the data from a file
data = open('extractive_output.txt').read()
# Split it into lines, which also strips the \n from the data
lines = data.split('\n')
# Going to store each document as an entry in a list
json_documents = []
# As we iterate over the lines will store the json
# doc that is going to be build in current_json
current_json = ''
# Start iterate overllines
for line in lines:
current_json += line
# If the line has no spaces, and is just a closing braket
# This is the end of a document
# Push the json on the list, and start over again
if line == '}':
json_documents.append(json)
current_json = ''
# parse the individual documents
json_documents = [simplejson.loads(x) for x in jsons]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment