Created
June 25, 2011 18:33
-
-
Save voidfiles/1046748 to your computer and use it in GitHub Desktop.
Parsing Extractiv JSON Crawl Job Output
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Extractive is a cool new service that does some on the fly web | |
crawling, and machine learning stuff like Entity Extraction, Full | |
text extraction, and more things I don't even understand. There | |
on demand api was very easy to use, but I built my first crawl | |
job the other night, and had some trouble parsing the results. | |
I thought I would put this up so people can see how I fixed the | |
output. | |
""" | |
# Get the data from a file | |
data = open('extractive_output.txt').read() | |
# Split it into lines, which also strips the \n from the data | |
lines = data.split('\n') | |
# Going to store each document as an entry in a list | |
json_documents = [] | |
# As we iterate over the lines will store the json | |
# doc that is going to be build in current_json | |
current_json = '' | |
# Start iterate overllines | |
for line in lines: | |
current_json += line | |
# If the line has no spaces, and is just a closing braket | |
# This is the end of a document | |
# Push the json on the list, and start over again | |
if line == '}': | |
json_documents.append(json) | |
current_json = '' | |
# parse the individual documents | |
json_documents = [simplejson.loads(x) for x in jsons] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment