voidfiles/gist:1046748

## gistfile1.py
"""
Extractive is a cool new service that does some on the fly web
crawling, and machine learning stuff like Entity Extraction, Full
text extraction, and more things I don't even understand. There
on demand api was very easy to use, but I built my first crawl
job the other night, and had some trouble parsing the results.

I thought I would put this up so people can see how I fixed the
output.
"""

# Get the data from a file
data = open('extractive_output.txt').read()

# Split it into lines, which also strips the \n from the data
lines = data.split('\n')

# Going to store each document as an entry in a list
json_documents = []

# As we iterate over the lines will store the json
# doc that is going to be build in current_json
current_json = ''

# Start iterate overllines
for line in lines:
    current_json += line

    # If the line has no spaces, and is just a closing braket
    # This is the end of a document
    # Push the json on the list, and start over again
    if line == '}':
        json_documents.append(json)
        current_json = ''

# parse the individual documents
json_documents = [simplejson.loads(x) for x in jsons]
	"""
	Extractive is a cool new service that does some on the fly web
	crawling, and machine learning stuff like Entity Extraction, Full
	text extraction, and more things I don't even understand. There
	on demand api was very easy to use, but I built my first crawl
	job the other night, and had some trouble parsing the results.

	I thought I would put this up so people can see how I fixed the
	output.
	"""

	# Get the data from a file
	data = open('extractive_output.txt').read()

	# Split it into lines, which also strips the \n from the data
	lines = data.split('\n')

	# Going to store each document as an entry in a list
	json_documents = []

	# As we iterate over the lines will store the json
	# doc that is going to be build in current_json
	current_json = ''

	# Start iterate overllines
	for line in lines:
	current_json += line

	# If the line has no spaces, and is just a closing braket
	# This is the end of a document
	# Push the json on the list, and start over again
	if line == '}':
	json_documents.append(json)
	current_json = ''

	# parse the individual documents
	json_documents = [simplejson.loads(x) for x in jsons]