Skip to content

Instantly share code, notes, and snippets.

@alexwoolford
Created September 22, 2014 05:50
Show Gist options
  • Save alexwoolford/59aee59e72c9a902d484 to your computer and use it in GitHub Desktop.
Save alexwoolford/59aee59e72c9a902d484 to your computer and use it in GitHub Desktop.
The DPS pages can be indexed in Solr.
# Interesting features from the HTML can then be loaded into a Solr index.
import json
from bs4 import BeautifulSoup
import solr
s = solr.SolrConnection('http://localhost:8983/solr')
for recordNum, line in enumerate(open('/Users/awoolford/Documents/scrapeDPS/dpsk12_org/dpsk12_org.json', 'r').readlines()):
try:
record = json.loads(line.strip())
soup = BeautifulSoup(record['html'])
url = record['url']
title = soup.title.getText().strip()
s.add(id=recordNum, url=url, title=title)
s.commit()
except:
pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment