Skip to content

Instantly share code, notes, and snippets.

@alexwoolford
Last active August 29, 2015 14:06
Show Gist options
  • Save alexwoolford/f34f8c5789dc6f519a54 to your computer and use it in GitHub Desktop.
Save alexwoolford/f34f8c5789dc6f519a54 to your computer and use it in GitHub Desktop.
A graph database (e.g. Neo4j) is an interesting way to look at the links between pages on a site.
# A JSON file containing the url and html is loaded into a Neo4j graph database.
import json
from bs4 import BeautifulSoup
from py2neo import neo4j
graph_db = neo4j.GraphDatabaseService()
urls = graph_db.get_or_create_index(neo4j.Node, "Urls")
connectedTo = graph_db.get_or_create_index(neo4j.Relationship, "ConnectedTo")
for line in open('dpsk12_org.json', 'r').readlines():
try:
record = json.loads(line.strip())
soup = BeautifulSoup(record['html'])
url = record['url']
for link in soup.findAll('a'):
if link['href'].startswith('http://'):
try:
fromUrl = urls.get_or_create("url", url, {'url' : url})
toUrl = urls.get_or_create("url", link['href'], {'url' : link['href']})
path = fromUrl.get_or_create_path("connectedTo", toUrl)
except:
pass
except:
pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment