Last active
February 7, 2016 21:24
-
-
Save tangotiger/4fec9a63b2cb4692ecb9 to your computer and use it in GitHub Desktop.
Parse Schedule
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
print("Parse start") | |
sourcefile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\schedulebyseason.htm" | |
targetfile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\parsed_schedulebyseason.txt" | |
searchstr = "recap?id=" | |
sample_recstr = "2015020001" | |
reclen = len(sample_recstr) | |
i = 0 | |
with open(sourcefile,'r') as infile, open(targetfile,'w') as outfile: | |
for line in infile: | |
line_iterator = str(line).split(searchstr) | |
if len(line_iterator) > 1: | |
game_id = line_iterator[1][0:reclen] | |
outfile.write(game_id) | |
outfile.write("\n") | |
i = i + 1 | |
print(str(i) + " : records written") | |
print("Parse end") | |
below is what the same thing would look like with lxml: here, we convert the html into a real xml-like tree and use an xpath query to select <a>
nodes whose href
attribute matches recap?id=
. each matching url is broken down into its component parts, and we parse the query string to extract the value of id
.
import urlparse
from lxml import html
# read html
body = open('schedulebyseason.htm', 'r').read()
# parse html into an element tree
doc = html.fromstring(body)
with open('game-ids.txt', 'w') as output:
# extract `href` attribute from each `a` tag
# whose `href` contains the string "recap?id="
for href in doc.xpath('//a[contains(@href, "recap?id=")]/@href'):
# decompose this url
bits = urlparse.urlparse(href)
# break down the query string
qs = urlparse.parse_qs(bits.query)
# because a query string key may be used more than once
# e.g., ("?year=2015&year=2016"), decomposed query strings are
# given as lists. take the first element of the list...
game_id = qs['id'][0]
# ... and write it to the output file
output.write('%s\n' % game_id)
Terrific stuff guys, thanks. This will definitely be especially valuable in my next parsing of a file.
Here's an alternative also provided:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa
Here's how to do it in Beautiful Soup:
from bs4 import BeautifulSoup
import urllib2
import re
resp = urllib2.urlopen("http://www.nhl.com/ice/schedulebyseason.htm")
soup = BeautifulSoup(resp, "html.parser", from_encoding=resp.info().getparam('charset') )
with open('game-ids.txt', 'w') as output:
for link in soup.find_all('a', href=True):
#print link['href']
# http://www.nhl.com/gamecenter/en/recap?id=2015020664
result = re.search(r'recap\?id=(\d+)', link['href'])
if result:
output.write('%s\n' % result.group(1))
I'm fascinated. Thanks guys, I'm going to try these solutions as well.
Someone pointed to here:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa
And I made my updates accordingly.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
here is a simplified version using a regular expression to find instances of
recap?id=
and capture the string of digits followingid=
. note that file paths should be updated to your local equivalents.