-
-
Save tangotiger/4fec9a63b2cb4692ecb9 to your computer and use it in GitHub Desktop.
print("Parse start") | |
sourcefile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\schedulebyseason.htm" | |
targetfile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\parsed_schedulebyseason.txt" | |
searchstr = "recap?id=" | |
sample_recstr = "2015020001" | |
reclen = len(sample_recstr) | |
i = 0 | |
with open(sourcefile,'r') as infile, open(targetfile,'w') as outfile: | |
for line in infile: | |
line_iterator = str(line).split(searchstr) | |
if len(line_iterator) > 1: | |
game_id = line_iterator[1][0:reclen] | |
outfile.write(game_id) | |
outfile.write("\n") | |
i = i + 1 | |
print(str(i) + " : records written") | |
print("Parse end") | |
here is a simplified version using a regular expression to find instances of recap?id=
and capture the string of digits following id=
. note that file paths should be updated to your local equivalents.
import re
# define a simple regular expression which finds strings
# matching "recap?id=" and captures the following series of digits
SCHEDULE_RE = re.compile(r'recap\?id=(\d+)')
# read file
html = open('/path/to/schedulebyseason.htm', 'r').read()
# find all instances of "recap?id="
recap_ids = SCHEDULE_RE.findall(html)
# open target file, write results
with open('game-ids.txt', 'w') as output:
for recap_id in recap_ids:
output.write('%s\n' % recap_id)
below is what the same thing would look like with lxml: here, we convert the html into a real xml-like tree and use an xpath query to select <a>
nodes whose href
attribute matches recap?id=
. each matching url is broken down into its component parts, and we parse the query string to extract the value of id
.
import urlparse
from lxml import html
# read html
body = open('schedulebyseason.htm', 'r').read()
# parse html into an element tree
doc = html.fromstring(body)
with open('game-ids.txt', 'w') as output:
# extract `href` attribute from each `a` tag
# whose `href` contains the string "recap?id="
for href in doc.xpath('//a[contains(@href, "recap?id=")]/@href'):
# decompose this url
bits = urlparse.urlparse(href)
# break down the query string
qs = urlparse.parse_qs(bits.query)
# because a query string key may be used more than once
# e.g., ("?year=2015&year=2016"), decomposed query strings are
# given as lists. take the first element of the list...
game_id = qs['id'][0]
# ... and write it to the output file
output.write('%s\n' % game_id)
Terrific stuff guys, thanks. This will definitely be especially valuable in my next parsing of a file.
Here's an alternative also provided:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa
Here's how to do it in Beautiful Soup:
from bs4 import BeautifulSoup
import urllib2
import re
resp = urllib2.urlopen("http://www.nhl.com/ice/schedulebyseason.htm")
soup = BeautifulSoup(resp, "html.parser", from_encoding=resp.info().getparam('charset') )
with open('game-ids.txt', 'w') as output:
for link in soup.find_all('a', href=True):
#print link['href']
# http://www.nhl.com/gamecenter/en/recap?id=2015020664
result = re.search(r'recap\?id=(\d+)', link['href'])
if result:
output.write('%s\n' % result.group(1))
I'm fascinated. Thanks guys, I'm going to try these solutions as well.
Someone pointed to here:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa
And I made my updates accordingly.
Note that I will definitely need to get there soon when I parse the event files. For the schedule, I didn't need to worry about ,