Last active
February 7, 2016 21:24
-
-
Save tangotiger/4fec9a63b2cb4692ecb9 to your computer and use it in GitHub Desktop.
Parse Schedule
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
print("Parse start") | |
sourcefile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\schedulebyseason.htm" | |
targetfile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\parsed_schedulebyseason.txt" | |
searchstr = "recap?id=" | |
sample_recstr = "2015020001" | |
reclen = len(sample_recstr) | |
i = 0 | |
with open(sourcefile,'r') as infile, open(targetfile,'w') as outfile: | |
for line in infile: | |
line_iterator = str(line).split(searchstr) | |
if len(line_iterator) > 1: | |
game_id = line_iterator[1][0:reclen] | |
outfile.write(game_id) | |
outfile.write("\n") | |
i = i + 1 | |
print(str(i) + " : records written") | |
print("Parse end") | |
Terrific stuff guys, thanks. This will definitely be especially valuable in my next parsing of a file.
Here's an alternative also provided:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa
Here's how to do it in Beautiful Soup:
from bs4 import BeautifulSoup
import urllib2
import re
resp = urllib2.urlopen("http://www.nhl.com/ice/schedulebyseason.htm")
soup = BeautifulSoup(resp, "html.parser", from_encoding=resp.info().getparam('charset') )
with open('game-ids.txt', 'w') as output:
for link in soup.find_all('a', href=True):
#print link['href']
# http://www.nhl.com/gamecenter/en/recap?id=2015020664
result = re.search(r'recap\?id=(\d+)', link['href'])
if result:
output.write('%s\n' % result.group(1))
I'm fascinated. Thanks guys, I'm going to try these solutions as well.
Someone pointed to here:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa
And I made my updates accordingly.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
below is what the same thing would look like with lxml: here, we convert the html into a real xml-like tree and use an xpath query to select
<a>
nodes whosehref
attribute matchesrecap?id=
. each matching url is broken down into its component parts, and we parse the query string to extract the value ofid
.