Last active
February 7, 2016 21:24
-
-
Save tangotiger/4fec9a63b2cb4692ecb9 to your computer and use it in GitHub Desktop.
Parse Schedule
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
print("Parse start") | |
sourcefile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\schedulebyseason.htm" | |
targetfile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\parsed_schedulebyseason.txt" | |
searchstr = "recap?id=" | |
sample_recstr = "2015020001" | |
reclen = len(sample_recstr) | |
i = 0 | |
with open(sourcefile,'r') as infile, open(targetfile,'w') as outfile: | |
for line in infile: | |
line_iterator = str(line).split(searchstr) | |
if len(line_iterator) > 1: | |
game_id = line_iterator[1][0:reclen] | |
outfile.write(game_id) | |
outfile.write("\n") | |
i = i + 1 | |
print(str(i) + " : records written") | |
print("Parse end") | |
Here's how to do it in Beautiful Soup:
from bs4 import BeautifulSoup
import urllib2
import re
resp = urllib2.urlopen("http://www.nhl.com/ice/schedulebyseason.htm")
soup = BeautifulSoup(resp, "html.parser", from_encoding=resp.info().getparam('charset') )
with open('game-ids.txt', 'w') as output:
for link in soup.find_all('a', href=True):
#print link['href']
# http://www.nhl.com/gamecenter/en/recap?id=2015020664
result = re.search(r'recap\?id=(\d+)', link['href'])
if result:
output.write('%s\n' % result.group(1))
I'm fascinated. Thanks guys, I'm going to try these solutions as well.
Someone pointed to here:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa
And I made my updates accordingly.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Terrific stuff guys, thanks. This will definitely be especially valuable in my next parsing of a file.
Here's an alternative also provided:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa