Skip to content

Instantly share code, notes, and snippets.

@tangotiger
Last active February 7, 2016 21:24
Show Gist options
  • Save tangotiger/4fec9a63b2cb4692ecb9 to your computer and use it in GitHub Desktop.
Save tangotiger/4fec9a63b2cb4692ecb9 to your computer and use it in GitHub Desktop.
Parse Schedule
print("Parse start")
sourcefile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\schedulebyseason.htm"
targetfile = "C:\\Users\\TOM\\PycharmProjects\\downloadNHL\\datafiles\\parsed_schedulebyseason.txt"
searchstr = "recap?id="
sample_recstr = "2015020001"
reclen = len(sample_recstr)
i = 0
with open(sourcefile,'r') as infile, open(targetfile,'w') as outfile:
for line in infile:
line_iterator = str(line).split(searchstr)
if len(line_iterator) > 1:
game_id = line_iterator[1][0:reclen]
outfile.write(game_id)
outfile.write("\n")
i = i + 1
print(str(i) + " : records written")
print("Parse end")
@dondrake
Copy link

Take a look at the Beautiful Soup library: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

@tangotiger
Copy link
Author

If you can provide a particular instance for discussion, that would be more helpful. Is it the search? The extract?

@tangotiger
Copy link
Author

Note that I will definitely need to get there soon when I parse the event files. For the schedule, I didn't need to worry about ,

@mattdennewitz
Copy link

here is a simplified version using a regular expression to find instances of recap?id= and capture the string of digits following id=. note that file paths should be updated to your local equivalents.

import re


# define a simple regular expression which finds strings
# matching "recap?id=" and captures the following series of digits
SCHEDULE_RE = re.compile(r'recap\?id=(\d+)')

# read file
html = open('/path/to/schedulebyseason.htm', 'r').read()

# find all instances of "recap?id="
recap_ids = SCHEDULE_RE.findall(html)

# open target file, write results
with open('game-ids.txt', 'w') as output:
    for recap_id in recap_ids:
        output.write('%s\n' % recap_id)

@mattdennewitz
Copy link

below is what the same thing would look like with lxml: here, we convert the html into a real xml-like tree and use an xpath query to select <a> nodes whose href attribute matches recap?id=. each matching url is broken down into its component parts, and we parse the query string to extract the value of id.

import urlparse

from lxml import html


# read html
body = open('schedulebyseason.htm', 'r').read()

# parse html into an element tree
doc = html.fromstring(body)

with open('game-ids.txt', 'w') as output:
    # extract `href` attribute from each `a` tag
    # whose `href` contains the string "recap?id="
    for href in doc.xpath('//a[contains(@href, "recap?id=")]/@href'):
        # decompose this url
        bits = urlparse.urlparse(href)
        # break down the query string
        qs = urlparse.parse_qs(bits.query)

        # because a query string key may be used more than once
        # e.g., ("?year=2015&year=2016"), decomposed query strings are
        # given as lists. take the first element of the list...
        game_id = qs['id'][0]

        # ... and write it to the output file
        output.write('%s\n' % game_id)

@tangotiger
Copy link
Author

Terrific stuff guys, thanks. This will definitely be especially valuable in my next parsing of a file.
Here's an alternative also provided:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa

@dondrake
Copy link

Here's how to do it in Beautiful Soup:

from bs4 import BeautifulSoup
import urllib2
import re

resp = urllib2.urlopen("http://www.nhl.com/ice/schedulebyseason.htm")
soup = BeautifulSoup(resp, "html.parser", from_encoding=resp.info().getparam('charset') )

with open('game-ids.txt', 'w') as output:
    for link in soup.find_all('a', href=True):
        #print link['href']
        # http://www.nhl.com/gamecenter/en/recap?id=2015020664
        result = re.search(r'recap\?id=(\d+)', link['href'])
        if result:
            output.write('%s\n' % result.group(1))

@tangotiger
Copy link
Author

I'm fascinated. Thanks guys, I'm going to try these solutions as well.

@tangotiger
Copy link
Author

Someone pointed to here:
https://gist.github.com/Ja1meMartin/db1b71ed90921aff24fa
And I made my updates accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment