Skip to content

Instantly share code, notes, and snippets.

@jdnc
Created August 27, 2013 19:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jdnc/6358205 to your computer and use it in GitHub Desktop.
Save jdnc/6358205 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "mini-code"
},
"name": "mini-code",
"nbformat": 2,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"source": "Here's some basic code that scraps some html from a course page and gets the name, phone and office\nof instructor"
},
{
"cell_type": "code",
"collapsed": true,
"input": "import requests\nimport bs4\nimport re",
"language": "python",
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"source": "The regular expessions for the relevant details are these:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "search_instructor = re.compile('Instructor|Professor', re.IGNORECASE)\nsearch_phone = re.compile('(\\d+-\\d+)')\nsearch_office = re.compile('(GDC\\s+\\d[.]\\d+)')\nsearch_name = re.compile('([\\w\\s]+)')",
"language": "python",
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "markdown",
"source": "Now we fetch the page for beautiful soup to scrape!"
},
{
"cell_type": "code",
"collapsed": true,
"input": "response = requests.get('http://www.cs.utexas.edu/~plaxton/c/388g/syllabus.html')",
"language": "python",
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"source": "Here is the content of the page in html"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print response.content",
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "<!doctype html public \"-//w3c//dtd html 4.0 transitional//en\">\n<html>\n<head>\n<META HTTP-EQUIV=\"Pragma\" CONTENT=\"no-cache\">\n<META HTTP-EQUIV=\"Expires\" CONTENT=\"-1\">\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">\n<meta name=\"GENERATOR\" content=\"Mozilla/4.79 [en] (X11; U; Linux 2.4.18 i686) [Netscape]\">\n</head>\n<body text=\"#000000\" bgcolor=\"#FFFFFF\" link=\"#3333FF\" vlink=\"#990000\" alink=\"#FF0000\">\n\n<blockquote>\n<div align=right><b><font color=\"#990000\">Spring 2013</font></b>\n<br><b><font color=\"#990000\">Unique Number 53665</font></b>\n<br><b><font color=\"#990000\">CS 388G</font></b></div>\n<b><font color=\"#990000\"><font size=+1>Algorithms: Techniques and Theory</font></font></b>\n<br>\n<hr WIDTH=\"100%\">\n<br>&nbsp;\n<table BORDER=0 WIDTH=\"100%\" NOSAVE >\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Instructor</font></b></td>\n\n<td><a href=\"http://www.cs.utexas.edu/users/plaxton\">Greg Plaxton</a>,\n471-9751, GDC 4.512, office hours M 2:30-3:30, W 3:30-4:30.</td>\n</tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">TA</font></b></td>\n\n<td>Onur Domanic, office hours TTh 10-11, GDC 1.302.\n</td>\n</tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Class Time</font></b></td>\n\n<td>TTh 2-3:30</td>\n</tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Class Location</font></b></td>\n\n<td>BUR 136</td>\n</tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Required Textbook</font></b></td>\n\n\n<td>T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein,\n<i>Introduction to Algorithms</i>, MIT Press, 3rd edition,\n2009.</td>\n</tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Course Outline</font></b></td>\n\n<td>This is a graduate course in the design and analysis of\nalgorithms. Some of the course material revisits topics that are\ncovered in a typical undergraduate algorithms course such as CS 357;\nin such cases, we emphasize more advanced aspects. See\nthe <a href=\"schedule.html\">schedule</a> for a more detailed lecture\nplan. </td>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Prerequisites</font></b></td>\n<td>\nGraduate standing and either CS 357 (or an equivalent course) or\nconsent of instructor.\n<td>\n</td>\n</tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Recommended Exercises</font></b></td>\n\n<td>Six sets of recommended exercises will be handed out during the\nsemester. The tentative dates for these handouts are indicated on the\nclass schedule. Sample solutions will also be handed out. It is\nsuggested that students attempt to solve the recommended exercises\n(either alone or in a group) before reading the sample solutions.\n</td> </tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Quizzes</font></b></td>\n\n<td>Most of the lectures will begin with a short quiz based on\nmaterial covered in the previous lecture. Each quiz will be graded\nout of 20. Attendance is worth 50%, i.e., anyone who turns in a quiz\nwill get a score of at least 10 out of 20. The quizzes are open\nbook/notes. Electronic devices may be used during the quizzes, but\nonly for the following two purposes: (1) to access an electronic\nversion of the class textbook; (2) to access an electronic version of\nyour personal class notes. As explained in the section below entitled\n\"Overall Raw Score\", all of your low quiz scores will be dropped in\nthe computation of your course grade. If you miss a quiz for any\nreason (legitimate or otherwise), your score for that quiz will be a\nzero, so it will be dropped. </td></tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Tests</font></b></td>\n\n<td>There will be three in-class tests, on <b>February\n14</b>, <b>March 28</b>, and <b>May 2</b> (all test dates are\nThursdays). The tests are closed book/notes, except that you are\nallowed to bring one page of notes (both sides may be used).\n</td> </tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Make-Up Tests</font></b></td>\n\n<td>\nPlease note that no make-up tests will be given in this course. If a\nstudent has a legitimate and properly documented excuse for missing\none of the tests, the missing test score will be estimated based on\nthe other test scores. More complicated scenarios, e.g., where a\nstudent misses multiple tests for legitimate reasons, will be treated\non a case-by-case basis. In the event of a non-excused absence, a\nscore of zero will be assigned. </td> </tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Overall Raw Score</font></b></td>\n\n<td>\nEach student's overall raw score out of 100 will be determined from\nthe quiz and test scores as follows. First, the test average is\ncomputed (as a percentage), call it X. Then, each individual quiz\nscore lower than X is replaced by X. Let Y denote the resulting quiz\naverage. Then the overall raw score is 0.4 X + 0.6 Y.\n</td> </tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Letter Grades</font></b></td>\n\n<td>\nThe mapping from overall raw scores to letter grades is not based on a\nfixed formula; it will depend to some extent on the overall\nperformance of the class. Typically about half of the students who\ncomplete the class receive a grade an A (or A-), and about half\nreceive a B (or B-, B+).\n</td> </tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Feedback</font></b></td>\n\n<td>\nThroughout the semester, please feel free to provide feedback to the\ninstructor regarding any aspect of the course. </td> </tr>\n\n<tr>\n<td>&nbsp;</td>\n<td>&nbsp;</td>\n</tr>\n\n<tr VALIGN=TOP NOSAVE>\n<td NOSAVE><b><font color=\"#990000\">Academic Honesty</font></b></td>\n\n<td>\nSee the following <a\nhref=\"http://www.cs.utexas.edu/academics/conduct/\">departmental\ndocument</a>.\n</td>\n</tr>\n\n</table>\n</blockquote>\n\n</body>\n<head>\n<META HTTP-EQUIV=\"Pragma\" CONTENT=\"no-cache\">\n<META HTTP-EQUIV=\"Expires\" CONTENT=\"-1\">\n</head>\n</html>\n"
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": true,
"input": "html_doc = bs4.BeautifulSoup(response.content)",
"language": "python",
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": true,
"input": "table = html_doc.find('table')",
"language": "python",
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"source": "Finally we get the relevant details"
},
{
"cell_type": "code",
"collapsed": false,
"input": "for row in table.findAll('tr'):\n cols = row.findChildren('td')\n text = cols[0].getText()\n match = search_instructor.match(text)\n if match:\n text2 = cols[1].getText()\n name = search_name.match(text2)\n phone = search_phone.search(text2)\n office = search_office.search(text2)\n if name:\n name = name.group()\n if phone:\n phone = phone.group()\n if office:\n office = office.group()\n print name, office, phone\n ",
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Greg Plaxton GDC 4.512 471-9751"
}
],
"prompt_number": 34
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment