Skip to content

Instantly share code, notes, and snippets.

@minikomi
Created September 18, 2014 04:55
Show Gist options
  • Save minikomi/8c012d58187932714ced to your computer and use it in GitHub Desktop.
Save minikomi/8c012d58187932714ced to your computer and use it in GitHub Desktop.
scraping rosetta code
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "Scraping Rosetta Code"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "## Setup"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import requests\nfrom bs4 import BeautifulSoup as BS\n\nlanguageHeaderId = \"J\"\n",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Testing Getting Section"
},
{
"cell_type": "code",
"collapsed": false,
"input": "url = \"http://rosettacode.org/wiki/JSON\"\nrosettaurl = \"http://rosettacode.org\"\n\nresult = requests.get(url)\nsoup = BS(result.content)\nallHeaders = soup.findAll(\"h2\")\nourHeader = [h for h in allHeaders if h.find(\"span\", id=languageHeaderId)][0]",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Get edit link"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print ourHeader.find(\"a\")",
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Print to next code"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print \"Start of section ---------\"\n\nnextNode = ourHeader\nwhile True:\n nextNode = nextNode.nextSibling\n try:\n tag_name = nextNode.name\n except AttributeError:\n tag_name = \"\"\n if tag_name == \"h2\":\n print \"End of section\"\n break\n else:\n print nextNode",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Start of section ---------\n\n\n<p>Here is a minimal implementation based on <a class=\"external text\" href=\"http://www.jsoftware.com/pipermail/chat/2007-April/000462.html\" rel=\"nofollow\">an old email message</a>.\n</p>\n\n\n<pre class=\"j highlighted_source\"><span class=\"co1\">NB. character classes:</span><br/><span class=\"co1\">NB. 0: whitespace</span><br/><span class=\"co1\">NB. 1: \"</span><br/><span class=\"co1\">NB. 2: \\</span><br/><span class=\"co1\">NB. 3: [ ] , { }\u00a0:</span><br/><span class=\"co1\">NB. 4: ordinary</span><br/>classes=.<span class=\"nu0\">3</span>&lt;. <span class=\"st_h\">'\"\\[],{}:'</span> <span class=\"sy0\">(</span>#@[ |&amp;&gt;: i.<span class=\"sy0\">)</span> a.<br/>classes=.<span class=\"nu0\">0</span> <span class=\"sy0\">(</span>I.a.e.<span class=\"st_h\">' '</span>,CRLF,TAB<span class=\"sy0\">)</span>} <span class=\"sy0\">(</span>]+<span class=\"nu0\">4</span>*<span class=\"nu0\">0</span>=]<span class=\"sy0\">)</span>classes<br/>\u00a0<br/>words=:<span class=\"sy0\">(</span><span class=\"nu0\">0</span>;<span class=\"sy0\">(</span><span class=\"nu0\">0</span> <span class=\"nu0\">10</span>#:<span class=\"nu0\">10</span>*\".;.<span class=\"nu0\">_2</span>]<span class=\"nu0\">0</span>\u00a0:<span class=\"nu0\">0</span><span class=\"sy0\">)</span>;classes<span class=\"sy0\">)</span>&amp;;: <span class=\"co1\">NB. states:</span><br/> <span class=\"nu0\">0.0</span> <span class=\"nu0\">1.1</span> <span class=\"nu0\">2.1</span> <span class=\"nu0\">3.1</span> <span class=\"nu0\">4.1</span> <span class=\"co1\">NB. 0 whitespace</span><br/> <span class=\"nu0\">1.0</span> <span class=\"nu0\">5.0</span> <span class=\"nu0\">6.0</span> <span class=\"nu0\">1.0</span> <span class=\"nu0\">1.0</span> <span class=\"co1\">NB. 1 \"</span><br/> <span class=\"nu0\">4.0</span> <span class=\"nu0\">4.0</span> <span class=\"nu0\">4.0</span> <span class=\"nu0\">4.0</span> <span class=\"nu0\">4.0</span> <span class=\"co1\">NB. 2 \\</span><br/> <span class=\"nu0\">0.3</span> <span class=\"nu0\">1.2</span> <span class=\"nu0\">2.2</span> <span class=\"nu0\">3.2</span> <span class=\"nu0\">4.2</span> <span class=\"co1\">NB. 3 {\u00a0: , } [ ]</span><br/> <span class=\"nu0\">0.3</span> <span class=\"nu0\">1.2</span> <span class=\"nu0\">2.0</span> <span class=\"nu0\">3.2</span> <span class=\"nu0\">4.0</span> <span class=\"co1\">NB. 4 ordinary</span><br/> <span class=\"nu0\">0.3</span> <span class=\"nu0\">1.2</span> <span class=\"nu0\">2.2</span> <span class=\"nu0\">3.2</span> <span class=\"nu0\">4.2</span> <span class=\"co1\">NB. 5 \"\"</span><br/> <span class=\"nu0\">1.0</span> <span class=\"nu0\">1.0</span> <span class=\"nu0\">1.0</span> <span class=\"nu0\">1.0</span> <span class=\"nu0\">1.0</span> <span class=\"co1\">NB. 6 \"\\</span><br/><span class=\"sy0\">)</span><br/>\u00a0<br/>tokens=.\u00a0;:<span class=\"st_h\">'[ ] , { }\u00a0:'</span><br/>actions=: lBra`rBracket`comma`lBra`rBracket`colon`value<br/>\u00a0<br/><span class=\"co1\">NB. action verbs argument conventions:</span><br/><span class=\"co1\">NB. x -- boxed json word</span><br/><span class=\"co1\">NB. y -- boxed json state stack</span><br/><span class=\"co1\">NB. result -- new boxed json state stack</span><br/><span class=\"co1\">NB.</span><br/><span class=\"co1\">NB. json state stack is an list of boxes of incomplete lists</span><br/><span class=\"co1\">NB. (a single box for complete, syntactically valid json)</span><br/>jsonParse=: <span class=\"nu0\">0</span> {:: <span class=\"sy0\">(</span>,a:<span class=\"sy0\">)</span> ,&amp;.&gt; [: actions@.<span class=\"sy0\">(</span>tokens&amp;i.@[<span class=\"sy0\">)</span>/ [:|.a:,words<br/>\u00a0<br/>lBra=: a: ,~ ]<br/>rBracket=: <span class=\"nu0\">_2</span>&amp;}.@], [:&lt; <span class=\"nu0\">_2</span>&amp;{::@], <span class=\"nu0\">_1</span>&amp;{@]<br/>comma=: ]<br/>rBrace=: <span class=\"nu0\">_2</span>&amp;}.@], [:&lt; <span class=\"nu0\">_2</span>&amp;{::@], [:|: <span class=\"sy0\">(</span><span class=\"nu0\">2</span>,~ [: -:@$ <span class=\"nu0\">_1</span>&amp;{@]<span class=\"sy0\">)</span> $ <span class=\"nu0\">_1</span>&amp;{@]<br/>colon=: ]<br/>value=: <span class=\"nu0\">_1</span>&amp;}.@], [:&lt; <span class=\"nu0\">_1</span>&amp;{::@], jsonValue&amp;.&gt;@[<br/>\u00a0<br/><span class=\"co1\">NB. hypothetically, jsonValue should strip double quotes</span><br/><span class=\"co1\">NB. interpret back slashes</span><br/><span class=\"co1\">NB. and recognize numbers</span><br/>jsonValue=:]<br/>\u00a0<br/>\u00a0<br/>require<span class=\"st_h\">'strings'</span><br/>jsonSer1=: <span class=\"st_h\">']'</span> ,~ <span class=\"st_h\">'['</span> }:@;@; <span class=\"sy0\">(</span><span class=\"st_h\">','</span> ,~ jsonSerialize<span class=\"sy0\">)</span>&amp;.&gt;<br/>jsonSer0=: <span class=\"st_h\">'\"'</span>, jsonEsc@:\":, <span class=\"st_h\">'\"'</span>\"<span class=\"nu0\">_</span><br/>jsonEsc=: rplc&amp;<span class=\"sy0\">(</span>&lt;;.<span class=\"nu0\">_1</span><span class=\"st_h\">' \\ \\\\ \" \\\"'</span><span class=\"sy0\">)</span><br/>jsonSerialize=:jsonSer0`jsonSer1@.<span class=\"sy0\">(</span>*@L.<span class=\"sy0\">)</span></pre>\n\n\n<p>Example use:\n</p>\n\n\n<pre class=\"text highlighted_source\"> jsonParse'{ \"blue\": [1,2], \"ocean\": \"water\" }'<br/>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510<br/>\u2502\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\u2502<br/>\u2502\u2502\"blue\"\u2502\u250c\u2500\u252c\u2500\u2510\u2502\"ocean\"\u2502\"water\"\u2502\u2502<br/>\u2502\u2502 \u2502\u25021\u25022\u2502\u2502 \u2502 \u2502\u2502<br/>\u2502\u2502 \u2502\u2514\u2500\u2534\u2500\u2518\u2502 \u2502 \u2502\u2502<br/>\u2502\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2502<br/>\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<br/> jsonSerialize jsonParse'{ \"blue\": [1,2], \"ocean\": \"water\" }'<br/>[[\"\\\"blue\\\"\",[\"1\",\"2\"],\"\\\"ocean\\\"\",\"\\\"water\\\"\"]]</pre>\n\n\n<p>Note that these are not strict inverses of each other. These routines allow data to be extracted from json and packed into json format, but only in a minimalistic sense. No attempts are made to preserve the subtleties of type and structure which json can carry. This should be good enough for most applications which are required to deal with json but will not be adequate for ill behaved applications which exploit the typing mechanism to carry significant information.\n</p>\n<p>Also, a different serializer will probably be necessary, if you are delivering json to legacy javascript. Nevertheless, these simplifications are probably appropriate for practical cases.\n</p>\n\n\nEnd of section\n"
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Getting original source for section"
},
{
"cell_type": "code",
"collapsed": false,
"input": "editLink = ourHeader.find(\"a\")['href']\neditpage = BS(requests.get(rosettaurl + editLink).content)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 39
},
{
"cell_type": "code",
"collapsed": false,
"input": "print editpage.find(\"textarea\").string",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "=={{header|J}}==\n\nHere is a minimal implementation based on [http://www.jsoftware.com/pipermail/chat/2007-April/000462.html an old email message].\n\n<lang j>NB. character classes:\nNB. 0: whitespace\nNB. 1: \"\nNB. 2: \\\nNB. 3: [ ] , { } :\nNB. 4: ordinary\nclasses=.3<. '\"\\[],{}:' (#@[ |&>: i.) a.\nclasses=.0 (I.a.e.' ',CRLF,TAB)} (]+4*0=])classes\n\nwords=:(0;(0 10#:10*\".;._2]0 :0);classes)&;: NB. states:\n 0.0 1.1 2.1 3.1 4.1 NB. 0 whitespace\n 1.0 5.0 6.0 1.0 1.0 NB. 1 \"\n 4.0 4.0 4.0 4.0 4.0 NB. 2 \\\n 0.3 1.2 2.2 3.2 4.2 NB. 3 { : , } [ ]\n 0.3 1.2 2.0 3.2 4.0 NB. 4 ordinary\n 0.3 1.2 2.2 3.2 4.2 NB. 5 \"\"\n 1.0 1.0 1.0 1.0 1.0 NB. 6 \"\\\n)\n\ntokens=. ;:'[ ] , { } :'\nactions=: lBra`rBracket`comma`lBra`rBracket`colon`value\n\nNB. action verbs argument conventions:\nNB. x -- boxed json word\nNB. y -- boxed json state stack\nNB. result -- new boxed json state stack\nNB.\nNB. json state stack is an list of boxes of incomplete lists\nNB. (a single box for complete, syntactically valid json)\njsonParse=: 0 {:: (,a:) ,&.> [: actions@.(tokens&i.@[)/ [:|.a:,words\n\nlBra=: a: ,~ ]\nrBracket=: _2&}.@], [:< _2&{::@], _1&{@]\ncomma=: ]\nrBrace=: _2&}.@], [:< _2&{::@], [:|: (2,~ [: -:@$ _1&{@]) $ _1&{@]\ncolon=: ]\nvalue=: _1&}.@], [:< _1&{::@], jsonValue&.>@[\n\nNB. hypothetically, jsonValue should strip double quotes\nNB. interpret back slashes\nNB. and recognize numbers\njsonValue=:]\n\n\nrequire'strings'\njsonSer1=: ']' ,~ '[' }:@;@; (',' ,~ jsonSerialize)&.>\njsonSer0=: '\"', jsonEsc@:\":, '\"'\"_\njsonEsc=: rplc&(<;._1' \\ \\\\ \" \\\"')\njsonSerialize=:jsonSer0`jsonSer1@.(*@L.)</lang>\n\nExample use:\n\n<lang> jsonParse'{ \"blue\": [1,2], \"ocean\": \"water\" }'\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\u2502\n\u2502\u2502\"blue\"\u2502\u250c\u2500\u252c\u2500\u2510\u2502\"ocean\"\u2502\"water\"\u2502\u2502\n\u2502\u2502 \u2502\u25021\u25022\u2502\u2502 \u2502 \u2502\u2502\n\u2502\u2502 \u2502\u2514\u2500\u2534\u2500\u2518\u2502 \u2502 \u2502\u2502\n\u2502\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n jsonSerialize jsonParse'{ \"blue\": [1,2], \"ocean\": \"water\" }'\n[[\"\\\"blue\\\"\",[\"1\",\"2\"],\"\\\"ocean\\\"\",\"\\\"water\\\"\"]]</lang>\n\nNote that these are not strict inverses of each other. These routines allow data to be extracted from json and packed into json format, but only in a minimalistic sense. No attempts are made to preserve the subtleties of type and structure which json can carry. This should be good enough for most applications which are required to deal with json but will not be adequate for ill behaved applications which exploit the typing mechanism to carry significant information.\n\nAlso, a different serializer will probably be necessary, if you are delivering json to legacy javascript. Nevertheless, these simplifications are probably appropriate for practical cases.\n\n"
}
],
"prompt_number": 42
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment