Last active
March 5, 2019 03:52
-
-
Save jkozma/98d64e81ba173fdf0feb9972149ca3fb to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Extracting text from web pages using bs4.BeautifulSoup</h2>\n", | |
"<h3>Obviously, bs4 needs to be installed. See <a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a> for installation help and a more detailed presentation on BeautifulSoup features and how to use them.</h3>\n", | |
"\n", | |
"<h3>Create a BeautifulSoup object and instantiate it with some html:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div><p>p outside any div</p><div>text in div with no p</div>" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from bs4 import BeautifulSoup as htm\n", | |
"raw_html_page = \"<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div><p>p outside any div</p><div>text in div with no p</div>\"\n", | |
"bs_obj_from_raw_html_pg = htm(raw_html_page, \"html.parser\")\n", | |
"bs_obj_from_raw_html_pg" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>Verify the type of the object:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"bs4.BeautifulSoup" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"type(bs_obj_from_raw_html_pg)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>Use the find_all function to return a set of all elements of a specified type:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div>,\n", | |
" <div>text in div with no p</div>]" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"all_divs = bs_obj_from_raw_html_pg.find_all('div')\n", | |
"all_divs" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>The element.ResultSet is returned as an iterable:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"bs4.element.ResultSet" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"type(all_divs)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>The find_all function returns p elements nested in a div element:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[<p>p-in-div with period.</p>,\n", | |
" <p>p-in-div with no period</p>,\n", | |
" <p> p-in-div with 2 leading and trailing spaces </p>,\n", | |
" <p>p outside any div</p>]" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"all_ps = bs_obj_from_raw_html_pg.find_all('p')\n", | |
"all_ps" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>Note that a ResultSet, while iterable, is itself not an iterator:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"ename": "TypeError", | |
"evalue": "'ResultSet' object is not an iterator", | |
"output_type": "error", | |
"traceback": [ | |
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", | |
"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", | |
"\u001b[1;32m<ipython-input-7-fea6845117ab>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mall_ps\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", | |
"\u001b[1;31mTypeError\u001b[0m: 'ResultSet' object is not an iterator" | |
] | |
} | |
], | |
"source": [ | |
"next(all_ps)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>The builtin iter function may be used to create an iterator function:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<p>p-in-div with period.</p>" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"all_ps_iter = iter(all_ps)\n", | |
"next(all_ps_iter)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<p>p-in-div with no period</p>" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"next(all_ps_iter)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>Members of the the ResultSet may be accessed using list notation:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<p>p-in-div with no period</p>" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"all_ps[1]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>The text attribute may be accessed as a string:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'p-in-div with no period'" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"all_ps[1].text" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>The type of ResultSet members is element.Tag:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"bs4.element.Tag" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"type(next(all_ps_iter))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>To extract the text from all members of the ResultSet as a single string, a single expression with a lambda expression can be used, as suggested in <a href=\"http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html\">http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html</a>:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'p-in-div with period. p-in-div with no period p-in-div with 2 leading and trailing spaces p outside any div'" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"' '.join(map(lambda elem_tag: elem_tag.text, all_ps))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>The same result is obtained using refactored code with a named function to make it more understandable, as suggested by A.M. Kuchling in online python docs, <a href=\"https://docs.python.org/3/howto/functional.html#small-functions-and-the-lambda-expression\">https://docs.python.org/3/howto/functional.html#small-functions-and-the-lambda-expression</a>:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'p-in-div with period. p-in-div with no period p-in-div with 2 leading and trailing spaces p outside any div'" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"def elem_tag_text(elem_tag):\n", | |
" return elem_tag.text\n", | |
"' '.join(map(elem_tag_text, all_ps))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>As expected, extracting the text from the div elements yields a different result:</h3>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'p-in-div with period.p-in-div with no period p-in-div with 2 leading and trailing spaces text in div with no p'" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"' '.join(map(elem_tag_text, all_divs))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.0" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment