Skip to content

Instantly share code, notes, and snippets.

@jkozma
Last active March 5, 2019 03:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jkozma/98d64e81ba173fdf0feb9972149ca3fb to your computer and use it in GitHub Desktop.
Save jkozma/98d64e81ba173fdf0feb9972149ca3fb to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Extracting text from web pages using bs4.BeautifulSoup</h2>\n",
"<h3>Obviously, bs4 needs to be installed. See <a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a> for installation help and a more detailed presentation on BeautifulSoup features and how to use them.</h3>\n",
"\n",
"<h3>Create a BeautifulSoup object and instantiate it with some html:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div><p>p outside any div</p><div>text in div with no p</div>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from bs4 import BeautifulSoup as htm\n",
"raw_html_page = \"<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div><p>p outside any div</p><div>text in div with no p</div>\"\n",
"bs_obj_from_raw_html_pg = htm(raw_html_page, \"html.parser\")\n",
"bs_obj_from_raw_html_pg"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Verify the type of the object:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bs4.BeautifulSoup"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(bs_obj_from_raw_html_pg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Use the find_all function to return a set of all elements of a specified type:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div>,\n",
" <div>text in div with no p</div>]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_divs = bs_obj_from_raw_html_pg.find_all('div')\n",
"all_divs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>The element.ResultSet is returned as an iterable:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bs4.element.ResultSet"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(all_divs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>The find_all function returns p elements nested in a div element:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<p>p-in-div with period.</p>,\n",
" <p>p-in-div with no period</p>,\n",
" <p> p-in-div with 2 leading and trailing spaces </p>,\n",
" <p>p outside any div</p>]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_ps = bs_obj_from_raw_html_pg.find_all('p')\n",
"all_ps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Note that a ResultSet, while iterable, is itself not an iterator:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "'ResultSet' object is not an iterator",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-7-fea6845117ab>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mall_ps\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mTypeError\u001b[0m: 'ResultSet' object is not an iterator"
]
}
],
"source": [
"next(all_ps)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>The builtin iter function may be used to create an iterator function:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<p>p-in-div with period.</p>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_ps_iter = iter(all_ps)\n",
"next(all_ps_iter)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<p>p-in-div with no period</p>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"next(all_ps_iter)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Members of the the ResultSet may be accessed using list notation:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<p>p-in-div with no period</p>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_ps[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>The text attribute may be accessed as a string:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'p-in-div with no period'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_ps[1].text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>The type of ResultSet members is element.Tag:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bs4.element.Tag"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(next(all_ps_iter))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>To extract the text from all members of the ResultSet as a single string, a single expression with a lambda expression can be used, as suggested in <a href=\"http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html\">http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html</a>:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'p-in-div with period. p-in-div with no period p-in-div with 2 leading and trailing spaces p outside any div'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"' '.join(map(lambda elem_tag: elem_tag.text, all_ps))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>The same result is obtained using refactored code with a named function to make it more understandable, as suggested by A.M. Kuchling in online python docs, <a href=\"https://docs.python.org/3/howto/functional.html#small-functions-and-the-lambda-expression\">https://docs.python.org/3/howto/functional.html#small-functions-and-the-lambda-expression</a>:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'p-in-div with period. p-in-div with no period p-in-div with 2 leading and trailing spaces p outside any div'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def elem_tag_text(elem_tag):\n",
" return elem_tag.text\n",
"' '.join(map(elem_tag_text, all_ps))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>As expected, extracting the text from the div elements yields a different result:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'p-in-div with period.p-in-div with no period p-in-div with 2 leading and trailing spaces text in div with no p'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"' '.join(map(elem_tag_text, all_divs))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment