jkozma/BeautifulSoupPractice.ipynb

## BeautifulSoupPractice.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Extracting text from web pages using bs4.BeautifulSoup</h2>\n",
    "<h3>Obviously, bs4 needs to be installed. See <a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a> for installation help and a more detailed presentation on BeautifulSoup features and how to use them.</h3>\n",
    "\n",
    "<h3>Create a BeautifulSoup object and instantiate it with some html:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p>  p-in-div with 2 leading and trailing spaces  </p></div><p>p outside any div</p><div>text in div with no p</div>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from bs4 import BeautifulSoup as htm\n",
    "raw_html_page = \"<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p>  p-in-div with 2 leading and trailing spaces  </p></div><p>p outside any div</p><div>text in div with no p</div>\"\n",
    "bs_obj_from_raw_html_pg = htm(raw_html_page, \"html.parser\")\n",
    "bs_obj_from_raw_html_pg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Verify the type of the object:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bs4.BeautifulSoup"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(bs_obj_from_raw_html_pg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Use the find_all function to return a set of all elements of a specified type:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p>  p-in-div with 2 leading and trailing spaces  </p></div>,\n",
       " <div>text in div with no p</div>]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_divs = bs_obj_from_raw_html_pg.find_all('div')\n",
    "all_divs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>The element.ResultSet is returned as an iterable:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bs4.element.ResultSet"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(all_divs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>The find_all function returns p elements nested in a div element:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<p>p-in-div with period.</p>,\n",
       " <p>p-in-div with no period</p>,\n",
       " <p>  p-in-div with 2 leading and trailing spaces  </p>,\n",
       " <p>p outside any div</p>]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_ps = bs_obj_from_raw_html_pg.find_all('p')\n",
    "all_ps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Note that a ResultSet, while iterable, is itself not an iterator:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "'ResultSet' object is not an iterator",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-7-fea6845117ab>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mall_ps\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m: 'ResultSet' object is not an iterator"
     ]
    }
   ],
   "source": [
    "next(all_ps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>The builtin iter function may be used to create an iterator function:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<p>p-in-div with period.</p>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_ps_iter = iter(all_ps)\n",
    "next(all_ps_iter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<p>p-in-div with no period</p>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next(all_ps_iter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Members of the the ResultSet may be accessed using list notation:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<p>p-in-div with no period</p>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_ps[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>The text attribute may be accessed as a string:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'p-in-div with no period'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_ps[1].text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>The type of ResultSet members is element.Tag:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bs4.element.Tag"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(next(all_ps_iter))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>To extract the text from all members of the ResultSet as a single string, a single expression with a lambda expression can be used, as suggested in <a href=\"http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html\">http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html</a>:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'p-in-div with period. p-in-div with no period   p-in-div with 2 leading and trailing spaces   p outside any div'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(map(lambda elem_tag: elem_tag.text, all_ps))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>The same result is obtained using refactored code with a named function to make it more understandable, as suggested by A.M. Kuchling in online python docs, <a href=\"https://docs.python.org/3/howto/functional.html#small-functions-and-the-lambda-expression\">https://docs.python.org/3/howto/functional.html#small-functions-and-the-lambda-expression</a>:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'p-in-div with period. p-in-div with no period   p-in-div with 2 leading and trailing spaces   p outside any div'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def elem_tag_text(elem_tag):\n",
    "    return elem_tag.text\n",
    "' '.join(map(elem_tag_text, all_ps))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>As expected, extracting the text from the div elements yields a different result:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'p-in-div with period.p-in-div with no period  p-in-div with 2 leading and trailing spaces   text in div with no p'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(map(elem_tag_text, all_divs))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h2>Extracting text from web pages using bs4.BeautifulSoup</h2>\n",
	"<h3>Obviously, bs4 needs to be installed. See <a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a> for installation help and a more detailed presentation on BeautifulSoup features and how to use them.</h3>\n",
	"\n",
	"<h3>Create a BeautifulSoup object and instantiate it with some html:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div><p>p outside any div</p><div>text in div with no p</div>"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from bs4 import BeautifulSoup as htm\n",
	"raw_html_page = \"<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div><p>p outside any div</p><div>text in div with no p</div>\"\n",
	"bs_obj_from_raw_html_pg = htm(raw_html_page, \"html.parser\")\n",
	"bs_obj_from_raw_html_pg"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>Verify the type of the object:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"bs4.BeautifulSoup"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"type(bs_obj_from_raw_html_pg)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>Use the find_all function to return a set of all elements of a specified type:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[<div><p>p-in-div with period.</p><p>p-in-div with no period</p><p> p-in-div with 2 leading and trailing spaces </p></div>,\n",
	" <div>text in div with no p</div>]"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"all_divs = bs_obj_from_raw_html_pg.find_all('div')\n",
	"all_divs"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>The element.ResultSet is returned as an iterable:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"bs4.element.ResultSet"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"type(all_divs)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>The find_all function returns p elements nested in a div element:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[<p>p-in-div with period.</p>,\n",
	" <p>p-in-div with no period</p>,\n",
	" <p> p-in-div with 2 leading and trailing spaces </p>,\n",
	" <p>p outside any div</p>]"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"all_ps = bs_obj_from_raw_html_pg.find_all('p')\n",
	"all_ps"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>Note that a ResultSet, while iterable, is itself not an iterator:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"ename": "TypeError",
	"evalue": "'ResultSet' object is not an iterator",
	"output_type": "error",
	"traceback": [
	"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
	"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
	"\u001b[1;32m<ipython-input-7-fea6845117ab>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mnext\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mall_ps\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
	"\u001b[1;31mTypeError\u001b[0m: 'ResultSet' object is not an iterator"
	]
	}
	],
	"source": [
	"next(all_ps)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>The builtin iter function may be used to create an iterator function:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<p>p-in-div with period.</p>"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"all_ps_iter = iter(all_ps)\n",
	"next(all_ps_iter)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<p>p-in-div with no period</p>"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"next(all_ps_iter)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>Members of the the ResultSet may be accessed using list notation:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<p>p-in-div with no period</p>"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"all_ps[1]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>The text attribute may be accessed as a string:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'p-in-div with no period'"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"all_ps[1].text"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>The type of ResultSet members is element.Tag:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"bs4.element.Tag"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"type(next(all_ps_iter))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>To extract the text from all members of the ResultSet as a single string, a single expression with a lambda expression can be used, as suggested in <a href=\"http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html\">http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html</a>:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'p-in-div with period. p-in-div with no period p-in-div with 2 leading and trailing spaces p outside any div'"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"' '.join(map(lambda elem_tag: elem_tag.text, all_ps))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>The same result is obtained using refactored code with a named function to make it more understandable, as suggested by A.M. Kuchling in online python docs, <a href=\"https://docs.python.org/3/howto/functional.html#small-functions-and-the-lambda-expression\">https://docs.python.org/3/howto/functional.html#small-functions-and-the-lambda-expression</a>:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'p-in-div with period. p-in-div with no period p-in-div with 2 leading and trailing spaces p outside any div'"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"def elem_tag_text(elem_tag):\n",
	" return elem_tag.text\n",
	"' '.join(map(elem_tag_text, all_ps))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>As expected, extracting the text from the div elements yields a different result:</h3>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'p-in-div with period.p-in-div with no period p-in-div with 2 leading and trailing spaces text in div with no p'"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"' '.join(map(elem_tag_text, all_divs))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}