Skip to content

Instantly share code, notes, and snippets.

@ZenithClown
Last active February 29, 2024 09:52
Show Gist options
  • Save ZenithClown/809642277fba2d8d2309e55ab307615f to your computer and use it in GitHub Desktop.
Save ZenithClown/809642277fba2d8d2309e55ab307615f to your computer and use it in GitHub Desktop.
Simple Web-Scrapping Utility Functions
# -*- encoding: utf-8 -*-
"""
A Generic Function to Scrap Table Data from URL
The generic function `readtable()` is defined tp fetch/scrap table
data from any given generic web URLs.
@author: Debmalya Pramanik
@version: v1.0.0
"""
import requests
from typing import Iterable
from bs4 import BeautifulSoup # web scrapping standard python library
def readtable(weburl : str, html_class_tag : object, **kwargs) -> Iterable:
"""
A Generic Function to Scrap Table Data from Web-URLs
The function scraps a table specified as a `<table class=?>` in
the specified URL. The function is made generic, and is called
by each underlying function.
:type weburl: str
:param weburl: A generic URL of a site from where Tables are to
be scrapped. The page can have multiple tables,
which is accepted by the code.
:type html_class_tag: object
:param html_class_tag: The class tag mentioned in the table. To
find a class tag, click on inspect element
and then look for the class tag in the
table.
"""
markup = kwargs.get("markup", "html.parser")
verify = kwargs.get("verify_https_request", True)
response = requests.get(weburl, verify = verify)
status_code = response.status_code
print(
f"Sucessfully Connected to {weburl}" if status_code == 200
else f"Failed to Connect. Error Code: {status_code}"
)
_soups = BeautifulSoup(response.text, markup)
tables = _soups.find_all("table", html_class_tag)
return tables

Web Scrappers (webscrappers)

a set of functions that uses the BeautifulSoup module to scrap data from various websites


In the fast developing world, data is crucial! The webscrappers contains a set of functions that can be used to easily retreive data into python that uses the BeautifulSoup at its core along with core modules like requests for retreiving data.

Getting Started

The code is publicly available at webscrappers by ZenithClown. To use the code, simply clone using git like:

git clone https://gist.github.com/ZenithClown/809642277fba2d8d2309e55ab307615f.git webscrappers
export PYTHONPATH="${PYTHONPATH}:webscrappers"

Done, now you can import individual required modules. All the functions are parameterized as much as possible. Check individual bots definations and usages in Web Scrapping BOTs section.

Basics of Web Scrapping

"Web scraping is the process of using bots to extract content and data from a website." Given a HTML page, a webscrapper tends to extract information from a HTML tag or elements into a desired format. In python, Beautiful Soup is popular python package for parsing HTML and XML documents. Some good tutorials on bs4 that I personally followed:

In addition, one might need Google Chrome Dev-Tools or Microsoft Edge Dev-Tools introduction.

Web Scrapping BOTs

TODO: Documentation. Currently, check the function docs for more information.

# -*- encoding: utf-8 -*-
"""
Python Functions to Scrape Data from Wikipedia the Free Encyclopedia
Wikipedia often contains detailed information about a topic, and
often times these informations can be scrapped using various tools
to create tables, charts, etc. The utility function provided below
extracts informations using `requests` and `BeutifulSoup` python
modules.
@author: Debmalya Pramanik
@version: v0.0.2
"""
from typing import Iterable
import pandas as pd # returns `wikitable` as pandas dataframe
from tables import readtable # readtable is now a generic function
def wikitable(weburl : str, **kwargs) -> Iterable[pd.DataFrame]:
"""
Scrap Table(s) from Wikipedia Page
The function searches for all the table present in a wikipedia
webpage and returns each table as a `pandas` dataframe. For
example, on providing the URL for "List of Countries"
(https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population)
the query returns a list of dataframes, based on populations count
as provided in the page.
:type weburl: str
:param weburl: A Wikipedia URL from where Tables are to be
scrapped. The page can have multiple tables, which
is accepted by the code.
"""
tables = readtable(
weburl = weburl,
html_class_tag = kwargs.get("html_class_tag", {"class" : "wikitable"}),
kwargs = kwargs
)
print(f"Fetched {len(tables)} from the provided Wikipedia page.")
return [pd.DataFrame(pd.read_html(str(table))[0]) for table in tables]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment