ZenithClown/Web Scrappers.md

## tables.py
# -*- encoding: utf-8 -*-

"""
A Generic Function to Scrap Table Data from URL

The generic function `readtable()` is defined tp fetch/scrap table
data from any given generic web URLs.

@author:  Debmalya Pramanik
@version: v1.0.0
"""

import requests
from typing import Iterable
from bs4 import BeautifulSoup # web scrapping standard python library

def readtable(weburl : str, html_class_tag : object, **kwargs) -> Iterable:
    """
    A Generic Function to Scrap Table Data from Web-URLs

    The function scraps a table specified as a `<table class=?>` in
    the specified URL. The function is made generic, and is called
    by each underlying function.

    :type  weburl: str
    :param weburl: A generic URL of a site from where Tables are to
                   be scrapped. The page can have multiple tables,
                   which is accepted by the code.

    :type  html_class_tag: object
    :param html_class_tag: The class tag mentioned in the table. To
                           find a class tag, click on inspect element
                           and then look for the class tag in the
                           table.
    """

    markup = kwargs.get("markup", "html.parser")
    verify = kwargs.get("verify_https_request", True)

    response = requests.get(weburl, verify = verify)

    status_code = response.status_code
    print(
        f"Sucessfully Connected to {weburl}" if status_code == 200
        else f"Failed to Connect. Error Code: {status_code}"
    )

    _soups = BeautifulSoup(response.text, markup)
    tables = _soups.find_all("table", html_class_tag)
    return tables

## Web Scrappers.md

      
    Raw
  

              Web Scrappers.md
            
          
Web Scrappers (webscrappers)

a set of functions that uses the BeautifulSoup module to scrap data from various websites


In the fast developing world, data is crucial! The webscrappers contains a set of functions that can be used to easily retreive data into
python that uses the BeautifulSoup at its core along with core modules like requests for retreiving data.
Getting Started

The code is publicly available at webscrappers by
ZenithClown. To use the code, simply clone using git like:
git clone https://gist.github.com/ZenithClown/809642277fba2d8d2309e55ab307615f.git webscrappers
export PYTHONPATH="${PYTHONPATH}:webscrappers"
Done, now you can import individual required modules. All the functions are parameterized as much as possible. Check individual bots definations and
usages in Web Scrapping BOTs section.
Basics of Web Scrapping

"Web scraping is the process of using bots to extract content and data from a website." Given a HTML page, a webscrapper tends to extract information
from a HTML tag or elements into a desired format. In python, Beautiful Soup is popular python
package for parsing HTML and XML documents. Some good tutorials on bs4 that I personally followed:

Beautiful Soup: Build a Web Scraper With Python - RealPython,
A Practical Introduction to Web Scraping in Python - RealPython,
Implementing Web Scraping in Python - Geeks for Geeks,
Web Scraping with Python - Beautiful Soup Crash Course - freeCodeCamp.org

In addition, one might need Google Chrome Dev-Tools or
Microsoft Edge Dev-Tools introduction.
Web Scrapping BOTs

TODO: Documentation. Currently, check the function docs for more information.


## wikipedia.py
# -*- encoding: utf-8 -*-

"""
Python Functions to Scrape Data from Wikipedia the Free Encyclopedia

Wikipedia often contains detailed information about a topic, and
often times these informations can be scrapped using various tools
to create tables, charts, etc. The utility function provided below
extracts informations using `requests` and `BeutifulSoup` python
modules.

@author:  Debmalya Pramanik
@version: v0.0.2
"""

from typing import Iterable
import pandas as pd # returns `wikitable` as pandas dataframe

from tables import readtable # readtable is now a generic function

def wikitable(weburl : str, **kwargs) -> Iterable[pd.DataFrame]:
    """
    Scrap Table(s) from Wikipedia Page

    The function searches for all the table present in a wikipedia
    webpage and returns each table as a `pandas` dataframe. For
    example, on providing the URL for "List of Countries"
    (https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population)
    the query returns a list of dataframes, based on populations count
    as provided in the page.

    :type  weburl: str
    :param weburl: A Wikipedia URL from where Tables are to be
                   scrapped. The page can have multiple tables, which
                   is accepted by the code.
    """

    tables = readtable(
        weburl = weburl,
        html_class_tag = kwargs.get("html_class_tag", {"class" : "wikitable"}),
        kwargs = kwargs
    )

    print(f"Fetched {len(tables)} from the provided Wikipedia page.")
    return [pd.DataFrame(pd.read_html(str(table))[0]) for table in tables]
	# -- encoding: utf-8 --

	"""
	A Generic Function to Scrap Table Data from URL

	The generic function `readtable()` is defined tp fetch/scrap table
	data from any given generic web URLs.

	@author: Debmalya Pramanik
	@version: v1.0.0
	"""

	import requests
	from typing import Iterable
	from bs4 import BeautifulSoup # web scrapping standard python library

	def readtable(weburl : str, html_class_tag : object, **kwargs) -> Iterable:
	"""
	A Generic Function to Scrap Table Data from Web-URLs

	The function scraps a table specified as a `<table class=?>` in
	the specified URL. The function is made generic, and is called
	by each underlying function.

	:type weburl: str
	:param weburl: A generic URL of a site from where Tables are to
	be scrapped. The page can have multiple tables,
	which is accepted by the code.

	:type html_class_tag: object
	:param html_class_tag: The class tag mentioned in the table. To
	find a class tag, click on inspect element
	and then look for the class tag in the
	table.
	"""

	markup = kwargs.get("markup", "html.parser")
	verify = kwargs.get("verify_https_request", True)

	response = requests.get(weburl, verify = verify)

	status_code = response.status_code
	print(
	f"Sucessfully Connected to {weburl}" if status_code == 200
	else f"Failed to Connect. Error Code: {status_code}"
	)

	_soups = BeautifulSoup(response.text, markup)
	tables = _soups.find_all("table", html_class_tag)
	return tables
	# -- encoding: utf-8 --

	"""
	Python Functions to Scrape Data from Wikipedia the Free Encyclopedia

	Wikipedia often contains detailed information about a topic, and
	often times these informations can be scrapped using various tools
	to create tables, charts, etc. The utility function provided below
	extracts informations using `requests` and `BeutifulSoup` python
	modules.

	@author: Debmalya Pramanik
	@version: v0.0.2
	"""

	from typing import Iterable
	import pandas as pd # returns `wikitable` as pandas dataframe

	from tables import readtable # readtable is now a generic function

	def wikitable(weburl : str, **kwargs) -> Iterable[pd.DataFrame]:
	"""
	Scrap Table(s) from Wikipedia Page

	The function searches for all the table present in a wikipedia
	webpage and returns each table as a `pandas` dataframe. For
	example, on providing the URL for "List of Countries"
	(https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population)
	the query returns a list of dataframes, based on populations count
	as provided in the page.

	:type weburl: str
	:param weburl: A Wikipedia URL from where Tables are to be
	scrapped. The page can have multiple tables, which
	is accepted by the code.
	"""

	tables = readtable(
	weburl = weburl,
	html_class_tag = kwargs.get("html_class_tag", {"class" : "wikitable"}),
	kwargs = kwargs
	)

	print(f"Fetched {len(tables)} from the provided Wikipedia page.")
	return [pd.DataFrame(pd.read_html(str(table))[0]) for table in tables]