Skip to content

Instantly share code, notes, and snippets.

@krishashok
Last active April 13, 2024 03:55
Show Gist options
  • Save krishashok/249432e8a754a357a216a67ee4453469 to your computer and use it in GitHub Desktop.
Save krishashok/249432e8a754a357a216a67ee4453469 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "e3bde860",
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup as bs\n",
"import pandas as pd\n",
"import numpy as np\n",
"import requests"
]
},
{
"cell_type": "markdown",
"id": "928e6ea1",
"metadata": {},
"source": [
"### Crawl sunoindia.in and extract all outgoing links in the show notes for Seen and Unseen \n",
"Why this website and not Amit Varma's own website? Because seenunseen.in is not crawling friendly (URLs have dates and there is autoscroll) "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4461373",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Header\n",
"headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',\n",
" 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'}\n",
"\n",
"all_episode_links = []\n",
"\n",
"# Extract all links to individual episode pages\n",
"for page in range(1,31):\n",
" url = f'https://www.sunoindia.in/the-seen-and-the-unseen-hosted-by-amit-varma?page={str(page)}#episodeList'\n",
" print(f'Loading page{url}')\n",
" r = requests.get(url, headers=headers)\n",
" soup = bs(r.content)\n",
" \n",
" all_links = soup.find_all('a')\n",
" episode_links_on_page = [link['href'] for link in all_links if 'amit-varma' in link['href'] and 'page=' not in link['href']]\n",
" all_episode_links.extend(list(set(episode_links_on_page)))\n",
" \n",
"print(all_episode_links)\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d5b93661",
"metadata": {},
"outputs": [],
"source": [
"len(all_episode_links)"
]
},
{
"cell_type": "markdown",
"id": "beafa931",
"metadata": {},
"source": [
"### Go through each episode page and extract all outgoing links in show notes and save in a pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e5dee7f8",
"metadata": {},
"outputs": [],
"source": [
"def return_href(link):\n",
" try:\n",
" output = link['href']\n",
" except:\n",
" output = \"#\"\n",
" return output\n",
" \n",
"all_episodes = []\n",
"for link in all_episode_links:\n",
" episode = {}\n",
" r = requests.get(link, headers=headers)\n",
" soup = bs(r.content, 'html.parser')\n",
" \n",
" # Let's grab the Episode title\n",
" episode['title'] = soup.find('h4').get_text()\n",
"\n",
" \n",
" show_notes = soup.find('div', id='tab-1')\n",
" outgoing_links = show_notes.find_all('a')\n",
" outgoing_href = [return_href(item) for item in outgoing_links]\n",
" \n",
" episode['links'] = outgoing_href\n",
" all_episodes.append(episode)\n",
" \n",
"all_episodes\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9f9a04c",
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(all_episodes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d5506e1",
"metadata": {},
"outputs": [],
"source": [
"# Add a column that stores the number of outgoing links \n",
"df['count'] = df['links'].apply(lambda x: len(x))"
]
},
{
"cell_type": "markdown",
"id": "db306932",
"metadata": {},
"source": [
"### Save Dataframe to CSV"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fc9efa6",
"metadata": {},
"outputs": [],
"source": [
"df.to_csv('seenunseen.csv')"
]
},
{
"cell_type": "markdown",
"id": "46241fb0",
"metadata": {},
"source": [
"### Episodes with the largest number of links in the show notes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c5dfbc1",
"metadata": {},
"outputs": [],
"source": [
"df.sort_values('count', ascending=False).head(20)"
]
},
{
"cell_type": "markdown",
"id": "291682ed",
"metadata": {},
"source": [
"### Most frequently used outgoing links"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ef7cad2",
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"\n",
"all_outgoing_links = list(df['links'])\n",
"\n",
"flat_list = [item for sublist in all_outgoing_links for item in sublist]\n",
"\n",
"counter = Counter(flat_list).most_common(200)\n",
"\n",
"freq_dict = dict([(key, value) for key, value in counter])\n",
"freq_dict\n",
" \n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "754c02e3",
"metadata": {},
"outputs": [],
"source": [
"filtered_flat_list = [link for link in flat_list if 'seenunseen' not in link]\n",
"counter = Counter(filtered_flat_list).most_common(80)\n",
"\n",
"freq_dict = dict([(key, value) for key, value in counter])\n",
"freq_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "502a394f",
"metadata": {},
"outputs": [],
"source": [
"len(flat_list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c59513a5",
"metadata": {},
"outputs": [],
"source": [
"len(list(set(flat_list)))"
]
},
{
"cell_type": "markdown",
"id": "3127f60b",
"metadata": {},
"source": [
"### Products most frequently recommended"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e2f713e",
"metadata": {},
"outputs": [],
"source": [
"amazon_list = [item for item in flat_list if 'amazon' in item]\n",
"amazon_counter = Counter(amazon_list).most_common(5)\n",
"\n",
"amazon_freq_dict = dict([(key, value) for key, value in amazon_counter])\n",
"amazon_freq_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15a9cacb",
"metadata": {},
"outputs": [],
"source": [
"len(amazon_list)"
]
},
{
"cell_type": "markdown",
"id": "a7a5b43f",
"metadata": {},
"source": [
"### People most regularly recommended"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c5b06de",
"metadata": {},
"outputs": [],
"source": [
"tw_list = [item for item in flat_list if 'twitter' in item and 'status' not in item]\n",
"twitter_list = [tw[:-8] if 'lang=en' in tw else tw for tw in tw_list]\n",
"twitter_counter = Counter(twitter_list).most_common(40)\n",
"\n",
"twitter_freq_dict = dict([(key, value) for key, value in twitter_counter])\n",
"twitter_freq_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a461203d",
"metadata": {},
"outputs": [],
"source": [
"len(tw_list)"
]
},
{
"cell_type": "markdown",
"id": "e00b5609",
"metadata": {},
"source": [
"### Videos most regularly recommended"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "50ab8d17",
"metadata": {},
"outputs": [],
"source": [
"yt_list = [item for item in flat_list if 'youtube' in item]\n",
"yt_counter = Counter(yt_list).most_common(40)\n",
"\n",
"yt_freq_dict = dict([(key, value) for key, value in yt_counter])\n",
"yt_freq_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b489ae8d",
"metadata": {},
"outputs": [],
"source": [
"len(yt_list)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment