Skip to content

Instantly share code, notes, and snippets.

View eliasdabbas's full-sized avatar
💭
https://nbastats.pro

Elias Dabbas eliasdabbas

💭
https://nbastats.pro
View GitHub Profile
@eliasdabbas
eliasdabbas / get_bot_ip_addresses.py
Last active June 28, 2024 09:45
Get the most up-to-date list of IP addresses for crawler bots, belonging to Google and Bing.
import ipaddress
import requests
import pandas as pd
def bot_ip_addresses():
bots_urls = {
'google': 'https://developers.google.com/search/apis/ipranges/googlebot.json',
'bing': 'https://www.bing.com/toolbox/bingbot.json'
}
@eliasdabbas
eliasdabbas / score_links.py
Last active September 20, 2022 12:26
Score internal links using two columns of "Source" and "Destination". This calculates various link importance metrics link degree centrality, betweenness centrality and PageRank.
# !pip install --upgrade transformers plotly pandas
import plotly.graph_objects as go
import pandas as pd
pd.options.display.max_columns = None
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
results = []
cars = ['mercedes', 'audi', 'bmw', 'volkswagen', 'ford', 'toyota',
@eliasdabbas
eliasdabbas / crawl_multiple_sites.py
Last active April 27, 2022 08:56
Crawl multiple websites with one for loop, while saving the output, logs, and job status separately for each website. Resume crawling any time simply be re-running the same code
from urllib.parse import urlsplit
import advertools as adv
sites = [
'https://www.who.int',
'https://www.nytimes.com',
'https://www.washingtonpost.com',
]
@eliasdabbas
eliasdabbas / serp_heatmap.py
Last active February 2, 2024 22:58
Create a heatmap of SERPs, using a table with columns: "keyword", "rank", and "domain"
import plotly.graph_objects as go
import pandas as pd
def serp_heatmap(df, num_domains=10, select_domain=None):
df = df.rename(columns={'domain': 'displayLink',
'searchTerms': 'keyword'})
top_domains = df['displayLink'].value_counts()[:num_domains].index.tolist()
top_domains = df['displayLink'].value_counts()[:num_domains].index.tolist()
top_df = df[df['displayLink'].isin(top_domains) & df['displayLink'].ne('')]