Skip to content

Instantly share code, notes, and snippets.

@gitronald
Created May 6, 2024 18:01
Show Gist options
  • Save gitronald/45bad10ca2b78cf4ec1197b542764e05 to your computer and use it in GitHub Desktop.
Save gitronald/45bad10ca2b78cf4ec1197b542764e05 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# WebSearcher Locations\n",
"\n",
"A brief guide to using locations with [WebSearcher](https://github.com/gitronald/WebSearcher)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import WebSearcher as ws\n",
"\n",
"# Create a directory for location data if it doesn't already exist\n",
"locations_dir = '../data/locations' # save in parent directory\n",
"os.makedirs(locations_dir, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"### 1. Use `WebSearcher` to download the latest location data:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Version out of date\n",
"getting: https://developers.google.com/static/google-ads/api/data/geo/geotargets-2024-03-20.csv.zip\n",
"saved: ../data/locations/geotargets-2024-03-20.csv\n"
]
}
],
"source": [
"ws.download_locations(locations_dir)"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"### 2. Load the location data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 194956 entries, 0 to 194955\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Criteria ID 194956 non-null int64 \n",
" 1 Name 194955 non-null object \n",
" 2 Canonical Name 194956 non-null object \n",
" 3 Parent ID 194711 non-null float64\n",
" 4 Country Code 194940 non-null object \n",
" 5 Target Type 194956 non-null object \n",
" 6 Status 194956 non-null object \n",
"dtypes: float64(1), int64(1), object(5)\n",
"memory usage: 10.4+ MB\n",
"\n",
"Example location row:\n"
]
},
{
"data": {
"text/plain": [
"Criteria ID 1000002\n",
"Name Kabul\n",
"Canonical Name Kabul,Kabul,Afghanistan\n",
"Parent ID 9075393.0\n",
"Country Code AF\n",
"Target Type City\n",
"Status Active\n",
"Name: 0, dtype: object"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Find the last file added to the directory\n",
"f = os.listdir(locations_dir)[-1]\n",
"fp = os.path.join(locations_dir, f)\n",
"\n",
"# Load the data and display details\n",
"locs = pd.read_csv(fp)\n",
"locs.info()\n",
"\n",
"print(\"\\nExample location row:\")\n",
"display(locs.loc[0, :])"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"### 3. Select a \"Canonical Name\"\n",
"\n",
"The Canonical Name is what WebSearcher needs to geolocate a search via the \n",
"`location` parameter. It is a string that uniquely identifies a location. For \n",
"example, the Canonical Name for \"New York City\" is \"New York City, New York, \n",
"United States\".\n",
"\n",
"Here we're going to try and find Boston's Canonical Name:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Criteria ID</th>\n",
" <th>Name</th>\n",
" <th>Canonical Name</th>\n",
" <th>Parent ID</th>\n",
" <th>Country Code</th>\n",
" <th>Target Type</th>\n",
" <th>Status</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5367</th>\n",
" <td>1006543</td>\n",
" <td>Boston</td>\n",
" <td>Boston,England,United Kingdom</td>\n",
" <td>20339.0</td>\n",
" <td>GB</td>\n",
" <td>City</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15832</th>\n",
" <td>1018127</td>\n",
" <td>Boston</td>\n",
" <td>Boston,Massachusetts,United States</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>City</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15891</th>\n",
" <td>1018186</td>\n",
" <td>East Boston</td>\n",
" <td>East Boston,Massachusetts,United States</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>Neighborhood</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17179</th>\n",
" <td>1019481</td>\n",
" <td>New Boston</td>\n",
" <td>New Boston,Michigan,United States</td>\n",
" <td>21155.0</td>\n",
" <td>US</td>\n",
" <td>City</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19611</th>\n",
" <td>1021917</td>\n",
" <td>New Boston</td>\n",
" <td>New Boston,New Hampshire,United States</td>\n",
" <td>21163.0</td>\n",
" <td>US</td>\n",
" <td>City</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24337</th>\n",
" <td>1026650</td>\n",
" <td>New Boston</td>\n",
" <td>New Boston,Texas,United States</td>\n",
" <td>21176.0</td>\n",
" <td>US</td>\n",
" <td>City</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24732</th>\n",
" <td>1027045</td>\n",
" <td>Boston</td>\n",
" <td>Boston,Virginia,United States</td>\n",
" <td>21178.0</td>\n",
" <td>US</td>\n",
" <td>City</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24972</th>\n",
" <td>1027285</td>\n",
" <td>South Boston</td>\n",
" <td>South Boston,Virginia,United States</td>\n",
" <td>21178.0</td>\n",
" <td>US</td>\n",
" <td>City</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64474</th>\n",
" <td>9041359</td>\n",
" <td>Boston Logan International Airport</td>\n",
" <td>Boston Logan International Airport,Massachuset...</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>Airport</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64621</th>\n",
" <td>9041514</td>\n",
" <td>Manchester-Boston Regional Airport</td>\n",
" <td>Manchester-Boston Regional Airport,New Hampshi...</td>\n",
" <td>21163.0</td>\n",
" <td>US</td>\n",
" <td>Airport</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82689</th>\n",
" <td>9060153</td>\n",
" <td>Boston College</td>\n",
" <td>Boston College,Massachusetts,United States</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>University</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83010</th>\n",
" <td>9060476</td>\n",
" <td>Boston Ave - Mill Hill</td>\n",
" <td>Boston Ave - Mill Hill,Connecticut,United States</td>\n",
" <td>21139.0</td>\n",
" <td>US</td>\n",
" <td>Neighborhood</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83851</th>\n",
" <td>9061334</td>\n",
" <td>South Boston</td>\n",
" <td>South Boston,Massachusetts,United States</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>Neighborhood</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104482</th>\n",
" <td>9104649</td>\n",
" <td>La Bostonnais</td>\n",
" <td>La Bostonnais,Quebec,Canada</td>\n",
" <td>20123.0</td>\n",
" <td>CA</td>\n",
" <td>Municipality</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Criteria ID Name \\\n",
"5367 1006543 Boston \n",
"15832 1018127 Boston \n",
"15891 1018186 East Boston \n",
"17179 1019481 New Boston \n",
"19611 1021917 New Boston \n",
"24337 1026650 New Boston \n",
"24732 1027045 Boston \n",
"24972 1027285 South Boston \n",
"64474 9041359 Boston Logan International Airport \n",
"64621 9041514 Manchester-Boston Regional Airport \n",
"82689 9060153 Boston College \n",
"83010 9060476 Boston Ave - Mill Hill \n",
"83851 9061334 South Boston \n",
"104482 9104649 La Bostonnais \n",
"\n",
" Canonical Name Parent ID \\\n",
"5367 Boston,England,United Kingdom 20339.0 \n",
"15832 Boston,Massachusetts,United States 21152.0 \n",
"15891 East Boston,Massachusetts,United States 21152.0 \n",
"17179 New Boston,Michigan,United States 21155.0 \n",
"19611 New Boston,New Hampshire,United States 21163.0 \n",
"24337 New Boston,Texas,United States 21176.0 \n",
"24732 Boston,Virginia,United States 21178.0 \n",
"24972 South Boston,Virginia,United States 21178.0 \n",
"64474 Boston Logan International Airport,Massachuset... 21152.0 \n",
"64621 Manchester-Boston Regional Airport,New Hampshi... 21163.0 \n",
"82689 Boston College,Massachusetts,United States 21152.0 \n",
"83010 Boston Ave - Mill Hill,Connecticut,United States 21139.0 \n",
"83851 South Boston,Massachusetts,United States 21152.0 \n",
"104482 La Bostonnais,Quebec,Canada 20123.0 \n",
"\n",
" Country Code Target Type Status \n",
"5367 GB City Active \n",
"15832 US City Active \n",
"15891 US Neighborhood Active \n",
"17179 US City Active \n",
"19611 US City Active \n",
"24337 US City Active \n",
"24732 US City Active \n",
"24972 US City Active \n",
"64474 US Airport Active \n",
"64621 US Airport Active \n",
"82689 US University Active \n",
"83010 US Neighborhood Active \n",
"83851 US Neighborhood Active \n",
"104482 CA Municipality Active "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regex = r'(?=.*Boston)'\n",
"str_mask = locs['Canonical Name'].str.contains(regex)\n",
"locs[str_mask]"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"There will often be many matches, so you probably want to narrow it down by state:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Criteria ID</th>\n",
" <th>Name</th>\n",
" <th>Canonical Name</th>\n",
" <th>Parent ID</th>\n",
" <th>Country Code</th>\n",
" <th>Target Type</th>\n",
" <th>Status</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>15832</th>\n",
" <td>1018127</td>\n",
" <td>Boston</td>\n",
" <td>Boston,Massachusetts,United States</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>City</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15891</th>\n",
" <td>1018186</td>\n",
" <td>East Boston</td>\n",
" <td>East Boston,Massachusetts,United States</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>Neighborhood</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64474</th>\n",
" <td>9041359</td>\n",
" <td>Boston Logan International Airport</td>\n",
" <td>Boston Logan International Airport,Massachuset...</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>Airport</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82689</th>\n",
" <td>9060153</td>\n",
" <td>Boston College</td>\n",
" <td>Boston College,Massachusetts,United States</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>University</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83851</th>\n",
" <td>9061334</td>\n",
" <td>South Boston</td>\n",
" <td>South Boston,Massachusetts,United States</td>\n",
" <td>21152.0</td>\n",
" <td>US</td>\n",
" <td>Neighborhood</td>\n",
" <td>Active</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Criteria ID Name \\\n",
"15832 1018127 Boston \n",
"15891 1018186 East Boston \n",
"64474 9041359 Boston Logan International Airport \n",
"82689 9060153 Boston College \n",
"83851 9061334 South Boston \n",
"\n",
" Canonical Name Parent ID \\\n",
"15832 Boston,Massachusetts,United States 21152.0 \n",
"15891 East Boston,Massachusetts,United States 21152.0 \n",
"64474 Boston Logan International Airport,Massachuset... 21152.0 \n",
"82689 Boston College,Massachusetts,United States 21152.0 \n",
"83851 South Boston,Massachusetts,United States 21152.0 \n",
"\n",
" Country Code Target Type Status \n",
"15832 US City Active \n",
"15891 US Neighborhood Active \n",
"64474 US Airport Active \n",
"82689 US University Active \n",
"83851 US Neighborhood Active "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regex = r'(?=.*Boston)(?=.*Massachusetts)'\n",
"str_mask = locs['Canonical Name'].str.contains(regex)\n",
"locs[str_mask]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And still several options, so select the Canonical Name you wanted, here the city of Boston:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Criteria ID 1018127\n",
"Name Boston\n",
"Canonical Name Boston,Massachusetts,United States\n",
"Parent ID 21152.0\n",
"Country Code US\n",
"Target Type City\n",
"Status Active\n",
"Name: 15832, dtype: object"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"canon_name = 'Boston,Massachusetts,United States'\n",
"name = locs[locs['Canonical Name'] == canon_name].iloc[0]\n",
"display(name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"### 4. Conduct a geolocated search\n",
"\n",
"Initialize the search engine and use the Canonical Name to perform a \n",
"geolocated search. Here we'll use \"pizza\" as a search query because it's a \n",
"good candidate for localization - if you search for \"pizza\" in Boston, you \n",
"probably want to find pizza places in Boston and the results should reflect that."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-05-06 10:29:39.998 | INFO | WebSearcher.searchers | 200 | pizza | Boston,Massachusetts,United States\n"
]
}
],
"source": [
"qry = 'pizza'\n",
"\n",
"# Filepaths\n",
"data_dir = os.path.join(\"data\", f\"demo-ws-v{ws.__version__}\")\n",
"fp_serps = os.path.join(data_dir, 'serps.json')\n",
"fp_results = os.path.join(data_dir, 'results.json')\n",
"dir_html = os.path.join(data_dir, 'html')\n",
"os.makedirs(dir_html, exist_ok=True)\n",
"\n",
"se = ws.SearchEngine() # Initialize searcher\n",
"se.search(qry, location=canon_name) # Conduct geolocated search\n",
"se.parse_results() # Parse results\n",
"se.save_serp(append_to=fp_serps) # Save SERP to json (html + metadata)\n",
"se.save_results(append_to=fp_results) # Save results to json\n",
"se.save_serp(save_dir=dir_html) # Save SERP html to dir (no metadata)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"Output the results with formatted columns:\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" type title url\n",
"0 local_results FLORINA Pizzeria & Paninoteca None\n",
"1 local_results Regina Pizzeria None\n",
"2 local_results Sal's Pizza | Court Street | Boston, MA None\n",
"3 general Where to Eat Excellent Pizza Around ... https://boston.eater.com/maps/best-b...\n",
"4 general Pizza Hut | Delivery & Carryout - No... https://www.pizzahut.com/\n",
"5 general Where to Find the Best Pizza in Bost... https://www.bostonmagazine.com/resta...\n",
"6 people_also_ask None None\n",
"7 general Pizza https://en.wikipedia.org/wiki/Pizza\n",
"8 general 15 Best Pizza Delivery Restaurants i... https://www.grubhub.com/delivery/ma-...\n",
"9 general THE 10 BEST Pizza Places in Boston (... https://www.tripadvisor.com/Restaura...\n",
"10 unknown None None\n",
"11 general New Market Pizza - Boston, Boston, MA https://newmarketpizza.com/\n",
"12 general Best Pizza in Boston: 27 Famous Pizz... https://www.cozymeal.com/magazine/be...\n",
"13 general 20 Best Pizza Spots in Boston For De... https://www.timeout.com/boston/resta...\n",
"14 searches_related None None\n",
"15 knowledge https://en.wikipedia.org/wiki/Pizza\n"
]
}
],
"source": [
"results = pd.DataFrame(se.results)\n",
"\n",
"with pd.option_context('display.width', 180, 'display.max_colwidth', 40):\n",
" print(results[['type', 'title', 'url']])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment