Skip to content

Instantly share code, notes, and snippets.

@elijahbenizzy
Created June 11, 2024 17:17
Show Gist options
  • Save elijahbenizzy/c5c03f58baf0e530fe49e62663525a87 to your computer and use it in GitHub Desktop.
Save elijahbenizzy/c5c03f58baf0e530fe49e62663525a87 to your computer and use it in GitHub Desktop.
def urls(base_url: str, max_scrape_depth: int = 1, cutoff: int = None) -> pd.DataFrame:
"""Gives all recursive URLs from the given base URL."""
def embedder() -> URLEmbedder:
"""Sets up an embedder to embed URLs on a remote GPU box."""
def embeddings_df(urls: pd.DataFrame, embedder: URLEmbedder) -> pd.DataFrame:
"""Adds embeddings to the given URL DataFrame."""
def saved_embeddings(embeddings_df: pd.DataFrame) -> str:
"""Saves the embeddings to disk + returns URL"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment