Skip to content

Instantly share code, notes, and snippets.

@pawarbi
Last active July 14, 2023 22:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pawarbi/5455568d9e10a62d580b9af6e0efe849 to your computer and use it in GitHub Desktop.
Save pawarbi/5455568d9e10a62d580b9af6e0efe849 to your computer and use it in GitHub Desktop.

Create Scraper class Initialize with base_url, experience_name, max_pages

Define method extract_data 
    Takes an idea HTML element
    Uses CSS selectors to extract required information from the element
    Returns a dictionary of the extracted data

Define method get_page_data
    Takes a session and page number
    Sends a GET request to the specified page using the session
    Uses BeautifulSoup to parse the response HTML
    Finds all idea elements in the parsed HTML
    Extracts data from each idea element using the extract_data method
    Returns a list of the extracted data

Define method scrape_data
    Creates a session
    Determines the pages to scrape
    Uses ThreadPoolExecutor to create multiple threads
    Each thread executes get_page_data method for a page
    Stores returned data into a DataFrame and adds it to a list
    Combines all DataFrames into a single DataFrame
    Returns the final DataFrame

Outside the class Define the URL, experience_name and max_pages Create an instance of Scraper with the URL, experience_name and max_pages Call the scrape_data method and save the result into df_ideas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment