Skip to content

Instantly share code, notes, and snippets.

@NoWorries
Last active September 17, 2023 10:24
Show Gist options
  • Save NoWorries/3e4764d72e7ea7e2b27e92c73b5fdb0a to your computer and use it in GitHub Desktop.
Save NoWorries/3e4764d72e7ea7e2b27e92c73b5fdb0a to your computer and use it in GitHub Desktop.
Python script to export a CSV of the links from all the pages on website. Ignoring common header and footer elements and focusing on the main part of the page to reduce duplication.

Setting up and running the Export Links script

Prerequisites

  • You need to have Python installed on your computer. If you don't have it installed, follow the instructions in this guide to install it: How to Install Python

Step 1: Download the Script

  1. Download the export links crawler script (use the download button if that doesn't work) to your computer. Save it in a folder where you can easily find it.

Step 2: Install Required Dependencies

  1. Open a terminal or Command Prompt.
  2. Navigate to the folder where you saved the script using the cd command (e.g., cd Desktop/MyScripts).
  3. Run the following command to install the required dependencies: pip install requests beautifulsoup4

Edit the Ignore List (Optional):

If you want to customize the elements that the script ignores (such as specific header or footer classes or element names), you can edit the ignore_list in the script. Open the script using a text editor, and you'll find the ignore_list variable. Add or remove class names or element names as needed.

For example, if you want to ignore links within elements with the class my-header and my-footer, you can modify the ignore_list like this:

ignore_list = [
    ".my-header",
    ".my-footer",
]

Save the script after making your changes.

Step 3: Run the Script

  1. To run the script, you need to use the python command followed by the script's filename. For example, if your script is named export_links.py, type python export_links.py and press Enter.

Step 4: Enter the Website URL

  1. The script will prompt you to enter the URL of the website you want to scan for a sitemap. Type the full website URL and press Enter. For example, https://www.example.com.

Step 5: Wait for the Script to Complete

  1. The script will start searching for the sitemap on the provided website. If it finds a sitemap, it will display the sitemap's URL.
  2. The script will then parse the sitemap and extract the links. Once it's done, it will inform you that the sitemap links have been saved.

Step 6: Check the Output

  1. The extracted links from the sitemap will be saved in a file named sitemap_links.csv in the same folder where the script is located.
  2. You can open this CSV file with a spreadsheet program like Microsoft Excel or Google Sheets to view the list of links.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment