Skip to content

Instantly share code, notes, and snippets.

@0xdevalias
Last active June 12, 2024 02:41
Show Gist options
  • Save 0xdevalias/e3618a3a256b3d43802cf2138f1d446a to your computer and use it in GitHub Desktop.
Save 0xdevalias/e3618a3a256b3d43802cf2138f1d446a to your computer and use it in GitHub Desktop.
Notes about common crawl, etc

Common Crawl, etc

  • https://www.commoncrawl.org/
    • Common Crawl maintains a free, open repository of web crawl data that can be used by anyone. Common Crawl is a 501(c)(3) non–profit founded in 2007. ‍ > We make wholesale extraction, transformation and analysis of open web data accessible to researchers.

    • https://www.commoncrawl.org/blog/
    • https://www.commoncrawl.org/overview
      • Overview The Common Crawl corpus contains petabytes of data, regularly collected since 2008. The corpus contains raw web page data, metadata extracts, and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

    • https://www.commoncrawl.org/get-started
      • Get Started

      • Accessing the Data Crawl data is free to access by anyone from anywhere.

        The data is hosted by Amazon Web Services’ Open Data Sets Sponsorships program on the bucket s3://commoncrawl/, located in the US-East-1 (Northern Virginia) AWS Region.

        You may process the data in the AWS cloud or download it for free over HTTP(S) with a good Internet connection.

        You can process the data in the AWS cloud (or download directly) using the URL schemes s3://commoncrawl/[...], https://ds5q9oxwqwsfj.cloudfront.net/[...] and https://data.commoncrawl.org/[...].

        To access data from outside the Amazon cloud, via HTTP(S), the new URL prefix https://data.commoncrawl.org/ – must be used.

        For further detail on the data file formats listed below, please visit the ISO Website, which provides format standards, information and documentation. There are also helpful explanations and details regarding file formats in other GitHub projects.

    • https://www.commoncrawl.org/latest-crawl
      • Latest Crawl - Archive Location & Download The latest crawl is:

        CC-MAIN-2024-22

        To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.

    • https://www.commoncrawl.org/web-graphs
      • Web Graphs

      • Common Crawl regularly releases host- and domain-level graphs, for visualising the crawl data.

        Hostnames in the graph are in reverse domain name notation and all types of links are listed, including purely “technical” links pointing to images, JavaScript libraries, web fonts, etc.

        However, only hostnames with a valid IANA TLD are used. As a result, URLs with an IP address as host component are not taken into account for building the host-level graph.

        The domain graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix list maintained on publicsuffix.org.

    • https://index.commoncrawl.org/
      • Common Crawl Index Server

      • https://commoncrawl.org/blog/announcing-the-common-crawl-index
        • Announcing the Common Crawl Index

        • We are pleased to announce a new index and query api system for Common Crawl. ‍ > The raw index data is available, per crawl, at: s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/

          There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl. To make working the index a bit simpler, an api and service for querying the index is available at: http://index.commoncrawl.org.

        • Index Format ‍ > The index format is relatively simple: It consists of a compressed plaintext index (with one line for each entry) compressed into gzipped chunks, and a secondary index of the compressed chunks. This index is often called the ‘ZipNum’ CDX format and it is the same format that is used by the Wayback Machine at the Internet Archive.

        • ‍Index Query API ‍To make working with the index a bit easier, the main index site (http://index.commoncrawl.org) contains a readily accessible api for querying the index.

          The api is a variation of the ‘cdx server api’ or ‘capture index server api’ that was originally built for the wayback machine.

        • The index can be queried by making a request to a specific collection.

          For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl: https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org

          The above query will only retrieve captures from the exact url “wikipedia.org/”, but a frequent use case may be to retrieve all urls from a path or all subdomains.

          This can be done by using a wildcard queries:

        • Pagination ‍ > For most prefix or domain prefix queries such as these, it is not feasible to retrieve all the results at once, and only the first page of results (by default, up to 15000) are returned. The total number of pages can be retrieved with the showNumPages query: https://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true

        • Command-Line Client ‍ > For smaller use cases, a simple client side library is available to simplify this process, https://github.com/ikreymer/cdx-index-client This is a simple python script which uses the pagination api to perform a parallel query on a local machine.

    • https://data.commoncrawl.org/
      • Common Crawl Data

      • https://data.commoncrawl.org/cc-index/table/cc-main/index.html
        • Common Crawl Index Table A tabular/columnar index (Parquet format) to the Common Crawl archives.

          Use s3://commoncrawl/cc-index/table/cc-main/warc/ as path to access the entire table from the Amazon cloud. More information and examples are given in the blog post announcing the columnar index and the cc-index-table project on github. Path listings of the Parquet files are provided alongside the listings of all monthly crawls. Please see also accessing the data.

          The table schema below shows the storage used for the columns in the partition of the January 2018 crawl (CC-MAIN-2018-05). Please note that the schema may evolve over time, the most recent schema is available on github (JSON, SQL).

          • https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format
            • Index to WARC Files and URLs in Columnar Format We're happy to announce the release of an index to WARC files and URLs in a columnar format. The columnar format (we use Apache Parquet) allows to efficiently query or process the index and saves time and computing resources. Especially, if only few columns are accessed, recent big data tools will run impressively fast.

      • https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/index.html
        • Common Crawl May 2024 Crawl Archive (CC-MAIN-2024-22) The May 2024 crawl archive contains 2.70 billion pages

  • https://github.com/centic9/CommonCrawlDocumentDownload
    • CommonCrawlDocumentDownload

    • This is a small tool to find matching URLs and download the corresponding binary data from the CommonCrawl indexes.

      Support for the newer URL Index (http://blog.commoncrawl.org/2015/04/announcing-the-common-crawl-index/) is available, older URL Index as described at https://github.com/trivio/common_crawl_index and http://blog.commoncrawl.org/2013/01/common-crawl-url-index/ is still available in the "oldindex" package.

      Please note that a full run usually finds a huge number of files and thus downloading will require a large amount of time and lots of disk-space if the data is stored locally!

      NOTE This project does not implement backoff on HTTP errors about too many requests. Due to the current high rate of access by many GPT/LLM experiments, the CommonCrawl S3 bucket very often returns HTTP errors about rate exceeded. See https://github.com/tballison/commoncrawl-fetcher-lite for a newer implementation of this with more advanced functionality that work more reliably.

      NOTE: CommonCrawl only stores up to 1MB per file and cuts off any bytes exceeding this length. So larger documents will be truncated and might not be valid and parsable any more. You can try to download the original file via the URL that is part of the crawl-data, but this project does not implement this due to potential "crawling" restrictions on target websites.

  • https://github.com/tballison/commoncrawl-fetcher-lite
    • commoncrawl-fetcher-lite Simplified version of a common crawl fetcher. This is yet another attempt to make it easy to extract files from Common Crawl.

  • https://github.com/tballison/SimpleCommonCrawlExtractor
    • SimpleCommonCrawlExtractor

    • Simple wrapper around IIPC Web Commons to take a literal warc.gz and extract standalone binaries

      This is only meant for toy (=single box) processing of commoncrawl data. Please, please, please use Behemoth or another Hadoop framework for actual processing!!!

      I'm under no illusion that this capability doesn't already exist...probably even with IIPC's Web Commons!

      The inspiration for this came from Dominik Stadler's CommonCrawlDocumentDownload.

      This tool requires a local repository of warcs...it does not do streaming processing... did I mention "toy" above?

      This tool allows for selection (inclusion or exclusion) of records by http-header mime type, Tika-detected mime type and/or file extension scraped from the target URL.

  • https://github.com/webrecorder/warcio.js
    • warcio.js

    • Streaming web archive (WARC) file support for modern browsers and Node.

      This package represents an approximate port TypeScript port of the Python warcio module.

  • https://github.com/webrecorder/warcio
    • WARCIO: WARC (and ARC) Streaming Library

    • Streaming WARC/ARC library for fast web archive IO

    • This library provides a fast, standalone way to read and write WARC Format commonly used in web archives. Supports Python 2.7+ and Python 3.4+ (using six, the only external dependency)

      warcio supports reading and writing of WARC files compliant with both the WARC 1.0 and WARC 1.1 ISO standards.

  • https://github.com/webrecorder/pywb
    • Webrecorder pywb

    • pywb is a Python 3 web archiving toolkit for replaying web archives large and small as accurately as possible. The toolkit now also includes new features for creating high-fidelity web archives.

      This toolset forms the foundation of Webrecorder project, but also provides a generic web archiving toolkit that is used by other web archives, including the traditional "Wayback Machine" functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment