Skip to content

Instantly share code, notes, and snippets.

@jmerle
Created July 18, 2019 00:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jmerle/c90ea3927e0e7a1cf8b9c246bc34c73d to your computer and use it in GitHub Desktop.
Save jmerle/c90ea3927e0e7a1cf8b9c246bc34c73d to your computer and use it in GitHub Desktop.
An explanation of the basics of what happens when you scrape a documentation in DevDocs

When you start scraping documentation using thor docs:generate <doc>, Thor will call Docs.generate. This method finds the correct scraper and calls Doc.store_pages, which is the basis of the actual scraping.

This method sets up an EntryIndex in which entries are stored and a PageDb in which processed pages are stored. After that's done, it starts scraping by calling build_pages on the scraper. For most scrapers this means the Scraper.build_pages method is called, but the Browser Support Tables and in the future the .NET scraper (which I am currently working on) implement the build_pages method themselves.

The build_pages method on the scraper scrapes all urls, starting with the urls specified by self.root_path and self.initial_urls. If it finds new scrape-able urls while scraping, it adds those to the queue and continues scraping until there are no more urls left.

When a page is being scraped, it goes like this:

  1. The body of the page is fetched by the request_one method (both the url and the file scraper have their own implementation).
  2. The content of the body is checked by the process_response? method (both the url and the file scraper have their own implementation). If this method returns true the parsing is continued, otherwise it's assumed something went wrong in step 1 and an error is logged.
  3. The body is ran through Scraper.process_response. This is where the magic happens.
  4. The Scraper.process_response method parses the HTML and extracts the title from the document.
  5. The Scraper.process_response method runs all filters over the html, with context about things like the current path, whether it's the root page and more. The filters that are ran by default are well documented in the Scraper Reference. Custom filters are added in scrapers themselves to handle things like removing unnecessary documentation-specific nodes and to extract the name, type and additional entries from the page. Look at existing scrapers to see exactly what they do, pretty much every scraper got a CleanHtml filter and an EntriesFilter.
  6. After the pipeline has ran, the final page (which can be modified by the filters to make it DevDocs-ready) is stored in the PageDb instance and the entry for this page plus any additional entries generated by the filters are stored in the EntryIndex instance.

I hope this is at least somewhat helpful, but I think you'll learn the most by looking at the source code yourself. I know from experience that might be kinda hard if DevDocs is your first Ruby project (it was my first too), but that's often the fastest way to get to know what you want to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment