When you start scraping documentation using thor docs:generate <doc>
, Thor will call Docs.generate. This method finds the correct scraper and calls Doc.store_pages, which is the basis of the actual scraping.
This method sets up an EntryIndex in which entries are stored and a PageDb in which processed pages are stored. After that's done, it starts scraping by calling build_pages
on the scraper. For most scrapers this means the Scraper.build_pages method is called, but the Browser Support Tables and in the future the .NET scraper (which I am currently working on) implement the build_pages
method themselves.
The build_pages
method on the scraper scrapes all urls, starting with the urls specified by self.root_path
and self.initial_urls
. If it finds new scrape-able urls while scraping, it adds those to the queue and continues scraping until there are no more urls left.
When a page is being scraped, it goes like this:
- The body of the page is fetched by the
request_one
method (both the url and the file scraper have their own implementation). - The content of the body is checked by the
process_response?
method (both the url and the file scraper have their own implementation). If this method returnstrue
the parsing is continued, otherwise it's assumed something went wrong in step 1 and an error is logged. - The body is ran through
Scraper.process_response
. This is where the magic happens. - The
Scraper.process_response
method parses the HTML and extracts the title from the document. - The
Scraper.process_response
method runs all filters over the html, with context about things like the current path, whether it's the root page and more. The filters that are ran by default are well documented in the Scraper Reference. Custom filters are added in scrapers themselves to handle things like removing unnecessary documentation-specific nodes and to extract the name, type and additional entries from the page. Look at existing scrapers to see exactly what they do, pretty much every scraper got aCleanHtml
filter and anEntriesFilter
. - After the pipeline has ran, the final page (which can be modified by the filters to make it DevDocs-ready) is stored in the
PageDb
instance and the entry for this page plus any additional entries generated by the filters are stored in theEntryIndex
instance.
I hope this is at least somewhat helpful, but I think you'll learn the most by looking at the source code yourself. I know from experience that might be kinda hard if DevDocs is your first Ruby project (it was my first too), but that's often the fastest way to get to know what you want to know.