groktocrawl scrape https://www.gutenberg.org/ebooks/11β full Alice in Wonderland as chapter-structured markdown.
Public-domain books are the largest open corpus of high-quality text on the internet. But getting them into a format an LLM, RAG system, or agent can use means:
- Finding the right download URL (EPUB? plain text? illustrated?)
- Dealing with Gutenberg's boilerplate header/footer in plain-text downloads