Skip to content

Instantly share code, notes, and snippets.

@josht-jpg
Last active September 2, 2020 20:52
Show Gist options
  • Save josht-jpg/10f65bf4303190da64390d5fd6c3a030 to your computer and use it in GitHub Desktop.
Save josht-jpg/10f65bf4303190da64390d5fd6c3a030 to your computer and use it in GitHub Desktop.
Removing extraneous text from raw files
import re
def get_book_contents(book_raw):
start = re.compile(r"START OF (?:THE|THIS) PROJECT GUTENBERG", flags=re.IGNORECASE)
end = re.compile(r"END OF (?:THE|THIS) PROJECT GUTENBERG", flags=re.IGNORECASE)
book_start = start.search(book_raw)
book_end = end.search(book_raw)
return book_raw[book_start.span()[1]:book_end.span()[1]]
notes = get_book_contents(notes_from_underground_raw)
crime = get_book_contents(crime_and_punishment_raw)
idiot = get_book_contents(the_idiot_raw)
possessed = get_book_contents(the_possessed_raw)
brothers = get_book_contents(the_brothers_karamazov_raw)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment