Skip to content

Instantly share code, notes, and snippets.

@bitchwhocodes
Last active January 21, 2022 05:17
Show Gist options
  • Save bitchwhocodes/9102400 to your computer and use it in GitHub Desktop.
Save bitchwhocodes/9102400 to your computer and use it in GitHub Desktop.
Gist to get the raw text from an epub for all chapters using Python, NLTK, EPUB
import nltk
import epub
book = epub.open_epub('design-is-a-job.epub')
for item in book.opf.manifest.values():
# read the content
data = book.read_item( item )
if 'html' in item.href and 'chap' in item.href:
print item.href
#print data
raw = nltk.clean_html(data)
print(raw)
@SandroMiccoli
Copy link

hey, got this error:
Traceback (most recent call last):
File "", line 8, in
File "/Library/Python/2.7/site-packages/nltk/util.py", line 346, in clean_html
raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

@BradKML
Copy link

BradKML commented Jan 21, 2022

This is more likely in Python 2, but otherwise it needs a revamp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment