Skip to content

Instantly share code, notes, and snippets.

@vgan
Last active November 29, 2017 20:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vgan/39891c411df69b2ba427bb6e35435e85 to your computer and use it in GitHub Desktop.
Save vgan/39891c411df69b2ba427bb6e35435e85 to your computer and use it in GitHub Desktop.
Using Beautiful Soup to extract text from the Italic Tags in an OCR'd book

Using Beautiful Soup to extract text from the Italic Tags in an OCR'd book (ugly ms word html)

Requires Beautiful Soup if you don't already have it:

pip install bs4

##!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from bs4 import BeautifulSoup

htmlfile = "./thompson.htm"
textfile = "./thompson.txt"

markup = open(htmlfile)
soup = BeautifulSoup(markup.read(),'html.parser')
markup.close()

eyes = soup.findAll('i')

f = open(textfile, "w")
for eye in eyes:
        f.write(eye.text + "\n")
f.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment