vgan/beautiful_soup_i_tags.md

## beautiful_soup_i_tags.md

      
    Raw
  

              beautiful_soup_i_tags.md
            
          
    Using Beautiful Soup to extract text from the Italic Tags in an OCR'd book (ugly ms word html)

Requires Beautiful Soup if you don't already have it:
pip install bs4

Beautiful Soup 4 Docs.

##!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from bs4 import BeautifulSoup

htmlfile = "./thompson.htm"
textfile = "./thompson.txt"

markup = open(htmlfile)
soup = BeautifulSoup(markup.read(),'html.parser')
markup.close()

eyes = soup.findAll('i')

f = open(textfile, "w")
for eye in eyes:
        f.write(eye.text + "\n")
f.close()