Last active
February 3, 2018 14:42
-
-
Save kadster/651e2261bdcae7b9fd5bba1924fdf077 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
from bs4 import BeautifulSoup | |
folder = '/Users/nt/Desktop/letters' | |
files = os.listdir(folder) | |
tagnames = ['persName', 'abstract', 'date', 'transcription'] | |
files = [f for f in files if 'xml' in f] | |
output_strs = [] | |
for fname in files: | |
with open(os.path.join(folder, fname), "r") as infile: | |
file_data= "" | |
content = infile.read() | |
soup = BeautifulSoup(content,'xml') | |
for tagname in tagnames: | |
tag_data = soup.find(tagname) | |
tag_text = tag_data.get_text() | |
file_data += tag_text | |
file_data += '\t' | |
tagnames.append(tag_text) | |
tagnames_str = "\t".join(tagnames) | |
output_strs.append("{}\t{}\n".format(fname, text.replace('\n', ' '))) | |
print(len(output_strs)) | |
with open('/Users/nt/Desktop/tagdata.dat', 'w') as outfile: | |
outfile.writelines(output_strs) | |
print('yo dude im finito') |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Ok the code was almost right before but now has turned into a little bit of a mess.
The error means that tag_data is None. So soup.find(tagname) returned None. So the tag is not in the document.
Easiest way to debug this now would be to insert
import pdb; pdb.set_trace() on line 11
When you run code, it'll stop on line 11. You'll then have the choice of pressing 'n' for next, moving one line ahead, or typing in 'p some_expression' to find out the value of some_expression at that point in code.
stepping through you can think about whether each expression returns what you expect it to etc.