-
-
Save kadster/651e2261bdcae7b9fd5bba1924fdf077 to your computer and use it in GitHub Desktop.
import os | |
from bs4 import BeautifulSoup | |
folder = '/Users/nt/Desktop/letters' | |
files = os.listdir(folder) | |
tagnames = ['persName', 'abstract', 'date', 'transcription'] | |
files = [f for f in files if 'xml' in f] | |
output_strs = [] | |
for fname in files: | |
with open(os.path.join(folder, fname), "r") as infile: | |
file_data= "" | |
content = infile.read() | |
soup = BeautifulSoup(content,'xml') | |
for tagname in tagnames: | |
tag_data = soup.find(tagname) | |
tag_text = tag_data.get_text() | |
file_data += tag_text | |
file_data += '\t' | |
tagnames.append(tag_text) | |
tagnames_str = "\t".join(tagnames) | |
output_strs.append("{}\t{}\n".format(fname, text.replace('\n', ' '))) | |
print(len(output_strs)) | |
with open('/Users/nt/Desktop/tagdata.dat', 'w') as outfile: | |
outfile.writelines(output_strs) | |
print('yo dude im finito') |
consider using a list
data = []
inside loop:
data.append(tag_text)
after loop:
data_str = "\t".join(data)
error:
Traceback (most recent call last):
File "/Users/nt/Desktop/braaaam.py", line 20, in
tag_text = tag_data.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
Ok the code was almost right before but now has turned into a little bit of a mess.
The error means that tag_data is None. So soup.find(tagname) returned None. So the tag is not in the document.
Easiest way to debug this now would be to insert
import pdb; pdb.set_trace() on line 11
When you run code, it'll stop on line 11. You'll then have the choice of pressing 'n' for next, moving one line ahead, or typing in 'p some_expression' to find out the value of some_expression at that point in code.
stepping through you can think about whether each expression returns what you expect it to etc.
Have a think about how many times your persName in persNames loop will execute for each fname in files iteration!