Skip to content

Instantly share code, notes, and snippets.

@kadster
Last active February 3, 2018 14:42
Show Gist options
  • Save kadster/651e2261bdcae7b9fd5bba1924fdf077 to your computer and use it in GitHub Desktop.
Save kadster/651e2261bdcae7b9fd5bba1924fdf077 to your computer and use it in GitHub Desktop.
import os
from bs4 import BeautifulSoup
folder = '/Users/nt/Desktop/letters'
files = os.listdir(folder)
tagnames = ['persName', 'abstract', 'date', 'transcription']
files = [f for f in files if 'xml' in f]
output_strs = []
for fname in files:
with open(os.path.join(folder, fname), "r") as infile:
file_data= ""
content = infile.read()
soup = BeautifulSoup(content,'xml')
for tagname in tagnames:
tag_data = soup.find(tagname)
tag_text = tag_data.get_text()
file_data += tag_text
file_data += '\t'
tagnames.append(tag_text)
tagnames_str = "\t".join(tagnames)
output_strs.append("{}\t{}\n".format(fname, text.replace('\n', ' ')))
print(len(output_strs))
with open('/Users/nt/Desktop/tagdata.dat', 'w') as outfile:
outfile.writelines(output_strs)
print('yo dude im finito')
@DavidJanz
Copy link

Have a think about how many times your persName in persNames loop will execute for each fname in files iteration!

@DavidJanz
Copy link

consider using a list

data = []
inside loop:
data.append(tag_text)

after loop:
data_str = "\t".join(data)

@kadster
Copy link
Author

kadster commented Feb 3, 2018

error:

Traceback (most recent call last):
File "/Users/nt/Desktop/braaaam.py", line 20, in
tag_text = tag_data.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

@DavidJanz
Copy link

DavidJanz commented Feb 3, 2018

Ok the code was almost right before but now has turned into a little bit of a mess.
The error means that tag_data is None. So soup.find(tagname) returned None. So the tag is not in the document.
Easiest way to debug this now would be to insert
import pdb; pdb.set_trace() on line 11
When you run code, it'll stop on line 11. You'll then have the choice of pressing 'n' for next, moving one line ahead, or typing in 'p some_expression' to find out the value of some_expression at that point in code.

stepping through you can think about whether each expression returns what you expect it to etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment