-
-
Save crowesn/ed56d1b8ef525949087496e506ca336f to your computer and use it in GitHub Desktop.
import xml.etree.ElementTree as ET | |
tree = ET.parse('thorpe.xml') | |
root = tree.getroot() | |
for item in root.iter('DISS_para'): | |
abstract = (item.text) | |
myList = [abstract.split('\t')] | |
''.join(map(str,myList)) | |
print(abstract) |
Try this:
import xml.etree.ElementTree as ET
tree = ET.parse('thorpe.xml')
root = tree.getroot()
combined_abstracts = "" # create empty string variable that we can build onto in our loop below
for item in root.iter("DISS_para"):
combined_abstracts += item.text # get item text and use __iadd__ (+=) method to concatenate string of combined abstracts
print(combined_abstracts) # print the combined_abstracts string object which should contain all of the abstracts
If you want to strip all the tabs, you can import the regex module and do this within the print command:
import re
...
print(re.sub('\t', '', combined_abstracts))
So that would be:
import xml.etree.ElementTree as ET
import re
tree = ET.parse('thorpe.xml')
root = tree.getroot()
combined_abstracts = ""
for item in root.iter("DISS_para"):
combined_abstracts += item.text
print(re.sub('\t', '', combined_abstracts))
You could also strip the tabs in the loop and print just the clean string:
import xml.etree.ElementTree as ET
import re
tree = ET.parse('thorpe.xml')
root = tree.getroot()
combined_abstracts = ""
for item in root.iter("DISS_para"):
combined_abstracts += re.sub('\t', '', item.text)
print(combined_abstracts)
If you wanted to replace the tabs with a space:
combined_abstracts += re.sub('\t', ' ', item.text)
That's amazing! It works!
Trying to add it to a CSV file, but it's parsing every letter. Not sure what I'm doing wrong.
import xml.etree.ElementTree as ET
import re
import csv
forCSV =[]
forCSV.append(['abstract'])
tree = ET.parse('thorpe.xml')
root = tree.getroot()
combined_abstracts = ""
for item in root.iter("DISS_para"):
combined_abstracts += re.sub('\t', '', item.text)
forCSV.append(combined_abstracts)
with open('outputFile.csv', 'w', newline='') as csvfile:
wr = csv.writer(csvfile, delimiter=',')
wr.writerows(forCSV)
sorry, that indentation is all fucked up
thanks, I found my mistake! you're a wizard!
So, you want to group all the abstracts into one string?