Skip to content

Instantly share code, notes, and snippets.

@crowesn
Last active March 5, 2019 15:33
Show Gist options
  • Save crowesn/ed56d1b8ef525949087496e506ca336f to your computer and use it in GitHub Desktop.
Save crowesn/ed56d1b8ef525949087496e506ca336f to your computer and use it in GitHub Desktop.
import xml.etree.ElementTree as ET
tree = ET.parse('thorpe.xml')
root = tree.getroot()
for item in root.iter('DISS_para'):
abstract = (item.text)
myList = [abstract.split('\t')]
''.join(map(str,myList))
print(abstract)
@crowesn
Copy link
Author

crowesn commented Mar 4, 2019

So, you want to group all the abstracts into one string?

@crowesn
Copy link
Author

crowesn commented Mar 4, 2019

Try this:

import xml.etree.ElementTree as ET

tree = ET.parse('thorpe.xml')
root = tree.getroot()

combined_abstracts = "" # create empty string variable that we can build onto in our loop below

for item in root.iter("DISS_para"):
    combined_abstracts += item.text # get item text and use __iadd__ (+=) method to concatenate string of combined abstracts 

print(combined_abstracts) # print the combined_abstracts string object which should contain all of the abstracts

@crowesn
Copy link
Author

crowesn commented Mar 4, 2019

If you want to strip all the tabs, you can import the regex module and do this within the print command:

import re

...

print(re.sub('\t', '', combined_abstracts))

@crowesn
Copy link
Author

crowesn commented Mar 4, 2019

So that would be:

import xml.etree.ElementTree as ET
import re

tree = ET.parse('thorpe.xml')
root = tree.getroot()

combined_abstracts = "" 

for item in root.iter("DISS_para"):
    combined_abstracts += item.text

print(re.sub('\t', '', combined_abstracts))

@crowesn
Copy link
Author

crowesn commented Mar 4, 2019

You could also strip the tabs in the loop and print just the clean string:

import xml.etree.ElementTree as ET
import re

tree = ET.parse('thorpe.xml')
root = tree.getroot()

combined_abstracts = "" 

for item in root.iter("DISS_para"):
    combined_abstracts += re.sub('\t', '', item.text)

print(combined_abstracts)

@crowesn
Copy link
Author

crowesn commented Mar 4, 2019

If you wanted to replace the tabs with a space:

combined_abstracts += re.sub('\t', ' ', item.text)

@carohansen
Copy link

That's amazing! It works!

@carohansen
Copy link

Trying to add it to a CSV file, but it's parsing every letter. Not sure what I'm doing wrong.

import xml.etree.ElementTree as ET
import re
import csv

forCSV =[]
forCSV.append(['abstract'])

tree = ET.parse('thorpe.xml')
root = tree.getroot()

combined_abstracts = ""

for item in root.iter("DISS_para"):
combined_abstracts += re.sub('\t', '', item.text)

forCSV.append(combined_abstracts)

with open('outputFile.csv', 'w', newline='') as csvfile:
wr = csv.writer(csvfile, delimiter=',')
wr.writerows(forCSV)

@carohansen
Copy link

sorry, that indentation is all fucked up

@carohansen
Copy link

thanks, I found my mistake! you're a wizard!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment