Skip to content

Instantly share code, notes, and snippets.

@bdunnette
Last active December 15, 2015 20:41
Show Gist options
  • Save bdunnette/5320606 to your computer and use it in GitHub Desktop.
Save bdunnette/5320606 to your computer and use it in GitHub Desktop.
Parse an NLM MeSH Trees file into something usable as a Drupal taxonomy
from __future__ import print_function
# The data comes from the NLM's MeSH Trees file, which can be downloaded here: https://www.nlm.nih.gov/mesh/filelist.html
tree_file = open('mtrees2013.bin')
tree_outfile = open('mtrees2013-parsed.txt', 'w')
tree_array = {}
for row in tree_file.readlines():
#print(row)
# The 'index' of the term is whatever follows the semicolon
term_index = row[row.find(';') + 1:len(row) - 1]
# The term's parent (if any) will be whatever has the address one level up the hierarchy - i.e. A01.101.202's parent would be A01.101 - so we'll parse this out
parent_index = term_index[:term_index.rfind('.')]
# If a parent term has already been parsed, put that at the front of the 'term' to provide a hierarchy
if parent_index in tree_array:
term = ','.join([tree_array[parent_index], row[:row.find(';')]])
# Otherwise, just take whatever precedes the semicolon as the 'term'
else:
term = row[:row.find(';')]
# Add this term to our array (for future parent searches)
tree_array[term_index] = term
# Finally, write this term to our text file
print(term, file=tree_outfile)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment