Skip to content

Instantly share code, notes, and snippets.

@jherskovic
jherskovic / build_journal_term_database.py
Last active Jan 2, 2016
This gist, together with https://gist.github.com/jherskovic/8222898 will read a MEDLINE base distribution and create a journal->MeSH term count dictionary, and then pickle it. It will use all CPUs in a system to process MEDLINE files in parallel.
View build_journal_term_database.py
# Parse each article. Create dictionary journal->MeSH terms
# Accumulate MeSH terms by journal
# Put every article output in a QUEUE, PICK IT UP with a single process
# This way, we can process multiple PubMed files in parallel
from read_medline import *
import multiprocessing
import cPickle as pickle
import traceback
import sys
@jherskovic
jherskovic / read_medline.py
Created Jan 2, 2014
Functions to iterate over the Medline Base collection files from the NLM. Requires https://github.com/martinblech/xmltodict and a MEDLINE baseline distribution in gzip format.
View read_medline.py
import glob
import xmltodict
import sys
import os
import logging
import hashlib
from gzip import GzipFile
from pprint import pprint
try:
@jherskovic
jherskovic / medication_parser.py
Created Oct 15, 2013
This is the default expected medication format for the MedRec project
View medication_parser.py
medication_parser = re.compile(r"""^\s*(?P<name>.*?)
\s+(?P<dose>[0-9\.\/]+)
\s*(?P<units>([mck]|mc)g|[md]l)
\s*(?P<formulation>.*?)
;
\s*?(?P<instructions>.*)""",
re.IGNORECASE | re.VERBOSE)