Skip to content

Instantly share code, notes, and snippets.

jherskovic /
Last active Jan 2, 2016
This gist, together with will read a MEDLINE base distribution and create a journal->MeSH term count dictionary, and then pickle it. It will use all CPUs in a system to process MEDLINE files in parallel.
# Parse each article. Create dictionary journal->MeSH terms
# Accumulate MeSH terms by journal
# Put every article output in a QUEUE, PICK IT UP with a single process
# This way, we can process multiple PubMed files in parallel
from read_medline import *
import multiprocessing
import cPickle as pickle
import traceback
import sys
jherskovic /
Created Jan 2, 2014
Functions to iterate over the Medline Base collection files from the NLM. Requires and a MEDLINE baseline distribution in gzip format.
import glob
import xmltodict
import sys
import os
import logging
import hashlib
from gzip import GzipFile
from pprint import pprint
jherskovic /
Created Oct 15, 2013
This is the default expected medication format for the MedRec project
medication_parser = re.compile(r"""^\s*(?P<name>.*?)