Skip to content

Instantly share code, notes, and snippets.

View johnbachman's full-sized avatar

John A. Bachman johnbachman

View GitHub Profile
@johnbachman
johnbachman / curating_entities.rst
Last active September 2, 2016 16:40
Curating entities for INDRA/Bioentities

How to curate entities

The starting point for curation is results from reading (e.g., from REACH), in the form of a pickled dictionary with lists of INDRA statements keyed by paper.

The first step is generate a list of agent texts with grounding (abbreviated "twg" in filenames), that shows the entity texts in order of their frequency of occurrence along with all of the different identifiers they are grounded to across the corpus (often the same string is grounded to different IDs depending on the context of the paper). You'll also want the comparable list after filtering out agent texts that are already in the default grounding map.

To dump both of these files as CSV, run the grounding_mapper top-level script on pickled reading output. For example, for the REACH output from the batch 4 evaluation:

python -m indra.preassembler.grounding_mapper <filename>
@johnbachman
johnbachman / pmc_to_s3.rst
Last active August 5, 2016 18:43
Procedure for uploading PMC content to S3

Procedure for uploading PMC content to S3

Download the PMC content directly from PMC to an Amazon EC2 instance with sufficient storage (>= 250 gb).

  • Run the ftp command-line program:

    ftp
    
  • Connect to the PMC FTP server and set passive mode on:

@johnbachman
johnbachman / s3cache.py
Last active August 29, 2015 14:27 — forked from dpwrussell/s3cache.py
Extremely rudimentary s3cache code
import boto3
from botocore.exceptions import ClientError
import hashlib
import os
import errno
def mkdir_p(path):
try:
os.makedirs(path)