Skip to content

Instantly share code, notes, and snippets.

View lomereiter's full-sized avatar

Artem Tarasov lomereiter

View GitHub Profile
  1. Add set -g escape-time 10 to ~/.tmux.conf
  2. Also add set -g default-terminal "screen-256color"
  3. Run tmux source-file ~/.tmux.conf to reload the config
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@lomereiter
lomereiter / flagstat.py
Created January 28, 2017 14:51
Pythonic flagstat for ADAM Parquet files
# environment: python3; conda install -c conda-forge fastparquet=0.0.4post1 joblib
# usage: python flagstat.py <dataset.adam>
from collections import Counter
import sys
import fastparquet
from fastparquet.core import read_row_group_file
from fastparquet.schema import SchemaHelper

Summary of the problem from mz5 paper (concerning .mzML but just as true for .imzML):

Although based on excellent ontologies, relying on the extended markup language (XML) for the straightforward implementation of mzData, mzXML, and mzML makes for a major efficiency bottleneck. XML was designed to be a human readable, textual data format with considerable inherent verbosity and redundancy. XML was not designed for efficient bulk data storage, and the general modus operandi requires reading complete files to construct the XML parse tree. The mzXML and mzML formats partly circumvent these limitations by using base-64 encoding and (optional) compression of the raw MS scan data in combination with an application-specific indexing system. Despite the improvements gained from these efforts,

Serialization: best practices

(In this document I pay attention mostly to data storage in scientific applications, not to web protocols.)

Traditional approaches

  • XML:
    • slow to parse
    • schemas (.xsd) are human-readable but hard to edit without special software
  • tooling for generating code for reading/writing is limited (mostly to Java)
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@lomereiter
lomereiter / pipeline.py
Last active August 31, 2015 23:22
Metabolomics pipeline
import sys
# path to pyIMS parent dir
sys.path.append("/home/lomereiter/github")
from pyIMS.image_measures.level_sets_measure import measure_of_chaos
from pyIMS.image_measures.isotope_image_correlation import isotope_image_correlation
from pyIMS.image_measures.isotope_pattern_match import isotope_pattern_match
import numpy as np
import cPickle
// compilation: rdmd --build-only -O -release -inline -IBioD sambamba_161.d
// to use LDC: rdmd --compiler=ldmd2 [--force] ...
import bio.bam.reader, bio.bam.writer, std.parallelism;
void main(string[] args) {
// boilerplate
defaultPoolThreads = 8;
auto input = new BamReader(args[1]); // use std.getopt for better args handling
auto output = new BamWriter(args[2]);
output.writeSamHeader(input.header);
diff --git a/wqflask/base/data_set.py b/wqflask/base/data_set.py
index a572a60..b152357 100755
--- a/wqflask/base/data_set.py
+++ b/wqflask/base/data_set.py
@@ -555,12 +555,22 @@ class DataSet(object):
# """ % (query_args))
try:
- self.id, self.name, self.fullname, self.shortname = g.db.execute("""
+ if self.type != "ProbeSet":
> git-lfs smudge genotype_files/gemma/HLC.map
LocalWorkingDir=/home/lomereiter/github/genenetwork2
LocalGitDir=/home/lomereiter/github/genenetwork2/.git
LocalMediaDir=/home/lomereiter/github/genenetwork2/.git/lfs/objects
TempDir=/home/lomereiter/github/genenetwork2/.git/lfs/tmp
GIT_DIR=.git
Error accessing media: genotype_files/gemma/HLC.map (84241b81feb7eec3c0b914e223ff23810c69610a6e759baf4797bab4a4850de8)
Error downloading /home/lomereiter/github/genenetwork2/.git/lfs/objects/84/24/84241b81feb7eec3c0b914e223ff23810c69610a6e759baf4797bab4a4850de8.