David Alexander dalexander

## GitReconsidered.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dalexander
                / GitReconsidered.md
            
            
              Created
              December 14, 2011 17:33
            
              
                Git Reconsidered
              
          
    Git Reconsidered: Limitations we have encountered, and a new model for sharing with GitHub

Authors(?): David Alexander, Patrick Marks, Jim Bullard, Jonathan Bingham
P4 annoyances can be a developer headache


p4 edit
Forgetting to do p4 edit
Having to tell p4 when you move/rename files
Bulky p4 clients


## coverage.py
"""
Find coverage in [winStart, winEnd) implied by tStart, tEnd
vectors.

Original from rangeQueries, and two attempts at speeding it up.

In the common case I could imagine projectIntoRangeFast2 being fastest,
but on my amplicons test case projectIntoRangeFast1 wins.
"""

## minor BasH5Reader change proposal

Using the BasH5Reader is simple:

  >>> b = BasH5Reader("m1122...bas.h5")  # Load the file
  >>> zmw = b[9]                         # Get Zmw object(s) by slicing on holenumber(s)
  >>> myRead = zmw.subreads[0]           # Get ZmwRead object
  >>> myRead.basecalls()
  "GATTACA"
  >>> myRead.QualityValue()
  array([5, 6, 3, 4, 8, 8, 1])

## circularization.py
#!/usr/bin/env python

from pbcore.io.FastaIO import FastaReader, FastaWriter, FastaRecord
import shlex
import sys
import subprocess
import os
import re

usage = "usage: circulization.py initial_contigs.fastq 20000 /tmp circulaized_contigs.fastq"

## FastaTable.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dalexander
                / FastaTable.md
            
            
              Last active
              December 22, 2015 01:09
            
              
                FastaTable
              
          
    pbcore in 2.2: new class FastaTable

In pbcore for 2.2 I'm introducing a new class, FastaTable, which gives
easy random access FASTA reading.  It requires a FASTA index (.fai)
file sitting next to the FASTA on the filesystem, and it requires a
constant wrapping length in the FASTA (note that these requirements
are already fulfilled by all PacBio reference repository FASTAs).
Internally the class works by mmap'ing the file contents into virtual

  
## genScatter.py
from pbcore.io import FastaTable
from nose.tools import eq_


def chunk(keysAndSizes, numChunks):
    """
    Heuristically attempt to split the keys up into sublists such that
    the total of sizes of each sublist is near the targetSize, and
    the chunks are well balanced.  Better to go over than under.
    """

## AlignmentFormat.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dalexander
                / AlignmentFormat.md
            
            
              Last active
              December 30, 2015 20:09
            
              
                Alignment file format proposal
              
          
    Custom Alignnment File Format Proposal

Here's a starting point for a file format, we can discuss/negotiate.
Let's use the extension .aln for a text file like this:
## Custom Alignment Format v0.1

  
## findBadReads.py
#!/usr/bin/env python

from pbcore.io import BasH5Reader, M4Reader
from pbcore.util.Process import backticks
import sys, os.path as osp

def totalReadLength(m4r):
    #return m4r.qseqlength
    # qseqlength is bogus! use the offsets from the query string
    extent = map(int, m4r.qName.split("/")[-1].split("_"))

## showChem.py
#!/usr/bin/env python
import sys
import h5py

fname = sys.argv[1]
f = h5py.File(fname, "r")

if fname.endswith("bax.h5"):
    ri = f["/ScanData/RunInfo"]
    try:

## realign.py
#
# Push gaps forward in homopolymers
#
# Rewrite rule 1: XX    ===>  XX
#                 X-          -X
#
# Rewrite rule 2: X-    ===>  -X
#                 XX          XX
#
# Iterate until convergence.
	"""
	Find coverage in [winStart, winEnd) implied by tStart, tEnd
	vectors.

	Original from rangeQueries, and two attempts at speeding it up.

	In the common case I could imagine projectIntoRangeFast2 being fastest,
	but on my amplicons test case projectIntoRangeFast1 wins.
	"""

	Using the BasH5Reader is simple:

	>>> b = BasH5Reader("m1122...bas.h5") # Load the file
	>>> zmw = b[9] # Get Zmw object(s) by slicing on holenumber(s)
	>>> myRead = zmw.subreads[0] # Get ZmwRead object
	>>> myRead.basecalls()
	"GATTACA"
	>>> myRead.QualityValue()
	array([5, 6, 3, 4, 8, 8, 1])
	#!/usr/bin/env python

	from pbcore.io.FastaIO import FastaReader, FastaWriter, FastaRecord
	import shlex
	import sys
	import subprocess
	import os
	import re

	usage = "usage: circulization.py initial_contigs.fastq 20000 /tmp circulaized_contigs.fastq"
	from pbcore.io import FastaTable
	from nose.tools import eq_


	def chunk(keysAndSizes, numChunks):
	"""
	Heuristically attempt to split the keys up into sublists such that
	the total of sizes of each sublist is near the targetSize, and
	the chunks are well balanced. Better to go over than under.
	"""
	#!/usr/bin/env python

	from pbcore.io import BasH5Reader, M4Reader
	from pbcore.util.Process import backticks
	import sys, os.path as osp

	def totalReadLength(m4r):
	#return m4r.qseqlength
	# qseqlength is bogus! use the offsets from the query string
	extent = map(int, m4r.qName.split("/")[-1].split("_"))
	#!/usr/bin/env python
	import sys
	import h5py

	fname = sys.argv[1]
	f = h5py.File(fname, "r")

	if fname.endswith("bax.h5"):
	ri = f["/ScanData/RunInfo"]
	try:
	#
	# Push gaps forward in homopolymers
	#
	# Rewrite rule 1: XX ===> XX
	# X- -X
	#
	# Rewrite rule 2: X- ===> -X
	# XX XX
	#
	# Iterate until convergence.