Skip to content

Instantly share code, notes, and snippets.

@chapmanb
Created August 24, 2018 17:03
Show Gist options
  • Save chapmanb/acebd6cd80e8c592dbf96e42b315efa2 to your computer and use it in GitHub Desktop.
Save chapmanb/acebd6cd80e8c592dbf96e42b315efa2 to your computer and use it in GitHub Desktop.
Extract UMIs from input fastq file
from __future__ import print_function
import collections
import gzip
import sys
from Bio.SeqIO.QualityIO import FastqGeneralIterator
nbases = 15
nreads = 1e6
toshow = 20
umis = collections.defaultdict(int)
with gzip.open(sys.argv[1]) as in_handle:
for i, (n, seq, q) in enumerate(FastqGeneralIterator(in_handle)):
umis[seq[:nbases]] += 1
if i > nreads:
break
counts = [(c, u) for u, c in umis.items()]
counts.sort(reverse=True)
for i, (c, u) in enumerate(counts):
print(u, c)
if i > toshow:
break
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment