Skip to content

Instantly share code, notes, and snippets.

@arq5x
Last active February 6, 2019 21:14
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arq5x/a3147254af240a994e95 to your computer and use it in GitHub Desktop.
Save arq5x/a3147254af240a994e95 to your computer and use it in GitHub Desktop.
kmer fun with jellyfish
import sys
from itertools import *
"""
compute the complexity of each kmer passed in
given the format of the output of `jellyfish dump -ct`
complexity is measured as the number of runs divided
by the total length of the sequence.
e.g., "AAAAA" would be 1/5
and "ACTGC" would be 5/5
and "AACCG" would be 3/5
but "GCGCG" would also be 5/5 - not good. not good at all.
Prefer methods in:
1. http://nar.oxfordjournals.org/content/42/12/e99.long
2. DustMasker paper.
"""
def encode(l):
"""RLE for a string"""
return [(len(list(group)),name) for name, group in groupby(l)]
for line in open(sys.argv[1]):
kmer, count = line.strip().split('\t')
runs = encode(kmer)
num_runs = len(runs)
print kmer, count, float(len(runs))/float(len(kmer))
brew install jellyfish
jellyfish count -m 21 -c 3 -s 1000000 -t 4 ecoli.k12.fa
jellyfish dump -ct mer_counts.jf | head
TATTGTCATTATTCATTTTTT 1
GTAATTTTTGTGCGACACGAA 1
GCGGGCGGTATCCAGTTCGTT 1
ATGTAGGCCCGGATGTTTGCC 1
CATACAACAAGCGACGAGCAT 1
CGCCGGGAAATACCCTGGTCG 1
TTGATAAACGCCCGCAAAGAT 1
AGCCGTCGTGATCCAGCGGAC 1
TGCCCGTGGTAACAGTGCACT 1
TTCGCCATTCACGGCGTTCGG 1
# how many kmers exist more than once
jellyfish dump -L 2 -ct mer_counts.jf | wc -l
# top ten most abundant kmers
jellyfish dump -L 2 -ct mer_counts.jf | sort -k2,2nr | head
ATAAGGCGTTCACGCCGCATC 43
GATAAGGCGTTCACGCCGCAT 43
AAGGCGTTCACGCCGCATCCG 41
CGGATGCGGCGTGAACGCCTT 39
GGATAAGGCGTTCACGCCGCA 39
TAAGGCGTTCACGCCGCATCC 39
AGGCGTTCACGCCGCATCCGG 38
GATGCGGCGTGAACGCCTTAT 38
GGATGCGGCGTGAACGCCTTA 38
GGCGTTCACGCCGCATCCGGC 38
# assess the sequence "complexity" (i.e., num runs / length) of each the top 10most abundant kmers.
python complexity.py <(jellyfish dump -L 2 -ct mer_counts.jf | sort -k2,2nr | head)
ATAAGGCGTTCACGCCGCATC 43 0.809523809524
GATAAGGCGTTCACGCCGCAT 43 0.809523809524
AAGGCGTTCACGCCGCATCCG 41 0.761904761905
CGGATGCGGCGTGAACGCCTT 39 0.761904761905
GGATAAGGCGTTCACGCCGCA 39 0.761904761905
TAAGGCGTTCACGCCGCATCC 39 0.761904761905
AGGCGTTCACGCCGCATCCGG 38 0.761904761905
GATGCGGCGTGAACGCCTTAT 38 0.809523809524
GGATGCGGCGTGAACGCCTTA 38 0.761904761905
GGCGTTCACGCCGCATCCGGC 38 0.761904761905
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment