Last active
February 6, 2019 21:14
-
-
Save arq5x/a3147254af240a994e95 to your computer and use it in GitHub Desktop.
kmer fun with jellyfish
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import sys | |
from itertools import * | |
""" | |
compute the complexity of each kmer passed in | |
given the format of the output of `jellyfish dump -ct` | |
complexity is measured as the number of runs divided | |
by the total length of the sequence. | |
e.g., "AAAAA" would be 1/5 | |
and "ACTGC" would be 5/5 | |
and "AACCG" would be 3/5 | |
but "GCGCG" would also be 5/5 - not good. not good at all. | |
Prefer methods in: | |
1. http://nar.oxfordjournals.org/content/42/12/e99.long | |
2. DustMasker paper. | |
""" | |
def encode(l): | |
"""RLE for a string""" | |
return [(len(list(group)),name) for name, group in groupby(l)] | |
for line in open(sys.argv[1]): | |
kmer, count = line.strip().split('\t') | |
runs = encode(kmer) | |
num_runs = len(runs) | |
print kmer, count, float(len(runs))/float(len(kmer)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
brew install jellyfish | |
jellyfish count -m 21 -c 3 -s 1000000 -t 4 ecoli.k12.fa | |
jellyfish dump -ct mer_counts.jf | head | |
TATTGTCATTATTCATTTTTT 1 | |
GTAATTTTTGTGCGACACGAA 1 | |
GCGGGCGGTATCCAGTTCGTT 1 | |
ATGTAGGCCCGGATGTTTGCC 1 | |
CATACAACAAGCGACGAGCAT 1 | |
CGCCGGGAAATACCCTGGTCG 1 | |
TTGATAAACGCCCGCAAAGAT 1 | |
AGCCGTCGTGATCCAGCGGAC 1 | |
TGCCCGTGGTAACAGTGCACT 1 | |
TTCGCCATTCACGGCGTTCGG 1 | |
# how many kmers exist more than once | |
jellyfish dump -L 2 -ct mer_counts.jf | wc -l | |
# top ten most abundant kmers | |
jellyfish dump -L 2 -ct mer_counts.jf | sort -k2,2nr | head | |
ATAAGGCGTTCACGCCGCATC 43 | |
GATAAGGCGTTCACGCCGCAT 43 | |
AAGGCGTTCACGCCGCATCCG 41 | |
CGGATGCGGCGTGAACGCCTT 39 | |
GGATAAGGCGTTCACGCCGCA 39 | |
TAAGGCGTTCACGCCGCATCC 39 | |
AGGCGTTCACGCCGCATCCGG 38 | |
GATGCGGCGTGAACGCCTTAT 38 | |
GGATGCGGCGTGAACGCCTTA 38 | |
GGCGTTCACGCCGCATCCGGC 38 | |
# assess the sequence "complexity" (i.e., num runs / length) of each the top 10most abundant kmers. | |
python complexity.py <(jellyfish dump -L 2 -ct mer_counts.jf | sort -k2,2nr | head) | |
ATAAGGCGTTCACGCCGCATC 43 0.809523809524 | |
GATAAGGCGTTCACGCCGCAT 43 0.809523809524 | |
AAGGCGTTCACGCCGCATCCG 41 0.761904761905 | |
CGGATGCGGCGTGAACGCCTT 39 0.761904761905 | |
GGATAAGGCGTTCACGCCGCA 39 0.761904761905 | |
TAAGGCGTTCACGCCGCATCC 39 0.761904761905 | |
AGGCGTTCACGCCGCATCCGG 38 0.761904761905 | |
GATGCGGCGTGAACGCCTTAT 38 0.809523809524 | |
GGATGCGGCGTGAACGCCTTA 38 0.761904761905 | |
GGCGTTCACGCCGCATCCGGC 38 0.761904761905 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment