Skip to content

Instantly share code, notes, and snippets.

@dkoslicki
Last active January 25, 2020 01:07
Show Gist options
  • Save dkoslicki/f9abdb3875aae3eed593afeea2f66034 to your computer and use it in GitHub Desktop.
Save dkoslicki/f9abdb3875aae3eed593afeea2f66034 to your computer and use it in GitHub Desktop.
Dump the kmers in a CMash database
import sys, os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))+"/CMash/CMash")
from CMash import MinHash as MH
import itertools
training_database = sys.argv[1] # first input is the training file name
dump_file = sys.argv[2] # second input is the desired output dump file
CEs = MH.import_multiple_from_single_hdf5(training_database)
fid = open(dump_file, 'w')
i = 0
for CE in CEs:
for kmer in CE._kmers:
fid.write('>seq%d\n' % i)
fid.write('%s\n' % kmer)
i += 1
fid.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment