Skip to content

Instantly share code, notes, and snippets.

@TomConlin
Created December 27, 2019 22:10
Show Gist options
  • Save TomConlin/6cd976151d36dd3e2a9fb34842b9c66e to your computer and use it in GitHub Desktop.
Save TomConlin/6cd976151d36dd3e2a9fb34842b9c66e to your computer and use it in GitHub Desktop.
fasta binary formats exploration with Julia (v0.2?)
2019 - almost a decade ago now I helped mentor
a CS student writing a bioinformatics exercise in C++.
To help avoid doing his work I did not use C++.
Instead I experimented in Julia; probably v0.2 or v0.3.
The task was to compare ginormas metagenomics studies quickly.
#################################################
read a large file fast
open file handle
determine file length
(assume file is contiguous on disk)
memory map? reads larger blocks
what is disk read cache size?
what are intermediate cache sizes?
what is the time to fetch a cache full?
what is the time to process a cache full?
"C" 100 00 1 1
"G" 100 01 1 1
"A" 100 00 0 1
"T" 101 01 0 0
"N" 100 11 1 0
for c in "AaCcTtGgNn" println(c,' ',c&3) end
A 1
a 1
C 3
c 3
G 3
g 3
T 0
t 0
N 2
n 2
for c in "AaCcTtGgNn" println (c, ' ', c&7) end
A 1
a 1
C 3
c 3
T 4
t 4
G 7
g 7
N 6
n 6
0,1,2,3
[A,C,T,G]
Translating ASCII DNA into a two bit representation
using bitwise AND 3 on the nt-base works great
the fact this is case insensitive is due to the
brilliant ASCII designers of yore.
Having to deal with unknown 'N' bases complicates
processing and for now introduces a branch condition.
Bitwise AND with 7 followed by a bitwise shift allows
filtering for N's
c&=7
6==c ? c=rand(0:3) : c=c<<1
Another option is a 3/4 bit representation
three bit seems awkward for processing and leaves us with three "slots" undefined
hmmm.
any representation less than a word (including byte)
would benefit from a "padding" flag
indicating the representation is to be ignored
so the ALU could process full words.
Four bit (nibble) is half a byte and would be easier to process and leaves 11 "slots"
which may be enough handle nucleotide ambiguity codes ...
or indels or carry a coarse quality score.
For some tasks, e.g. find GC content
the representation is irrelevant as we do not store and reuse the sequence.
in which case the 2-bit transformation is sufficient
or the 3-bit transformation if 'N' is present
With N present there are three options
N is always GC (left as an exercise)
N is never GC GC is nt&1*(nt>>1)&1
N is sometimes GC CG is nt&=7;6==nt?rand(0:1):(nt>>1)&1
#####################################################################
A 4bit (nibble) level encoding compatible with two-bit encoding
which allows gaps and ambiguity but requires lookup tables.
0000 A
0001 C
0010 T
0011 G
0100 !A 'B'
0101 !C 'D'
0110 !T 'V'
0111 !G 'H'
1000 !TA
1001 !TC
1010 !CA
1011 !TG
1100 !AG
1101 !CG
1110 <gap>
1111 N
0,1,2,3
[A,C,T,G]
###############################################################
translating the 4 base nucleotides from the two-bit (dense) representation
to a four-bit (sparse) representation (flag based)
mathematically b4 = 2^b2
but that is apt to be more expensive than necessary
b4=1<<(b2+1) is cheaper
1,2,4,8
[A,C,T,G]
A nibble level encoding for comparison/alignment with ambiguity
which does not need lookup for comparison
AND-ing two encodings results in an alignment
with the less ambiguous of two retained when matched.
a & b -> 0 no match
-> !0 match
nyb nt(s) ambiguity
0 0000 <gap> -
1 0001 A A*
2 0010 C C*
3 0011 AC M
4 0100 T T*
5 0101 TA W
6 0110 TC Y
7 0111 TCA H
8 1000 G G*
9 1001 GA R
10 1010 GC S
11 1011 GCA V
12 1100 GT K
13 1101 GTA D
14 1110 GTC B
15 1111 GTCA N
gaps can be used as padding to fill to the end of a machine word
given sequence from the restricted alphabet "AaCcTtUuGgNn"
(not the full ambiguity code)
c&=7
6==c ? rand(0:3) : 1<<(c>>1)
will encode the four primary bases without lookup
translating N to random base at the cost of a branch.
or
c&=7;
6==c ? CONST : 1<<(c>>1)
CONST = 15 will let N match everything.
CONST = 0 will make N match nothing.
or 1<<(c&7)>>1
will make N always be a 'G' which will be the cheapest
for the restricted alphabet, ANDing two 64bit words
(16 nucleotide bases encoded as nibbles within a 64bit word)
the only case where more than one bit in a nibble is set when matched is:
when an N is compared with N _and_ N is set to match anything
then the nibble will have all 4 bits set.
short of that pathological case,
the number of bases which matched is the 'Hamming Weight'
or 'population count' of number of bits set in the resulting word.
hence in julia
16 == popcount( seqA[word_n] & seqB[word_n] )
means all 16 bases in both sequences matched
note: popcount() is a native machine instruction in many
modern architectures which performs "sideways addition.
so a popcount() of 15 ~> 93.75% Similar and 14 -> 87.5% Similar
#########################################################################
k-mers
N's and ambiguity have no place
they can neither be nor match 'nothing'.
they must be constant i.e 'G' or random.
tabulating what a sliding window of a size represents,
2-bit representation 32-mer
3-bit representation 21-mer
4-bit representation 16-mer
existence is reportedly all that is being recorded
K-mer Index is:
a sparse structure?
a bitvector?
for 2 bit dense encoding
4^32 = 1.844674407×10¹⁹ flags take /8
2.305843009×10¹⁸ bytes or /1024
2.251799814×10¹⁵ KB or /1024
2.199023256×10¹² MB or /1024
2,147,483,648 GB or /1024
2,097,152 TB or /1024
2,048 PetB or /1024
2 ExoB
will not be doing a bit vector for 32-mers then
4 bit flag encoding
1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 greatest
- 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 least
____________________________________________________
0111011101110111011101110111011101110111011101110111011101110111
2,101,679,826,000,000 ... but there far fewer possible sequences than that
4^16 = 4,294,967,296 flags take / 8
536,870,912 bytes or / 1024
524,288 KB or / 1024\q
512M
that could almost live in a L3 cache or two and be indexed directly
with a half word
@time index=BitArray(2^32-1)
# ~17ms
# break apart the fasta records and feed the sequence to the mer-indexer
function process_fasta(filepath::ASCIIString, fx::function, merdex::BitArray{1} )
mer::Int32
row::ASCIIString=""
rows::Array{ASCIIString} = open(readlines, filepath)
for row in rows
if('>' == first(row))
continue
else
fx(row,merdex)
end
end
end
# finds hexamers in a sequence & records their existence in an index
function hexamer(seq::ASCIIString, merdex::BitArray{1})
mer=0
# prime
for i 0:15 mer|=((seq[i+1]&7)>>1)<<2i end
merdex[mer]=true
for i 17:length(row)-17
mer=mer<<2
mer|=(seq[i]&7)>>1
merdex[mer]=true
end
end # ~hexamer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment