TomConlin/FastFastaRead.readme

## FastFastaRead.readme
2019 - almost a decade ago now I helped mentor
a CS student writing a bioinformatics exercise in C++.

To help avoid doing his work I did not use C++.
Instead I experimented in Julia; probably v0.2 or v0.3.

The task was to compare ginormas metagenomics studies quickly.


#################################################
read a large file fast


open file handle

determine file length

(assume file is contiguous on disk)

memory map? reads larger blocks

what is disk read cache size?

what are intermediate cache sizes?

what is the time to fetch a cache full?
what is the time to process a cache full?


 "C" 100 00 1 1
 "G" 100 01 1 1

 "A" 100 00 0 1
 "T" 101 01 0 0
 "N" 100 11 1 0


for c in "AaCcTtGgNn"  println(c,' ',c&3) end
A 1
a 1
C 3
c 3
G 3
g 3
T 0
t 0
N 2
n 2


for c in "AaCcTtGgNn" println (c, ' ', c&7) end
A 1
a 1
C 3
c 3
T 4
t 4
G 7
g 7
N 6
n 6


 0,1,2,3
[A,C,T,G]


Translating ASCII DNA into a two bit representation
using bitwise  AND 3 on the nt-base works great
the fact this is case insensitive is due to the
brilliant ASCII designers of yore.
Having to deal with unknown 'N' bases complicates
processing and for now introduces a branch condition.

Bitwise AND with 7 followed by a bitwise shift allows
filtering for N's

c&=7
6==c ? c=rand(0:3) : c=c<<1


Another option is a 3/4 bit representation
three bit seems awkward for processing and leaves us with three "slots" undefined
hmmm.
any representation less than a word (including byte)
would benefit from a "padding" flag
indicating the representation is to be ignored
so the ALU could process full words.


Four bit (nibble) is half a byte and would be easier to process and leaves 11 "slots"
which may be enough handle nucleotide ambiguity codes ...
or indels or carry a coarse quality score.


For some tasks, e.g. find GC content
the representation is irrelevant as we do not store and reuse the sequence.
in which case the 2-bit transformation is sufficient
or the 3-bit transformation if 'N' is present

With N present there are three options
	N is always GC 		   (left as an exercise)
	N is never GC			GC is nt&1*(nt>>1)&1
	N is sometimes GC		CG is nt&=7;6==nt?rand(0:1):(nt>>1)&1


#####################################################################

A 4bit (nibble) level encoding compatible with two-bit encoding
which allows gaps and ambiguity but requires lookup tables.


 0000 A
 0001 C
 0010 T
 0011 G

	0100 !A  'B'
	0101 !C	 'D'
	0110 !T  'V'
	0111 !G  'H'

		1000 !TA
 		1001 !TC
 		1010 !CA
 		1011 !TG

			1100 !AG
			1101 !CG

			1110 <gap>

			1111 N


 0,1,2,3
[A,C,T,G]
###############################################################

translating the 4 base nucleotides from the two-bit (dense) representation
to a four-bit (sparse) representation (flag based)
mathematically b4 = 2^b2
but that is apt to be more expensive than necessary

b4=1<<(b2+1) is cheaper

 1,2,4,8
[A,C,T,G]


A nibble level encoding for comparison/alignment with ambiguity
which does not need lookup for comparison

AND-ing two encodings results in an alignment
with the less ambiguous of two retained when matched.


a & b -> 0  no match
      -> !0 match

	nyb		nt(s)		ambiguity

0	0000	<gap>	 	-
1	0001	A		 	A*
2	0010	C		 	C*
3	0011	AC			M
4	0100	T			T*
5	0101	TA			W
6	0110	TC			Y
7	0111	TCA			H
8	1000	G			G*
9	1001	GA			R
10	1010	GC			S
11	1011	GCA			V
12	1100	GT			K
13	1101	GTA			D
14	1110	GTC			B
15	1111	GTCA		N


gaps can be used as padding to fill to the end of a machine word


given sequence from the restricted alphabet "AaCcTtUuGgNn"
(not the full ambiguity code)


c&=7
6==c ? rand(0:3) : 1<<(c>>1)

will encode the four primary bases without lookup
translating N to random base at the cost of a branch.


or
c&=7;
6==c ? CONST : 1<<(c>>1)

CONST = 15 will let N match everything.
CONST =  0 will make N match nothing.


or 1<<(c&7)>>1

will make N  always be a 'G'  which will be the cheapest


for the restricted alphabet, ANDing two 64bit words
(16 nucleotide bases encoded as nibbles within a 64bit word)
the only case where more than one bit in a nibble is set when matched is:
when an N is compared with N _and_ N is set to match anything
then the nibble will have all 4 bits set.
short of that pathological case,

the number of bases which matched is the 'Hamming Weight'
or 'population count' of number of bits set in the resulting word.

hence in julia

16 == popcount( seqA[word_n] & seqB[word_n] )

means all 16 bases in both sequences matched
note: popcount() is a native machine instruction in many
modern architectures which performs "sideways addition.

so a popcount() of  15 ~> 93.75% Similar and 14 -> 87.5% Similar

#########################################################################


k-mers

N's and ambiguity have no place
they can neither be nor match 'nothing'.
they must be constant i.e 'G' or random.

tabulating what a sliding window of a size represents,

2-bit representation 32-mer
3-bit representation 21-mer
4-bit representation 16-mer

existence is reportedly all that is being recorded

K-mer Index is:
 	a sparse structure?
 	a bitvector?

for 2 bit dense encoding

4^32 =  1.844674407×10¹⁹ flags take /8
		2.305843009×10¹⁸ bytes or /1024
		2.251799814×10¹⁵ KB    or /1024
	    2.199023256×10¹² MB    or /1024
	    2,147,483,648 GB       or /1024
	    2,097,152 TB           or /1024
	    2,048 PetB             or /1024
	    2 ExoB
will not be doing a bit vector for 32-mers then


4 bit flag encoding

  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000  greatest
- 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001  least
  ____________________________________________________
  0111011101110111011101110111011101110111011101110111011101110111


2,101,679,826,000,000  ... but there far fewer possible sequences than that

 4^16 = 4,294,967,296 flags take / 8
 		536,870,912 bytes  or    / 1024
 		524,288 KB or            / 1024\q
 		512M

that could almost live in a L3 cache or two and be indexed directly
with a half word


@time index=BitArray(2^32-1)
# ~17ms

# break apart the fasta records and feed the sequence to the mer-indexer
function process_fasta(filepath::ASCIIString, fx::function, merdex::BitArray{1} )
	mer::Int32
	row::ASCIIString=""
    rows::Array{ASCIIString} = open(readlines, filepath)
	for row in rows
		if('>' == first(row))
 			continue
 		else
 			fx(row,merdex)
		end
	end
end

# finds hexamers in a sequence & records their existence in an index
function hexamer(seq::ASCIIString, merdex::BitArray{1})
	mer=0
	# prime
	for i 0:15 mer|=((seq[i+1]&7)>>1)<<2i end
	merdex[mer]=true
	for i 17:length(row)-17
		mer=mer<<2
		mer|=(seq[i]&7)>>1
		merdex[mer]=true
	end
end # ~hexamer
	2019 - almost a decade ago now I helped mentor
	a CS student writing a bioinformatics exercise in C++.

	To help avoid doing his work I did not use C++.
	Instead I experimented in Julia; probably v0.2 or v0.3.

	The task was to compare ginormas metagenomics studies quickly.



	#################################################
	read a large file fast


	open file handle

	determine file length

	(assume file is contiguous on disk)

	memory map? reads larger blocks

	what is disk read cache size?

	what are intermediate cache sizes?

	what is the time to fetch a cache full?
	what is the time to process a cache full?


	"C" 100 00 1 1
	"G" 100 01 1 1

	"A" 100 00 0 1
	"T" 101 01 0 0
	"N" 100 11 1 0


	for c in "AaCcTtGgNn" println(c,' ',c&3) end
	A 1
	a 1
	C 3
	c 3
	G 3
	g 3
	T 0
	t 0
	N 2
	n 2


	for c in "AaCcTtGgNn" println (c, ' ', c&7) end
	A 1
	a 1
	C 3
	c 3
	T 4
	t 4
	G 7
	g 7
	N 6
	n 6


	0,1,2,3
	[A,C,T,G]


	Translating ASCII DNA into a two bit representation
	using bitwise AND 3 on the nt-base works great
	the fact this is case insensitive is due to the
	brilliant ASCII designers of yore.
	Having to deal with unknown 'N' bases complicates
	processing and for now introduces a branch condition.

	Bitwise AND with 7 followed by a bitwise shift allows
	filtering for N's

	c&=7
	6==c ? c=rand(0:3) : c=c<<1


	Another option is a 3/4 bit representation
	three bit seems awkward for processing and leaves us with three "slots" undefined
	hmmm.
	any representation less than a word (including byte)
	would benefit from a "padding" flag
	indicating the representation is to be ignored
	so the ALU could process full words.


	Four bit (nibble) is half a byte and would be easier to process and leaves 11 "slots"
	which may be enough handle nucleotide ambiguity codes ...
	or indels or carry a coarse quality score.


	For some tasks, e.g. find GC content
	the representation is irrelevant as we do not store and reuse the sequence.
	in which case the 2-bit transformation is sufficient
	or the 3-bit transformation if 'N' is present

	With N present there are three options
	N is always GC (left as an exercise)
	N is never GC GC is nt&1*(nt>>1)&1
	N is sometimes GC CG is nt&=7;6==nt?rand(0:1):(nt>>1)&1


	#####################################################################

	A 4bit (nibble) level encoding compatible with two-bit encoding
	which allows gaps and ambiguity but requires lookup tables.


	0000 A
	0001 C
	0010 T
	0011 G

	0100 !A 'B'
	0101 !C 'D'
	0110 !T 'V'
	0111 !G 'H'

	1000 !TA
	1001 !TC
	1010 !CA
	1011 !TG

	1100 !AG
	1101 !CG

	1110 <gap>

	1111 N


	0,1,2,3
	[A,C,T,G]
	###############################################################

	translating the 4 base nucleotides from the two-bit (dense) representation
	to a four-bit (sparse) representation (flag based)
	mathematically b4 = 2^b2
	but that is apt to be more expensive than necessary

	b4=1<<(b2+1) is cheaper

	1,2,4,8
	[A,C,T,G]


	A nibble level encoding for comparison/alignment with ambiguity
	which does not need lookup for comparison

	AND-ing two encodings results in an alignment
	with the less ambiguous of two retained when matched.


	a & b -> 0 no match
	-> !0 match

	nyb nt(s) ambiguity

	0 0000 <gap> -
	1 0001 A A*
	2 0010 C C*
	3 0011 AC M
	4 0100 T T*
	5 0101 TA W
	6 0110 TC Y
	7 0111 TCA H
	8 1000 G G*
	9 1001 GA R
	10 1010 GC S
	11 1011 GCA V
	12 1100 GT K
	13 1101 GTA D
	14 1110 GTC B
	15 1111 GTCA N


	gaps can be used as padding to fill to the end of a machine word



	given sequence from the restricted alphabet "AaCcTtUuGgNn"
	(not the full ambiguity code)




	c&=7
	6==c ? rand(0:3) : 1<<(c>>1)

	will encode the four primary bases without lookup
	translating N to random base at the cost of a branch.


	or
	c&=7;
	6==c ? CONST : 1<<(c>>1)

	CONST = 15 will let N match everything.
	CONST = 0 will make N match nothing.


	or 1<<(c&7)>>1

	will make N always be a 'G' which will be the cheapest


	for the restricted alphabet, ANDing two 64bit words
	(16 nucleotide bases encoded as nibbles within a 64bit word)
	the only case where more than one bit in a nibble is set when matched is:
	when an N is compared with N _and_ N is set to match anything
	then the nibble will have all 4 bits set.
	short of that pathological case,

	the number of bases which matched is the 'Hamming Weight'
	or 'population count' of number of bits set in the resulting word.

	hence in julia

	16 == popcount( seqA[word_n] & seqB[word_n] )

	means all 16 bases in both sequences matched
	note: popcount() is a native machine instruction in many
	modern architectures which performs "sideways addition.

	so a popcount() of 15 ~> 93.75% Similar and 14 -> 87.5% Similar

	#########################################################################


	k-mers

	N's and ambiguity have no place
	they can neither be nor match 'nothing'.
	they must be constant i.e 'G' or random.

	tabulating what a sliding window of a size represents,

	2-bit representation 32-mer
	3-bit representation 21-mer
	4-bit representation 16-mer

	existence is reportedly all that is being recorded

	K-mer Index is:
	a sparse structure?
	a bitvector?

	for 2 bit dense encoding

	4^32 = 1.844674407×10¹⁹ flags take /8
	2.305843009×10¹⁸ bytes or /1024
	2.251799814×10¹⁵ KB or /1024
	2.199023256×10¹² MB or /1024
	2,147,483,648 GB or /1024
	2,097,152 TB or /1024
	2,048 PetB or /1024
	2 ExoB
	will not be doing a bit vector for 32-mers then



	4 bit flag encoding

	1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 greatest
	- 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 least
	____________________________________________________
	0111011101110111011101110111011101110111011101110111011101110111


	2,101,679,826,000,000 ... but there far fewer possible sequences than that

	4^16 = 4,294,967,296 flags take / 8
	536,870,912 bytes or / 1024
	524,288 KB or / 1024\q
	512M

	that could almost live in a L3 cache or two and be indexed directly
	with a half word


	@time index=BitArray(2^32-1)
	# ~17ms

	# break apart the fasta records and feed the sequence to the mer-indexer
	function process_fasta(filepath::ASCIIString, fx::function, merdex::BitArray{1} )
	mer::Int32
	row::ASCIIString=""
	rows::Array{ASCIIString} = open(readlines, filepath)
	for row in rows
	if('>' == first(row))
	continue
	else
	fx(row,merdex)
	end
	end
	end

	# finds hexamers in a sequence & records their existence in an index
	function hexamer(seq::ASCIIString, merdex::BitArray{1})
	mer=0
	# prime
	for i 0:15 mer\|=((seq[i+1]&7)>>1)<<2i end
	merdex[mer]=true
	for i 17:length(row)-17
	mer=mer<<2
	mer\|=(seq[i]&7)>>1
	merdex[mer]=true
	end
	end # ~hexamer