Skip to content

Instantly share code, notes, and snippets.

@liskin
Last active June 14, 2021 10:18
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save liskin/9e1aa34d6f090bef608e1c5b7fd64a35 to your computer and use it in GitHub Desktop.
Save liskin/9e1aa34d6f090bef608e1c5b7fd64a35 to your computer and use it in GitHub Desktop.
Faster COVID-19 testing using PCR and hashing (bloom filter)
============================================================
According to Israeli scientists, [it's possible][pcr-pooling] to merge
multiple swab samples into one tube for PCR without losing much accuracy.
Their method involves a divide and conquer approach, starting by pooling
samples into groups of (up to) 32 and then eliminating groups with no positive
samples.
[pcr-pooling]: https://medium.com/@dinber19/more-with-less-using-pooling-to-detect-coronavirus-with-fewer-tests-8ba1a2cd8b67
I'd like to suggest an alternative approach based on hashing, essentially a
biological [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter).
The core idea is that we don't just group samples, we first split each sample
and only then group it with several other samples, and we do this in a way
that allows identification of individual positives/negatives, assuming the
ratio of positives is low (1%, ideally even less).
A bloom filter uses k distinct hash functions and m bits. Adding an input
value involves hashing it using each hash function and then setting the
resulting bit position to 1.
A biological bloom filter uses tubes instead of bits. Adding a positive sample
contaminates k tubes with the virus, whereas adding a negative sample only
slightly dilutes k tubes (essentially a no-op). Then PCR is done on all those
tubes, and the bit values are revealed. We can then use it as a traditional
bloom filter and check (on a computer) which samples are in the positive set.
Provided the ratio of positive samples is low (which it should be if we're
going to widely test asymptomatic people), the false positive rate should be
acceptable and we should be able to identify infected persons within a single
PCR run.
Assumptions:
* we're able to split a sample into up to 10 equal parts (I hope so)
* we can merge samples and still detect the virus (verified in the lab)
* it's possible to correctly put thousands of sub-samples into tubes (???)
* ratio of positive vs tested is low (so far it has been)
What follows is a Haskell program to simulate the process and calculate ideal
values of k (number of sub-samples we need) and false positivity rates.
See results after the code.
> import Data.Csv (encode)
> import Data.List
> import Text.Printf
> import qualified Data.BloomFilter as Bloom
> import qualified Data.BloomFilter.Hash as Bloom
> import qualified Data.ByteString.Lazy as B
> pcrTubes = 96
> maxSamplesInTube = 50
> kHashesCandidates = [2..20]
> salts = [0..100]
> samplesInTube kHashes nAll = fromIntegral nAll * fromIntegral kHashes / fromIntegral pcrTubes
> genSamples :: Int -> [Int]
> genSamples salt = map (salt +) [1..]
> bloom samples kHashes nRealPositive =
> Bloom.fromList (Bloom.hashes kHashes) pcrTubes (take nRealPositive samples)
> tryBloom samples nRealPositive nAll =
> [ nTestedPositive filt - nRealPositive
> | kHashes <- kHashesCandidates, let filt = bloom samples kHashes nRealPositive
> , samplesInTube kHashes nAll <= maxSamplesInTube ]
> where
> nTestedPositive filt = length $ filter id [n `Bloom.elem` filt | n <- take nAll samples]
> optimalBloom nRealPositive nAll = fmap (\(_, avg, k) -> (avg, k)) optimum
> where
> allFalsePositives = transpose [ tryBloom samples nRealPositive nAll | samples <- map genSamples salts ]
> avgFalsePositives = map avg allFalsePositives
> sumSquareFalsePositives = map (sum . map (^2)) allFalsePositives
> optimum = if null allFalsePositives
> then Nothing
> else Just $ minimum $ zip3 sumSquareFalsePositives avgFalsePositives kHashesCandidates
> avg :: (Num a, Integral a) => [a] -> Double
> avg xs = fromIntegral (sum xs) / fromIntegral (length xs)
> table :: [(Int, Int, Int, String, String)]
> table =
> [ (nRealPositive, nAll, kHashes, fmt (samplesInTube kHashes nAll), fmt avgFalsePositives)
> | nRealPositive <- [1, 5, 10, 15, 20, 30]
> , nAll <- [500, 1000, 2000, 5000]
> , Just (avgFalsePositives, kHashes) <- [optimalBloom nRealPositive nAll] ]
> where
> fmt :: Double -> String
> fmt d = printf "%.2f" d
> main = do
> B.putStr (encode [("nRealPositive", "nAll", "kHashes", "samplesInTube", "avgFalsePositives")])
> B.putStr (encode table)
$ ./corona-bloom | column -s, -t
nRealPositive nAll kHashes samplesInTube avgFalsePositives
1 500 3 15.62 0.00
1 1000 4 41.67 0.00
1 2000 2 41.67 0.41
5 500 7 36.46 0.02
5 1000 4 41.67 0.50
5 2000 2 41.67 10.48
10 500 7 36.46 1.31
10 1000 4 41.67 4.83
10 2000 2 41.67 39.64
15 500 4 20.83 7.41
15 1000 4 41.67 17.58
15 2000 2 41.67 81.89
20 500 4 20.83 17.95
20 1000 4 41.67 40.67
20 2000 2 41.67 134.20
30 500 4 20.83 57.10
30 1000 3 31.25 114.93
30 2000 2 41.67 265.82
These ones look interesting:
nRealPositive nAll kHashes samplesInTube avgFalsePositives
1 500 3 15.62 0.00
1 1000 4 41.67 0.00
1 2000 2 41.67 0.41
5 500 7 36.46 0.02
5 1000 4 41.67 0.50
5 2000 2 41.67 10.48
10 500 7 36.46 1.31
10 1000 4 41.67 4.83
15 500 4 20.83 7.41
15 1000 4 41.67 17.58
20 500 4 20.83 17.95
20 1000 4 41.67 40.67
This means that if we expect to have up to 5 infected in a thousand samples, we
can test them all in a single 96-tube PCR run and identify the infected ones
almost exactly. As the ratio of infected samples rises, the false positive rate
goes up rapidly, but up to 20 infected in 1000 samples, it should fit in one
additional 96-tube PCR run. But we must be able to split a single swab into 5
pieces.
STACK_EXEC = stack exec \
--package bloomfilter \
--package bytestring \
--package cassava
corona-bloom: corona-bloom.lhs
$(STACK_EXEC) -- ghc --make $<
ghci: corona-bloom.lhs
$(STACK_EXEC) -- ghci $<
ghcid: corona-bloom.lhs
$(STACK_EXEC) -- ghcid $<
@liskin
Copy link
Author

liskin commented Mar 24, 2020

Nore nicely rendered version here: https://github.com/liskin/covid19-bloom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment