Skip to content

Instantly share code, notes, and snippets.

@johandahlberg
Last active December 17, 2015 12:39
Show Gist options
  • Save johandahlberg/5611360 to your computer and use it in GitHub Desktop.
Save johandahlberg/5611360 to your computer and use it in GitHub Desktop.
GCCounter in Scala inspired by http://saml.rilspace.org/moar-languagez-gc-content-in-python-d-fpc-c-and-c. This is no match performance wise for the C/D/C++ etc solutions, but I think that it makes a nice point of showing how Scalas parallel collections can be used to with very little extra effort parallelize operations. Test file available for …
import scala.io.Source
import java.io.File
object GCCounter extends App {
val file = new File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa")
// The actual GC counting function
def countGCOnLine(line: String): (Long, Long) = {
if (line.startsWith(">"))
(0, 0)
else {
val at = line.count(c => c == 'A' || c == 'T')
val gc = line.count(c => c == 'C' || c == 'G')
(at + gc, gc)
}
}
// Read 1 kb at the time
val chunkSize = 1024 * 1
val iterator = Source.fromFile(file, chunkSize).getLines.grouped(chunkSize)
// Move through the iterator and accumulate the total gc over each chunck
val (gc, total) = iterator.foldLeft[(Long, Long)]((0, 0))((accumulator, chunck) => {
// Map/Reduce the gc over each line in the chunk in parallel.
// The magic very simple parallelism magic going on here is in the 'par'
// keyword which converts the chunck collection to a parallel collection
// one which the map operations will automatically be parallelized.
val (gc, total) = chunck.par.map(line => countGCOnLine(line)).
reduce((x, y) => (x._1 + y._1, x._2 + y._2))
(accumulator._1 + total, accumulator._2 + gc)
})
println("% GC: " + (gc.toFloat / total) * 100)
}
scalac -optimize GCCounter.scala
time scala GCCounter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment