Skip to content

Instantly share code, notes, and snippets.

@pathikrit
Last active December 12, 2019 21:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pathikrit/45ac5ed5d805142ec0bd31c9fc14da66 to your computer and use it in GitHub Desktop.
Save pathikrit/45ac5ed5d805142ec0bd31c9fc14da66 to your computer and use it in GitHub Desktop.
Split a file into multiple GZIP files
import java.io.InputStream
import better.files._
import squants.information._, InformationConversions._
object GzipSplitter {
/** Splits the $inputstream into approximately equal chunks of $splitSize gzip files under $outputDirectory */
def split(
inputStream : InputStream,
outputDirectory : File = File.newTemporaryDirectory(),
outputFilePrefix : String = "part-",
splitSize : Information = 50.mb,
initialGuess : Int = 1e5.toInt
): Unit = {
val lines = inputStream.lines
Iterator.from(0)
.takeWhile(_ => lines.nonEmpty)
.foldLeft(initialGuess) { case (guess, part) =>
val file = outputDirectory / s"${outputFilePrefix}${part}.gz"
for {
gzip <- file.gzipOutputStream()
writer <- gzip.printWriter().autoClosed
line <- lines.take(guess)
} writer.println(line)
println(s"Wrote $guess lines to ${file.name} (${file.size.mb.value.toInt} MB)")
// Probe the file we just wrote to update our guess of how many lines it takes to make a $splitSize chunk
// Note: file sizes do not scale linearly (esp. for compressed streams); but this is a good enough approximation
(splitSize * guess / file.size.bytes).toInt
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment