Skip to content

Instantly share code, notes, and snippets.

@johnynek
Last active July 17, 2018 21:16
Show Gist options
  • Save johnynek/91a2a35b4808b98328cdc2ad3f15d74b to your computer and use it in GitHub Desktop.
Save johnynek/91a2a35b4808b98328cdc2ad3f15d74b to your computer and use it in GitHub Desktop.
Scala's path dependent types can be used to prove that a serialized value can be deserialized without having to resort to Try/Either/Option. This puts the serialized value into the type, so we can be sure we won't fail. This is very useful for distributed compute settings such as scalding or spark.
import scala.util.Try
object PathSerializer {
trait SerDe[A] {
// By using a path dependent type, we can be sure can deserialize without wrapping in Try
type Serialized
def ser(a: A): Serialized
def deser(s: Serialized): A
// If we convert to a generic type, in this case String, we forget if we can really deserialize
def toString(s: Serialized): String
def fromString(s: String): Try[A]
}
val intSer: SerDe[Int] =
new SerDe[Int] {
type Serialized = String
def ser(a: Int) = a.toString
def deser(s: Serialized) = s.toInt // since this was serialized with this SerDe this is safe
def toString(s: Serialized): String = s
def fromString(s: String): Try[Int] = Try(s.toInt /* this can fail, because we only know it is a String */)
}
def example0 = {
val x = 42
val ser: intSer.Serialized = intSer.ser(x)
val y = intSer.deser(ser)
assert(x == y)
}
def example1 = {
// In a case like spark or scalding, we can remember that deserialization can't fail:
// pretend the list below is an RDD or TypedPipe
val inputs: List[Int] = (0 to 1000).toList
val serialized: List[intSer.Serialized] = inputs.map(intSer.ser(_))
// intSer.Serialized is distinct from String, note the following error
// if we pretend they are the same:
//
// val strings: List[String] = serialized
//
// ts_ser.scala:41: error: type mismatch;
// found : List[PathSerializer.intSer.Serialized]
// required: List[String]
// Look, ma! no Trys!
val deserialized: List[Int] = serialized.map(intSer.deser(_))
assert(inputs == deserialized)
}
}
@sritchie
Copy link

@przemek-pokrywka, that's not quite right. Scalding or Spark jobs can compile down to many map-reduce steps, each of which is separated by a serialization boundary. Usually that happens behind the scenes, and you're forced to serialize some sort of class tag into each object behind the scenes on disk.

With this trick you can explicitly trigger serialization, then jump back out in the same program. This is more efficient, and behind the scenes the mapreduce platform only sees bytes, so no class tags required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment