Skip to content

Instantly share code, notes, and snippets.

@johnynek
Last active July 17, 2018 21:16
Show Gist options
  • Save johnynek/91a2a35b4808b98328cdc2ad3f15d74b to your computer and use it in GitHub Desktop.
Save johnynek/91a2a35b4808b98328cdc2ad3f15d74b to your computer and use it in GitHub Desktop.
Scala's path dependent types can be used to prove that a serialized value can be deserialized without having to resort to Try/Either/Option. This puts the serialized value into the type, so we can be sure we won't fail. This is very useful for distributed compute settings such as scalding or spark.
import scala.util.Try
object PathSerializer {
trait SerDe[A] {
// By using a path dependent type, we can be sure can deserialize without wrapping in Try
type Serialized
def ser(a: A): Serialized
def deser(s: Serialized): A
// If we convert to a generic type, in this case String, we forget if we can really deserialize
def toString(s: Serialized): String
def fromString(s: String): Try[A]
}
val intSer: SerDe[Int] =
new SerDe[Int] {
type Serialized = String
def ser(a: Int) = a.toString
def deser(s: Serialized) = s.toInt // since this was serialized with this SerDe this is safe
def toString(s: Serialized): String = s
def fromString(s: String): Try[Int] = Try(s.toInt /* this can fail, because we only know it is a String */)
}
def example0 = {
val x = 42
val ser: intSer.Serialized = intSer.ser(x)
val y = intSer.deser(ser)
assert(x == y)
}
def example1 = {
// In a case like spark or scalding, we can remember that deserialization can't fail:
// pretend the list below is an RDD or TypedPipe
val inputs: List[Int] = (0 to 1000).toList
val serialized: List[intSer.Serialized] = inputs.map(intSer.ser(_))
// intSer.Serialized is distinct from String, note the following error
// if we pretend they are the same:
//
// val strings: List[String] = serialized
//
// ts_ser.scala:41: error: type mismatch;
// found : List[PathSerializer.intSer.Serialized]
// required: List[String]
// Look, ma! no Trys!
val deserialized: List[Int] = serialized.map(intSer.deser(_))
assert(inputs == deserialized)
}
}
@przemek-pokrywka
Copy link

przemek-pokrywka commented Apr 9, 2018

@johnynek

https://gist.github.com/johnynek/91a2a35b4808b98328cdc2ad3f15d74b#file-scala-path-dependent-serializers-scala-L48 is only possible if you have values of intSer.Serialized around. But when any serialization to disk/buffers is involved one doesn't read intSer.Serialized - one reads raw bytes / String / other low-level representation. So you would need to somehow jump from that low-level representation to intSer.Serialized directly which I think isn't possible to be made safe, without using Option, Try or equivalent.

Update

But when any serialization to disk/buffers is involved one doesn't read intSer.Serialized - one reads raw bytes / String / other low-level representation.

one can still keep a typed reference to the data in offsite-heap for example, in which case the deserialization is safe.

@sritchie
Copy link

@przemek-pokrywka, that's not quite right. Scalding or Spark jobs can compile down to many map-reduce steps, each of which is separated by a serialization boundary. Usually that happens behind the scenes, and you're forced to serialize some sort of class tag into each object behind the scenes on disk.

With this trick you can explicitly trigger serialization, then jump back out in the same program. This is more efficient, and behind the scenes the mapreduce platform only sees bytes, so no class tags required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment