Skip to content

Instantly share code, notes, and snippets.

@milessabin
Last active January 4, 2016 01:39
Show Gist options
  • Save milessabin/8549878 to your computer and use it in GitHub Desktop.
Save milessabin/8549878 to your computer and use it in GitHub Desktop.
import shapeless._
import record._
import syntax.singleton._
object ScaldingPoC extends App {
// map, flatMap
val birds =
List(
"name" ->> "Swallow (European, unladen)" :: "speed" ->> 23 :: "weightLb" ->> 0.2 :: "heightFt" ->> 0.65 :: HNil,
"name" ->> "African (European, unladen)" :: "speed" ->> 24 :: "weightLb" ->> 0.21 :: "heightFt" ->> 0.6 :: HNil
)
val fasterBirds = birds.map(b => b + ("doubleSpeed" ->> b("speed")*2))
fasterBirds foreach println
val britishBirds = birds.map(b => b + ("weightKg" ->> b("weightLb")*0.454) + ("heightM" ->> b("heightFt")*0.305))
britishBirds foreach println
val items =
List(
"author" ->> "Benjamin Pierce" :: "title" ->> "Types and Programming Languages" :: "price" ->> 49.35 :: HNil,
"author" ->> "Roger Hindley" :: "title" ->> "Basic Simple Type Theory" :: "price" ->> 23.14 :: HNil
)
val pricierItems = items.map(i => i + ("price" ->> i("price")*1.1))
pricierItems foreach println
val books =
List(
"text" ->> "Not everyone knows how I killed old Phillip Mathers" :: HNil,
"text" ->> "No, no, I can't tell you everything" :: HNil
)
val lines = books.flatMap(book => for(word <- book("text").split("\\s+")) yield book + ("word" ->> word))
lines foreach println
}
@deanwampler
Copy link

Cool.

One thing you would want is a separation between schema and actual records, for performance. For example, specify that column 2 is the title, but have an efficient data structure (Array or Stream) holding the data, either by column or by row. You might be reading millions of records in a single process.

@milessabin
Copy link
Author

I'm not making any ambitious claims for the efficiency of the above. However, the current (shapeless 2.0-M1) representation of records is probably lighter weight that you expect: the keys are encoded as singleton types intersected with the types of the values and have absolutely no runtime footprint ... at runtime the record is essentially a cons list of the values and the keys are completely erased.

@deanwampler
Copy link

Cool. I need to take a look at the latest implementation.

@johnynek
Copy link

We will need to look at serialization here because, as Dean notes, we definitely don't want to serialize the keys with each row. We'd have to look at how Kryo does (or can be made to) serialize the records.

@milessabin
Copy link
Author

@johnynek The keys don't exist at all at runtime.

@bsidhom
Copy link

bsidhom commented Mar 27, 2014

Here's an example showing Kryo serialization of record types: https://gist.github.com/bsidhom/9798005

The record type takes no more space than its underlying HList, which isn't too bad when registered. I haven't found a way to register classes more concisely unfortunately, but it may be possible to remove some boilerplate given an example instance (via getClass) or with the help of macros.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment