- It supports SQL, Scala, Java, Python, etc
- It supports hive tables, json tables, native scala data structures etc as RDDs
- It supports batch, streaming, interactive, iterative etc modes of computation
- It supports AWS, Mesos, YARN, openstack etc as underlying computation engines
- It supports tachyon, hdfs, s3, hive as storage engines; also DBs and noSQL DBs
Useful scala links
- http://www.scala-lang.org/docu/files/ScalaOverview.pdf
- http://www.cs.ucsb.edu/~benh/162/Programming-in-Scala.pdf
Interesting things to think about:
- Tachyon - in-memory distributed file system based on lineage
- MDCC and other stuff from BDAS which are focused on point updates to big data
- GraphX
- MLBase and MLLib
Martin Odersky observations:
- with all the theoretical advantages of functional programming, some trigger is needed for its wide adoption
- parallel and distributed programming is the catalyst; reason is that there is lot of parallelism to be utilized (AWS etc)
Graphx
- tables and graphs are merged w.r.t read and write
- spark api for table management and graphlab api for graph management are brought together
- useful algorithms are:
- pagerank
- connected components
- shortest path
- ALS