Skip to content

Instantly share code, notes, and snippets.

@manku-timma
Last active August 29, 2015 14:04
Show Gist options
  • Save manku-timma/06dc665db1ae4f4becff to your computer and use it in GitHub Desktop.
Save manku-timma/06dc665db1ae4f4becff to your computer and use it in GitHub Desktop.
Apache Spark
  • It supports SQL, Scala, Java, Python, etc
  • It supports hive tables, json tables, native scala data structures etc as RDDs
  • It supports batch, streaming, interactive, iterative etc modes of computation
  • It supports AWS, Mesos, YARN, openstack etc as underlying computation engines
  • It supports tachyon, hdfs, s3, hive as storage engines; also DBs and noSQL DBs

Useful scala links

Interesting things to think about:

  • Tachyon - in-memory distributed file system based on lineage
  • MDCC and other stuff from BDAS which are focused on point updates to big data
  • GraphX
  • MLBase and MLLib

Martin Odersky observations:

  • with all the theoretical advantages of functional programming, some trigger is needed for its wide adoption
  • parallel and distributed programming is the catalyst; reason is that there is lot of parallelism to be utilized (AWS etc)

Graphx

  • tables and graphs are merged w.r.t read and write
  • spark api for table management and graphlab api for graph management are brought together
  • useful algorithms are:
  • pagerank
  • connected components
  • shortest path
  • ALS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment