- many kind of RDD: HadoopRDD, FilteredRDD, PythonRDD, EsSpark, GeoRDD ...
- look at the DAG by typing type(myRDD)
- look at the DAG by typing df.toDebugString
## Tasks
- number of executor should be 2 times number of physical cores
- spark-shell --master local[8]
- this is number of tasks ~ number of threads
- minimum executor 8G
- maximum executor 40G
- local disk can be configured in
spark-env.sh
- spark.local.dir can be coma separated location
- this cannot be hdfs
- this should be ssd mounted drives
- all in one
- driver and executors are in the same jvm
- the local dirs is given by SPARK_LOCAL_DIRS (Standalone, Mesos)
- a master
- n workers
- the local dirs is given by SPARK_LOCAL_DIRS (Standalone, Mesos)
- a driver
- the resource manager (one)
- the node managers (one on each node)
- the local dirs is given by LOCAL_DIRS (YARN)
- the driver is inside the client laptop
- the driver is inside the master container
- rdd.cache() == rdd.persist(MEMORY_ONLY) -> in memory
- rdd.persist(MEMORY_ONLY_SER) -> serialized in memory
- rdd.persist(MEMORY_ONLY_2) -> serialized in memory on two container
- rdd.persist(DISK_ONLY) -> serialized on disk
- rdd.persist(MEMORY_AND_DISK) -> in memory and serialized on disk if needed
- rdd.persist(MEMORY_AND_DISK_SER) -> serialized in memory and serialized on disk if needed
- default is parallel gc, this is fine in the genral case
- when using spark streaming, use gcs GC
- use G1 gc in the general case...
- narrow means a partition is only used by one later rdd
- wide means a partition is used by many later rdd
- cPython is used, but pypy can be used and this speeds-up a lot
- Py4j is used between the driver and the coordinator
- use dstreams
- cannot refresh under 0.5 seconds
- has a windowed capability
- they are type safety
- they are not optimized by spark
- by mean the lambda function or the order of transformation/action
- the code is hard to read: how you do it instead of what you are doing
- the code is easier to read
- they are optimized (the dag)
- they are not type safety
- they are array of Row ; not typed
- you may want to transform a dataframe into RDD at last step of processes
- a dataframe is a dataset of type Row
- only need to apply a scala class to a dataframe
- typed dataframes
- this allows to check type integrity at compile time
- this does not apply when using sql strings;
- apparently better using scala coding than sql for compiling error
- allows map & flatmap operation of RDD
- are using tungsten instead of java jvm
- a dataset is a dataframe of type X (any kind of class with typed elements)