sgoyal1012/Best Links.md

## Best Links.md

      
    Raw
  

              Best Links.md
            
          
    Scala Cheatsheet for pattern matching etc

https://docs.scala-lang.org/cheatsheets/
Scala cheatsheet for concepts

https://www.coursera.org/learn/progfun1/supplement/Sauv3/cheat-sheet
Scala 101 concepts - Objects, Classes etc

https://www.coursera.org/learn/progfun1/supplement/ebKWU/scala-tutorial
Scala Style Guide

https://www.coursera.org/learn/progfun1/supplement/KPiGt/scala-style-guide
Scala apply basics

https://twitter.github.io/scala_school/basics2.html#apply

  
## Course 1. Functional Programming Principles in Scala.md

      
    Raw
  

              Course 1. Functional Programming Principles in Scala.md
            
          
    Best links

https://gist.github.com/sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6#file-best-links-md
Course 1. Functional Programming Principles in Scala

Week 1 -- Getting Started


Scala, SBT setup


Create scala worksheet .sc - useful feature


Practice assignment

Using === instead of == in tests.
Recursion!


SBT Tutorial
https://www.coursera.org/learn/progfun1/supplement/uV974/sbt-tutorial-and-submission-of-assignments-please-read


Can run in two ways, either via main or extends App


1.1 Programming Paradigms


Functional programming is a paradigm
Imperative programming

Java!
Von Neumann Computer
One needs to conceptualize data structures word by word
Mathematical theories vs programming
Le Voila! Functional programming


Functional programming can be seen as a calculator -- use the REPL

Restricted (no mutable variables, eg. Pure LISP) vs wider (focus on functions, eg. Scala)
Functions are FIRST CLASS CITIZENS


Why functional programming?


Good for exploiting ||ism

1.2 Elements of Programming


See Functional programming as a calculator, REPL
Goes over the evaluation of an arithmetic expression and a parameterized function - The substitution model
Side Effects - eg. C++

https://softwareengineering.stackexchange.com/questions/40297/what-is-a-side-effect


Call by value and call by name - great examples!

1.3 Evaluation Strategies


CBV and CBN evaluation/termination etc
Scala always CBV, as it is better performant, plays nicer with imperative programming

1.4 Scala if-else


If-else is an expression, predicates
Short circuit evaluation, sometimes do not need to see the right hand side to evaluate the expression
def vs val (by name and value)
Scala make a param call by name :=>
- https://alvinalexander.com/source-code/scala/simple-scala-call-name-example


By-name parameters are only evaluated when used. They are in contrast to by-value parameters. To make a parameter called by-name, simply prepend => to its type.

1.5 Square root using Newton's method


Recursive functions NEED to have a return type (because it needs to know the return type), otherwise it is optional (Duh)

1.6 Blocks and Lexical Scope


Nested functions -- put auxillary functions inside the function sqrt
Blocks

1.7 Tail Recursion


Function rewriting rule, GCD and factorial
Tail recursion
https://stackoverflow.com/questions/33923/what-is-tail-recursion

Can execute in constant stack space, as it calls itself as the last action.
Tail calls -- calling other function.


Can force a function to be tail recursive by annotating with @tailrec


Mostly solved by storing an accumulator or intermediate result.

Assignment


Stable identifiers!!

Week 2 - Higher Order Functions

2.1 Higher Order Functions


Functions are First Class Values
Higher order function operates on other functions!

A function becomes a parameter of another function
Anonymous Functions -- passing functions with name can get tedious; to avoid A FUNCTION HAS NO NAME.
(x:Int) => x*x*x


It is Syntactic Sugar, looks neater, but does add logical value.


2.2 Currying


A function returns another function!
Eg.sum(cube)(1,10) applies sum to cube; and then this applied function is applied to (1,10)
Function application associates to the left.
Just combine the two functions and write them one after the other!
Instead of passing parameter lists, we pass anonymous functions one after the other -- CURRYING
Goes over an example of curring using Product function
***Good example -- https://docs.scala-lang.org/tour/currying.html ***

2.3 Fixed points of functions


f(x) = x -- is the fixed point of a function
SQRT is a special case of fixed point.
Average Damping
Abstractions can be combined into higher order functions to create new abstractions.

2.4 Scala Syntax Summary


EBNF notations (https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form)

2.5 Functions and data


Define classes
Scala keeps types and values in different namespaces
toString needs an override method!

2.6 More fun with rationals


Our previous functions do not simplify the rationals to their simples form.
Calculate the gcd in the constructor itself, and use to reduce the rational. -- Data Abstraction
Usage of this in the max/less functions.
Scala require!

https://stackoverflow.com/questions/26140757/what-to-choose-between-require-and-assert-in-scala
Can also use assert
require is to enforce a precondition on the caller, assert on code's behavior.


Constructors -- the primary constructor

2.7 Evaluation and operators


The new Class(x1, x2) is already a value.
Substituion of this for the class value (self-reference)
Any method with a param can be used like an infix operator
r.add(s) <--> r add s
Relaxed identifiers
Operator overloading

Unary operators, need unary_ prefix : https://stackoverflow.com/questions/16644988/why-is-the-unary-prefix-needed-in-scala
Even overloaded identifiers have precedence!!


Assignment - Functional sets

Represent sets as functions.
  /**
   * Returns the set of the one given element.
   */
  def singletonSet(elem: Int): Set = {
    x: Int => x.equals(elem)
  }

Week 3 - Data and Abstraction

3.1 -- Class hierarchies


Dynamic binding/Late binding -- https://stackoverflow.com/questions/19017258/static-vs-dynamic-binding-in-java
Abstract Class

Permissible to have empty methods if the class is abstract; otherwise shows errors Method XXX is not defined HDP!
Gives an example of implementing set as Binary Trees. (Persistent data structures, old data structure does not go away)
Subclasses and superclasses -- without no super class, Object is the root of all classes
Subclass can also override the non abstract methods.


Objects -- A Singleton for a class!!
Introduces main class!
Dynamic method dispatch --> Code invoked by a method call depends on the runtime type of the object.

3.2 -- How classes are organized


Organized in packages. -- package clause at top of the file.

Fully qualified name
Some entities (like Int etc are automatically imported)


Traits


Single inheritance language
Essentially like an abstract class.
A class can inherit from MANY traits.

Can contains fields and concrete methods.


Scala class hierarchy -- at the VERY TOP is Scala.Any

Numeric types are not strict, but can convert into each other.
Any, AnyRef, Anyval
Type Nothing at the bottom

Signals abnormal termination
Empty collections


null --> Null

Int CANNOT be null! (null is only for AnyVal)


3.3 -- Polymorphism


Type parameterization

Classes and methods can have type.
Immutable lists -- cons lists
Definition generalized using a type parameter [T]
ACHTUNG -- Defining members as val defines it as an accessible member.


Functions can also have type parameters.

Scala can figure out the type parameter based on the parameters passed!


Type parameters do not affect evaluation! -- Type Erasure

Only important for the compiler, types erased at runtime.


Polymorphism


Subtyping
Generics

Assignments!

Very nice assignment on TweetSets

Scala has to conditional operator~ https://alvinalexander.com/scala/scala-ternary-operator-syntax
Very good assignment!

Week 4: Types and Pattern Matching

4.1: Objects everywhere


Pure object oriented lanuage --> Everything is an object!. Is scala one of these?
Pure Booleans

true and false becomes Constant objects!
Even Int can become a class!


Exercsie on Natural numbers. -- Very good!!!

Peano Nunbers! -- https://wiki.haskell.org/Peano_numbers


4.2: Functions as objects?


Function types are are.. IN FACT objects in scala
type A => B is just an abbreviation for class trait Function1[A, B]


Anon functions lead to AnonFun


Calling a function is f.apply() -- Functions are objects!


Eg.
class AddOne extends Function1[Int, Int] {
     |   def apply(m: Int): Int = m + 1
     | }


def is NOT an object!, When for example def f() is used as f -- converts it into function value -- ETA expansion.

Useful -- https://twitter.github.io/scala_school/basics2.html#apply
4.3: Subtyping and generics


Base Type

assertAllPos example
Conforming to the bound --> S <: T --> S is a subtype of T
--> S > :T --> S is a supertype of T
Can also mix lower bound and upper bound
Given S <: T, given covariance for some types, therefore List[S] <: List[T], becuase LIsts are covariant.
Covariance -- https://docs.scala-lang.org/tour/variances.html


In Java, arrays are covariant, but they do type checking for every array to prevent issues.

Why was it done? Because they wanted a sort to work for everything.


Barbara Liskov Substition Principle


If A<:B, then everything you can do with B, you should be able to do with A.

In scala, Arrays are NOT Covariant (for why, see more below)
4.4: Variance


Lists are immutable, arrays are mutable. Immutable types can be covariant, if some other additional conditions are also met.
Types can be covariant C[+A], contravariant C[-A], or nonvariant C[A]
Typing rules for functions --

For a function, if A2 <: A1 and B1 <: B2, then A1 => B1 <: A2 => B2.

Functions must be contravariant in their argument types and covariant in their result types
Example -->
trait Function1[-T, +U] {
  def apply(x: T): U
} // Variance check is OK because T is contravariant and U is covariant

class Array[+T] {
  def update(x: T)
} // variance checks fails

MIND BLOWN! --> Function trait (checked in Scala code) becomes
trait Function1[-T, +U] {
 def apply(x: T) : U
}
i.e. return type is covariant, argument type is contravariant.


Can you just put + and - whenever the hell you want?! NOOO -- otherwise we could make array covariant. Scala compiler checks that are no problematic rules of variance.

. Roughly, what it will do is, it will let covariant type parameters only appear in method results. Contravariant type parameters can only appear in method parameters, and nonvariant, or invariant, that they are used as aliases, type parameters can actually appear anywhere.

Example from Piyush 007 -- https://gitlab.criteois.com/p.narang/scala-class/blob/master/src/main/scala/com/criteo/scalaclass/generics/Generics.scala


Takes example of Nil from the Cons Lists example.Nothing is a subtype of everything.


Making a class covariant can be a lot of work! -- The prepend example


Assume the similar Cat, Dog, Animal inheritance tree used earlier, plus the following:

abstract class SmallAnimal extends Animal
case class Mouse(name: String) extends SmallAnimal
Suppose we’re working with functions that accept types of animals, and return the types of food they eat. If we would like a Cat => SmallAnimal (because cats eat small animals), but are given a Animal => Mouse instead, our program will still work. Intuitively an Animal => Mouse will still accept a Cat as an argument, because a Cat is an Animal, and it returns a Mouse, which is also a SmallAnimal. Since we can safely and invisibly substitute the former by the latter, we can say Animal => Mouse is a subtype of Cat => SmallAnimal.

4.5 Decomposition


Expr, Number and Sum example; trait expr
Too tedious!! too many methods for any functionality. Quadratic number of methods with functionality.
Maybe use Type Casts? Nooo that sucks! Scala does not recommend it.
Define eval in every class -- but that way need to touch all classes if you add a method to the base trait.

4.6 Pattern Matching


All previous methods have shortcomings.
Case Classes
-- Automatically adds companion objects.
-- A generalization of the switch statement.
-- Can match on multiple expressions
-- Matches in the order they are written (Can match constructors too Sum(x1, x2)
-- You can have methods in the class/object itself this.match for example.
Is not always good! Analyze the use cases and decide.. The Expression Problem

4.7 Lists


Lists are immutable, and recursive

Similar to the Cons lists implemented before.
All lists are constructed from empty lists, using ::, called Cons eg. fruit = "apples" :: "(banana")
List(x) == x :: Nil
Cons associated to the righ -- A::B::C is interpreted as A::(B::C)


Insertion Sort --good example!

Homework -- VERY GOOD HUFFMAN ASSIGNMENT


Tuple unpacking, AB DOBARA MAT BHOOLNA!

def makeOrderedLeafList(freqs: List[(Char, Int)]): List[Leaf] = {
      freqs.sortBy(_._2).map {
        case (c: Char, count: Int) => Leaf(c, count)
      }
    }

Write tests this way!!

trait TestTrees {
		val t1 = Fork(Leaf('a',2), Leaf('b',3), List('a','b'), 5)
		val t2 = Fork(Fork(Leaf('a',2), Leaf('b',3), List('a','b'), 5), Leaf('d',4), List('a','b','d'), 9)
	}

Week 5-- More Lists

5.1 More Functions on Lists


Goes over the functions on lists like drop, tail.. xs(n)
Creating new lists --> xs ++ ys. xs updated (n,x)
Finding elements --> xs indexOf x, xs contains x
Implementation of last, init, reverse and concat in a recursive manner.
RemoveAt in terms of take and drop

5.2 Pairs and tuples


Merge sort example.

Merge function (Compare head elements with each other)
splitAt function returns a pair,  eg. ("answer", 42) is a pair.


Tuple

_1 and _2 can be used to access the fields
Pattern match in a symmetics way using tuples


case (Nil, ys) => ys
case (xs, Nil) => xs


Run case matching on tuples!

5.3 Implicit parameters


Can make merge sort with [T]..but then not every type defines a comparator <, so compiler cannot be sure.
Can parametrize with a function -- Use a function lt -- put function value last in your parameters..gives compiler an idea of the types of parameters.
trait Ordering[T] as a parameter -- leave out the ord in the call after you mark as implicit..the compiler figures out the correct ordering
Compiler determines implicits according to the type of arguments, and whether they are in scope at that point. Failing rules, it is an ERROR.

5.4 Higher order functions on lists


The map method to transform every element of a list.
The filter method to filter out elements based on a condition.
Other functions.. such as filterNot, partition, takeWhile, dropWhile, span -- Nice functions!

5.5 Reduction on lists


reduceLeft for sum and product.
foldLeft takes in an accumulator
Implementations of both foldLeft and reduceLeft
foldRight and reduceRight --> parenthesis go to the right.
The operators are equivalent, if the operators are associative and commutative -- Good example on when concat cannot be used!

5.6 - Reasoning About Concat


What does it mean for a functional program to be correct? Eg. of concat, it is associative etc.
Structural Induction -- eg. of Natural Induction -- class 8 Maths! (Factorial example)
Structural Induction

To prove a property P(xs) for all lists xs, show base case, and then prove for x::xs
Shows the associative property for concatenation.
Induction hypotheses


5.7 - A Larger Equational Proof on Lists


Larger/More difficult example

reverse -- proving by induction does not work
Replace xs.reverse by ys; and prove that.


Week 6: Collections

6.1 - Other Collections


Lists' access depends on the index (element at index 0 is faster to get than in the middle)
Vector -- Upto 32*32 is an array, after 32 .. every element becomes a pointer to an array of 32 elements.

Number of accesses is the depth - log32(n)-- much faster!!
https://www.scala-lang.org/api/2.12.4/scala/collection/immutable/Vector.html


Depends on access patterns, for head and list, Vector is mucho mejor~

Does not have ::, but has +: and :+
Vectors are also immutable! Goes over insertion of an element. -- create many objects!


Seq is the base class --> the common base class for List and Vector

At the apex is Iterable


Goes over Array and String
Range function --> 6 to 1 by -2 -- >Lower Bound to Upper Bound by Type
Goes over sequence operations

exists and forall
zip and unzip
flatmap
sum, max etc for numeric lists


Goes over examples

6.2 - Combinatorial Search and For-Expressions


Handling nested sequences

Eg. of generating pairs of integers < 7 where sum is prime.
IndexedSequence ---> is implemented using Vector.
To flatten, can use foldRight or flatten.

xs flatmap f = (xs map f).flatten


Program can be difficult to understand! due to abstraction.


Introduces the for expression. --> for (s) yield e; s contains a generator and a filter.


Expresses the isPrime problem in nested for.

6.3 - Combinatorial Search Example


Sets

Sets are unordered
Do not have duplicate elements.
contains method.


N Queens problem -- VERY NICE PROBLEM!

Recursive solution --> more than one solution!
zip is very useful!!!
The diagonal check!


Maps

NoSuchElementException -- the get method returns None if key not found.
Introduces the Option type. --> Some and None --> pattern matching


Sorted and GroupBy

sorted is natural ordering, sortWith is according to a rule.


The polynomial example

Concat, ++ superimposes instead of adding.
Defines a function by himself, adjust!
withDefaultValue!! -- simplifies it.
Repeated parameters --> bindings: (Int, Double)*
Uses foldLeft -- is more efficient


6.5 - Putting the Pieces Together


Mnemonics for a phone number


Comparing the lines of code for scripting vs general languages.
Inverts the map using for
Map can also be used as a function!
Writes the encode function.

6.6 - Conclusion

Assignments -- Anagrams


The identity function -- JHAKAAS!
s flatmap wordOccurences

Great assignment !! -- TOO DIFFICULT!


## Course 2. Functional Program Design in Scala.md

      
    Raw
  

              Course 2. Functional Program Design in Scala.md
            
          
    Course 2. Functional Program Design in Scala

Week 1: For Expressions and Monads

Recap


Recap on case classes

JSON example, tries to represent it using abstract classes and case classes; recursive examples for toString method.
By itself, an expression is not typable -- JBinding, String example. Goes over trait Function1
We can subclass the function type, for eg. Seq[Elem] extends (Int => Elem)
scala MatchError


Partial Function -- mind blown!

Define a function as Partial Function; isDefinedAt to check
ACHTUNG!! -- isDefinedAt only applies to the outerblock of matching, does not guarantee in Nested.


Recap on collections

All have map, flatMap, filter, foldLeft, foldRight
idealized implementation of map, flatMap and filter on Lists
In practice are different to apply to arbit collections, and make them tail-recursive
For expressions --> simplify combinations of core methods map, flatMap and filter -- the Prime example is much simpler!


How for translates to map, flatMap

First flatMap --> for recursively
Generators can also have Patterns!! JSON example -- translates to withFilter on the pattern.


Lecture 1.1 - Queries with For


for can be considered as querying a database.
The book database example.

Nice example of yielding results to find authors who have writen at least two books.
Can use distinct


map, flatmap and filter can all be expressed in for.

Lecture 1.2 - Translation of For


Goes over translation of for in terms of map

withFilter
Two generators go to flatMap
Every translation steps has one less generator
Goes over the isPrime example.


As long as you have map, flatMap and filter, can use any datatype for for

Lecture 1.3 - Functional Random Generators


for expressions are NOT TIED to collections, all you need is map, flatMap and filter.
Generator example.

Can use for for it, but need a map and flatmap definition.
self example (because there are two generate fns)
List generators; trees example.


Random generators can be used for unit testing.

ScalaCheck -- https://alvinalexander.com/scala/fp-book/scalacheck-1-introduction


Lecture 1.4 - Monads


Has two functions flatMap and unit

List is a monad
andThen for function composition


Monad Laws

Associativity , Left unit and Right unit
Proves them on Option class -- flatMap proof is a bit involved.


Significance of laws

Can inline the two generators due to significance.


The Try type

Success and Failure
Composing with for
Bulletproof example - an expression composed of Try, map and flatMap will never throw a non fatal exception.


Week 2: Lazy Evaluation

Lecture 2.1 - Structural Induction on Trees


Structural induction on Trees

IntSet --> contains and incl
Three laws for intset, for eg. --> (s incl x) contains y = s contains y
Three laws completely define IntSet
Base case is Empty, induction step is NonEmpty
Five cases for the third law


Lecture 2.2 - Streams


Prime number example -- calculate every prime number and just take the top two, using the for code.

Compute the tail only IF needed.
Avoid computing the tail until on demand --> Stream
Range code is similar but behavior is completely different -- creates only on demand!
:: produces a list; #:: produces a stream (identical to Stream.cons)


Implementations are defined in the Stream object
The difference is that the tail params is call by name (remember the :=> to make a var call by name!
StreamRange example question

Lecture 2.3 - Lazy evaluation


Do things as late as possible, never do them twice
Instead of computing the tail everytime, can store the result and re-use it --> lazy evaluation vs call by name evaluation

Haskell is purely lazy
By default scala does strict evaluation, but allows it.


lazy val vs def --> is the same the first time..but after the first time, lazy val is not recomputed --> very good example
Stream's tail becomes a lazy val

Gives examlpe on the isPrime example


Lecture 2.4 - Computing with Infinite Sequences


Since calculation is done ONLY when needed, good to define infinite sequences.

Define and then take upto the where you need them.


Calculation of prime numbers --> Sieve of erasthosenes

Excellent implementation!!
Streams to converge approximation of sqrt


Lecture 2.5 - Case Study: the Water Pouring Problem


Explains the problem, and the methods/classes to represent.

Try all paths until you have the destination.. i.e. the glass with the correct amount
case class Capacity, trait Move with moves Empty, Fill and Pour
def change in the trait to be defined by every case class above.
Path history, foldRight
Calculate ALL the paths and then filter by the solution (i.e. paths with end state as the glass with a volume)


NOT VERY EFFICIENT

Do not revisit the states already explored!


TIP -- Name everything you can!, do not put all operators in one line and keep scope for future changes!


https://alvinalexander.com/scala/how-to-control-visibility-constructor-fields-scala-val-var-private


Assignment: Bloxorz

Nice ASSIGNMENT! CHUTIYA MISTAKE!!!

Week 3 -- Functions and State

3.1 Functions and state


Goes over function rewriting.
Rewriting always leads to the same solution.

Any order of rewrite leads to the same solution!


But the world is a state of objects --> an object has state if it is behavior changes with time (for eg. a bank account)
var is a mutable definition (can be reassigned)
Gives the BankAccount example and definition.
The Cons Stream's example -- assuming operations on Stream are purely functional, unless you have side effects.
Does not have to be a variable change.. FOCUS ON THE BEHAVIOR! (if behavior changes => stateful)

Lecture 3.2 - Identity and Change


Referential Transparency
Operational Equivalent -- two definitions x and y are operationally equivalent if any tests cannot be distringuished.

Goes over proving the above in the BankAccount example.


The substitution model does not HOLD!!

Lecture 3.3 - Loops


Loops are NOT essential for imperative prgramming.
Defines while as a function WHILE --> condition must be passed by name (so it is reevaluated every time)
Also defines REPEAT
for loops cannot be represented by just higher order functions.

Like for expressions.. but uses foreach


Lecture 3.4 - Extended Example: Discrete Event Simulation


Digital Circuits (ECE aa gayi!!...)
Base components -- AND, OR and NOT (with a delay)
Creates an half adder using just these basic gates.
Defines the classes/functions Wires and gates
Uses HalfAdder to define a FullAdder

Lecture 3.5 - Discrete Event Simulation: API and Usage


Action --> a function that has side effect (return type Unit)


time is simulated, not actual time.
Wire class -> getSignal, addSignal, actions to be done when signal changes.
Goes over the implementation of all the gates.

Lecture 3.5 - Discrete Event Simulation: Implementation and Test


Agenda is a list of Events to be performed at a time; they perform actions
Action goes into the Agenda.
Probe is also defined to get the values of the wire during the simulation
Runs a simulation (that has delays of time as well)
Goes over OrGateAlt -- an implementation in terms of AND and NOT
Losing referential transparency, more confusing

Week 4 -- Timely Effects

Lecture 4.1 - Imperative Event Handling: The Observer Pattern


Observer Pattern -- Remember MVC from Objective C days!! (Publisher-Subscriber)

Goes over the Publisher trait
Shows the example of BankAccount in this pattern.
A subscriber Consolidator that subscribes to all the bank accounts.
_ Nice example!!


Goes over the good and the bad (Bad is return type is Unit (forces imperative style) -- CONCURRENCY!!


We will study how to go from Imperative Functional Programming to Functional Reactive Programming


Lecture 4.2 - Functional Reactive Programming (FRP)


Signals that changed with time.

Mouse moved example -- Signal[Position]


frp.Signal <--> Scala.React

Get the value of the signal at the current time.
Get value of a signal in terms of other signals.
Introduces Var ---> for signals that can be updated/changed (The Update method -- which is also defined for Arrays.


Important -- Defining a signal in terms of other Signal A = f(Signal B) for example -- the relation is maintained going forward. (forever in the future!)


Goes over the BankAccount example.

Balance becomes a Signal[Int]
A signal cannot be a function of itself!! balance() = balance() * amount IS WRONGG!!

Instead get the value of balance in a val and then do it


IMPORTANT!


Diff between b = b + 1 and b() = b() + 1 -- the latter DONT MAKE SENSE!, as the first one updated value, and the second defines a relation in eternity.

Lecture 4.3 - A simple FRP implementation


Every signal has (1) the current value (2) current expression defining it (3) all of its observers that need to re- evaluated; if signal value changes.
How to know on whose behalf a signal is updated.
Need to call in a stack like fasion -- defines a class Stackable variables
Type Signal[_] it means can take signals of all types.

NoSignal -- a sentinel signal.


Adds the current caller to the signal

A signal cannot observe itself (duh) -- Catches the cyclic dependency
computeValue() does the propagation
NoSignal's computeValue does nothing.


Problems with parallelism!!


Going to thread local -- Dynamic Variable


Pass current value of signal as a implicit parameter

Lecture 4.4 - Latency as an Effect 1


Introduces the monad Future (Computations can take time!!!)

Reading from memory takes time, and sending a packet to Europe takes time!
Converting into a human time format, that makes sense! --> sending a packet to Europe in this scale takes 5 years!
Do not have to wait for a computation, Callback -- mind blown!!*


Lecture 4.5 - Latency as an Effect 2


A Monad -- Future

Handles both the idea that computations can fail, and can also take time.
Function onComplete --> think of it as an envelope given to the calculator and asking him to find the result, fill the envelope and return it back to us.
Ignoring implicit execution context for now.
Function can match to Success or Failure

Can have a onComplete that takes two callbacks (one for success/failure)


readFromMemory returns Future[Array[Byte]]
Future's apply returns a future that computes asynchronously (Future just expects a Body)

Lecture 4.6 - Combinators on Futures 1


Functions on future.. all the usual flatMap etc, along with recoverWith
flatMap cleans up the code and makes it nicer -- Very nice examples!

zips two futures (Does not work as zip fails when one of the lists being zipped fails)


recover and recoverWith

Lecture 4.7 - Combinators on Futures 2


Writes a better method fallBackTo --> expressed in terms of recoverWith
Introduces blocking, the Awaitable trait, but IS NOT RECOMMENDED!!

You should NEVER EVER BLOCK in production!!


Lecture 4.8 - Composing Futures 1


for comprehensions (remember brightroll?!!)
Defines retry

In terms of fallBackTo recursion


Lecture 4.9 - Implementation of flatMap on Future


Implementation of flatMap in terms of CallBack

Lecture 4.10 - Composing Futures 2


Goes over reminding us of foldLeft and foldRight
Implements retry in terms of foldLeft and foldRight
Sometimes straight recursion is much simpler to understand!

Conclusion, FIN.

Assignment


Uses Signal's -- FRP


## Course 3: Parallel Programming.md

      
    Raw
  

              Course 3: Parallel Programming.md
            
          
    Course 3. Parallel Programming

Week 1: Introduction to Parallel Programming

1.1 What is parallel computing?


Many calculations (can be done in parallel), same time
Amdahl!!


After a while, hit a power wall


Just increasing CPU frequency does not help, so added multi-cores!


Parallel programming is HARDER than sequential


But we need to do it for SPEEDUP (No other reason)


Parallel vs Concurrency


Parallel -- Uses Parallel hardware for more efficiency (eg. matrix multiplication, computer graphics) - speedup
Concurrent -- Improves modularity (eg. async apps such as web servers, databases) -- convenience
Both of these can intersect each other


Different granularities of Parallelism

Bit-level parallelism and Instruction-level parallelism and Task-level parallelism
Different types of hardware -- Multi core processors, Symmetric multiprocessors, GPU, FPGA, Computer Clusters


1.2 Parallelism on the JVM


What is an operating system? -- Software than manages hardware and software resources, and schedules program execution
Process - An instance of a program executing in the OS (a process has a PID) -- the most coarse grained unit in || programming
Multitasking -- two processes are isolated and owns a memory area
Each process can have multiple threads (Threads share memory space) -- can be started within the same program, and can exchange information
between one another

A thread has a program counter and a program stack
JVM process starts with main thread -- in a sequential program, only the main thread is used.
Goes over the Thread subclass
Goes over an example where the outputs can be different (threads are executed in parallel!)
To ensure, a sequence of statements is in executed in order -- Atomicity -- use the synchronized construct! -- at most one thread
can hold a monitor


1.3 Parallelism on the JVM II


Synchronizations can be nested! -- Gives the example of a bank transfer -- locks on both source and transfer account

Excellent example when a thread transfers from a1 to a2, and another from a2 to a1 --> DEADLOCK!


Resolving deadlocks


Acquire locks in the same order!


Memory model


How threads interact when accessing shared memory
Two threads writing to different memory locations --> no need to synchronize!
The thread join --> seeing write Rule.

1.4 Running Computations in Parallel


pNorm example


Uses recursion!!
MIND BLOWN! --> Need to use call by name! (need to pass unevaluated computations)
Memory contention!! -- Speedup also take into account not only cores, but also memory
Also you are only as fast as your fastest thread..

1.5 Estimate pi -- monte carlo


Estimate using randomly sampling points to fall within a circle/square

1.6 First class tasks


Using the construct task instead of parallel.

Defines Task interface that takes in a computation as an argument


Breaks the p-Norm implementation into tasks
Defines parallel construct in terms of tasks

1.7 How Fast are Parallel Programs?


Empirical measurement vs asymptotic analysis (worst case run time)
Asymptotic analysis of sequential running time and parallel running time on the sum-segment example
Parallel program time depends on the number of resources --> introduces Work (all sequential) and Depth (Infinite parallel)

Can never do better than Depth (as that is the case when parallelism is infinite)


Amdahl's law returns!! -- the law of parallelism

1.8 Benchmarking Parallel Programs


Benchmarking vs Testing

Testing is binary, benchmarking gives a continuous value


Just timing is not enough -- make sure to take care of stuff such as warm-up, eliminating outliers, prevent such as GC, JIT compilation, aggresive optimizations
Enter ScalaMeter


Be wary of GC! -- runtimes can vary across runs


JVM may apply further optimizations -- Scalameter has withWarmer  -- awesome!! -- Nice intro to ScalaMeter; can also measure Memory footprint`

Assignment -- Nice assignment on blurring in photoshop

Week 2: Basic Task Parallel Algorithms

2.1 Parallel Sorting


Uses an intermediate array -- breaks down sorting into smaller array sizes


merge and the copy functions


parMergeSort vs quickSort -- Uses ScalaMeter

2.2 Data Operations and Parallel Mapping


Gives examples of operations on collections (fold, scan, map etc.)
List is NOT good (as cannot find the middle easily) --> use Array and Tree
The naive parallel implementation first element cons rest of list is still n!!
So we look at Array

Breaks into parallel implementation -- (1) Writes need to be disjoint (2) Threshold for sequential needs to be large enough to have efficiency!


Goes over implementation of Tree

leaf stores the num of elements, and nodes contain the elements


Compare arrays vs Trees --> in trees less worry about disjointness

2.3 Parallel Fold (Reduce) Operation


Goes over differences between foldLeft, foldRight, reduceLeft, reduceRight
Parallelism can be done ONLY on associative operations --- as we want to split into parts as we wish

Each associative operations can be shown as a tree --> leaves are the values and nodes are the operations.
For ordering for a Tree, can use list
Tree rotation preserves the result.


For arrays reduction, can convert into tree
Gives reduce and reduceSeg function. (reduce is done in a sequential fashion)

2.4 Associativity 1


Compares with commutative
Gives examples of operations that are both associative and commutative
Examples where it is associative but NOT commutative

String concat
Matrix multiplication


Example of where it is the opposite case
Should NOT LOSE associativity
Floating point operations are not necessarily associative! (Great examples!!) -- due to precision considerations
Trick to make functions commutative! --> get start and just always pass the lesser argument first (by making a less checker)

2.5 Associtavity II


Gives an example of associativity for tuples
Example of average using two reduces --> one for sum and other for length

Gives the case when commutativity also implies associativity (given the rotation condition)
Gives two examples of applying this theorem


Floating Numbers -- CUIDATE!!
Sets example (Closure operation -- A Closure B)

2.6 Scan


scan is like a combo of map and fold
HOW to PARALLELIZE?!!

this value Depends on the previous value!
Do more work (store the intermediate results) --> parallelize, totally worth it.
Have trees that store intermediate values
Think of summing up trees from leaf to the head, storing results in an intermediate fashion as you go up.
upsweep and downsweep functions to the rescue
Application to arrays --> store the Segment of array in the non-leaf nodes


Assignment -- Parallel programming

Count change problem -- how to divide stuff if we do not know the workload?!!
Counting change is a canonical example of a task-parallel problem in which the partitioning the workload across processors is solution-driven -- to know how to optimally partition the work, we would first need to solve the problem itself.

the synchronization costs of doing this are way too high. Instead, we need to agglomerate parts of the computation. We do this by calling the sequential countChange


Speed up is measured using ScalaMeter -- letting it warmup and GC first

Nice line of sight problem
Week 3: Data-parallelism

3.1 Data Parallel Programming


Task parallelization --> a form of parallelization that distributes execution across computing nodes
Data parallelization --> distributes data across nodes
Parallel for loop

Introduces the parallel for loop --> MUST BE CAREFUL of adding .par to the loop
Mandelbrot set --> Keeps computing complex numbers till reaching infitiny if adbsolute value

Goes over the image array in parallel.
Compares data parallelism for this problem, vs task parallelism --> Why?!! -- compares workload for different parts of the data

Task of balancing workload is managed by the scheduler -- we do not care about equally dividing! it is done for us..


3.2 Data Parallel Operations


Some methods like filter can be done in par.
But foldLeft CANNOT be done --> great lego example!! -- foldLeft depends on the previous part --> so MUST be sequential
But fold can be done--> it is like a reduction tree

3.3 Data Parallel Operations II


Uses fold to calculate max using Int.MinValue as the neutral element
Rock, paper and scissors example!

Play is NOT associative --> it is commutative


For fold to work --> the two following conditions must hold to make the operator f a MONOID

f(a, f(b,c)) == f(f(a,b),c) -- ASSOCIATIVE
f(z,a) == f(a,z) == a -- Like an identity on the neutral element z
Commutativity does not matter


For fold -- the type of the neutral element must be the same !!!
In comes aggregate --> a combination of fold and foldLeft

z   A   z   A   z   A   z   A
 \ /     \ /seqop\ /     \ /    
  B       B       B       B
    \   /  combop   \   /
      B _           _ B
         \ combop  /
              B

https://stackoverflow.com/questions/6928374/example-of-the-scala-aggregate-function
3.4 Scala || operations


Scala collections hierarchy

Traversable[T] -- Must implement foreach
Iterable[T] -- Must have iterator
Seq[T] -- Orderd
...


Parallel collection hierarchy


ParIterable[T] etc


GenSeq, GenIterable etc -- can or cannot be run in parallel


Compares toPar on List vs Vector -- List is way slower


All collections have a parallel counterpart (mutable Par, immutable par etc)


Example of calling a function with arguments GenSet that accepts both sequential and parallel implementations.

BUT parallel case is NOT deterministic
NEVER modify the same memory locations (like a mutable set) without proper synchronization -- instead of a mutable set uses the ConcurrentSkipListSet
Lessons learnt!

Never write to a collection that is concurrently traveresed.
Never read from a collection that is concurrent modified.
TrieMap is an exception -- has a snapshot() method.

https://stackoverflow.com/questions/29499381/what-is-a-triemap-and-what-is-its-advantages-disadvantages-compared-to-a-hashmap


3.5 Splitters and Combiners


Iterator - must always call hasNext

Implements foldLeft on iterator


Splitter extends Iterator

Every parallel collection has a splitter implementation
Efficient method split --> O(n)
Implements fold on a splitter --> similar to reduction trees --> parallely keep splitting and then sequential if below a threshold


Builder

newBuilder method


Combiner -- is a parallel version of the Builder -->

Every parallel collection must implement newCombiner -->
Combines two combiners
Efficient method -- O(logn)


Good notes!

https://stackoverflow.com/questions/40098794/what-are-builder-combiner-and-splitter-in-scala
https://docs.scala-lang.org/overviews/parallel-collections/architecture.html


Assignment -- KMeans

K-means is an example of a bulk synchronous parallel algorithm (BSP). BSP algorithms are composed 
from a sequence of supersteps, each of which contains:

parallel computation, in which processes independently perform 
local computations and produce some values;

communication, in which processes exchange data;

barrier synchronisation, during which processes wait until every process finishes

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Remember Tail Recursion! -- https://gist.github.com/sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6#17-tail-recursion

The recursive call should be the last thing that the function does. (no operation after!)
Assignment tries different choices of initial means selection and compares them

Week 4: Data Structures for parallel computing

4.1 - Implementing Combiners


Transformer methods --> like filter, map etc (create new collection)
Scala Builder for sequential transformer operations
For Parallel, we use Combiner (a Builder with a combine) (combine this with that)

Combine is different for different collections (for Set and Map it is a union)
Combine MUST be efficient!! Must execute in O(log n + log m) time
To gain any speedup, the operation must be logarithmic!!


Combiner

Gives a bad implementation of Combiner --> O(n)
Arrays CANNOT be combined efficiently --> as they are contiguous
Compares runtime for various Data Structures.
Pues.. implementation of a combiner is NOT trivial!


4.2 Parallel Two-phase Construction


Construct data structures in parallel using the two-phase construction (Use an intermediate data structure)

The intermediate data structure must have an efficient combine method, an efficient += method and can convert into the resulting data structure in O(n/P) time
Many intermediate data structures are created ,and then combined, and then parallel-y it is converted to the initial daa structure
Uses a nested array buffer -- and then just copies the reference!
The result method to copy back into the array in a parallel fashion
Can extend to other data structures, need to choose an efficient intermediate data structure

Intermediate data strucures MUST be non-overlapping


4.3 Conc-tree Data Structure


Conc-tree a parallel counterpart of Cons List

Lists are built for sequential computations, trees can be done in parallel.
Trees are ONLY good for parallelism if they are balanced

Enter Conc trees! Contains level and size info at each node
An invariant -- the level cannot be greater than one
<> to Concat trees -- Concat making sure that the tree balance structure is maintained
Goes over cases of concating trees in various unbalanced situations.


https://en.wikipedia.org/wiki/Conc-tree_list


4.4 Amortized, Constant-time Append Operation


+= method should be O(1)

Adds the Append node type

Relaxes the balanced constraint
Make sure that there are NO more than log N Append nodes
Makes a parallel relation between counting numbers in binary rep and append nodes --


4.5 Conc Tree combiners


Conc buffers -- introduces the Chunk node and the expand method


## Course 4: Big Data Analysis With Scala and Spark.md

      
    Raw
  

              Course 4: Big Data Analysis With Scala and Spark.md
            
          
    Course 4. Parallel Programming

Week 1: From Parallel to distributed

1.1 Logistics


Goes over logistics
Why scala? Why spark?

As the data science doe in MATLAB is small (cant fit on memory)
Bcoz its the shizz yo
Works a lot like Scala
Modeled like Scala collections (map, flatMap) -- hadoop a lot more boilerplate
Databricks community edition!


1.2: Data-Parallel to Distributed Data-Parallel


In previous node, we did data-parallel (jellybeans in a jar!)

Different processors operate on different chunks of data, but same machine
Data is SPLIT into different nodes, on which nodes operate in parallel.

Now MUST worry about network latency!


A lot can be shared between share memory and distributed memory --> operations should be associative!
Again, reminding you about LATENCY!!


My dear RDD


1.3 Latency


Issues while going from shared to distributed

Partial Failure - a subset might fail!!
Latency -- Network communication causes latency -- LATENCY WILL NEVER EVER GO AWAY

Important latency numbers -- https://gist.github.com/hellerbarde/2843375
1MB from memory vs 1 memory from disk - 10 TIMES HIGHER!!
Memory < Disk < Network (Latency order)
Multiplies by billion to get a human sense


Goes over hadoop and mapreduce --> Mapreduce/Hadoop was great because of fault tolerance

Failure probability is high (for a node to fail..)


Why spark?

Because of fault tolerance, hadoop/mapreduce shuffles its data and writes to disk (EXPENSIVE!)
Spark strategy for handling latency-- all data is immutable and everything is in memory!
Just replay the transormation chain!
100 times faster -- does stuff in memory
Minimizes network traffic


Hadoop vs spark on benchmark -- logistic regression 100s per 1s!

1.4 Basics of Spark's RDDs


RDDs seem like a immutable collection, have higher order functions like map, filter etc

Heavy use of higher order functions
Similar structure -- like MAGIC!


How to create a RDD?

Transform a RDD
or create one.. using SparkSession, SparkContext


1.5 RDDs: Transformation and Actions


Transformers create a new collections (not single values!) -- Accessors return single values as results
Spark defines Transformations and Actions

Transformations return RDD
Actions return NOT a RDD.. write to HDFS
MUY IMPORTANTE -- Transformations are lazy and Actions are eager

Laziness reduces network communication by being a lazy fuck!
NOTHING HAPPENS when calling a map -- add an action to get a result!!


Eager actions

collect, count, take, reduce etc. etc.
LOOK AT THE RETURN TYPE to know if its a action vs transformation!


Stage

Something that we will do in the future sometime


Spark can optimize stuff...

eg. x.filter.take(10) can stop when you have 10! -- MIND BLOWN


Two RDD transformations (union, cartesian etc)


1.6 Evaluation in Spark: Unlike Scala Collections!


Why is spark good?

Lazy and in memory!
Most data science stuff requires iterations (map + reduce)

In hadoop, there is a read and write for intermediate data. (so a lot of time is spent!)
Spark loads everything in memory (saves so much time!)
Logistic regression is a hello world of the iterative data algorithm world


Gives an example of logistic regression

RDDs are recomputed on each action -- unneccesary to do it everytime
persist -- after an RDD is evaluated, can reuse -- MUY IMPORTANTE
Suppose you do two operations --- can just persist -- such as take followed by count


Persistence comes in various flavors

Can keep both on memory and disk (as regular/serialized java objects)
cache vs persist -- cache is memory only and persist can take multiple arguments (eg. MEMORY_ONLY, MEMORY_ONLY_SER)


Different than collections!

Dont wait on the rsult (dont be EAGER)
Dont do same computations (use caching hijo de puta!)


Laziness is a good thing!!

Spark can analyze and optimize (if you just need 10, it will optimize to only get 10) -- as soon as I get 10, just quit hdp!
Spark can fuse computations and do less work

For eg. map and filter are two passes, but it just does it once!! (MIND BLOWN!)


1.6 Cluster Topology Matters!


There is a master-worker topology

master is the driver program (The commander, we talk to this)
executors are workers (they do the work)
Cluster manager is YARN or MESOS


Driver is the brain

Contains the main, transformations etc


Steps in a spark program

Spark SENDS the program to executors (think about it hard hijo de puta)
Driver says to executor -- take this function and data, do your shit and return the result
foreach is an action -- so nothing happens on the driver! (ONLY Transformations happen on it)
take gives it back to the driver program (sent back to the driver)


Summary


Think where your code is executed

Transformations are stored up on the driver.. and then on an action executor runs them.


Week 2: Reduction Operations & Distributed Key-value pairs

2.1 Reduction Operations


How are actions`reduction operations` distributed?

Methods like fold, aggregate etc walk through a collection and combine to get a result
Remember that foldLeft is NOT parallelizable

Because of TYPES!!


fold returns the same type
aggregate combines the two


on RDDs -- foldLeft and foldRight does NOT even exits -- as it is not parallelizable! -- Thy must use aggregate!!


Why not have a serial implementation? Does not make sense synchronize wise..
aggregate is desirable as you go from a higher data structure (Case class) to a simpler one (like Int)

2.2 Pair RDDs


Key-value pairs are preferred in the map-reduce world!

Project down onto key-value pairs
PairRDDs have special operations -- special methods -- groupByKey, reduceByKey, join etc.


2.3 Transformations and Actions on Pair RDDs


Actions and Transformations on PairRDD

groupBy example
Spark groupByKey groups the functions according to the keys they already have (no function required)
reduceByKey reduces the values for the same keys.

Note that reduceByKey is a transformation, unlike the action reduce in scala collections


mapValues only applies functions to the values
countByKey just counts the number of elements by key
Transformation keysis NOT an action


2.4 Joins


Inner Join vs Outer Join

DIfference happens when both set of keys do not exist in both
Gives example of train and location data
Inner join only takes the keys that intersect

It is commutative


Joins ARE transformations~~
Outer join --> takes the union

Returns the Option type (can be None or Some)
Choice of left vs right depends on whose keys are more important to us


Week 3: Reduction Operations & Distributed Key-value pairs

3.1 Shuffling: What it is and why it's important


groupBy and grooupByKey --> shuffle data as data MUST move around to combine data for the same key


groupBy returns a ShuffledRDD
Need to move around the values so that key has its values together
SHUFFLE is expensive!! -- Remember the humanized latency


reduceByKey is much more efficient! USE THIS!

reduces on the mappers first
Reduces the amount of data for network serialization


reduceByKey gives almost 3x speed!


3.2 Partitioning


How does spark know which key to put where?
What is a partition?

Data within a RDD is split into partitions.
Data in same partition is ALWAYS on the same machine.
Can configure the number of partitions
Generally number of partitions is equal to number of cores on all executor nodes


Can customize a partition ONLY on a pair RDD
Hash partition calculates p using the hashvalue -- tries to spread around based on the value

Hash partitioning can cause unevenly skewed if we are unlucky


Range partioning -- partitioned according to ordering

Can distribute keys unevenly based on ranges


How to customize partioning?

MUY IMPORTANTE -- partitionBy
Using transformations that create partitions
MUST persist()!! -- otherwise it is repatition everytime
Otherwise partition is applied everytime causing shuffle network operations


map does not make sense AS the KEYS MAY CHANGE!!!

Use mapValues WHEREVR possible!


3.3 Optimizing with partitioners


Why partioning?


Becuase avoid network shuffles!!
Range partitioners can further optimizer reduceByKey, 9x speedup over grouoByKey!
Gives a great example of join always shuffling data
At the satrt of the program, partition the big dataset
We want to shuffle the small dataset, not the bigger one
Shuffle occurs when resulting RDD depends on elements from other or the same RDD
- Achtung on operations that can cause a shuffle!

3.4 Wide vs Narrow Dependencies


Not all transformations are created equal! (eg. shuffle operations take MUCH longer)
Computations are represented as a lineage graph (DAG)

RDDs are depicted as partitions (atomic pieces of dataset), and the dependencies


Transformations cause shuffles

Narrow dependency -- Each partition of parent is used by at most one partition
Wide dependencies -- Eacg partition of parent may be used by more partitions in child RDD
groupByKey has wide dependencies
NICE EXAMPLE on a graph
operations like join can cause shuffle (depending on whether you were smart to partition or not)


Spark dependencies method!! and toDebug String
Lineage graphs are the key to fault tolerance!

You can go up the graph to recompute a failed partition!
For deep dependencies.. need to recompute all partitions above! -- MIND BLOWN!


Week 4: SQL, Dataframe and Datasets

4.1 Structured vs Unstructured data


Goes over an example of joining two RDDs

Compares three different approaches -- cartesian
filter before join is much faster than join then filter.
Gets the same result! But some methods are waster
Can Spark we smarter to do that.. i.e. optimize the code itsekf?


Different types of data

Unstructured -- log files, images etc.
Semi strucutred -- JSON, XML etc.
Structured -- Database tables
RDD has no idea about the structure of the data, other than it knows what type of object it is, so cant optimize
Given a structure, spark can optimize!
Databases are more organized and optimized
Spark SQL will optimize!


4.2 Spark SQL


SQL is the lingua franca
Spark SQl adds relational processing

A spark module, a library on top of spark
Catalyst a query optimizer
Spark SQL is top of spark, operates on RDDs under the hood.


Goes over SQL lingo
Dataframes REQUIRE some schema! -- Dataframe are untyped~~ (thats why you have them!)

untyped transformations


Can create dataframe from an RDD by either inferring schema, or by specifying


With case classes, column names are inferred!
Goes over creating DF from RDD[Row]


On datafarame, can do createOrReplaceTemplate -- then just SQL away!!, gives the people example

4.3 Dataframe (1)


Just running over RDDs

Applies relational optimizations from the DB community
Are untyped --chingado Rows! Scala compiler does not type check
Goes over the primitive types and the complex types available (Map, Struct etc)

Case classes map to a Struct


Dataframe API looks similar to a SQL api

groupBy, select etc


Working with columns has three ways -->

Using $ notation

$"age">10


Refer the dataframe
SQL Queries inside strings


GroupBy returns the RelationalGroupedDataset datatype


4.4 Dataframe (2)


Handling missing data (Nan and null) (drop, fill, replace)
In join MUST specify the columns to join on


Goes over a join example


The Query optimizer takes care of the optimization for you!

4x faster!!
Spark optimizes for you! Gives you optimized RDDs
The catalyst can reorder and know which data is not used, so less data around the network!

Catalyst reorder partitions, reduces amount of data to shuffle around, removes unwanted partitions
Due to being restriced to SQL data type, Tungsten optimizes how to pack data into memory

Stores data in a columnar format


Limitations

Errors happen at runtime!!
Limited to same data types


4.5 Datasets!


Row is not convenient, need to cast them back using asInstanceOf

but it is a headache, can use the printSchema to get the schema


BUT we want both type safety and optimizations!

Enter Datasets!
Dataframe = Dataset[Row] --> Dataframe is a type of a Dataset
A combo of RDD and Dataframe

Mix and match RDD and Dataframes.
REQUIRE structured/semi structure data (must have some sort of schema)


Now we can mix and match APIs!
Can convert RDD and DFF to Dataset


Datasets require TypedColumn and not just Column (as datasets are typed!)

Has both typed and untyped transformtions


Can call map on Dataframe -- gives a dataset; but need to typecast manually!
Operations on RDD are not all available on Datasets (and also different sometimes!)

groupByKey returns the maldito KeyValuedGroupedDataset

Two set process to get back a dataset


Pay attention to the TypedColumn!
KeyValuedGroupedDataset --> mapGroups and flatMapGroups

Pero donde chingado esta reduceByKey! (Datasets do NOT have it!)
mapGroups requires a full shuffle!! (Do NOT use'em!)
Docs suggest an Aggregator.

has three types [-IN, BUF, OUT]
Need to define the functions zero, reduce, merge and finish
Also need encoders!! --> Encoders.STRING


Compares Dataframes vs RDDs vs Datasets


RDD is the core abstraction ! Others are built on it
Dataframe is the best optimization! (but untyped!)


Not everything on a dataset can be optimized, for eg. functional lambda fiter cannot be optimzied


Can optimize relational queries better, as it knows the columns required