Skip to content

Instantly share code, notes, and snippets.

@sgoyal1012
Last active September 7, 2020 16:02
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6 to your computer and use it in GitHub Desktop.
Save sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6 to your computer and use it in GitHub Desktop.
Scala Course

Best links

https://gist.github.com/sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6#file-best-links-md

Course 1. Functional Programming Principles in Scala

Week 1 -- Getting Started

1.1 Programming Paradigms

  • Functional programming is a paradigm
  • Imperative programming
    • Java!
    • Von Neumann Computer
    • One needs to conceptualize data structures word by word
    • Mathematical theories vs programming
    • Le Voila! Functional programming
  • Functional programming can be seen as a calculator -- use the REPL
    • Restricted (no mutable variables, eg. Pure LISP) vs wider (focus on functions, eg. Scala)
    • Functions are FIRST CLASS CITIZENS
  • Why functional programming?
  • Good for exploiting ||ism

1.2 Elements of Programming

1.3 Evaluation Strategies

  • CBV and CBN evaluation/termination etc
  • Scala always CBV, as it is better performant, plays nicer with imperative programming

1.4 Scala if-else

By-name parameters are only evaluated when used. They are in contrast to by-value parameters. To make a parameter called by-name, simply prepend => to its type.

1.5 Square root using Newton's method

  • Recursive functions NEED to have a return type (because it needs to know the return type), otherwise it is optional (Duh)

1.6 Blocks and Lexical Scope

  • Nested functions -- put auxillary functions inside the function sqrt
  • Blocks

1.7 Tail Recursion

  • Mostly solved by storing an accumulator or intermediate result.

Assignment

  • Stable identifiers!!

Week 2 - Higher Order Functions

2.1 Higher Order Functions

  • Functions are First Class Values
  • Higher order function operates on other functions!
    • A function becomes a parameter of another function
    • Anonymous Functions -- passing functions with name can get tedious; to avoid A FUNCTION HAS NO NAME.
      (x:Int) => x*x*x
      
    • It is Syntactic Sugar, looks neater, but does add logical value.

2.2 Currying

  • A function returns another function!
  • Eg.sum(cube)(1,10) applies sum to cube; and then this applied function is applied to (1,10) Function application associates to the left.
  • Just combine the two functions and write them one after the other!
  • Instead of passing parameter lists, we pass anonymous functions one after the other -- CURRYING
  • Goes over an example of curring using Product function ***Good example -- https://docs.scala-lang.org/tour/currying.html ***

2.3 Fixed points of functions

  • f(x) = x -- is the fixed point of a function
  • SQRT is a special case of fixed point.
  • Average Damping
  • Abstractions can be combined into higher order functions to create new abstractions.

2.4 Scala Syntax Summary

2.5 Functions and data

  • Define classes
  • Scala keeps types and values in different namespaces
  • toString needs an override method!

2.6 More fun with rationals

2.7 Evaluation and operators

Assignment - Functional sets

Represent sets as functions.

  /**
   * Returns the set of the one given element.
   */
  def singletonSet(elem: Int): Set = {
    x: Int => x.equals(elem)
  }

Week 3 - Data and Abstraction

3.1 -- Class hierarchies

  • Dynamic binding/Late binding -- https://stackoverflow.com/questions/19017258/static-vs-dynamic-binding-in-java
  • Abstract Class
    • Permissible to have empty methods if the class is abstract; otherwise shows errors Method XXX is not defined HDP!
    • Gives an example of implementing set as Binary Trees. (Persistent data structures, old data structure does not go away)
    • Subclasses and superclasses -- without no super class, Object is the root of all classes
    • Subclass can also override the non abstract methods.
  • Objects -- A Singleton for a class!!
  • Introduces main class!
  • Dynamic method dispatch --> Code invoked by a method call depends on the runtime type of the object.

3.2 -- How classes are organized

  • Organized in packages. -- package clause at top of the file.

    • Fully qualified name
    • Some entities (like Int etc are automatically imported)
  • Traits

  • Single inheritance language
  • Essentially like an abstract class.
  • A class can inherit from MANY traits.
    • Can contains fields and concrete methods.
  • Scala class hierarchy -- at the VERY TOP is Scala.Any
    • Numeric types are not strict, but can convert into each other.
    • Any, AnyRef, Anyval
    • Type Nothing at the bottom
      • Signals abnormal termination
      • Empty collections
    • null --> Null
      • Int CANNOT be null! (null is only for AnyVal)

3.3 -- Polymorphism

  • Type parameterization
    • Classes and methods can have type.
    • Immutable lists -- cons lists
    • Definition generalized using a type parameter [T]
    • ACHTUNG -- Defining members as val defines it as an accessible member.
  • Functions can also have type parameters.
    • Scala can figure out the type parameter based on the parameters passed!
  • Type parameters do not affect evaluation! -- Type Erasure
    • Only important for the compiler, types erased at runtime.
  • Polymorphism
  • Subtyping
  • Generics
Assignments!

Very nice assignment on TweetSets

Week 4: Types and Pattern Matching

4.1: Objects everywhere

  • Pure object oriented lanuage --> Everything is an object!. Is scala one of these?
  • Pure Booleans
    • true and false becomes Constant objects!
    • Even Int can become a class!
  • Exercsie on Natural numbers. -- Very good!!!

4.2: Functions as objects?

  • Function types are are.. IN FACT objects in scala type A => B is just an abbreviation for class trait Function1[A, B]

  • Anon functions lead to AnonFun

  • Calling a function is f.apply() -- Functions are objects!

Eg.

class AddOne extends Function1[Int, Int] {
     |   def apply(m: Int): Int = m + 1
     | }
  • def is NOT an object!, When for example def f() is used as f -- converts it into function value -- ETA expansion.

Useful -- https://twitter.github.io/scala_school/basics2.html#apply

4.3: Subtyping and generics

  • Base Type

    • assertAllPos example
    • Conforming to the bound --> S <: T --> S is a subtype of T --> S > :T --> S is a supertype of T
    • Can also mix lower bound and upper bound
    • Given S <: T, given covariance for some types, therefore List[S] <: List[T], becuase LIsts are covariant. Covariance -- https://docs.scala-lang.org/tour/variances.html
  • In Java, arrays are covariant, but they do type checking for every array to prevent issues.

    • Why was it done? Because they wanted a sort to work for everything.
  • Barbara Liskov Substition Principle

If A<:B, then everything you can do with B, you should be able to do with A.

In scala, Arrays are NOT Covariant (for why, see more below)

4.4: Variance

  • Lists are immutable, arrays are mutable. Immutable types can be covariant, if some other additional conditions are also met.
  • Types can be covariant C[+A], contravariant C[-A], or nonvariant C[A]
  • Typing rules for functions --
For a function, if A2 <: A1 and B1 <: B2, then A1 => B1 <: A2 => B2.

Functions must be contravariant in their argument types and covariant in their result types
Example -->
trait Function1[-T, +U] {
  def apply(x: T): U
} // Variance check is OK because T is contravariant and U is covariant

class Array[+T] {
  def update(x: T)
} // variance checks fails

MIND BLOWN! --> Function trait (checked in Scala code) becomes

trait Function1[-T, +U] {
 def apply(x: T) : U
}
i.e. return type is covariant, argument type is contravariant.

  • Can you just put + and - whenever the hell you want?! NOOO -- otherwise we could make array covariant. Scala compiler checks that are no problematic rules of variance.
. Roughly, what it will do is, it will let covariant type parameters only appear in method results. Contravariant type parameters can only appear in method parameters, and nonvariant, or invariant, that they are used as aliases, type parameters can actually appear anywhere.

Example from Piyush 007 -- https://gitlab.criteois.com/p.narang/scala-class/blob/master/src/main/scala/com/criteo/scalaclass/generics/Generics.scala

  • Takes example of Nil from the Cons Lists example.Nothing is a subtype of everything.

  • Making a class covariant can be a lot of work! -- The prepend example

Assume the similar Cat, Dog, Animal inheritance tree used earlier, plus the following:

abstract class SmallAnimal extends Animal
case class Mouse(name: String) extends SmallAnimal
Suppose we’re working with functions that accept types of animals, and return the types of food they eat. If we would like a Cat => SmallAnimal (because cats eat small animals), but are given a Animal => Mouse instead, our program will still work. Intuitively an Animal => Mouse will still accept a Cat as an argument, because a Cat is an Animal, and it returns a Mouse, which is also a SmallAnimal. Since we can safely and invisibly substitute the former by the latter, we can say Animal => Mouse is a subtype of Cat => SmallAnimal.

4.5 Decomposition

  • Expr, Number and Sum example; trait expr
  • Too tedious!! too many methods for any functionality. Quadratic number of methods with functionality.
  • Maybe use Type Casts? Nooo that sucks! Scala does not recommend it.
  • Define eval in every class -- but that way need to touch all classes if you add a method to the base trait.

4.6 Pattern Matching

  • All previous methods have shortcomings.
  • Case Classes -- Automatically adds companion objects. -- A generalization of the switch statement. -- Can match on multiple expressions -- Matches in the order they are written (Can match constructors too Sum(x1, x2) -- You can have methods in the class/object itself this.match for example.
  • Is not always good! Analyze the use cases and decide.. The Expression Problem

4.7 Lists

  • Lists are immutable, and recursive
    • Similar to the Cons lists implemented before.
    • All lists are constructed from empty lists, using ::, called Cons eg. fruit = "apples" :: "(banana") List(x) == x :: Nil
    • Cons associated to the righ -- A::B::C is interpreted as A::(B::C)
  • Insertion Sort --good example!
Homework -- VERY GOOD HUFFMAN ASSIGNMENT
  • Tuple unpacking, AB DOBARA MAT BHOOLNA!
def makeOrderedLeafList(freqs: List[(Char, Int)]): List[Leaf] = {
      freqs.sortBy(_._2).map {
        case (c: Char, count: Int) => Leaf(c, count)
      }
    }
Write tests this way!!
trait TestTrees {
		val t1 = Fork(Leaf('a',2), Leaf('b',3), List('a','b'), 5)
		val t2 = Fork(Fork(Leaf('a',2), Leaf('b',3), List('a','b'), 5), Leaf('d',4), List('a','b','d'), 9)
	}

Week 5-- More Lists

5.1 More Functions on Lists

  • Goes over the functions on lists like drop, tail.. xs(n)
  • Creating new lists --> xs ++ ys. xs updated (n,x)
  • Finding elements --> xs indexOf x, xs contains x
  • Implementation of last, init, reverse and concat in a recursive manner.
  • RemoveAt in terms of take and drop

5.2 Pairs and tuples

  • Merge sort example.
    • Merge function (Compare head elements with each other)
    • splitAt function returns a pair, eg. ("answer", 42) is a pair.
  • Tuple
    • _1 and _2 can be used to access the fields
    • Pattern match in a symmetics way using tuples
case (Nil, ys) => ys
case (xs, Nil) => xs
  • Run case matching on tuples!

5.3 Implicit parameters

  • Can make merge sort with [T]..but then not every type defines a comparator <, so compiler cannot be sure.
  • Can parametrize with a function -- Use a function lt -- put function value last in your parameters..gives compiler an idea of the types of parameters.
  • trait Ordering[T] as a parameter -- leave out the ord in the call after you mark as implicit..the compiler figures out the correct ordering
  • Compiler determines implicits according to the type of arguments, and whether they are in scope at that point. Failing rules, it is an ERROR.

5.4 Higher order functions on lists

  • The map method to transform every element of a list.
  • The filter method to filter out elements based on a condition.
  • Other functions.. such as filterNot, partition, takeWhile, dropWhile, span -- Nice functions!

5.5 Reduction on lists

  • reduceLeft for sum and product.
  • foldLeft takes in an accumulator
  • Implementations of both foldLeft and reduceLeft
  • foldRight and reduceRight --> parenthesis go to the right.
  • The operators are equivalent, if the operators are associative and commutative -- Good example on when concat cannot be used!

5.6 - Reasoning About Concat

  • What does it mean for a functional program to be correct? Eg. of concat, it is associative etc.
  • Structural Induction -- eg. of Natural Induction -- class 8 Maths! (Factorial example)
  • Structural Induction
    • To prove a property P(xs) for all lists xs, show base case, and then prove for x::xs
    • Shows the associative property for concatenation.
    • Induction hypotheses

5.7 - A Larger Equational Proof on Lists

  • Larger/More difficult example
    • reverse -- proving by induction does not work
    • Replace xs.reverse by ys; and prove that.

Week 6: Collections

6.1 - Other Collections

  • Lists' access depends on the index (element at index 0 is faster to get than in the middle)
  • Vector -- Upto 32*32 is an array, after 32 .. every element becomes a pointer to an array of 32 elements.
  • Depends on access patterns, for head and list, Vector is mucho mejor~
    • Does not have ::, but has +: and :+
    • Vectors are also immutable! Goes over insertion of an element. -- create many objects!
  • Seq is the base class --> the common base class for List and Vector
    • At the apex is Iterable
  • Goes over Array and String
  • Range function --> 6 to 1 by -2 -- >Lower Bound to Upper Bound by Type
  • Goes over sequence operations
    • exists and forall
    • zip and unzip
    • flatmap
    • sum, max etc for numeric lists
  • Goes over examples

6.2 - Combinatorial Search and For-Expressions

  • Handling nested sequences

    • Eg. of generating pairs of integers < 7 where sum is prime.
    • IndexedSequence ---> is implemented using Vector.
    • To flatten, can use foldRight or flatten.
      • xs flatmap f = (xs map f).flatten
  • Program can be difficult to understand! due to abstraction.

  • Introduces the for expression. --> for (s) yield e; s contains a generator and a filter.

  • Expresses the isPrime problem in nested for.

6.3 - Combinatorial Search Example

  • Sets

    • Sets are unordered
    • Do not have duplicate elements.
    • contains method.
  • N Queens problem -- VERY NICE PROBLEM!

    • Recursive solution --> more than one solution!
    • zip is very useful!!!
    • The diagonal check!
  • Maps

    • NoSuchElementException -- the get method returns None if key not found.
    • Introduces the Option type. --> Some and None --> pattern matching
  • Sorted and GroupBy

    • sorted is natural ordering, sortWith is according to a rule.
  • The polynomial example

    • Concat, ++ superimposes instead of adding.
    • Defines a function by himself, adjust!
    • withDefaultValue!! -- simplifies it.
    • Repeated parameters --> bindings: (Int, Double)*
    • Uses foldLeft -- is more efficient

6.5 - Putting the Pieces Together

  • Mnemonics for a phone number
  • Comparing the lines of code for scripting vs general languages.
  • Inverts the map using for
  • Map can also be used as a function!
  • Writes the encode function.

6.6 - Conclusion

Assignments -- Anagrams
  • The identity function -- JHAKAAS!
  • s flatmap wordOccurences

Great assignment !! -- TOO DIFFICULT!

Course 2. Functional Program Design in Scala

Week 1: For Expressions and Monads

Recap

  • Recap on case classes

    • JSON example, tries to represent it using abstract classes and case classes; recursive examples for toString method.
    • By itself, an expression is not typable -- JBinding, String example. Goes over trait Function1
    • We can subclass the function type, for eg. Seq[Elem] extends (Int => Elem)
    • scala MatchError
  • Partial Function -- mind blown!

    • Define a function as Partial Function; isDefinedAt to check
    • ACHTUNG!! -- isDefinedAt only applies to the outerblock of matching, does not guarantee in Nested.
  • Recap on collections

    • All have map, flatMap, filter, foldLeft, foldRight
    • idealized implementation of map, flatMap and filter on Lists
    • In practice are different to apply to arbit collections, and make them tail-recursive
    • For expressions --> simplify combinations of core methods map, flatMap and filter -- the Prime example is much simpler!
  • How for translates to map, flatMap

    • First flatMap --> for recursively
    • Generators can also have Patterns!! JSON example -- translates to withFilter on the pattern.

Lecture 1.1 - Queries with For

  • for can be considered as querying a database.
  • The book database example.
    • Nice example of yielding results to find authors who have writen at least two books.
    • Can use distinct
  • map, flatmap and filter can all be expressed in for.

Lecture 1.2 - Translation of For

  • Goes over translation of for in terms of map
    • withFilter
    • Two generators go to flatMap
    • Every translation steps has one less generator
    • Goes over the isPrime example.
  • As long as you have map, flatMap and filter, can use any datatype for for

Lecture 1.3 - Functional Random Generators

  • for expressions are NOT TIED to collections, all you need is map, flatMap and filter.
  • Generator example.
    • Can use for for it, but need a map and flatmap definition.
    • self example (because there are two generate fns)
    • List generators; trees example.
  • Random generators can be used for unit testing.

Lecture 1.4 - Monads

  • Has two functions flatMap and unit

    • List is a monad
    • andThen for function composition
  • Monad Laws

    • Associativity , Left unit and Right unit
    • Proves them on Option class -- flatMap proof is a bit involved.
  • Significance of laws

    • Can inline the two generators due to significance.
  • The Try type

    • Success and Failure
    • Composing with for
    • Bulletproof example - an expression composed of Try, map and flatMap will never throw a non fatal exception.

Week 2: Lazy Evaluation

Lecture 2.1 - Structural Induction on Trees

  • Structural induction on Trees
    • IntSet --> contains and incl
    • Three laws for intset, for eg. --> (s incl x) contains y = s contains y
    • Three laws completely define IntSet
    • Base case is Empty, induction step is NonEmpty
    • Five cases for the third law

Lecture 2.2 - Streams

  • Prime number example -- calculate every prime number and just take the top two, using the for code.
    • Compute the tail only IF needed.
    • Avoid computing the tail until on demand --> Stream
    • Range code is similar but behavior is completely different -- creates only on demand!
    • :: produces a list; #:: produces a stream (identical to Stream.cons)
  • Implementations are defined in the Stream object
  • The difference is that the tail params is call by name (remember the :=> to make a var call by name!
  • StreamRange example question

Lecture 2.3 - Lazy evaluation

  • Do things as late as possible, never do them twice
  • Instead of computing the tail everytime, can store the result and re-use it --> lazy evaluation vs call by name evaluation
    • Haskell is purely lazy
    • By default scala does strict evaluation, but allows it.
  • lazy val vs def --> is the same the first time..but after the first time, lazy val is not recomputed --> very good example
  • Stream's tail becomes a lazy val
    • Gives examlpe on the isPrime example

Lecture 2.4 - Computing with Infinite Sequences

  • Since calculation is done ONLY when needed, good to define infinite sequences.
    • Define and then take upto the where you need them.
  • Calculation of prime numbers --> Sieve of erasthosenes
    • Excellent implementation!!
    • Streams to converge approximation of sqrt

Lecture 2.5 - Case Study: the Water Pouring Problem

  • Explains the problem, and the methods/classes to represent.

    • Try all paths until you have the destination.. i.e. the glass with the correct amount
    • case class Capacity, trait Move with moves Empty, Fill and Pour
    • def change in the trait to be defined by every case class above.
    • Path history, foldRight
    • Calculate ALL the paths and then filter by the solution (i.e. paths with end state as the glass with a volume)
  • NOT VERY EFFICIENT

    • Do not revisit the states already explored!
  • TIP -- Name everything you can!, do not put all operators in one line and keep scope for future changes!

  • https://alvinalexander.com/scala/how-to-control-visibility-constructor-fields-scala-val-var-private

Assignment: Bloxorz
Nice ASSIGNMENT! CHUTIYA MISTAKE!!!

Week 3 -- Functions and State

3.1 Functions and state

  • Goes over function rewriting.
  • Rewriting always leads to the same solution.
    • Any order of rewrite leads to the same solution!
  • But the world is a state of objects --> an object has state if it is behavior changes with time (for eg. a bank account)
  • var is a mutable definition (can be reassigned)
  • Gives the BankAccount example and definition.
  • The Cons Stream's example -- assuming operations on Stream are purely functional, unless you have side effects.
  • Does not have to be a variable change.. FOCUS ON THE BEHAVIOR! (if behavior changes => stateful)

Lecture 3.2 - Identity and Change

  • Referential Transparency
  • Operational Equivalent -- two definitions x and y are operationally equivalent if any tests cannot be distringuished.
    • Goes over proving the above in the BankAccount example.
  • The substitution model does not HOLD!!

Lecture 3.3 - Loops

  • Loops are NOT essential for imperative prgramming.
  • Defines while as a function WHILE --> condition must be passed by name (so it is reevaluated every time)
  • Also defines REPEAT
  • for loops cannot be represented by just higher order functions.
    • Like for expressions.. but uses foreach

Lecture 3.4 - Extended Example: Discrete Event Simulation

  • Digital Circuits (ECE aa gayi!!...)
  • Base components -- AND, OR and NOT (with a delay)
  • Creates an half adder using just these basic gates.
  • Defines the classes/functions Wires and gates
  • Uses HalfAdder to define a FullAdder

Lecture 3.5 - Discrete Event Simulation: API and Usage

  • Action --> a function that has side effect (return type Unit)
  • time is simulated, not actual time.
  • Wire class -> getSignal, addSignal, actions to be done when signal changes.
  • Goes over the implementation of all the gates.

Lecture 3.5 - Discrete Event Simulation: Implementation and Test

  • Agenda is a list of Events to be performed at a time; they perform actions
  • Action goes into the Agenda.
  • Probe is also defined to get the values of the wire during the simulation
  • Runs a simulation (that has delays of time as well)
  • Goes over OrGateAlt -- an implementation in terms of AND and NOT
  • Losing referential transparency, more confusing

Week 4 -- Timely Effects

Lecture 4.1 - Imperative Event Handling: The Observer Pattern

  • Observer Pattern -- Remember MVC from Objective C days!! (Publisher-Subscriber)

    • Goes over the Publisher trait
    • Shows the example of BankAccount in this pattern.
    • A subscriber Consolidator that subscribes to all the bank accounts. _ Nice example!!
  • Goes over the good and the bad (Bad is return type is Unit (forces imperative style) -- CONCURRENCY!!

  • We will study how to go from Imperative Functional Programming to Functional Reactive Programming

Lecture 4.2 - Functional Reactive Programming (FRP)

  • Signals that changed with time.

    • Mouse moved example -- Signal[Position]
  • frp.Signal <--> Scala.React

    • Get the value of the signal at the current time.
    • Get value of a signal in terms of other signals.
    • Introduces Var ---> for signals that can be updated/changed (The Update method -- which is also defined for Arrays.
  • Important -- Defining a signal in terms of other Signal A = f(Signal B) for example -- the relation is maintained going forward. (forever in the future!)

  • Goes over the BankAccount example.

    • Balance becomes a Signal[Int]
    • A signal cannot be a function of itself!! balance() = balance() * amount IS WRONGG!!
      • Instead get the value of balance in a val and then do it
  • IMPORTANT!

  • Diff between b = b + 1 and b() = b() + 1 -- the latter DONT MAKE SENSE!, as the first one updated value, and the second defines a relation in eternity.

Lecture 4.3 - A simple FRP implementation

  • Every signal has (1) the current value (2) current expression defining it (3) all of its observers that need to re- evaluated; if signal value changes.
  • How to know on whose behalf a signal is updated.
  • Need to call in a stack like fasion -- defines a class Stackable variables
  • Type Signal[_] it means can take signals of all types.
    • NoSignal -- a sentinel signal.
  • Adds the current caller to the signal
    • A signal cannot observe itself (duh) -- Catches the cyclic dependency
    • computeValue() does the propagation
    • NoSignal's computeValue does nothing.
  • Problems with parallelism!!
  • Going to thread local -- Dynamic Variable
  • Pass current value of signal as a implicit parameter

Lecture 4.4 - Latency as an Effect 1

  • Introduces the monad Future (Computations can take time!!!)
    • Reading from memory takes time, and sending a packet to Europe takes time!
    • Converting into a human time format, that makes sense! --> sending a packet to Europe in this scale takes 5 years!
    • Do not have to wait for a computation, Callback -- mind blown!!*

Lecture 4.5 - Latency as an Effect 2

  • A Monad -- Future

    • Handles both the idea that computations can fail, and can also take time.
    • Function onComplete --> think of it as an envelope given to the calculator and asking him to find the result, fill the envelope and return it back to us.
    • Ignoring implicit execution context for now.
    • Function can match to Success or Failure
      • Can have a onComplete that takes two callbacks (one for success/failure)
    • readFromMemory returns Future[Array[Byte]]
    • Future's apply returns a future that computes asynchronously (Future just expects a Body)

    Lecture 4.6 - Combinators on Futures 1

    • Functions on future.. all the usual flatMap etc, along with recoverWith
    • flatMap cleans up the code and makes it nicer -- Very nice examples!
      • zips two futures (Does not work as zip fails when one of the lists being zipped fails)
    • recover and recoverWith

    Lecture 4.7 - Combinators on Futures 2

    • Writes a better method fallBackTo --> expressed in terms of recoverWith
    • Introduces blocking, the Awaitable trait, but IS NOT RECOMMENDED!!
      • You should NEVER EVER BLOCK in production!!

    Lecture 4.8 - Composing Futures 1

    • for comprehensions (remember brightroll?!!)
    • Defines retry
      • In terms of fallBackTo recursion

    Lecture 4.9 - Implementation of flatMap on Future

    • Implementation of flatMap in terms of CallBack

    Lecture 4.10 - Composing Futures 2

    • Goes over reminding us of foldLeft and foldRight
    • Implements retry in terms of foldLeft and foldRight
    • Sometimes straight recursion is much simpler to understand!

    Conclusion, FIN.

    Assignment
    • Uses Signal's -- FRP

Course 3. Parallel Programming

Week 1: Introduction to Parallel Programming

1.1 What is parallel computing?
  • Many calculations (can be done in parallel), same time
  • Amdahl!!
  • After a while, hit a power wall
  • Just increasing CPU frequency does not help, so added multi-cores!
  • Parallel programming is HARDER than sequential
  • But we need to do it for SPEEDUP (No other reason)
  • Parallel vs Concurrency
  • Parallel -- Uses Parallel hardware for more efficiency (eg. matrix multiplication, computer graphics) - speedup
  • Concurrent -- Improves modularity (eg. async apps such as web servers, databases) -- convenience
  • Both of these can intersect each other
  • Different granularities of Parallelism
    • Bit-level parallelism and Instruction-level parallelism and Task-level parallelism
    • Different types of hardware -- Multi core processors, Symmetric multiprocessors, GPU, FPGA, Computer Clusters
1.2 Parallelism on the JVM
  • What is an operating system? -- Software than manages hardware and software resources, and schedules program execution
  • Process - An instance of a program executing in the OS (a process has a PID) -- the most coarse grained unit in || programming
  • Multitasking -- two processes are isolated and owns a memory area
  • Each process can have multiple threads (Threads share memory space) -- can be started within the same program, and can exchange information between one another
    • A thread has a program counter and a program stack
    • JVM process starts with main thread -- in a sequential program, only the main thread is used.
    • Goes over the Thread subclass
    • Goes over an example where the outputs can be different (threads are executed in parallel!)
    • To ensure, a sequence of statements is in executed in order -- Atomicity -- use the synchronized construct! -- at most one thread can hold a monitor

1.3 Parallelism on the JVM II

  • Synchronizations can be nested! -- Gives the example of a bank transfer -- locks on both source and transfer account
    • Excellent example when a thread transfers from a1 to a2, and another from a2 to a1 --> DEADLOCK!
  • Resolving deadlocks
  • Acquire locks in the same order!
  • Memory model
  • How threads interact when accessing shared memory
  • Two threads writing to different memory locations --> no need to synchronize!
  • The thread join --> seeing write Rule.

1.4 Running Computations in Parallel

  • pNorm example
  • Uses recursion!!
  • MIND BLOWN! --> Need to use call by name! (need to pass unevaluated computations)
  • Memory contention!! -- Speedup also take into account not only cores, but also memory
  • Also you are only as fast as your fastest thread..

1.5 Estimate pi -- monte carlo

  • Estimate using randomly sampling points to fall within a circle/square

1.6 First class tasks

  • Using the construct task instead of parallel.
    • Defines Task interface that takes in a computation as an argument
  • Breaks the p-Norm implementation into tasks
  • Defines parallel construct in terms of tasks

1.7 How Fast are Parallel Programs?

  • Empirical measurement vs asymptotic analysis (worst case run time)
  • Asymptotic analysis of sequential running time and parallel running time on the sum-segment example
  • Parallel program time depends on the number of resources --> introduces Work (all sequential) and Depth (Infinite parallel)
    • Can never do better than Depth (as that is the case when parallelism is infinite)
  • Amdahl's law returns!! -- the law of parallelism

1.8 Benchmarking Parallel Programs

  • Benchmarking vs Testing
    • Testing is binary, benchmarking gives a continuous value
  • Just timing is not enough -- make sure to take care of stuff such as warm-up, eliminating outliers, prevent such as GC, JIT compilation, aggresive optimizations
  • Enter ScalaMeter
  • Be wary of GC! -- runtimes can vary across runs
  • JVM may apply further optimizations -- Scalameter has withWarmer -- awesome!! -- Nice intro to ScalaMeter; can also measure Memory footprint`
Assignment -- Nice assignment on blurring in photoshop

Week 2: Basic Task Parallel Algorithms

2.1 Parallel Sorting

  • Uses an intermediate array -- breaks down sorting into smaller array sizes
  • merge and the copy functions
  • parMergeSort vs quickSort -- Uses ScalaMeter

2.2 Data Operations and Parallel Mapping

  • Gives examples of operations on collections (fold, scan, map etc.)
  • List is NOT good (as cannot find the middle easily) --> use Array and Tree
  • The naive parallel implementation first element cons rest of list is still n!!
  • So we look at Array
    • Breaks into parallel implementation -- (1) Writes need to be disjoint (2) Threshold for sequential needs to be large enough to have efficiency!
  • Goes over implementation of Tree
    • leaf stores the num of elements, and nodes contain the elements
  • Compare arrays vs Trees --> in trees less worry about disjointness

2.3 Parallel Fold (Reduce) Operation

  • Goes over differences between foldLeft, foldRight, reduceLeft, reduceRight
  • Parallelism can be done ONLY on associative operations --- as we want to split into parts as we wish
    • Each associative operations can be shown as a tree --> leaves are the values and nodes are the operations.
    • For ordering for a Tree, can use list
    • Tree rotation preserves the result.
  • For arrays reduction, can convert into tree
  • Gives reduce and reduceSeg function. (reduce is done in a sequential fashion)

2.4 Associativity 1

  • Compares with commutative
  • Gives examples of operations that are both associative and commutative
  • Examples where it is associative but NOT commutative
    • String concat
    • Matrix multiplication
  • Example of where it is the opposite case
  • Should NOT LOSE associativity
  • Floating point operations are not necessarily associative! (Great examples!!) -- due to precision considerations
  • Trick to make functions commutative! --> get start and just always pass the lesser argument first (by making a less checker)

2.5 Associtavity II

  • Gives an example of associativity for tuples
  • Example of average using two reduces --> one for sum and other for length
    • Gives the case when commutativity also implies associativity (given the rotation condition)
    • Gives two examples of applying this theorem
  • Floating Numbers -- CUIDATE!!
  • Sets example (Closure operation -- A Closure B)

2.6 Scan

  • scan is like a combo of map and fold
  • HOW to PARALLELIZE?!!
    • this value Depends on the previous value!
    • Do more work (store the intermediate results) --> parallelize, totally worth it.
    • Have trees that store intermediate values
    • Think of summing up trees from leaf to the head, storing results in an intermediate fashion as you go up.
    • upsweep and downsweep functions to the rescue
    • Application to arrays --> store the Segment of array in the non-leaf nodes
Assignment -- Parallel programming

Count change problem -- how to divide stuff if we do not know the workload?!!

Counting change is a canonical example of a task-parallel problem in which the partitioning the workload across processors is solution-driven -- to know how to optimally partition the work, we would first need to solve the problem itself.

the synchronization costs of doing this are way too high. Instead, we need to agglomerate parts of the computation. We do this by calling the sequential countChange
  • Speed up is measured using ScalaMeter -- letting it warmup and GC first

Nice line of sight problem

Week 3: Data-parallelism

3.1 Data Parallel Programming
  • Task parallelization --> a form of parallelization that distributes execution across computing nodes
  • Data parallelization --> distributes data across nodes
  • Parallel for loop
    • Introduces the parallel for loop --> MUST BE CAREFUL of adding .par to the loop
    • Mandelbrot set --> Keeps computing complex numbers till reaching infitiny if adbsolute value
      • Goes over the image array in parallel.
      • Compares data parallelism for this problem, vs task parallelism --> Why?!! -- compares workload for different parts of the data
        • Task of balancing workload is managed by the scheduler -- we do not care about equally dividing! it is done for us..
3.2 Data Parallel Operations
  • Some methods like filter can be done in par.
  • But foldLeft CANNOT be done --> great lego example!! -- foldLeft depends on the previous part --> so MUST be sequential
  • But fold can be done--> it is like a reduction tree
3.3 Data Parallel Operations II
  • Uses fold to calculate max using Int.MinValue as the neutral element
  • Rock, paper and scissors example!
    • Play is NOT associative --> it is commutative
  • For fold to work --> the two following conditions must hold to make the operator f a MONOID
    • f(a, f(b,c)) == f(f(a,b),c) -- ASSOCIATIVE
    • f(z,a) == f(a,z) == a -- Like an identity on the neutral element z
    • Commutativity does not matter
  • For fold -- the type of the neutral element must be the same !!!
  • In comes aggregate --> a combination of fold and foldLeft
z   A   z   A   z   A   z   A
 \ /     \ /seqop\ /     \ /    
  B       B       B       B
    \   /  combop   \   /
      B _           _ B
         \ combop  /
              B

https://stackoverflow.com/questions/6928374/example-of-the-scala-aggregate-function

3.4 Scala || operations
  • Scala collections hierarchy

    • Traversable[T] -- Must implement foreach
    • Iterable[T] -- Must have iterator
    • Seq[T] -- Orderd ...
  • Parallel collection hierarchy

  • ParIterable[T] etc
  • GenSeq, GenIterable etc -- can or cannot be run in parallel

  • Compares toPar on List vs Vector -- List is way slower

  • All collections have a parallel counterpart (mutable Par, immutable par etc)

  • Example of calling a function with arguments GenSet that accepts both sequential and parallel implementations.

3.5 Splitters and Combiners
Assignment -- KMeans
K-means is an example of a bulk synchronous parallel algorithm (BSP). BSP algorithms are composed 
from a sequence of supersteps, each of which contains:

parallel computation, in which processes independently perform 
local computations and produce some values;

communication, in which processes exchange data;

barrier synchronisation, during which processes wait until every process finishes

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Remember Tail Recursion! -- https://gist.github.com/sgoyal1012/98ac2d289ddef3c1b3e4c4bb9a6093c6#17-tail-recursion

  • The recursive call should be the last thing that the function does. (no operation after!) Assignment tries different choices of initial means selection and compares them

Week 4: Data Structures for parallel computing

4.1 - Implementing Combiners
  • Transformer methods --> like filter, map etc (create new collection)
  • Scala Builder for sequential transformer operations
  • For Parallel, we use Combiner (a Builder with a combine) (combine this with that)
    • Combine is different for different collections (for Set and Map it is a union)
    • Combine MUST be efficient!! Must execute in O(log n + log m) time
    • To gain any speedup, the operation must be logarithmic!!
  • Combiner
    • Gives a bad implementation of Combiner --> O(n)
    • Arrays CANNOT be combined efficiently --> as they are contiguous
    • Compares runtime for various Data Structures.
    • Pues.. implementation of a combiner is NOT trivial!
4.2 Parallel Two-phase Construction
  • Construct data structures in parallel using the two-phase construction (Use an intermediate data structure)
    • The intermediate data structure must have an efficient combine method, an efficient += method and can convert into the resulting data structure in O(n/P) time
    • Many intermediate data structures are created ,and then combined, and then parallel-y it is converted to the initial daa structure
    • Uses a nested array buffer -- and then just copies the reference!
    • The result method to copy back into the array in a parallel fashion
    • Can extend to other data structures, need to choose an efficient intermediate data structure
      • Intermediate data strucures MUST be non-overlapping
4.3 Conc-tree Data Structure
  • Conc-tree a parallel counterpart of Cons List
    • Lists are built for sequential computations, trees can be done in parallel.
    • Trees are ONLY good for parallelism if they are balanced
      • Enter Conc trees! Contains level and size info at each node
      • An invariant -- the level cannot be greater than one
      • <> to Concat trees -- Concat making sure that the tree balance structure is maintained
      • Goes over cases of concating trees in various unbalanced situations.
    • https://en.wikipedia.org/wiki/Conc-tree_list
4.4 Amortized, Constant-time Append Operation
  • += method should be O(1)
    • Adds the Append node type
      • Relaxes the balanced constraint
      • Make sure that there are NO more than log N Append nodes
      • Makes a parallel relation between counting numbers in binary rep and append nodes --
4.5 Conc Tree combiners
  • Conc buffers -- introduces the Chunk node and the expand method

Course 4. Parallel Programming

Week 1: From Parallel to distributed

1.1 Logistics
  • Goes over logistics
  • Why scala? Why spark?
    • As the data science doe in MATLAB is small (cant fit on memory)
    • Bcoz its the shizz yo
    • Works a lot like Scala
    • Modeled like Scala collections (map, flatMap) -- hadoop a lot more boilerplate Databricks community edition!
1.2: Data-Parallel to Distributed Data-Parallel
  • In previous node, we did data-parallel (jellybeans in a jar!)

    • Different processors operate on different chunks of data, but same machine
    • Data is SPLIT into different nodes, on which nodes operate in parallel.
      • Now MUST worry about network latency!
    • A lot can be shared between share memory and distributed memory --> operations should be associative!
    • Again, reminding you about LATENCY!!
  • My dear RDD

1.3 Latency
  • Issues while going from shared to distributed
    • Partial Failure - a subset might fail!!
    • Latency -- Network communication causes latency -- LATENCY WILL NEVER EVER GO AWAY
      • Important latency numbers -- https://gist.github.com/hellerbarde/2843375
      • 1MB from memory vs 1 memory from disk - 10 TIMES HIGHER!!
      • Memory < Disk < Network (Latency order)
      • Multiplies by billion to get a human sense
  • Goes over hadoop and mapreduce --> Mapreduce/Hadoop was great because of fault tolerance
    • Failure probability is high (for a node to fail..)
  • Why spark?
    • Because of fault tolerance, hadoop/mapreduce shuffles its data and writes to disk (EXPENSIVE!)
    • Spark strategy for handling latency-- all data is immutable and everything is in memory!
    • Just replay the transormation chain!
    • 100 times faster -- does stuff in memory
    • Minimizes network traffic
  • Hadoop vs spark on benchmark -- logistic regression 100s per 1s!

1.4 Basics of Spark's RDDs

  • RDDs seem like a immutable collection, have higher order functions like map, filter etc

    • Heavy use of higher order functions
    • Similar structure -- like MAGIC!
  • How to create a RDD?

    • Transform a RDD
    • or create one.. using SparkSession, SparkContext

1.5 RDDs: Transformation and Actions

  • Transformers create a new collections (not single values!) -- Accessors return single values as results
  • Spark defines Transformations and Actions
    • Transformations return RDD
    • Actions return NOT a RDD.. write to HDFS
    • MUY IMPORTANTE -- Transformations are lazy and Actions are eager
      • Laziness reduces network communication by being a lazy fuck!
      • NOTHING HAPPENS when calling a map -- add an action to get a result!!
    • Eager actions
      • collect, count, take, reduce etc. etc.
      • LOOK AT THE RETURN TYPE to know if its a action vs transformation!
    • Stage
      • Something that we will do in the future sometime
    • Spark can optimize stuff...
      • eg. x.filter.take(10) can stop when you have 10! -- MIND BLOWN
    • Two RDD transformations (union, cartesian etc)
1.6 Evaluation in Spark: Unlike Scala Collections!
  • Why is spark good?

    • Lazy and in memory!
    • Most data science stuff requires iterations (map + reduce)
      • In hadoop, there is a read and write for intermediate data. (so a lot of time is spent!)
      • Spark loads everything in memory (saves so much time!)
      • Logistic regression is a hello world of the iterative data algorithm world
  • Gives an example of logistic regression

    • RDDs are recomputed on each action -- unneccesary to do it everytime
    • persist -- after an RDD is evaluated, can reuse -- MUY IMPORTANTE
    • Suppose you do two operations --- can just persist -- such as take followed by count
  • Persistence comes in various flavors

    • Can keep both on memory and disk (as regular/serialized java objects)
    • cache vs persist -- cache is memory only and persist can take multiple arguments (eg. MEMORY_ONLY, MEMORY_ONLY_SER)
  • Different than collections!

    • Dont wait on the rsult (dont be EAGER)
    • Dont do same computations (use caching hijo de puta!)
  • Laziness is a good thing!!

    • Spark can analyze and optimize (if you just need 10, it will optimize to only get 10) -- as soon as I get 10, just quit hdp!
    • Spark can fuse computations and do less work
      • For eg. map and filter are two passes, but it just does it once!! (MIND BLOWN!)
1.6 Cluster Topology Matters!
  • There is a master-worker topology
    • master is the driver program (The commander, we talk to this)
    • executors are workers (they do the work)
    • Cluster manager is YARN or MESOS
  • Driver is the brain
    • Contains the main, transformations etc
  • Steps in a spark program
    • Spark SENDS the program to executors (think about it hard hijo de puta)
    • Driver says to executor -- take this function and data, do your shit and return the result
    • foreach is an action -- so nothing happens on the driver! (ONLY Transformations happen on it)
    • take gives it back to the driver program (sent back to the driver)
  • Summary
  • Think where your code is executed
    • Transformations are stored up on the driver.. and then on an action executor runs them.

Week 2: Reduction Operations & Distributed Key-value pairs

2.1 Reduction Operations
  • How are actions`reduction operations` distributed?
    • Methods like fold, aggregate etc walk through a collection and combine to get a result
    • Remember that foldLeft is NOT parallelizable
      • Because of TYPES!!
    • fold returns the same type
    • aggregate combines the two
  • on RDDs -- foldLeft and foldRight does NOT even exits -- as it is not parallelizable! -- Thy must use aggregate!!
  • Why not have a serial implementation? Does not make sense synchronize wise..
  • aggregate is desirable as you go from a higher data structure (Case class) to a simpler one (like Int)
2.2 Pair RDDs
  • Key-value pairs are preferred in the map-reduce world!
    • Project down onto key-value pairs
    • PairRDDs have special operations -- special methods -- groupByKey, reduceByKey, join etc.
2.3 Transformations and Actions on Pair RDDs
  • Actions and Transformations on PairRDD
    • groupBy example
    • Spark groupByKey groups the functions according to the keys they already have (no function required)
    • reduceByKey reduces the values for the same keys.
      • Note that reduceByKey is a transformation, unlike the action reduce in scala collections
    • mapValues only applies functions to the values
    • countByKey just counts the number of elements by key
    • Transformation keysis NOT an action
2.4 Joins
  • Inner Join vs Outer Join
    • DIfference happens when both set of keys do not exist in both
    • Gives example of train and location data
    • Inner join only takes the keys that intersect
      • It is commutative
    • Joins ARE transformations~~
    • Outer join --> takes the union
      • Returns the Option type (can be None or Some)
      • Choice of left vs right depends on whose keys are more important to us

Week 3: Reduction Operations & Distributed Key-value pairs

3.1 Shuffling: What it is and why it's important
  • groupBy and grooupByKey --> shuffle data as data MUST move around to combine data for the same key
  • groupBy returns a ShuffledRDD
  • Need to move around the values so that key has its values together
  • SHUFFLE is expensive!! -- Remember the humanized latency
  • reduceByKey is much more efficient! USE THIS!

    • reduces on the mappers first
    • Reduces the amount of data for network serialization
  • reduceByKey gives almost 3x speed!

3.2 Partitioning
  • How does spark know which key to put where?
  • What is a partition?
    • Data within a RDD is split into partitions.
    • Data in same partition is ALWAYS on the same machine.
    • Can configure the number of partitions
    • Generally number of partitions is equal to number of cores on all executor nodes
  • Can customize a partition ONLY on a pair RDD
  • Hash partition calculates p using the hashvalue -- tries to spread around based on the value
    • Hash partitioning can cause unevenly skewed if we are unlucky
  • Range partioning -- partitioned according to ordering
    • Can distribute keys unevenly based on ranges
  • How to customize partioning?
    • MUY IMPORTANTE -- partitionBy
    • Using transformations that create partitions
    • MUST persist()!! -- otherwise it is repatition everytime
    • Otherwise partition is applied everytime causing shuffle network operations
  • map does not make sense AS the KEYS MAY CHANGE!!!
    • Use mapValues WHEREVR possible!
3.3 Optimizing with partitioners
  • Why partioning?
  • Becuase avoid network shuffles!!
  • Range partitioners can further optimizer reduceByKey, 9x speedup over grouoByKey!
  • Gives a great example of join always shuffling data
  • At the satrt of the program, partition the big dataset
  • We want to shuffle the small dataset, not the bigger one
  • Shuffle occurs when resulting RDD depends on elements from other or the same RDD - Achtung on operations that can cause a shuffle!
3.4 Wide vs Narrow Dependencies
  • Not all transformations are created equal! (eg. shuffle operations take MUCH longer)
  • Computations are represented as a lineage graph (DAG)
    • RDDs are depicted as partitions (atomic pieces of dataset), and the dependencies
  • Transformations cause shuffles
    • Narrow dependency -- Each partition of parent is used by at most one partition
    • Wide dependencies -- Eacg partition of parent may be used by more partitions in child RDD
    • groupByKey has wide dependencies
    • NICE EXAMPLE on a graph
    • operations like join can cause shuffle (depending on whether you were smart to partition or not)
  • Spark dependencies method!! and toDebug String
  • Lineage graphs are the key to fault tolerance!
    • You can go up the graph to recompute a failed partition!
    • For deep dependencies.. need to recompute all partitions above! -- MIND BLOWN!

Week 4: SQL, Dataframe and Datasets

4.1 Structured vs Unstructured data
  • Goes over an example of joining two RDDs
    • Compares three different approaches -- cartesian
    • filter before join is much faster than join then filter.
    • Gets the same result! But some methods are waster
    • Can Spark we smarter to do that.. i.e. optimize the code itsekf?
  • Different types of data
    • Unstructured -- log files, images etc.
    • Semi strucutred -- JSON, XML etc.
    • Structured -- Database tables
    • RDD has no idea about the structure of the data, other than it knows what type of object it is, so cant optimize
    • Given a structure, spark can optimize!
    • Databases are more organized and optimized
    • Spark SQL will optimize!
4.2 Spark SQL
  • SQL is the lingua franca
  • Spark SQl adds relational processing
    • A spark module, a library on top of spark
    • Catalyst a query optimizer
    • Spark SQL is top of spark, operates on RDDs under the hood.
  • Goes over SQL lingo
  • Dataframes REQUIRE some schema! -- Dataframe are untyped~~ (thats why you have them!)
    • untyped transformations
  • Can create dataframe from an RDD by either inferring schema, or by specifying
  • With case classes, column names are inferred!
  • Goes over creating DF from RDD[Row]
  • On datafarame, can do createOrReplaceTemplate -- then just SQL away!!, gives the people example
4.3 Dataframe (1)
  • Just running over RDDs
    • Applies relational optimizations from the DB community
    • Are untyped --chingado Rows! Scala compiler does not type check
    • Goes over the primitive types and the complex types available (Map, Struct etc)
      • Case classes map to a Struct
    • Dataframe API looks similar to a SQL api
      • groupBy, select etc
    • Working with columns has three ways -->
      • Using $ notation
        • $"age">10
      • Refer the dataframe
      • SQL Queries inside strings
    • GroupBy returns the RelationalGroupedDataset datatype
4.4 Dataframe (2)
  • Handling missing data (Nan and null) (drop, fill, replace)
  • In join MUST specify the columns to join on
  • Goes over a join example
  • The Query optimizer takes care of the optimization for you!
    • 4x faster!!
    • Spark optimizes for you! Gives you optimized RDDs
    • The catalyst can reorder and know which data is not used, so less data around the network!
      • Catalyst reorder partitions, reduces amount of data to shuffle around, removes unwanted partitions
      • Due to being restriced to SQL data type, Tungsten optimizes how to pack data into memory
        • Stores data in a columnar format
  • Limitations
    • Errors happen at runtime!!
    • Limited to same data types
4.5 Datasets!
  • Row is not convenient, need to cast them back using asInstanceOf
    • but it is a headache, can use the printSchema to get the schema
  • BUT we want both type safety and optimizations!
    • Enter Datasets!
    • Dataframe = Dataset[Row] --> Dataframe is a type of a Dataset
    • A combo of RDD and Dataframe
      • Mix and match RDD and Dataframes.
      • REQUIRE structured/semi structure data (must have some sort of schema)
    • Now we can mix and match APIs!
    • Can convert RDD and DFF to Dataset
  • Datasets require TypedColumn and not just Column (as datasets are typed!)
    • Has both typed and untyped transformtions
  • Can call map on Dataframe -- gives a dataset; but need to typecast manually!
  • Operations on RDD are not all available on Datasets (and also different sometimes!)
    • groupByKey returns the maldito KeyValuedGroupedDataset
      • Two set process to get back a dataset
  • Pay attention to the TypedColumn!
  • KeyValuedGroupedDataset --> mapGroups and flatMapGroups
    • Pero donde chingado esta reduceByKey! (Datasets do NOT have it!)
    • mapGroups requires a full shuffle!! (Do NOT use'em!)
    • Docs suggest an Aggregator.
      • has three types [-IN, BUF, OUT]
      • Need to define the functions zero, reduce, merge and finish
      • Also need encoders!! --> Encoders.STRING
  • Compares Dataframes vs RDDs vs Datasets
  • RDD is the core abstraction ! Others are built on it
  • Dataframe is the best optimization! (but untyped!)
  • Not everything on a dataset can be optimized, for eg. functional lambda fiter cannot be optimzied
  • Can optimize relational queries better, as it knows the columns required
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment