Below are notes taken for the "Scala Essential Training for Data Science": https://www.lynda.com/Scala-tutorials/Scala-Essential-Training-Data-Science/559182-2.html
In this course we will need to install Scala, Postgres, and Spark. The Lynda course we will be following along with provides manual installation instructions that do not use tools package management tools. To simplify installations directions, I encourage using a command line based package management tool. On MacOS there is Homebrew (https://brew.sh/), in Windows there is Chocolatey (https://chocolatey.org/), and on Linux distributions there is SDKMAN! (https://sdkman.io/) and on Debian based systems APT (https://wiki.debian.org/AptCLI).
In the case of Linux users, I will provide SDKMAN! installation instructions in some cases, and APT instructions in others, depending on which is more straightforward. You may also need to use sudo
.
MacOS and Windows users may need to provide your system password to install certain packages.
If you are comfortable taking another approach to install the software used in this course, feel free to, but know that I may be unable to support issues encountered using your approach to installation (I primarily use MacOS).
Begin by installing Scala. The Lynda course uses version 2.11.11 which is not the most recent. Since there were some slight changes to components that we will learn in this course, you should install the same version (at lead the same major.minor
version) as used by the instructor. To install, run the appropriate command for your system:
$ brew install scala@2.11 ## This should install 2.11.12 ## MacOS with https://brew.sh/ installed
> choco install scala --version=2.11.4 ## Windows with https://chocolatey.org/ installed
$ sdk install scala 2.11.11 ## Linux distributions with https://sdkman.io/ installed
Scala is dependent on the Java Standard Edition Platform. Most package managers will resolve dependencies and will install those prerequisites. For instance, if you do not already have it, Homebrew will install OpenJDK@13
when you use it to install Scala@2.11
.
- Data Types (Section 1.3)
- Arrays, Vectors, and Ranges (Section 1.5)
- Maps (Section 1.6)
- Expressions (Section 1.7)
- Functions (Section 1.8)
- Objects and Classes (Section 1.9)
Scala provides a simple high-level abstraction to process over multiple cores or on hyper-threaded processors using its standard library implementation of "parallel collections". The goal of Scala’s parallel collections is in enabling parallelism to be easily brought into more code by taking well understood programming abstractions of sequential collections such as the array, vector and range types discussed in the previous section. Scala's parallel collection types include ParArray
, ParVector
, ParRange
, ParHashMap
, and ParSet
.
You can read more about parallel collections in the "Parallel Collections Overview" section of the Scala documentation.
Only use parallel collections when the collection contain thousands or tens of thousands of elements. Small collections configured to be parallel when they need only be sequential incur an unnecessary overhead (albeit a fairly minor overhead).
When creating parallel collections from sequential ones, the conversion may require a deep copy, so also keep your memory quota in mind when defining parallel collections.
Conceptually, Scala’s parallel collections framework parallelizes an operation on a parallel collection by recursively “splitting” a given collection, applying an operation on each partition of the collection in parallel, and re-“combining” all of the results that were completed in parallel.
These concurrent, and “out-of-order” semantics of parallel collections lead to the following two implications:
- Side-effecting operations can lead to non-determinism
- Non-associative operations lead to non-determinism
By "out-of-order" the authors are referring to it in the temporal sense of the order of operations owing to the fact that different threads operating on a partitioned collection will compute their piece over different temporal durations. They don't mean a spatial mis-ordering of the operations. A parallel collection broken into partitions A, B, C, in that order, will be reassembled once again in the order; not some other arbitrary order like B, C, A.
From Scala's Parallel Collections Overview documentation. See the Wikipedia for more information on non-deterministic algorithms.
Avoid using applying procedures with side-effects. An example is by using an accessor method, like foreach
to increment a var
declared outside of the closure which is passed to foreach
.
scala> var sum = 0
sum: Int = 0
scala> val list = (1 to 1000).toList.par
list: scala.collection.parallel.immutable.ParSeq[Int] = ParVector(1, 2, 3,…
scala> list.foreach(sum += _); sum
res01: Int = 467766
scala> var sum = 0
sum: Int = 0
scala> list.foreach(sum += _); sum
res02: Int = 457073
scala> var sum = 0
sum: Int = 0
scala> list.foreach(sum += _); sum
res03: Int = 468520
Here, summing using foreach over the collection results in different values each time. This is caused by a data race due to concurrent read/write operations on the sum
variable; an impact of having split the parallel collection perform foreach across multiple cores or threads. Illustrated here:
ThreadA: read value in sum, sum = 0 value in sum: 0
ThreadB: read value in sum, sum = 0 value in sum: 0
ThreadA: increment sum by 760, write sum = 760 value in sum: 760
ThreadB: increment sum by 12, write sum = 12 value in sum: 12
Contrast this with a sequential collection where the result on sum
is repeatable and accurate:
scala> var sum = 0
sum: Int = 0
scala> val list = (1 to 1000)
list: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170...
scala> list.foreach(sum += _); sum
res11: Int = 500500
scala> var sum = 0
sum: Int = 0
scala> list.foreach(sum += _); sum
res12: Int = 500500
Also avoid non-associative operations on parallel collections.
Example of associative vs. non-associative operations:
Subtraction is non-associative:
(1 − 2) − 3 = −4
1 − (2 − 3) = 2
whereas addition is associative:
(1 + 2) + 3 = 6
1 + (2 + 3) = 6
Since the order of a function applied to a parallel collection is arbitrary due to their multithreading, you cannot perform non-associative operations since state information cannot be relied on at any point in time.
We can see the impact of non-associative operations on a parallel collection in Scala using subtraction in the reduce
method on a parVector
here:
scala> val list = (1 to 1000).toList.par
list: scala.collection.parallel.immutable.ParSeq[Int] = ParVector(1, 2, 3,…
scala> list.reduce(_-_)
res01: Int = -67860
scala> list.reduce(_-_)
res02: Int = 2350
scala> list.reduce(_-_)
res03: Int = -234948
While the same parVector
having reduce
on by an associative operation like addition works just fine:
scala> list.reduce(_+_)
res18: Int = 500500
scala> list.reduce(_+_)
res19: Int = 500500
- Creating Parallel Collections (Section 2.2)
- Mapping Functions Over Parallel Collections (Section 2.3)
- Filtering Parallel Collections (Section 2.4)
Using SQL in Scala first requires a SQL RDMS as the data source; in the course we run PostgreSQL, but any RDMS that you may have already installed on your computer should work. Installing Postgres is straightforward. The video gives installation instructions; but you may find it easier to run one of the following commands (providing you have installed a package manager: Brew on MacOS or Chocolatey on Windows).
$ brew install postgresql ## MacOS with https://brew.sh/ installed
$ apt-get install postgresql ## Debian-like Linux distributions
> choco install postgresql ## Windows with https://chocolatey.org/ installed
You also will need the PostgreSQL JDBC. The video lectures appear to skip the installation step for the JDBC driver. The driver is a binary jar
file, downloadable from https://jdbc.postgresql.org/ by navigating to Downloads. I've confirmed that the current version as of this writing works.
When you invoke scala, you need to pass the path to the driver jar
file following the -classpath
flag, such as $ scala -classpath ~/Downloads/postgresql-42.2.10.jar
, for example. If you used a different SQL database management system, like MySQL or SQLite.
Worth a mention that executing queries on your SQL database through the JDBC driver connection returns ResultSet objects, which maintain its own cursor object. Functions of the ResultsSet Class allow for your interaction with the cursor. A ResultSet cursor is initially positioned before the first row; the first call to the method next makes the first row the current row; the second call makes the second row the current row, and so on. More details on the native Java SQL API is available on Oracles Java Specification documentation website: https://docs.oracle.com/javase/8/docs/api/java/sql/package-summary.html
- Loading Data into PostgreSQL (Section 3.2)
- Connecting to PostgreSQL (Section 3.3)
- Querying with SQL strings (Section 3.4)
- Querying with Prepared Statements (Section 3.5)
- Getting Started with Spark RDDs (Section 4.3)
- Mapping Functions Over RDDs (Section 4.4)
- Statistics Over RDDs (Section 4.5)
- Creating DataFrames (Section 5.1)
- Grouping and Filtering on DataFrames (Section 5.2)
- Joining DataFrames (Section 5.3)
- Working with JSON Files (Section 5.4)