Jacek Laskowski jaceklaskowski

## spark-day4.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                jaceklaskowski
                / spark-day4.md
            
            
              Last active
              June 15, 2017 11:56
            
              
                Day 4
              
          
    Exercise

Develop a Spark standalone application (using IntelliJ IDEA) with Spark MLlib and LogisticRegression to classify emails.
Think about command line and what parameters you'd like to accept for various use cases.
TIP Use scopt

libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"


## notes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                jaceklaskowski
                / notes.md
            
            
              Last active
              June 14, 2017 21:47
            
              
                notes
              
          
    IDEA:

breaks on demand given the number of exercises
break man who says we should have one

val wholeJsonRDD = sc.wholeTextFiles("input.json").map(_._2)
val mySchema = new StructType().add($"n".int)
wholeJsonRDD.toDF.withColumn("json", from_json($"value", mySchema)).show(truncate = false)

val jsonDF=spark.read.json("output.json")


## spark-exercises.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              2 stars
            
          
                jaceklaskowski
                / spark-exercises.md
            
            
              Last active
              June 26, 2022 12:04
            
              
                Spark Exercises
              
          
    Exercise 1

Union only those rows (from large table) with keys in left small table, i.e. union two dataframes together but only those with the key in my small table.
Exercise 2

Aggregation on an array of nested json = How to sum the quantities across all lines for a given order (which would give 1 + 3 = 4 for the below sample dataset):
{


## blockchain.md

      
              1 file
            
          
              1 fork
            
          
              2 comments
            
          
              2 stars
            
          
                jaceklaskowski
                / blockchain.md
            
            
              Last active
              March 14, 2018 15:17
            
              
                Blockchains, Cryptoeconomics, Ethereum, Litecoin, Bitcoin, IOTA
              
          
    Moved the notes to https://github.com/jaceklaskowski/blockchain-notes repository. See you there.
I'm going to remove the gist after Oct, 7th.

  
## parquet.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                jaceklaskowski
                / parquet.md
            
            
              Last active
              December 26, 2017 19:16
            
              
                Parquet
              
          
    Parquet

Introduction


Stores schema information along with the data
Columnar storage/file format

"reference file format on Hadoop HDFS"
"read-optimized view of data"


excellent for local file storage on HDFS (instead of external databases).
writing very large datasets to disk


## scala-something.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                jaceklaskowski
                / scala-something.md
            
            
              Created
              February 24, 2017 22:42
            
              
                Scala SOMETHING
              
          
    Notes


Combinators == building blocks
functions and higher order functions
composition
immutable == less moving parts to worry about
a routine job == a boilerplate == a boring stuff
a context (so the job varies)
abstracting away == happening behind the scenes == cutting down repetitive code == eliminating boilerplate
"The code becomes small, succinct, and more readable"


## sparkathon-agenda.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              2 stars
            
          
                jaceklaskowski
                / sparkathon-agenda.md
            
            
              Last active
              October 10, 2017 20:28
            
              
                Sparkathon in Warsaw - Development Activities
              
          
    Spark-a-thon — Development Activities

Spark Structured Streaming


 Developing custom Source to handle formats like XML

 Publish the streaming source / format to https://spark-packages.org/
 As a follow-up answer How to read streaming data in XML format from Kafka?


 Explore what's stored in checkpointLocation
 How is streaming groupBy different with Append and Complete output modes?
 Answering the question on StackOverflow — How to count items per time window?


## dcos.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              5 stars
            
          
                jaceklaskowski
                / dcos.md
            
            
              Last active
              June 30, 2018 12:13
            
              
                Introduction to DC/OS
              
          
    DC/OS


The latest documentation at https://dcos.io/docs/latest.

Mesosphere DC/OS 1.9


Announcing Mesosphere DC/OS 1.9
DC/OS 1.9.0 Release Candidate 2
Bringing Production-Proven Data Services to DC/OS 1.9 with our Partners


DC/OS provides one-click installation of data services such as databases, message queues, and analytics engines, on-par with cloud providers such as Amazon Web Services.


## spark-AlreadyExistsException.md

      
              1 file
            
          
              0 forks
            
          
              4 comments
            
          
              0 stars
            
          
                jaceklaskowski
                / spark-AlreadyExistsException.md
            
            
              Created
              August 17, 2016 02:31
            
              
                ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
              
          
    scala> Seq(A(4)).toDS
16/08/16 19:26:26 ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
	at com.sun.proxy.$Proxy14.create_database(Unknown Source)


## exercise-meetup.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                jaceklaskowski
                / exercise-meetup.md
            
            
              Last active
              August 3, 2016 20:01
            
              
                Exercise for meetup today
              
          
Create a new Scala/sbt project in IntelliJ IDEA


Project name: handleAllocatedContainers


Create a test (ScalaTest) for our exercise


Hint: See https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L448-L470
Input: sequence of elements
3 predicates to produce 4 sets <-- think about any number of predicates
The test will fail!!!


Exercise is to make the test pass