Skip to content

Instantly share code, notes, and snippets.

View reynoldsm88's full-sized avatar

Michael Reynolds reynoldsm88

  • Two Six Labs
  • New York City
View GitHub Profile
public class MySuperClass {
private Map<String,String> values;
// static initializer parent
{
values = new HashMap<String,String>();
values.put( "value", "super" );
}

Spark internals through code

Nothing gives you more detail about spark internals than actually reading it source code. In addition, you get to learn many design techniques and improve your scala coding skills. These are the random notes I make while reading the spark code. The best way to comprehend the notes is to load spark code into an IDE, e.g. IntelliJ, and navigate the code on the side.

Genesis - creation of a spark cluster

The scripts for creating a spark cluster are: start-master.sh and start-slave.sh. Read them carefully, and you can see that both scripts are very similar except the values for $CLASS variable. For start-master.sh, the value is CLASS="org.apache.spark.deploy.master.Master", while the value for start-slave.sh is shown below with more context.

# NOTE: This exact class name is matched downstream by SparkSubmit.
@reynoldsm88
reynoldsm88 / build.sbt
Created June 22, 2017 18:30 — forked from seratch/build.sbt
Scala School - Testing with specs2 examples
organization := "net.seratch"
name := "sandbox"
version := "0.1"
scalaVersion := "2.9.1"
libraryDependencies ++= Seq(
"junit" % "junit" % "4.9" withSources(),
@reynoldsm88
reynoldsm88 / gist:94e7244554f3d59877fec10eb26a5f59
Created September 6, 2017 21:03
separate certain modules into different profiles
<build>
<defaultGoal>install</defaultGoal>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
@reynoldsm88
reynoldsm88 / get_job_status.sh
Created December 4, 2017 21:38 — forked from arturmkrtchyan/get_job_status.sh
Apache Spark Hidden REST API
curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20151008145126-0000
<pluginManagement>
<plugins>
<!--This plugin's configuration is used to store Eclipse m2e settings
only. It has no influence on the Maven build itself. -->
<plugin>
<groupId>org.eclipse.m2e</groupId>
<artifactId>lifecycle-mapping</artifactId>
<version>1.0.0</version>
<configuration>
<lifecycleMappingMetadata>
POD=$(oc get pod | grep <xyz | awk '{print $1}'
oc set volume <DC> --add --claim-name=<name> --type pvc --claim-size=1G --mount-path /remote/data
oc rollout latest <DC>
oc rsync /local/data:$POD/remote/data
package org.apache.spark.countSerDe
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
lazy val root = ( project in file( "." ) ).settings( libraryDependencies ++= elastic4s
++ scalaTest
++ betterFiles
++ commonsTestBase,
excludeDependencies ++= Seq( ExclusionRule( "org.slf4j", "slf4j-log4j12" ) ) )
lazy val root = ( project in file( "." ) ).settings( libraryDependencies ++= clulabProcessors
++ kafka
++ logging
++ scalaTest
++ embeddedKafka
++ scalaMock, // the mess below is to resolve conflicting versions of various dependencies
excludeDependencies ++= Seq( ExclusionRule( "org.slf4j", "slf4j-log4j12" ),
ExclusionRule( "javax.ws.rs", "javax.ws.rs-api" ), // out of date because of oracle jee debacle
ExclusionRule(