Skip to content

Instantly share code, notes, and snippets.

@tgkprog
Last active January 5, 2017 10:56
Show Gist options
  • Save tgkprog/e61dd10ee67510620b51c0cfd6ae5399 to your computer and use it in GitHub Desktop.
Save tgkprog/e61dd10ee67510620b51c0cfd6ae5399 to your computer and use it in GitHub Desktop.
Download and run Apache Zeppelin
a1 a2 c1 c2
1 2 3Aa 4tt
2 22 222Bbkumar 21
// * Can copy paste this in to a new zeppelin notebook on http://localhost:8080/ presuming you got zeppelin to install and run
// * Your example should take 6 parameters so that can test 4 transformations including date. this example does not have date parsing.
// * parse a date using DateFormat and use that date to compare to a column (is equal) from file
val start = System.currentTimeMillis()
import scala.util.matching.Regex
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.catalog.Column
def doRegReplace(orig: String, reg: Regex, rplc: String): String =
{
reg.replaceAllIn(orig, rplc)
}
println("--- 1" )
val pathOnServer = "/u/data/s2big.csv"// "/u/data/s2.csv"
val inColData = spark.read.option("header", "true").format("csv").option("inferSchema", "true").option("nullValue", null).load(pathOnServer).cache()
val val1 = z.input("val1", "2").toString().toInt
val val2 = z.input("val2", "Other info").toString()
val str1 = z.input("str1", "A|B|E|a|o").toString()
val str2 = z.input("str2", "X").toString()
val sdf = new java.text.SimpleDateFormat("yyyy-mm-dd")
val date1s = z.input("date1", "2016-12-04").toString()
val date1 = sdf.parse(date1s)
println("--- 2 date:" + date1 + "." )
println("--- val2:" + val2 + "." )
println("--- str1:" + str1 + "." )
println("--- str2:" + str2 + "." )
var outColData = inColData.withColumn("a2", inColData("a1") * val1)
val newCol = "c3"
val onCol = "c1"
val idx = 1
val re = str1.r
val rpl = str2
println("new c :" + newCol + ", on col :" + onCol + "." + ", value :" + re)
//re.replaceAllIn(inColData(onCol).toString()
val doRegReplace_udf = udf(doRegReplace(_: String, re, rpl))
outColData = outColData.withColumn(
newCol, doRegReplace_udf(inColData(onCol)))
println("---data Final---" + idx + val2 + ":")
outColData.collect().foreach(println)
val end = System.currentTimeMillis()
println("Done Took :" + ((end - start)/ 1000d) + " seconds. [ total " + (end - start) + " millis]\n")
println("---Done---")
1. Download Apache Zeppelin http://zeppelin.apache.org/
Apache Zeppelin release 0.6.2. download and unzip
* http://www.apache.org/dyn/closer.cgi/zeppelin/zeppelin-0.6.2/zeppelin-0.6.2-bin-all.tgz
2. JDK8 http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
3. Have scala installed : https://www.scala-lang.org/download/2.12.1.html and spark http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
4. Env
In ubuntu in ~/.bashrc
Add ensure variables
#java home might be defined elsewhere. try echo $JAVA_HOME to see if already installed
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export SCALA_HOME=/a/scala/lang/scala-2.11.8
export SBT_HOME=/a/scala/sbt/sbt13
export HADOOP_HOME=/a/big/hadoop/hadoop-2.7.3
export SPARK_HOME=/a/big/spark/spark-2.0.1-bin-hadoop2.7
export PATH=$GRADLE_HOME/bin:$PATH:/opt/android-studio/bin:$SCALA_HOME/bin:$SBT_HOME/bin:$SPARK_HOME/bin:
# for fast scripting, optional
export zep=/a/big/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh
5. Run zepplin
Linux
bin/zeppelin-daemon.sh start
6. Browser go to
http://localhost:8080/
Make a new notebook, in the text area add the souce from gist: https://gist.github.com/tgkprog/5ff218efcda3f3ec2114581309544461
Change path to local path of
val pathOnServer = "..."
this is ubuntu path, can test how it is in windows, maybe
val pathOnServer = "/a/data2.csv"
Or
val pathOnServer = "c:/a/data2.csv"
Not sure you can try.
7. Then name the notebook your user name or
change one of the printlns to have your freelancder/fivrr user name,
run the paragraph and take 1-2 screen shots,s ave as png from paint or
other progam and upload to chat.
* In this gist will find data2.csv - sample data and Zeppelin-notebook-sample.txt which can be used directly. except change file path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment