Skip to content

Instantly share code, notes, and snippets.

View adrian-chang's full-sized avatar

Adrian Chang adrian-chang

View GitHub Profile
@adrian-chang
adrian-chang / REMOTE_TRAINING.DockerFile
Created February 12, 2020 21:57
SageMaker Training DockerFile
FROM python:3.8.1-buster as python-base
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
RUN pip install sklearn
COPY . /opt/code
ENTRYPOINT ["python", "main.py"]
@adrian-chang
adrian-chang / LOCAL_TRAINING.DockerFile
Created February 12, 2020 21:15
AWS SageMaker Simple Local Training DockerFile
FROM python:3.8.1-buster as python-base
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
RUN pip install sklearn
ENTRYPOINT ["python", "main.py"]
students.join(majors, Seq("student_id"), "full").show()
+----------+------------+----------------+
|student_id|student_name| major|
+----------+------------+----------------+
| 1| John| null|
| 3| Mary| History|
| 4| Jane| null|
| 2| Bill|Computer Science|
+----------+------------+----------------+
students.join(colleges, Seq("student_id"), "right").show()
+----------+------------+--------------------+
|student_id|student_name| college_name|
+----------+------------+--------------------+
| 1| John| Harvard|
| 1| John| Stanford|
| 3| Mary| University of Texas|
| 3| Mary| Columbia|
| 4| Jane|University of Was...|
students.join(colleges, Seq("student_id"), "left").show()
+----------+------------+--------------------+
|student_id|student_name| college_name|
+----------+------------+--------------------+
| 1| John| Stanford|
| 1| John| Harvard|
| 2| Bill| null|
| 3| Mary| Columbia|
| 3| Mary| University of Texas|
@adrian-chang
adrian-chang / udfExample.scala
Last active April 4, 2018 19:17
Simple UDF Example
import org.apache.spark.sql.functions._
val multiUDF = udf((value: Double) => {
value - 10
})
val scoresDF = sc.parallelize(
Array(("Fred", 82.0), ("Fred", 90.0), ("Fred", 12.0))
)
.toDF("key", "value")
@adrian-chang
adrian-chang / groupByRDDBasicExample.scala
Last active April 4, 2018 19:17
Group by RDD Basic
val partition = sc.parallelize(Seq(
("1234", 1),
("1234", 1),
("1234", 1)
))
val result = partition.reduceByKey(_ + _)
// ("1234", 3)
@adrian-chang
adrian-chang / groupByDataframeBasicExample.scala
Last active April 4, 2018 19:17
Group by Dataframe Basic
val partition = sc.parallelize(Seq(
("1234", 1),
("1234", 1),
("1234", 1)
)).toDF("key", "value")
partition.groupBy("key").agg(sum('value))
// ("1234", 3)
val scoresRDD = sc.parallelize(
Array(("Fred", 82.0), ("Fred", 90.0), ("Fred", 12.0))
)
val createScoreCombiner = (score: Double) => List(score)
val scoreCombiner = (collector: List[Double}, score: Double) => {
collection += score
}
val scoresDF = sc.parallelize(
Array(("Fred", 82.0), ("Fred", 90.0), ("Fred", 12.0))
)
.toDF("key", "value")
val scores = scoresDF.groupBy('key).agg(collect_list('value))
// ("Fred", List(82.0, 90.0, 12.0))