Skip to content

Instantly share code, notes, and snippets.

@evbruno
Created November 22, 2016 16:04
Show Gist options
  • Save evbruno/808d891ef21633aff9a9e27f4ef43d41 to your computer and use it in GitHub Desktop.
Save evbruno/808d891ef21633aff9a9e27f4ef43d41 to your computer and use it in GitHub Desktop.
Flight Delay computing with Spark
// ref: https://github.com/evbruno/flight_delay_akka_streams
// ref: https://github.com/evbruno/flight_delay_java8
// 0 : load
val flightText = sc.textFile("/tmp/2008.csv").cache
case class FlightEvent(
year: String,
month: String,
dayOfMonth: String,
dayOfWeek: String,
uniqueCarrier: String,
flightNum: String,
arrDelayMins: Int)
val flights = flightText.map(s => s.split(",")).filter(c => c(0) != "Year").map( s =>
FlightEvent(
year = s(0),
month = s(1),
dayOfMonth = s(2),
dayOfWeek = s(3),
flightNum = s(9),
uniqueCarrier = s(8),
arrDelayMins = scala.util.Try(s(14).toInt).getOrElse(0)
)
).cache
val delayedFlights = flights.filter(_.arrDelayMins > 0)
delayedFlights.count // res3: Long = 2979504
// 1a : SQL
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val newDF = delayedFlights.toDF
newDF.registerTempTable("flights")
val df = sqlContext.sql("select uniqueCarrier, count(1), avg(arrDelayMins) from flights group by uniqueCarrier")
df.show
// 1b : Dataframes/RDDs
newDF.printSchema
newDF.show(10)
newDF.groupBy("uniqueCarrier").show(21)
newDF.groupBy("uniqueCarrier").avg("uniqueCarrier").show(21)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment