Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Used what we learned at SparkCamp to play with some of Leeds City Council 2014 road traffic accident data
// Download 2014 Road traffic accidents from Leeds City Council
// http://data.gov.uk/dataset/road-traffic-accidents/resource/4882af59-27c5-4148-9624-6cffee36c688
// Remove the header line in an editor because we've not figured out how to do this in Spark yet
// Register the data
val data = sc.textFile("/home/akaerast/downloads/accidents2014.csv")
# Describe the class
case class Accident(reference: String,
easting: String,
northing: String,
vehicles: Int,
casualties: Int,
date: String,
time: String,
roadClass: Int,
roadSurface: Int,
lightingConditions: Int,
weatherConditions: Int,
casualtyClass: Int,
casualtySeverity: Int,
casualtySex: Int,
casualtyAge: Int,
vehicleType: Int
)
// Split each line into an array
val rows = data.map(line => line.split(","))
// map each row into the Accident class
val accidents = rows.map(row => Accident(row(0),row(1),row(2),row(3).toInt,row(4).toInt,row(5),row(6),row(7).toInt,row(8).toInt,row(9).toInt,row(10).toInt,row(11).toInt,row(12).toInt,row(13).toInt,row(14).toInt,row(15).toInt))
// cache the accidents into memory if possible
accidents.cache()
// count the number of accidents with more than one casualty (921)
accidents.filter(accident => accident.casualties > 1).count()
// sum the total number of casualties (4065)
val casualties = accidents.map(a => a.casualties)
val totalCasualties = casualties.reduce((a, b) => a + b)
// Great, now lets do some stats
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
// Take all the numeric values and convert them to doubles
val observations = rows.map(row => Vectors.dense(row(3).toDouble, row(4).toDouble, row(7).toDouble, row(8).toDouble, row(9).toDouble, row(10).toDouble, row(11).toDouble, row(12).toDouble, row(13).toDouble, row(14).toDouble, row(15).toDouble))
// And then summarise the data
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean)
println(summary.variance)
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
// generate separate RDDs for the values columns we care about
val vehicles: RDD[Double] = accidents.map(accident => accident.vehicles.toDouble)
val casualties: RDD[Double] = accidents.map(accident => accident.casualties.toDouble)
val casualtySex: RDD[Double] = accidents.map(accident => accident.casualtySex.toDouble)
// 0.25 correlation between number of vehicles and number of casualties)
val vehicleCasualtyCorrelation: Double = Statistics.corr(vehicles, casualties, "pearson")
// A 0.01 correlation between number of vehicles and sex of the casualty
val vehicleSexCorrelation: Double = Statistics.corr(vehicles, casualtySex, "pearson")
// A 0.08 correlation between number of casualties and the sex of the casualty
val casualtiesSexCorrelation: Double = Statistics.corr(casualties, casualtySex, "pearson")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment