Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dgadiraju/a85fd8914fdcaa323837e9b8bea06e3e to your computer and use it in GitHub Desktop.
Save dgadiraju/a85fd8914fdcaa323837e9b8bea06e3e to your computer and use it in GitHub Desktop.
/*
spark-shell --master yarn \
--conf spark.ui.port=12345 \
--num-executors 6 \
--executor-cores 2 \
--executor-memory 2G
*/
// Solution using Core API
val crimeData = sc.textFile("/public/crime/csv")
val header = crimeData.first
val crimeDataWithoutHeader = crimeData.filter(criminalRecord => criminalRecord != header)
val crimeCountForResidence = sc.parallelize(crimeDataWithoutHeader.
filter(rec => rec.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1)(7) == "RESIDENCE").
map(rec => (rec.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1)(5), 1)).
reduceByKey((total, value) => total + value).
map(rec => (rec._2, rec._1)).
sortByKey(false).
take(3))
crimeCountForResidence.
map(rec => (rec._2, rec._1)).
toDF("crime_type", "crime_count").
write.json("user/dgadiraju/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA")
@Karthik-NS
Copy link

hi,
can you please explain the purpose of -1 in split function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment