Skip to content

Instantly share code, notes, and snippets.

@vikas-gonti
vikas-gonti / Solutions
Last active September 14, 2019 17:53
Solution to 20 questions at http://nn02.itversity.com/cca175/
:'Problem 1
#Connect to the MySQL database on the itversity labs using sqoop and import all of the data from the orders table into HDFS
Output Requirements
#Place the customer files in the HDFS directory
#/user/yourusername/problem1/solution/
#Replace yourusername with your OS user name
#Use a text format with comma as the columnar delimiter
#Load every order record completely'
sqoop import \
/*spark-shell --master yarn \
--conf spark.ui.port=12345 \
--num-executors 6 \
--executor-cores 2 \
--executor-memory 2G \
--packages com.databricks:spark-avro_2.10:2.0.1
*/
import com.databricks.spark.avro._;
var orderDF = sqlContext.read.avro("user/cloudera/problem1/orders");
@vikas-gonti
vikas-gonti / CrimeDataTop3.scala
Created June 30, 2018 20:32
Get top 3 crime types based on number of incidents in RESIDENCE area
/*
Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
File format - text file
Delimiter - “,” (use regex while splitting split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1), as there are some fields with comma and enclosed using double quotes.
Get top 3 crime types based on number of incidents in RESIDENCE area using “Location Description”
Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA
Output Fields: Crime Type, Number of Incidents
Output File Format: JSON
Output Delimiter: N/A
Output Compression: No
@vikas-gonti
vikas-gonti / InactiveCustomers.scala
Created June 30, 2018 19:45
Get inactive customers using left outer join between orders and customers.
/*
sqoop import \
--connect jdbc:mysql://ms.itversity.com:3306/retail_db \
--username retail_user \
--password itversity \
--table orders \
--target-dir data/retail_db/orders \
--as-textfile \
--num-mappers 1
*/
@vikas-gonti
vikas-gonti / WordCount.scala
Created June 30, 2018 19:24
Develop word count program
/*
Data is available in HDFS /public/randomtextwriter
Get word count for the input data using space as delimiter (for each word, we need to get how many types it is repeated in the entire input data set)
Number of executors should be 10
Executor memory should be 3 GB
Executor cores should be 20 in total (2 per executor)
Number of output files should be 8
Avro dependency details: groupId -> com.databricks, artifactId -> spark-avro_2.10, version -> 2.0.1
Target Directory: /user/<YOUR_USER_ID>/solutions/solution05/wordcount
Target File Format: Avro
@vikas-gonti
vikas-gonti / CrimeData.scala
Created June 30, 2018 19:16
Get monthly crime count by type
/* Data is available in HDFS file system under /public/crime/csv
You can check properties of files using hadoop fs -ls -h /public/crime/csv
Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
File format - text file
Delimiter - “,”
Get monthly count of primary crime type, sorted by month in ascending and number of crimes per type in descending order
Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution01/crimes_by_type_by_month
Output File Format: TEXT
Output Columns: Month in YYYYMM format, crime count, crime type
Output Delimiter: \t (tab delimited)
@vikas-gonti
vikas-gonti / NYSEData.Scala
Last active June 27, 2018 01:45
Convert nyse data to parquet
//Convert nyse data (multiple files)to parquet (4 files output)
/* Data is available in local file system under /data/nyse (ls -ltr /data/nyse)
Fields (stockticker:string, transactiondate:string, openprice:float, highprice:float, lowprice:float, closeprice:float, volume:bigint)
Convert file format to parquet
Save it /user/<YOUR_USER_ID>/nyse_parquet*/
/*spark-shell --master yarn \
--conf spark.ui.port=12456 \
--num-executors 4*/
@vikas-gonti
vikas-gonti / IMDBData.scala
Created June 27, 2018 01:08
IMDB Data Top 50 ratings
// user/gontiv/solutions/imdbdata/title.akas.tsv
// user/gontiv/solutions/imdbdata/title.ratings.tsv
/*
spark-shell --master yarn \
--conf spark.ui.port=12456 \
--num-executors 10 \
--executor-memory 3G \
--executor-cores 2 \
--packages com.databricks:spark-avro_2.10:2.0.1