vikas-gonti

## Solutions
:'Problem 1
#Connect to the MySQL database on the itversity labs using sqoop and import all of the data from the orders table into HDFS
Output Requirements
#Place the customer files in the HDFS directory
#/user/yourusername/problem1/solution/
#Replace yourusername with your OS user name
#Use a text format with comma as the columnar delimiter
#Load every order record completely'

sqoop import \

## order_X_order_items.scala
/*spark-shell --master yarn \
  --conf spark.ui.port=12345 \
  --num-executors 6 \
  --executor-cores 2 \
  --executor-memory 2G \
  --packages com.databricks:spark-avro_2.10:2.0.1
*/

import com.databricks.spark.avro._;
var orderDF = sqlContext.read.avro("user/cloudera/problem1/orders");

## CrimeDataTop3.scala
/*
Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
File format - text file
Delimiter - “,” (use regex while splitting split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1), as there are some fields with comma and enclosed using double quotes.
Get top 3 crime types based on number of incidents in RESIDENCE area using “Location Description”
Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA
Output Fields: Crime Type, Number of Incidents
Output File Format: JSON
Output Delimiter: N/A
Output Compression: No

## InactiveCustomers.scala
/*
sqoop import \
  --connect jdbc:mysql://ms.itversity.com:3306/retail_db \
  --username retail_user \
  --password itversity \
  --table orders \
  --target-dir data/retail_db/orders \
  --as-textfile \
  --num-mappers 1
  */

## WordCount.scala
/*
Data is available in HDFS /public/randomtextwriter
Get word count for the input data using space as delimiter (for each word, we need to get how many types it is repeated in the entire input data set)
Number of executors should be 10
Executor memory should be 3 GB
Executor cores should be 20 in total (2 per executor)
Number of output files should be 8
Avro dependency details: groupId -> com.databricks, artifactId -> spark-avro_2.10, version -> 2.0.1
Target Directory: /user/<YOUR_USER_ID>/solutions/solution05/wordcount
Target File Format: Avro

## CrimeData.scala
/* Data is available in HDFS file system under /public/crime/csv
You can check properties of files using hadoop fs -ls -h /public/crime/csv
Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
File format - text file
Delimiter - “,”
Get monthly count of primary crime type, sorted by month in ascending and number of crimes per type in descending order
Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution01/crimes_by_type_by_month
Output File Format: TEXT
Output Columns: Month in YYYYMM format, crime count, crime type
Output Delimiter: \t (tab delimited)

## NYSEData.Scala
//Convert nyse data (multiple files)to parquet (4 files output)
/* Data is available in local file system under /data/nyse (ls -ltr /data/nyse)
Fields (stockticker:string, transactiondate:string, openprice:float, highprice:float, lowprice:float, closeprice:float, volume:bigint)
Convert file format to parquet
Save it /user/<YOUR_USER_ID>/nyse_parquet*/

/*spark-shell --master yarn \
  --conf spark.ui.port=12456 \
  --num-executors 4*/

## IMDBData.scala
// user/gontiv/solutions/imdbdata/title.akas.tsv
// user/gontiv/solutions/imdbdata/title.ratings.tsv

/*
spark-shell --master yarn \
  --conf spark.ui.port=12456 \
  --num-executors 10 \
  --executor-memory 3G \
  --executor-cores 2 \
  --packages com.databricks:spark-avro_2.10:2.0.1
	:'Problem 1
	#Connect to the MySQL database on the itversity labs using sqoop and import all of the data from the orders table into HDFS
	Output Requirements
	#Place the customer files in the HDFS directory
	#/user/yourusername/problem1/solution/
	#Replace yourusername with your OS user name
	#Use a text format with comma as the columnar delimiter
	#Load every order record completely'

	sqoop import \
	/*spark-shell --master yarn \
	--conf spark.ui.port=12345 \
	--num-executors 6 \
	--executor-cores 2 \
	--executor-memory 2G \
	--packages com.databricks:spark-avro_2.10:2.0.1
	*/

	import com.databricks.spark.avro._;
	var orderDF = sqlContext.read.avro("user/cloudera/problem1/orders");
	/*
	Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
	File format - text file
	Delimiter - “,” (use regex while splitting split(",(?=(?:[^\"]\"[^\"]\")[^\"]$)", -1), as there are some fields with comma and enclosed using double quotes.
	Get top 3 crime types based on number of incidents in RESIDENCE area using “Location Description”
	Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA
	Output Fields: Crime Type, Number of Incidents
	Output File Format: JSON
	Output Delimiter: N/A
	Output Compression: No
	/*
	sqoop import \
	--connect jdbc:mysql://ms.itversity.com:3306/retail_db \
	--username retail_user \
	--password itversity \
	--table orders \
	--target-dir data/retail_db/orders \
	--as-textfile \
	--num-mappers 1
	*/
	/*
	Data is available in HDFS /public/randomtextwriter
	Get word count for the input data using space as delimiter (for each word, we need to get how many types it is repeated in the entire input data set)
	Number of executors should be 10
	Executor memory should be 3 GB
	Executor cores should be 20 in total (2 per executor)
	Number of output files should be 8
	Avro dependency details: groupId -> com.databricks, artifactId -> spark-avro_2.10, version -> 2.0.1
	Target Directory: /user/<YOUR_USER_ID>/solutions/solution05/wordcount
	Target File Format: Avro
	/* Data is available in HDFS file system under /public/crime/csv
	You can check properties of files using hadoop fs -ls -h /public/crime/csv
	Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
	File format - text file
	Delimiter - “,”
	Get monthly count of primary crime type, sorted by month in ascending and number of crimes per type in descending order
	Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution01/crimes_by_type_by_month
	Output File Format: TEXT
	Output Columns: Month in YYYYMM format, crime count, crime type
	Output Delimiter: \t (tab delimited)
	//Convert nyse data (multiple files)to parquet (4 files output)
	/* Data is available in local file system under /data/nyse (ls -ltr /data/nyse)
	Fields (stockticker:string, transactiondate:string, openprice:float, highprice:float, lowprice:float, closeprice:float, volume:bigint)
	Convert file format to parquet
	Save it /user/<YOUR_USER_ID>/nyse_parquet*/

	/*spark-shell --master yarn \
	--conf spark.ui.port=12456 \
	--num-executors 4*/
	// user/gontiv/solutions/imdbdata/title.akas.tsv
	// user/gontiv/solutions/imdbdata/title.ratings.tsv

	/*
	spark-shell --master yarn \
	--conf spark.ui.port=12456 \
	--num-executors 10 \
	--executor-memory 3G \
	--executor-cores 2 \
	--packages com.databricks:spark-avro_2.10:2.0.1