This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
:'Problem 1 | |
#Connect to the MySQL database on the itversity labs using sqoop and import all of the data from the orders table into HDFS | |
Output Requirements | |
#Place the customer files in the HDFS directory | |
#/user/yourusername/problem1/solution/ | |
#Replace yourusername with your OS user name | |
#Use a text format with comma as the columnar delimiter | |
#Load every order record completely' | |
sqoop import \ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/*spark-shell --master yarn \ | |
--conf spark.ui.port=12345 \ | |
--num-executors 6 \ | |
--executor-cores 2 \ | |
--executor-memory 2G \ | |
--packages com.databricks:spark-avro_2.10:2.0.1 | |
*/ | |
import com.databricks.spark.avro._; | |
var orderDF = sqlContext.read.avro("user/cloudera/problem1/orders"); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location) | |
File format - text file | |
Delimiter - “,” (use regex while splitting split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1), as there are some fields with comma and enclosed using double quotes. | |
Get top 3 crime types based on number of incidents in RESIDENCE area using “Location Description” | |
Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA | |
Output Fields: Crime Type, Number of Incidents | |
Output File Format: JSON | |
Output Delimiter: N/A | |
Output Compression: No |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
sqoop import \ | |
--connect jdbc:mysql://ms.itversity.com:3306/retail_db \ | |
--username retail_user \ | |
--password itversity \ | |
--table orders \ | |
--target-dir data/retail_db/orders \ | |
--as-textfile \ | |
--num-mappers 1 | |
*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
Data is available in HDFS /public/randomtextwriter | |
Get word count for the input data using space as delimiter (for each word, we need to get how many types it is repeated in the entire input data set) | |
Number of executors should be 10 | |
Executor memory should be 3 GB | |
Executor cores should be 20 in total (2 per executor) | |
Number of output files should be 8 | |
Avro dependency details: groupId -> com.databricks, artifactId -> spark-avro_2.10, version -> 2.0.1 | |
Target Directory: /user/<YOUR_USER_ID>/solutions/solution05/wordcount | |
Target File Format: Avro |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* Data is available in HDFS file system under /public/crime/csv | |
You can check properties of files using hadoop fs -ls -h /public/crime/csv | |
Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location) | |
File format - text file | |
Delimiter - “,” | |
Get monthly count of primary crime type, sorted by month in ascending and number of crimes per type in descending order | |
Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution01/crimes_by_type_by_month | |
Output File Format: TEXT | |
Output Columns: Month in YYYYMM format, crime count, crime type | |
Output Delimiter: \t (tab delimited) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
//Convert nyse data (multiple files)to parquet (4 files output) | |
/* Data is available in local file system under /data/nyse (ls -ltr /data/nyse) | |
Fields (stockticker:string, transactiondate:string, openprice:float, highprice:float, lowprice:float, closeprice:float, volume:bigint) | |
Convert file format to parquet | |
Save it /user/<YOUR_USER_ID>/nyse_parquet*/ | |
/*spark-shell --master yarn \ | |
--conf spark.ui.port=12456 \ | |
--num-executors 4*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// user/gontiv/solutions/imdbdata/title.akas.tsv | |
// user/gontiv/solutions/imdbdata/title.ratings.tsv | |
/* | |
spark-shell --master yarn \ | |
--conf spark.ui.port=12456 \ | |
--num-executors 10 \ | |
--executor-memory 3G \ | |
--executor-cores 2 \ | |
--packages com.databricks:spark-avro_2.10:2.0.1 |