elmarculino/gist:6a636f2e709ce0252051de839225ac97

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    +++
title = "training"
author = ["Marco Ribeiro"]
draft = false
+++
CCA Spark and Hadoop Developer Exam (CCA175) {#cca-spark-and-hadoop-developer-exam--cca175}

Required Skills {#required-skills}

Data Ingest {#data-ingest}

The skills to transfer data between external systems and your cluster. This includes the following:


Import data from a MySQL database into HDFS using Sqoop


Export data to a MySQL database from HDFS using Sqoop


Change the delimiter and file format of data during import using Sqoop


Ingest real-time and near-real-time streaming data into HDFS


Process streaming data as it is loaded onto the cluster


Load data into and out of HDFS using the Hadoop File System commands


Transform, Stage, and Store {#transform-stage-and-store}

Convert a set of data values in a given format stored in HDFS into new data values or a new data format and write them into HDFS.


Load RDD data from HDFS for use in Spark applications


Write the results from an RDD back into HDFS using Spark


Read and write files in a variety of file formats


Perform standard extract, transform, load (ETL) processes on data


Data Analysis {#data-analysis}

Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data.


Use metastore tables as an input source or an output sink for Spark applications


Understand the fundamentals of querying datasets in Spark


Filter data using Spark


Write queries that calculate aggregate statistics


Join disparate datasets using Spark


Produce ranked or sorted data


Configuration {#configuration}

This is a practical exam and the candidate should be familiar with all aspects of generating a result, not just writing code.

Supply command-line options to change your application configuration, such as increasing available memory

There are several arguments which can be passed while submitting spark applications or launching spark-shell/pyspark

```sh
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000
```

All these are important parameters:

–num-executors
–executor-cores
–executor-memory
–total-executor-cores

Here are few important links:

<http://spark.apache.org/docs/latest/submitting-applications.html>
<http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/>

Spark 1 and Spark 2 {#spark-1-and-spark-2}

Your exam cluster runs CDH 5.15 which comes with Spark 1.6. An additional package has been installed to offer Spark 2.3. Candidates should be aware of how to run two different versions of Spark before taking the exam. Instructions are found here: https://www.cloudera.com/documentation/spark2/latest/topics/spark%5Frunning%5Fapps.html
Exercises [0/16] {#exercises}

Data Repo {#data-repo}

TODO Exercise 01 - Get monthly crime count by type {#exercise-01-get-monthly-crime-count-by-type}


Details - Duration 40 minutes

Data set [URL](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)
Choose language of your choice Python or Scala
Data is available in HDFS file system under /public/crime/csv
You can check properties of files using hadoop fs -ls -h /public/crime/csv
Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
File format - text file
Delimiter - “,”
Get monthly count of primary crime type, sorted by month in ascending and number of crimes per type in descending order
Store the result in HDFS path /user/<YOUR\_USER\_ID>/solutions/solution01/crimes\_by\_type\_by\_month
Output File Format: TEXT
Output Columns: Month in YYYYMM format, crime count, crime type
Output Delimiter: \t (tab delimited)
Output Compression: gzip

TODO Exercise 02 - Get details of inactive customers {#exercise-02-get-details-of-inactive-customers}


Details - Duration 15 to 20 minutes

Data is available in local file system /data/retail\_db
Source directories: /data/retail\_db/orders and /data/retail\_db/customers
Source delimiter: comma (“,”)
Source Columns - orders - order\_id, order\_date, order\_customer\_id, order\_status
Source Columns - customers - customer\_id, customer\_fname, customer\_lname and many more
Get the customers who have not placed any orders, sorted by customer\_lname and then customer\_fname
Target Columns: customer\_lname, customer\_fname
Number of files - 1
Target Directory: /user/<YOUR\_USER\_ID>/solutions/solutions02/inactive\_customers
Target File Format: TEXT
Target Delimiter: comma (“, ”)
Compression: N/A

TODO Exercise 03 - Get top 3 crime types based on number of incidents in RESIDENCE area {#exercise-03-get-top-3-crime-types-based-on-number-of-incidents-in-residence-area}


Details - Duration 15 to 20 minutes

Data is available in HDFS file system under /public/crime/csv
Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
File format - text file
Delimiter - “,” (use regex while splitting split(",(?=(?:[^\\"]\*\\"[^\\"]\*\\")\*[^\\"]\*$)", -1), as there are some fields with comma and enclosed using double quotes.
Get top 3 crime types based on number of incidents in RESIDENCE area using “Location Description”
Store the result in HDFS path /user/<YOUR\_USER\_ID>/solutions/solution03/RESIDENCE\_AREA\_CRIMINAL\_TYPE\_DATA
Output Fields: Crime Type, Number of Incidents
Output File Format: JSON
Output Delimiter: N/A
Output Compression: No

TODO Exercise 04 - Convert nyse data to parquet {#exercise-04-convert-nyse-data-to-parquet}


Details - Duration 10 minutes

Data is available in local file system under /data/nyse (ls -ltr /data/nyse)
Fields (stockticker:string, transactiondate:string, openprice:float, highprice:float, lowprice:float, closeprice:float, volume:bigint)
Convert file format to parquet
Save it /user/<YOUR\_USER\_ID>/nyse\_parquet

TODO Exercise 05 - Develop word count program {#exercise-05-develop-word-count-program}


Details - Duration 20 minutes

Data is available in HDFS /public/randomtextwriter
Get word count for the input data using space as delimiter (for each word, we need to get how many types it is repeated in the entire input data set)
Number of executors should be 10
Executor memory should be 3 GB
Executor cores should be 20 in total (2 per executor)
Number of output files should be 8
Avro dependency details: groupId -> com.databricks, artifactId -> spark-avro\_2.10, version -> 2.0.1
Target Directory: /user/<YOUR\_USER\_ID>/solutions/solution05/wordcount
Target File Format: Avro
Target fields: word, count
Compression: N/A or default

TODO Exercise 06 - Get details of top 5 customers by revenue for each month {#exercise-06-get-details-of-top-5-customers-by-revenue-for-each-month}


Details - Duration: 20 to 30 minutes

Tables should be in hive database - <YOUR\_USER\_ID>\_retail\_db\_txt
orders
order\_items
customers
Time to create database and tables need not be counted. Make sure to go back to Spark SQL module and create tables and load data
Get details of top 5 customers by revenue for each month
We need to get all the details of the customer along with month and revenue per month
Data need to be sorted by month in ascending order and revenue per month in descending order
Create table top5\_customers\_per\_month in <YOUR\_USER\_ID>\_retail\_db\_txt
Insert the output into the newly created table

TODO Problem with Solution: Find the top 50 rated and voted movies with more than 100000 votes {#problem-with-solution-find-the-top-50-rated-and-voted-movies-with-more-than-100000-votes}

Download data from below site.
https://datasets.imdbws.com/
Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz
Find the top 50 rated movies with more than 100000 votes
Storage details
Columns: titleId,title,region,language,averageRating,numVotes
Store the result at below location: home/cloudera/workspace/movies/
Store the result in following format.
a. Text file
Columns to be seperated with tab "\t"
Compression: Bzip2cOdec
b. Sequence file.
Compression: Bzip2cOdec
c. JSON file.
Compression: Bzip2cOdec
d. Parquet.
Compression: uncompressed
e. ORC file.
f. Avro file.
Compression: uncompressed
Use following methods:
Method 1: Use RDD
Method 2: Use DF
Method 3: Use SQL query.
Solution
TODO Problem with solution: Find the top 50 voted movies {#problem-with-solution-find-the-top-50-voted-movies}

Download data from below site.
https://datasets.imdbws.com/
Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz
Find the top 50 voted movies
Storage details
Columns: titleId,title,region,language,averageRating,numVotes
Store the result at below location: home/cloudera/workspace/movies/
Store the result in following format.
a. Text file
Columns to be seperated with tab "\t"
Compression: BZip2Codec
b. Sequence file.
Compression: BZip2Codec
c. JSON file.
Compression: BZip2Codec
d. Parquet.
Compression: uncompressed
e. ORC file.
f. Avro file.
Compression: uncompressed
Use following methods:
Method 1: Use RDD
Method 2: Use DF
Method 3: Use SQL query.
Solution
TODO Problem with Solution: Find the top 50 rated movies with more than 100000 votes {#problem-with-solution-find-the-top-50-rated-movies-with-more-than-100000-votes}

Download data from below site.
https://datasets.imdbws.com/
Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz
Find the top 50 rated movies with more than 100000 votes
Find the top 50 voted movies
Storage details
Columns: titleId,title,region,language,averageRating,numVotes
Store the result at below location: home/cloudera/workspace/movies/hive
Store the result in following format.
a. Text file
Columns to be seperated with tab “\t”
b. Sequence file.
c. RC file.
d. Parquet.
e. ORC file.
Compression: SNAPPY
f. Avro file.
Use Hive to load data and output data to required location.
[[http://datamazic.blogspot.ca/2018/02/problem-find-top-rated-movie-using-hive.html][
TODO Problem 1 {#problem-1}

Using sqoop, import orders table into hdfs to folders /user/cloudera/problem1/orders. File should be loaded as Avro File and use snappy compression
Using sqoop, import order_items  table into hdfs to folders /user/cloudera/problem1/order-items. Files should be loaded as avro file and use snappy compression
Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes.
Expected Intermediate Result: Order_Date , Order_status, total_orders, total_amount. In plain english, please find total orders and total amount per status per day. The result should be sorted by order date in descending, order status in ascending and total amount in descending and total orders in ascending. Aggregation should be done using below methods. However, sorting can be done using a dataframe or RDD. Perform aggregation in each of the following ways
a). Just by using Data Frames API - here order_date should be YYYY-MM-DD format
b). Using Spark SQL  - here order_date should be YYYY-MM-DD format
c). By using combineByKey function on RDDS -- No need of formatting order_date or total_amount
Store the result as parquet file into hdfs using gzip compression under folder
/user/cloudera/problem1/result4a-gzip
/user/cloudera/problem1/result4b-gzip
/user/cloudera/problem1/result4c-gzip
Store the result as parquet file into hdfs using snappy compression under folder
/user/cloudera/problem1/result4a-snappy
/user/cloudera/problem1/result4b-snappy
/user/cloudera/problem1/result4c-snappy
Store the result as CSV file into hdfs using No compression under folder
/user/cloudera/problem1/result4a-csv
/user/cloudera/problem1/result4b-csv
/user/cloudera/problem1/result4c-csv
create a mysql table named result and load data from /user/cloudera/problem1/result4a-csv to mysql table named result
TODO Problem 2 {#problem-2}

TODO Problem 3 {#problem-3}

TODO Problem 4 {#problem-4}

TODO Problem 5 {#problem-5}

TODO Problem 6 {#problem-6}

TODO Problem 7 {#problem-7}

**
CS246: Mining Massive Data Sets {#cs246-mining-massive-data-sets}

Mining Massive Datasets - Coursera MOOC videos
DSE230x - Big Data Analytics Using Spark {#dse230x-big-data-analytics-using-spark}

Google Professional Data Engineer {#google-professional-data-engineer}

AWS Certified Big Data - Specialty {#aws-certified-big-data-specialty}

Intro do Hadoop and MapReduce {#intro-do-hadoop-and-mapreduce}

Datasets {#datasets}

Dataset of problem set 1 (Purchases) {#dataset-of-problem-set-1--purchases}

Dataset of problem set 2 (Access Logs) {#dataset-of-problem-set-2--access-logs}

Dataset of the final project (Forum Data) {#dataset-of-the-final-project--forum-data}

Solution {#solution}

https://jefflirion.github.io/udacity/Intro%5Fto%5FHadoop%5Fand%5FMapReduce/Problem%5FSet4.html
Bash Alias {#bash-alias}

https://gist.github.com/LouisKottmann/7803450
run_mapreduce() {
  hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}

alias hs=run_mapreduce

run_mapcombinereduce() {
  hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -reducer $2 -combiner $2 -file $1 -file $2 -input $
}

alias hsc=run_mapcombinereduce

# hs mapper.py reducer.py input_1 tutorial_out

Mapper {#mapper}

import sys

def mapper():
    # read standard input line by line
    for line in sys.stdin:
        # strip off extra whitespace, split on tab and put the data in an array
        data = line.strip().split("\t")

        if len(data) != 6:
            # Something has gone wrong. Skip this line.
            continue

        date, time, store, item, cost, payment = data

        # Now print out the data that will be passed to the reducer
        print "{0}\t{1}".format(store, cost)
Mapper com regex {#mapper-com-regex}

import sys
import re

def mapper():
    # Each line presents a Common Log Format (NCSA Common Log Format).
    # For each line, the mapper writes the page name and its occurrence
    # to the standard output, separated by a tab character.
    for line in sys.stdin:
        matches = re.findall(r'\"(.+?)\"',line)
        if matches:
            request = matches[0]
            url = request.split()[1]
            print "{0}\t{1}".format(name, 1)
Reducer {#reducer}

https://gist.github.com/chenghan/7456549
import sys

salesTotal = 0
oldKey = None

for line in sys.stdin:
    data = line.strip().split("\t")
    if len(data) != 2:
        # Something has gone wrong. Skip this line.
        continue

    thisKey, thisSale = data
    if oldKey and oldKey != thisKey:
      print oldKey, "\t", salesTotal
      oldKey = thisKey
      salesTotal = 0

     oldKey = thisKey
     salesTotal += float(thisSale)

 if oldKey != None:
   print oldKey, "\t", salesTotal
Running code locally {#running-code-locally}

cat test_in.txt | python mapper.py | sort | python reducer.py
Hadoop Commands {#hadoop-commands}

# create small sample
head -200 purchases.txt > samples.txt

# create sample gzip file
gzip -cd ../access_log.gz | head > samples.txt

# test code
cat testfile | ./mapper.py | sort | ./reducer.py

# run a job
hs mapper.py reducer.py input_folder output_folder

# view the results
hadoop fs -cat output_folder/part-00000 | less

# retrieve the results
hadoop fs -get output_folder/part-00000 results.txt

# delete a folder
hadoop fs -rm -r delete_folder

# running mapreduce
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper ./sales_per_store_mapper.py -reducer ./total_per_key_reducer.py -file ./base.py -file ./sales_per_store_mapper.py -file ./total_per_key_reducer.py -input myinput -output sbbs1


Test Mapper on Python {#test-mapper-on-python}

import sys

def mapper():
    for line in sys.stdin:
        data = line.strip().split("\t")

        if len(data) != 6:
            continue

        date, time, store, item, cost, payment = data

        print("{0}\t{1}".format(store, cost))

test_text = """2013-10-09\t13:22\tMiami\tBoots\t99.95\tVisa
2013-10-09\t13:22\tNew York\tDVD\t9.50\tMasterCard
2013-10-09 13:22:59 I/O Error
^d8x28orz28zoijzu1z1zp1OHH3du3ixwcz114<f
1\t2\t3"""

def main():
    from io import StringIO
    sys.stdin = StringIO(test_text)
    mapper()
    sys.stdin = sys.__stdin__

if __name__ == "__main__":
    main()
Datasets {#datasets}

Retail DB Dataset {#retail-db-dataset}