Skip to content

Instantly share code, notes, and snippets.

@rohitdholakia
Last active December 1, 2015 00:27
Show Gist options
  • Save rohitdholakia/9698c260ce10965df9e0 to your computer and use it in GitHub Desktop.
Save rohitdholakia/9698c260ce10965df9e0 to your computer and use it in GitHub Desktop.
Hands-on assignments - Spark on EMR

#Preamble

If you are reading this, congratulations! You have already setup your first cluster on EMR with Spark installed. Today, we will get hands-on with several features of Spark. Mainly, we will focus on pipelining transformations and actions.

##Shakespeare

  • There is a text file on S3, in bucket datateched, Shakespeare_all.txt. I want you to copy it from S3 to your local system. To do this, you will need to use the aws cli library. This would already be setup on your machine as you all installed ipython and other libraries (yay!)

  •      aws s3 cp s3://datateched/<filename> <path on local directory>
    
  • After the command above runs correctly, you should see the file on the local directory. Spark can work with text files on local file system as well, however, it is a requirement for this file to be present on the same path in all the machines of the cluster. We want to test Spark with HDFS, so we will copy this file to HDFS.

  • HDFS is Hadoop distributed file system. It has a command line shell that can be seen by typing

  • hdfs dfs

  • That should bring up all the commands you can run. As you can observe, it looks and feels just like a normal linux file system! Internally, it takes care of a lot of the fault-tolerant details. The command to put the text file you copied from S3 is "-put". I will let you find out the complete command to put the file into /tmp/Shakespeare.txt :)

##RDDs!

  • Type pyspark on the terminal. pyspark is a python shell that interacts with the Spark core APIs. In other words, you write python code to use Spark. Spark is originally written in Scala. Today, we will get familiar with RDDs and transformations and actions using pyspark.

  • Note: Sometimes, when you enter pyspark, it would keep printing something and then stop and you won't see the ">>>" prompt. In these cases, just press "Enter" and you should see the ">>>" prompt. This happens sometimes due to buffer issues.

  • The first step is to load all works of shakespeare into the base RDD. The file is a text file, and everything within the pyspark shell runs within the class SparkContext (sc). So, let's load that text file into a base RDD.

  • shakespeare = sc.textFile("/tmp/shakespeare.txt") 
    
  • This will create a baseRDD. In English, what this will do is divide this file into blocks, and depending on how many data nodes you have in your cluster (number of datanodes = total number of nodes - number of master nodes), point the machines to the blocks. Thing to note is that Spark will not do anything in this step. It will not split it and then copy it to those machines. Spark is lazy, which means that it will only do the operations when you ask for a result. E.g, you map, flatMap, filter. All of these are transformations but you still didn't ask for a result. After those operations, when you say, count(), Spark will realize it has to run all those transformations first and start from the beginning! This is great because you don't want to know the details between map and flatMap and between flatmap and filter. So, no intermediate results are stored in this case. Also, all the operators are run together in the executor nodes and final result returned to driver. These details make Spark faster than conventional MapReduce.

  • Q1: Find out the number of partitions that shakespeare has!

  • Q2: What is the command to set the number of partitions you want yourself?

  • Now that you have your baseRDD, you can run transformations on it. First task would be to read each line and break it into words. The transformation to do this in Spark is called flatMap. The difference between flatMap and map is that the former takes a RDD of 'N' objects and converts each object into multiple objects. Essentially, takes one collection and creates 'N' collections. The latter doesn't create a collection of collections. It changes the 'N' elements in some way. For e.g

       numbers = sc.parallelize([1, 2, 3, 4, 5]) #parallelize takes a dataset and distributes it as a RDD. 
       squares = numbers.map(lambda l: l**2) #this is squaring each number in numbers
    
  • Q3: Use flatMap to create a new RDD that would take this Shakespeare RDD and convert each line to a list of words

  • Q4: Find out how the first 5 elements of this new RDD look like. The command can be found in the documentation linked to above!

  • Q5: Which type of RDD is created when you use flatMap?

  • Q6: Create another RDD that first does flatMap and then filter to look for a word of your choice in Shakespeare. You can look for multiple words by using something of the form

                  'anthony' in line or 'julius' in line
    
  • Q7: print the count of your condition. How many do you see for different cases?

  • Q8: Find the count of each word in all works of Shakespeare. Which transformations would you need? Check documentation!

  • Q9: Can you tell me the top 10 words based on count? Which transformation did you use?

  • Q10: Considering you are running so many queries on the one flatMap RDD. What should you do to make the queries run faster? Can you . . . . store it somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment