Skip to content

Instantly share code, notes, and snippets.

@saptak
Last active March 5, 2024 08:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save saptak/ef65c2cfe83a32d8772e to your computer and use it in GitHub Desktop.
Save saptak/ef65c2cfe83a32d8772e to your computer and use it in GitHub Desktop.

Tutorial Overview

In this tutorial you will gain a working knowledge of Pig through the hands-on experience creating Pig scripts to carry out essential data operations and tasks.

We will first read in two data files that contain New York Stock Exchange dividend prices and stock prices, and then use these files to perform a number of Pig operations including:

  • Define a relation with and without schema
  • Define a new relation from an existing relation
  • Select specific columns from within a relation
  • Join two relations
  • Sort the data using ‘ORDER BY’
  • FILTER and Group the data using ‘GROUP BY’

This tutorial was derived from one of the lab problems in the Hortonworks Developer training class. The developer training class covers uses of the tools in the Hortonworks Data Platform and how to develop applications and projects using the Hortonworks Data Platform. You can find more information about the course at Hadoop Training for Developers.

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig. Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Data File System.

Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster.

Download the Data

You’ll need sample data for this tutorial. The data set you will be using is stock ticker data from the New York Stock Exchange from the years 2000-2001. Download this sample data from the following location:

https://s3.amazonaws.com/hw-sandbox/tutorial1/infochimps_dataset_4778_download_16677-csv.zip

The file is about 11 megabytes, and might take a few minutes to download.

Open the folder infochimps_dataset_4778_download_16677 > NYSE and locate the two data files that you will be using for this tutorial:

  • <code>NYSE_daily_prices_A.csv</code>
    
  • <code>NYSE_dividends_A.csv</code>
    

Step 1: Upload the data files

Select the HDFS Files view from the Off-canvas menu at the top. That is the views menu. The HDFS Files view allows you to view the Hortonworks Data Platform(HDP) file store. The HDP file system is separate from the local file system.

Navigate to /user/admin, click Upload and Browse, which brings up a dialog box where you can select the NYSE_daily_prices_A.csv file from you computer. Upload the NYSE_dividends_A.csv file in the same way. When finished, notice that both files are now in HDFS.

Step 2: Create Your Script

Open the Pig interface by clicking the Pig Button in the views menu.

On the left we can choose between our saved Pig Scripts, UDFs and the Pig Jobs executed in the past. To the right of this menu bar we see our saved Pig Scripts.

Click on the button "New Script", enter “Pig-Dividend” for the title of your script and leave the location path empty:

Below you can find an overview about which functionalities the pig interface makes available. A special feature of the interface is the PIG helper at the top left of the composition area, which provides templates for Pig statements, functions, I/O statements, HCatLoader() and Python user defined functions.

Step 3: Define a relation

In this step, you will create a script to load the data and define a relation.

  • On line 1 define a relation named STOCK_A that represents the NYSE stocks that start with the letter “A”
  • On line 2 use the DESCRIBE command to view the STOCK_A relation

The completed code will look like:

STOCK_A = LOAD 'nyse/NYSE_daily_prices_A.csv' using PigStorage(','); 
DESCRIBE STOCK_A; 

Step 4: Save and Execute the Script

Click the Save button to save your changes to the script. Click Execute to run the script. This action creates one or more MapReduce jobs. After a moment, the script starts and the page changes. Now, you have the opportunity to Kill the job in case you want to stop the job.

Next to the Kill job button is a progress bar with a text field above that shows the job’s status.

When the job completes, check the results in the green box. You can also download results to your system by clicking the download icon. Notice STOCK_A does not have a schema because we did not define one when loading the data into relation STOCK_A.

Step 5: Define a Relation with a Schema

Let’s use the above code but this time with a schema. Modify line 1 of your script and add the following AS clause to define a schema for the daily stock price data. The complete code will be:

    STOCK_A = LOAD 'NYSE_daily_prices_A.csv' using PigStorage(',') 
    AS (exchange:chararray, symbol:chararray, date:chararray,                 
    open:float, high:float, low:float, close:float, volume:int, adj_close:float); 
    DESCRIBE STOCK_A; 

Save and execute the script again. This time you should see the schema for the STOCK_A relation:

Step 6: Define a new relation from an existing relation

You can define a new relation based on an existing one. For example, define the following B relation, which is a collection of 100 entries (arbitrarily selected) from the STOCK_A relation.

Add the following line to the end of your code:

    B = LIMIT STOCK_A 100; 
    DESCRIBE B; 

Save and execute the code. Notice B has the same schema as STOCK_A, because B is a subset of A relation.

Step 7: View the Data

To view the data of a relation, use the DUMP command.

Add the following DUMP command to your Pig script, then save and execute it again:

    Dump B;

The command requires a MapReduce job to execute, so you will need to wait a minute or two for the job to complete. The output should be 100 entries from the contents of NYSE_daily_prices_A.csv (and not necessarily the ones shown below, because again, entries are arbitrarily chosen):

Step 8: Select specific columns from a relation

Delete the DESCRIBE A, DESCRIBE B and DUMP B commands from your Pig script; you will no longer need those.

One of the key uses of Pig is data transformation. You can define a new relation based on the fields of an existing relation using the FOREACH command. Define a new relation C, which will contain only the symbol, date and close fields from relation B.

Now the complete code is:

    STOCK_A = LOAD 'NYSE_daily_prices_A.csv' using PigStorage(',') 
    AS (exchange:chararray, symbol:chararray, date:chararray, open:float, 
    high:float, low:float, close:float, volume:int, adj_close:float); 
    
    B = LIMIT STOCK_A 100; 
    C = FOREACH B GENERATE symbol, date, close; 
    DESCRIBE C; 

Save and execute the script and your output will look like the following:

Step 9: Store relationship data into a HDFS File

In this step, you will use the STORE command to output a relation into a new file in HDFS. Enter the following command to output the C relation to a folder named output/C (then save and execute):

    STORE C INTO 'output/C'; 

Again, this requires a MapReduce job (just like the DUMP command), so you will need to wait a minute for the job to complete.

Once the job is finished, go to HDFS Files view and look for a newly created folder called “output” under /user/admin:

Click on “output” folder. You will find a subfolder named “C”.

Click on “C” folder. You will see an output file called “part-r-00000”:

Click on the file “part-r-00000”. It will download the file:

Step 10: Perform a join between 2 relations

In this step, you will perform a join on two NYSE data sets: the daily prices and the dividend prices. Dividends prices are shown for the quarter, while stock prices are represented on a daily basis.

You have already defined a relation for the stocks named STOCK_A. Create a new Pig script named “Pig-Join”. Then define a new relation named DIV_A that represents the dividends for stocks that start with an “A”, then join A and B by both the symbol and date and describe the schema of the new relation C.

The complete code will be:

    STOCK_A = LOAD 'NYSE_daily_prices_A.csv' using PigStorage(',') 
        AS (exchange:chararray, symbol:chararray, date:chararray,
        open:float, high:float, low:float, close:float, volume:int, adj_close:float); 
    DIV_A = LOAD 'NYSE_dividends_A.csv' using PigStorage(',') 
        AS (exchange:chararray, symbol:chararray, date:chararray, dividend:float); 
    C = JOIN STOCK_A BY (symbol, date), DIV_A BY (symbol, date); 
    DESCRIBE C; 

Save the script and execute it. Notice C contains all the fields of both STOCK_A and DIV_A. You can use the DUMP command to see the data stored in the relation C:

Step 11: Sort the data using “ORDER BY”

Use the ORDER BY command to sort a relation by one or more of its fields. Create a new Pig script named “Pig-sort” and enter the following commands to sort the dividends by symbol then date in ascending order:

    DIV_A = LOAD 'NYSE_dividends_A.csv' using PigStorage(',')
        AS (exchange:chararray, symbol:chararray, date:chararray, dividend:float); 
    B = ORDER DIV_A BY symbol, date asc; 
    DUMP B; 

Save and execute the script. Your output should be sorted as shown here:

Step 12: Filter and Group the data using “GROUP BY”

The GROUP command allows you to group a relation by one of its fields. Create a new Pig script named “Pig-group”. Then, enter the following commands, which group the DIV_A relation by the dividend price for the “AZZ” stock.

    DIV_A = LOAD 'NYSE_dividends_A.csv' using PigStorage(',') 
        AS (exchange:chararray, symbol:chararray, date:chararray, dividend:float); 
    B = FILTER DIV_A BY symbol=='AZZ'; 
    C = GROUP B BY dividend; 
    DESCRIBE C; 
    DUMP C; 

Save and execute. Notice that the data for stock symbol “AZZ” is grouped together for each dividend.

Congratulations! You have successfully completed the tutorial and well on your way to pigging on Big Data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment