Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save saptak/f1bc9cc88e7da6336777 to your computer and use it in GitHub Desktop.
Save saptak/f1bc9cc88e7da6336777 to your computer and use it in GitHub Desktop.

Overview

In this tutorial we will walk through the process of

  • cleaning and aggregating 10 years of raw stock ticker data from NYSE
  • enriching the data model by looking up additional attributes from Wikipedia
  • creating an interactive visualization on the model

Prerequisites:

  • A working HDP cluster
    • The easiest way to have a HDP cluster is to download the Hortonworks Sandbox, or
    • Create a cluster using Azure HDInsight Service, or
    • Create your own HDP for Windows cluster
  • The NYSE stockticker data
  • Microsoft Excel 2013 with PowerPivot and Power View enabled on a Windows system
  • Microsoft Power Query for Excel

The Data

The Zip file you downloaded above containing the NYSE stock ticker data is comprised of two folders below:

These folders contain numerous comma separated value files representing end of day stock ticker and dividend data of all stock symbols listed at NYSE for a period of 10 years.

Staging the data on HDFS

The first step is to stage the data in HDFS. In case you are using HDInsight Service or another cloud provider, you will have to stage the data on the respective cloud storage. For example in the case of Azure HDInsight Service it is Azure Blob Storage.

For the rest of the tutorial we are going to assume we have a local cluster like the Hortonworks Sandbox to work with.

We will use the HDFS Files view of Ambari to stage the data. 

Navigate to /user/admin.

Create a new directory nyse/nyse_prices and nyse/nyse_dividends.

Go to nyse_prices and upload some of the files of the local zip we previously downloaded and repeat this for nyse_dividends. 

You could also use hdfs command line among other ways to achieve the same. This is probably the less time-compusing opportunity. The following commands load data in hdfs. 

  • Open a terminal:
    • e.g. ssh root@127.0.0.1

It will ask you for your password. The default password is hadoop.

  • To upload the data
    • hadoop fs -put nyse_prices /user/admin/nyse
    • hadoop fs -put nyse_dividends /user/admin/nyse

Note: to change any permission

  • log in as superuser for hdfs 
    • su hdfs
  • and change this using these commands
    • hdfs dfs -chmod 777 /user/admin/nyse
  • or changing the user of the folder
    • hadoop fs -chown -R admin /user/admin/nyse

Creating a Hive schema on the raw data:

We will use the Hive View of Ambari to execute the DDL queries:

Use the DDL statement below to create the table ‘price_data’:

  create external table price_data (stock_exchange string, symbol string, trade_date string, 
                                      open float, high float, low float, close float, volume int, adj_close float) 
  row format delimited 
  fields terminated by ',' 
  stored as textfile 
  location '/user/admin/nyse/nyse_prices';

To test it succeeded run the sample query by clicking on the icon next to the table name at the left.

Then use the DDL statement below to create the table ‘dividends_data’:

  create external table dividends_data (stock_exchange string, symbol string, trade_date string, dividend float) 
  row format delimited 
  fields terminated by ',' 
  stored as textfile 
  location '/user/admin/nyse/nyse_dividends';

To test it succeeded run a query:

Aggregating the stocks and dividend data

Notice in the output of the query on the table price_data above that there are many records for a stock symbol for every year.

We want to work with a table which summarizes the high, low, average and the dividend for every year by stock symbol.

Create another directory under /user/admin/nyse with the title "stock_aggregates".

Here’s the DDL statement to create the table ‘yearly_aggregates’:

  create table yearly_aggregates (symbol string, year string, high float, low float, average_close float, total_dividends float) 
  row format delimited 
  fields terminated by ',' 
  stored as textfile 
  location '/user/admin/nyse/stock_aggregates';

Now let’s populate the table with data using the following query:

  insert overwrite table yearly_aggregates 
  select a.symbol, year(a.trade_date), max(a.high), min(a.low),  avg(a.close), sum(b.dividend) 
  from price_data a 
  left outer join dividends_data b 
  on (a.symbol = b.symbol and a.trade_date = b.trade_date) 
  group by a.symbol, year(a.trade_date);

Let’s run a query to test the table and notice in the output below now we have just one record per year per symbol:

Consuming and Shaping resultset from Hive query using Excel

We will use Excel for this step.

Open Excel with a blank workbook and select the Power Query tab on the ribbon:

Click on “From Other Sources” on the ribbon and select “From Hadoop File (HDFS)”:

Here you should enter the IP address or server name of you HDP cluster namenode.

If your cluster needs authentication, you will also need to enter your credentials. For Sandbox, select the anonymous mode.

You will see a listing of all files on the HDFS, which could be quite overwhelming:

Thankfully Power Query has a nifty capability to filter the listing by clicking on the little triangle on the column name. In this case I filter the Folder path column with the string ‘aggregate’:

Now click on the Binary link on the Content column:

Now you should see the comma separated records all in one column since excel does not yet know how to process it:

Right click on the only column and select “Split Column” -> “By Delimiter”:

Select “Comma” as the delimiter and hit “OK”:

Now you should see the data properly structured:

Let’s rename the column by right clicking the column names and selecting rename with name that make more sense:

Now scroll down and you will notice that some values in the “Dividend” column are “\N” instead of a number. Let’s fix this by right clicking and selecting “Replace Values”:

Now the “Dividend” Column should be well-formed:

Let’s change the data type of the columns High, Low, Average, Dividend to Number by right clicking the column and selecting “Change Type”:

Once you are done changing the type, the number fields should show italicized:

We are done with shaping this data, lets click “Done”: 

In the query settings on the right hand pane you can disable the download of the data, since we will download the data set only after we enhance it with additional attributes from Wikipedia.

Acquiring external data from the Internet using Search

Let’s find out which are the stocks that are in S&P 500. Click on “Online Search” on the Power Query tab of the Excel worksheet:

By hovering over the search result list you can see previews of the datasets:

Click to bring in the dataset titled “S&P 500 Component stocks – list of S&P 500 companies”:

We have to shape this data before we use it. Click on “Filter & Shape” button on right hand pane to bring up the filter dialog:

Remove the uninteresting columns for our analysis like “SEC filings”, “GICS Sub Industry”, “Data first added” and “key” by right-clicking on column names and selecting “Remove Columns”:

Click “Done”

Enhancing the resultset acquired from Hadoop with additional attributes using data from Wikipedia

In this section we want to limit the data to only the stocks in the S&P 500 and moreover we want additional attributes for the stocks like the Company Name, Industry and location of the Headquarters.

Go to the “Power Query” tab and select “Merge”:

Select the resultset you acquired from Hadoop as the first resultset and select the resultset you acquired from Internet as your second resultset. Also select the “Symbol” and “Stock Symbol” as the columns to do the JOIN on and click OK:

Now let’s expand the NewColumn to its component attributes by clicking on the little T shaped icon beside the column name. Deselect “Ticker Symbol” as we have that attribute already from before:

You should see the new columns like below:

Rename the new columns to something like below:

Scroll down and you will notice that many rows have null values for the new columns.

These rows are null because these stock symbols are not part of S&P 500. So, we need to filter these rows out by right clicking one of the null values and selecting “Text Filters”:

Now the dataset is limited to just the symbols in S&P 500:

Click “Done”:

To load it into the PowerPivot in-memory model, click on the “Load to data Model” link on the right hand pane:

To confirm that it has been loaded to the PowerPivot model, click on the PowerPivot tab on Excel and select Manage:

You should see the following:

Close the PowerPivot Model.

Visualizing the combined model using Power View in Excel

Open the Insert tab of Excel and select PowerView to insert a Power View slide:

You will see a Power View canvas with the default model:

Deselect all the PowerView fields except Dividend and Headquarters. :

Then ensure the dividend is being averaged and not summed by click on the little triangle beside Dividend

Now click on the Map button on the Power View ribbon and BOOM!, you got an interactive map:

Resize and Zoom the Map control to your taste.

As we walked through an end-to-end data pipeline where we started with the raw big data, cleaned, transformed and aggregated it in Hortonworks Data Platform or HDP, before consuming it in Excel for modeling and visualization, the integration was so simple that we barely noticed.

This was your favorite big data community (Apache Hadoop) at work. Hortonworks and others in the Apache community designed and implemented WebHDFS, an open REST API in Apache Hadoop. Microsoft used the API from Power Query for Excel to make the integration to Microsoft Business Intelligence platform seamless.

Happy Hadooping!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment