Skip to content

Instantly share code, notes, and snippets.

@Xenorith
Last active July 19, 2019 19:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Xenorith/dc052976b337c9aa43145c556f37110e to your computer and use it in GitHub Desktop.
Save Xenorith/dc052976b337c9aa43145c556f37110e to your computer and use it in GitHub Desktop.

Getting Started with the Alluxio-Presto Sandbox AMI

The Alluxio-Presto sandbox is an Amazon Machine Image offered on Amazon EC2 and features installations of MySQL, Hadoop, Hive, Presto, and Alluxio. The sandbox AMI lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and experience the performance benefits of using Alluxio in a big data software stack.

The goal of this guide is to show how Alluxio can improve Presto's query performance by reading through Alluxio to access locally cached data, originally stored in an Amazon S3 bucket.

This is a similar variation of the Alluxio-Presto sandbox in a Docker container.

Outline

Prerequisites

  • An account on AWS
  • Web console access to EC2
  • A terminal window with SSH

Familiarity with EC2 is helpful but not required. The tutorial launches an EC2 instance through the web console assuming the user has a newly created AWS account; experienced users may be able to skip through certain sections.

Note that the launched EC2 instance does not qualify for free usage tier because the instance needs to have sufficient resources to execute the workload. The instance type we will be using, r4.4xlarge, costs about $1 an hour.

Launch an EC2 instance

We will be launching a single EC2 instance to deploy the sandbox AMI. Users who are familiar with EC2 can jump to the next section on exploring Alluxio, after launching an instance with the following key parameters:

  • AMI name: alluxio-presto-sandbox; search for its AMI ID under Community AMIs
  • Instance type: r4.4xlarge
  • Security group: Open inbound ports for 22, 19999, and 8080

Open the web console

Open a web browser and navigate to the EC2 console, logging in with your credentials when prompted. Note the AWS region in the upper right corner.

awsRegions

Click on the region to open a dropdown of available regions. It is recommended to select the region that is geographically closest to your computer.

On the left sidebar, click the Instances link to arrive at the following page: emptyInstancesPage

Click the blue Launch Instance button. This will navigate you to the first of several steps to launch an instance.

Step 1: Choose an Amazon Machine Image (AMI)

In the left sidebar, there are 4 categories: Quick Start, My AMIs, AWS Marketplace, and Community AMIs. Click on Community AMIs and search for the Alluxio-Presto sandbox AMI by typing alluxio-presto-sandbox in the search bar. Exactly one result should appear; click on its blue Select button to proceed. chooseAmi

Note that each AMI has a unique ID in the form ami-xxxxxxxxxxxxxxxxx. This ID differs depending on which AWS region was selected.

Step 2: Choose an Instance Type

Among the large table of instance types and their specifications, scroll down to search for the r4.4xlarge instance type.

Warning: r4.4xlarge is provisioned with the minimum amount of resources needed to run this AMI. Choosing a different instance type may result in unexpected errors.

Select it and click 6. Configure Security Group in horizontal row of steps near the top. Steps 3, 4, and 5 are skipped because we can use their default values. instanceType

Step 6: Configure Security Group

Create a new security group named alluxio-presto-sandbox Port 22 is already added to allow SSH access. Click Add Rule on the bottom left to add another row. Set the port range of the new rule to 8080 to allow access to the Presto web UI. Set the source to be Anywhere.

Repeat the above steps to add another rule to open port 19999 to allow access to the Alluxio web UI. Click the blue Review and Launch button on the bottom right to proceed. securityGroups

Step 7: Review Instance Launch

This page shows an overview of all the configurations set in the previous steps. You can disregard the yellow warning boxes, titled Improve your instances' security and Your instance configuration is not eligible for the free usage tier. reviewInstance

Click the blue Launch button on the bottom right, which will open a pop up regarding key pairs.

Select an existing key pair or create a new key pair

If you have an existing key pair with its corresponding private key file, keep Choose an existing key pair in the first dropdown and select the known key pair from the second dropdown. Click the checkbox to acknowledge you have the private key file.

For new users without an existing key pair, select the second option Create a new key pair and type in a name in the second text box (ex. alluxio-presto-sandbox-keypair). Click Download key pair and your browser will download a private key file with a matching name (ex. alluxio-presto-sandbox-keypair.pem). Keep track of where this file is downloaded (ex. ~/Downloads/alluxio-presto-sandbox-keypair.pem) because it will be used to access your launched instance. keypairs

Click the blue Launch Instances button to finally launch the instance. The browser will display a loading spinner with status messages before arriving at a launched page with a blue View Instances button on the bottom. Click this button to return back to the Instances page.

Important for newer accounts!! Accounts may encounter an instance limit error when attempting to launch an instance of type r4.4xlarge. This is because each account is limited on how many instances can be launched at a time for each type. Newer accounts typically will not be allowed to launch larger instance types; for example, the starting limit value for r4.4xlarge could be 0.

If you see the following error, follow the given URL to request for a limit increase. limitError

Limit values can be found in the Limits page from the left sidebar. limitsPage

SSH into the instance

Navigate to the Instances page if not already open and find the entry for the newly launched instance. Select the row and find the instance's Public DNS; it should start with ec2- and end in .amazonaws.com. This is the hostname used to SSH into the instance. launchedInstance

Open a terminal window and use the ssh command to connect to the launched instance. In the command below, replace:

  • /path/to/privateKeyFile.pem with the path to your private key file
  • ec2-00-00-00-00.compute-1.amazonaws.com with the EC2 instance public DNS
ssh -i /path/to/privateKeyFile.pem ec2-user@ec2-00-00-00-00.compute-1.amazonaws.com

If you see an error labeled WARNING: UNPROTECTED PRIVATE KEY FILE!, the permissions of the private key file needs to be updated. Run chmod 400 /path/to/privateKeyFile.pem, again replacing with the path to your private key file, and retry the SSH command again.

If this is your first time running SSH for this instance, you will be prompted to confirm the authenticity of the host. Type yes to continue.

Once successfully connected, the terminal prompt should start with something similar to [ec2-user@ip-00-00-00-00 ~]$ It is assumed for the remainder of this guide that commands will be run from within the instance.

Service Architecture

The goal of using this AMI is to read data stored in an S3 bucket through Alluxio such that:

  • A public dataset is hosted on an Amazon S3 bucket such as s3://bucket-name/path
  • The public bucket is mounted into the Alluxio filesystem at alluxio:///path/to/mount
  • The Alluxio mount point is referenced in Hive table definitions via CREATE TABLE table_name ... LOCATION 'alluxio:///path/to/mount/table_name';
  • Presto uses the tables from the hive metastore to run queries with SELECT * FROM table_name;

About the Instance Configuration

The Alluxio-Presto Sandbox comes with pre-installed software which can be found in the /opt directory of the instance. Alluxio is installed at /opt/alluxio. The configuration parameters for which are in the /opt/alluxio/conf directory.

The configuration files for the rest of the services can be found within their respective directories in /opt/hadoop, /opt/hive, and /opt/presto. Refer to the programs' specific documentation in order to find out how each is configured.

The services within the instance are all managed by a daemon called supervisord. This daemon is responsible for starting, stopping, and restarting each service.

Control the status of each processes by using the supervisorctl command.

  • supervisorctl status will show the status of each service
  • supervisorctl stop <service_name> will stop a service
  • supervisorctl start <service_name> will start a service
  • supervisorctl restart <service_name> will restart a service
  • supervisorctl help will list all available commands

Logs for each process can be found in /var/log/supervisor

Explore Alluxio

We'll use a combination of the Alluxio web UI at http://EC2_PUBLIC_DNS:19999 and the Alluxio CLI to explore the Alluxio filesystem and cluster status.

The instance comes with an Amazon S3 bucket pre-mounted in Alluxio at the /s3 directory. It contains data for TPC-DS benchmarks at the "scale 100" size factor which amounts to about 100GB of data across multiple tables.

Open the Alluxio web UI at http://EC2_PUBLIC_DNS:19999 to check if the Alluxio master has started successfully. If not, wait a few moments, refresh the page, and it should become available.

You can see the current Alluxio mounts by running alluxio fs mount from within the the instance.

alluxio fs mount
/opt/alluxio/underFSStorage                           on  /    (local, capacity=15.99GB, used=-1B(0%), not read-only, not shared, properties={})
s3a://alluxio-public-http-ufs/tpcds/scale100-parquet  on  /s3  (s3, capacity=-1B, used=-1B, read-only, not shared, properties={aws.secretKey=******, aws.accessKeyId=******})

Run Queries with Presto on Alluxio

In this next section we're going to use Presto and Alluxio to show how Alluxio can massively decrease query times by reading cached data.

This guide focuses on using Presto through the command line; however, you can also use the Presto UI at http://EC2_PUBLIC_DNS:8080 to view the status of your queries.

Open Presto CLI with alluxio schema

From within the instance, launch the Presto CLI:

presto --catalog hive --debug
presto>

Tip: You can exit at any time by typing exit;

The instance comes pre-loaded with tables in Presto. A schema named alluxio has already been defined. The database contains the tables from the TPC-DS benchmark.

presto> show schemas;
       Schema
--------------------
 alluxio
 default
 information_schema
(3 rows)

Use the alluxio schema

presto> use alluxio;
USE
presto:alluxio>

Once you see the prompt presto:alluxio> you will be using the schema. The table definitions can be found in /usr/share/tpcdsData/createAlluxioTpcdsTables.sql.

Run query

We're going to run a query derived from the TPC-DS benchmarks. The full query below can also be found at /usr/share/tpcdsData/prestoQuery.sql.

with ssr as
(select  ss_store_sk as store_sk,
     sum(ss_ext_sales_price) as sales,
     sum(coalesce(sr_return_amt, 0)) as returns,
     sum(ss_net_profit - coalesce(sr_net_loss, 0)) as profit
 from store_sales left outer join store_returns on
     (ss_item_sk = sr_item_sk and ss_ticket_number = sr_ticket_number)
     , promotion
 where ss_promo_sk = p_promo_sk
 group by ss_store_sk),
wsr as
(select  ws_web_site_sk as website_sk,
     sum(ws_ext_sales_price) as sales,
     sum(coalesce(wr_return_amt, 0)) as returns,
     sum(ws_net_profit - coalesce(wr_net_loss, 0)) as profit
 from web_sales left outer join web_returns on
     (ws_item_sk = wr_item_sk and ws_order_number = wr_order_number)
     , promotion
 where ws_promo_sk = p_promo_sk
 group by ws_web_site_sk)
 select channel, sk, sum(sales) as sales, sum(returns) as returns, sum(profit) as profit
 from (select 'store channel' as channel, store_sk as sk, sales, returns, profit from ssr
     union all
     select 'web channel' as channel, website_sk as sk, sales, returns, profit from  wsr) x
 group by rollup (channel, sk) order by channel, sk limit 100;

Copy the above query, paste it into the Presto prompt, and hit enter to execute. The query will likely take at least 5 minutes to finish, so it would be a good time to grab a drink from the fridge.

An output table is shown when done; the following is a sample of the first few rows: -> collapse the output

    channel    | sk  |        sales         |       returns        |        profit
---------------+-----+----------------------+----------------------+-----------------------
 store channel |   1 |  2.515118199869996E9 | 1.2543365919999991E8 | -1.1656864820600054E9
 store channel |   2 | 2.5205456842599974E9 |  1.245415812100001E8 | -1.1674391653999991E9
 store channel |   4 | 2.5179203736600065E9 | 1.2488394164999989E8 | -1.1657134956600013E9
 store channel |   7 |  2.511532314150002E9 | 1.2449183458999994E8 | -1.1649937964900005E9
 store channel |   8 | 2.5347543900099897E9 | 1.2612198162999986E8 | -1.1755503999099977E9
 store channel |  10 |      2.51036513155E9 | 1.2575692749000001E8 |  -1.160484859049994E9
 store channel |  13 |  2.513131863500003E9 | 1.2527703745000003E8 | -1.1657150094600005E9
 store channel |  14 |  2.535310933710003E9 |  1.264888773999999E8 | -1.1709913304899998E9
 store channel |  16 | 2.5312767249400077E9 |       1.2517431134E8 |  -1.168813352409999E9
...

Press q to leave the query results.

After quitting the results, the query summary should resemble the following:

Query 20190719_001828_00003_vy8rx, FINISHED, 1 node
http://localhost:8080/ui/query.html?20190719_001828_00003_vy8rx
Splits: 8,164 total, 8,164 done (100.00%)
CPU Time: 1220.3s total,  325K rows/s, 5.17MB/s, 28% active
Per Node: 1.8 parallelism,  574K rows/s, 9.14MB/s
Parallelism: 1.8
Peak Memory: 2.17GB
11:30 [396M rows, 6.16GB] [574K rows/s, 9.14MB/s]

Note: here that in the last line of the output, 11:30 represents the query time in mm:ss format.

<--

Rerunning the query

Because this is the first time we're reading the data, it is pulled from S3 and is then returned through Alluxio. At the same time, the Alluxio worker will be caching the data in memory so that the next time the data is accessed it can be read at memory speed.

Running the query again should be faster since the data is now cached in Alluxio, unlike the first query which read its data from the S3 bucket. Let's run it again!

Tip: You can hit the up arrow while in the Presto CLI to scroll through previously executed commands. Exit the Presto shell with exit;

Take note of the query execution time; did you notice any difference in performance? If you want replicate the same result, you will need to free the data from Alluxio memory. After exiting the Presto shell, run the alluxio free command to free the data.

alluxio fs free /s3

To terminate the SSH session, type exit in the SSH terminal prompt.

Next Steps

You can run other queries from the TPC-DS benchmark here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment