Skip to content

Instantly share code, notes, and snippets.

@Laxman-SM
Forked from dcolucci/presto-howto.md
Created February 28, 2020 11:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Laxman-SM/403ad2a57187bc6505f86b2372bcc354 to your computer and use it in GitHub Desktop.
Save Laxman-SM/403ad2a57187bc6505f86b2372bcc354 to your computer and use it in GitHub Desktop.

##Getting Started ####download

  1. download the jar: https://prestodb.io/docs/current/installation/cli.html
  2. move it to some directory and cd to that directory
  3. rename the jar file presto

####launching presto

  1. cd to directory with presto jar file

  2. run

    ./presto --server presto-prd-srv01.aws.conde.io:8889 --catalog hive
  3. you are now in the presto CLI.

##Querying Data ###schemas & tables

  1. Launch the presto CLI and run show schemas;
  2. This will show you the available schemas (can be thought of as top-level data sources) available to query. Of chief interest to us will be the sparrow schema.
  3. run use sparrow; to switch to the sparrow schema.
  4. run show tables; to show the tables from which we may query data. The event_clickstream table is of most interest to us.

###queries Let's query that table! Run

select * from event_clickstream limit 10;

Use the left and right arrow keys to view more columns of the result table in the CLI output.

Important Note: You must always use either a limit clause or filter by partition in your where clause. What's a partition? Keep reading...

###partitions In each schema table, data is divided among files on the server (in this case s3), known as partitions. In the sparrow.event_clickstream table, data is partitioned by event capture date. There is a separate partition for each calendar date.

When running queries that cannot use a limit clause, you'll want to make sure you filter to a specific subset of partitions in your where clause.

To view the partitions in the sparrow.event_clickstream table, for instance, run

show partitions from event_clickstream;

You will see a list of all the partitions. As mentioned above, there is a partition per calendar date. As you can see in the CLI output, the partition name is dt.

###putting it all together To run a non-limited query, you will want to filter on partition in your where clause. Here is an example of how you might do so for a query against the sparrow.event_clickstream table:

select count(*) from event_clickstream where dt = '2016-04-13'

This query would show you a count of all sparrow events that were tracked on 2016-04-13.

##Notes

  • There is generally a 1-day lag of data available for querying through presto. Querying presto can provide us with useful data insights for time periods on the order of magnitude of days; for more granular data insights, we will have to rely on querying presto's data sources directly (e.g., Druid)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment