##Getting Started ####download
- download the jar: https://prestodb.io/docs/current/installation/cli.html
- move it to some directory and
cd
to that directory - rename the jar file
presto
####launching presto
-
cd
to directory withpresto
jar file -
run
./presto --server presto-prd-srv01.aws.conde.io:8889 --catalog hive
-
you are now in the presto CLI.
##Querying Data ###schemas & tables
- Launch the presto CLI and run
show schemas;
- This will show you the available schemas (can be thought of as top-level data sources) available to query. Of chief interest to us will be the
sparrow
schema. - run
use sparrow;
to switch to the sparrow schema. - run
show tables;
to show the tables from which we may query data. Theevent_clickstream
table is of most interest to us.
###queries Let's query that table! Run
select * from event_clickstream limit 10;
Use the left and right arrow keys to view more columns of the result table in the CLI output.
Important Note: You must always use either a limit
clause or filter by partition in your where
clause. What's a partition? Keep reading...
###partitions
In each schema table, data is divided among files on the server (in this case s3), known as partitions. In the sparrow.event_clickstream
table, data is partitioned by event capture date. There is a separate partition for each calendar date.
When running queries that cannot use a limit
clause, you'll want to make sure you filter to a specific subset of partitions in your where
clause.
To view the partitions in the sparrow.event_clickstream
table, for instance, run
show partitions from event_clickstream;
You will see a list of all the partitions. As mentioned above, there is a partition per calendar date. As you can see in the CLI output, the partition name is dt
.
###putting it all together
To run a non-limit
ed query, you will want to filter on partition in your where
clause. Here is an example of how you might do so for a query against the sparrow.event_clickstream
table:
select count(*) from event_clickstream where dt = '2016-04-13'
This query would show you a count of all sparrow events that were tracked on 2016-04-13.
##Notes
- There is generally a 1-day lag of data available for querying through presto. Querying presto can provide us with useful data insights for time periods on the order of magnitude of days; for more granular data insights, we will have to rely on querying presto's data sources directly (e.g., Druid)