quintessence/SDSS-Dataset-Exercise-Setup.md

## SDSS-Dataset-Exercise-Setup.md

      
    Raw
  

              SDSS-Dataset-Exercise-Setup.md
            
          
    Setup Instructions for Snowflake Digital Sky Survey Blog Post

What's this about?

Companion blog post
Tools


Snowflake, snowsql CLI tool

macOS can install via brew


Kaggle account to download the data

Sloan Digital Sky Survey Data Release 17


Table of Contents


Setup

Warehouse
Database, schema, and stage
SnowSQL Client

How to connect
The config file (Optional)
How to exit
Common SnowSQL Setup Errors


Data Upload


Creating CREATE with vim
Full Snowflake Worksheet

Setup: warehouse, database, schema, stage and CLI

Make sure you are using a role with the correct permissions to create and use warehouses and
databases. If you're using a trial, the easiest way to do this is with ACCOUNTADMIN:
USE ROLE ACCOUNTADMIN;
Warehouse

You should have default XS warehouse available in your Snowflake account - even during the free trial.
The warehouse is what provides the compute for you to run queries. If you do not have one in your account,
or you'd like to create a new one to use for this exercise, please run the following in a Snowflake worksheet:
/* Create and use an XS warehouse, which will autosuspend after
   5 min (300 seconds) and will autoresume on use. Max/min
   cluster size set to 1 to prevent scaling. */
CREATE WAREHOUSE IF NOT EXISTS sdss_wh WITH 
  WAREHOUSE_SIZE = 'XSMALL' 
  WAREHOUSE_TYPE = 'STANDARD' 
  AUTO_SUSPEND = 300 
  AUTO_RESUME = TRUE 
  MIN_CLUSTER_COUNT = 1 
  MAX_CLUSTER_COUNT = 1 
  SCALING_POLICY = 'STANDARD';

USE WAREHOUSE sdss_wh;
Database, schema, stage

You will need to create a database to store both the table that will house
your data and the stage that will receive the data file(s). You can also
optionally create a separate schema in the database, although for the blog
post I'm using the public schema which is created with the database.
/* Create and use the database and its public schema */
CREATE DATABASE IF NOT EXISTS sdss;
USE SCHEMA sdss.public;

/* Create the stage */
CREATE OR REPLACE STAGE sdss_stage
   FILE_FORMAT = (TYPE="CSV", FIELD_DELIMITER=",", SKIP_HEADER=1);
Snowflake documentation for creating stages for file uploads
The SnowSQL Client

You will need the snowsql client in order to upload the data from local to Snowflake.
Snowflake's documentation for installing the snowsql client
Note: if you are using macOS you can use brew.
How to connect

The command to connect should be:
snowsql --accountname ACCOUNT.REGION.PROVIDER --username USERNAME --dbname sdss --schemaname public

So if your account name is abcde123 and you are using Snowflake with GCP and in the us-central1 region,
then you'd use:
snowsql --accountname abcde123.us-central1.gcp --username USERNAME --dbname sdss --schemaname public

If you're unsure of your provider and region, you can see it in the Snowflake app URL, e.g.:
https://app.snowflake.com/us-central1.gcp/abcde123/

For more information about this, please see Snowflake's documentation about account identifiers.
For your password you can either add it as a parameter or provide it when prompted.
The config file

You can also tell snowsql what account name, username, and password to use via the config file. On macOS / Linux this file is located at ~/.snowsql/config. The relevant section of the config file would be set to the following if using the above example account:
accountname = abcde123.us-central1.gcp
username = USERNAME
password = PASSWORD

Please see Snowflake's documentation for the config file for more information and parameters.
How to exit

Global vim trauma informs me that I should always specifically call out how to exit:
!quit

Common Setup Errors in SnowSQL

Some setup issues you may encounter with the snowsql CLI tool:

You can store your account name, username, and optionally password, in the ~/.snowsql/config file. You can also optionally pass them through as parameters when you use the tool. You will be prompted for any information you do not provide.

If you’re not sure what your account name is, which is different from your username that you use to log in, run the following in a worksheet:

SELECT CURRENT_ACCOUNT();
If you see the following error, where ACCOUNTNAME is the alphanumeric string you provided for your account:

250001 (08001): Failed to connect to DB. Verify the account name is correct: ACCOUNTNAME.snowflakecomputing.com:443. 000403: 403: HTTP 403: Forbidden

Then you are using a non-AWS provider and you need to provide a fully qualified account name of the format:
 accountname.region.provider


Note that this error takes a long, long time to appear and snowsql may appear to be hanging before the error appears.


If you see the following error:

Failed to initialize log. No logging is enabled: [Errno 13] Permission denied: '/Users/snowsql_rt.log_bootstrap'

This is because the default file path to the logging file is ../snowsql_rt.log. To fix this in Linux / macOS, open the config file and change the default logging path to ~/.snowsql/snowsql_rt.log. (Errno 13 Question and Answer on StackOverflow.)

Data Upload

In order to upload the data from local, you need to:

Connect to the correct account and database via snowsql as documented above
Upload the data to the stage via the PUT command

The syntax for the PUT command is:
PUT file://<full_path_to_file> @<stage_name>

So if you have data in the file star_classification.csv, in your Downloads folder, that you needed to upload to the sdss_stage:
PUT file:///Users/LOCAL_USERNAME/Downloads/star_classification.csv @sdss_stage

Note: the triple backslash, ///, for the file:// scheme and then the preceeding / for the full file path.
Creating CREATE in vim

This is a little aside that I created for a couple reasons. One, for vim practice, but also so I didn’t need to copy / enter 18 different column names and risk forgetting or mistyping one of them. To start, we’ll copy in the header line only:
head -1 star_classification.csv > create_sdss_table.sql

The file should look like this:
"obj_ID","alpha","delta","u","g","r","i","z","run_ID","rerun_ID","cam_col","field_ID","spec_obj_ID","class","redshift","plate","MJD","fiber_ID"

Change to insert mode (type i) to edit, and wrap the CREATE statement around the text and put the trailing semicolon. The file should now look like the following three lines:
CREATE OR REPLACE TRANSIENT TABLE sdss_staging_table (
"obj_ID","alpha","delta","u","g","r","i","z","run_ID","rerun_ID","cam_col","field_ID","spec_obj_ID","class","redshift","plate","MJD","fiber_ID"
);

Change back to command mode (press esc). In order, we’re going to:

Strip the quotes
Replace the commas with newlines
Indent the body of the create with a tab for readability.

To do this we’ll use the s command, which searches and replaces. There are two types of newlines, \n and \r, and vim uses the latter (this is to keep in mind when we remove the commas). s uses regex, so the beginning of a line is denoted by ^ and the end of a line by $. To run the commands correctly, you'll need your cursor to be on the second line, as s without any numbers supplied only acts on the current line. That means that the above commands, in order, are:

:s/"//g -> switch the quote with nothing (deleting it) for all occurrences (g is for global on the line)
:s/,/\r/g -> switch the comma with a newline, also globally
:2,19s/^/\t/ -> switch beginning of line with tab for lines 2-19

The file should now look like:
CREATE OR REPLACE TRANSIENT TABLE sdss_staging_table (
    obj_ID
    alpha
    delta
    u
    g
    r
    i
    z
    run_ID
    rerun_ID
    cam_col
    field_ID
    spec_obj_ID
    class
    redshift
    plate
    MJD
    fiber_ID
);

Since we don't yet know what we want the different column types to be, we're going to put everything as type VARCHAR. To do the edit in vim:
:2,19s/$/ VARCHAR,/

You will need to delete the trailing comma at the end of line 19, as it is the last column. Once you run the above and delete that comma, the final create statement should look like:
CREATE OR REPLACE TRANSIENT TABLE sdss_staging_table (
    obj_ID VARCHAR,
    alpha VARCHAR,
    delta VARCHAR,
    u VARCHAR,
    g VARCHAR,
    r VARCHAR,
    i VARCHAR,
    z VARCHAR,
    run_ID VARCHAR,
    rerun_ID VARCHAR,
    cam_col VARCHAR,
    field_ID VARCHAR,
    spec_obj_ID VARCHAR,
    class VARCHAR,
    redshift VARCHAR,
    plate VARCHAR,
    MJD VARCHAR,
    fiber_ID VARCHAR
);
The Snowflake Worksheet

If you would like to run the worksheet exactly as I have done, here are all of the commands. Note that the worksheet cannot run all at once as there are specified data uploads that happen at different junctures.
/* Create a small warehouse to use if needed. Note:
// -- Size is xsmall as we won't be doing anything compute heavy
// -- MIN = MAX, so the warehouse cannot scale if a query is
//    more computationally expensive than we realized. */
CREATE WAREHOUSE IF NOT EXISTS sdss_wh WITH 
    WAREHOUSE_SIZE = 'XSMALL' 
    WAREHOUSE_TYPE = 'STANDARD' 
    AUTO_SUSPEND = 300 
    AUTO_RESUME = TRUE 
    MIN_CLUSTER_COUNT = 1 
    MAX_CLUSTER_COUNT = 1 
    SCALING_POLICY = 'STANDARD';
USE WAREHOUSE sdss_wh;

/* Create the database. Using public schema here. */
CREATE DATABASE IF NOT EXISTS sdss;
USE SCHEMA sdss.public;

/* Create stage for subset data only */
CREATE OR REPLACE STAGE sdss_small_stage
    FILE_FORMAT = (TYPE="CSV", FIELD_DELIMITER=",", SKIP_HEADER=1);


/* Load subset data from local at this juncture using SnowSQL */

/* Verify the CSV file is in the stage, it might be compressed in a .gz */
LIST @sdss_small_stage;

/* Create transient table for subset of the data. Note:
// -- Transient table, so there's no automatic back ups of the data.
// -- All fields are text so we can inspect them without accidentally changing the content */
CREATE OR REPLACE TRANSIENT TABLE sdss_small_table (
    obj_ID VARCHAR,
    alpha VARCHAR,
    delta VARCHAR,
    u VARCHAR,
    g VARCHAR,
    r VARCHAR,
    i VARCHAR,
    z VARCHAR,
    run_ID VARCHAR,
    rerun_ID VARCHAR,
    cam_col VARCHAR,
    field_ID VARCHAR,
    spec_obj_ID VARCHAR,
    class VARCHAR,
    redshift VARCHAR,
    plate VARCHAR,
    MJD VARCHAR,
    fiber_ID VARCHAR
);

/* Copy the data in */
COPY INTO sdss_small_table FROM @sdss_small_stage;

/* See the data. */
SELECT * FROM sdss_small_table;


/* ------------------------------------------------------------------------ */


/* As an exercise, create the table from a SELECT statement
// First, note you can select/read columns directly from the stage */
SELECT $1,$2,$3 FROM @sdss_small_stage;

/* Second, note you can cast to a type and rename the column */
SELECT 
    $1::VARCHAR AS obj_id,
    $2::VARCHAR AS alpha,
    $3::VARCHAR AS delta
FROM @sdss_small_stage;


/* So the complete "create from select" statement is: */
CREATE OR REPLACE TRANSIENT TABLE sdss_small_from_select AS SELECT
	$1::VARCHAR AS obj_ID,
	$2::VARCHAR AS alpha,
	$3::VARCHAR AS delta,
	$4::VARCHAR AS u,
	$5::VARCHAR AS g,
	$6::VARCHAR AS r,
	$7::VARCHAR AS i,
	$8::VARCHAR AS z,
	$9::VARCHAR AS run_ID,
	$10::VARCHAR AS rerun_ID,
	$11::VARCHAR AS cam_col,
	$12::VARCHAR AS field_ID,
	$13::VARCHAR AS spec_obj_ID,
	$14::VARCHAR AS class,
	$15::VARCHAR AS redshift,
	$16::VARCHAR AS plate,
	$17::VARCHAR AS MJD,
	$18::VARCHAR AS fiber_ID
FROM @sdss_small_stage;


/* ------------------------------------------------------------------------ */


/* Can also drop / adjust the above to accommodate having the FILE FORMAT separate.
// Not necessary for a one-off excercise, but would be necessary when you have multiple
// data sources and thus potentially different file formats. Think about perhaps a
// scientific study being run through various universities, hospitals (if medical),
// government agencies, and so forth. They all might be slightly variant for a variety
// of reasons. */


/* Create staging table for data, adding in three columns for metadata information */
CREATE OR REPLACE TRANSIENT TABLE sdss_staging_table (
    import_file VARCHAR,
    import_row_number INT,
    import_time TIMESTAMP DEFAULT current_timestamp(),
    obj_ID BIGINT,
    alpha NUMBER(20,17),
    delta NUMBER(20,17),
    u NUMBER(10,8),
    g NUMBER(10,8),
    r NUMBER(10,8),
    i NUMBER(10,8),
    z NUMBER(10,8),
    run_ID INT,
    rerun_ID INT,
    cam_col INT,
    field_ID INT,
    spec_obj_ID BIGINT,
    class VARCHAR,
    redshift NUMBER(11,9),
    plate INT,
    MJD INT,
    fiber_ID INT
);

/* Create the stage for the full / "real" data */
CREATE STAGE IF NOT EXISTS sdss_stage FILE_FORMAT = (TYPE="CSV", FIELD_DELIMITER=",", SKIP_HEADER=1);

/* Upload the full file, i.e. the `_orig` file, to the stage via SnowSQL */

/* Verify the file is in the stage */
LIST @sdss_stage;

/* COPY the data via a SELECT command, to set the additional column values: */
COPY INTO sdss_staging_table FROM (SELECT
    metadata$filename,
    metadata$file_row_number,
    current_timestamp(),
    $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18
    FROM @sdss_stage);

/* Did any data make it in? (Spoiler: no) */
SELECT COUNT(*) FROM sdss_staging_table;

/* Checking for errors, spoiler:
// This command will not work with a transform, but if we
// were doing a straightforward copy we could use this. */
COPY INTO sdss_staging_table FROM (SELECT
    metadata$filename,
    metadata$file_row_number,
    current_timestamp(),
    $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18
    FROM @sdss_stage) VALIDATION_MODE='RETURN_ERRORS';
/* Leaving as an exercise to the reader to create a new
   table without the extra rows and do a direct copy with
   validation mode to find additional errors. */
    
    
/* Upload the other CSV files to the stage, run LIST to verify */
LIST @sdss_stage;

/* Copy only the `_edit` file */
COPY INTO sdss_staging_table FROM (SELECT
    metadata$filename,
    metadata$file_row_number,
    current_timestamp(),
    $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18
    FROM @sdss_stage/star_classification_edit.csv.gz);

SELECT * FROM sdss_staging_table LIMIT 25;
SELECT * FROM sdss_staging_table WHERE class='"GALAXY"';
SELECT COUNT(*) FROM sdss_staging_table WHERE class='"GALAXY"';