Skip to content

Instantly share code, notes, and snippets.

View githoov's full-sized avatar

Scott Hoover githoov

View GitHub Profile
@githoov
githoov / redshift_metadata.md
Last active September 21, 2022 01:57
Redshift Metadata

Redshift Tables Used

pg_table_def, stl_query, stl_querytext, stl_tr_conflict, stl_explain, stl_alert_event_log, stl_ddltext, stl_scan, stl_save, stl_hashjoin, stl_hash, stl_plan_info, stl_return, and information_schema.table_constraints.

Queries to Extract Features

  • execution time
select (endtime - starttime) as execution_time_in_ms 
from stl_query 
where query = QUERY_ID;
@githoov
githoov / looker.md
Last active July 17, 2021 18:20
A Quick Walkthrough of Looker and Accessing its APIs

Looker Overview

As a reminder, Looker has three core components: first, it provides a physical modeling layer (LookML), which one uses to abstract SQL generation and define complex transforms; second, Looker consumes the LookML model and generates dialect-specific SQL, which it then issues to the database via JDBC; third, it provides a web-based UI which acts as both a LookML IDE as well as the primary means of exploring data.

The LookML Model

Let's consider a simple schema containing two tables, job and warehouse, which we'll use as an example. In Looker, our representation of these tables would look like this:

job definition

- view: job
  sql_table_name: public.job
@githoov
githoov / gondola.png
Last active March 24, 2018 20:20
gondola
foo
@githoov
githoov / nlp.md
Last active April 20, 2017 13:54
NLP Blog

Introduction

I recently had the opportunity to showcase Snowflake at JOIN, Looker's first user conference. I used my time to highlight a few Snowflake features that I find particularly useful, as someone who does analytics. The presentation demonstrated simultaneous workloads that share the same data as well as analytically intensive SQL patterns against large-scale, semi-structured data.

I thought I'd refactor my presentation as a series of blog entries to share some useful insights and interesting patterns with a broader audience. I'm going to step through how to incorporate sentiment analysis as well as tweet similarity into an interactive model using both Looker and Snowflake. (Note: if you haven't read our previous blog on sentiment analysis using Naïve Bayes, I highly recommend you do so.)

Part 1 - Simultaneous Workloads, Shared Data

N/A

Part 2 - Tweet Similarity

Overview

@githoov
githoov / lambda_example.py
Created December 14, 2016 23:46
Lambda Example
# preliminaries
import re
import json
import logging
import boto3
import urllib
import base64
# fire up logger
logger = logging.getLogger()
/*
create a JavaScript UDF to find regexp matches and throw them into an array
*/
create or replace function regexp_matches(TEXT string, PATTERN string)
returns variant
language javascript
as '
var re = new RegExp(PATTERN, "g");
res = [];
while (m = re.exec(TEXT)) {
@githoov
githoov / tweet_simiarity.sql
Last active October 17, 2016 17:36
Tweet Similarity
/*
this is a basic first pass at tweet similarity
using a modified cosine-similarity approach. the
modification in question weights words based on their
rarity across the corpus. that means that more common
words that aren't quite stop words get a lower weight
than less common words—the method is called inverse
document frequency.
*/
  1. make sure an inbound rule is set up for ssh on port 22 from anywhere
  2. sudo yum update
  3. sudo yum install –y aws-kinesis-agent
  4. vi /etc/aws-kinesis/agent.json and enter the following:
    {
      "cloudwatch.emitMetrics": true,
      "kinesis.endpoint": "",
      "firehose.endpoint": "https://firehose.us-east-1.amazonaws.com",
    
@githoov
githoov / generate.r
Last active September 1, 2016 17:22
R Script to Create a Survival Plot and to Generate a Sample Data Set
# preliminaires
library("ggplot2")
library("zoo")
set.seed(111)
# generate plot of survival curve
x <- sort(dexp(seq(0, 1, 0.01)), decreasing = TRUE)
ggplot(data.frame(x = c(0, 5)), aes(x)) + stat_function(fun = dexp, args = list(rate = 1)) + scale_x_continuous(labels=c(expression(t["0"], t["1"], t["2"], t["3"], t["4"], t["5"]))) + labs(x = "Time", y = expression(y = P(T > t["i"])), title = "Survival Function")
# simulate subscription data
#!/usr/bin/env ruby
require 'json'
require 'optparse'
@file_path = ''
@separator = ''
@table_name = ''
OptionParser.new do |opts|