Skip to content

Instantly share code, notes, and snippets.

@bstancil
bstancil / Finding User Sessions with SQL
Last active October 18, 2023 16:07
These queries let you define find user sessions against event data logged to Segment SQL, Snowplow, or Google BigQuery.
-- These queries let you define find user sessions against event data
-- logged to Segment SQL, Snowplow, or Google BigQuery.
-- For more details, see the full post:
-- LINK
--- SEGMENT SQL
-- Finding the start of every session
SELECT *
FROM (
SIMPLE VERSION
## Snippet
SELECT rr.id,
rr.created_at,
rr.report_id,
rr.account_id,
rr.executed_by_id,
rr.data_source_id,
<link href="https://cdn.rawgit.com/jaz303/tipsy/master/src/stylesheets/tipsy.css" rel="stylesheet" type="text/css">
<style>
#wrapper {
font-family: Helvetica, Arial, sans-serif;
font-size: 12px;
width: 900px;
margin: 0 auto;
padding-bottom: 10px;
}
@bstancil
bstancil / Schema.R
Last active October 29, 2015 18:15
This is a gist
V(g)$label.cex <- 3 * (0.06125 * V(g)$degree / max(V(g)$degree) + .2)
V(g)$label.color <- rgb(0, 0, .2, .49 * V(g)$degree / max(V(g)$degree) + .5)
V(g)$frame.color <- rgb(0, 0, .2, .39 * V(g)$degree / max(V(g)$degree) + .6)
egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)
E(g)$color <- rgb((colorRamp(c("blue", "yellow", "red"))(E(g)$weight/max(E(g)$weight)))/255)
E(g)$width <- egam
plot(g, layout=layout_on_sphere(g), vert
import csv
import requests
import nltk, string
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):

A few thoughts on creating a "universal user ID":

Unfortuantely, this isn't a trival task. There are some general ways that you can get close, but Segment's data is rarely clean enough that you can create a perfect mapping.

Ultimately, the problem stems from the fact that Segment's two basic ways of tracking identity - anonymous ids and user ids - overlap in both directions. Anonymous ids, which are basically browser cookies, can map to multiple user ids if multiple people log in from the same browser. And user ids can map to multiple anonymous ids if a user uses multiple browers.

This means that there are some cases when you can't conclusively say which user an anonymous id represents. There are ways to make best guesses, but it's not certain.

First, I create a big table of all the user_ids and anonymous_ids that we've tracked. How you do this exactly depends a little bit on your implementation fo Segment, but we do something like this:

Using Bridge with Multiple-Orgs

Currently Bridge can only act on behalf of a single Organization at a time, however it is possible to run multiple instances of Bridge on a server. To do so, we'll have to make some modifications to the system configuration.

Assumptions:

  • Linux/Ubuntu 14.04
  • Organization #1: RobotOrg
  • Organization #2: HumanOrg

Step 1 - Select RobotOrg in Mode and follow the normal Add-Data flow for a Data Source on a private network.

<!--Load the AJAX API-->
<script type="text/javascript" src="https://www.google.com/jsapi"></script>
<script type="text/javascript">
// Get the HTML body before calling google.load
var HTML = document.body.innerHTML;
// Load the Visualization API and the controls package.
google.load('visualization', '1.0', {'packages':['controls']});
SELECT *,
MIN(context_campaign_medium) OVER (PARTITION BY user_id ORDER BY occurred_at) AS first_channel,
SUM(new_session) OVER (ORDER BY user_id, occurred_at) AS global_session_id,
SUM(new_session) OVER (PARTITION BY user_id ORDER BY occurred_at) AS user_session_id
FROM (
SELECT *,
CASE WHEN EXTRACT('EPOCH' FROM occurred_at)
- EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session