Skip to content

Instantly share code, notes, and snippets.

@saicitus
Created November 1, 2016 22:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save saicitus/2fda6975609e7e9c963a32759968a0a4 to your computer and use it in GitHub Desktop.
Save saicitus/2fda6975609e7e9c963a32759968a0a4 to your computer and use it in GitHub Desktop.

session_analytics

This PostgreSQL extension provides some functions for querying hstore arrays.

Functions

  • count_elements(elements hstore[], pattern text). returns the number of elements that match the given pattern.
  • multi_count_elements(elements hstore[], patterns text[]). returns an array of integers which contains number of elements that match the corresponding pattern. This is equivalent of multiple calls of count_elements, but is usually faster.
  • sum(int[]). aggregate function for summing vectors. This is useful for aggregating result of multi_count_elements.
  • filter_elements(elements hstore[], pattern text). returns the subsequence of elements that match the given pattern.
  • contains_elements(elements hstore[], patterns text[]). returns 1 if we can find a subsequence of elements such that they match the corresponding patterns. Otherwise, it returns 0.
  • funnel_events(elements hstore[], patterns text[]). returns an array with length equal to length of patterns. if patterns[1..i] matches a subesequence of patterns, then result[i] is 1, otherwise result[i] is 0.
  • group_elements(elements hstore[], fields text[]). groups the elements by the value of the given fields, and returns a row per each group. Each group is tuple (keys text[], elements hstore[]).
  • group_over_time(elements hstore[], fields text[], time_field text, truncation_unit text, client_offset interval). this function is similar to group_elements, but also adds the value of the given time field to group keys. The time field is shifted by the client offset and truncated by the given unit before being add to the group keys.
  • filter_groups(group group_elements_result, pattern text). filters out the given group if it doesn't match the given pattern.

Deprecated Functions

  • cohort_over_time(elements hstore[], fields text[], time_field text, truncation_unit text, client_offset interval, patterns text[], min_sizes int[]). similar to group_over_time, but filters out the groups who doesn't have minimum count of elements matching the given patterns. This function now can be replaced with a nested call to filter_groups and group_over_time.

Pattern Language

The pattern language below is used to match hstore objects using field based expressions that can be combined using AND (",") and OR ("|") operators to create complex expressions:

  • field_name => value. Is value of field_name equal to the given value?
  • field_name =>not value. Is value of field_name not equal to the given value?
  • field_name =>gt value, field_name =>gte value, field_name =>lt value, field_name =>lte value. int64 comparison between value of field_name and given value.
  • field_name =>contains value. Is given value substring of the value of field_name?
  • field_name =>not_contains value. Opposite of =>contains.
  • field_name =>regex pattern. Does value of field_name match the given regex pattern?
  • field_name =>like pattern. Does value of field_name match the given like pattern?
  • field_name =>ilike pattern. Similar to =>like, but case insensitive.
  • A , B. Does both A and B hold?
  • A | B. Does A or B hold?
  • (A). Parentheses for arbitrary mixing AND and OR operators.
  • values can be escaped and quoted.

The above pattern language is extended for filter_groups to support aggregate functions:

  • You can use function(pattern) in place of field_name in any of the above patterns.
  • count(pattern). number of elements matching the pattern.
  • sum(pattern;field_name). sum of field_name in elements matching the pattern.
  • max(pattern;field_name). max of field_name in elements matching the pattern.
  • min(pattern;field_name). min of field_name in elements matching the pattern.
  • avg(pattern;field_name). integer part of average of field_name in elements matching the pattern.

Examples

These examples are based on a database where you have a schema like the following:

┌─────────────┬──────────┬───────────┐
│   Column    │   Type   │ Modifiers │
├─────────────┼──────────┼───────────┤
│ cust_id     │ bigint   │           │
│ cust_events │ hstore[] │           │
└─────────────┴──────────┴───────────┘

You can think of element in the cust_events array as representing a single event that a customer generated. An example event might look like:

{
    client      => iphone-32.2.1a,
    page_class  => message,
    action      => send,
    time        => 1398363313
}

Count occurences of certain types of events

Here is an example where you are counting the total number of message page views:

SELECT
    sum(count_elements(cust_events, 'action=>view,page_class=>message'))
FROM
    cust_events;

Count occurences of certain types of events grouped by additional event metadata

Here is an example of counting the types of clients cucstomers use to view or list messages. Note the use of the '|' and ',' operators to do OR and AND operations and also the use of '(...)' to group expressions.

SELECT
    count(*),
    unnest(
        filter_elements(cust_events, '(action=>view,page_class=>message)|(action=>list,page_class=>message)')
    )->'client' as client,
FROM
    cust_events
GROUP BY
    client;

Another way to ask the same question illustrating how filters can be chained:

SELECT
    count(*),
    unnest(
        filter_elements(
            filter_elements(cust_events, 'page_class=>message'),
            'action=>list|action=>view'
        )
    )->'client' as client,
FROM
    cust_events
GROUP BY
    client;

Count customers who performaned certain actions over their lifetime

Here's an example illustrating the use of the count_elements function to count the number of users who repeatedly viewed a message:

SELECT
    count(*)
FROM
    cust_events
WHERE
    count_elements(cust_events, 'action=>view,page_class=>message') > 2;

Listing customers who performed a sequence of actions

Say you run a retail application and want to list users who searched for a product, viewed it, and went on to purchase:

SELECT
    user_id
FROM
    cust_events
WHERE
    contains_elements(cust_events, 
        ARRAY['action=>search', 'action=>view,page_class=>product', 'action=>purchase']
    );

Generating funnel queries

Now imagine an application similar to the one above but instead of a list of users you want to know the counts of users that moved through various stages of your funnel:

SELECT
    sum(contains_elements(cust_events, ARRAY['action=>search']) as search,
    sum(contains_elements(cust_events, ARRAY['action=>search', 
                                             'action=>view,page_class=>product']) as view,
    sum(contains_elements(cust_events, ARRAY['action=>search', 
                                             'action=>view,page_class=>product', 
                                             'action=>purchase']) as purchase
FROM
    cust_events;

Adding New Operators

To add new operators:

  1. add the operator to the TOKENIZATION_REGEX,
  2. add it to the list of token string pound-definitions,
  3. add it to the OperatorCode enum,
  4. add it to the OperatorTokens mapping,
  5. For string operators, modify StringOperator()/StringOperatorResult() functions.
  6. For numeric operators, modify NumericOperator()/NumericOperatorResult() functions.

Copyright

Copyright 2014-2016 Citus Data.

Licensed under the Apache License, Version 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment