Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save alexanderdean/7863005 to your computer and use it in GitHub Desktop.
Save alexanderdean/7863005 to your computer and use it in GitHub Desktop.

Custom unstructured event and context functionality: draft specification

0. Introduction

This draft specification covers the enrichment and storage processes for:

  1. Custom unstructured events
  2. Custom untructured context

Custom unstructured events are well-documented as part of the Snowplow Tracker Protocol. Custom unstructured context is less well documented - essentially is looks like this:

  • Any existing event including unstructured events (e.g. trackPageView) can have a custom context attached to it by way of two additional arguments to the function
  • The additional arguments are name of context followed by context properties (a JSON)
  • Essentially it allows you to extend an existing event type with custom properties, without having to create a new custom unstructured event type

Example:

_snaq.push(['trackPageView', 'Product: The Last of Us', 'product-page', {
                    type: 'game',
                    platform: 'ps3',
                    sku: 'ps3-2312-13x'
                }
            ]);

1. Initial support

It looks like Amazon has removed their limitations on Redshift table widths. This should make it relatively straightforward for us to add support for the "raw" forms of custom unstructured events and custom unstructured context into our table definitions. The idea here is that we will perform very lightweight validation of the incoming event/context JSONs and then write them into dedicated columns in the Redshift and Postgres table definitions.

We have assigned the tickets for this work to a dedicated Milestone for scheduling, Initial unstructured events & context.

2. Schema'ed extraction into dedicated tables

In a subsequent phase of work, we want to support schema-driven extraction of custom unstructured events and context into dedicated tables in Redshift and Postgres.

Schema design

We will aim to express the schemas using a YAML format added into the EmrEtlRunner's config.yml:

:enrichments:
  ..
  event_definitions:
    - in_name: Add to Basket
      out_name: add_to_baskets
      fields:
      - in:
          name: int-sku
          datatype: string
          required: true
        out:
          name: internal_sku # Column rename
          max_length: 10 # Silent truncate
      - in:
          name: quantity
          datatype: int
          required: false # Or can leave out
        out:
          type: bigint # Narrowing the type
      - in:
          name: sizes
          datatype: list<string>
        # out the same as in
      - in:
          name: available_since
          datatype: dt
        out:
          datatype: timestamp # Widening the type. Default so can leave out
      - ...
    - name: Save Game
      tag: ...
  context_definitions:
    table_prefix: ctx_
    - in_name: Product Page
      out_name: product_pages
      fields:
      - in:
          name: type
          datatype: string
          required: true
        out:
          name: product_type
          max_length: 25
      - ...

Some ideas for the future:

  • Contracts on input values e.g. non-empty string, or positive integer
  • Ability to do dynamic field-level mappings (or validations) in JavaScript
  • Write a linter to check the format (and make sure e.g. in datatype and out datatype are compatible)

Processing in Enrichment phase

During the Enrichment phase:

  • Each unstructured event's name is looked up against the map of defined unstructured events
  • If a definition is not found, no further steps are taken (the raw JSON properties are simply stored in the Redshift table as before)
  • If a definition is found, the definition is parsed and used to programmatically unmarshal the raw JSON into the predefined fields
  • If the unmarshalling fails, then the error is written out into the bad bucket

The exact same process happens for custom unstructured context.

Storage

The Enrichment phase should write out each type of unstructured event and unstructured context into its own dedicated set of flat-files. These flat files should be partitioned into a folder based on the event/context out_name.

So where before we wrote out all events like so:

<<out bucket>/events/part-0000x

Now we will write out like so:

<<out bucket>/good/core/part-0000x
<<out bucket>/good/events/add_to_baskets/part-0000x
<<out bucket>/good/events/save_games/part-0000x
<<out bucket>/good/events/.../part-0000x   
<<out bucket>/good/context/product_pages/part-0000x
<<out bucket>/good/context/editorial_pages/part-0000x
<<out bucket>/good/context/.../part-0000x

We will update the StorageLoader to programmatically load each folder into the corresponding table name - for example, the rows in context/editorial_page will end up in a table called atomic.ctx_editorial_page. For now, the Snowplow user will have to manually construct these child tables and manually install them into their Redshift cluster.

In the case of arrays of values, these would be written out into a further child table, like so:

<<out bucket>/good/events/add_to_baskets/part-0000x
<<out bucket>/good/events/add_to_baskets_sizes/part-0000x

Here is an example:

CREATE TABLE atomic.evt_add_to_baskets_views (
  event_id varchar(16) 
  sku varchar(10),
  category varchar(10),
  brand varchar(20),
  price dec(18,2),
  sizes_id varchar(16), -- This is a key to a separate sizes table generated by the ETL process - see below
  available_since date,
  primary key(event_id),
  distkey(event_id)
);

Note that in the case of the array field type, Snowplow would automatically define a separate table is defined:

CREATE TABLE atomic.evt_product_views_sizes ( --name is concatenated of the parent table and the field name
  size_id varchar(16),
  size varchar(5),
  primary key(size_id),
  distribution key(size_id)
);

Note that Snowplow would generate a new size_id value for each new event.

The format of the current "fat" core atomic.events table will be unchanged for now.

Some ideas for the future:

  • Auto-generate the Redshift DDL from the YAML
  • Create new first-class Snowplow event types (e.g. link clicks) using the above functionality
  • Migrate existing first-class Snowplow event types (e.g. ecommerce tracking) to use the above functionality

Analysis

Each child event or context table will include the (parent) event_id to make a join into the core event possible.

Given that we will be frequently joining the core atomic.events table to these child tables, we will update the atomic.events table, specify the event_id join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

Analysts would be able to query the data simply e.g. with:

SELECT *
FROM atomic.core_events e
LEFT JOIN atomic.evt_product_views pv
ON e.event_id = pv.event_id

Note that because we are joining on a distribution key and sort key, query performance should be as fast as if all the data is was on a single table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment