Skip to content

Instantly share code, notes, and snippets.

@yalisassoon
Last active December 29, 2015 17:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save yalisassoon/7704998 to your computer and use it in GitHub Desktop.
Save yalisassoon/7704998 to your computer and use it in GitHub Desktop.
Initial specification for Snowplow custom events functionality

Custom event functionality: initial specification

1. Defining custom events

Snowplow users specifies their own custom event dictionary as a YAML. This is a list of all the different event types in their own custom dictionary:

- event_type: product_view
- event_type: add_to_basket
- event_type: submit_transaction
- event_type: transaction_completed

For each event type the user specifies the complete set of fields that will be captured:

- event_type: product_view
  fields:
  - name: sku
    type: string
    max_length: 10
    required: true
  
  - name: category
    type: string
    max_length: 10
    
  - name: brand
    type: string
    max_length: 20
    
  - name: price
    type: number
    max_length: 16
    decimals: 2
    
  - name: sizes
    type: array
      type: string
      max_length: 5
    
  - name: available_since
    type: date

The Snowplow user can then track the events defined in their event dictionary using the trackUnstructEvent method e.g. via the Javascript tracker:

_snaq.push('trackUnstructEvent', 'product_view'
  {
    sku: 'ASO01043',
    category: 'Dresses',
    brand: 'ACME',
    price: 49.95,
    sizes: ['xs', 's', 'l', 'xl', 'xxl'],
    available_since$dt: new Date(2013,3,7)
  }
]);

By default, all Snowplow users will have a general atomic.events table in Redshift / PostgreSQL. This will include only the following fields, all of which are required:

Field name Field type Description
event_id string Unique ID assigned to each event by enrichment process
event_type string The name of the event e.g. 'product_view'
context_type string The context an event takes place in e.g. 'web'. This is explored further in the next section.
collector_tstamp timestamp The time that the event was recorded

In addition, Snowplow will create a separate table for every event defined in the YAML event dictionary. This will include all the fields specified in the YAML event definition, and the event ID e.g.

CREATE TABLE atomic.product_views (
  event_id varchar(16) 
  sku varchar(10),
  category varchar(10),
  brand varchar(20),
  price dec(18,2),
  sizes_id varchar(16), -- This is a key to a separate sizes table generated by the ETL process - see below
  available_since date,
  primary key(event_id),
  distkey(event_id)
);

Note that in the case of the array field type, Snowplow would automatically define a separate table is defined:

CREATE TABLE atomic.product_views_sizes ( --name is concatenated of the parent table and the field name
  size_id varchar(16),
  size varchar(5),
  primary key(size_id),
  distribution key(size_id)
);

Note that Snowplow would generate a new size_id value for each new event.

The Snowplow Enrichment process would then unpick custom unstructured events with event_type = 'product_view' and use the name / value pairs to populate the values in the three tables:

  1. atomic.events
  2. atomic.product_views
  3. atomic.product_views_sizes

Analysts would be able to query the data simply e.g.

SELECT *
FROM atomic.events e
LEFT JOIN atomic.product_views pv
ON e.event_id = pv.event_id

Note that because we are joining on a distribution key, query performance should be as fast as if all the data is was on a single table.

2. Capturing context

Each event has a context. It should be possible for users to define different context types, as YAML files, and define the different fields that can be captured in different context types.

For example:

- context_type: web
- context_type: mobile

For each context type, it should be possible to define the fields that can be captured:

- context_type: web
  fields:
  - name: page_url
    type: string
    max_length: 1000
  
  - name: page_title
    type: string
    max_length: 1000
    
  - name: refr_url
    type: string
    max_length: 1000
    
  - name: doc_height
    type: number
    decimals: 0
    max_length: length: 6
...
  

Context would be passed in as its own JSON. Different context types would be unpicked into different Redshift tables (so you'd have a table for web contexts, a table for mobile contexts, for example). All joins would still be done on event_id.

NOTE: it is not clear in this model how you'd handle enrichments, for example:

  • Splitting up refr_url and page_url into constituent components
  • Categorising referers into search / social etc.
  • Classifying paid marketing sources based on URL query string parameters
  • Inferring geo location from IP addresses

Potentially the above could be configured so that an enrichment has a YAML that defines the input set of fields, the enrichment module itself, and the output set of fields?

3. Practical considerations

Clearly we only want to roll out the basic custom event functionality at this stage: context is a much lower priority, but something to at least consider as we architect a solution...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment