Snowplow users specifies their own custom event dictionary as a YAML. This is a list of all the different event types in their own custom dictionary:
- event_type: product_view
- event_type: add_to_basket
- event_type: submit_transaction
- event_type: transaction_completed
For each event type the user specifies the complete set of fields that will be captured:
- event_type: product_view
fields:
- name: sku
type: string
max_length: 10
required: true
- name: category
type: string
max_length: 10
- name: brand
type: string
max_length: 20
- name: price
type: number
max_length: 16
decimals: 2
- name: sizes
type: array
type: string
max_length: 5
- name: available_since
type: date
The Snowplow user can then track the events defined in their event dictionary using the trackUnstructEvent method e.g. via the Javascript tracker:
_snaq.push('trackUnstructEvent', 'product_view'
{
sku: 'ASO01043',
category: 'Dresses',
brand: 'ACME',
price: 49.95,
sizes: ['xs', 's', 'l', 'xl', 'xxl'],
available_since$dt: new Date(2013,3,7)
}
]);
By default, all Snowplow users will have a general atomic.events
table in Redshift / PostgreSQL. This will include only the following fields, all of which are required:
Field name | Field type | Description |
---|---|---|
event_id |
string | Unique ID assigned to each event by enrichment process |
event_type |
string | The name of the event e.g. 'product_view' |
context_type |
string | The context an event takes place in e.g. 'web'. This is explored further in the next section. |
collector_tstamp |
timestamp | The time that the event was recorded |
In addition, Snowplow will create a separate table for every event defined in the YAML event dictionary. This will include all the fields specified in the YAML event definition, and the event ID e.g.
CREATE TABLE atomic.product_views (
event_id varchar(16)
sku varchar(10),
category varchar(10),
brand varchar(20),
price dec(18,2),
sizes_id varchar(16), -- This is a key to a separate sizes table generated by the ETL process - see below
available_since date,
primary key(event_id),
distkey(event_id)
);
Note that in the case of the array field type, Snowplow would automatically define a separate table is defined:
CREATE TABLE atomic.product_views_sizes ( --name is concatenated of the parent table and the field name
size_id varchar(16),
size varchar(5),
primary key(size_id),
distribution key(size_id)
);
Note that Snowplow would generate a new size_id
value for each new event.
The Snowplow Enrichment process would then unpick custom unstructured events with event_type = 'product_view'
and use the name / value pairs to populate the values in the three tables:
atomic.events
atomic.product_views
atomic.product_views_sizes
Analysts would be able to query the data simply e.g.
SELECT *
FROM atomic.events e
LEFT JOIN atomic.product_views pv
ON e.event_id = pv.event_id
Note that because we are joining on a distribution key, query performance should be as fast as if all the data is was on a single table.
Each event has a context. It should be possible for users to define different context types, as YAML files, and define the different fields that can be captured in different context types.
For example:
- context_type: web
- context_type: mobile
For each context type, it should be possible to define the fields that can be captured:
- context_type: web
fields:
- name: page_url
type: string
max_length: 1000
- name: page_title
type: string
max_length: 1000
- name: refr_url
type: string
max_length: 1000
- name: doc_height
type: number
decimals: 0
max_length: length: 6
...
Context would be passed in as its own JSON. Different context types would be unpicked into different Redshift tables (so you'd have a table for web contexts, a table for mobile contexts, for example). All joins would still be done on event_id
.
NOTE: it is not clear in this model how you'd handle enrichments, for example:
- Splitting up
refr_url
andpage_url
into constituent components - Categorising referers into search / social etc.
- Classifying paid marketing sources based on URL query string parameters
- Inferring geo location from IP addresses
Potentially the above could be configured so that an enrichment has a YAML that defines the input set of fields, the enrichment module itself, and the output set of fields?
Clearly we only want to roll out the basic custom event functionality at this stage: context is a much lower priority, but something to at least consider as we architect a solution...