yalisassoon/custom-event-functionality-spec.md

## custom-event-functionality-spec.md

      
    Raw
  

              custom-event-functionality-spec.md
            
          
    Custom event functionality: initial specification

1. Defining custom events

Snowplow users specifies their own custom event dictionary as a YAML. This is a list of all the different event types in their own custom dictionary:
- event_type: product_view
- event_type: add_to_basket
- event_type: submit_transaction
- event_type: transaction_completed
For each event type the user specifies the complete set of fields that will be captured:
- event_type: product_view
  fields:
  - name: sku
    type: string
    max_length: 10
    required: true
  
  - name: category
    type: string
    max_length: 10
    
  - name: brand
    type: string
    max_length: 20
    
  - name: price
    type: number
    max_length: 16
    decimals: 2
    
  - name: sizes
    type: array
      type: string
      max_length: 5
    
  - name: available_since
    type: date
The Snowplow user can then track the events defined in their event dictionary using the trackUnstructEvent method e.g. via the Javascript tracker:
_snaq.push('trackUnstructEvent', 'product_view'
  {
    sku: 'ASO01043',
    category: 'Dresses',
    brand: 'ACME',
    price: 49.95,
    sizes: ['xs', 's', 'l', 'xl', 'xxl'],
    available_since$dt: new Date(2013,3,7)
  }
]);
By default, all Snowplow users will have a general atomic.events table in Redshift / PostgreSQL. This will include only the following fields, all of which are required:


Field name
Field type
Description


event_id
string
Unique ID assigned to each event by enrichment process


event_type
string
The name of the event e.g. 'product_view'


context_type
string
The context an event takes place in e.g. 'web'. This is explored further in the next section.


collector_tstamp
timestamp
The time that the event was recorded


In addition, Snowplow will create a separate table for every event defined in the YAML event dictionary. This will include all the fields specified in the YAML event definition, and the event ID e.g.
CREATE TABLE atomic.product_views (
  event_id varchar(16) 
  sku varchar(10),
  category varchar(10),
  brand varchar(20),
  price dec(18,2),
  sizes_id varchar(16), -- This is a key to a separate sizes table generated by the ETL process - see below
  available_since date,
  primary key(event_id),
  distkey(event_id)
);
Note that in the case of the array field type, Snowplow would automatically define a separate table is defined:
CREATE TABLE atomic.product_views_sizes ( --name is concatenated of the parent table and the field name
  size_id varchar(16),
  size varchar(5),
  primary key(size_id),
  distribution key(size_id)
);
Note that Snowplow would generate a new size_id value for each new event.
The Snowplow Enrichment process would then unpick custom unstructured events with event_type = 'product_view' and use the name / value pairs to populate the values in the three tables:

atomic.events
atomic.product_views
atomic.product_views_sizes

Analysts would be able to query the data simply e.g.
SELECT *
FROM atomic.events e
LEFT JOIN atomic.product_views pv
ON e.event_id = pv.event_id
Note that because we are joining on a distribution key, query performance should be as fast as if all the data is was on a single table.
2. Capturing context

Each event has a context. It should be possible for users to define different context types, as YAML files, and define the different fields that can be captured in different context types.
For example:
- context_type: web
- context_type: mobile
For each context type, it should be possible to define the fields that can be captured:
- context_type: web
  fields:
  - name: page_url
    type: string
    max_length: 1000
  
  - name: page_title
    type: string
    max_length: 1000
    
  - name: refr_url
    type: string
    max_length: 1000
    
  - name: doc_height
    type: number
    decimals: 0
    max_length: length: 6
...
  
Context would be passed in as its own JSON. Different context types would be unpicked into different Redshift tables (so you'd have a table for web contexts, a table for mobile contexts, for example). All joins would still be done on event_id.
NOTE: it is not clear in this model how you'd handle enrichments, for example:

Splitting up refr_url and page_url into constituent components
Categorising referers into search / social etc.
Classifying paid marketing sources based on URL query string parameters
Inferring geo location from IP addresses

Potentially the above could be configured so that an enrichment has a YAML that defines the input set of fields, the enrichment module itself, and the output set of fields?
3. Practical considerations

Clearly we only want to roll out the basic custom event functionality at this stage: context is a much lower priority, but something to at least consider as we architect a solution...
Field name	Field type	Description
`event_id`	string	Unique ID assigned to each event by enrichment process
`event_type`	string	The name of the event e.g. 'product_view'
`context_type`	string	The context an event takes place in e.g. 'web'. This is explored further in the next section.
`collector_tstamp`	timestamp	The time that the event was recorded