Skip to content

Instantly share code, notes, and snippets.

View alexanderdean's full-sized avatar

Alexander Dean alexanderdean

View GitHub Profile
/*
* Copyright (c) 2015 Tim Harper.
*/
import sbt._
import Keys._
import xerial.sbt.Pack._
object SamzaTasks {
// Request body expected to validate against this JSON Schema
private val PayloadDataSchema =
SchemaCriterion("com.snowplowanalytics.snowplow", "payload_data", "jsonschema", 1, 0)
// Check JSON is a payload_data version 1-0-*, and verify it against the schema
val body: ValidatedNel[JsonNode] = bodyNode.verifySchemaAndValidate(schemaCriterion)
input_lines = LOAD '$INPUT' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES '\\w+';
-- create a group for each word
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for a video_played event",
"self": {
"vendor": "com.channel2.vod",
"name": "video_played",
"format": "jsonschema",
"version": "1-0-0"
},
"type": "object",
{
"schema": "iglu:com.channel2.vod/video_played/jsonschema/1-0-0",
"data": {
"length": 213,
"id": "hY7gQrO"
}
}
@alexanderdean
alexanderdean / video_played.json
Last active August 29, 2015 14:02
Example JSON Schema for a video_played.json
{
"$schema": "http://json-schema.org/schema#",
"description": "Schema for a video_played event",
"type": "object",
"properties": {
"length": {
"type": "number"
},
"id": {
"type": "string"
/**
* Loader for Thrift SnowplowRawEvent objects which
* are inbound as a simple Byte Array.
*/
object ThriftByteArrayLoader extends CollectorLoader[Array[Byte]] {
private val thriftDeserializer = new TDeserializer
/**
* Converts the source string into a MaybeCanonicalInput.
---
# ^^^ YAML documents must begin with the document separator "---"
#
#### Example docblock, I like to put a descriptive comment at the top of my
#### playbooks.
#
# Overview: Playbook to bootstrap a new host for configuration management.
# Applies to: production
# Description:
# Ensures that a host is configured for management with Ansible.
@alexanderdean
alexanderdean / redshift-bug
Created December 17, 2013 17:13
Redshift bug when working with JSONs and UNIONs
-- 1. Setup
DROP table bug_table cascade;
CREATE TABLE bug_table (
some_json varchar(200),
some_flag boolean
);
CREATE VIEW bug_view_1 AS

Custom unstructured event and context functionality: draft specification

0. Introduction

This draft specification covers the enrichment and storage processes for:

  1. Custom unstructured events
  2. Custom untructured context

Custom unstructured events are well-documented as part of the Snowplow Tracker Protocol. Custom unstructured context is less well documented - essentially is looks like this: