breedx-splk/event_considerations.md

## event_considerations.md

      
    Raw
  

              event_considerations.md
            
          
    This is an attempt to summarize some of the discussion from last Friday's Event WG meeting and to argue for the existence of metadata and for keeping it separate. It also provides a set of possible design choices for how to model events.
Event model

Background

We know that events will be implemented with logs as the transport. That's decided and for the purposes of
this discussion is purely an implementation detail.
Events are a wad of information about something that happened at a point in time. The core content of an event is usually structured and is presented in a certain shape. We can call this shape a data model.
Not all events originate from otel components. For events that do originate from otel components, there is a strong need to have event-specific data models defined via otel semantic conventions. In other words, there is a need to define the constituent parts of a specific type of event and to document the agreed upon semantics. This allows unrelated otel components (mobile app, web app, backend service) to generate the same events in the same way. This also allows receivers of these events to interpret event contents consistently, regardless of the origination source. A schema is a common tool used to define data models.
For events not originating from otel components, otel does not control the data model, schema, nor semantics, and so the event content can be viewed as opaque and passed on unmodified. In other cases, to be discussed at a later time, externally sourced events need to be translated into well-defined otel events, and these transformations can be documented elsewhere (similar to how export rules are already defined for traces).
Metadata

What is metadata?

Metadata is data about other data. It provides additional context about some entity without itself being a first-order component of said entity.
What are some examples of real-world metadata?

If the entity is a car, some of its fundamental inherent characteristics probably include things like manufacturer name and model, the number of wheels, the type of engine, and what side the steering wheel is on. Metadata about a car might include things like country of manufacture, name of the designer, and a set of region-specific safety ratings.
You could imagine many other pieces of data that are related to the car but are not actually part of the car. The car exists and functions just fine without its metadata, but a system that processes cars might need some of that metadata to make useful decisions. An import tariff system might need to know the origin, and a recommendation engine for parents of young children might use the safety ratings. Not every system will care about all the metadata -- it is often largely supplemental.
What about an example event? Let's make up an event for popping a balloon. We decide that the important things about a balloon popping are:

time it happened
balloon diameter
balloon age (in seconds)
balloon surface material
other item that touched the balloon

Because this example is contrived, you could also think of other things to include, but for now let's assume that the balloon committee agreed that the above is the important stuff for most purposes. What about metadata? Surely, there could be some supplemental information about this event, like:

who popped it (optional, of course)
whether anybody heard it (boolean)
where the balloon was purchased
how much it cost
ambient temperature
balloon color

The supplemental data can be useful to processing systems, but is not itself important to the fundamental definition of the BalloonPopped event. Some systems generating BalloonPopped events may not have some information available and might omit it.
Why do we need metadata?

Metadata gives supplemental data about other data. As mentioned above, it provides additional context about a thing that happened, without being a first-class part of that thing.
Why should we keep metadata separate

A well behaved data model has boundaries. It defines precisely what things are part of an entity, and anything not included in the definition is implicitly not part of the entity. A system that is processing an event instance with a well defined schema knows what can be found inside of it, and it should not expect to find other non-eventy things in that event instance. This also prevents surprises: When handling an entity with a known schema, the handler does not need to try to process unexpected fields or work around unknown types.
A data model that is too fluid or relaxed just becomes mud. It doesn't contain enough structure to hold itself together. A well-designed data model is easy for a human user to glance at and instantly internalize into a mental model. A poorly defined muddy model doesn't provide this same value. It also requires overly-specific knowledge to be built into any processing code. Take this poor design for example:
{
  field1: "Smith",
  field2: "Jimmy",
  field3: "123 Maple Ln.",
  field4: "Anytown, US"
}
compared with
{
  surname: "Smith",
  given_name: "Jimmy",
  street_address: "123 Maple Ln.",
  city_state: "Anytown, US"
}
I know which one I'd rather deal with! The first object model is poorly defined -- it doesn't provide enough information about what each field is used for. A processor would need to have specific knowledge about field order and semantics.
While this is not specifically applicable yet to metadata, I hope it helps to demonstrate that weakly defined data models and weakly defined apis are harder to work with.
Separation of concerns is a well established design principle in software development. An object should be one thing. A utility should perform one main function. This principle fosters smaller, more modular software components, which typically makes modifications easier, safer, and smaller. I would argue that the same can apply to data modeling:
{
  name: {
    first: "Jimmy",
    last: "Smith"
  },
  address: { 
    street_ordinal: "123",
    street_name: "Maple Ln.",
    city: "Anytown",
    state: "US"    
  }
}
by splitting the original data model above into smaller, single-purpose parts and then grouping them, we have improved two areas of this design:

address and name have been separated. A glance at the top level clearly communicates that a person has a name and an address
the address is further decomposed into its consituent parts

A system that only deals with names can clearly ignore the address, and a system that aggregates surnames need only look within the the name. Future changes to the address portion of the schema should have no impact on the name fields. This decomposition helps achieve this.
Let's take another example with metadata:
{
  job_title: "Regional Manager",
  surname: "Smith",
  favorite_color: "salmon",
  given_name: "Jimmy",
  lawn_size: 142,
  street_address: "123 Maple Ln.",
  city_state: "Anytown, US",
  places_lived: 7,
  lawn_size_units: "m2"
}
This big bag of fields contains a blend of core Person data and some supplemental metadata about that person and their context. It has the benefit of being flat, but a user looking at the instance cannot readily discern what fields are fundamentally important to the definition of the Person. A schema for Person would either need to include everything presented here, or have muddy default fallback of "everything not in the schema is supplemental metadata".
We can improve this by grouping the metadata into its own entity and splitting it out:
{

  surname: "Smith", 
  given_name: "Jimmy",
  street_address: "123 Maple Ln.",
  city_state: "Anytown, US",
  meta: {
    favorite_color: "salmon",
    job_title: "Regional Manager",  
    places_lived: 7,
    lawn_size: 142,
    lawn_size_units: "m2"
  }
}
By grouping the metadata, we have created a clear separation between what is a fundamentally important top level characteristic of a Person and what other data might be metadata. To take it one step further, we can leverage the decomposition from before and also apply it to the metadata:
{
  name: {
    first: "Jimmy",
    last: "Smith"
  },
  address: { 
    street: {
      ordinal: "123",
      name: "Maple Ln."
    },
    city: "Anytown",
    state: "US"    
  },
  meta: {
    favorite_color: "salmon",
    job_title: "Regional Manager",  
    places_lived: 7,
    lawn: {
      size: 142,
      size_units: "m2"
    }
  }
}
The remaining problem is a little academic, but still worth covering. As it stands right now, the metadata itself is a component of the thing it describes. It is on the same level as name and address, and thus, the metadata has two responsibilities -- it is part of a person and it describes a person. This is weirdly self-referencing and violates the separation of concerns principle. Taking it all the way:
{
  person: {
    name: {
      first: "Jimmy",
      last: "Smith"
    },
    address: { 
      street: {
        ordinal: "123",
        name: "Maple Ln."
      },
      city: "Anytown",
      state: "US"    
    }
  },
  meta: {
    favorite_color: "salmon",
    job_title: "Regional Manager",  
    places_lived: 7,
    lawn: {
      size: 142,
      size_units: "m2"
    }
  }
}
We now have an entity that contains both a person and the metadata about that person. Each can be easily discerned by a human eye, and can be handled separately. Schema modifications to one area of the Person (name) do not necessarily impact the schema of another area (address).
Event modeling options

With logs as the underlying transport, what are we currently considering for options
to model events.
Given the following as a high-level structure of an event, we are presented with
several options on how to model the describe the event.
[ timestamp ] [ severity ] [ name ] [ body ] [ attributes ]
but for the purposes of this discussion, because they are not relevant, let's
drop timestamp and severity, which leaves us with just:
[ name ] [ body ] [ attributes ]
Where name is a single namespaced/prefixed string attribute.
For the following options, we will use a contrived, yet un-spec'd event called otel.button.click that has the following definition:

x coordinate
y coordinate
button name

In some options, the x and y will be split into a point entity.
Other representational contrived/unspec'd metadata will include session.id, client.active-screen-name, and client.display.dimensions which has width and height. JSON is only used here for demonstration purposes.
Option 1

Event body and metadata blended and flattened into attributes:
{ 
  body: null,
  attributes: {
    "client.display.dimensions.width": 2048,
    "event.name": "otel.button.click",
    "otel.button.click.x": 122,
    "session.id: "44b83fc0c38f3eb2191fd7fbb6ca1ae00e124c9f"
    "otel.button.click.y": 9,
    "client.active-screen-name": "status", 
    "otel.button.name": "beepboop",
    "client.display.dimensions.height": 1024,
  }
}
Pros:

nice and flat

Cons:

logs body is unused
duplicated data in attribute key names
hard to discern event content from metadata (muddy)

Option 2

Event body in attributes (event.data), metadata also in attributes.
Event body is also just a bag of flat namespaced attributes.
{ 
  body: null,
  attributes: {
    "event.name": "otel.button.click",
    "event.data": {
      "otel.button.click.x": 122,
      "otel.button.click.y": 9,
      "otel.button.name": "beepboop"
    }
    "client.display.dimensions.width": 2048,
    "session.id: "44b83fc0c38f3eb2191fd7fbb6ca1ae00e124c9f"
    "client.active-screen-name": "status",
    "client.display.dimensions.height": 1024,
  }
}
Pros:

event body separate from other metadata

Cons:

logs body is unused
duplicated data in attribute key names

Option 3

Event body in attributes (event.data), metadata also in attributes.
Event body has a well-defined schema with terse fields.
{ 
  body: null,
  attributes: {
    "event.name": "otel.button.click",
    "event.data": {
      "point": {
        "x": 122,
        "y": 9
      },
      "button": "beepboop"
    }
    "client.display.dimensions.width": 2048,
    "session.id: "44b83fc0c38f3eb2191fd7fbb6ca1ae00e124c9f"
    "client.active-screen-name": "status",
    "client.display.dimensions.height": 1024,
  }
}
Pros:

event body separate from other metadata
terse fields in event body

Cons:

logs body is unused

Option 4

Event body in logs body, metadata in attributes. Event body is also just a bag of flat namespaced attributes.
{ 
  "body": {
    "otel.button.click.x": 122,
    "otel.button.click.y": 9,
    "otel.button.name": "beepboop"
  },
  attributes: {
    "event.name": "otel.button.click",
    "client.display.dimensions.width": 2048,
    "session.id: "44b83fc0c38f3eb2191fd7fbb6ca1ae00e124c9f"
    "client.active-screen-name": "status",
    "client.display.dimensions.height": 1024,
  }
}
Pros:

nice and flat
event body separate from other metadata
logs body is used

Cons:

duplicated data in attribute key names

Option 5

Event body in logs body, metadata in attributes. Event body has a well-defined schema with terse fields.
This is effectively the model I thought we had arrived upon 3 weeks ago.
{ 
  "body": {
    "point": {
      "x": 122,
      "y": 9
    },
    "button": "beepboop"
  },
  attributes: {
    "event.name": "otel.button.click",
    "client.display.dimensions.width": 2048,
    "session.id: "44b83fc0c38f3eb2191fd7fbb6ca1ae00e124c9f"
    "client.active-screen-name": "status",
    "client.display.dimensions.height": 1024,
  }
}
Pros:

event body separate from other metadata
body is used
terse fields in event body

Cons:

The 3 concerns that Jack has listed here here

Option 6

?