Skip to content

Instantly share code, notes, and snippets.

Last active September 23, 2016 18:22
Show Gist options
  • Save chb0github/5befe59e30a76cdd49ede844162ea6cd to your computer and use it in GitHub Desktop.
Save chb0github/5befe59e30a76cdd49ede844162ea6cd to your computer and use it in GitHub Desktop.
Metrics System design

UI Design

The ability to Choose from a multitude of related entities by domain space. For example:

  • In the last mile space routes will consist of drivers and specific packages
  • In middle mile and long haul, specific packages are not nearly as important as capacity utilization
  • In sortation, knowing when you had a bad sort or unsortable is what counts as well as who did it. In this model, a route is tangential
  • To support this, the service must provide a catalog of data schemas representing each of the domain spaces. eg.

GET /schemas/ would return all schemata and can be presented as a pull down menu GET /schemas/sortation would return a JSON-schema defining the data payload presented for sortation data. This data can be presented as a tree view of selectable fields. Related fields could also be selected and joined apppropriately. for example, a reference to the last-mile schema could be include and joined. GET /schemas/last-mile would return a JSON-schema defining the data payload presented for last mile pertinent data.

The examples go on. This data can be selected and aggregated using a complex event processor which would allow for joins and aggregations by windows (time or count).

Service design

  • Schema can designed in code and discovered implemented and supplied as a classic resource.
    • A code based design means that the function of the system will never differ from schema.
    • A classic resource will be much more accessible to business developer designing but could be out-of-sync with functionality. Read: doesn't require a developer.

Creating queries

  • Once a selection of fields from various related schema are selected, space in a query-specific data store is created.
  • For example, if the query does severalaggregations, then a column family in Cassandra could be created specifically to house data related to this query.
  • If none-aggregate or flat data is required, a classic RDBMS table can be created with only the relevant fields
  • A "query" is formed from the selections in the query and is persisted and supplied to the processing engin and configured to output data for this query to the query specific data store from above. The dat is both transformative and selective

select,, c.baz as BIGC from Alpha a, Beta b, Charlie c

  • As data passes into the system it is matched against existing queries and emits data to the pre-defined data store.
  • Queries have expiration dates and may be renewed so as to limit resource consumption for forgotten-about queries.

Collectively, these could be thought of as something akin to a bento-box; tidy and complete and just for you. Perhaps a "q-box"

Service interaction

  • Any instance of a service may take any "q-box" definition, persist the query, create the data store and begin listening on all available data channels.
  • Other instances can be made aware of the new "q-box" either through a distributed cache such as Hazelcast or a queue-topic exchange arrangement.
  • As q-box definitions are, by their nature, straightforward, any distributed data store can work for persistence

How to actually get data: Live data

  • All new services must register their schema at

POST /schemas/:subtype

  • For safety sake, the system should discard messages for which it does not have a conforming schema registered.

  • Then, the Create and Update of entities should be pushlished into a stream of data (webhooks and/or queue) the the CEP engine can match against queries.The use of queues makes it naturally scalable. However, depending on the solution chosen this may be unnecessary.

Historical Data A dedicated output for all streams must be to archive data. Conjuring up data then becomes a matter of select * from EntityArchive where timestamp > '2016-09-22T18:21:21Z' and then stream it through the CEP processor. The archive source should be cheap and unstructured because queries will not be performed against it directly. A simple key-value store would be enough. However, bet hedging might be desired and since all data will almost certainly be JSON, then elastic can serve nicely.

On demand data Though this is supportable, querying terabytes of data on the spot adhoc (with no prior indexing or performance enhancing) should be discouraged. This would require keeping on hand capacity for processing or archiving to redshift with it's subsequent costs. A thorough analysis of solutions should be undertaken If this is desired then archive stored should be a graph database as this offers the best performance for relational searching across sparse data.

Systems effected

  • Elbrox - joining alarms with any entity alarmed upon
  • Planner - Because it's so important
  • worldview - update to package/vehicle status. When drivers are added to vehicles
  • AntFarm - Not much: It's meant to be more of a query store than a business decider
  • Earp - Definitely: What is getting delivered to where and to whom.
  • Planseeker/worldseeker/jobseeker data stores may need to be re-thought
  • Matrix -- will need to offer extensive schema discovery and query building tools via drag-and-drop. Extensive work here
  • SomeNewService - To accept/offer schema for CEP processing and configuration
  • Lockbox - no effect
  • cromag - Questionable.

This is a system wide necessity and will need on a per system basis.

For references:

Microsoft Stream Insight Video of Stream insight

Amazon Kenesis + Data Pipelines

Esper For just the CEP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment