Skip to content

Instantly share code, notes, and snippets.

@aeilers
Last active December 11, 2019 07:19
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aeilers/30aa0047187e5a5d573a478abc581903 to your computer and use it in GitHub Desktop.
Save aeilers/30aa0047187e5a5d573a478abc581903 to your computer and use it in GitHub Desktop.
Hive Pattern/Stack

Hive Pattern/Stack

The Hive Pattern is a proven theory to redefine enterprise level architecture based on Command Query Responsibility Segregation (CQRS), Event Sourcing (ES), unified transaction logs, and microservices patterns.

Goals of the Hive Pattern:

  • define a horizontally scalable, high performing, fault tolerant solution for enterprise architecture
  • remove boilerplate application logic so engineers can focus on solving enterprise challenges
  • improve operational efficiencies through specialization and micro implementations
  • minimize the depth of n-tier architecture

The Hive Stack is an open source, enterprise application stack in the spirit of LAMP, MEAN, and other open source application stacks that proves the Hive Pattern. The intent is to provide a standardized set of services and storage solutions as a base set of tools to solve enterprise challenges.

Disclaimer: There is no relationship between the Hive Pattern/Stack and Apache Hive. However, as Apache Hive is a Data Warehouse solution, presumably it could be wired up as a Consumer service for your enterprise solution.

Table of Contents

Background

A few years back I was fortunate enough to be exposed to some massive architectures and implementations at an enterprise scale. It sparked my curiosity for full stack universal applications at an enterprise level. One of the things that bothered me about that architecture was how much of a mess it was to traverse. It was like getting a plate of spaghetti and trying to make sense of all the crisscrossing noodles to what meatballs they touched. Over time that plate of spaghetti continues to get piled on as more meatballs (services) are added to meet the needs of the business.

Another aspect of that implementation was the lack of automated elasticity it provided for supporting teams. There was a great deal of work to calculate exact requirements for normal and peak operations throughout the year. Engineers were essentially making educated guesses on these requirements with constant monitoring and tweaking. Often, these calculations needed constant manual adjustments to meet the actual business needs at the time.

Both of these situations had one thing in common, time. Choosing to race against time rather than make sound architectural decisions only adds to that mess. Instead of making it a race, we can use the disadvantage of time to our advantage through the concept of Event Sourcing. Event Sourcing allows us to track events over time to constantly adjust to business needs as they change. It is our proverbial time machine where we can replay events that have occurred in the past to make more intelligent decisions about the future.

Event Sourcing only gets us part of the way there. To address the spaghetti architecture, we can employ unified transaction logs where multiple inputs/outputs can be wired to a common event log that can scale itself accordingly. This leads us towards a combination of specialized workers that work together to make a high performing, scalable solution.

Why the name?

Explained to me somewhat recently, a bee hive is comprised of a large group of individuals working in highly specialized roles for the betterment of the hive. That naturally occurring phenomenon is essentially the linchpin of the Hive Pattern. Orchestrating a group of specialized workers in terms of services and storage mechanisms allows each of them to do what they do best without sacrificing functionality by fulfilling multiple responsibilities.

The use of the names Producer, Consumer, and Stream Processor may suggest an explicit tie to Apache Kafka. While the Hive Stack is built with Kafka, the usage of these terms in the Hive Pattern are to tie in specific concepts of an existing architectural solution that Apache Kafka overtly suggests. This pattern could just as easily be applied to other unified transaction log systems such as Logstash. With that in mind, a producer could just as easily be referred to as a publisher and a consumer as a subscriber.

High-Level Description

At the most basic level, it is essentially a system where multiple producers feed data through a unified transaction log to multiple consumers. Stream processors represent an evolution of the Extract Transform Load (ETL) pattern to provide support for validation and complicated command routing to support the concept of a CQRS saga for more complicated domain models. Here are some diagrams to help illustrate the pattern:

High-Level Diagram

hive pattern/stack high-level diagram

Sequence Diagram

hive pattern/stack sequence diagram

The Hive Stack leverages Node.js LTS for Producers, Consumers, and Stream Processors because of its non-blocking event-driven capabilities and its small footprint. This allows each of the microservices to scale while using less resources than its counterparts. Because JavaScript is a universal language, it allows for more code re-use when it comes to implementing Web and Native applications that interact with the Hive Stack.

The security of these endpoints should be considered with the use of SSL, authorization headers, and/or proxies that provide CSRF solutions to ensure that the source of incoming data is valid as well as validating the data itself.

In-Depth Breakdown

Below is an in-depth description for each of the major components in the high-level diagram above.

Producers

Producers represent a simpler implementation where domain Value Objects can be passed through to the log directly with minimal validation. Since Value Objects have no unique identity, they are essentially immutable and can be treated as such. Therefore, this type of validation is superficial and can easily be handled by the Value Object's schema definition. Examples of this type of implementation would include streams of analytics data for user tracking or geolocation data for real-time position tracking.

Consumers

Consumers handle the Query responsibilities in the CQRS pattern. They are responsible for translating single or multiple event streams into denormalized formats that can be queried by user applications. Since all of the data has been validated before it is logged, they free themselves from that requirement and can focus on translating and serving data.

The Hive Stack leverages MongoDB as a storage solution for these microservices because of its rich data modeling and querying capabilities. Apache Cassandra also seems like a viable solution for this requirement because of its ability to handle high-volume reads/writes and a rich query language.

Stream Processors

Stream Processors are multi-faceted in their responsibilities. By default, they handle the command responsibilities in the CQRS pattern. Therefore, they are integrated with the domain layer to take commands and get existing aggregate data to pass to the domain layer for business-specific logic and validation. Once validated, it passes the returned event to the log and stores the updated snapshot of the aggregate to the caching layer. Depending on the needs of the domain model, the Stream Processor allows for transactional consistency if required. Essentially this makes it a Stream Producer as it is performing more than the Producer above, but for similar tasks. In order to follow a more semantic API design, POSTs should be used for Create aggregate commands, PUTs should be used for Update-like aggregate commands, and DELETEs should be used for Delete-like aggregate commands. This is possible, but not necessarily required, since the HTTP specification allows for message bodies for these verbs. One could argue that since commands/events are immutable, these should all be POSTs. But remember that the commands/events are to there to dictate changes to the aggregate, which are certainly not immutable.

The second role of the Stream Processor is to rebuild the caching layer from the transactional log. This is valuable when standing up new environments for various reasons like A/B testing, debugging, and deploying geolocated instances of the application stack. Essentially this makes it a Stream Consumer as it is performing the specific task of rebuilding the cache as opposed to the translations and queries of the Consumer above. Typically these would be a short-lived implementation and not used nearly as often as the default Stream Processor definition above.

The third role of the Stream Processor is the most complex and likely least used. For more complex domain models, sometimes the need for a saga (or process manager) is required. A saga's job is to manage the complexities of inter-aggregate communication should the need arise. Since a Stream Processor is able to read events from the logs and also write to the logs (defined separately above), it is able to issue commands to the domain layer based on the events from one aggregate to another.

The Hive Stack leverages Redis for a caching layer due to its high availability, distribution, and performance capabilities. Also, it employs the Redlock algorithm to provide transactional consistency and manage concurrency. Riak also seems like a viable solution for this requirement as it is a similar product that also provides strong consistency concepts.

Transaction Log

The unified transaction log is the centralized storage solution that is the foundation of the pattern. Think of it as the nucleus of an atom. The electrons that revolve around this nucleus are made up of the different microservice types described above. The transaction log's job is to handle multiple inputs/outputs to each of these microservice types while providing a persistence storage layer. Events are stored here once they have been validated by their producers and are read from here by their consumers. This layer must be able to scale using distributed storage techniques to meet the volume demands of the microservices it supports.

Multiple log streams can and should be defined in this layer such that it can act as a messaging bus for more complex domain models. CQRS sagas are able to read event output to translate into other commands to support these requirements. Event playback to bring an aggregate root to its current state can also be achieved by listening to the output of these logs.

The Hive Stack leverages Apache Kafka to achieve these requirements. It is built to persist and scale to the needs of the microservices it supports. As mentioned previously, Logstash seems to be another viable solution for this requirement.

Break with Tradition

As you may have noticed at this point, there are differences to traditional implementations of the CQRS/ES pattern for specific reasons. If we didn't break with tradition from time to time, we would still be writing assembly language on punch cards. Each of these differences are discussed below.

Unified Transaction Logs

Traditionally, Event Sourcing and Messaging are handled in separate layers of the application. The event log is really only used on the command side of the CQRS pattern. Then a multitude of message buses are implemented to handle communication between these Command services and their Query counterparts. This quickly becomes a complicated mess. However, to address the spaghetti this eventually creates, a unified transaction log is implemented. This specialization allows it to scale horizontally to meet the needs of the microservices without adding the complexity of separate event stores and the possible multitude of message buses depending on the needs of the application.

Event Sourcing Cache

The Hive Pattern skips over the traditional approach of replaying the sequence of events for that specific aggregate in favor of storing a snapshot of the aggregate in a caching layer. This way the transaction logs can remain unburdened by the additional queries to support this requirement. It has the added benefit of storage and performance optimizations to help when scaling up your application, even if the need is immediate. This is no different than adding a caching layer on top of a traditional CRUD database and is implemented in a similar fashion.

Aggregates and Denormalization

Aggregates in the CQRS definition are tasked with translating the resulting event of a command into the data changes they specify for further validation. They are also responsible for event playback to bring it up to its current state. Because they are able to perform event translations, they are also best suited to handle denormalization on the consumer side of the transaction log. To be clear, separate denormalizers should be written that extend the Aggregate class while reusing existing Schema definitions. The denormalizers would leverage the same techniques as their aggregate root counterparts.

Summary

In conclusion, the Hive Pattern is a proven theory to redefine enterprise level architecture based on Command Query Responsibility Segregation (CQRS), Event Sourcing (ES), unified transaction logs, and microservices patterns.

The Hive Stack is an open source, enterprise application stack in the spirit of LAMP, MEAN, and other open source application stacks that proves the Hive Pattern. The intent is to provide a standardized set of services and storage solutions as a base set of tools to solve enterprise challenges.

Together, their intent is to challenge the status quo and possibly improve on the current state of enterprise architecture and engineering.

Hive Stack Components

Below are links to each of the basic building blocks of the Hive Stack implementation. Each links to a starting point as a Docker image that can be inherited, then add your own domain application layer. If a more advanced solution is required, each of the Docker images link to starter kits to provide a jumping off point to fully meet your needs.

The only thing that is not provided is a proxy implementation. All you need to do to see the Hive Stack fully functional is to implement a proxy for layer 4 or 7 load balancing to each of the microservices below. Each of the Docker images and their matching starter kits have configuration information for running them with docker-compose, which is recommended for local development.

Docker images:

Starter kits:

Frameworks:

Try It Out

If you're interested in how it works, try it out and see for yourself. If you find a bug, let me know! I'll be continuing my work with the stack and it will change be changing over time so check back often!

Test:

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment