jar349/comments.md

## comments.md

      
    Raw
  

              comments.md
            
          
    Exercise Comments

Overview

In general, this pipeline is a set of immutable append-only logs, processors, and stores.  As any level 1 system context diagram ought to be, it is technology agnostic and could be implemented with a number of technologies, but I admit to having Apache Kafka and Apache Spark Streaming in mind as I designed this.  In order to understand the value of this approach, I recommend reading Martin Kleppmann's Turning the database inside-out.  I urge you strongly to read that article before continuing to evaluate the diagram or continuing with these comments.
The purpose of the pipeline is to end up with always-up-to-date stores of data that can be performantly queried at scale.  Source files stream through the pipeline and cause streaming updates to what can be thought of as "materialized views" whose implementation and technology can be chosen based on the query characteristics.  For example, an elasticsearch index can be updated with a new source report to allow full-text search.  An OLAP cube can be updated to allow for pivot, zoom, and slice-and-dice of data on-demand.  An OLTP datastore can be updated with pre-computed answers to known queries like 'Top 10 Products by Store', etc.  Caching should be inferred, but it is not shown in order to keep the diagram clean and understandable.
This architecture was designed with a number of goals in mind:

Requirements:  Meet the requirements layed out in the exercise.
Evolution: Design a system of interacting subsystems that have well-defined boundaries between them.  The boundaries are intentionally kept vague so as not to impinge upon each subsystem's design.  The characteristics of the exchanged data should drive what kind of boundary to use: kafka pub/sub, REST, message queues, etc.  It is the schemas that should be desgined, versioned, and controlled at the architecture level, not the transport mechanisms.  For the most part, however, the boundary is a Log.
Reprocessing: Differentiate between an incoming source and a request to process (or reprocess) that source
Scale: Allow components to stream the ingestion of data through the pipeline into individual stores that are specialized to answer differing kinds of queries performantly at scale.  Again, see the above referenced article to give you an idea.
Deployment: When processors start up, they are free to rebuild any needed internal state - or cache - from the appropriate Log.
Controlled Input: what we expose to the outside world is controlled, secured, logged, balanced, etc and provides a layer of indirection between external components and the internal system
Operations: Logging, Monitoring, and Alerting must be first-class citizens in any system design - not baked in later

Tradeoffs

Architecture is fundamentally the practice of making and communicating trade-off decisions.  Inputs to the architect are goals, resources, and limitations.  An architect's outputs are diagrams that whose components and connections reflect trade-offs made to meet goals with the available resources within the set limitations.  Good software architects understand that not all goals get explicitly stated, that architecture and design are different activities, and that pragmatic design choices do sometimes affect the architecture.  That being said, here are the down-side trade-offs that are a result of the architecture I have provided:

In order to respond to queries performantly at scale, I have created N-number of views that duplicate each others' data.  I am increasing the amount of physical data storage required in exchange for performance at scale.  This is a very common tradeoff made by engineers at other companies such as Linkedin, Facebook, Netflix, and others.
Because there are multiple views of our data, we require a gateway to keep client-side queries simple and handle cross-cutting concerns on behlf of the views: security, logging, metrics, etc.
I have broken processing into a number of sub-pipelines rather than one large pipeline.  This increases complexity, but in exchange we get natural boundaries that also serve as imtermediate stores from which the status of the ingestion system can be deduced.
Each subsystem uses a separate store for its own internal state instead of one large database.  This increases complexity but in exchange, we get a number of benefits: a) each subsystem is decoupled from another, allowing each subsystem to evolve independently, b) each datastore is relatively smaller and so easier to scale than a single larger datastore, and c) the storage technology can be independently chosen per subsystem to achieve the best performance based on its characteristics
Complete rebuilds of views when new builders/processors start up can take a long time.  In exchange, we get small, fast, streaming updates while the builders/processors are running that can handle a large volume of incoming data.  And we can mitigate the longer complete rebuild times by allowing new versions of view builders to build new versions of views and make use of the web gateway to deliberately "cut over" to the new versions when they are ready.

Design in the exercise description

Certain aspects of the exercise description contain design concerns.  For example, "Updates are transactional", "Foreign Keys on the Purchases table should be enforced at insertion", as they pre-suppose a transactional data store and table-based storage technologies.  A level 1 system context diagram should be as technology agnostic as possible.  My use of logs, processors, and view builders is a common architectural pattern which can be implemented by a number of technologies - which should be chosen during the design phase, At the Last Responsible Moment.
Moreover, an architect who attempts to make every technology choice will by-needs become an overwhelmed micromanager.  For certain, the architect should participate in and guide design decisions but an architect must keep their eye on boundaries, continually oversee the evolution of the architecture, comunicate between teams, continually revisit and review tradeoff decisions, and most of all ensure that the goals of the architecture are being met by its implementation.
Because of these things, I have made the assumption that these design details convey information about the nature of the system rather than serve as a limitation that 'there must be a relational database'.
Insider Knowledge

Finally, as an existing employee with extensive knowledge of the system, I have attempted not to "cheat".  The metaphor of this exercise is obvious to me, but I have not included aspects that only I know will exist - such as our integration team's manual intervention in the addition and subtraction of "stores" and/or "products".  I expect that this will become an interview question - how would you allow for manual intervention in your architecture?  So if you see a decision that makes you think: "John should know better", consider that I may have considered that "insider knowledge" and left it out on purpose.
Legend


Cloud Icon - External Source
Rounded Rectangle - Subsystem, comprising one to many sub-components.
Rounded Cylinder - Immutable Append-Only Log
Squashed Rectangle - Data Store
Arrow - Flow of data (does not imply mechanism: rest, eventing, etc)

Components

Source Gateway

This subsystem is responsible for:

receiving source input from incoming source files (XLS or PDF), placing them in some kind of object store such as HDFS, S3, Ceph, etc and then publishing a record to the source log.
managing the object store.  For example, moving older source files to glacial store, tape archive, or other cheap long-term storage.

Remember that this component is a subsystem and may be composed of multiple processes.  My expectation is that there may be a cluster of processes for handling each kind of input (FTP, Email, HTTP, etc), and perhaps a cron-like process that periodically performs archiving tasks.  Keep in mind that this vision is getting into design.  A good version 1 may be a single process... I simply wanted to give some insight into how it is possible for this subsystem to scale.
Source Manager

This subsystem is responsible for:

deriving the processing state of each individual source file by subscribing to the various Logs
providing an interface that answers queries about source processing status
providing an interface that allows source files to be re-processed

The kind of store used by the source manager is driven by the need for performant queries.  Load characteristics will be read heavy.  Likely will require partitioning.  First assumption is monthly partitions based on upper limit on source files per day is 50,000.  Double that to allow for growth, and multiply by 30 for a month - leading to ~3M records per partition which should keep queries performant if correctly indexed.  A relational database is not necessarily assumed.  Again, this is getting into detailed design, but I think it's important to show how that this component is designed to be scaled as needed, but does not need to have all the work done up front.
Source Preprocess

This subsystem is responsible for:

acting as the entry-point for the data ingestion pipeline.  Requests for re-processing start here too.  As a pre-processor, this process performs tasks such as: parsing, cleaning, correcting, identifying, enriching, etc.

Let me remind again that this component is a subsystem and can begin life as a single process and grow to become a set of multiple processes, each process with multiple instances.  Remember that in these cases, it's a good idea to understand how these subsystems would be scaled out so that the simple version implementations do not make later scaling difficult.
Once sources are ready to have their data processed, they publish a source processing request.
Source Processors

This subsystem is responsible for:

taking the pre-processed source files and turning them into a set of facts that can be appended to the appropriate Log(s).
reporting the status of the processing
breaking the concept of a source file into our model(s): stores, products, and purchases.

It's important to note that a view builder may be interested in duplicated entries among source files.  Therefore, the source processors do not attempt to deduplicate and indeed this is one of the reasons for introducing a concept of a process request ( on the diagram) as separate from source id: there are no true duplicates any longer.  A row came from a certain request to process a certain source file from a certain input.  This takes care of the requirements that "source files may contain data from a previous batch".  True duplicates within one source file should be removed during pre-processing.
View Builders

As new entries are appended to the model Logs, a set of view builders can determine if the presence of the new log entry requires that any of the views it builds need to be updated.  Each view builder can determine how to update its view, based on the technologies involved.  For example, a builder for an OLAP cube may decide to rebuild the cube nightly.  The builder for a key-value store view of all purchases may update immediately.  The builder of an OLTP database view may use batch inserts.  Each view builder determines how to handle duplicate log entries.
Web Gateway

Follows the gateway pattern that simplifies client-side queries for its pages, collapses potentially many queries (each with their own network overhead) into one larger query, and consolidates security, caching, logging, request metrics, etc.