jondot/gist:4183355

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Tracking Your Business

You've built (or are maintaining) a product which has many services over different machines at the backend, all orchestrating together to implement one or more business processes. How are you tracking this system?
In general: how can we provide visibility for linear-pipeline distributed systems where a series of processing stages are arranged in succession to perform a specific business function over a data stream (i.e. transaction), and across several machines?
A simple, somewhat crude, example for cross-systems transaction would be an order preparation system in real life, let's say in an electronics factory. During such a workflow, an order entering the processing pipeline goes through each stage defined by the manufacturing floor manager - "planning, provisioning, packing, shipping".
Taking this a bit closer to the Web, we can easily see instances of such transactions, even if we are not always aware we've implemented them that way. A background job is a pipeline, or a transaction, of one process.
A user ordering an item from your online store is another example where multiple stages are involved (perhaps some of these are even done with the help of third-parties such as Paypal or Stripe).
So how can you track these at the infrastructure level? namely, how would you:

Have better visibility for an entire such process which may start at machine A and service X, and then ends a few machines and services later at machine B and service Z.
Measure the overall performance of such a process across all of the players in your architecture and at each step of the way.

Tracking Data

You may have bumped into this before. Referring back to the previous real-life manufacturing example, an item gets a "ticket" slapped onto it when it is first pronounced as an actual
entity in the factory. This ticket is then being used to record each person who handled the item, and the time it was handled at.
Looking back at a distributed system implementing a pipeline, if the data handed out from process to process is such that you can tack on additional properties,
that is - it will be persisted after each step, and persisting it doesn't cost that much, then you may be in luck.
In such a scenario it is common to include tracking metadata within the object, and just stamp it with relevant trace information (such as time) per process,
within the lifetime of that object and the length of the pipeline.
At the end of the entire business process, a given object will show you where it's been and when. This idea would be easy to implement and provide excellent forensics ability - you can investigate your pipeline behavior per process step within the pipeline,
by just looking at the object itself.
External Tracking

You may also have been aware of systems in factories, or even physical shops, where operators feed in an item ID, their signature and time stamp onto a thin terminal
in order to indicate they have processed it at their station.
Keeping that in mind, the solution I want to discuss here involves an external service, to which the pipeline simply announces progress per each step of the pipeline as
the item being processed is making progress.
If you're originally coming from the enterprise, you've already identified such a thing as something somewhat similar to what would be also called a BAM.
And if you're not familiar with enterprisy solutions to problems, you may have also heard of taking this concept to a much lower-level
infrastructural kind of thing - Google's Dapper and not very long ago
Twitter's Zipkin systems, that offer extremely detailed information about linear and tree-based transactions,
and show you an immense breakdown of a processes within your code.
Current Solutions

Although I could use an off-the-shelf enterprise BAM product, I really didn't want the kind of world of pain you get when integrating an enterprise product
with an agile, lightweight, startup-like infrastructure.
And then I didn't feel using a system like Zipkin for higher-level much less granular business processes was right. Since I had such a system figured out on the back of my head
for a while now, I built it in the last weekend.
Using Roundtrip

So after we went through an overview of the problem, the solution might be already clear to you.
Our distributed cloud may generate a ton of business workflows and transactions over many or few machines, and the point is
that a transaction or workflow starts at a certain machine, goes to one or more, and then ends up at some other (or same) machine.
We need a way to keep track of when a transaction starts and when it ends. A bonus would be to be able to track stages in the
transaction that happen before it ends. Lets call that checkpoints. That is, basically, what Roundtrip is.
Roundtrip will store the tracking data about your transactions: start, end, and any number of checkpoints, and will provide metrics as a bonus.
Roundtrip supports pluggable backends (currently using Redis), metric aggregators (currently using StatsD), and APIs (currently using HTTP and command line).
The plan is to support at least UDP and 0mq as APIs for extremely performant systems, although the HTTP API is already pretty good.
Here's a short breeze through using Roundtrip:
In the near future, there will be language-specific drivers so that you just plug the right driver (be it a ruby gem, python egg, or a java jar), and make calls to roundtrip as if it were part of your code.
For now, you'll have to make simple RESTful calls the way you usually do, to HTTP (ruby HTTParty, node.js mikeal/request etc).
I'm using curl here just for experimentation, again, in your code you should use the HTTP library of your choice to make the calls.
Transaction lifecycle

Create a new trip. This would be where you start your transaction within your code, issue the following HTTP call. You get a trip ID which you carry around the workflow / transaction so that you could end it and place checkpoints on it.
curl -XPOST http://localhost:9292/invoicing/trips
{"id":"cf1999e8bfbd37963b1f92c527a8748e","route":"invoicing","started_at":"2012-11-30T18:23:23.814014+02:00"}

Add as many checkpoints as you like. Make sure to provide checkpoint as a postback parameter. Yes, we're using PATCH, and yes, its the correct semantic RESTful use of it :).
curl -XPATCH -dcheckpoint=generated.pdf http://localhost:9292/trips/cf1999e8bfbd37963b1f92c527a8748e
{"ok":true}

curl -XPATCH -dcheckpoint=emailed.customer http://localhost:9292/trips/cf1999e8bfbd37963b1f92c527a8748e
{"ok":true}

End your transaction. You get back a bag of data which represents the trip your transaction made. With this you can update what ever
system you have (be it metrics, analytics, health, etc).
curl -XDELETE http://localhost:9292/trips/cf1999e8bfbd37963b1f92c527a8748e
{"id":"cf1999e8bfbd37963b1f92c527a8748e","route":"invoicing","started_at":"2012-11-30T18:54:20.098477+02:00","checkpoints":[["generated.pdf","2012-11-30T19:08:26.138140+02:00"],
["emailed.customer","2012-11-30T19:12:41.332270+02:00"]]}