Skip to content

Instantly share code, notes, and snippets.

@chandrasaripaka
Created December 2, 2020 05:11
Show Gist options
  • Save chandrasaripaka/5d0deb43cdb883de4c14400ad7a8ede3 to your computer and use it in GitHub Desktop.
Save chandrasaripaka/5d0deb43cdb883de4c14400ad7a8ede3 to your computer and use it in GitHub Desktop.
Instrumentation and Observability in Distributed Systems

##Instrumentation and Observability in Distributed Systems - Part1

####Abstract:

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another to achieve a common goal.

Quite often, its complex to work with the distributed systems, and it's even more complicated to measure and observe across all the components. Measurement of metrics in distributed systems for e.g., memory, CPU, I/O, network play a significant role in observability which is the base for scaling of distributed systems. Also, one has to understand the correlation among the services, components and the hosts, and the underlying hardware.

In distributed systems, services are hosted across multiple nodes, the reliability of a service depends on the performance of these hosts and services and coordination between the services. We present here a naive approach, the systematic way of designing the observability and instrumentation in distributed systems.

###Theory

Here is a structural layout of how I would see the metrics are responsible for functioning of the service and Application.

flowchart TD  
  Memory & Disk & IO --> Host;
  Network --> Host;
  Services --> Host;
  Services <--> Application;
  SoftwareClients --> Application;
  Business --> Application;

For example, in the context of Hadoop eco-systems, zookeeper is used as a coordinator or observer to determine the master in master-slave architecture. Zookeeper needs to operate at less than 5 to 30 ms latency for every transaction. The components that use zookeeper needs to advertise the status and the workers/clients use the status in zookeeper to determine the primary host.

To identify this kind of scenario, one needs to be aware of disk latency, RPC, thrift calls etc., for the services that are journal driven.


Is there a better tool that can help determine and can prevent a service from degradation?

Answer is No, and you have to design your own. Make your code observable and feedback the metrics into the same component, that can operate on the services, with a mission for autoscaling, prevention of unplanned failover, offer high throughput and less latency even in high volumes.


We can approach the problem of observability using different methods. I once again refer to something called USE method, popularized by Brendan Gregg.

It can be summarized as

    FOR EVERY RESOURCE, CHECK UTILIZATION, SATURATION, AND ERRORS.
  Usability: The average time that resource was busy serving work.
  Saturation: The degree to which the resource has extra work which
  it can't service, often queued.
  ErrorRate: The count of error events per second.

We take this as a basis to determine the bottlenecks for any service, system or a host. I will cover this topic later in my sections.

When there is no automated resolution, the operator needs to intervene in debugging the component and find the root cause of the issue and handle the events to support. They will resort to an operations handbook written by the component creator. If the software doesn't provide an operations handbook, the support person has to go through the logs of all the components involved in this failure event and remediate each one of them, will act by creating a decision tree. The impact of the issue has to be determined, from the component/service impact matrix. This process is called the impact analysis, which determines the Service Level effects, and it opens up the issue origination. The above is a tedious process and deserves the utmost care and the post-remediation; there will be a root cause analysis.

Traditional Analysis:

flowchart TD
  User --> | ReportsSlowness | S[Support];
  S --> A{ChecksAlerts};
  A --> |Yes| B[AlertFound]
  A --> |No| C[AlertNotFound]
  B --> E{CanAutoRemediate};
  E --> |Yes| F[Remediate];
  E --> |No| I;
  C --> |Impact Analysis| I[ImpactAnalysis];
  I --> F;

Root cause analysis (RCA) needs to be automated/drafted, for the repeated issues, else the toil piles up, as there may not be the same set of employees who addresses the case again.

So, there is a great need for a framework here, how to codify an RCA from converting a manual finding to converge into an automated analysis.

flowchart LR
  Collect --> Logs & Metrics;
  Logs & Metrics <--> Monitor;
  Monitor --> CreateSignals;
  CreateSignals --> Remediate;

####Building Blocks

The fundamental change that every host needs to bring here is to have the necessary metric exporters at different host processes and the host itself. But one needs to know what has to be exported.

There are bunch of metrics exporters like:

  1. Node exporters
  2. Metric Beats which cover the system beat events.
  3. Packet beats to understand what's going through the network.
     I would not recommend any packet sniffing, but rather measure the throughput.
  4. Your choice of any software that can configure the predefined metrics
     that needs to be exported.
  5. Alternatively, I suggest to have a look at the Brendan Gregg's [BPF
  Performance tools](http://www.brendangregg.com/bpf-performance-tools-book.html) if you are linux savvy.
graph TD
  HostMetrics --> Host;
  ServiceMetrics --> Service;
  Logs --> Service;
  Services --> Hosts;

( The below are just an example taking Java Processes into consideration.)

  1. Heap Memory.
  2. Native Threads and their equivalent usage of CPU cores.
  3. TCP Connections in and out and their effect on CPU Cores.
  4. Native Threads and their equivalent usage of open files.
  5. Max Open Processes by a application.
  6. Stack Size tuning , if GC is not properly tuned, and large buffers of data
  being written to stack, you see out of memory errors,
  unable to extend the stack with any more memory.
  This depends on the nature of the application, if its poorly written
  memory management.
  7. Control checks for the disk utilization and log rolling.

The service needs to be configured with proper system soft and hard ulimits. As a system operator we all know this, but what we need is the continuous monitoring of these ulimits in resource intensive systems. You may have seen issues where the service can go to a hung state, because of some underlying background process on the container or Virtual Machine. Unless the service produces observability on the processes and systems produce metrics on the machines that can be correlated, service will not be able to identify this.

graph TD
  MemoryUsed --> Memory;
  HeapMemory --> Memory;
  VirtualMemory --> Memory;

  How do we understand all of these metrics with the hosts,
  services and networks ?
  Who would want to use these metrics?
  How do we derive value out of it ?

Operators would need these kind of metrics data and see if there is a breach of threshold, to generate alerts and remediate accordingly for a host running these services. Performance engineers would analyse these metrics data, and derive common patterns.

###Feedback based Systems or Services.

A service can be impacted due to a abnormality in any of the metric, from one of the . Even in this modern infrastructure days, failover of the services to another machine, because proactive monitoring is not enabled, and with the rightful remediation of the service, zero downtimes and eliminate the need for failover. Abnormality of any hardware device, has also a

A service should have a constant feedback from the metrics and should be able to self-adjust to the outside context of clients. A service should also have a control system, which can self-calibrate itself to the external factors. One needs to code that intelligence to the service through meters and signals as a feedback to the system.

sequenceDiagram
  participant client
  participant ServiceController
  participant container
  participant Application
  participant controlsystem
  client->>ServiceController: 1.Requests access to service endpoint
  ServiceController->>container: 2.Give access to resources
  container->>controlsystem: 3.Informs request metrics
  loop recalibrate
    controlsystem->>controlsystem: Intelligent recalibrate
  end
  Note right of controlsystem: Analyzing all requests,<br/> failure possibility predicted <br/> increase in latency.
  controlsystem->>container: 4. Sends response metrics
  loop feedback
    container->>container: Scale up, latency increased.
    container->>container: Scale up, throughput decreased <br/> high volume
  end
  container->>ServiceController: 5. Containers Increased.
  ServiceController->>client: 6. Returns response

In the present context of distributed systems, there are no such control systems that are built. This happens only when a service tends to failover, or distributing the load between multiple hosts. We don't intend to do a precise measurement of the infrastructure and the cost that we are using for these services.

Majority of the cloud providers give these options, but it comes with a cost. Except for the cloud-native services, any service that does detailed monitoring, proactive feedback and self-remediation of the services, this is what makes the services competent enough to serve large volume of clients.However, if you host an application in cloud and want to scale by itself, with out proper feedback of metrics you would be either eager to spin up more instances or burn more budget in running your services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment