Skip to content

Instantly share code, notes, and snippets.

@srinathperera
Created May 21, 2020 04:33
Show Gist options
  • Save srinathperera/f7363e05d9221b64c42939b1ab1b9386 to your computer and use it in GitHub Desktop.
Save srinathperera/f7363e05d9221b64c42939b1ab1b9386 to your computer and use it in GitHub Desktop.

X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software

Numerous studies have reported that configuration and similar human errors are the largest source of errors in deployed systems [10, 11, 24, 25, 30, 32, 34, 58],

profilers - what, root cause - why

What

X-ray currently supports four metrics: execution latency, CPU utilization, file system usage, and network use ( user can choose one to analysis)

Related work

Aguilera et al. [1] infer causal paths between application components and attribute delays to specific nodes.

Pinpoint [15, 16] traces communication between middleware components to infer which components cause faults and the causal paths that link black-box components.

Magpie [7] extracts the component control flow and resource consumption of each request to build a workload model for performance prediction. Even though Magpie provides detailed performance information to manually infer root causes, it still does not automatically diagnose why the observed performance anomalies occur.

tunning: Many systems [14, 20, 60, 61] tune performance by injecting artificial traffic and using machine learning to correlate observed performance with specific configuration options.

Spectroscope [46] diagnoses performance changes by comparing request flows between two executions of the same workload.

configs: Several systems are holistic or address the third step (fixing the problem). PeerPressure [54] and Strider [55] compare Windows registry state on different machines. They rely on the most common configuration states being correct since they cannot infer why a particular configuration fails. Chronus [56] compares configuration states of the same computer across time. AutoBash [50] allows users to safely try many potential configuration fixes.

How?

use code instrumentation - uses difference in performance by similar requests and where they diverge

The only online activities are gathering performance data and logging system calls, synchronization operations and known data races. X-ray records timestamps at the entry and exit of system calls and synchronization operations.

online phase collect data and offline does analysis later. Uses determistic replay to avoid the effect of analysis on the program

X-ray introduces the technique of performance summarization. This technique first attributes performance costs to very fine-grained events, namely user-level instructions and system calls executed by the application.

Then, it uses dynamic information flow analysis to associate each such events with a ranked list of probable root causes. Finally, it summarizes the cost of each root cause over all events by adding the products of the per-event cost and an estimate of the likelihood that the event was caused by the root cause in question. The result is a list of root causes ordered by performance costs.

Use ConfAid[6] traint checking framework to find potential root causes. Rather than track taint as a binary value, ConfAid associates a weight with each taint identifier that represents the strength of the causal relationship between the tainted value and the root cause. When ConfAid observes the failure event (e.g., a bad output), it outputs all root causes on which the current program control flow depends, ordered by the weight of that dependence

Taints

  1. add a taint when request is recived and propergate through thread and any IPC - can't track multi threaded apps

  2. add taints to memory locations - default, slower

One of the most important insights that led to the design of X-ray is that the marginal effort of determining the root cause of all or many events in a program execution is not substantially greater than the effort of determining the root cause of a single event. Because a taint tracking system does not know a-priori which intermediate values will be needed to calculate the taint of an output, it must calculate taints for all intermediate values.

Summarizations - The latency of each system call and syn- chronization operation is recorded during online execu- tion. X-ray attributes the remaining latency to user-level instructions.

simple mummerizaton - X-ray calculates the total cost for each root cause by summing the per-block costs for that cause over all basic blocks within the analysis scope;

differential - The cost of a divergence is the difference between the performance cost of all basic blocks on the divergent path taken by the first execution and the cost of all blocks on the path taken by the second execu- tion.

multiple - cost of shortest path, and this path compared

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment