Skip to content

Instantly share code, notes, and snippets.

@wolfram77
Last active July 23, 2023 19:30
Show Gist options
  • Save wolfram77/156c05247edd0e16812fe10c5a82a95d to your computer and use it in GitHub Desktop.
Save wolfram77/156c05247edd0e16812fe10c5a82a95d to your computer and use it in GitHub Desktop.
Cores that don’t count : NOTES

In this paper, the authors talk about "often silent" corrupt execution errors (CEEs) that cause ephemeral computational errors for a class of computations. They observe that small (valid) code changes that makes heavy use of rarely used instructions can lead to large shifts in reliability (due to manufacturing defects). These were not detected during manufacturing tests, and cannot always be mitigated by microcode updates. Such cores can give incorrect results for some inputs and can be obscured by undiagnosed software bugs. When a core develops such a behavior, they term it as mercurial. Authors observe on the order of few mercurial cores per several thousand machines. The rate seen by their automatic detector is gradually increasing, but they don't know if this reflects a change in the underlying rate.

CEEs may be detected nearly immediately with self-checks, exceptions, or seg-faults. In other cases, wrong answers are detected early, too late in computation, or never detected. Bad metadata can cause the loss of entire filesystem, or a corrupted encryption key can render large amounts of data permanently inaccessible. It is long known that storage devices and networks can corrupt data, known as silent data corruption (SDC). In addition to random alpha particles, cosmic rays, and overclocking, CEEs can be another cause of SDC.

The authors observe the following failures caused by CEEs:

  • Violation of lock semantics causing data corruption and crashes.
  • Data corruption exhibited by various load, store, vector, and coherence operations.
  • AES miscomputation causing data corruption.
  • Corruption affecting garbage collection, causing live data to be lost.
  • Database index corruption, leading some queries to be corrupt.
  • Repeat bit-flips in strings at a particular bit position/
  • Corruption of kernel state, resulting in process and kernel crashes.

CEEs mostly appear non-deterministically at a variable rate. In a multi-core processor, typically just one core fails. Faulty cores fail repeatedly and intermittently, and often get worse with time (aging). These issues appear to be an industry-wide problem, not specific to any vendor, but differ across CPU product, and can be highly dependent upon frequency, voltage, and temperature. Some mercurial core CEE rates are strongly frequency sensitive, some aren't. With Dynamic Frequency and Voltage Scaling (DFVS), sometimes lower frequency (surprisingly) increases the failure rate. Because of limited knowledge of underlying hardware, and no access to hardware-supported test structures, the authors cannot infer much about root causes.

The authors are detecting mercurial cores now due to:

  • Large server fleets.
  • Increased attention to reliability.
  • Improvements in software development that minimize software bugs.
  • Steady increase in CPU scale and complexity,
  • Smaller sillicon feature sizes, and smaller margins of error.
  • Limits of CMOS scaling and risk of post-burnin (latent) failures.
  • Novel techniques, such as stacked layers, add complexity and manufacturing risk.
  • Complex microarchitectures, CPUs are discrete accelerators with a shared register file.

The authors discuss on the need to have a systematic way of developing tests to detect CEEs, and quarantining the mercurial cores. Testing becomes part of the full lifecycle of a CPU, not just an issue for vendors or burn-in testing. Naively, detecting CEEs seems to imply double work, and automatic correction imples triple work (using majority voting and deterministic replay, but relies on the assumption that the voting mechanism is reliable). Some systems use pairs of cores in "lockstep" to detect if one fails.

The authors suggest existence of system support for efficient checkpointing to recover from a failed computation. In addition, they recommend that correctness is often best checked at the endpoints rather than in lower-level infrastructure, and that application-specific detection methods to decide whether to continue past a checkpoint or to retry. Application-level screening can be more focused, more easily fine-tuned, and can enable application-level mitigations. Isolating a specific core could be challenging becuase of scheduler's assumption that all machines of a specific type have idential resources. One might identify a set of tasks that can run speculatively on a given mercurial core, and if the core fails, the tasks can be re-executed on a different core. Perhaps compilers could detect blocks of code whose correct execution is especially critical, and then automatically replicate just those computations.

Many of Google's applications already check for SDCs, and can also detect CEEs at minimal extra cost. Other systems execute the same update logic, in parallel, at several replicas to avoid network dependencies and for fail-stop resilience. We can exploit these dual computations to detect CEEs. They also use self-screening mechanisms in some cryptographic applications.

They discuss that creating resilience to tolerate CEEs is going to be a trade-off between performance and hardware reliability. Hardware should be designed for test to detect cores with subtle manufacturing defects, exposing these test features to end users. In addition, critical function units should be conservatively designed, trading some extra area and power for reliability. This might still be much more efficient than replicating computations in software.

They also suggest that hyperscalers could isolate mercurial-core servers from their fleets, and make them available to researchers. We could also develop fault-injectors for testing software resilience on real hardware. CEEs are likely to occur in other hardware as well, such as GPUs, ML accelerators, P4 switches, and NICs. There might be novel challenges in detecting and mitigating CEEs in non-CPU settings.

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment