ogxd/ExceptionHandlingRFC.md

## ExceptionHandlingRFC.md

      
    Raw
  

              ExceptionHandlingRFC.md
            
          
    The Problem

What are we trying to solve?

Exception handling in .NET has several pitfalls:
Performance Overhead & Resiliency

Exception handling in .NET can be expensive in performance, especially when exceptions are thrown frequently. The process of throwing an exception involves capturing a stack trace and unwinding the stack, which is a costly operation. If an application relies heavily on exceptions for control flow, this can significantly degrade performance.

  Cost of exceptions analysis
We often hear that throwing exceptions is expensive, so here are some numbers.
This benchmark consists of a loop in which we throw and catch a fixed number of exceptions.  Given the number of iterations and the total time, we can estimate on average how much time throwing an exception takes on a single CPU core.
The first thing we notice is that, on the hardware used, throwing a single exception takes 21432ns or 0.021ms. That’s huge. At 10k exceptions thrown per second, that represents 214ms of CPU time, or 21.4% of CPU usage. For comparison, allocating the exceptions but not throwing them takes 0.024% CPU usage instead.
Then, we run the same loop but on several cores (up to the maximum number of physical cores available). In theory, if there wasn’t any form of thread contention associated with exceptions, we would see a flat line: throwing a single exception on a single thread should always take the same time, independently from the number of threads running the benchmark.
This is however not what we observe: as the number of threads increases, the time taken for throwing a single exception increases linearly. This implies that there is indeed some form of thread contention happening when throwing exceptions.
Here is another way to visualize it:
Let’s say we have a given multithreaded application throwing 1000 exceptions per second in total. Given the measurements done above, we can estimate the total CPU usage. If the application were to run on a 10 threads machine, this would occupy about 1.5% of the CPU usage. Now, if it were to run on a 20 threads machine, we observe that it would still occupy 1.5% of the CPU usage.
That’s bad because it means that as we might opt-in for more powerful machines with more cores to handle more load, the number of exceptions we can throw does not scale. At some point, exceptions will inevitably become a bottleneck.

https://smartadserver.atlassian.net/browse/CB-2823 was about trying to reduce the overhead of throwing exceptions, but didn’t led to significant improvements.
This has an unfortunate consequence on system resiliency, as the overhead is likely to grow as a process gets overloaded (since we are likely to have more TaskCanceledExceptions for instance), after which we observe a threshold effect where the process becomes completely unstable.
And Exception is just an object. The overhead happens when an exception is thrown. Simply allocating a new exception does not lead to any overhead.
Async and Exceptions

In short, async is designed in a way so that awaiting a Task that has thrown an exception results in the exception being rethrown internally. This is by design in .NET.
Polly Enters the Chat

While Polly is the most widely adopted resiliency framework in .NET, and even backed by Microsoft, the resiliency policies are based on asynchronous delegates. This results in exceptions being rethrown for every asynchronous delegate in the policy, further amplifying this threshold effect. This is counter-productive as using the resiliency framework may result in a less resilient application.
Lack of Explicitness

Unlike some languages that use explicit exception declarations (like Java's checked exceptions), .NET does not require methods to declare which exceptions they might throw. This lack of explicitness means developers must rely on documentation or source code analysis to understand the potential exceptions. This is cumbersome, error-prone, and time-consuming.
For these reasons, exception handling is in practice done defensively: assuming that a call can fail, try-catch blocks are used. This is especially true for IO operations, network requests, and any interaction with external systems, but also for components developed by others within the company where coding practices can be very heterogeneous.
This often leads to either excessive use of try-catch blocks can lead to:


A code that is harder to read and maintain
Exception handling policy is heterogeneous across the codebase, the logical flow is unclear, and defensive try-catch blocks often lead to cluttered code with unnecessary safety branches.


A risk of silent failure
If exceptions are caught but not properly handled or logged, the program will continue to process a request in an erroneous state without any indication of a problem. This makes debugging and maintenance more difficult.


The lack of explicitness also leads to the opposite where the handling is insufficient, which can lead to:

Exceptions lose context of their failure
When an exception is finally handled but is thrown deeper into the call stack, it is more difficult to handle it properly since the context is lost, making assessing the relevance and debugging more difficult.

Why should we solve it?

As a backend engineering team maintaining a system handling hundreds of billions of requests daily, performance and resiliency are key. Proper exception handling is mandatory to achieve this goal.
Another important aspect is code maintenance and quality. Having a unified and clear exception handling policy will simplify and unify the codebase and make the engine behavior more predictable.
Goals and non-goals


✅ Come up with a pattern that addresses the performance/resiliency issues stated above
✅ Simplify exception handling through a more explicit approach
❌ Getting rid of exceptions. The issue does not lie in exception, but in the overhead of throwing an exception and how exception handling is in vanilla .NET.
❌ Silencing through less error logs. The point is also to have more meaning full errors, not silencing them all.
❌ Changing the way we handle exceptions in the whole codebase represents a huge amount of work. The point is not to address all of them in a single pass, but rather to address main paths and make it clear what the policy and pattern should be for future development and evolutions.

Definition of success

Removed/mitigated the exception throwing resiliency threshold, for an overall more resilient backend.
This can be measured by monitoring spikes of exceptions thrown app_monitoring_total{ category="General/Exceptions (count/s)" }
The exception-handling policy is clear and homogenousThis is a qualitative measurement, but the success factor can be decomposed into a few points (documented? adoption? …)
Less overall decontextualized errors loggedCan be measured, as the “normal“ residual number of errors/fatal logs
An error log isn’t necessarily linked to an Exception. It may seem obvious, but it's important to keep that in mind.
The solution

It is a challenging topic to address, as try-catch exception handling is the only de-facto way in .NET. The pitfalls mentioned above are unknown/underestimated in the .NET community and that very few resources online seem to mention or tackle.
Yet, outside of the .NET prism, the explicitness pitfall alone is well acknowledged. The most recent and trendy languages (Go, Rust, Kotlin, …) all have in common a more explicit exception handling.
One widespread approach is the use of the Result monad (see the embedded Result for Kotlin of Rust for instance). Unfortunately, this type is not part of the .NET BCL.
The Result Pattern

This monad is well-known in computer science nowadays, so let’s skip what it is and how it works and directly introduce it in the context of .NET and more specifically exception handling.
Let’s say we define this simple structure:
public readonly record struct Result<T>
{
    public bool IsSuccessful { get; init; }
    public T Value { get; init; }
    public Exception Exception { get; init; }
}
We could introduce much more utility methods and safeguards but this is not the point here, so we keep it simple for the demonstration.
Let’s say we have a method A (caller) that internally calls method B (callee).
As B may throw an exception, we would like to handle it in A. The normal .NET way would consist of checking the B documentation or source code, and then wrapping B in a try-catch, with proper branches depending on exceptions expected to be thrown, like so:
public string A(...) {
  ...
  try {
    return B(...);
  } catch (SomeException) {
    ...
  } catch (SomeOtherException) {
    ...
  }
  ...
  return fallbackValue;
}
Now, let’s see what it would look like if B returned an Result instead:
public string A(...) {
  ...
  Result<string> result = B(...)
  if (result.IsSuccessful)
    return result.Value;
      
  switch (result.Exception) {
    case SomeException:
      ...
      break;
    case SomeOtherException:
      ...
      break;
  }
  ...
  return fallbackValue;
}
Or a more compact approach using pattern matching, similar to Rust’s match philosophy:
public string A() {
  ...
  switch (B()) {
    case { IsSuccessful: true } result:
      return result.Value;
    case { Exception: SomeException }:
      ...
      break;
    case { Exception: SomeOtherException }:
      ...
      break;
  }
  ...
  return fallbackValue;
}
So, what are the benefits?
This is still very readable, as much as the try-catch version is.
This is also very versatile thanks to pattern matching and extensibility possibilities with the Result type.
The code is one level more explicit. Why? Because with B returning a Result, you know it can fail (without throwing), so your exception handling is not defensive anymore. This by itself can help get rid of a lot of confusion and unnecessary boilerplate code in a lot of places.
The exception remains contextualized. Let me explain. In the try-catch version, while we think we handle all possible exceptions because we have read B's documentation or source code, we can't be guaranteed that it's the case, because documentation might not be up-to-date, correct, or simply because B internally calls C that might itself throw another exception that is also unexpected (for the same reasons). Given this, despite A returning a string, and because you think you are handling all possible exceptions, you might document that A does not throw, while in reality, it can.With the Result approach, the failure state is embedded within the Result. So in the example above, as a user of A, you are assured that A won't throw. On the other hand, if you believe something unexpected might happen in C, you can add a branch to handle the remaining Exceptions returned by B and decide whether A should return a Result or silence the error and keep returning a string.
But what if I can’t change B?

If B belongs to a codebase you don't own, or you don't want to touch it, you can simply use an extension method to switch between the try-catch pattern to the Result pattern, like so:
public static Result<T> ToResult<T>(this Func<T> func) {
    try {
      return new Result<T>{ IsSuccessful = true, Value = func() };
    } catch (Exception e) {
      return new Result<T>{ IsSuccessful = false, Exception = e };
    }
  }
}
This is the only try-catch block you would need, after which the whole exception handling flow is now Result based and throw-free.
A implementation would then become:
public string A() {
  ...
  switch (B.ToResult()) {
    case { IsSuccessful: true } result:
      return result.Value;
    case { Exception: SomeException }:
      ...
      break;
    case { Exception: SomeOtherException }:
      ...
      break;
  }
  ...
  return fallbackValue;
}
What About Asynchronous Code?

Let’s now say that A and B are asynchronous, both returning a Task.
The pattern applies the same way for the most part, but there is a twist:
In asynchronous methods, a Task represents the ongoing work. When an exception occurs within a Task, it is captured and stored within the Task object itself. This is different from synchronous code, where exceptions are thrown immediately at the point of failure.
When the Task is awaited (with the await keyword), it essentially unwraps the result of that Task. If the Task has been completed successfully, await retrieves the result. However, if the task is faulted (due to an exception), the await expression rethrows the exception so that it can be handled in the context where the task was awaited. This is important for “classic“ exception handling, as it allows exceptions to be propagated back to the calling code.
To apply the same pattern as for synchronous code, B must first return an Task<Result>, which implies that the success/failure information is now carried by the Result, and not anymore by the task.
Here is what A would then look like:
public async Task<string> A() {
  ...
  switch (await B()) {
    case { IsSuccessful: true } result:
      return result.Value;
    case { Exception: SomeException }:
      ...
      break;
    case { Exception: SomeOtherException }:
      ...
      break;
  }
  ...
  return fallbackValue;
}
The benefits of transforming the return type of a method like B (with possible failure outcomes) from a Task to a Task<Result> are the same as in the context of synchronous code:
This is still very readable, as much as the try-catch version is.
This is also very versatile thanks to pattern matching and extensibility possibilities with the Result type.
The code is one level more explicit. Why? Because with B returning a Task<Result>, you know it can fail, so your exception handling is not defensive anymore. As the failure information (if any) is now carried by the Result, you can now as a consumer of that method safely await the Task without fear of an exception being thrown.
The exception remains contextualized.
BONUS: It also is safer! One of the pitfalls of async/await is non-async methods returning a Task. While it alleviates from an additional state machine stage and allocation of an additional Task, it can be dangerous in some contexts, such as when using a try-catch block, as the exception is rethrown when awaited, which does not happen in the try-catch scope (it happened to me once, in production… 🥲).
public Task<string> A(...) {
  ...
  try {
    return B(...); // Since B is not awaited, try-catch won't behave as expected
  } catch (SomeException) {
    ...
  } catch (SomeOtherException) {
    ...
  }
  ...
  return fallbackValue;
}
This can’t happen when B returns a Task<Result>, because you have to await it first to then be able to handle the exception.
But what if I can’t change B?

B may belong to a codebase you don’t own or that you can’t change. That’s fine. In fact, we rarely implement client code ourselves for asynchronous I/O (HttpClient, SignalR, gRPC, …) and so we inevitably have to consume asynchronous methods (returning a Task or equivalent) that can throw (a common one is OperationCanceledException with the CancellationToken).
In such case we could derive the ToResult method presented above to an async version, awaiting the Task within a try-catch block and injecting exception thrown (if any) in the Result. This is a simple approach, but it implies the allocation of an additional Task, one more state machine stage, and more importantly any exception to be rethrown once because of the usage of the await keyword.
It turns out we can make our own custom awaiter to transform the Task into a Task<Result> without using the await at all!
public static Task<Result<T>> ToAsyncResult<T>(this Task<T> task) {
  return new ToAsyncResultAwaiter<T>(task).TaskResult;
}

private struct ToAsyncResultAwaiter<T>
{
  private readonly Task<T> _task;
  private readonly TaskCompletionSource<Result<T>> _tcs;

  public Task<Result<T>> TaskResult => _tcs.Task;

  internal ToAsyncResultAwaiter(Task<T> task) {
    _task = task;
    _tcs = new TaskCompletionSource<Result<T>>();

    // Setup task completed callback
    TaskAwaiter<T> awaiter = _task.GetAwaiter();
    awaiter.OnCompleted(OnTaskCompleted);
  }

  private void OnTaskCompleted() {
    if (_task.IsCompletedSuccessfully) {
      _tcs.TrySetResult(Result.Success<T>(_task.Result));
    } else {
      _tcs.TrySetResult(Result.Fail<T>(_task.IsCanceled ? new TaskCanceledException() : _task.Exception!.Unwrap()));
    }
  }
}
Using this extension method, here is what our asynchronous method A would look like:
public async Task<string> A() {
  ...
  switch (await B().ToAsyncResult()) {
    case { IsSuccessful: true } result:
      return result.Value;
    case { Exception: SomeException }:
      ...
      break;
    case { Exception: SomeOtherException }:
      ...
      break;
  }
  ...
  return fallbackValue;
}
Isn’t that awesome?
Fully Explicit Exception Handling?
By using the Result pattern for exception handling, we have covered most of the pitfalls of exception handling in .NET. However, while it is now explicit that a method can or cannot fail, is not explicit as to what kind of failure can occur.
This can be addressed by having the Result explicitly expose the possible failures. Ideally, this should be also explicit for static analysis, because having to run the program to dynamically find out what the failures could be would be impractical.
Since there are no discriminated unions in .NET, we can work around this with generics. Unfortunately, this would imply some limitations:
Covariant return types (C#) are not supported for interfaces. For this reason, if for instance SqlRepository and RedisRepository both implement IRepository, all GetAsync return types are the same. Both implementations may have different types of failure, and so it turns out not so great in the end.
Converting a Task to a Task<Result<T, Exception1, Exception2, ...>> is no longer entirely statically checked and safe: it implies "trust", where possible failures indicated in the Result are entirely based on the documentation/source code of the wrapped method, with possibility for human error.
For these reasons, full explicitness is not in the scope of this RFC.
Exception Handling Policies with Result

When/where to handle exceptions?

Only handle exceptions if you have a mean to continue the execution.
Given that CheckRequest, GetGeolocation, and GetUserId may throw, in which case should we handle exceptions in this place?
void ProcessRequest(Request request) {

  bool isRequestValid = CheckRequest(request);
  if (isRequestValid) {
    throw new InvalidRequestException();
  }
  
  Geolocation geolocation = GetGeolocation(request); ❌ Not handled
  Guid userId = GetUserId(request); ❌ Not handled

  // do stuff...
}
Here is what it would look like if we follow the rule of thumb above. As you can see, we handle the GetGeolocation and GetUserId cases, because they have fallback. Given that exception are handled here, it is worth logging at this stage, with an appropriate verbosity depending on the criticity of the failing component and context.
On the other hand, CheckRequest success is mandatory, so the execution of ProcessRequest must stop in case of exception thrown. For this reason, we don't handle exceptions at this stage and let them bubble up. The exception may be handled at some point and a log may be emitted, but at an higher level.
void ProcessRequest(Request request) {
  
  bool isRequestValid = CheckRequest(request);
  if (isRequestValid) {
    throw new InvalidRequestException();
  }
  
  Geolocation geolocation;
  try {
    geolocation = GetGeolocation(request);
  } catch(Exception e) {
    logger.LogWarning(e, "...");
    geolocation = GetFallbackGeolocation(request);
  }

  Guid userId;
  try {
    userId = GetUserId(request);
  } catch(Exception e) {
    logger.LogError(e, "...");
    userId = Guid.Zero;
  }

  // do stuff...
}
Let’s complexity things a bit with asynchronicity and cancellation. A token is passed to ProcessRequestAsync, meaning the asynchronous execution is optimistically expected to stop upon token expiration.
Because the token expiration is the caller responsibility and not set within ProcessRequestAsync, it is not ProcessRequestAsync responsibility to handle possible OperationCanceledException from token cancellation. For this reason, we must not handle such exception and let them bubble up to the callers.
async Task<Result> ProcessRequestAsync(Request request, CancellationToken token) {

  bool isRequestValid = await CheckRequestAsync(request, token);
  if (isRequestValid) {
    throw new InvalidRequestException();
  }
  
  Geolocation geolocation;
  try {
    geolocation = await GetGeolocationAsync(request, token);
  } catch(Exception e) when (e is not OperationCanceledException) {
    logger.LogWarning(e, "...");
    geolocation = GetFallbackGeolocation(request);
  }

  Guid userId;
  try {
    userId = await GetUserIdAsync(request, token);
  } catch(Exception e) when (e is not OperationCanceledException) {
    logger.LogError(e, "...");
    userId = Guid.Zero;
  }

  // do stuff...
}
Applying the rule using the Result pattern

Let’s see what it would look like using the Result pattern. Implementation can vary as we can use if statement, switch expressions, or even extension methods, depending on the context and what is the most readable.
Here is an example using switch statements and some pattern machine:
async Task<Result> ProcessRequestAsync(Request request, CancellationToken token) {

  switch (await CheckRequestAsync(request, token)) {
    case { IsSuccessful: false } result:
      return result.ToResult();
    case { IsSuccessful: true, Value: false }:
      return Result.Fail(new InvalidOperationException());
  }

  Geolocation geolocation;
  switch (await GetGeolocationAsync(request, token)) {
    case { IsSuccessful: true } result:
      geolocation = result.Value;
      break;
    case { Exception: InvalidOperationException } result:
      return result.ToResult();
    default:
      //logger.LogWarning(result.Exception, "...");
      geolocation = GetFallbackGeolocation(request);
      break;
  }

  Guid userId;
  switch (await GetUserIdAsync(request, token)) {
    case { IsSuccessful: true } result:
      userId = result.Value;
      break;
    case { Exception: OperationCanceledException } result:
      return result.ToResult();
    default:
      //logger.LogWarning(result.Exception, "...");
      userId = Guid.Empty;
      break;
  }

  // do stuff...

  return Result.Success();
}
In this example, not a single exception is thrown, as every failure is carried by returned Result objects.
You can have a how this example performs in a few scenarios compared to the try/throw/catch pattern in this benchmark.