zah/Towards a unified error handling in Nim.md Secret

## Towards a unified error handling in Nim.md

      
    Raw
  

              Towards a unified error handling in Nim.md
            
          
    Towards a unified error handling in Nim

Recommended reading:
http://joeduffyblog.com/2016/02/07/the-error-model/
A common vocabulary for errors:

Before we start, let's establish some common vocabulary for talking about errors. You see, not all errors are created equal. We can broadly classify them in three categories:


Recoverable errors
The key about these errors is that they represent situations we have planned for. A user may have entered incorrect data (potentially malicious one) , a network connection may be interrupted, or an important data file may be missing. With such errors, the developer must decide what the policy of the software should be because they are expected to arise even in a perfectly implemented program.


Detected bugs in the code
These errors represent situations that are never supposed to happen in the code - an out of bounds array access, a null pointer dereference, some invalid input value to function. When such errors are detected it's not clear how to handle them. The program must have entered an invalid state somehow and proceeding further just obscures the root cause of the problem and creates potential risk for more secondary errors down the line. In a development environment, it's always best if the program fails fast and provides as much captured context to the developer as possible (e.g. a stack trace, a memory dump, etc). In a software dealing with financial transactions where correctness is the most important characteristic, restarting the entire process may still be the most appropriate response even in release builds.
We should note one exception: If the language provides strong enough memory-safety guarantees (as perhaps Nim does in some situations), and the started operation works on a well-isolated state and has a goal which failure can be clearly communicated to the user (e.g. Unzipping a zip file failed, a request died with internal server error, etc), then we can perhaps treat the detected bugs as something that the software should be able to recover from as a whole. We'll call this the Abort Task Scenario. (The bug should still be recored with as much context as possible though).


Catastrophic system failures
These errors represent situations where it's simply not reasonable for us to recover - we have ran out of memory, a hardware device appears to malfunction, etc. While these situations are not bugs, mostly everything we've said about bugs applies here well. The code should not attempt to handle the problem, and the more context information we are able to gather the better.


How are errors handled anyway?

While there are many established practices for handling errors, we can easily generalize them in the following way:


Recoverable errors are signaled with either error results or exceptions. They are usually propagated only few levels up the stack (typically just 1), suitable error handling code is executed and the program execution continues.


Non-recoverable errors lead to a process-level or task-level panic.
Allocated resources can be either automatically released by a process or task supervisor (the OS or the language run-time) or by "unwinding the stack" (running user-defined clean up code, destructors, finally blocks, etc).


One notable exception is APIs inspired by the condition/restart  system in Lisp. Such APIs can be given an object that is able to decide how an error should be handled before it's raised (e.g. failing to load a data file may be handled with launching a file dialog for selecting a different file, all without raising an error).

If error handling can be summarized so easily, what are the differences between the various error handling practices? In the following paragraphs, we'll see that these practices disagree only across several specific design choices. Finally, we'll see how we can define a common framework for dealing with all of them in Nim in an unified and convenient way.
Let's welcome to the stage our contenders: abandonment (just terminating the process or task without running any user code), error codes, fancier error result types such as Option, Either, Result, etc, checked and unchecked exceptions.
Are the error results part of the API signature?

This is true for error results and checked exceptions. When a new error result type is added, clients of the API must adapt their code. This was one of the reasons why checked exceptions didn't find they way in C# for example, but it seems hard to argue that adding a new recoverable error shouldn't break the API contract.
Another complaint against checked exceptions is that they lead to an explosion of the possible error results when the error handling code is further distanced from the original source of the error, but the strong equivalence with error results points us to the right solution - remap the error types as they travel up the stack (exceptions may support this even better if they are allowed to form a chain).
Nim is well-positioned to address this requirement with enum error codes (which must be handled in an exhaustive case statement), with vararg generic types (that can express a multi-cased Either or ErrorResult type) or with its support for precise exception tracking.
Is there an implicit mechanism for propagating non-recoverable errors up the stack?

While we can make a strong case that the recoverable errors should be part of the API signature, this is much harder to argue about the non-recoverable errors. After all, these errors are not planned for and there shouldn't be any code concerned with a specific error type. Our only goal is to record as much context information as possible and then to safely release any obtained resources (either by unwinding the stack or by directly jumping to a place where a process or task supervisor will be able to restore everything to a normal state). Using error results seems inappropriate for achieving both of these goals. Unchecked exceptions and abandonment handle this better.
Is the client code required to handle the errors?

This is usually pointed out as a strong argument in favor of using error results. The robustness of the software is increased, because all error handling is clearly visible during code review. With Nim's discard keyword, the client code is required to handle the errors at the call-site. We'll see that by applying a little trick - wrapping the result in a distinct type - we can easily achieve the same for checked exceptions as well.
Are the error results rich objects?

This is true for exceptions and fancier error result types. Starting with an enum error code may make it harder to switch to a richer result type later.
Can we sub-divide the error types in categories?

Nim is well-positioned to express this in many ways - with enum set types, with a hierarchy of exception types or with arbitrary type traits.
What are the run-time costs of the success and failure paths?

This is another strong reason why checked exceptions are unpopular. Raising an exception is inherently costly because usually at least a stack trace is collected. Modern compilers try hard to make the success path as fast as possible (faster than checking error codes), but this comes at the expense of the failure path. All of this forces us to use error results if the recoverable errors are frequent enough.
Can we compose our functions in concise and natural way?

Error codes prevent us from chaining our expressions in a natural way (foo(bar()).baz()). Monadic values such as Option or Either are able to handle this by introducing some run-time cost. Exceptions offer the best performance when the errors are rare.
One scheme to rule them all

On a fundamental level, all error handling schemes offer the same thing - you try to execute a certain operation and this will either result in a success or in a particular type of failure. The various schemes are nothing but specific run-time mechanisms and calling conventions for communicating the error results. Can we devise a single syntax able to work with all schemes while implementing the following requirements?

All recoverable errors will be handled, without an easy way to miss them.
All unrecoverable errors will be automatically propagated up the stack with minimum fuss in the code.

As it turns out, in Nim, the answer is Yes!
The handleErrors construct:

handleErrors is similar to a try expression or a try statement:
var res = handleErrors foo(bar())
          except RecoverableError1, RecoverableError2: alternativeValue()
handleErrors:
  peer.sendMessage(...)
except PeerDisconnected as e:
  info "peer disconnected", ip = e.ip
  attemptReconnect()
The difference is that with handleErrors you are not merely trying, you promise to handle all the possible recoverable errors! (unless you decide to use an else clause, just like in a case statement). Please note that the name handleErrors was chosen just to make this proposal easier to understand. The final name of this construct may be something shorter such as rescue or check, which may be further abbreviated to chk or just ch (the except clauses can then use of instead).
So, what are the recoverable errors? handleErrors adapts its behavior depending on the result type of the expression given to it. It can recognize various calling schemes such as:

proc foo(): (Error, Result)
proc foo(out: var Result): Error
proc foo(): Result {.raises: [Error].}
proc foo(): Either[Result, Error]
... and so on

It treats the values of enums just like error types (so, our first guideline is to name your error codes with the same naming convention you use for exceptions), it knows how to extract the result value regardless of the scheme and this will be the value returned by the handleErrors expression. It may even know how to turn chained calls involving procs with error codes into a block of code checking each invocation in turn.
Ultimately, handleErrors should allow you to carry out certain refactorings without changing the client code:

Turning an error code into a richer error result
Switching between error results and checked exceptions

What about the non-recoverable errors?

These should be mostly failed asserts and few specific Nim exceptions such as IndexError, FieldError, OutOfMemError, etc. To get these through the exception tracking mechanism of Nim, all of our raises lists will implicitly feature these exceptions by default. handleErrors won't require handling them. We may also introduce a compile-time option in Nim for controlling how certain exceptions are handled before they are raised.
How are the recoverable exceptions specified in the proc signature?

We can introduce a custom pragma for this:
proc p(): Result {.errors: FooError, BarError.}
The Nim exception tracking mechanism will guarantee that no other exceptions may be raised by p(). To enforce the checking of the errors at the call-site, the custom pragma can wrap the result type in a distinct type that will be useless at the call-site unless unpacked with the handleErrors construct.
But isn't this just turning exceptions into error codes? What are the advantages?

The main advantage of the handleErrors construct is that it still support chaining your calls in a natural way and potentially having larger blocks of code with simplified error handling. As suggested, eventually we may teach handleErrors how to do the same trick even for procs using error codes.
Why do we need a separate .errors: pragma? Why don't we use .raises: directly?

The problem is that the final raises list must also include all possible non-recoverable errors. The handleErrors construct should not require you handle those. They should be implicity added to all raises lists and implicitly propagated in handleErrors.
But exceptions just don't work properly in async code. We have to stick to errror codes.

Yuriy is working on fixing this:
https://github.com/yglukhov/Nim/commits/yield-in-try
How do we discriminate between error enums and regular enum results?

There are many ways to achieve this in Nim. We can use a pragma attached to the type or a simple type trait such as:
template isErrorType(x: type MyError): bool = true
A concept may use the above as a predicate.
Are there any other features?

It may be useful to introduce an additional construct similar to handleErrors, but wrapping the result in an Either type:
var r = wrapErrors foo(bar())
if r.successful:
  echo r.value
else:
  echo r.error
It may also be useful to introduce some short-cuts for remapping the error types with the same syntax regardless of the combination of calling schemes being used.
Isn't all of this too complicated to implement?

It may be easy to start with a bare bones version of the handleErrors. We can expand its capabilities over time. The exception tracking mechanism of Nim is not fully developed and tested, so some issues are to be expected.
handleErrors is too long of a name. I'd prefer something shorter.

Suggestion are welcome. One alternative is rescue/except (stolen from Ruby) or the silly er/of.
Appendix: aborting tasks and the need for a resource supervisor

In some situations you may be lured to treat a detected bug as a recoverable error. You may argue that the overall robustness of the software will be increased if certain errors are just reported to the user and the app execution continues as normal. Such situations can be handled in several ways:

If there is a complicated task that has a clearly defined end goal, it may be delegated to an external process. If the process fails due to a bug, the error can be reported to the user.
The UI process may be separated from a back-end process that may be restarted at will.
A clean restart may be cheap if we frequently save the important state in a persistent storage.

When none of these is an option, there are two possible strategies:


Install a top-level handler for all non-recoverable exceptions in the context where the fallible task starts and let the stack unwinding attempt to free the resources. It's better if all of the state associated with the bug can be destroyed at this point, because it's notoriously hard to write stack unwinding code leaving the program in a normal state (without any half committed transactions).


Alternatively, have the fallible task allocate resources only with certain APIs that register the resources in a place where they can be freed if a failure is detected (i.e. Nim's per-thread GC heap is an example of this).


Comments


missing raise tracking
Jacek: This proposal contains a construct for handling errors - missing is the other side of the coin where the compiler enforces correct {.error.} annotations - without these, the programmer is incentivised to skip error annotations altogether, either by mistake, omission or laziness
Zahary: Any proposal can be invalidated by imagining a lazy or incompetent programmer (i.e. the programmer can forget to insert the checks that detect the error conditions in the first place). What will follow this initial proposal is a specific set of guidelines and design principles that will be shared by the team and enforced in code reviews - there is no substitute for that.


use of experimental / untested nim features
Jacek: The idea to use yet another underdeveloped Nim feature that's not used anywhere else looks like it may set the project back by weeks - we've tried this with async, static, C++ support, yield-in-try and several other features - ttmath/C++ alone cost us several weeks. Yet, the Nim code base contains thousands of examples/unit tests for more simple features making them more well tested and well used - is there any possibility to stick with these? Are we prepared to make that investment now, given our desire to get something working out by end of June?
Zahary: It's important that we agree on our long-term vision for the project. Error handling remains the last big question mark in the design principles and the lack of guidelines so far results in everyone adopting a different set of practices. We have many months before our code will be in the hands of actual users and just like static typing can be gradually introduced in a project in order to find some latent bugs, the compiler-assisted analysis described here can be gradually introduced in a code-base to increase its robustness. Not using the best tool for the job is obviously not a winning strategy in the long-term.


explicit propagation for recoverable errors
Jacek: We're in full control of the source code, and do not have to worry about breaking API of the libraries we develop (except for the inconvenience caused by using multiple git repos and broken version/dependency handling) - it is however of paramount importance that we write robust software - we can increase robustness with a policy where recoverable errors are explicitly handled with discard-like visibility, and the compiler enforces their correct specification, both on the raising side and on the handling side - given the analysis that recoverable errors typically are handled at propagation depths, it makes sense to make each step explicit, per function (so allow chaining, but not implicit propagation across function borders) - this would need compiler checking and enforcement of correct specs
Zahary: I tried hard to understand this comment, but I wasn't able to grasp what the point was. As far as I can tell, there is no disagreement with anything that I've outlined in the proposal.


tuples
Jacek: (Error, Result) tuples should probably be avoided - given the ease of creating an Either-like type, this signature is redundant and promotes smelly code
Zahary: I've included all these different styles of APIs in my examples, because such APIs may be provided by third-party packages that we don't control.


strong enough memory-safety guarantees
Jacek: This is not Nim, in its default release mode compile - if it's not default, it will eventually be forgotten - would not make any such assumptions. same goes for overflow checking etc, unless we disallow the use of -d:release and handle optimizations with some other, nimbus-specific flag that explicitly must be added - as a side note, some of this could be improved at lower cost with more modern protection features like the various sanitizers in clang/gcc - don't remember seeing any support for these in nim however (??), or analysis of which of them make sense for nim
Zahary: Nim supports controlling all of the safety checks individually. It's perfectly fine to leave most of the checks in release too and this is probably something that we'll end up doing.


fail-fast
Jacek: happy with this for non-recoverable - the alternative is to write strictly transactional software where each operation/task can be rolled back completely - hard without the right language primitives / library support


signatures
Jacek: as a general note, making non-recoverable errors part of the signature is not feasible in Nim, given how common compile options change code semantics / which non-recoverable errors are raised - eg debug vs release
Zahary: I have explicitly stated that making the non-recoverable errors part of the signature is not a good idea. And actually, this would be one of my critiques towards some specific current APIs based on error codes. Quite often, we communicate non-recoverable errors with error codes.


api error handling signatures
Jacek: While being good for legacy code, the availability of choice in which signature to use to signal the errors of an api (Either, {.error.} or var output) leaves us without one obvious way to encode which recoverable errors an API raises - it would decrease complexity to unify behind one API style in all our code - if they are handled in a uniform way on the catching side, there's no inherent advantage to having a choice on the raising side. an additional, uglier macro to handle the smelly styles (so handleErrors convert f() for the unsanctioned API styles) could promote this over time
Zahary: I've discussed the trade-offs on the raising side, but perhaps not in enough detail. Besides the performance trade-off which is mentioned in some of the paragraphs, exceptions sometimes help with simplifying the code in the "private" portions of a module. If your public API has well-defined errors, you are allowed some leeway in all the helper procs you introduce to get the job done. I think some good examples of this are the RLP libraries and Ryan's JSON-RPC marshaling code. The same code using error codes would have been clumsier to implement and less performant.


**refactoring error handling **
Jacek: in a typed language with good unit tests, refactoring is cheap - wouldn't worry too much about which style is chosen, as long as it's enforced by the compiler at every step