Error Handling in Rust

A guide to error handling in Rust. The first half introduces Rust's language features and libraries for error handling. The second half should help you make good error handling code in your Rust programs.

Rust errors

Error handling gets special treatment in the language because when reading and writing code it is convenient to understand the happy path and the error path separately. Since errors are exceptional, we usually want to read code assuming errors aren't occurring, so making error handling very low-overhead is useful. Furthermore, by supporting error handling in the language and standard library, different libraries can handle errors without requiring loads of boilerplate for translating errors at library boundaries.

Rust has two ways to represent errors: using the Result type and by panicking. The former is much more common and usually preferred.

The Result type is a regular enum with Ok(T) and Err(E) variants. The former signalling normal program behaviour, and the latter signalling an error. Both variants contain regular Rust values and the programmer can freely choose the types for both. There is essentially no special mechanism here. A function returns just a regular value which can indicate either success or an error; the callers of a function must check which. Propagating an error simply means returning an error when an error is found. Rust does provide some special constructs to make this easy.

Panicking is a special mechanism for 'immediately' stopping progress in a controlled manner. It is triggered by macros like panic! and functions like unwrap. It can also be triggered implicitly, e.g., when arithmetic overflows. Panicking is usually not handled by the programmer and terminates the current thread.

The facilities for errors in the language and standard library are incomplete. There are several crates which can help with error handling and you'll probably want to use one.

Result and Error

Covers the Result type and using it for error handling, the ? operator, the Error trait, the Try trait, and other parts of Rust's machinery for handling errors as values.

Panic

Covers panicking, the panic macros, and panicking functions.

Non-Rust errors

How to deal with errors which originate outside of Rust, primarily when interoperating with other languages.

Testing

How to test code which uses Result or panics.

Result and Error

Using a Result is the primary way of handling errors in Rust. A Result is a generic enum in the standard library, it has two variants: Ok and Err, which indicate correct execution and incorrect execution, respectively. Both Result and its two variants are in the prelude, so you don't need to explicitly import them and you don't have to write Result::Ok as you would for most enums.

Both variants take a single argument which can be any type. You'll usually see Result<T, E> where T is the type of value returned in Ok and E is the error type returned in Err. It's fairly common for modules to use a single error type for the whole module and to define an alias like pub type Result<T> = std::result::Result<T, MyError>;. In that case, you'll see Result<T> as the type, where MyError is an implicit error type.

Creating a Result value is as simple as using the variant as a constructor, just like other enums. E.g, for a Result<i32, i32> (a Result with an i32 payload and an i32 error code), you would use Ok(42) and Err(2) to create Result values.

When receiving a Result object, you can address the ok and error cases by using a match expression, e.g.,

fn foo(r: Result<String, MyError>) {
    match r {
        Ok(s) => println!("foo got a string: {s}"),
        Err(e) => println!("an error occurred: {e}"),
    }
}

There are more ergonomic ways to handle Results too. There are a whole load of combinator methods on Result. See the docs for details. There are way too many to cover here, but as an example, map takes a function and applies it to the payload of the result if the result is Ok and does nothing if it is Err, e.g.,

fn foo(r: Result<i32, MyError>) -> Result<String, MyError> {
    r.map(|i| i.to_string())
}

The ? operator

Applying ? to a result will either unwrap the payload, if the result is an Ok, or immediately return the error if it is an Err. E.g.,

fn foo(r: Result<i32, MyError>) -> Result<String, MyError> {
    let i = r?; // Unwraps Ok, returns an Err
    Ok(i.to_string())
}

The above code does the same as the previous example. In this case, the map code is more idiomatic, but if we had a lot of work to do with i, then using ? would be better. Note that the type of i is i32 in both examples.

The ? operator can be used with other types too, notably Option. You can also use it with your own types, see the discussion of the Try trait, below.

? does not have to just return the error type directly. It can convert the error type into another by calling From::from. So if you have two error types: Error1 and Error2 and Error2 implements From<Error1>, then the following will work (note the different error types in the signature):

fn foo(r: Result<i32, Error1>) -> Result<String, Error2> {
    let i = r?;
    // ...
}

This implicit conversion is relied on for several patterns of error handling we'll see in later chapters.

Try blocks

Try blocks are an unstable language feature (#![feature(try_blocks)]). The syntax is try { ... } where the block contains code like any other block. The difference compared to a regular block, is that a try block introduces a scope for the question mark operator. If you use ? inside a try block, it will propagate the error to the result of the block immediately, rather than returning the error from the current function.

For example,

#![allow(unused)]
fn main() {
let result: Result<i32, ParseIntError> = try {
    "1".parse::<i32>()?
        + "foo".parse::<i32>()?
        + "3".parse::<i32>()?
};
}

At runtime, execution will stop after the second parse call and the value of result will be the error from that call. Execution will continue after the let statement, without returning.

Option

Option is similar to Result in that it is a very common, two-variant enum type with a payload type Some (compare to Result::Ok). The difference is that Option::None (compared with Result::Err) does not carry a payload. It is, therefore, analogous to Result<T, ()> and there are many methods for converting between Option and Result. ? works with Option just like Result.

The intended meaning of the types, however, is different. Result represents a value which has either been correctly computed or where an error occurred in its computation. Option represents a value which may or may not be present. Generally, Option should not be used for error handling. You may use or come across types like Result<Option<i32>, MyError>, this type might be returned where the computation is fallible, and if succeeds it will return either an i32 or nothing (but importantly, returning nothing is not an error).

The Error trait

The error trait, std::error::Error, is a trait for error types. There is no hard requirement, you can use a type in a Result and with ? without implementing Error. It has some useful functionality, and it means you can use dynamic error handling using dyn Error types. Generally you should implement Error for your error types (there are no required methods, so doing so is easy); most error libraries will do this for you.

Provided functionality

Display is a super-trait of Error which means that you can always convert an error into a user-facing string. Most error libraries let you derive the Display impl using an attribute and a custom format string.

The Error trait has an experimental mechanism for attaching and retrieving arbitrary data to/from errors. This mechanism is type-driven: you provide and request data based on its type. You use request methods (e.g., request_ref) to request data and the provide method to provide data, either by value or by reference. You can use this mechanism to attach extra context to your errors in a uniform way.

A common usage of this mechanism is for backtraces. A backtrace is a record of the call stack when an error occurs. It has type Backtrace which has various methods for iterating over the backtrace and capturing the backtrace when an error occurs. To get a backtrace, use the Error::request_ref method, e.g., if let Some(trace) = err.request_ref::<Backtrace>() { ... }.

For an error to support backtraces, it must capture a backtrace when the error occurs and store it as a field. The error must then override the Error::provide method to provide the backtrace if it is requested. For more details on this mechanism see the docs of the Provider trait which is used behind the scenes to implement it. (Note that there used to be a specific method for getting a backtrace from an error and this has been replaced by the generic mechanism described here).

Error::source is a way to access a lower-level cause of an error. For example, if your error type is an enum MyError with a variant Io(std::io::Error), then you could implement source to return the nested io::Error. With deep nesting, you can imagine a chain of these source errors, and Error provides a sources method1 to get an iterator over this chain of source errors.

1

This and some other methods are implemented on dyn Error, rather than in the trait itself. That makes these methods usable on trait objects (which wouldn't be possible otherwise due to generics, etc.), but means they are only usable on trait objects. That reflects the expected use of these methods with the dynamic style of error handling (see the following section).

Dynamic error handling

When using Result you can specify the concrete error type, or you can use a trait object, e.g., Result<T, Box<dyn Error>>. We'll talk about the pros and cons of these approaches in later chapters on designing error handling, for now we'll just explain how it works.

To make this work, you implement Error for your concrete error types, and ensure they don't have any borrowed data (i.e., they have a 'static bound). For ease of use, you'll want to provide a constructor which returns the abstract type (e.g., Box<dyn Error>) rather than the concrete type. Creating and propagating errors works the same way as using concrete error types.

Handling errors might be possible using the abstract type only (using the Display impl, source and sources methods, and any other context), or you can downcast the error trait object to a concrete type (using one of the downcast methods). Usually, there are many possibilities for the concrete type of the error, you can either try downcasting to each possible type (the methods return Option or Result which facilitates this) or use the is method to test for the concrete type. This technique is an alternative to the common match destructuring of concrete errors.

Evolution of the Error trait

Error handling in general, and the Error trait in particular, have been evolving for a long time and are still in flux. Much of what is described above is nightly only and many unstable features have changed, and several stable ones deprecated. If you're targetting stable Rust, it is best to mostly avoid using the Error trait and instead use an ecosystem alternative like Anyhow. You should still implement Error though, since the trait itself is stable and this facilitates users of your code to choose the dynamic path if they want. You should also be conscious when reading docs/StackOverflow/blog posts that things may have changed.

The Try trait

The ? operator and try blocks, and their semantics are not hard-wired to Result and Option. They are tied to the Try trait, which means they can work with any type, including a custom alternative to Result. We don't recommend using your own custom Result, we have in fact never found a good use case (even when implementing an error handling library or in other advanced cases) and it will make your code much less compatible with the ecosystem. Probably, the reason you might want to implement the Try trait is for non-error related types which you want to support ergonomic short-circuiting behaviour using ? (e.g., Poll; although because of the implied semantics of error handling around ?, this might also be a bad idea). Anyway, it's a kinda complex trait and we're not going to dive into it here, see the docs if you're interested.

Deprecated stuff

There is a deprecated macro, try, which does basically the same thing as ?. It only works with Result and in editions since 2018, you have to use r#try syntax to name it. It is deprecated and there is no reason to use it rather than ?; you might come across it in old code though.

There are some deprecated methods on the Error trait. Again, there is no reason to use these, but you might see them in older code. Error::description has been replaced by using a Display impl. Error::cause has been replaced by Error:source, which has an additional 'static bound on the return type (which allows it to be downcast to a concrete error type).

Panic

Panicking is a mechanism for crashing a thread in an orderly way. Unlike Result-based errors or exceptions in other languages, panics are not intended to be caught or handled.

By default, panicking terminates the current thread by unwinding the stack, executing all destructors as it goes. This means that the program can be left in a consistent state and the rest of the program can carry on executing.

You can also configure your program (by adding panic = 'abort' to the appropriate profile in your Cargo.toml) to abort on panic. In this case, the whole program exits immediately when a panic occurs. Using abort-on-panic will make your program slightly more performant (because stack unwinding doesn't need to be considered, similar to using C++ without exceptions). However, it can make your program less robust (a single thread cannot crash without crashing the whole program) and it means destructors won't be run on panic.

Even in the default unwind-on-panic configuration, causing a panic while the thread is already panicking will cause the whole program to abort. You must therefore be very careful that destructors cannot panic under any circumstance. You can check if the current thread is panicking by using the std::thread::panicking function.

When a panic occurs, a panic hook function is called. By default, this prints a message and possibly a backtrace to stderr, but it can be customised. See the docs for std::panic for more information. The panic hook is called whether the panic will unwind or abort.

In a no-std crate, you'll need to set your own panic handler. Use the #[panic_handler] attribute and see the docs for core::panicking for more info.

For more details on how panicking is implemented, see this blog post; for more on if and when you should panic (particularly using unwrap see this one.

Triggering panics

Panics can be triggered in all sorts of ways. The most obvious way is using the panic! macro, that unconditionally starts a panic when it is encountered. The todo!, unimplemented!, and unreachable! macros all do the same thing, but with different implications for human readers (the unreachable_unchecked macro does not panic; it is an indication to the compiler that its location is never reachable, and encountering it is undefined behaviour). panic_any is a function similar to the panic! macro which lets you panic with an arbitrary object, rather than a string message.

The assert and debug_assert macros panic if their condition is not met. E.g., debug_assert_eq panics (only in debug builds) if it's arguments are not equal.

The functions unwrap and expect on Result or Option panic if the receiver is Err or None (respectively). They can be thought of as a bridge between the worlds of errors using Result and panicing.

There are many places where panicking is implicit. In debug builds, arithmetic overflow/underflow causes a panic. Indexing (e.g., some_slice[42]) out of bounds (or for a key which does not exist) causes a panic. In general, since panicking is not part of the type system or annotated on function signatures, you must assume that any function call might cause a panic, unless you can prove otherwise.

Programming with panics

You should not use panicking for general error handling. If you follow this advice, then most programmers won't have to worry too much about panics, in general, it just works. Specifically, you should only use panics for things which (in theory) can't happen. That is, every time a panic does happen, it is a bug in the program. You'll need to take some care to avoid unwraping things unless you're really sure it's impossible to be Err/None. Likewise, if you're unsure that an index or key is in-bounds, use get rather than indexing.

One thing you might have to consider is the state of other threads. Panicking terminates a thread so you don't need to worry about data which is only referenced from the panicked thread. But data which is referenced from multiple threads could be corrupted (note that this data must be Sync). You can use destructors to ensure that shared data is left in a consistent state since these will be run on panic. However, you cannot rely on destructors being run in all circumstances (e.g., consider an object referred to by an Arc from another thread).

An example of a feature for ensuring consistency in the face of a panicking thread is mutex lock poisoning. When you lock a mutex, you get a Result which will be an Err if another thread panicked while the lock was held. If you unwrap this result, then you essentially propagate the panic to the current thread (which is a fine approach). Alternatively, you could try to recover by fixing any inconsistent state.

If you are writing unsafe code, then you must ensure that your code is exception safe. This means that any safety invariants your code relies on are re-established at any point where the program might panic. The UnwindSafe trait is used to mark types as being unwind-safe, that is the type can be used after a panic has occurred and the unwinding has been caught. It is used with catch_unwind, which is used to catch panics at FFI boundaries (see below). However, the concept has not gained a wide following in the Rust community, so it is probably unwise to rely on it.

You might want to ensure that your program (or parts of your program) cannot panic. However, this is generally not a good idea. If you don't want to have unwinding, then you can abort on panic. However, wanting your program to be panic free is essentially wanting your program to be bug free (at least for several classes of bugs), which is a big ask! It is usually better for a bug to cause a panic rather than to cause incorrect execution. In some circumstances, you might be able to check every arithmetic and indexing operation, handle every error, and never unwrap an Option, but for most programs, this is more of a burden than it is worthwhile. There are some crates (dont-panic and no_panic) which can help with this, but they rely on causing a linking error, so only work in some circumstances, can depend on specific optimisations, and can't do anything smart like prove that a panic trigger is unreachable.

Is panicking safe?

Given that panicking feels like a crash and crashes are often exploitable, it is often asked if panicking is safe. There are many levels to this question! Importantly, panicking cannot cause memory unsafety, so panicking is safe in the sense of Rust's unsafe keyword. Similarly, panicking is never exploitable. Since panicking runs destructors, panicking is also fairly safe in a colloquial sense in the sense that it can't leave your program in an inconsistent state (although you still have to take care to avoid bugs here, as mentioned above). However, if panicking causes your whole program to crash (due to a double panic, propagating panics to other threads, or abort-on-panic), then this can be a poor user experience.

Interop and catching panics

Panics must not cross the FFI boundary. That means that you must catch Rust panics using catch_unwind. See the interop chapter for details.

Backtraces

When a panic happens, Rust can print a backtrace of the stack frames which led to the panic (i.e., all the functions which were being called when the panic happened). To show the backtrace, set the RUST_BACKTRACE environment variable to 1 (or full if you want a verbose backtrace) when you run your program. You can prevent backtraces being printed (or change their behaviour) by setting your own panic hook.

You can capture and manipulate backtraces programmatically using std::backtrace.

Non-Rust errors

If you interoperate with other languages or interact with the operating system in certain ways (usually this means avoiding Rust's existing abstractions in the standard library or crates), then you will need to deal with errors which are not part of Rust's error system. These fall into two categories: error values and exceptions. In both cases, you must convert these errors into Rust errors, this chapter will explain how, as well as going in the other direction - handling Rust errors at the FFI boundary.

C errors

C has no standardized way to handle errors. Different libraries will indicate errors in different ways, e.g., by returning an error number, null pointer, sentinel value, etc. They may also set a global variable to give more error context. At the FFI boundary you must convert these different kinds of errors into Rust errors.

Generally, for FFI you will have a -sys crate which has simple bindings for foreign functions in Rust. This crate will have the native error handling. Then you will have a crate which wraps these function bindings in a Rust library which provides a safe, idiomatic Rust API. It is in this crate that errors should be translated into Rust errors, usually Results. Exactly how you do this depends on the level of abstraction your crate is targetting. You may have a direct representation of the C error in your Rust error, e.g., return Result<T, ErrNo> where ErrNo is a newtype wrapping the integer type or an alias of it. At the other extreme, you might use an error type like any other Rust crate. In between, you might have a typical Rust error type and embed a representation of the underlying C error inside the Rust error. Essentially, dealing errors are 'just data' in both C and Rust, so translating errors at the FFI boundary is similar to handling any other data at the boundary.

OS errors

You might interact with the operating system via the standard library or a crate, at the level of an API or syscalls. In these cases you might need direct access to the operating system's errors. In the standard library, these are available from io::Error.

You can use std::io::Error::raw_os_error and std::io::Error::from_raw_os_error to convert between a Rust error and an operating system's error number. Some operating system operations do not return an error, but instead return some indication of an error (such as a null pointer) and make more information about the error available in some other way. std::io::Error::last_os_error provides a way to access that information as a Rust error. You must be careful to use it immediately after producing the error; other standard library functions or accessing the operating system outside of the standard library might change or reset the error information.

Exceptions and panics

Languages often provide mechanisms for error handling which are not simple return values. Exceptions in many languages and Rust panics are examples. These mechanisms need special handling at the FFI boundary. In general, anything which involves stack unwinding or non-local control flow (that is, where execution jumps from one function to another) must not cross the FFI boundary. That means that on the Rust side, panics must not unwind through a foreign function (i.e., panics must be caught); if they do, the process will abort. For other languages, exceptions and similar must not jump into or 'over' (on the stack) Rust code; this will cause undefined behaviour.

If Rust code might panic (and most Rust code can) and is called by foreign code, then you should catch any panics using the catch_unwind function. You may also find std::panic::resume_unwind and std::thread::panicking useful.

There is work in progress to permit unwinding from Rust into C code using the c_unwind (and other _unwind) APIs. However, at the time of writing this work is not fully implemented and is unstable. See RFC 2945 for more details.

Testing

This chapter will cover the intersection of error handling and testing. We'll cover how to handle errors which occur in tests, both where an error should indicate test failure and where we want to test that an error does occur.

You should ensure that your project has good test coverage of error paths as well as testing the success path. This might mean that errors are propagated correctly, that your code recovers from errors in the right way, or that errors are reported in a certain way. It is particularly important to test error recovery, since this can lead to program states (often many complex states) which are not encountered in normal execution and might occur only rarely when the program is used.

Handling errors in tests

In this section we'll assume a test should pass if there is no error and the test should fail if there is any error.

A basic Rust unit test fails if the function panics and passes if it does not. So, if you use only panics for error handling, then things are very easy! However, if (more likely) you have at least some Results to handle, you have to do a bit more work. The traditional approach is to panic if there is an error, using unwrap. Since 2018, you can return a Result from a test. In this case the test fails if the test returns an error or panics, so you can use the ? operator to fail a test on error. To do this simply add -> Result<(), E> (for any error type E) to the signature of the test.

Testing the kind of errors

To properly test your program, you should test the errors which your functions throw. Exactly what to test is an interesting question: for internal functions you should test what is needed, this might mean just that any error is thrown (for certain inputs). If you rely on certain errors being thrown (e.g., for recovery) then you should test that a specific error is thrown or that the error contains correct information. Likewise, for testing public functions, you should test what your API guarantees (do you guarantee certain error types under some circumstances? Or only that an error is thrown?).

To require that a test panics, you can use the #[should_panic] attribute. You can use an expected parameter to test for a particular message when panicking, e.g., #[should_panic(expected = "internal compiler error")].

You could use #[should_panic] with unwrap to test that an error is thrown, but it is a bad idea: you could easily pass the test by some other part of the code panicking before the error should occur.

The better way to test for the presence of an error is to assert Result::is_err is true. If there are more properties of the error to check, then you can use Result::is_err_and for simple checks or Result::unwrap_err where there is more code. For example,

#![allow(unused)]
fn main() {
#[test]
fn test_for_errors() {
    assert!(should_throw_any_error().is_err());
    assert!(should_throw_not_found().is_err_and(|e| e.kind() == ErrorKind::NotFound));

    let e = should_throw_error_foo().unwrap_err();
    assert_eq!(e.kind(), ErrorKind::Foo);
    assert_eq!(e.err_no(), 42);
    assert_eq!(e.inner().kind(), ErrorKind::NotFound);
}
}

Exactly how to test properties of the error thrown will depend on what kind of error types you are using and what guarantees you are testing. You may be able to call methods on the error and assert properties of the results (as shown in the example above). If you want to test the type of the error then you'll need to either downcast (if using trait object error types) or match (or use if let; if using enum error types). In the latter case, the assert_matches macro can be useful - you can test that an error is thrown and the exact type of the error in one go, e.g., assert_mactches!(should_throw_not_found(), Err(MyError::NotFound));.

Testing error recovery and handling

It is important to test the error handling/recovery paths, however, with idiomatic error handling this is not trivial because it is not idiomatic to pass Results into a function, but rather to test the result and only pass in a value if the result is Ok. This means that the only way to inject errors when testing is to have functions return an error. Unless you have functions which take a closure, then the best way to do this is by using mock objects.

Mocking in Rust is usually ad hoc and lightweight. The common pattern is that you have a generic function or function which takes a trait object and you pass a custom mock object which implements the required trait bounds. For example,

#![allow(unused)]
fn main() {
// Function to test
fn process_something(thing: impl Thing) -> Result<i32, MyError> {
    // ...
}

trait Thing {
    fn process(&self, processor: Proc) -> Result<f32, MyError>;
}

#[cfg(test)]
mod test {
    use super::*;

    // Mock object
    struct AlwaysZero;

    impl Thing  for AlwaysZero {
        fn process(&self, processor: Proc) -> Result<f32, MyError> {
            Ok(0.0)
        }
    }

    // Mock object
    struct AlwaysErr;

    impl Thing for AlwaysErr {
        fn process(&self, processor: Proc) -> Result<f32, MyError> {
            Err(MyError::IoError)
        }
    }

    #[test]
    fn process_zero() -> Result<(), MyError> {
        assert_eq!(process_something(AlwaysZero)?, 0);
        Ok(())
    }

    #[test]
    fn process_err() -> Result<(), MyError> {
        // Note that we test the specific error returned but not the contents of
        // that error. This should match the guarantees/expectations of `process_something`.
        assert_matches!(process_something(AlwaysErr), Err(MyError::ProcessError(_)));
        Ok(())
    }
}
}

For more sophisticated mocking, there are several mocking libraries available. Mockall is the most popular. These can save a lot of boilerplate, especially when you need mock objects to do more than always return a single error.

Testing error reporting

For applications with sophisticated error reporting (e.g., a compiler), you'll want to test that error reporting output, ensuring that the messages which are expected and other data such as the input which caused the error, information to locate that input, error numbers, etc.

How to test this will vary by application, you might be able to unit test the output or you might prefer to use integration tests. You'll want to implement some kind of framework to help these tests so that you're not doing loads of repetitive string comparisons. It's important not to test too much or your test suite will be fragile. An example of such tests are rustc's UI tests.

Benchmarking

If errors can occur in real life, then you should probably consider errors when benchmarking. You might want to separately benchmark what happens in the pure success and the error cases. You might also want a benchmark where errors occur randomly with realistic (or greater than realistic) frequency. This will require some engineering which is similar to mocking, but does the real thing in some cases and mocks an error occasionally.

The errors ecosystem

Good crates

The following crates are commonly used and recommended.

ThisError

A powerful derive(Error) macro. Implements Error and Display for your error types, including convenient syntax for source errors, backtraces, and conversions.

Anyhow

A crate for working with trait object error types. Extends the features of the standard library's Error trait. However, most of Anyhow's features have been added to std, so you might not need Anyhow any more (it does have the advantage that it doesn't require unstable features, so can be used with a stable toolchain).

Snafu

Supports deriving error handling functionality for error types (including the single struct style, not just enum style errors), macros for throwing errors, and using strings as errors.

Error Stack

An alternative and extension to this-error, error-stack helps define your error types, but also adds support for better stacks of errors with arbitrary added attachments and information about how the error is produced and propagated. It also has sophisticated error reporting and formatting.

Error reporting crates

These crates are for reporting errors to the user in pretty ways. They are particularly useful for reporting errors in input text (e.g., source code) and are somewhat inspired by rustc's error reporting style.

Eyre is a fork of Anyhow with enhanced reporting facilities. The following crates are just for reporting and work with other error libraries if required:

Historic crates

As Rust's error handling story has evolved, many error handling libraries have come and gone. The following were influential in their time, but there are now better alternatives (sometimes including just the support in std). You might still see these crates used in historic documentation, but we wouldn't recommend using them any more.

Error design

How should you handle errors in your program? This chapter aims to answer that question. The first section will help you categorise and characterise errors. The second section will help you design error handling at a high level (issues like whether to recover from an error, when to use different kinds of errors, etc.), the third section is specifically about designing error types in your program.

Thinking about errors

thinking-about-errors.md

Error handling

error-handling.md

Error type design

error-type-design.md

Case studies

Thinking about errors

An error is an event where something goes wrong in a program. However, there are many different ways things can go wrong, and many different kinds of 'wrong'. The details of an error event dictate how it should be represented and handled. In this section, we'll cover many different ways to categorise and characterise errors.

I'm not sure if the terminology I use here is all standard, I've certainly made some up.

Internal vs external errors

Consider a JSON parser. It opens a file, reads the file, and attempts to parse the data as JSON into a data structure. There are many things which could go wrong. The user might supply a filename which does not exist on disk, trying to open the file will fail. If the filename is correct, when the program attempts to read the file, there might be an operating system error which causes the read to fail. If reading succeeds, the JSON might be malformed, so it cannot be parsed correctly. If the file is correct, but there is a bug in the program, some arithmetic operation may overflow.

It is useful to categorise these errors as 'external', where the input is incorrect, and 'internal', where the input is correct but something unexpected happens.

An incorrect filename or malformed JSON data are external errors. These are errors which the program should expect to happen sometimes.

Failure to read the file, or arithmetic overflow are internal errors. These are not due to user input, but due to a problem with the program's environment or the program itself. These errors are unexpected, to varying extents. Bugs in the program are a kind of internal error, these may produce an error which the program itself can detect, or may be a silent error.

In this example, the input was user input, but in general, an external error might be caused by input which comes from another part of the program. From the perspective of a library, malformed input from the client program would cause an external error.

Note that although I've said some errors are expected/unexpected there is not a perfect correlation with internal/external. A library which states that the client is responsible for data integrity might treat malformed input as an unexpected external error. A failed network connection on a mobile device is an expected internal error.

The split between internal and external is not clean, since what constitutes a program is often not clearly defined. Consider a distributed system (or even a system with multiple processes), errors caused by other nodes/processes (or due to interactions of nodes, such as a transaction error) might be considered internal from the perspective of the whole system and external from the perspective of a single node/process.

One can also extend this thinking within program, making a distinction between errors which occur in the current module, compared with errors which occur in one module and are handled in a different one.

Expected/unexpected spectrum of errors

Some errors happen in the normal operation of a program and some are very rare. I believe treating this as a binary is incorrect and we should think about a spectrum of likelihood of an error occurring.

At one extreme, some errors are impossible, modulo cosmic rays, CPU bugs, etc. Some errors are possible, but should never happen and will only happen if there is a bug in the program. Then there are errors which are expected to happen sometimes, with different frequencies, e.g., file system errors are fairly unlikely to happen, network errors are much more likely, but still rare in a server, but common-ish on a mobile device. At the other extreme, one can write code which always triggers an error condition (and arguably this is not a bug if the error is handled, but it may be considered poor programming practice).

Recoverable vs unrecoverable

Can a program recover when an error occurs? Documentation for Rust often starts with a split between recoverable and unrecoverable errors. However, I think this is a more complicated question. Whether a program can recover from an error depends on what we mean by recovering (and what we mean by 'program' - is it recovery if an observer process restarts the erroring process?), it may depend on the program state before and after the error, it may depend on where we try to recover (it might be impossible to recover where the error occurs, but possible further up the call stack, or vice versa), it may depend on what context is available with the error. Even if it is technically possible to recover, if that would take an unreasonable amount of effort or risk, then it might be better to consider the error as unrecoverable.

There is also the question of intention. An error might be technically recoverable, but a library might want to treat it as unrecoverable for some reason.

So, I think there is a whole spectrum of recoverability which we can consider for errors and the context where the error occurs.

Multiple errors

Errors combine in all sorts of ways:

  • One error might cause another error as it is propagated or during recovery.
  • An error might be deliberately converted into a different kind of error.
  • An error might re-occur after recovery, or even after a 'successful' recovery, might cause a different error.
  • Errors might be collatable, i.e, after an error occurs and execution continues, other errors are collected and presented to the user all at once.

Execution environment

This one is a bit different to the others. It is important to consider how a program is executing when thinking about errors. We might want to treat errors differently when testing or debugging code. We might also treat errors differently if our program is embedded in another (either as a library, a plugin, or some kind application-in-an-application-server scenario). Likewise, if our program is the OS, errors will be treated differently to if it is an application running on top of the OS. A program might also be inside some kind of VM or container with its own error handling facilities, or it might have some kind of supervisor or observer program.

Error handling

This chapter will discuss what to actually do with your errors.

When your code encounters an error, you have a few choices:

  • recover from the error,
  • propagate the error,
  • crash.

Recovery is the most interesting option and most of this section will be spent on this. If you don't recover, then you'll probably propagate the error. This is pretty straightforward, the only choice is whether to convert the error into a different one or propagate it as is. We'll discuss this more in the following chapter on error types.

Crashing can mean panicking the thread or aborting the process. The former might cause the latter, but that doesn't usually matter. Whether you abort or panic depends on the architecture of your program and the severity of the error. If the program is designed to cope with panicking threads, and the blast radius of the error is limited to a single thread, then panicking is viable and probably the better option. If your program cannot handle threads panicking, or if the error means that multiple threads will be in an inconsistent state, then aborting is probably better.

Crashing may be a kind of recovery (if just a thread crashes and the program continues, or if the program runs in an environment where crashing is expected and the program can be restarted), but usually it is the least desirable response to an error. Even for prototype programs, you are probably better to propagate errors to the top level rather than crashing, since it is easier to convert that to a proper error handling system (and prototypes turn into production code more often than anyone would like).

Error handling strategy

Your project needs an error handling strategy. I consider this an architectural issue and should be decided in the early stages of planning, since it affects the API and implementation designs and can be difficult to change later in development.

Your strategy should include expectations for robustness (for example, how high priority is it that errors do not surface to the user? Should correctness, performance, or robustness be prioritised? What situations can we assume will never happen?), assumptions about the environment in which the code will run (in particular can errors/crashes be handled externally? Is the program run in a container or VM, or otherwise supervised? Can we tolerate restarting the program? How often?), do errors need to be propagated or reported to other components of the system? Requirements for logging, telemetry, or other reporting, who should handle errors (e.g., for a library, which kinds of error should be handled internally and which should the client handle), and how error/recovery states will be tested.

Whether your project is a library or an application, and if an application whether it stands alone or is a component in a larger system, will have a huge effect on your error handling strategy. Most obviously, if you are a stand-alone application then there is nobody else to handle errors! In an application you will have more certainty about the requirements of the system, whereas for a library you will likely need to provide for more flexibility.

Having identified the requirements of your strategy, this will inform how you represent errors as types (discussed below and in the next chapter), where and how you will recover from errors, and the information your error types must carry (discussed in the following sections).

Result or panic?

Most programs should use Result for most of their error handling.

You might think that for a very quick prototype or script, you don't need error handling and unwrapping and panicking are fine. However, using ? is usually more ergonomic than unwrap and these tiny programs have a habit of growing into larger ones, and converting panicking code into code with proper error handling is a huge chore. Much better to have a very simple error type (even just a string); changing the error type is much easier.

You should probably use panics to surface bugs, i.e., in situations which should be impossible. This is easier said than done. For example, consider an integer arithmetic overflow. This might be impossible, but it might also be possible given enough time or the right combination of user input. So even classes of error which usually cause panics are likely only to be best represented as a panic in some circumstances.

When designing an API, either public or private, it is generally better for a function to return a Result and let the caller decide to panic or not, rather than always panic on an error. It is very easy to convert an Error into a panic but the reverse is more difficult and loses information.

For how to design the error types carried by Result, see the next chapter.

Recovery

Error recovery is tightly linked to your program and to the individual error. You can't recover from a network error in the same way as a user input error (at least not all the time), and different projects (or parts of projects) will handle these errors in different ways. However, there are some general strategies to consider, and there are some ways which recovery intersects with Rust's error types.

Techniques for recovery

Possible recovery techniques are:

  • stop,
  • retry,
  • retry with new input,
  • ignore,
  • fallback.

Most of these may be accompanied by some change in program state. E.g., if you are retrying, you may increment a retry counter so that after some number of failed retries, a different technique is tried. Some programs might have much more complex 'recovery modes'.

The recovery technique is likely combined with some logging action or reporting to the user. These actions require additional information in the error type which is not needed directly for recovery. I think it is useful to think of logging or reporting as part of error recovery, rather than an alternative, since it is your code which handles the error, rather than throwing it for client code to deal with.

Stop

If an error occurs while performing an action, you can simply stop doing the action. This might require 'undo'ing some actions to ensure the program is in a consistent state. Sometimes this just means calling destructors, but it might mean removing data written to disk, etc. Failing to execute an action should probably be reported in some way, either back to the user or via an API to another program or part of the program.

An extreme form of stopping is crashing. You could terminate the current thread (if your program is designed to handle this happening), or the whole process. This is usually not a very responsible thing to do, but in some environments it is fine (e.g., if the process is supervised by another process or if the process is part of a distributed system where nodes can recover). Assuming that the systems can recover from such a crash, you need to consider how it will recover - if the process runs with the same input and tries the same action, will it always get the same result (i.e., crash again)?

Retry

If the error is due to a temporary fault or a fallible actor (such as a human user), simply retrying the action which caused the error might succeed. In other cases, you can retry with different input (e.g., by asking the user for more input, or by using a longer timeout, or asking a different server).

Ignore

Some errors can simply be ignored. This might happen when the action being attempted is optional in some way, or where a reasonable default value can be used instead of the result of the action.

Fallback

Sometimes there is an alternate path to the path which caused an error and your program can fallback to using that path. This might be as simple as using a default value instead of trying to find a better value, or it might be using an alternate library or alternate hardware (e.g., falling back to an older library if the newer library is not present on a system, or using CPU rendering rather than dedicated graphics hardware).

Where to recover

You may choose to recover near to where the error occurs, or throw the error up the call stack and recover at some distance from the error. Both approaches have advantages and there is no best choice.

Close to the error you are likely to have the most specific context and therefore the most information for recovery. There is also the minimum of disruption to program state. On the other hand, by throwing the error and dealing with it further from the error site, you might have more options, for example you might be able to interact with the user in a way which is impossible deep in library code. The correct recovery approach may also be clearer, especially if the error occurs in a library. Finally, if you centralize error handling to some extent, this may allow more code reuse.

Information for recovery

If you always recover close to the error site, then your error types only need to carry information for logging/reporting (in the event that recovery fails). If you handle errors further from the error site, then your error type must carry enough information for recovery.

As errors propagate through the stack, you may want to add information to your error. This is fairly easy using the Result::map_err method, but you will need to account for this when designing your error types. Likewise, you may wish to remove information by making the error more abstract (especially at module or crate boundaries). This is inevitably a trade-off between providing enough information to recover, but not providing too much which might expose the internals of your module or make the API too complex.

There is also the question of how to store information in your errors. For most code, I recommend storing information structurally as data and avoiding pre-formatting data into message strings or summarizing data. Storing information as structural data lets your code's client decide how to log or report errors, enables maximal flexibility in recovery, allows localization and other customisation, and permits structural logging.

Logging

Logging is useful for debugging and basically essential for running services in production. You will probably have a bunch of constraints around logging because of how it must integrate (technically and culturally) with other systems.

How you log should follow from your logging strategy - how you decide what to log and how to log it. That is beyond the scope of this doc. I'll cover some options, but which is appropriate will depend on your logging strategy, rather than anything I could enumerate here.

One big choice is whether to log errors where they occur or in some central-ish point or points. The former is easier since it puts less pressure on the errors to carry context. The latter makes it easier to log only unhandled (at some level) errors, and it makes it easier to change the way errors are logged. This will somewhat depend on whether you want to log all errors or only uncaught ones. In the latter case, you'll probably want to do some logging in error recovery too since this is an area where bugs (or performance issues) are more likely to be triggered.

Another choice is whether to keep data in errors structurally or pre-formatted. The latter is easier and more flexible in the sense that the data stored can be changed over time easily. However, it is less flexible in that there is only one choice for logging. In the former case, logging in different places or at different times can log different information. The information can also be formatted differently, internationalized, translated, or used to create derived information. Keeping data structurally is therefore the better choice in large or long-lived systems.

Structured logging takes the structured approach further by logging data in a structured way, rather than logging a formatted string. It is usually superior in all but the simplest scenarios.

Tracing is an approach to logging which relates log events by program flow across time. This is often much more useful than logging only point-in-time events without context. Generally, errors don't need to be aware of whether you do simple logging or tracing. However, if you're logging errors centrally, you may need to add some tracing information to errors so that when they are logged there is the relevant context for the tracing library.

Libraries for logging

  • tracing is the best choice for full tracing and structured logging. It is maintained by the Tokio project, but does not require using Tokio. There is an ecosystem of integration crates and other libraries.
  • log is part of the Rust project and an old standard of Rust logging. It requires a logger to actually do the logging and is therefore independent of how logs are emitted. It supports simple structured logging, but not tracing.
  • env-logger is the most commonly used logger (used with log, above). It is simple and minimal and best suited for small projects, prototypes, etc.
  • log4rs and fern are more full-featured but complex loggers (which again integrate with log).
  • slog was the most popular choice for structured logging and is an alternative to log. However, it does not seem to be maintained any longer and is probably not a good choice for new systems.

I would recommend using Tracing if you need production-strength logging/tracing, or log and env-logger if you need simple logging.

Error type design

You can use any type as an error in Rust. The E in Result<T, E> has no bounds and can be anything. At the highest level there are two choices: use a concrete error type or use a trait object (e.g., Box<dyn Error>). In the latter case, you'll still need concrete error types to use for values, you just won't expose them as part of your API. When designing the concrete error types, there are two common approaches: create a taxonomy of error types primarily using enums, or use a single struct which represents all the errors in a module or modules.

There are also some less common choices for the concrete error types. You can use an integer type which can be useful around FFI where you have an error number from C code. You should probably convert the error number into a more idiomatic error type before passing it on, but it can be a useful technique as an intermediate form between C and Rust types, or for some very specialised cases.

You can also use a string error message as your error type, this is rarely suitable for production code but is useful when prototyping because converting code from one kind of error type to another is much easier than converting panicking code to using Results (it also lets you use ? which can clean up your code nicely). You'll probably want to include type Result<T> = std::result::Result<T, &'static str>; to make this technique more usable. Alternatively, you can use string messages as the concrete types underlying trait object error types. Anyhow provides the anyhow!/bail! macro to make this easy.

For the rest of this section, we'll describe the two styles for concrete error types, then using trait objects as error types. We'll finish with some discussion on how to choose which technique to use, and some general tips for writing better error types.

Enum style

In this style, you define a taxonomy of errors with a high level of detail and specialization, in particular, different kinds of errors have different data. Concretely, this means primarily using enums with a variant for each kind of error and the data specific to that kind of error embedded in the variant. E.g.,

#![allow(unused)]
fn main() {
pub enum MyError {
    Network(IoError, Request),
    Malformed(MalformedRespError),
    MissingHeader(String),
    UnknownField(String),
}

pub struct MalformedRespError {
    // ...
}
}

(Note that I think this is a poorly designed error, I'll give better designs below).

The advantages of this approach are:

  • there is maximum information available, this is especially useful for error recovery,
  • it's easy to extend with new errors,
  • it's a common approach and is conceptually simple, so it is easy to understand,
  • by using more enums or fewer enums with more variants, you can adjust the amount of static vs dynamic typing,
  • throwing, rethrowing (especially at API boundaries when making errors strictly more abstract), and catching errors is ergonomic,
  • it works well with error libraries like thiserror.

There are two ways to consider the design: by designing based on how the errors will arise or how the errors will be handled. I think the former is easier and more future-proof (how can you predict all the ways an error might be handled?).

There is a lot of scope for different design decisions: how to categorise errors, how much detail to include, how to structure the errors, how to handle back/forward compatibility, etc. Concretely, how many enums should you use and how many variants? Should you nest errors? How should you name the enums and variants?

If designing based on how errors arise, you want one enum for each class of errors and one variant for each specific kind of error (obviously, there's a lot of space here for what constitutes a class or specific kind of error). More practically, a function can only return a single error type, so all errors which can be returned from a function must belong to a single enum. Since errors can likely be returned from multiple functions, you'll end up with a set of errors returned from a set of functions which have the same error type. It is possible, but not necessary, for that set of functions to match with module boundaries. So one error type per module is sometimes OK, but I wouldn't recommend it as a general rule.

Furthermore, at different levels of abstraction, the right level of detail for errors will change (we'll discuss this a bit more below). So, the right level of detail for error enums deep inside your implementation is probably different to the right level in the API.

For example,

#![allow(unused)]
fn main() {
pub struct MessageError {
    kind: MessageErrorKind,
    request: Request,
}

pub enum MessageErrorKind {
    Network(IoError),
    Parse(Option<ProtocolVersion>),
}

enum NetworkError {
    NotConnected,
    ConnectionReset,
    // ...
}

enum ParseError {
    Malformed(MalformedRespError),
    MissingHeader(String),
    UnknownField(String),
}

struct MalformedRespError {
    // ...
}
}

In this example (over-engineered for such a small example, but imagine it as part of a larger program), NetworkError and ParseError are used in the implementation and are designed to contain maximal information for recovery. At the API boundary (if we can't recover), these are converted into a MessageError. This has is a struct since all errors will contain the request which caused the error, however, it still follows the 'enum taxonomy' approach rather than the 'single struct' approach since the error kinds hold different data. There should be enough information here for users of the API to recover or to report errors to the user.

If designing based on how an error might be handled, you will want a different variant for each different recovery strategy. You might want a different enum for each recovery point, or you might just want a single enum per module or crate. For example,

#![allow(unused)]
fn main() {
pub enum Retryable {
    Network(IoError, Request),
    Malformed(MalformedRespError),
    NotRetryable(Box<dyn Error>),
}

pub enum Reportable {
    MissingHeader(String),
    UnknownField(String),
    RetryFailed(Retryable),
    NotReportable(Box<dyn Error>),
}

pub struct MalformedRespError {
    // ...
}
}

Nested errors

One thing to be aware of in this approach is nesting of errors. It is fairly common to have variants which simply contain a source error, with a simple From impl to convert one error to another. This is made ultra-ergonomic by the ? operator and attributes in the error handling libraries such as thiserror. I believe this idiom is very overused, to the point where you should consider it an anti-pattern unless you can prove otherwise for the specific case.

In most cases where you create an error out of another error, there is additional context that is likely to be useful when recovering or debugging. If the error is simply nested, then this extra context is missed. Sometimes, if it seems like there is no extra useful context, then it might be a signal that the source error should be a variant of the other error, rather than nested in it. Alternatively, it might be that the level of abstraction should be changed.

For example, you might have a std::io::Error nested in your enum as MyError::Io(std::io::Error). It might be that there is more context to include, such as the action, filename, or request which triggered the IO error. In this case, std::io::Error comes from another crate, so it cannot be inlined. It might be that the level of abstraction should change, e.g., if the io error occurred when loading a config file from disk, it could be converted into a MyError::Config(Path).

To add more context to an error, use .map_err(|e| ...)? rather than just ?. To change the abstraction level, you might be able to use a custom From impl, or you might need to use map_err with an explicit conversion method.

Often, this pattern occurs when primarily considering logging errors rather than using errors for recovery. IMO, it is usually better to have fine-grained logging where the error occurs, rather than relying on logging errors at some central point. See the error handling section for more discussion.

Another issue of this approach is that it can lead to cycles of error types, e.g.,

#![allow(unused)]
fn main() {
pub enum A {
    // ...
    B(Box<B>),
}

pub enum B {
    Foo,
    A(A),
    Other(Box<dyn Error>),
    // ...
}
}

Here you might have a B inside an A inside a B or vice versa, or more deeply nested. When it comes to error recovery, and you pattern match on a B to find a B::Foo, you might miss some instances because the B::Foo is hidden inside a B::A variant. It could also be hidden inside the B::Other variant, either as a B or an A. This will make error recovery messy where it shouldn't be.

The solution is to aggressively normalize errors in the From impls. However, the automatically generated impls won't do this for you, so that means writing them by hand.

A corollary to this thinking on nested errors, is that Error::source is not a very useful method (IMO, it is only useful for logging errors and debugging).

Single struct style

In this style, you use a single error struct. Each error stores mostly the same data and this is stored separately to the kind of error. Though you will likely still use an enum to represent the kind of error, this will likely be a C-like enum without embedded data. E.g.,

#![allow(unused)]
fn main() {
pub struct MyError {
    kind: ErrorKind,
    source: Box<dyn Error>,
    message: Option<String>,
    request: Option<Request>,
}

pub enum ErrorKind {
    Network,
    MalformedResponse,
    MissingHeader,
    UnknownField,
    // This lets us handle any kind of error without adding new error kinds.
    Other,
}
}

This style is less common in the ecosystem. A prominent example is std::io::Error, note that this error type is further optimised by using a packed representation on some platforms.

The advantages of the single struct style are:

  • it scales better when there are lots of errors (there is not a linear relationship between the number of errors and the number of error types),
  • logging errors is simple,
  • catching errors with downcasting is not too bad, and familiar from the trait object approach to error handling,
  • it is easily customisable at the error site - you can add custom context without adding (or changing) an error kind,
  • it requires less up-front design.

You have a choice about what information to include in your error struct. Some generic information you might like to include are a backtrace for the error site, an optional source error, some error code or other domain-specific way to identify the error, and/or some representation of the program state (this last one is very project-specific). If you include a source error, then you might face similar issues to those described in the previous section on error nesting.

You might also include an error message created at the error site. This is convenient and flexible, but has limitations in terms of reporting - it means the message cannot be customized (e.g., a GUI might want to format an error message very differently to a CLI) or localized.

You will probably want to include the kind of error and use an ErrorKind enum for this. Error recovery is mostly impossible otherwise. You then have to decide how detailed to make the error kinds. More detail can help recovery but means more variants to keep track of and too much detail may be of little use (especially if your error struct has a pre-formatted message and/or other information).

Trait objects

You may choose to you use a trait object as your error type, Box<dyn Error> is common. In this case, you still need concrete error types and the design decisions discussed above still apply. The difference is that at some point in your code, you convert the concrete error types into the trait object type. You could do this at the API boundary of your crate or at the source of the error or anywhere in between.

There are some advantages of using a trait object error type:

  • assuming you are using different error types for your module internals and your API, it saves you having a second set of error types for the API,
  • it is statically much simpler to have a single type rather than a complex taxonomy of types,
  • assuming users will not do 'deep' recovery, using the errors can be simpler too,
    • in particular, logging is more uniform and simpler,
  • 'dynamic typing' of errors is more flexible and makes backwards compatibility and stability easier.

The error type

Since dyn Trait types do not have a known size, they must be wrapped in a pointer. Most cases can use Box since that is most convenient. For systems with no allocator, or where micro-optimization of the error path is required, you can dedicate some static memory to holding an error and use a &'static reference.

The most common trait to use for the trait object is the Error trait from the standard library. This is part of core (as opposed to std) so can be used in no-std environments. An alternative is to use the Error trait from Anyhow, an error handling crate. This has the advantage that you don't need to use unstable features of the standard library. It has the same features as the standard library trait, but with a slightly different API.

You could use your own error trait type. For ease of working with other code, you probably want it to have std::error::Error as a super-trait. However, I'm not aware of any advantage of doing this over using concrete error types.

Exposing additional context

As described in the Result and Error section, you can dynamically provide additional context from an Error object. This should be well-documented in your API docs, including whether users can rely on the stability of the additional context.

Useful context to provide may include:

  • a backtrace,
  • pre-formatted messages as strings for logging or reporting to users (though not that this precludes localization or other customisation of the message),
  • error numbers or other domain-specific information.

If using dynamic trait object errors, the errors are only useful for coarse-grained recovery, logging, or for direct reporting to users. If you want to enable fine-grained recovery, you are probably better off using concrete error types, rather than adding loads of context to error trait objects.

Eliminating concrete error types

You can't really eliminate all concrete error types; due to Rust's nature, you have to have a concrete type which is abstracted into the trait object. However, you can avoid having to define your own concrete types in every module. While this is a reasonable approach, I think it is probably only the best approach in some very limited circumstances.

How to do this depends on how much information you need to get from the Error trait object. If you want to provide some additional context, you'll need somewhere to store that context or the data required to create it. Some approaches:

  • use an error number (since the integer types don't implement Error, you will need to wrap the number in a newtype struct, e.g., pub struct ErrNum(u32);),
  • use a string (again, you'll need a newtype, or if you use Anyhow you can make use of the anyhow!/bail! macros),
  • use a the single struct style described above with a very simple structure.

Picking an approach

First of all, be aware that you are not making one choice per project. Different modules within your project can take different approaches, and you may want to take a different approach with the errors in your API vs those in the implementation. As well as different approaches in different parts of your project, you can mix approaches in one place, e.g., using concrete errors for some functions and trait object errors for others, or using a mix of enum style and single struct style errors.

Ultimately the style of error type which is best depends on what kind of error handling you'll do with them (or expect your users to do with them, if you're writing a library). For error recovery at (or very close) to the error site, the error type doesn't matter much. If you are going to do fine-grained recovery at some distance from where the error is thrown (e.g., in a different module or crate), then the enum style is probably best. If you have a lot of errors in a library (and/or all errors are rather similar), or you expect the user to only do coarse-grained recovery (e.g., retry or communicate failure to the user), then the single struct style might be better. Where there can be effectively no recovery (or recovery is rare), and the intention is only to log the errors (or report the errors to a developer in some other way) then trait objects might be best.

Common advice is to use concrete error types for libraries (because they give the user maximum flexibility) and trait object errors in applications (for simplicity). However, I think this is an over-simplification. The advice for libraries is good unless you are sure the user won't want to recover from errors (and note that recovery is possible using downcasting, it's just easier if there is more information). For applications, it might make sense to use concrete types to make recovery easier. For many applications, you're more likely to be able to recover from locally produced errors rather than those from upstream libraries. Being closer to the user also makes interactive recovery easier.

For large, sophisticated applications, you probably want to use the enum style of concrete error, since this permits localisation (and other customisation) of error messages and helps with the modularity of APIs.

General advice

Naming

Avoid over-using Error in names, it is easy to end up with it repeating endlessly. You might want to use it in your type names (unless it is clear from context), but you almost certainly don't need it in variant names. E.g., MyError::Io is better than MyError::IoError.

It is fine to overload the standard library names like Result and Error, you can always use an explicit prefix to name the std ones, but you'll rarely need to.

Avoid having an errors module in your crate. Keep your errors in the module they serve and consider having different errors in different modules. Using an errors module encourages thinking of errors as a per-project thing, rather than tailoring their level of detail to what is most appropriate.

Stability

Your errors are part of your API and therefore must be considered when thinking about stability (e.g., how to bump the semver version). If you have dynamically typed aspects to your errors, then you need to decide (and document) whether these are part of your API or not (e.g., if you use a trait object error and permit downcasting to a concrete type, can the user rely on the concrete type remaining the same? What about provided additional context?).

For error enums, you probably should mark all your enums as #[non_exhaustive] so that you can add variants backwards compatibly.

Converting errors at API boundaries

It often makes sense to have different error types in your API to those in the implementation. That means that the latter must be converted to the former at API boundaries. Fortunately, this is usually ergonomic in Rust due to the implicit conversion in the ? operator. You'll just need to implement the From trait for the types used in the API (or Into for the concrete types, if that is not possible).

If you'll be converting error types at boundaries, you'll be converting to a more abstract (or at least less detailed) representation. That might mean converting to a trait object, or it might be a more suitable set of concrete types. If you're using trait objects internally, you probably don't need to convert, although you might change the trait used. As discussed above, you should choose the types most appropriate to your API. In particular, you should consider what kind of error recovery is possible (or reasonable) outside an abstraction boundary.

Should you use the internal error as a 'source' error of the API error? Probably not, after all the internal error is part of your internals and not the API definition and therefore you probably don't want to expose it. If the expectation is that the only kind of error handling that will be done is to log the errors for debugging or crash reporting, etc. then it might make sense to include the source error. In this case, you should document that the source error is not part of the stable API and should not be used for error handling or relied upon (be prepared for users to rely on it anyway).

Resources

Blog posts