On Error Handling

Error handling is something we all have to do. Error handling is hard, and there are just so many times we think we know that some piece of code just cannot fail. So why check for errors there?

Well …

Your chdir() could fail because the target was deleted, causing your program to dump files somewhere in the failesystem. Your log2() could fail because your input was negative for some reason, summoning nasal demons as it fails. Your spacecraft guidance system could crash, causing your spacecraft to crash and burn as well. (This has happened.) All that just because error handling is hard, or expensive, or <insert other reason here>.

All because error handling sucks. But why is that so?


State of the error

If your background is in C, Perl, some parts of PHP, or basically anything else that tries to be like-C-but-different, you will be familiar with the humble errno and it’s friend return -1. If your background is Java, Python or other parts of PHP, you will know exceptions. If your background is C with external libraries, you will probably be familiar with functions that return error codes. If your background is basically anything except Haskell, you will be familiar with NULL or FALSE or undefined or whatever else your language may call it as the canonical value for well, that didn’t work out so well, did it?.

Let’s look at each one of these mechanisms and try to find out why they make error handling harder than it should be. The discussion will focus on C and C++, because these two languages are widely used and account for a truly horrifying number CVEs (so some discussion on how to avoid errors may be in order). In particular we will assume that the purpose of error handling is to not only to handle unexpected conditions gracefully, but also to prove that some parts of the program handle all unexpected conditions gracefully without croaking.

Sentinel values

Or as they are more commonly known, returns -1 and sets errno appropriately. But it’s not just the pairing of -1 and errno, we call any value that could be a valid result of a function but isn’t (by convention) a sentinel value. Programming in C (and many languages derived from C) is full of sentinel values:

  • C strings are terminated with a NUL byte, which is a valid character—but by convention used as string terminator because “nobody needs a string that contains NUL
  • malloc() and friends return NULL pointers if they can’t allocate memory, because address 0 is always invalid (obviously (hello microcontrollers!))
  • the str* and mem* family of C functions often return NULL on failure (if strstr cannot find a substring in a string, for example)
  • almost any C function that involves a file descriptor returns -1 on failure

The common theme for all of these is that an error is also a value. If you ever call open() and it fails, you will get -1 as a result—which is fine as far as signalling goes, but you can happily use that -1 later for reading or writing or unlinking or whatever you may wish. You don’t have to handle the error, you can just assume that the operation has succeeded. In C, this is an especially bad problem because it is just so easy to forget error checking and not notice. Call lseek(fd, 0, SEEK_END) to get the length of some file. That never fails if fd is valid, right? Not quite. On 32bit systems, lseek might conceivably fail for files larger than 2GB. In that case, lseek returns -1. If you use the return value of lseek to allocate memory (because lseek never fails, right?), your allocator will almost certainly say NULL.

Which leads us to the next kind of error indicator.

Almost-sentinel values

Like sentinel values, almost-sentinel values are error signals that have to be checked. Unlike sentinel values, almost-sentinel values are also valid results in their own right.

Examples for almost-sentinel values are legion among the f* family of IO functions in the C library. The functions always return something that makes sense as a return value. For example, fread will always return the number of items read. If an error occurs, fread might have read nothing yet, and thus return 0. If end of file occurs, fread will also return 0.

Not all is lost though, because the f* family also contains feof to check whether a file has been read to end, and ferror to check whether an error has happened. All you need to do to make your code safe is check one (or both) if you get a result you didn’t expect. Easy, right?

Enter readdir.

Take this excerpt from readdir(3):

If the end of the directory stream is reached, NULL is returned and errno is not changed. If an error occurs, NULL is returned and errno is set appropriately. To distinguish end of stream and from an error, set errno to zero before calling readdir() and then check the value of errno if NULL is returned.

To use readdir correctly, you must always clear errno before you call it.

(Fortunately, readdir_r has none of these problems, and it even is thread-safe where readdir is not required to be. Unfortunately, readdir_r has its own problems and has been deprecated by glibc in favour of thread-safe readdir.)

Error code returns

So far we’ve only looked at the C standard library. There are, however, multitudes of other C libraries out there, for everything the C standard library doesn’t do (and everything it does do, of course). Not all of these libraries can use the errno facility the C library provides. Faced with this problem, libraries usually do one of two things:

  1. use sentinel values and an errno-like symbol or function (the mysql C connector does this, for example)
  2. return the error code directly and place results in out parameters (gnutls and others)

Since the first of these two is a case of sentinel values, it won’t be discussed further. The second is a lot more interesting and warrants closer inspection. Not because it allows for more and more precise error indications, but because it decouples error signalling from the actual values produced by any given operation.

This little change is so important that it bears repeating: error code returns decouple error indications from result values. Error indicators and result values are distinct types. This type distinction needn’t be enforced at the language level to be significant. The distinction allows for statements like x is valid if and only if err == 0, which in turn allows us to communicate this information to the compiler. A compiler using information like this could (for example) warn about uses of possibly uninitialized values for x if it cannot prove that the use occurs in a codepath that ensures that err == 0.

Unfortunately, compilers do not actually do this. Runtimes do this. They call it by another name entirely though.

Exceptions

Languages that call themselves “object oriented” usually come with a system they call exceptions. These exceptions are used to signal errors without the need for error codes, thus removing a lot of perceived clutter from code. “Removing clutter” is actually a fitting description, since a lot of code does not care which error occured, only whether an error occured—and if so, it usually wants to just notify another thing of the error and stop running. Exceptions allow for this by making error signalling a piece of control flow: if an error happens, just tell the runtime about the details of the error by throwing an exception. The runtime will then clean up your partial state (as best it knows how) and find someone interested in the error (or crash your program, if nobody is).

This makes exceptions a curious mixture of error signalling, error handling, and control flow. It makes exception-based error handling systems very easy because forgetting error handling will not silently produce incorrect results, it will produce no results at all. If an exception is thrown that the program does not know how to handle, the program merely crashes.

What exception systems save in clutter where the exact type of error is not important, it adds in clutter where it is. Take, for example, SocketException in Java. For every kind of error a program may want to handle, Java has a different type of exception. Python tries to steer a middle course by putting an error number into an exception, allowing you to catch just one exception for a number of different errors.

Contrast this to C++: where in Java (especially Java, but also Python and the other) every error condition is an exception, in C++ many error conditions are not. C++ inherits much of C, and thus also inherits the error number system, which is a large part of the reason for this. C++ also has exceptions, and some parts of the STL use them for most of their error signalling. Some parts don’t, though. In fact, the filesystem library added in C++17—by number of operations possibly the largest single part of the STL that throws exceptions—has non-throwing overloads for everything.

Upon closer inspection, almost all other instances of exceptions thrown in the STL do not merely signal some kind of error, they signal an invariant violation. Regex contains invalid character classes? Throw regex_error. Try to access a vector element beyond the end of the vector? Throw out_of_range. Call a function object that does not contain a function? Throw bad_function_call. Extract something from a stream that is at EOF? Set a bit to indicate failure.

Set a bit to indicate failure.

In C++, any error that can be handled reasonably is not thrown as an exception. Exceptions are reserved for those cases where a piece of code cannot execute because it would violate a program invariant, or because such a violation has already happened. In Java they do that too, among all other kinds of error signalling. But in C++, exceptions are primarily a tool to communicate invariant violations.

Why does this make exceptions a bad error handling strategy though?

Well, it doesn’t. It does make them less-than-ideal if you need information about what can fail and what can’t though. If your program is one of those lucky few that can just print some error information and die when an error occurs, by all means use exceptions. If you want to prove that certain parts of your code cannot ever crash because you forgot to handle an error condition, do not use exceptions for those parts.

If you shouldn’t use error numbers, or error code returns, or exceptions, what should you use?

It’s two values, right?

Any operation that can fail does not just return an error when it fails and a value when it doesn’t. It returns an error if and only if it fails, and a value if and only if it doesn’t. Effectively it returns two values of distinct types, because it cannot return a single value that can cover both types—error and proper return value.

Except that it can. Sort of. C++ has union types that have exactly these semantics: each instance of a union contains a single value of exactly one of the contained types. Add a little indicator like so, and you’re done, right?

struct error_or {
	bool has_value;
	union {
		something_t value;
		error_t error;
	};
};

Almost, but not quite. You can still access the wrong member of your struct, just like you could ignore the error. Only now the results can be much worse, because the error and the value you access share a memory location, thus invoking completely undefined behaviour if you ignore the error and always read value from the struct.

Enter Sum types

What we really want is some type that just does not let you access the wrong member. If your operation yielded an error, you want the compiler to just not let you access the value. Similarly, if the operation succeeded, you want the compiler to just not let you access the error.

This is not actually a new concept—functional languages have been doing this since the olden days. In these functional langues, a type that contains one value of one of many types is called a sum type. Any language inspired by functional languages usually has sum types as well. Rust, being inspired by functional programming languages, has them—they call them enum types. C has them, they call them unions. C++17 has a different formulation, here they are called variants.

What sets the functional languages and Rust apart from C and C++ in this list though is that the functional langues and Rust have matching constructs, whereas C and C++ don’t. Take the example of Rust, and the Result type there:

let res : Result<i32, i32> = ...;

match res {
	Ok(i) => i,
	Err(e) => -e,
}

The compiler will generate code that will execute the Ok branch of the match if res contains a value and the Err branch otherwise. And there is no other way to get at the contents of a Result instance. Just match. (Rust does provide helper methods that, for example, either return the value or panic and crash the program if the object does not contain a value.)

This matching is immensely powerful. Unfortunately, C++ has no matching feature built into the language, and C is right out. We can however use features of modern C++ to emulate matching by encapsulating the union we need to hold the values and the indicator which part of the union is valid into a new type, and giving this type a dispatch operation that takes a number of callable objects and just calls the right one. We might also not want to call the operation dispatch, but something else—something that describes not what it does, but what it can be used to do. Since the Rust example above takes one value, executes one of two branches, and returns one value, we may want to call it … reduce, because it reduces two branches to one?

error_or<unsigned char> e = ...;

// return the value on success, or -1 on error
return e.reduce(
	[] (int value) noexcept { return value; },
	[] (std::error_code error) noexcept { return -1; });

// as opposed to ...
if (e)
	return e.value();
else
	return -1;

The error_or class demonstrated here is available in the nu utilities library. Documentation is available at the nu documentation site.

The first variant (the call to reduce) has one decisive advantage over the second variant: the first variant can guarantee that no exceptions will be thrown by the execution of the code, while the second variant cannot. This is very useful if you use noexceptness as a guarantee that the program does not crash due to faulty error handling. It is more useful still if you use a compiler plugin to check that you haven’t messed up your noexcept.

Making error_or more useful

This interface is a bit lacking if you want to chain multiple operations together. Since the reduce operation above turns two values into one, chaining operations requires the branches of reduce to return error_or instances as well.

error_or<unsigned char> e = ...;

return e
	.reduce(
		[] (int value) { return some_operation(value); },
		[] (std::error_code error) { return error_or<int>(error); })
	.reduce(
		[] (double value) { return some_other_operation(value); },
		[] (std::error_code error) { return error_or<int>(error); })
	.reduce(
		[] (int value) { return double(value); },
		[] (std::error_code err) { return NAN; });

The first two operations applied to e in this example return another error_or instance, once with a different value type. Both operations just forward the error if e didn’t contain a value.

This pattern of calling one operation that may fail, applying another operation that may fail on the result, and applying another operation that may fail on the result … is rather common. For this reason, error_or also provides an operator for exactly this purpose.

error_or<unsigned char> e = ...;

// apply some_operation to value in e and return the result, or
// cast the error in e to the result type of some_operation.
// then do the same with some_other_operation
return (e
	% some_operation
	% some_other_operation)
	.reduce(
		[] (int value) { return double(value); },
		[] (std::error_code err) { return NAN; });

Another common thing to do is to extract a part of the result to reuse in later processing. To make this a little easier, error_or provides another operator to apply a transformation only on the contained value, not on the whole error_or object.

error_or<unsigned char> e = ...;

// transform the contained value of e into a vector,
// creating an error_or<std::vector<double>>
// then call some_other_operation_again on the resulting object.
return (e
	/ [] (double d) { return std::vector<double>(1, d); }
	% some_other_operation_again)
	.reduce(
		[] (int value) { return double(value); },

Easy, right?