Pitfalls in C and C++: Unsigned types

Part of an occasional series of posts discussing sometimes maddening aspects of sound and music research work and software development

Despite the antiquity of the C and C++ programming languages and the relative difficulty of writing reliable code in either, audio researchers sometimes have a need to write or port code into these languages—either for speed, or for compatibility with a wider ecosystem such as Vamp plugins.

Here we talk about one of the common pitfalls in this process: unsigned types.

C and C++ are unusual amongst languages nowadays in making a distinction between signed and unsigned integers. An int is signed by default, meaning it can represent both positive and negative values. An unsigned is an integer that can never be negative. If you take an unsigned 0 and subtract 1 from it, the result wraps around, leaving a very large number (2^32-1 with the typical 32-bit integer size).

Intuitively the two types seem to map fairly reasonably to mathematical notions of integers and natural numbers, leading many programmers to choose unsigned types for any values that "feel" like they should never be negative, such as loop indices or measurements of size. Unfortunately, this is not a reliable intuition.

When should I use unsigned?

Never. There, that was easy—can we go home now?

Of course, it's not quite that simple.

You should use unsigned values whenever you are dealing with bit values, i.e. direct representations of the contents of memory; or when doing manipulations such as bit masking or shifting on data, for example when writing low-level code to read binary file formats such as audio files; or if you happen to be doing work such as embedded programming where type sizes and alignments really matter.

But stick to plain signed integers otherwise. You'll avoid a whole class of common problems.

Why?

The single most common cause of errors in C and C++ audio analysis code I get to review is arithmetic underflow on unsigned int.

Here's a typical example.

    for (unsigned i = 0; i < nlengths; i++) {
        unsigned candidates = total - lengths[i] + 1;
        for (unsigned j = 0; j < candidates; j++) {
            // use j as an array index

When I ran this code on a very short audio file, total came out as zero and lengths[i] as 28. The calculation of candidates then underflowed, coming out somewhere in the region of four billion.

But if you simply replace unsigned by int throughout, including in the definitions of the lengths array and total count, the code magically becomes correct.

You could say that the programmer should simply have checked that total was big enough before doing the calculation. Well, yes—but they didn't, and the programmer of this example was very far from stupid and inexperienced. This kind of thing happens all the time: I've done it myself and I'm sure most other C and C++ programmers have too. (Though some of them probably don't know it yet!)

And there is no advantage to using unsigneds here. There's nothing about a 32-bit unsigned integer type that makes it intrinsically appropriate for representing loop indices or counts of things. It doesn't model the behaviour of natural numbers any better than a signed int does. Natural numbers don't wrap around!

Things would be different if int and unsigned were truly separate types, so that the extra information in the type name actually helped the compiler ensure that the types were right. But that isn't the case: if you write a function with a signature like

    void f(unsigned int count) { }

and call it with

    f(-1);

the compiler will just pass some gigantic number into the function without even a warning. So you can't use unsigned to enforce any useful restrictions either.

Common counterarguments

“You have to deal with unsigneds anyway, because library functions use them.”

This is probably the strongest counterargument. Every time you call std::vector::size(), you get an unsigned value back. Code like

    if (i < v.size()-1) {

will underflow if v is empty. So I'm afraid you can't just forget about the question, even if you never use unsigned yourself.

“If the standard libraries use unsigned, doesn't that mean it's good practice?”

There are plenty of situations in which using unsigned is good practice: much embedded or low-level software development for example. The standard library doesn't guess what you're up to, instead it just uses the type most closely aligned with the category of operation it's doing. And when it's dealing with sizes and allocations, that category is "memory", which means it uses unsigned values.

Note that there are also well-designed libraries that avoid unsigned types for sizes and indices (for example Qt4). And of course, many languages more recent than C have been designed without unsigned types at all.

“Unsigned has a bigger range.”

Only a bit bigger. If that's an issue, you probably need a 64-bit type anyway.

“Unsigned overflow and arithmetic is better-defined.”

True. In particular, signed overflow and underflow are completely undefined in C (INT_MAX + 1 could be anything); modular arithmetic on negative numbers is implementation dependent; and so is bit shifting of signed integers. My contention is that these limitations cause fewer problems in practice than unintended unsigned underflow. A large fraction of the code I see in day-to-day work could be made more reliable by globally replacing “unsigned” or “unsigned int” with “int” with no side-effects at all.

“Unsigneds are faster.”

This idea is explored here, for example, in an article aimed at embedded software developers. In a workstation environment, however, the opposite is often true. Precisely because signed overflow is undefined, compilers are allowed to make optimisations that assume that signed integer cannot overflow, and this enables a whole class of loop and vector optimisations that simply aren't possible with unsigned integers. Not that it'll make much difference to your code either way, but it's a nice thought.

Summary

Avoid using unsigned ints in C and C++ unless you're dealing with raw memory contents or performing bit manipulations like shifting or masking.
Turn on compiler warnings, so that the compiler can tell you if you're mixing signeds with unsigned values from library code: resolve those cases by casting the unsigneds to signed integers, not vice versa.

—Chris Cannam

Tweets by @soundsoftwareuk

Recent notes

MLSP Prizes for Reproducibility: Winners announced!

Announcing the winners of the MLSP 2014 and SoundSoftware.ac.uk Prizes for Reproducibility in Signal Processing, organised by SoundSoftware.ac.uk in conjuction with the IEEE Signal Processing Society for the 2014 IEEE International Workshop on Machine Learning for Signal Processing.

SoundSoftware 2014: Videos now available!

The SoundSoftware 2014 workshop, our third annual workshop on software and data in audio and music research, was just as enjoyable as the previous two. Because so much research in this field ends up being expressed through software, a software workshop turns out to be all about the means by which research becomes useful and relevant to people other than the original researchers—fertile ground for interesting and thought-provoking talks.

The workshop videos are now available online at http://soundsoftware.ac.uk/soundsoftware2014, so if you weren't able to make it in person, catch up here!

Our third annual one-day workshop on Software and Data for Audio and Music Research takes place on the 8th of July 2014 at Queen Mary, University of London. The workshop includes talks on issues such as robust software development for audio and music research, reproducible research in general, management of research data, and open access. Read more here, clear your calendar, and register now!