size_t and Other Assorted C Data Types

Posted on Sep 12, 2021

This is sort of a rant? Well, it is to the extent one can seriously rant about a data type in C. Maybe not a rant, but a collection of thoughts, experiences, and caveats about a data type that, in my view, is a double-edged sword like no other data type is in C: size_t.

“Ariadna…?”

“Yes, sweetie?”

“But isn’t size_t a data type that guarantees storing sizes of memory objects in a way that is coherent with your architecture?”

“Yup”

“So, where’s the issue?”

“Let me write a whole blog post about it!”

The Origin

First of all, let’s get things straight. size_t is a standard data type, defined at stddef.h. It’s part of C as much as your int, char, float, etc. This is important to remember, because, yes, originally only the compiler-native data types were in K&R C, but with the ANSI standardization and later ISO standardization of C, some new data types were defined in headers, usually implemented as some form of typedef. C99, for instance, introduced the absolutely wonderful fixed-sized data types that you should be using int64_t, uint32_t, uint64_t and friends. If I’m not mistaken, size_t was also introduced in C99. Maybe it was already available on some set of extensions before C99, but I’m strictly speaking about the standards here.

OK, why do we need size_t? Because of CPU architectures. I’m not aware of C ever being used on 8 bit computers at their time period,1 but C has witnessed the progression from 16 to 64 bits CPUs on the desktop… The DEC PDP-11, which the first C compiler and UNIX were written on,2 was a 16 bits computer way back in the early 70s.

This had a couple of funny repercusions during the 32 bits era which were happily fixed when 64 bits became dominant. You should be aware (and if you’re not, I’m telling you know) that the compiler native data types in C do not represent a fixed size of bytes, not even within a same platform. From a standards POV, not even char is defined as “8 bits,” as people blindly love to assume. There’s a reason for this; I’ll explain later. So, for example, int is just a minimum of 16 bits, exactly like short, and long is defined just as minimum 32 bits. And for some reason, during the 32 bits era, some compilers like GCC decided that int and long int were going to be both 32 bits… limits.h still shows this, by the way, if you wanna check my sources.

And nowadays, GCC defines int as 32 bits and long int as 64 bits if you’re on a 64 bits CPU. Might seem strange, as we tend to see int as the “native” size of a computer… but that’s just because it’s way easier to teach people that C types have some “meaning” like data types in other languages. The dire truth is… in C all there is is numerical data types with different platorm-dependent storage sizes.

These fuzzy definitions of data types in C are one of its strengths. Imagine you wrote a program for a 32 bits platform but you want it ported to a 64 bits platform. Unless your program needs precise sizes for its variables, chances are very high that you’ll be just fine by recompiling the whole thing and not needing to tweak (almost) anything. Porting to a higher-bits platform, which is what usually happens because of progress, is almost trivial. Porting to lower-bits platform though might need some caution, but is also easy. Its just the critical cases, like your variables actually hitting the data type’s upper limit which will require your attention when doing this. Everything else will just work.

Of course, there are times in which you need a 64 bits or an 8 bits integer type. For that, the types in stdint.h are your friends.

OK, now back to size_t. This data type is defined at stddef.h and is specified as the integer type for the result of the sizeof operator, meaning that it is meant to be able to hold a size number as big as whatever addressable object you might store in memory. Don’t even bother tracing where the real type definition on glibc, because it’s a mess (oh, how strange in a GNU project!) In MUSL, it’s easier: it’s defined as an unsigned _Addr (and _Addr is just a long) at bits/alltypes.h, which is included into stddef.h. Which makes sense, because you want 64 bits on a 64 bits system and 32 bits on a 32 bits system, right? And it’s unsigned because… negative sizes don’t make any sense.3

This makes a ton of sense. So the plan is to use size_t everytime you want to represent the size of an object. Some of the functions available from the C Standard Library use size_t for this. Most notably all of the I/O functions that take the number of bytes to be read or written (e.g. snprintf(), fgets(), etc.) or even some of the string-related functions like strncpy(), memset(),4 etc. It’s also the type returned by strlen(), for example.

If so, what’s the problem?

int Vs. size_t

The natural type choice for everything is int by default. For example, counters in loops. You could use other data types… yes… you can use a char if you know the range of your counter will fit in it… And on the other hand, using pointers as iterators in loops is a very, very well-known optimization in some circumstances. Yet for your usual loop, int i; is probably what you’re gonna use.

Now let’s take a look at this utterly pointless minimal example:

#include <stddef.h>
#include <stdio.h>
#include <string.h>

int
main(void)
{
	int i;
	size_t len;
	const char txt[] = "Oh, this is pointless!";

	for (i = 0, len = strlen(txt); i < len; ++i)
		putchar(txt[i]);

	putchar('\n');

	return 0;
}

If we compile this in the way I use to compile things, because it’s the best way to compile things in C… this is what we get:

$ gcc -std=c99 -Wall -Wextra -Wpedantic -D_POSIX_C_SOURCE=200809L -o sizet sizet.c
sizet.c: In function 'main':
sizet.c:12:42: warning: comparison of integer expressions of different signedness:  int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]
   12 |         for (i = 0, len = strlen(txt); i < len; ++i)
      |                                          ^

The check for comparisons between integers of different signedness is very useful. Please, don’t blame it on that: it allows for catching nasty bugs that can be hard to debug otherwise. Comparing two data types of different signedness is dangerous because… well… programming is always a matter of trade-offs, isn’t it? Unsigned variables give you a wider range of positive values (plus one bit more of storage), at the cost of losing information. So, when set to a negative value, unsigned variables just silently underflow… and when you’re comparing an unsigned variable against a signed variable, you might get totally skewed results if the signed variable happens to be negative.

So how do we fix the code above? A couple of alternatives:

  1. Change i to size_t.
  2. Change len to int, despite strlen() returning size_t.
  3. Not change the declarations, but cast len down to int at the comparison.

Let’s discard the obviously harmful one, i.e. number 3. Type casting is something that should only be used if you really know what you’re doing and why. Now, please be aware that number 2 isn’t much different, but at least you’re forcing a consistent use of the variable by setting it as an int straight from the beginning. Number 1 seems the more direct approach that also happens to properly store the result of strlen() into the data type it’s meant to.

All three work fine. The compiler will be happy if you apply any of the three. And the code will work in all three cases. What’s your choice? In a trivial example like this it’s hard to choose one… because there are no further implications to the choice…

But imagine the text was user input… What do we do? That user input might be of a size that is bigger than what int can hold… Then we must go Route 1, because who knows… but what if we need a way to return an error condition and we decide that the subroutine will return either the number of characters read or a negative number in case of an error… This is a typical interface in C… In that case, your subroutine must return int to return either a negative number or i itself when it’s done iterating through the string… But then you’re halving the storage capacity of i… What if the size overflows i?

No, please, ssize_t isn’t the solution, even if it’s a standard type. It still halves the storage… it’s just that the halves will probably be bigger than int. You still get the signedness issue with it and then other problems like using %z (i.e., the one for size_t) as its printf-format delimiter, although it works, isn’t defined by the standard. Manual page system_data_types(7) even tells you to cast any ssize_t variables to intmax_t first if you’re planning to print them out via printf(). So portability issues might arise from using ssize_t if you’re not careful…

I just can’t tell you how many times I’ve banged my head against my desk just to decide whether a variable or a return type should be int or size_t in my projects. It’s like one of those decisions that you take… almost out of instinct? And then the compiler bites you back with some warning… or worse, under some conditions, the compiler isn’t really able to detect the problem and you find out there’s an underflow… quite a couple of commits later!

Is this a flaw in C99? No. This is just how computers work. And when you’re using C, you’re exposed to this type of problems. Signedness and unsignedness are a challenge at a hardware level. A computer can’t understand whether a number like b11011010 is either 218 or -38 decimal without context. Is the most significant bit the signedness bit or just part of the number? In fact, the way computers operate is by assuming the programer knows how they want the number to be interpreted.

That’s why you can do very, very weird tricks to use char variables that go way beyond 0x7F (127) to encode UTF-8, even if they’ll be seen as “negative” (e.g. 0xc0, which will be treated as -64 if you use a signed char variable,) but it works anyways. Yup, I’m learning to encode UTF-8 from scratch! (Not that hard, to be honest.)

As Some Kind of Conclusion

I feel like size_t shows how careful you must be when designing your interfaces and giving meaning to C data types in the context of your own code… as they lack any meaning by design. They’re meant to be blank templates you mold and use to what you need them to be. These aren’t anything like C++ STL types, which come with a whole bunch of methods associated that give them some meaning beyond being a “bunch of bytes, minimum X, maximum Y.”

In code, meaning comes from all the interactions that create the code you’re writing. You might be cleverly using size_t in some way that makes total sense because you’ve spiced it up with some routine that enhances that type and gently pushes it to what you want it to be. Of course this is super well-known when it comes to user-defined types, but I wanted to stress out that this also happens with primitive data types as well.

And sometimes there are silly, innocent mistakes. Like Linux syscall read() returning a ssize_t and me thinking it returned a plain ol' int because I just translated the registers order from the x64 ABI in ASM into C and never bothered to care what was meant to be stored at the rax register after return. Not a big deal if you know the dimensions won’t cause any trouble to you, but as soon as you learn this fact like I just did some hours ago, what do you do? Do you change all variables where you’re storing the return value from read() to be strictly API-correct… Or do you just trust that all the context surrounding those couple of syscalls ensure to be highly unlikely to hit an overflow?

These are the kinds of challenges that make me love working at the level of C. This is just how my brain is built. I find these little details to be so fascinating, and at the core of how these things we call computers work… And that’s how you end up writing a long post about just one data type in C! 🤣


  1. I have heard, though, of people using C nowadays in their retrocomputing projects. ↩︎

  2. Actually, UNIX started on a PDP-8, but that platform quickly proved to be too limiting, so our heroes Ritchie and Kernighan switched to a PDP-11… but the PDP-11 machine language was so cumbersome for a project like UNIX that that decision led Ritchie and Thompson to create C! ↩︎

  3. However the totally baffling, signed ssize_t exists and I don’t even know why. ↩︎

  4. Please, I never do this, but this time I must: avoid strncpy() like the plague, as it doesn’t guarantee null-termination of the target string. OpenBSD’s strlcpy() is way safer… and you can just easily grab it from almost everywhere (like any of my projects that use it) and slap it into your code without any hassle, both because it’s a diminute module and also because of its licensing. I just cite strncpy() because it is in the C Standard Library and serves the purpose of illustration. ↩︎