How to Introduce A Critical Bug In Your Code

Posted on Mar 25, 2021

So, let me tell you folks a story about how to introduce a critical, total wrecking bug in just one line and, of course, releasing it as a bugfix release itself! Enter schain 2.0.3, the worst possible release EVER… featuring myself as lead developer! \o/

OK, this is not meant as a self-inflicted punishment session, but as a very practical reminder of how not to do things when you’re coding. I hope this serves you, my dear reader, as well as myself.

The bug consisted in schain not being able to read any file whatsoever, greeting the poor user (very probably, just me) with this awful message printed to stderr:

$ schain -v
schain 2.0.3
$ schain -c 0 this-will-fail
2021-03-25: 0 day(s).
$ schain this-will-fail 
this-will-fail: Not an schain file.

In any other version of schain, including of course, 2.0.4, this is the expected behavior:

$ schain -v
schain 2.0.4
$ schain this-will-fail 
2021-03-25: 0 day(s).

I don’t think this is the place to get into detail on what schain is meant to do and be, all you need to know here is that it is an habit tracker. It opens files in a specific format and that the $file: Not an schain file. error message should only be shown if the file cannot be parsed by the program. As you can see, I did something massively wrong somewhere in the code and… well… schain reported that it couldn’t open any file at all.

What was going on there? Let’s get into debugging! (and GDB!)1

Debugging read_file()

When I encounter a bug like this, where we see the program printing an output that isn’t valid in the executed context but is valid in another (namely, an unparseable file in this case), the way to go in my opinion is a easy as setting a breakpoint in the procedure that prints that said output. I know, this is no revolutionary technique, but, hey, it works… and maybe someone who’s reading this might benefit from this?2

So, I assume you know how to compile debugging symbols, so let’s first locate the Not an schain file. string in schain’s source code. An easy search in vis3 inside schain.c4 showed this:

static void
read_file(char *date, unsigned int *num, const char *path)
{
	FILE *fp;
	char *ptr, *endptr;
	char buf[BUF_MAX_SIZE];

	if ((fp = fopen(path, "r")) == NULL)
		die("%s: %s.", path, strerror(errno));
	fgets(buf, BUF_MAX_SIZE, fp);
	fclose(fp);

	ptr = strtok(buf, "\t");
	if (ptr == NULL)
		goto parsedie;

	strncpy(date, ptr, DATE_SIZE);

	ptr = strtok(NULL, "\n");
	if (ptr == NULL)
		goto parsedie;

	*num = strtol(ptr, &endptr, 10);
	if (endptr[0] != '\0')
		goto parsedie;

parsedie:
	die("%s: Not an schain file.", path);
}

Have you spotted the bug already? It’s actually a very easy one if you’re experienced… or if you remember a similar thing happening to a very important piece of software some years ago… But let’s pretend we haven’t found it yet… How do we proceed next?

So, we know that Not an schain file. is meant to be printed when a file cannot be parsed by the subroutine… in other words, when ptr is set to NULL because strtok() hasn’t found the next token it’s been provided.

The first thing to do is to test if the correct behavior is actually working… Why? Because we need to isolate what is really going wrong with this procedure… So, let’s feed an invalid file to schain 2.0.3 on purpose! Will it choke? Will it try to interpret it as if it was valid? Excitement in this adventure!

OK, I may be a very weird girl (well, I am), but only I thing I could think of as a file that would obvioulsy trigger a true parse error was… schain’s compiled binary itself! Yeah, weird choice, I know, but it was easy to access and already there in the same directory, so there we go. Unless schain magically had acquired readelf capabilities overnight, I expect it not to like it:

$ ./schain -v
schain 2.0.3
$ ./schain schain
schain: Not an schain file.

Awesome, read_file() is working correctly as expected when it gets an invalid file. Then the problem is only with valid schain files!

At this point, the usual suspect is the parser itself. Is it somehow not tokenizing anything right for some reason? Now it’s time to inspect how variables are behaving during execution.

We fire gdb schain (remember, schain has been compiled with debugging symbols!) and set a breakpoint (b) at read_file():5

$ gdb schain 
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from schain...
(gdb) b read_file
Breakpoint 1 at 0x169e: file schain.c, line 147.

Now, we run schain using the same (valid) file that I used in the first examples above. Execution is immediately stopped when we hit read_file():

(gdb) run this-will-fail 
Starting program: /home/ari/prog/schain/schain this-will-fail

Breakpoint 1, read_file (date=0x7fffffffe7c2 "", num=0x7fffffffe79c, path=0x7fffffffeb9d "this-will-fail") at schain.c:147
147	{

OK, there many ways to do this, like manually stepping execution line by line (a way I use quite often, to be honest)… but, let me be a fancy girl, and use a hardware watchpoint, which is just a fancy way of telling your debugger to watch out if the value of a variable is changed.

(gdb) watch ptr
Hardware watchpoint 2: ptr
(gdb) cont
Continuing.

cont stands for continue… GDB (and LLDB for that manner) allows for abridged versions of their commands! So now we’re telling the debugger to resume execution until something happens: be it catching a signal or hitting the next breakpoint or, as you read below, hitting the watchpoint we had set up for ptr. The latter is what happens in this case:

Hardware watchpoint 2: ptr

Old value = 0x0
New value = 0x7fffffffe720 "2021-03-25"
read_file (date=0x7fffffffe7c2 "", num=0x7fffffffe79c, path=0x7fffffffeb9d "this-will-fail") at schain.c:158
158		if (ptr == NULL)

OK, isn’t this interesting, huh? The first call to strtok() is indeed working! ptr is set to a value (a memory address, because it’s a pointer) and GDB is kind enough to print out strings when pointers refer to them… and we see that the date is being parsed correctly!

Is maybe the parser choking at the second field of our TSV file? Let’s figure that out! So we tell GDB to resume running schain:

(gdb) cont
Continuing.

Hardware watchpoint 2: ptr

Old value = 0x7fffffffe720 "2021-03-25"
New value = 0x7fffffffe72b "0"
read_file (date=0x7fffffffe7c2 "2021-03-25", num=0x7fffffffe79c, path=0x7fffffffeb9d "this-will-fail") at schain.c:164
164		if (ptr == NULL)
(gdb) cont
Continuing.
this-will-fail: Not an schain file.
[Inferior 1 (process 332193) exited with code 01]
(gdb) 

OH, strtok() succeeds again! ptr now points to a string "0" (yep, that’s what the file actually contains, a 0)… So… the parser is not the problem? Why are we hitting the die() statement?

What’s going on!?

git is everyone’s friend

I know it’s easy to tell what that bug is, but I’m trying here to put myself in the shoes of someone who’s starting out programming or debugging C code… I mean, I’ve been there. If I didn’t catch the bug immediately and made it get into a stable release, well… you’ll read in a moment why that was… and timing actually played a very bad part in all of this.

However, let’s keep solving this mystery, OK? C’mon, this is what makes coding so much fun (in a frustrating way sometimes, I know).

I recalled that schain 2.0.2 didn’t have this bug. I mean, I use this software almost6 every day. So, the conclusion was clear: the nasty bug got introduced during the development of schain 2.0.3…

Without getting into details that can be easily worked out by looking at the repo itself, the release of schain 2.0.3 dealt with fixing what happened when schain was fed a directory, which is an invalid type of file for schain. At least in schain 2.0.2 (I haven’t researched if this also happened in earlier versions), invalid files crashed schain with a SEGFAULT because that very same ptr variable got dereferenced when set as NULL.

In fact, the "%s: Not an schain file." string was one of the changes that helped in fixing that bug. I had tested that it worked for the intended scenario. So, we must look for the commit that introduced that string.

This is a task that I actually did in a different way, but for this post I wanted to showcase a couple of git’s most powerful tools it includes.

Let’s start with the infamously Torvaldsian git blame. The blame command tells you which commit every line in a source file was last modified. We already know that this is related to read_file() in schain.c, so…

$ git blame -L '/^read_file/,/^}/' schain.c
12d364e7 (Ariadna Vigo 2021-02-26 21:12:23 +0100 146) read_file(char *date, unsigned int *num, const char *path)
^8c270ac (Ariadna Vigo 2020-10-06 23:02:22 +0200 147) {
^8c270ac (Ariadna Vigo 2020-10-06 23:02:22 +0200 148)   FILE *fp;
12d364e7 (Ariadna Vigo 2021-02-26 21:12:23 +0100 149)   char *ptr, *endptr;
fcd9ed3d (Ariadna Vigo 2020-11-02 09:32:44 +0100 150)   char buf[BUF_MAX_SIZE];
^8c270ac (Ariadna Vigo 2020-10-06 23:02:22 +0200 151) 
^8c270ac (Ariadna Vigo 2020-10-06 23:02:22 +0200 152)   if ((fp = fopen(path, "r")) == NULL)
fbb573ec (Ariadna Vigo 2021-03-23 23:30:06 +0100 153)           die("%s: %s.", path, strerror(errno));
fcd9ed3d (Ariadna Vigo 2020-11-02 09:32:44 +0100 154)   fgets(buf, BUF_MAX_SIZE, fp);
fcd9ed3d (Ariadna Vigo 2020-11-02 09:32:44 +0100 155)   fclose(fp);
^8c270ac (Ariadna Vigo 2020-10-06 23:02:22 +0200 156) 
12d364e7 (Ariadna Vigo 2021-02-26 21:12:23 +0100 157)   ptr = strtok(buf, "\t");
ad72d0d6 (Ariadna Vigo 2021-03-23 22:38:42 +0100 158)   if (ptr == NULL)
fbb573ec (Ariadna Vigo 2021-03-23 23:30:06 +0100 159)           goto parsedie;
ad72d0d6 (Ariadna Vigo 2021-03-23 22:38:42 +0100 160) 
ad72d0d6 (Ariadna Vigo 2021-03-23 22:38:42 +0100 161)   strncpy(date, ptr, DATE_SIZE);
8c668c71 (Ariadna Vigo 2021-03-17 02:03:26 +0100 162) 
12d364e7 (Ariadna Vigo 2021-02-26 21:12:23 +0100 163)   ptr = strtok(NULL, "\n");
ad72d0d6 (Ariadna Vigo 2021-03-23 22:38:42 +0100 164)   if (ptr == NULL)
fbb573ec (Ariadna Vigo 2021-03-23 23:30:06 +0100 165)           goto parsedie;
ad72d0d6 (Ariadna Vigo 2021-03-23 22:38:42 +0100 166) 
12d364e7 (Ariadna Vigo 2021-02-26 21:12:23 +0100 167)   *num = strtol(ptr, &endptr, 10);
12d364e7 (Ariadna Vigo 2021-02-26 21:12:23 +0100 168)   if (endptr[0] != '\0')
fbb573ec (Ariadna Vigo 2021-03-23 23:30:06 +0100 169)           goto parsedie;
^8c270ac (Ariadna Vigo 2020-10-06 23:02:22 +0200 170) 
fbb573ec (Ariadna Vigo 2021-03-23 23:30:06 +0100 171) parsedie:
fbb573ec (Ariadna Vigo 2021-03-23 23:30:06 +0100 172)   die("%s: Not an schain file.", path);
^8c270ac (Ariadna Vigo 2020-10-06 23:02:22 +0200 173) }

I know, regexp’s can seem daunting (they’re hard… let’s face it), but the ones we’re passing to the -L option are super simple and benefit from the BSD/Plan9 kind of coding style that I’ve learned to use and appreciate… I always start the definition of a procedure in a newline after declaring its type. So we can easily search for any procedure by looking for the range defined by /^procedure/,/^}/… In non-esoteric language, “search for a line that starts with procedure and walk through until you get to a line that starts with a } (which by formatting will be the one closing the subroutine… all other closing brackets have one or more \t characters in front.)

See how all references to the parsedie label were introduced by commit fbb573ec? We’re closing in! So… let’s take a look at the patch that commit applied to the source! To see the diff for a specific commit against its parent(s), this is the way to do it!

$ git diff fbb573ec^! schain.c

OK, so the changes relevant to our little mystery are these:

diff --git a/schain.c b/schain.c
index 169833e..9142107 100644
--- a/schain.c
+++ b/schain.c
@@ -142,7 +142,7 @@ is_date(const char *str)
 		return 0;
 }
 
-static int
+static void
 read_file(char *date, unsigned int *num, const char *path)
 {
 	FILE *fp;
@@ -150,39 +150,39 @@ read_file(char *date, unsigned int *num, const char *path)
 	char buf[BUF_MAX_SIZE];
 
 	if ((fp = fopen(path, "r")) == NULL)
-		return -1;
+		die("%s: %s.", path, strerror(errno));
 	fgets(buf, BUF_MAX_SIZE, fp);
 	fclose(fp);
 
 	ptr = strtok(buf, "\t");
 	if (ptr == NULL)
-		return -1;
+		goto parsedie;
 
 	strncpy(date, ptr, DATE_SIZE);
 
 	ptr = strtok(NULL, "\n");
 	if (ptr == NULL)
-		return -1;
+		goto parsedie;
 
 	*num = strtol(ptr, &endptr, 10);
 	if (endptr[0] != '\0')
-		return -1;
+		goto parsedie;
 
-	return 0;
+parsedie:
+	die("%s: Not an schain file.", path);
 }

Voilà! Do you see it now? It’s blatantly obvious now, isn’t it?

So, read_file() had previously been a procedure that returned an integer that acted as a sort of catch-all error code that was handled by the caller… badly. It was a very poor way to deal with that, so the fix was to make read_file() handle errors itself as a grown adult should and become a selfish void non-returning procedure. All return -1; statements were correctly rewritten as goto parsedie;7… But return 0;… ehm… well… ohnoes…

A label is just that, a label… not an embedded subroutine of any sorts like some languages do support. It doesn’t create any lambda closure… it’s just… like a line number in BASIC: a tag with no effect over flow control unless when you goto it. When the parsing algorithm succeeded, it then continued straingth into the die() call…

Yeah, I picture it like my code happily prancing to its literal fatal death().

Just for completeness sake, here’s the diff that fixes the bug:

diff --git a/schain.c b/schain.c
index 9142107..bee44d5 100644
--- a/schain.c
+++ b/schain.c
@@ -168,6 +168,8 @@ read_file(char *date, unsigned int *num, const char *path)
        if (endptr[0] != '\0')
                goto parsedie;
 
+       return;
+
 parsedie:
        die("%s: Not an schain file.", path);
 }

And that, folks, wraps it for today’s very long post… Ehm… nope, it doesn’t. It’s me, Ariadna: of course is not over! Permission granted to hate me! (I don’t care!)

Commit to the code you push!

This all happened because I rushed to fix, commit, and push a minor bug. I just had tested that that fix worked, but I didn’t even bother to test whether it had affected normal operations of schain.

Also, the mere coincidence that I had forgotten to check into my habit trackers that day paved the way for me not immediately noticing that something was awfully wrong. I noticed the bug only the next morning because I remembered to update my trackers… only to see my poor thing schain fail miserably…

And I had literally 5 minutes left to join a Zoom meeting. So the fix had to wait until afternoon.

I guess schain has no users out there besides me. Seriously, I only know one friend of mine who uses minitimer and some other fellow developers who have checked out my code, but don’t think they use any of it. (If you do, say hi!) That’s perfectly fine! My code is… well… amateur, but in the best sense of that word. It’s not pro code and I know that.

However, if you’re developing in the open… hey… I know I’m not Red Hat or Debian, organizations whose software is used by millions of users… but my code may potentially be used by someone… Without acting paranoid, I do believe that I should act as if there were a couple of (potential) users of code I publish advertised to be stable.

And you know what? I wrote and released that horrible code after a bad day, at night… maybe not that late, having a look at the timestamps… but hey, I was tired. Really tired. And I just went out and push the code.

The lesson? Commit to the code you push… puns intended. That means that if you’re not in a good place or you need to rest, leave that development branch unmerged,8 go to bed… next day you’ll see things way better.

In fact, this very same post sits as a Hugo draft in its own git branch right now. Now I’ll commit the draft, make me some dinner, rest my eyes and brains, remove my makeup, and follow through my night personal hygiene routine… before cleaning this post a bit up and pushing it up into the wide worldly web…

Sounds like a plan, doesn’t it?


  1. Actually, I sort of like LLDB more and I know it’s fully compatible with GCC’s GDB symbol table format, but… I guess I’m just too used to GDB to switch to LLDB… Oh, the laziness! ↩︎

  2. Hm… awesome, I write a blog and have no clue of who my target audience is! ↩︎

  3. Well, grep’ing would also be fine… but well… I knew I was going to edit this later, you know… ↩︎

  4. One of the key things I always do in my projects is to do all output facing the user (be it to stdout or stderr) in the main module. All other modules are thought as “backends” or “libraries” of some sort. Again, this not revolutionary: it’s good practice known since forever. ↩︎

  5. Gotta love how verbose GNU software is. Aren’t they cute… (my goodness, what happened to the Rule of Silence?) OK, back to work. ↩︎

  6. This has something to do with why I didn’t catch this bug earlier… ↩︎

  7. And yes, gotos do have valid uses like this one! In this case, to avoid repeating exactly the same call. ↩︎

  8. Fun fact is that I might have worked on the fatal commit directly on the master branch… I sometimes do that out of a very bad habit, albeit some people defend that approach. ↩︎