No Smart Conditionals in Shell

Posted on Jan 14, 2022

Oh girl, this one is gonna be a ride! It’s been quite some time I haven’t posted a more technical post on here. Time is due to tackle some POSIX shell stuff, because… it matters… a lot I might say. Today’s topic is conditionals and, most importantly, how not to use “smart” shortcuts in your code.

First a Primer on Conditions in POSIX Shell

So, let’s say you’ve never coded in POSIX shell1 before. How do you write a conditional? An if statement is what you’re thinking of, coming from other programming languages, right? Yeah, that’s right, if is a statement you find on POSIX sh:

if CONDITION1; then
	...
elif CONDITION2; then
	...
else
	...
fi

Of course elif (else if) and else are optional, as you might know from other languages. So far, everything looks super normal, right?

Nope. Shell if has a very strange peculiarity in it… You know in C you just write an if block like this, right?

if (variable == '1') {
	...
}

In C there are comparison operators available for you, such as ==, <, >, <=, >=, !=. These are part of the C language itself and this is the case in almost every single programming language in existence.

But not in POSIX shell. Nope. POSIX shell has no native comparison operators. This results in a syntax error:

$ cat cond.sh
#!/bin/sh

var='oh'

if "$var" == 'OH'; then
	echo 'Ah!'
fi
$ ./cond.sh
./cond.sh: line 5: oh: command not found

Why is shell trying to run the contents of the $var variable? What’s going on here? And no, don’t try different operators, or using = instead of == (as if this was BASIC), because, honey, that’s not the problem. The problem is that I actually misled you (Sorry! It was for educational purposes!). The real syntax of if, as defined by the POSIX specification looks like this:

if COMMAND; then
    ...
fi

OK, the specs actually don’t call the element after the literal if a command, but a compound-list, which is a fancy, formal grammar way they use to refer to a set of executable statements.

So, shell if doesn’t really expect a comparison expression… it actually expects a command like any other one you would use throught the script. Wanna see an example of my own? I’m full of surprises, you know that… and sometimes, but only sometimes, those surprises come in the form of a shell scripting snippet…2

if tasklst=$(NO_COLOR=1 cras 2>/dev/null | grep '#'); then
	task=$(printf '[Exit]\n%s' "$tasklst" | bemenu -p 'Tasks' -l 10 -i)
else
	exit 1
fi

This comes from a script I use to integrate cras into my window manager… It’s called cras-menu and it’s part of a non-project repo I call arisys where I keep some of the scripts that make my desktop work wonders!

Back to the snippet. I’ll save you lots of times by telling you to read all that is between $(...) as “run this and return me the contents of standard output as the value of the expression.” So, the crazy pipeline on the if line is run, it sends some output to standard output, and that string is stored into $tasklst. Yep, that = there is your regular assignment operator, not a comparison operator. And like in many languages, in shell assigning some rvalue to a variable returns back the rvalue itself.

OK, yeah, but… how do I actually compare two things in shell, there must be a way, right? POSIX shell is a Turing-complete language… It must be able to do conditional jumps per definition!

You do it with a command, duh!

/bin/[

The command is test(1). You’ll rarely encounter it invoked as that, though… but as its alias, [.

I’ll do a bit of a spoiler here. If you’ve ever looked at shell scripts, you’ve probably seen brackets everywhere:

if [ "$var" = 'OH' ]; then
	...
fi

The easy way to teach conditionals in shell is to tell the student that the brackets are part of if’s syntax. The truth is that they’re not, as I showed before. [ is an alias specified for test(1) by the POSIX specs themselves. How this aliasing is done on your system may vary, but as far as I can tell, on my current Arch system, using GNU coreutils, both are different binaries, as shown by the different inode numbers and the mismatching MD5 hashes. However, they do have exactly the same size? Not sure what is going on here.

[ari@arch ~]$ ls -il /bin/test
930478 -rwxr-xr-x 1 root root 59552 sep 29 15:56 /bin/test
[ari@arch ~]$ ls -il /bin/[
930393 -rwxr-xr-x 1 root root 59552 sep 29 15:56 '/bin/['
[ari@arch ~]$ md5sum /bin/test
fb3ec5d358acf44072bc6f6b9d537826  /bin/test
[ari@arch ~]$ md5sum /bin/[
a36bf5b09f3b39cf283d0e3c4921a410  /bin/[

On OpenBSD the two files share the same inode number, so they’re definitely the same binary, but one of the two paths is a hard link (not a symlink, which I guess would bring in a small performance hit):

[ari@aribsd ~]$ ls -il /bin/test
181441 -r-xr-xr-x  2 root  bin  123320 Sep 30 22:01 /bin/test
[ari@aribsd ~]$ ls -il /bin/[
181441 -r-xr-xr-x  2 root  bin  123320 Sep 30 22:01 /bin/[

The main point here is that for all intents and purposes, [ is just a command with a funny syntax. It is funny because it forces you to always end it with a ] which doesn’t serve any actual purpose but give the illusion that you’re somehow writing a bracketed expression. The illusion, though, is somewhat broken when you see that [ has some of its comparison operators be actual regular command line options, like -f to check whether something is a regular file or not:

$ cat cond.sh
#!/bin/sh

var='oh'

if [ -f '/bin/test' ]; then
	echo '/bin/test exists!'
else
	echo '/bin/test does not exist... weird 0_0'
fi
$ ./cond.sh
/bin/test exists!

You could also use test(1) directly. In fact, let’s demonstrate this!

$ cat cond.sh
#!/bin/sh

var='oh'

if test -f '/bin/test'; then
	echo '/bin/test exists!'
else
	echo '/bin/test does not exist... weird 0_0'
fi
$ ./cond.sh
/bin/test exists!

The main takeaway here is that in POSIX shell, no matter whether you use one or the other syntax of test/[, the way you test for a condition in it is via a command, which is an actual binary like any other binary on your system, with its own exit status and all the shebang (no pun intended… I guess?). If this is clear to you know, honey, then, I can move onto the dangerous waters of Boolean operators!

The Bad A && B || C Idiom

This is the real reason why I’m writing this super long post about conditionals in POSIX shell. This is something I was actually taught from using shellcheck, which I first thought was being a bit overexaggerating this… but then I read the specs.

A super, super common “hack” in POSIX shell is to reduce if-else clause conditionals to A && B || C, where A, B, C are commands. The idea is simple but relies on an implementation detail, namely that the shell optimizes Boolean operators in such a way that they stop being executed as soon as the truth or falsehood of the expression as a whole is known. This is not how logical operators work in propositional logic and I won’t go down this rabbit hole, because it’d render this post a thesis.

So, if the shell sees A && B || C this, it will do this:

  1. Group this, per left associativity, as (A && B) || C.
  2. Check if A is true (this means run it, remember [ is a command).
  3. If A is true, then, we need to know whether B is true or not, so we run it.
  4. If A is false, we know that (A && B) is false, so we don’t run B. We jump to the logical or, so we run C.

That’s why people use this as an equivalent to if A; then B; else C, as the execution of B depends on A being true… and if A is false, then C is run skipping B because of the optimization mentioned earlier…

But here comes the caveat.

If A is true, B is run. What if B returns false? Well, (A && B) becomes false, so C must be run for the logical or to be solved for! Then you’ve suddenly got all three commands being run, unlike what you would get from a proper conditional:

$ [ "1" ] && echo 'OH' > /dev/full || echo 'No'
bash: echo: write error: No space left on device
No

Don’t be fooled by the write error! It actually shows both calls to echo have run, even though the condition (i.e. [ "1" ]) is true. The huge problem here is that you’re relying on exit values to make things sort of work under some conditions, but that same reason makes this idiom so fragile. OK, yeah, knowing what exit codes mean affect [ per definition, but why should they change the behavior of the then and else clauses?

A proper if-conditional, on the other hand, behaves like this:

$ if [ "1" ]; then echo 'OH' > /dev/full; else echo 'No'; fi
bash: echo: write error: No space left on device

Which is what you’d have expected in the first place, isn’t it? Only the then clause is executed because the condition is true. Exit codes of the calls to echo are ignored, they have no say in how the execution flow will go.

The thing is that conditionals do something that no hack consisting of logical operators can do: branching. Your conditional statements in all languages specify jumps so that your then and else clauses become mutually exclusive. Not getting into Assembly on this post (it’s getting way too long already), but the way your computer actually does this is via instructions that are literal gotos! An idiom like A && B || C just doesn’t do that.

Funny thing is that C has a (quite limited) shorthand conditional that actually does what people expect from A && B || C: the so-called ternary operator, i.e. A ? B : C. I wouldn’t be surprised at all if the infamous shell idiom was born as a poor imitation of C’s ternary operator… But again, the ternary operator behaves as an actual conditional, such that evaluating B is guaranteed to be mutually exclusive to evaluating C.

As a Conclusion to This Long Post

My point with this discussion is that… well… early optimizations, especially linguistic ones, are the root of all evils. I’ve also these “smart” shorthand conditionals in shell for some time and I’ve never run into any bugs because of them. But, as soon as you learn there is unexpected behavior that could arise from using them, the sane thing to do is to stop using them…

I’ve even stopped using the safe-per-definition shorthands A && B and A || B. Yeah these works as intended, but in the end… you know… logical operators are logical operators, and the language provides for a conditional construct that is guaranteed, by specification, to work as intended.

And I feel like the “linguistic” benefit of writing “less” code with these && || hacks are… dubious? You can write an if-conditional in one line if you want so… and everyone will immediately know (even people who don’t know POSIX shell scripting, but do know other languages) what is going on there.

To me this is a classic case of, yeah, you can use this if you really know how, but it’s not worth the pitfalls it can make you fall into.

So, please, next time you’re writing a shell script… please use if (and shellcheck… trust me, it’ll save your life more than once!)


  1. I don’t care at all about Bash and never will. As far as I can tell POSIX sh and Bash don’t differ in anything I’m referring to on this post, except, well, that in Bash you also get the Bash-specific [[ ... ]] syntax… which, by the way, I never understood why Bash introduced it. Anyways, my point is that I think this post also applies to Bash, but don’t take my word on it; all I know is POSIX sh. ↩︎

  2. And in other occasions, my surprises might be a bit more interesting… 😏 ↩︎