What’s New in POSIX 2024 – XCU

· im tosti


Table of Contents

In the 1950s, computers did not really interoperate. ARPANET has not yet happened (that would become a thing in the 60s), and every operating system was typically tied to the hardware that was meant to run on. Most communication actually happened over telephone, and no company was more present in that space than the Bell System. Unfortunately, the way they were so present was through exclusive supply contracts (with its subsidiary Western Electric) and a vast array of patents that it would refuse to license to competitors. So they got an antitrust suit aimed at them, which after seven years of litigation culminated in the 1956 consent decree. The Bell System was broken up, obliged to license all of its patents royalty-free, and barred from entering any industry other than telecommunications. So they made Unix.

Unix was unique, because the focus was on the software (since Bell couldn’t compete in this space anyway, as per the above). An evolution of Multics, it was developed on a PDP-7 (by cross-compiling). They then ported a compiler-compiler to it, leading to the development of B. Once their internal needs outgrew the PDP-7, it got ported to the PDP-11, and gained full typesetting capabilities. Gaining some traction internally, when Bell acquired other PDP-11s, instead of running DEC’s own OS for the machine, they simply ran Unix on it. This has led to the rewrite of the OS in C, a higher level (comparatively, of course) language, which enabled the porting of it to other machines (like the Interdata 7/32 and 8/32). Interest grew, and Bell (not being allowed to turn Unix into a product) simply shipped it at manufacturing cost for the media. Notably, ARPANET used it (see: RFC 681).

In the early 1980s, Unix had become a univeral operating system, used on virtually every serious machine. Then, AT&T got hit by an antitrust suit again. The exact details matter less, but freed it from the old restriction. System V immediately turned into a product, almost killing it. That very year, the GNU project was created, and the BSD project was started in Berkeley. Having grown accustomed to interoperability (since up until that point, there was only really one serious Unix), several standardization attempts were created. The System V Interface Definition was the AT&T one, Europe created the X/Open consortium of Single UNIX Specification fame, and the IEEE put out POSIX. These latter two would eventually merge and become equivalent, developed by the Austin Group, defining the only interface said to be universally interoperable on the OS level that we have to this day.

As of the previous release of POSIX, the Austin Group gained more control over the specification, having it be more working group oriented, and they got to work making the POSIX specification more modern. POSIX 2024 is the first release that bears the fruits of this labor, and as such, the changes made to it are particularly interesting, as they will define the direction of the specification going forwards. This is what this article is about!

Well, mostly. POSIX is composed of a couple of sections. Notably XBD (Base Definitions, which talk about things like what a file is, how regular expressions work, etc), XSH (System Interfaces, the C API that defines POSIX’s internals), and XCU (which defines the shell command language, and the standard utilities available for the system). There’s also XRAT, which explains the rationale of the authors, but it’s less relevant for our purposes today. XBD and XRAT are both interesting as context for XSH and XCU, but those are the real meat of the specification. This article will focus on the XCU section, in particular the utilities part of that section. If you’re more interested in the XSH section, there’s an excellent summary page by sortix’s Jonas Termansen that you can read here.

Highlights #

Handling of Filenames in Shell #

One of the most common errors in shell scripts when working with files tends to be the presumption that the newline character (\n) will not be present in the filename. Consider, for example, wanting to do some processing of files in a directory, processing the most recently modified ones first, with some custom break condition. The most common (naive) way of implementing this looks like this1:

1ls -t | while read -r f; do
2	# if my condition; then break; fi
3	# do something with $f
4done

After all, read(1p) reads logical lines from stdin into a variable, and ls(1p) outputs one entry per line. The problem is that pathnames2 (as per section 3.254 of POSIX 2024) are just strings (meaning they can contain any bytes except the NUL character), meaning it’s incorrect to even treat it as a character string, let alone something you can put in a newline-separated form. As such, the correct solution, historically, has been to loop over the files in some other way (such as wildcards, which aren’t subject to expansion, or using find(1p)), then sort them, then run on the sorted datatype. This question is probably one of the most talked about in shell. POSIX 2024 addresses this issue in two ways.

The Null Option #

find(1p) now supports the -print0 primary, which makes find use the NUL character as a separator. To go along with it, xargs(1p) now supports the -0 argument, which reads arguments expecting them to be separated with NUL characters. Finally, for (most) other usecases, read(1p) now supports the -d (delimiter) argument, where -d '' means the NUL character is the delimiter. This is a non-ideal resolution though. Previous POSIX releases have considered -print0 before, but never ended up adopting it because using a null terminator meant that any utility that would need to process that output would need to have a new option to parse that type of output.

More precisely, this approach does not resolve our original problem. xargs(1p) can’t sort, and therefore we still have to handle that logic separately, unless sort(1p) also grows this support, even after read(1p). This problem continues with every other type of use-case. Importantly, it breaks the interoperability that POSIX was made to uphold.

Thankfully, there is the second way that they’re fixing this issue.

The Nuclear Option #

We’ve established that, yes, pathnames can include newlines. We have not established why they can do that. After some deliberation, the Austin Group could not find a single use-case for newlines in pathnames besides breaking naive scripts. Wouldn’t it be nice if the naive scripts were just correct now? Ok, that might be a bit much all at once. We’re heading there though! A bunch of C functions3 are now encouraged to report EILSEQ if the last component of a pathname to a file they are to create contains a newline (put differently, they’re to error out instead of creating a filename that contains a newline).

As for the utilities, the following utilities are now either encouraged to error out if they are to create a filename that contains a newline, and/or encouraged to error out if they are about to print a pathname that contains a newline in a context where newlines may be used as a separator: admin(1p), ar(1p), awk(1p), basename(1p), cd(1p), cksum(1p), cmp(1p), command(1p), compress(1p), cp(1p), csplit(1p), ctags(1p), cxref(1p), dd(1p), df(1p), diff(1p), dirname(1p), du(1p), ed(1p), ex(1p), file(1p), find(1p), fuser(1p), get(1p), grep(1p), hash(1p), head(1p), ipcs(1p), link(1p), ln(1p), localedef(1p), ls(1p), m4(1p), mailx(1p), make(1p), man(1p), mkdir(1p), mkfifo(1p), mv(1p),nm(1p), patch(1p), pax(1p), prs(1p), pws(1p), rm(1p), rmdel(1p), sact(1p), sccs(1p), sh(1p), sort(1p), split(1p), tee(1p), touch(1p), type(1p), uncompress(1p), unget(1p), uniq(1p),uucp(1p), uudecode(1p), val(1p), vi(1p), wc(1p), what(1p), yacc(1p), and zcat(1p).

Furthermore, sh(1p) talks about future direction, which may require the above to be treated as errors, and pr(1p) has a new section talking about “problematic pathnames” (since, for its use-case, tabs and vertical tabs are also problem-causing).

This is a much better solution, even in its current form. Unless your threat model includes attackers targeting you in particular (which, for example, immediately excludes all “home use” scripts), you can reasonably expect people to be discouraged from creating newline-containing characters, where before it might have been perceived as a “clever hack”. You can’t rely on your system enforcing those files not exist, but this is a major step in that direction.

TL;DR #

While code like ls -t | while read -r f isn’t strictly correct yet, it’s likely to become strictly correct eventually. It’s also much more reasonable to opt into this early, unless you’re writing software with security requirements, are deleting files based on inputs, or similar.

Modern C #

C has come a pretty long way in the last half-century, but for most intents and purposes, we haven’t been able to really benefit from it. Did you know that since c11 we actually have built-in unicode support via <uchar.h> (ISO/IEC TR 19769:2004)? Most modern programs can’t actually utilize this, because they target c89 (often incorrectly called “ANSI C”) or (if you’re lucky) c99. Why does this happen?

Well, when you’re building a new C program, you must decide what version of C to target. Target something too new, and no one will be able to build it. An example of this is btop++, which targetted some newer C++ features (notably <ranges>) that at the time of its publishing LLVM simply did not support: libc++ simply didn’t have them yet (at least not in a stable format available on most distributions), and you couldn’t use gcc’s libstdc++ because its <ranges> implementation depended on concepts (which LLVM also did not have yet).

As such, what you do, is you look at the platforms you want your program to run on, and try to figure out what the least common denominator would be. It just so happens that for the longest time, that denominator would be c89. For a little while now, it’s been c99. As for why that is, POSIX is a large part. You see, up until POSIX 2024, POSIX required that the c89 compiler be present on the system. If you have c89 you’re compliant, and if you do not, you are not. Most operating systems try to be POSIX compliant, and so it becomes a typical expectation (so you don’t have to worry about not having C at all, something other languages do have to worry about). This broad presumption of availability also pushes the embedded developers to provide something along those lines as well (setting the expectation of expectations), so most microcontrollers will have a c89 (or again, recently, c99) toolchain available for them.

In short, application authors will tend not to target something until it’s fairly common, unless there’s a disproportionate advantage for their specific use-case (such as with c99 over c89). What’s fairly common is strongly influenced by what is pseudo-guaranteed by the only portable standard we have.

Anyway, POSIX 2024 now requires c17, and does not require c89. Furthermore, the rationale mentions that future editions will not require c17, but will simply require whatever C specification version is the most modern and already implemented by major toolchains. So going forward, it’ll be much easier to justify using actually modern C for your new projects, and we can expect more and more embedded tools to provide modern C versions (something we’re already seeing, especially on microcontrollers that are based on ARM or RISC-V).

Limits & Cooperation #

Operating systems impose limits (often arbitrary) on what runs inside of them, and your applications (and scripts, and interactive usage) may also want to impose some limits and cooperation on what you run. As such, it’s important that you be able to interact with these limits. This is what the nice(1p), renice(1p), and ulimit(1p) utilities are meant to do.

Unfortunately, renice(1p) only worked in absolutes, and ulimit(1p) only let you set a maximum write size for files (and didn’t differentiate between hard and soft limits), and was only available as part of the XSI extension.

With POSIX 2024, ulimit(1p) now supports reporting hard and soft limits and defines how those are used and interact. Additionally, the core image size, data segment size, open file descriptor amount, stack size, cpu time4, and address space limits now exist. This means that you can now (or rather, in the near future) reasonably rely on those existing and actually make use of them in portable scripts. renice(1p) is also updated to support the -n option (just like nice(1p)) to change the niceness value relatively.

Finally, we get a new utility: timeout(1p). A lot of tools over the years have added options to handle their own timeouts (curl(1) in particular comes to mind, having several different types of timeouts for various use-cases), but with timeout(1p) you don’t need those (except for the added flexibility) anymore. It even handles child processes (in several implementation defined ways) and (importantly) lets you customize the signal and send a secondary SIGKILL after a secondary timeout.

Makefiles #

make(1p) remains the default build system to this day. Or at least sort of. Most people tend to write large scripts that wrap around make for various reasons, but in the end they will tend to produce a Makefile (though ninja has been gaining a lot of traction). Let’s take a look at a typical example to explain what the improvements are.

Our use-case is simple: we have a bunch of .c files in ./. We want to compile and link them together. We also have a dependency (let’s say it’s libcurl) that requires some additional CFLAGS and LDFLAGS, which we query using pkg-config (well, I’m going to use pkgconf, it’s compatible). Importantly, we’re lazy in that we don’t want to specify every .c file in the directory in our Makefile. We also want to be able to clean our .o files without resorting to something like git clean -fx (that might clean some temporary artifacts that we do want to keep). With GNU Make, that might look something like so:

 1SRC := $(wildcard *.c)
 2OBJ := $(SRC:.c=.o)
 3
 4CC ?= cc
 5
 6CFLAGS  ?= -Os -pipe
 7LDFLAGS ?= -Wl,-O2
 8LIBS    ?=
 9
10PKGCONF ?= pkgconf
11
12CURLC != $(PKGCONF) --cflags libcurl
13CURLL != $(PKGCONF) --libs   libcurl
14
15CFLAGS  := $(CFLAGS)  $(CURLC)
16LDFLAGS := $(LDFLAGS) $(CURLL)
17
18myprog: $(OBJ)
19	$(CC) -o $@ $(LDFLAGS) $(LIBS) $^
20
21.PHONY: clean
22clean:
23	rm -f $(OBJ) myprog

This will not work on anything but GNU Make. MacOS make5 won’t be happy with the != used for CURLC, while bsdmake and bmake won’t be happy with $^. POSIX make would be unhappy with the :=, wildcard and .PHONY. Similarly, if we targetted bmake initially, the result would not properly run on gmake, and so on. The various implementations are mutually incompatible in diverging ways, since the POSIX implementation lacked critical features required for writing such small (and the vast majority of Makefiles should be this small) Makefiles.

While there’s still no good solution for the $(wildcard *.c) portion of this6, the following, annotated with comments for changes, should now work in strict POSIX compatibility7:

 1# .POSIX: is meant to make the Make implementation behave as though
 2# it is standard POSIX-make, since there may be conflicts
 3.POSIX:
 4# we use ::= here, since POSIX does not define :=.
 5# we also strictly enumerate the sources
 6SRC ::= one.c two.c
 7OBJ ::= $(SRC:.c=.o)
 8
 9CC ?= cc
10
11CFLAGS  ?= -Os -pipe
12LDFLAGS ?= -Wl,-O2
13LIBS    ?=
14
15PKGCONF ?= pkgconf
16
17CURLC != $(PKGCONF) --cflags libcurl
18CURLL != $(PKGCONF) --libs   libcurl
19
20# ditto re: ::=
21CFLAGS  ::= $(CFLAGS)  $(CURLC)
22LDFLAGS ::= $(LDFLAGS) $(CURLL)
23
24myprog: $(OBJ)
25	$(CC) -o $@ $(LDFLAGS) $(LIBS) $^
26
27.PHONY: clean
28clean:
29	rm -f $(OBJ) myprog

That’s very few changes! Importantly, gmake can already handle this, meaning that by targeting this feature set you are strictly improving compatibility.

To be very specific, POSIX 2024 added support for the $^ and $+ internal macros, ::=, :::=, !=, ?=, and += macro assignment forms, silent includes via -include, .NOTPARALLEL, .PHONY, and .WAIT special targets (of which I did not cover the parallelism ones, as those will typically be mostly useful to meta build systems), and other less important changes that will be listed out in full below.

Logging #

Our computers have more and more cores. In early 2017 (when the previous version of the standard was being finalized), most consumer grade hardware still maxed out at 4 cores (likely with SMT). This was also the segment of the market most likely to have background batch processing done in shell (as more enterprise-grade uses tend to write in a programming language that can integrate with their numerous external APIs). As such, while background processes were certainly common, it wasn’t as much of a common expectation that one might be doing some major processing (e.g. video re-encoding) in the background while performing other tasks. Of course, right after that point, in March 2017, the first generation of AMD Ryzen CPUs dropped on the scene and put processors with as many as 16 threads into the hands of consumers at more than reasonable prices. Today, in 2024, it’s difficult to buy a new workstation cpu with fewer than 12 threads, making the abovementioned scenarios all the more common.

The original specification of logger(1p) was written with a fairly uncommon, albeit necessary, use-case in mind. Today, such use-cases are much more common, and could be even more common if logging was easier to do correctly8. This original specification basically said that logger(1p) takes arguments like echo, but instead of outputting the text into stdout, it does so into syslog. It also means that logging the output of commands is unduly complicated.

In POSIX 2024, logger(1p) becomes a more fully-qualified command, with arguments and stdin interpretation. Notably, if there are no non-option arguments, logger(1p) will read the contents to log from stdin. It is also possible to ask the contents of a specific file to be logged using -f. Additionally, the syslog priority can be specified with -p, the pid of the logger process on each message using -i, and a syslog tag string using -t. Of additional importance, every non-empty line in the input or file shall be logged as a separate message, which means that the -i argument can be used to perform bulk logging where you can differentiate between failed runs.

Internationalization #

Different people speak different languages, and it’s important to be able to translate your program for those. While you’re writing a C program or something along those lines, you can always reach for a library that you link into (such as GNU intl). While you’re writing a shell script, however, your options tend to be far more limited, since you can’t distribute it alongside the script very easily. Wouldn’t it be helpful if the standard everyone follows to various degrees actually settled on whatever interface was the most used in practice? Anyway, POSIX 2024 has adopted the gettext suite ala GNU, both as a system interface (gettext(3p) and co) and in the CLI (gettext(1p), ngettext(1p), xgettext(1p), msgmft(1p)).

Since the target audience for this article is primarily shell people and advanced end-users, I’ll quickly go over the utilization in a shell context. If you’re already familiar with the basics of GNU’s implementation, you can skip the rest of this section! Translations are organized by message IDs (msgid) which can then be turned into arbitrary message strings (msgstr). These are encoded in a Portable Object file (.po), which you compile into a Machine Object file (.mo) using msgfmt(1p). You then place them on your system in such a way that the gettext(1p) utilities will be able to find them, and the typical LANG/LANGUAGE/LC_ALL/LC_MESSAGES mechanism will get you the correct translations.

For the purposes of this minimal example, we’re going to write a very small program that talks about pets. The program will either print you like cats, you like dogs, or you have %d pets (where the %d will be used for printf output). We’ll also demonstrate how special-case plural forms work. We’ll be making a French and an English translation, and everything will be done relative to a directory of your choosing, that I will refer to as $PWD or . interchangeably. We’ll start by writing our two annotated .po files.

# ./en/pets.po : the filename and location is arbitrary
# empty messageid and str signals the header
# different languages deal with plural forms differently
# English only has a special case for "one"
# the `plural=` section is a C-like conditional expression
msgid ""
msgstr ""
"Content-Type: text/plain; charset=utf-8\n"
"Plural-Forms: nplurals=2; plural=n != 1;\n"
"Language: en\n"

# the IDs here are identical the messages
# since English is the source language for us
msgid "you like cats"
msgstr "you like cats"

msgid "you like dogs"
msgstr "you like dogs"

# note that if a translation isn't found
# the msgid is used as is
msgid "you have a pet"
msgid_plural "you have %d pets"
msgstr[0] "you have a pet"
msgstr[1] "you have %d pets"
# ./fr/pets.po
# we'll have a special case for 0 (aucun) and 1 (un)
# then the rest will be general case plural
msgid ""
msgstr ""
"Content-Type: text/plain; charset=utf-8\n"
"Plural-Forms: nplurals=3; plural=(n==0)?0: (n==1)?1: 2\n"

msgid "you like cats"
msgstr "vous aimez les chats"

msgid "you like dogs"
msgstr "vous aimez les chiens"

# I translated "pet" as "little companion"
# as there's no satisfactory direct translation
msgid "you have a pet"
msgid_plural "you have %d pets"
msgstr[0] "vous n'avez pas de petits compagnons"
msgstr[1] "vous avez un petit compagnon"
msgstr[2] "vous avez %d petits compagnons"

These files aren’t usable as-is. We need to compile them into .mo files. We’ll start by compiling them in the same directory: msgfmt en/pets.po -o en/pets.mo; msgfmt fr/pets.po -o fr/pets.mo. We now need to place them in a location that gettext(1p) and co. will be able to find it in. For those specific commands there are numerous special cases, and we’ll take advantage of those via TEXTDOMAINDIR. Under $TEXTDOMAINDIR, the system will try to look for your $LC_MESSAGES locale followed by LC_MESSAGES, then your textdomain. For convenience, we’ll make some symlinks: ln -s . en/LC_MESSAGES; ln -s . fr/LC_MESSAGES. We can now demonstrate the messages manually!

 1export TEXTDOMAINDIR=$PWD
 2
 3# the system will access $TEXTDOMAINDIR/${locales…}/$TEXTDOMAIN.mo
 4# you can avoid setting $TEXTDOMAIN if you specify it on the CLI
 5export TEXTDOMAIN=pets
 6
 7# we can now translate simple messages!
 8LC_MESSAGES=fr gettext -s 'you like cats'
 9# => "vous aimez les chats"
10LC_MESSAGES=en gettext -s 'you like dogs'
11# => "you like dogs"
12# if you try to access a translation that doesn't exist,
13# it will simply print the ID, thus why it needs to be representative
14LC_MESSAGES=it gettext -s 'you like cats'
15# => "you like cats"
16
17# for plural forms, you use ngettext(1p)
18# because we probably also want to show the real number,
19# ngettext can output printf-compatible format strings
20# so we'll write a wrapper
21# $1: locale; $2: msgid; $3: msgid_plural; $4: quantity
22plural() {
23	printf "$(LC_MESSAGES="$1" ngettext "$2" "$3" "$4")\n" "$4"
24}
25# we can now demonstrate how the translation system adapts to plurals
26for i in $(seq 0 2); do
27	plural en 'you have a pet' 'you have %d pets' $i
28done
29# => you have 0 pets
30# => you have a pet
31# => you have 2 pets
32
33# in French, we had a special case for 0, let's see it in action:
34for i in $(seq 0 2); do
35	plural fr 'you have a pet' 'you have %d pets' $i
36done
37# => vous n'avez pas de petits compagnons
38# => vous avez un petit compagnon
39# => vous avez 2 petits compagnons
40
41# if you try to access a translation that doesn't exist,
42# the system will follow typical English rules, as above:
43for i in $(seq 0 2); do
44	plural it 'you have a pet' 'you have %d pets' $i
45done
46# you have 0 pets
47# you have a pet
48# you have 2 pets

In short, you can now rely on GNU-style gettext and ngettext utilities to be present, and write your script with the presumption that they are there. If the translation files are not installed, the message ID will be used (intelligently, in the case of plural forms), so you don’t need to worry about the possibility of them not being installed.

Minor Changes #

These are changes that are relatively minor, but I still think deserve a spotlight.

Changes Index #

If you’re here early, hi! I’ve been working on this piece (I have a good chunk of an MB in plaintext notes) since the middle of the summer. Instead of letting it continue to drag on, I decided to radically reduce the scope just to the highlights, and only the XCU Utilities section. Thanks to sortix (linked above), I feel like I can stick with that latter, but I still plan to actually write out the full changes index here, as well as go over any of the Shell Command Language changes. It’s just going to take a long time still, since I’m not interested in simply dumping out the change notifications, but rather explain every change being made (albeit not as completely as I do in the highlights section). I will update this page in-place and post a second announcement when this section is complete. Don’t expect it any time soon though (probably not until early 2025).


  1. Ok yes, it actually looks more like for f in $(ls -t), but that’s bad for other reasons, and is a mistake under more circumstances than just this case. ↩︎

  2. Note that this doesn’t apply to pathnames that are only composed of the portable filename character set (meaning that every level in the pathname is a portable filename). The portable filename character set notably does not include newlines (or spaces, for that matter). ↩︎

  3. bind(3p), link(3p), linkat(3p), mkdir(3p), mkdirat(3p), mkdtemp(3p), mkfifo(3p), mkfifoat(3p), mknod(3p), mknodat(3p), open(3p) and openat(3p) (but only for new files via O_CREAT), rename(3p), renameat(3p), symlink(3p), and symlinkat(3p)↩︎

  4. In this context it’s talking about total execution time. Let’s say you set it to 10 cpu seconds and ran a CPU stress test. After it used 10 cpu seconds, the process would be killed due to exceeding the limit. ↩︎

  5. That is to say, GNU Make from 2006. ↩︎

  6. the closest you can get is a SRC != ls *.c, which can be problematic in many ways. In this example I just enumerated them, but you can still do the above, it’s just not safe due to whitespace. Note that many build systems (like meson) that compile down to Makefiles or ninja force you to enumerate your source files anyway. ↩︎

  7. At the time of testing, none of them have done it yet (except maybe gmake, I can’t tell and I don’t want to read GNU sources), as Make implementations tend to move fairly slowly. In part due to how old it is, and in part due to the format being quite funny to implement. ↩︎

  8. You can make wrappers around the old logger, and you can use your rc manager to log stdin/stdout. These are however much less ergonomic than dropping something that’s log-enabled into the background using ^z and bg↩︎