The char32_t problem (GNU libunistring)

Appendix B The `char32_t` problem

In response to the wchar_t mess described in the previous section, ISO C 11 introduces two new types: char32_t and char16_t.

char32_t is a type like wchar_t, with the added guarantee that it is 32 bits wide. So, it is a type that is appropriate for encoding a Unicode character. It is meant to resolve the problems of the 16-bit wide wchar_t on AIX and Windows platforms, and allow a saner programming model for wide character strings across all platforms.

char16_t is a type like wchar_t, with the added guarantee that it is 16 bits wide. It is meant to allow porting programs that use the broken wide character strings programming model from Windows to all platforms. Of course, no one needs this.

These types are accompanied with a syntax for defining wide string literals with these element types: u"..." and U"...".

So far, so good. What the ISO C designers forgot, is to provide standardized C library functions that operate on these wide character strings. They standardized only the most basic functions, mbrtoc32 and c32rtomb, which are analogous to mbrtowc and wcrtomb, respectively. For the rest, GNU gnulib https://www.gnu.org/software/gnulib/ provides the functions:

Functions for converting an entire string: mbstoc32s – like mbstowcs, c32stombs – like wcstombs.
Functions for testing the properties of a 32-bit wide character: c32isalnum, c32isalpha, etc. – like iswalnum, iswalpha, etc.

Still, this API has two problems:

The char32_t encoding is locale dependent and undocumented. This means, if you want to know any property of a char32_t character, other than the properties defined by <wctype.h> – such as whether it’s a dash, currency symbol, paragraph separator, or similar –, you have to convert it to char * encoding first, by use of the function c32tomb.
Even on platforms where wchar_t is 32 bits wide, the char32_t encoding may be different from the wchar_t encoding.

Appendix B The char32_t problem

Appendix B The `char32_t` problem