Grapheme cluster breaks in a string (GNU libunistring)

10.1 Grapheme cluster breaks in a string

The following functions find a single boundary between grapheme clusters in a string.

Function: void u8_grapheme_next (const uint8_t *s, const uint8_t *end) ¶

Function: void u16_grapheme_next (const uint16_t *s, const uint16_t *end) ¶

Function: void u32_grapheme_next (const uint32_t *s, const uint32_t *end) ¶

Returns the start of the next grapheme cluster following s, or end if no grapheme cluster break is encountered before it. Returns NULL if and only if s == end.

Note that these functions do not handle the case when a character outside of the range between s and end is needed to determine the boundary. This is the case in particular with syllables in Indic scripts or emojis. Use _grapheme_breaks functions for such cases.

Function: void u8_grapheme_prev (const uint8_t *s, const uint8_t *start) ¶

Function: void u16_grapheme_prev (const uint16_t *s, const uint16_t *start) ¶

Function: void u32_grapheme_prev (const uint32_t *s, const uint32_t *start) ¶

Returns the start of the grapheme cluster preceding s, or start if no grapheme cluster break is encountered before it. Returns NULL if and only if s == start.

Note that these functions do not handle the case when a character outside of the range between start and s is needed to determine the boundary. This is the case in particular with syllables in Indic scripts or emojis. Use _grapheme_breaks functions for such cases.

Note also that these functions work only on well-formed Unicode strings.

The following functions determine all of the grapheme cluster boundaries in a string.

Function: void u8_grapheme_breaks (const uint8_t *s, size_t n, char *p) ¶

Function: void u16_grapheme_breaks (const uint16_t *s, size_t n, char *p) ¶

Function: void u32_grapheme_breaks (const uint32_t *s, size_t n, char *p) ¶

Function: void ulc_grapheme_breaks (const char *s, size_t n, char *p) ¶

Function: void uc_grapheme_breaks (const ucs_t *s, size_t n, char *p) ¶

Determines the grapheme cluster break points in s, an array of n units, and stores the result at p[0..nx-1].

p[i] = 1: means that there is a grapheme cluster boundary between s[i-1] and s[i].
p[i] = 0: means that s[i-1] and s[i] are part of the same grapheme cluster.

p[0] is always set to 1, because there is always a grapheme cluster break at start of text.

In addition to the above variants for UTF-8, UTF-16, and UTF-32 strings, <unigbrk.h> provides another variant: uc_grapheme_breaks.

This is similar to u32_grapheme_breaks, but it accepts any characters which may not be represented in UTF-32, such as control characters.