GNU Smalltalk User’s Guide: Locales

Different countries and cultures have varying conventions for how to communicate. These conventions range from very simple ones, such as the format for representing dates and times, to very complex ones, such as the language spoken. Provided the programs are written to obey the choice of conventions, they will follow the conventions preferred by the user. GNU Smalltalk provides two packages to ease you in doing so. The I18N package covers both internationalization and multilingualization; the lighter-weight Iconv package covers only the latter, as it is a prerequisite for correct internationalization.

Multilingualizing software means programming it to be able to support languages from every part of the world. In particular, it includes understanding multi-byte character sets (such as UTF-8) and Unicode characters whose code point (the equivalent of the ASCII value) is above 127. To this end, GNU Smalltalk provides the UnicodeString class that stores its data as 32-bit Unicode values. In addition, Character will provide support for all the over one million available code points in Unicode.

Loading the I18N package improves this support through the EncodedStream class¹³, which interprets and transcodes non-ASCII Unicode characters. This support is mostly transparent, because the base classes Character, UnicodeCharacter and UnicodeString are enhanced to use it. Sending asString or printString to an instance of Character and UnicodeString will convert Unicode characters so that they are printed correctly in the current locale. For example, ‘$<279> printNl’ will print a small Latin letter ‘e’ with a dot above, when the I18N package is loaded.

Dually, you can convert String or ByteArray objects to Unicode with a single method call. If the current locale’s encoding is UTF-8, ‘#[196 151] asUnicodeString’ will return a Unicode string with the same character as above, the small Latin letter ‘e’ with a dot above.

The implementation of multilingualization support is not yet complete. For example, methods such as asLowercase, asUppercase, isLetter do not yet recognize Unicode characters.

You need to exercise some care, or your program will be buggy when Unicode characters are used. In particular, Characters must not be compared with ==¹⁴ and should be printed on a Stream with display: rather than nextPut:.

Also, Characters need to be created with the class method codePoint: if you are referring to their Unicode value; codePoint: is also the only method to create characters that is accepted by the ANSI Standard for Smalltalk. The method value:, instead, should be used if you are referring to a byte in a particular encoding. This subtle difference means that, for example, the last two of the following examples will fail:

    "Correct.  Use #value: with Strings, #codePoint: with UnicodeString."
    String with: (Character value: 65)
    String with: (Character value: 128)
    UnicodeString with: (Character codePoint: 65)
    UnicodeString with: (Character codePoint: 128)

    "Correct.  Only works for characters in the 0-127 range, which may
     be considered as defensive programming."
    String with: (Character codePoint: 65)

    "Dubious, and only works for characters in the 0-127 range.  With
     UnicodeString, probably you always want #codePoint:."
    UnicodeString with: (Character value: 65)

    "Fails, we try to use a high character in a String"
    String with: (Character codePoint: 128)

    "Fails, we try to use an encoding in a Unicode string"
    UnicodeString with: (Character value: 128)

Internationalizing software, instead, means programming it to be able to adapt to the user’s favorite conventions. These conventions can get pretty complex; for example, the user might specify the locale ‘espana-castellano’ for most purposes, but specify the locale ‘usa-english’ for currency formatting: this might make sense if the user is a Spanish-speaking American, working in Spanish, but representing monetary amounts in US dollars. You can see that this system is simple but, at the same time, very complete. This manual, however, is not the right place for a thorough discussion of how an user would set up his system for these conventions; for more information, refer to your operating system’s manual or to the GNU C library’s manual.

GNU Smalltalk inherits from ISO C the concept of a locale, that is, a collection of conventions, one convention for each purpose, and maps each of these purposes to a Smalltalk class defined by the I18N package, and these classes form a small hierarchy with class Locale as its roots:

Basic usage of the I18N package involves a single selector, the question mark (?), which is a rarely used yet valid character for a Smalltalk binary message. The meaning of the question mark selector is “How do you say … under your convention?”. You can send ? to either a specific instance of a subclass of Locale, or to the class itself; in this case, rules for the default locale (which is specified via environment variables) apply. You might say, for example, LcTime ? Date today or, for example, germanMonetaryLocale ? account balance. This syntax can be at first confusing, but turns out to be convenient because of its consistency and overall simplicity.

These two packages provides much more functionality, including more advanced formatting options support for Unicode, and conversion to and from several character sets. For more information, refer to Multilingual and international support with Iconv and I18N in the GNU Smalltalk Library Reference.

As an aside, the representation of locales that the package uses is exactly the same as the C library, which has many advantages: the burden of mantaining locale data is removed from GNU Smalltalk’s mantainers; the need of having two copies of the same data is removed from GNU Smalltalk’s users; and finally, uniformity of the conventions assumed by different internationalized programs is guaranteed to the end user.

In addition, the representation of translated strings is the standard MO file format adopted by the GNU gettext library.

3.4 Internationalization and localization support

Footnotes

(13)

(14)

(15)