Guile Reference Manual: Character Encoding of Source Files

Warning: This is the manual of the legacy Guile 2.0 series. You may want to read the manual of the current stable series instead.

6.17.8 Character Encoding of Source Files

Scheme source code files are usually encoded in ASCII or UTF-8, but the built-in reader can interpret other character encodings as well. When Guile loads Scheme source code, it uses the file-encoding procedure (described below) to try to guess the encoding of the file. In the absence of any hints, UTF-8 is assumed. One way to provide a hint about the encoding of a source file is to place a coding declaration in the top 500 characters of the file.

A coding declaration has the form coding: XXXXXX, where XXXXXX is the name of a character encoding in which the source code file has been encoded. The coding declaration must appear in a scheme comment. It can either be a semicolon-initiated comment, or the first block #! comment in the file.

The name of the character encoding in the coding declaration is typically lower case and containing only letters, numbers, and hyphens, as recognized by set-port-encoding! (see set-port-encoding!). Common examples of character encoding names are utf-8 and iso-8859-1, as defined by IANA. Thus, the coding declaration is mostly compatible with Emacs.

However, there are some differences in encoding names recognized by Emacs and encoding names defined by IANA, the latter being essentially a subset of the former. For instance, latin-1 is a valid encoding name for Emacs, but it’s not according to the IANA standard, which Guile follows; instead, you should use iso-8859-1, which is both understood by Emacs and dubbed by IANA (IANA writes it uppercase but Emacs wants it lowercase and Guile is case insensitive.)

For source code, only a subset of all possible character encodings can be interpreted by the built-in source code reader. Only those character encodings in which ASCII text appears unmodified can be used. This includes UTF-8 and ISO-8859-1 through ISO-8859-15. The multi-byte character encodings UTF-16 and UTF-32 may not be used because they are not compatible with ASCII.

There might be a scenario in which one would want to read non-ASCII code from a port, such as with the function read, instead of with load. If the port’s character encoding is the same as the encoding of the code to be read by the port, not other special handling is necessary. The port will automatically do the character encoding conversion. The functions setlocale or by set-port-encoding! are used to set port encodings (see Ports).

If a port is used to read code of unknown character encoding, it can accomplish this in three steps. First, the character encoding of the port should be set to ISO-8859-1 using set-port-encoding!. Then, the procedure file-encoding, described below, is used to scan for a coding declaration when reading from the port. As a side effect, it rewinds the port after its scan is complete. After that, the port’s character encoding should be set to the encoding returned by file-encoding, if any, again by using set-port-encoding!. Then the code can be read as normal.

Alternatively, one can use the #:guess-encoding keyword argument of open-file and related procedures. See File Ports.

Scheme Procedure: file-encoding port

C Function: scm_file_encoding (port)

Attempt to scan the first few hundred bytes from the port for hints about its character encoding. Return a string containing the encoding name or #f if the encoding cannot be determined. The port is rewound.

Currently, the only supported method is to look for an Emacs-like character coding declaration (see how Emacs recognizes file encoding in The GNU Emacs Reference Manual). The coding declaration is of the form coding: XXXXX and must appear in a Scheme comment. Additional heuristics may be added in the future.