When creating a Scheme string from a C string or when converting a Scheme string to a C string, the concept of character encoding becomes important.
In C, a string is just a sequence of bytes, and the character encoding describes the relation between these bytes and the actual characters that make up the string. For Scheme strings, character encoding is not an issue (most of the time), since in Scheme you usually treat strings as character sequences, not byte sequences.
Converting to C and converting from C each have their own challenges.
When converting from C to Scheme, it is important that the sequence of bytes in the C string be valid with respect to its encoding. ASCII strings, for example, can’t have any bytes greater than 127. An ASCII byte greater than 127 is considered ill-formed and cannot be converted into a Scheme character.
Problems can occur in the reverse operation as well. Not all character encodings can hold all possible Scheme characters. Some encodings, like ASCII for example, can only describe a small subset of all possible characters. So, when converting to C, one must first decide what to do with Scheme characters that can’t be represented in the C string.
Converting a Scheme string to a C string will often allocate fresh
memory to hold the result. You must take care that this memory is
properly freed eventually. In many cases, this can be achieved by
using scm_dynwind_free
inside an appropriate dynwind context,
See Dynamic Wind.
SCM
scm_from_locale_string (const char *str)
¶SCM
scm_from_locale_stringn (const char *str, size_t len)
¶Creates a new Scheme string that has the same contents as str when interpreted in the character encoding of the current locale.
For scm_from_locale_string
, str must be null-terminated.
For scm_from_locale_stringn
, len specifies the length of
str in bytes, and str does not need to be null-terminated.
If len is (size_t)-1
, then str does need to be
null-terminated and the real length will be found with strlen
.
If the C string is ill-formed, an error will be raised.
Note that these functions should not be used to convert C string
constants, because there is no guarantee that the current locale will
match that of the execution character set, used for string and character
constants. Most modern C compilers use UTF-8 by default, so to convert
C string constants we recommend scm_from_utf8_string
.
SCM
scm_take_locale_string (char *str)
¶SCM
scm_take_locale_stringn (char *str, size_t len)
¶Like scm_from_locale_string
and scm_from_locale_stringn
,
respectively, but also frees str with free
eventually.
Thus, you can use this function when you would free str anyway
immediately after creating the Scheme string. In certain cases, Guile
can then use str directly as its internal representation.
char *
scm_to_locale_string (SCM str)
¶char *
scm_to_locale_stringn (SCM str, size_t *lenp)
¶Returns a C string with the same contents as str in the character
encoding of the current locale. The C string must be freed with
free
eventually, maybe by using scm_dynwind_free
,
See Dynamic Wind.
For scm_to_locale_string
, the returned string is
null-terminated and an error is signaled when str contains
#\nul
characters.
For scm_to_locale_stringn
and lenp not NULL
,
str might contain #\nul
characters and the length of the
returned string in bytes is stored in *lenp
. The
returned string will not be null-terminated in this case. If
lenp is NULL
, scm_to_locale_stringn
behaves like
scm_to_locale_string
.
If a character in str cannot be represented in the character encoding of the current locale, the default port conversion strategy is used. See Ports, for more on conversion strategies.
If the conversion strategy is error
, an error will be raised. If
it is substitute
, a replacement character, such as a question
mark, will be inserted in its place. If it is escape
, a hex
escape will be inserted in its place.
size_t
scm_to_locale_stringbuf (SCM str, char *buf, size_t max_len)
¶Puts str as a C string in the current locale encoding into the
memory pointed to by buf. The buffer at buf has room for
max_len bytes and scm_to_local_stringbuf
will never store
more than that. No terminating '\0'
will be stored.
The return value of scm_to_locale_stringbuf
is the number of
bytes that are needed for all of str, regardless of whether
buf was large enough to hold them. Thus, when the return value
is larger than max_len, only max_len bytes have been
stored and you probably need to try again with a larger buffer.
For most situations, string conversion should occur using the current
locale, such as with the functions above. But there may be cases where
one wants to convert strings from a character encoding other than the
locale’s character encoding. For these cases, the lower-level functions
scm_to_stringn
and scm_from_stringn
are provided. These
functions should seldom be necessary if one is properly using locales.
This is an enumerated type that can take one of three values:
SCM_FAILED_CONVERSION_ERROR
,
SCM_FAILED_CONVERSION_QUESTION_MARK
, and
SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE
. They are used to indicate
a strategy for handling characters that cannot be converted to or from a
given character encoding. SCM_FAILED_CONVERSION_ERROR
indicates
that a conversion should throw an error if some characters cannot be
converted. SCM_FAILED_CONVERSION_QUESTION_MARK
indicates that a
conversion should replace unconvertable characters with the question
mark character. And, SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE
requests that a conversion should replace an unconvertable character
with an escape sequence.
While all three strategies apply when converting Scheme strings to C,
only SCM_FAILED_CONVERSION_ERROR
and
SCM_FAILED_CONVERSION_QUESTION_MARK
can be used when converting C
strings to Scheme.
char
*scm_to_stringn (SCM str, size_t *lenp, const char *encoding, scm_t_string_failed_conversion_handler handler)
¶This function returns a newly allocated C string from the Guile string str. The length of the returned string in bytes will be returned in lenp. The character encoding of the C string is passed as the ASCII, null-terminated C string encoding. The handler parameter gives a strategy for dealing with characters that cannot be converted into encoding.
If lenp is NULL
, this function will return a null-terminated C
string. It will throw an error if the string contains a null
character.
The Scheme interface to this function is string->bytevector
, from the
ice-9 iconv
module. See Representing Strings as Bytes.
SCM
scm_from_stringn (const char *str, size_t len, const char *encoding, scm_t_string_failed_conversion_handler handler)
¶This function returns a scheme string from the C string str. The
length in bytes of the C string is input as len. The encoding of the C
string is passed as the ASCII, null-terminated C string encoding
.
The handler parameters suggests a strategy for dealing with
unconvertable characters.
The Scheme interface to this function is bytevector->string
.
See Representing Strings as Bytes.
The following conversion functions are provided as a convenience for the most commonly used encodings.
SCM
scm_from_latin1_string (const char *str)
¶SCM
scm_from_utf8_string (const char *str)
¶SCM
scm_from_utf32_string (const scm_t_wchar *str)
¶Return a scheme string from the null-terminated C string str, which is ISO-8859-1-, UTF-8-, or UTF-32-encoded. These functions should be used to convert hard-coded C string constants into Scheme strings.
SCM
scm_from_latin1_stringn (const char *str, size_t len)
¶SCM
scm_from_utf8_stringn (const char *str, size_t len)
¶SCM
scm_from_utf32_stringn (const scm_t_wchar *str, size_t len)
¶Return a scheme string from C string str, which is ISO-8859-1-,
UTF-8-, or UTF-32-encoded, of length len. len is the number
of bytes pointed to by str for scm_from_latin1_stringn
and
scm_from_utf8_stringn
; it is the number of elements (code points)
in str in the case of scm_from_utf32_stringn
.
char
*scm_to_latin1_stringn (SCM str, size_t *lenp)
¶char
*scm_to_utf8_stringn (SCM str, size_t *lenp)
¶scm_t_wchar
*scm_to_utf32_stringn (SCM str, size_t *lenp)
¶Return a newly allocated, ISO-8859-1-, UTF-8-, or UTF-32-encoded C string
from Scheme string str. An error is thrown when str
cannot be converted to the specified encoding. If lenp is
NULL
, the returned C string will be null terminated, and an error
will be thrown if the C string would otherwise contain null
characters. If lenp is not NULL
, the string is not null terminated,
and the length of the returned string is returned in lenp. The length
returned is the number of bytes for scm_to_latin1_stringn
and
scm_to_utf8_stringn
; it is the number of elements (code points)
for scm_to_utf32_stringn
.
It is not often the case, but sometimes when you are dealing with the implementation details of a port, you need to encode and decode strings according to the encoding and conversion strategy of the port. There are some convenience functions for that purpose as well.
SCM
scm_from_port_string (const char *str, SCM port)
¶SCM
scm_from_port_stringn (const char *str, size_t len, SCM port)
¶char*
scm_to_port_string (SCM str, SCM port)
¶char*
scm_to_port_stringn (SCM str, size_t *lenp, SCM port)
¶Like scm_from_stringn
and friends, except they take their
encoding and conversion strategy from a given port object.