I have a true love/hate relationship with unions.
That’s the thing about unions: the compiler will arrange things so they can accommodate both love and hate.
The extension API defines a number of simple types and structures for general-purpose use. Additional, more specialized, data structures are introduced in subsequent sections, together with the functions that use them.
The general-purpose types and structures are as follows:
typedef void *awk_ext_id_t;
A value of this type is received from gawk
when an extension is loaded.
That value must then be passed back to gawk
as the first parameter of
each API function.
#define awk_const …
This macro expands to ‘const’ when compiling an extension,
and to nothing when compiling gawk
itself. This makes
certain fields in the API data structures unwritable from extension code,
while allowing gawk
to use them as it needs to.
typedef enum awk_bool {
awk_false = 0,
awk_true
} awk_bool_t;
A simple Boolean type.
typedef struct awk_string {
char *str; /* data */
size_t len; /* length thereof, in chars */
} awk_string_t;
This represents a mutable string. gawk
owns the memory pointed to if it supplied
the value. Otherwise, it takes ownership of the memory pointed to.
Such memory must come from calling one of the
gawk_malloc()
, gawk_calloc()
, or
gawk_realloc()
functions!
As mentioned earlier, strings are maintained using the current multibyte encoding.
typedef enum {
AWK_UNDEFINED,
AWK_NUMBER,
AWK_STRING,
AWK_REGEX,
AWK_STRNUM,
AWK_ARRAY,
AWK_SCALAR, /* opaque access to a variable */
AWK_VALUE_COOKIE, /* for updating a previously created value */
AWK_BOOL
} awk_valtype_t;
This enum
indicates the type of a value.
It is used in the following struct
.
typedef struct awk_value {
awk_valtype_t val_type;
union {
awk_string_t s;
awknum_t n;
awk_array_t a;
awk_scalar_t scl;
awk_value_cookie_t vc;
awk_bool_t b;
} u;
} awk_value_t;
An “awk
value.”
The val_type
member indicates what kind of value the
union
holds, and each member is of the appropriate type.
#define str_value u.s
#define strnum_value str_value
#define regex_value str_value
#define num_value u.n.d
#define num_type u.n.type
#define num_ptr u.n.ptr
#define array_cookie u.a
#define scalar_cookie u.scl
#define value_cookie u.vc
#define bool_value u.b
Using these macros makes accessing the fields of the awk_value_t
more
readable.
enum AWK_NUMBER_TYPE {
AWK_NUMBER_TYPE_DOUBLE,
AWK_NUMBER_TYPE_MPFR,
AWK_NUMBER_TYPE_MPZ
};
This enum
is used in the following structure for defining the
type of numeric value that is being worked with. It is declared at the
top level of the file so that it works correctly for C++ as well as for C.
typedef struct awk_number {
double d;
enum AWK_NUMBER_TYPE type;
void *ptr;
} awk_number_t;
This represents a numeric value. Internally, gawk
stores
every number as either a C double
, a GMP integer, or an MPFR
arbitrary-precision floating-point value. In order to allow extensions
to also support GMP and MPFR values, numeric values are passed in this
structure.
The double-precision d
element is always populated
in data received from gawk
. In addition, by examining the
type
member, an extension can determine if the ptr
member is either a GMP integer (type mpz_ptr
), or an MPFR
floating-point value (type mpfr_ptr_t
), and cast it appropriately.
CAUTION: Any MPFR or MPZ values that you create and pass to
gawk
to save are copied. This means you are responsible to release the storage once you’re done with it. See the sampleintdiv
extension for some example code.
typedef void *awk_scalar_t;
Scalars can be represented as an opaque type. These values are obtained
from gawk
and then passed back into it. This is discussed
in a general fashion in the text following this list, and in more detail in
Variable Access and Update by Cookie.
typedef void *awk_value_cookie_t;
A “value cookie” is an opaque type representing a cached value. This is also discussed in a general fashion in the text following this list, and in more detail in Creating and Using Cached Values.
Scalar values in awk
are numbers, strings, strnums, or typed regexps. The
awk_value_t
struct represents values. The val_type
member
indicates what is in the union
.
Representing numbers is easy—the API uses a C double
. Strings
require more work. Because gawk
allows embedded NUL bytes
in string values, a string must be represented as a pair containing a
data pointer and length. This is the awk_string_t
type.
A strnum (numeric string) value is represented as a string and consists
of user input data that appears to be numeric.
When an extension creates a strnum value, the result is a string flagged
as user input. Subsequent parsing by gawk
then determines whether it
looks like a number and should be treated as a strnum, or as a regular string.
This is useful in cases where an extension function would like to do something
comparable to the split()
function which sets the strnum attribute
on the array elements it creates. For example, an extension that implements
CSV splitting would want to use this feature. This is also useful for a
function that retrieves a data item from a database. The PostgreSQL
PQgetvalue()
function, for example, returns a string that may be numeric
or textual depending on the contents.
Typed regexp values (see Strongly Typed Regexp Constants) are not of
much use to extension functions. Extension functions can tell that
they’ve received them, and create them for scalar values. Otherwise,
they can examine the text of the regexp through regex_value.str
and regex_value.len
.
Identifiers (i.e., the names of global variables) can be associated
with either scalar values or with arrays. In addition, gawk
provides true arrays of arrays, where any given array element can
itself be an array. Discussion of arrays is delayed until
Array Manipulation.
The various macros listed earlier make it easier to use the elements
of the union
as if they were fields in a struct
; this
is a common coding practice in C. Such code is easier to write and to
read, but it remains your responsibility to make sure that
the val_type
member correctly reflects the type of the value in
the awk_value_t
struct.
Conceptually, the first three members of the union
(number, string,
and array) are all that is needed for working with awk
values.
However, because the API provides routines for accessing and changing
the value of a global scalar variable only by using the variable’s name,
there is a performance penalty: gawk
must find the variable
each time it is accessed and changed. This turns out to be a real issue,
not just a theoretical one.
Thus, if you know that your extension will spend considerable time
reading and/or changing the value of one or more scalar variables, you
can obtain a scalar cookie106
object for that variable, and then use
the cookie for getting the variable’s value or for changing the variable’s
value.
The awk_scalar_t
type holds a scalar cookie, and the
scalar_cookie
macro provides access to the value of that type
in the awk_value_t
struct.
Given a scalar cookie, gawk
can directly retrieve or
modify the value, as required, without having to find it first.
The awk_value_cookie_t
type and value_cookie
macro are similar.
If you know that you wish to
use the same numeric or string value for one or more variables,
you can create the value once, retaining a value cookie for it,
and then pass in that value cookie whenever you wish to set the value of a
variable. This saves storage space within the running gawk
process and reduces the time needed to create the value.
See the “cookie” entry in the Jargon file for a definition of cookie, and the “magic cookie” entry in the Jargon file for a nice example. See also the entry for “Cookie” in the Glossary.