By default, gawk
reads text files as its input. It uses the value
of RS
to find the end of an input record, and then uses FS
(or FIELDWIDTHS
or FPAT
) to split it into fields (see Reading Input Files).
Additionally, it sets the value of RT
(see Predefined Variables).
If you want, you can provide your own custom input parser. An input
parser’s job is to return a record to the gawk
record-processing
code, along with indicators for the value and length of the data to be
used for RT
, if any.
To provide an input parser, you must first provide two functions (where XXX is a prefix name for your extension):
awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);
This function examines the information available in iobuf
(which we discuss shortly). Based on the information there, it
decides if the input parser should be used for this file.
If so, it should return true. Otherwise, it should return false.
It should not change any state (variable values, etc.) within gawk
.
awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);
When gawk
decides to hand control of the file over to the
input parser, it calls this function. This function in turn must fill
in certain fields in the awk_input_buf_t
structure and ensure
that certain conditions are true. It should then return true. If an
error of some kind occurs, it should not fill in any fields and should
return false; then gawk
will not use the input parser.
The details are presented shortly.
Your extension should package these functions inside an
awk_input_parser_t
, which looks like this:
typedef struct awk_input_parser { const char *name; /* name of parser */ awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf); awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf); awk_const struct awk_input_parser *awk_const next; /* for gawk */ } awk_input_parser_t;
The fields are:
const char *name;
The name of the input parser. This is a regular C string.
awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
A pointer to your XXX_can_take_file()
function.
awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
A pointer to your XXX_take_control_of()
function.
awk_const struct input_parser *awk_const next;
This is for use by gawk
;
therefore it is marked awk_const
so that the extension cannot
modify it.
The steps are as follows:
static awk_input_parser_t
variable and initialize it
appropriately.
gawk
using the register_input_parser()
API function
(described next).
An awk_input_buf_t
looks like this:
typedef struct awk_input { const char *name; /* filename */ int fd; /* file descriptor */ #define INVALID_HANDLE (-1) void *opaque; /* private data for input parsers */ int (*get_record)(char **out, struct awk_input *iobuf, int *errcode, char **rt_start, size_t *rt_len, const awk_fieldwidth_info_t **field_width); ssize_t (*read_func)(); void (*close_func)(struct awk_input *iobuf); struct stat sbuf; /* stat buf */ } awk_input_buf_t;
The fields can be divided into two categories: those for use (initially,
at least) by XXX_can_take_file()
, and those for use by
XXX_take_control_of()
. The first group of fields and their uses
are as follows:
const char *name;
The name of the file.
int fd;
A file descriptor for the file. gawk
attempts to open
the file for reading using the open()
system call. If it was
able to open the file, then fd
will not be equal to
INVALID_HANDLE
. Otherwise, it will.
An extension can decide that it doesn’t want to use the open file descriptor
provided by gawk
. In such a case it can close the file and
set fd
to INVALID_HANDLE
, or it can leave it alone and
keep it’s own file descriptor in private data pointed to by the
opaque
pointer (see further in this list). In any case, if
the file descriptor is valid, it should not just overwrite the
value with something else; doing so would cause a resource leak.
struct stat sbuf;
If the file descriptor is valid, then gawk
will have filled
in this structure via a call to the fstat()
system call.
Otherwise, if the lstat()
system call is available, it will
use that. If lstat()
is not available, then it uses stat()
.
Getting the file’s information allows extensions to check the type of
the file even if it could not be opened. This occurs, for example,
on Windows systems when trying to use open()
on a directory.
If gawk
was not able to get the file information, then
sbuf
will be zeroed out. In particular, extension code
can check if ‘sbuf.st_mode == 0’. If that’s true, then there
is no information in sbuf
.
The XXX_can_take_file()
function should examine these
fields and decide if the input parser should be used for the file.
The decision can be made based upon gawk
state (the value
of a variable defined previously by the extension and set by
awk
code), the name of the
file, whether or not the file descriptor is valid, the information
in the struct stat
, or any combination of these factors.
Once XXX_can_take_file()
has returned true, and
gawk
has decided to use your input parser, it calls
XXX_take_control_of()
. That function then fills
either the get_record
field or the read_func
field in
the awk_input_buf_t
. It must also ensure that fd
is not
set to INVALID_HANDLE
. The following list describes the fields that
may be filled by XXX_take_control_of()
:
void *opaque;
This is used to hold any state information needed by the input parser
for this file. It is “opaque” to gawk
. The input parser
is not required to use this pointer.
int (*get_record)(char **out,
struct awk_input *iobuf,
int *errcode,
char **rt_start,
size_t *rt_len,
const awk_fieldwidth_info_t **field_width);
This function pointer should point to a function that creates the input records. Said function is the core of the input parser. Its behavior is described in the text following this list.
ssize_t (*read_func)(int, void *, size_t);
This function pointer should point to a function that has the
same behavior as the standard POSIX read()
system call.
It is an alternative to the get_record
pointer. Its behavior
is also described in the text following this list.
void (*close_func)(struct awk_input *iobuf);
This function pointer should point to a function that does
the “teardown.” It should release any resources allocated by
XXX_take_control_of()
. It may also close the file. If it
does so, it should set the fd
field to INVALID_HANDLE
.
If fd
is still not INVALID_HANDLE
after the call to this
function, gawk
calls the regular close()
system call.
Having a “teardown” function is optional. If your input parser does
not need it, do not set this field. Then, gawk
calls the
regular close()
system call on the file descriptor, so it should
be valid.
The XXX_get_record()
function does the work of creating
input records. The parameters are as follows:
char **out
This is a pointer to a char *
variable that is set to point
to the record. gawk
makes its own copy of the data, so
your extension must manage this storage.
struct awk_input *iobuf
This is the awk_input_buf_t
for the file. Two of its fields should
be used by your extension: fd
for reading data, and opaque
for managing any private state.
int *errcode
If an error occurs, *errcode
should be set to an appropriate
code from <errno.h>
.
char **rt_start
size_t *rt_len
If the concept of a “record terminator” makes sense, then
*rt_start
should be set to point to the data to be used for
RT
, and *rt_len
should be set to the length of the
data. Otherwise, *rt_len
should be set to zero.
Here too, gawk
makes its own copy of this data, so your
extension must manage this storage.
const awk_fieldwidth_info_t **field_width
If field_width
is not NULL
, then *field_width
will be initialized
to NULL
, and the function may set it to point to a structure
supplying field width information to override the default
field parsing mechanism. Note that this structure will not
be copied by gawk
; it must persist at least until the next call
to get_record
or close_func
. Note also that field_width
is
NULL
when getline
is assigning the results to a variable, thus
field parsing is not needed.
If the parser sets *field_width
,
then gawk
uses this layout to parse the input record,
and the PROCINFO["FS"]
value will be "API"
while this record
is active in $0
.
The awk_fieldwidth_info_t
data structure
is described below.
The return value is the length of the buffer pointed to by
*out
, or EOF
if end-of-file was reached or an
error occurred.
It is guaranteed that errcode
is a valid pointer, so there is no
need to test for a NULL
value. gawk
sets *errcode
to zero, so there is no need to set it unless an error occurs.
If an error does occur, the function should return EOF
and set
*errcode
to a value greater than zero. In that case, if *errcode
does not equal zero, gawk
automatically updates
the ERRNO
variable based on the value of *errcode
.
(In general, setting ‘*errcode = errno’ should do the right thing.)
As an alternative to supplying a function that returns an input record,
you may instead supply a function that simply reads bytes, and let
gawk
parse the data into records. If you do so, the data
should be returned in the multibyte encoding of the current locale.
Such a function should follow the same behavior as the read()
system call, and you fill in the read_func
pointer with its
address in the awk_input_buf_t
structure.
By default, gawk
sets the read_func
pointer to
point to the read()
system call. So your extension need not
set this field explicitly.
NOTE: You must choose one method or the other: either a function that returns a record, or one that returns raw data. In particular, if you supply a function to get a record,
gawk
will call it, and will never call the raw read function.
gawk
ships with a sample extension that reads directories,
returning records for each entry in a directory (see Reading Directories). You may wish to use that code as a guide for writing
your own input parser.
When writing an input parser, you should think about (and document)
how it is expected to interact with awk
code. You may want
it to always be called, and to take effect as appropriate (as the
readdir
extension does). Or you may want it to take effect
based upon the value of an awk
variable, as the XML extension
from the gawkextlib
project does (see The gawkextlib
Project).
In the latter case, code in a BEGINFILE
rule
can look at FILENAME
and ERRNO
to decide whether or
not to activate your input parser (see The BEGINFILE
and ENDFILE
Special Patterns).
If you would like to override the default field parsing mechanism for a given
record, then you must populate an awk_fieldwidth_info_t
structure,
which looks like this:
typedef struct { awk_bool_t use_chars; /* false ==> use bytes */ size_t nf; /* number of fields in record (NF) */ struct awk_field_info { size_t skip; /* amount to skip before field starts */ size_t len; /* length of field */ } fields[1]; /* actual dimension should be nf */ } awk_fieldwidth_info_t;
The fields are:
awk_bool_t use_chars;
Set this to awk_true
if the field lengths are specified in terms
of potentially multi-byte characters, and set it to awk_false
if
the lengths are in terms of bytes.
Performance will be better if the values are supplied in
terms of bytes.
size_t nf;
Set this to the number of fields in the input record, i.e. NF
.
struct awk_field_info fields[nf];
This is a variable-length array whose actual dimension should be nf
.
For each field, the skip
element should be set to the number
of characters or bytes, as controlled by the use_chars
flag,
to skip before the start of this field. The len
element provides
the length of the field. The values in fields[0]
provide the information
for $1
, and so on through the fields[nf-1]
element containing the information for $NF
.
A convenience macro awk_fieldwidth_info_size(numfields)
is provided to
calculate the appropriate size of a variable-length
awk_fieldwidth_info_t
structure containing numfields
fields. This can
be used as an argument to malloc()
or in a union to allocate space
statically. Please refer to the readdir_test
sample extension for an
example.
You register your input parser with the following function:
void register_input_parser(awk_input_parser_t *input_parser);
Register the input parser pointed to by input_parser
with
gawk
.