gawk
¶When using gawk
, the value of RS
is not limited to a
one-character string. If it contains more than one character, it is
treated as a regular expression
(see Regular Expressions). (c.e.)
In general, each record
ends at the next string that matches the regular expression; the next
record starts at the end of the matching string. This general rule is
actually at work in the usual case, where RS
contains just a
newline: a record ends at the beginning of the next matching string (the
next newline in the input), and the following record starts just after
the end of this string (at the first character of the following line).
The newline, because it matches RS
, is not part of either record.
When RS
is a single character, RT
contains the same single character. However, when RS
is a
regular expression, RT
contains
the actual input text that matched the regular expression.
If the input file ends without any text matching RS
,
gawk
sets RT
to the null string.
The following example illustrates both of these features.
It sets RS
equal to a regular expression that
matches either a newline or a series of one or more uppercase letters
with optional leading and/or trailing whitespace:
$ echo record 1 AAAA record 2 BBBB record 3 | > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" } > { print "Record =", $0,"and RT = [" RT "]" }'
-| Record = record 1 and RT = [ AAAA ] -| Record = record 2 and RT = [ BBBB ] -| Record = record 3 and RT = [ -| ]
The square brackets delineate the contents of RT
, letting you
see the leading and trailing whitespace. The final value of
RT
is a newline.
See A Simple Stream Editor for a more useful example
of RS
as a regexp and RT
.
If you set RS
to a regular expression that allows optional
trailing text, such as ‘RS = "abc(XYZ)?"’, it is possible, due
to implementation constraints, that gawk
may match the leading
part of the regular expression, but not the trailing part, particularly
if the input text that could match the trailing part is fairly long.
gawk
attempts to avoid this problem, but currently, there’s
no guarantee that this will never happen.
Caveats When Using Regular Expressions for RS |
---|
Remember that in Record splitting with regular expressions works differently than
regexp matching with the |
The use of RS
as a regular expression and the RT
variable are gawk
extensions; they are not available in
compatibility mode
(see Command-Line Options).
In compatibility mode, only the first character of the value of
RS
determines the end of the record.
mawk
has allowed RS
to be a regexp for decades.
As of October, 2019, BWK awk
also supports it. Neither
version supplies RT
, however.
RS = "\0" Is Not Portable |
---|
There are times when you might want to treat an entire data file as a
single record. The only way to make this happen is to give You might think that for text files, the NUL character, which
consists of a character with all bits equal to zero, is a good
value to use for BEGIN { RS = "\0" } # whole file becomes one record?
Almost all other It happens that recent versions of See Reading a Whole File at Once for an interesting way to read
whole files. If you are using |