gawk
-Specific Regexp Operators ¶GNU software that deals with regular expressions provides a number of
additional regexp operators. These operators are described in this
section and are specific to gawk
;
they are not available in other awk
implementations.
Most of the additional operators deal with word matching.
For our purposes, a word is a sequence of one or more letters, digits,
or underscores (‘_’):
\s
Matches any space character as defined by the current locale. Think of it as shorthand for ‘[[:space:]]’.
\S
Matches any character that is not a space, as defined by the current locale. Think of it as shorthand for ‘[^[:space:]]’.
\w
Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for ‘[[:alnum:]_]’.
\W
Matches any character that is not word-constituent. Think of it as shorthand for ‘[^[:alnum:]_]’.
\<
Matches the empty string at the beginning of a word.
For example, /\<away/
matches ‘away’ but not
‘stowaway’.
\>
Matches the empty string at the end of a word.
For example, /stow\>/
matches ‘stow’ but not ‘stowaway’.
\y
¶Matches the empty string at either the beginning or the end of a word (i.e., the word boundary). For example, ‘\yballs?\y’ matches either ‘ball’ or ‘balls’, as a separate word.
\B
Matches the empty string that occurs between two
word-constituent characters. For example,
/\Brat\B/
matches ‘crate’, but it does not match ‘dirty rat’.
‘\B’ is essentially the opposite of ‘\y’.
Another way to think of this is that ‘\B’ matches the empty string
provided it’s not at the edge of a word.
There are two other operators that work on buffers. In Emacs, a
buffer is, naturally, an Emacs buffer.
Other GNU programs, including gawk
,
consider the entire string to match as the buffer.
The operators are:
\`
Matches the empty string at the beginning of a buffer (string)
\'
Matches the empty string at the end of a buffer (string)
Because ‘^’ and ‘$’ always work in terms of the beginning
and end of strings, these operators don’t add any new capabilities
for awk
. They are provided for compatibility with other
GNU software.
In other GNU software, the word-boundary operator is ‘\b’. However,
that conflicts with the awk
language’s definition of ‘\b’
as backspace, so gawk
uses a different letter.
An alternative method would have been to require two backslashes in the
GNU operators, but this was deemed too confusing. The current
method of using ‘\y’ for the GNU ‘\b’ appears to be the
lesser of two evils.
Backreferences Are Not Supported |
---|
In POSIX Basic Regular Expressions (BREs), you can specify what are
called backreferences in regular expressions. For instance,
in This construct is not supported in POSIX Extended Regular Expressions
(EREs) such as are used in This is true even though the underlying regexp matching engine(s) used
by We are told that BusyBox |
The various command-line options
(see Command-Line Options)
control how gawk
interprets characters in regexps:
In the default case, gawk
provides all the facilities of
POSIX regexps and the
previously described
GNU regexp operators.
GNU regexp operators described
in Regular Expression Operators.
Match only POSIX regexps; the GNU operators are not special (e.g., ‘\w’ matches a literal ‘w’). Interval expressions are allowed.
Match traditional Unix awk
regexps. The GNU operators
are not special. Because BWK awk
supports them,
the POSIX character classes (‘[[:alnum:]]’, etc.) are available.
So too, interval expressions are allowed.
Characters described by octal and hexadecimal escape sequences are
treated literally, even if they represent regexp metacharacters.
This option remains for backwards compatibility but no longer has any real effect.