The grep
family of programs searches files for patterns.
These programs have an unusual history.
Initially there was grep
(Global Regular Expression Print),
which used what are now called Basic Regular Expressions (BREs).
Later there was egrep
(Extended grep
) which used
what are now called Extended Regular Expressions (EREs). (These are almost
identical to those available in awk
; see Regular Expressions).
There was also fgrep
(Fast grep
), which searched
for matches of one or more fixed strings.
POSIX chose to combine these three programs into one, simply named
grep
. On a POSIX system, grep
’s default behavior
is to search using BREs. You use -E
to specify the use
of EREs, and -F to specify searching for fixed strings.
In practice, systems continue to come with separate egrep
and fgrep
utilities, for backwards compatibility. This
section provides an awk
implementation of egrep
,
which supports all of the POSIX-mandated options.
You invoke it as follows:
egrep
[options]'pattern'
files ...
The pattern is a regular expression. In typical usage, the regular
expression is quoted to prevent the shell from expanding any of the
special characters as file name wildcards. Normally, egrep
prints the lines that matched. If multiple file names are provided on
the command line, each output line is preceded by the name of the file
and a colon.
The options to egrep
are as follows:
-c
Print a count of the lines that matched the pattern, instead of the lines themselves.
-e pattern
Use pattern as the regexp to match. The purpose of the -e option is to allow patterns that start with a ‘-’.
-i
Ignore case distinctions in both the pattern and the input data.
-l
Only print (list) the names of the files that matched, not the lines that matched.
-q
Be quiet. No output is produced and the exit value indicates whether the pattern was matched.
-s
Be silent. Do not print error messages for files that could not be opened.
-v
Invert the sense of the test. egrep
prints the lines that do
not match the pattern and exits successfully if the pattern is not
matched.
-x
Match the entire input line in order to consider the match as having succeeded.
This version uses the getopt()
library function
(see Processing Command-Line Options) and gawk
’s
BEGINFILE
and ENDFILE
special patterns
(see The BEGINFILE
and ENDFILE
Special Patterns).
The program begins with descriptive comments and then a BEGIN
rule
that processes the command-line arguments with getopt()
. The -i
(ignore case) option is particularly easy with gawk
; we just use the
IGNORECASE
predefined variable
(see Predefined Variables):
# egrep.awk --- simulate egrep in awk # # Options: # -c count of lines # -e argument is pattern # -i ignore case # -l print filenames only # -n add line number to output # -q quiet - use exit value # -s silent - don't print errors # -v invert test, success if no match # -x the entire line must match # # Requires getopt library function # Uses IGNORECASE, BEGINFILE and ENDFILE # Invoke using gawk -f egrep.awk -- options ... BEGIN { while ((c = getopt(ARGC, ARGV, "ce:ilnqsvx")) != -1) { if (c == "c") count_only++ else if (c == "e") pattern = Optarg else if (c == "i") IGNORECASE = 1 else if (c == "l") filenames_only++ else if (c == "n") line_numbers++ else if (c == "q") no_print++ else if (c == "s") no_errors++ else if (c == "v") invert++ else if (c == "x") full_line++ else usage() }
Note the comment about invocation: Because several of the options overlap
with gawk
’s, a -- is needed to tell gawk
to stop looking for options.
Next comes the code that handles the egrep
-specific behavior.
egrep
uses the first nonoption on the command line
if no pattern is supplied with -e.
If the pattern is empty, that means no pattern was supplied, so it’s
necessary to print an error message and exit.
The command-line arguments up to ARGV[Optind]
are cleared, so that gawk
won’t try to process them as files. If no
files are specified, the standard input is used, and if multiple files are
specified, we make sure to note this so that the file names can precede the
matched lines in the output:
if (pattern == "") pattern = ARGV[Optind++] if (pattern == "") usage() for (i = 1; i < Optind; i++) ARGV[i] = "" if (Optind >= ARGC) { ARGV[1] = "-" ARGC = 2 } else if (ARGC - Optind > 1) do_filenames++ }
The BEGINFILE
rule executes
when each new file is processed. In this case, it is fairly simple; it
initializes a variable fcount
to zero. fcount
tracks
how many lines in the current file matched the pattern.
Here also is where we implement the -s option. We check
if ERRNO
has been set, and if -s was supplied.
In that case, it’s necessary to move on to the next file. Otherwise
gawk
would exit with an error:
BEGINFILE { fcount = 0 if (ERRNO && no_errors) nextfile }
The ENDFILE
rule executes after each file has been processed.
It affects the output only when the user wants a count of the number of lines that
matched. no_print
is true only if the exit status is desired.
count_only
is true if line counts are desired. egrep
therefore only prints line counts if printing and counting are enabled.
The output format must be adjusted depending upon the number of files to
process. Finally, fcount
is added to total
, so that we
know the total number of lines that matched the pattern:
ENDFILE { if (! no_print && count_only) { if (do_filenames) print file ":" fcount else print fcount }
total += fcount }
The following rule does most of the work of matching lines. The variable
matches
is true (non-zero) if the line matched the pattern.
If the user specified that the entire line must match (with -x),
the code checks this condition by looking at the values of
RSTART
and RLENGTH
. If those indicate that the match
is not over the full line, matches
is set to zero (false).
If the user
wants lines that did not match, we invert the sense of matches
using the ‘!’ operator. We then increment fcount
with the value of
matches
, which is either one or zero, depending upon a
successful or unsuccessful match. If the line does not match, the
next
statement just moves on to the next input line.
We make a number of additional tests, but only if we
are not counting lines. First, if the user only wants the exit status
(no_print
is true), then it is enough to know that one
line in this file matched, and we can skip on to the next file with
nextfile
. Similarly, if we are only printing file names, we can
print the file name, and then skip to the next file with nextfile
.
Finally, each line is printed, with a leading file name,
optional colon and line number, and the final colon
if necessary:
{ matches = match($0, pattern) if (matches && full_line && (RSTART != 1 || RLENGTH != length())) matches = 0 if (invert) matches = ! matches fcount += matches # 1 or 0 if (! matches) next if (! count_only) { if (no_print) nextfile if (filenames_only) { print FILENAME nextfile } if (do_filenames) if (line_numbers) print FILENAME ":" FNR ":" $0 else print FILENAME ":" $0 else print } }
The END
rule takes care of producing the correct exit status. If
there are no matches, the exit status is one; otherwise, it is zero:
END { exit (total == 0) }
The usage()
function prints a usage message in case of invalid options,
and then exits:
function usage() { print("Usage:\tegrep [-cilnqsvx] [-e pat] [files ...]") > "/dev/stderr" print("\tegrep [-cilnqsvx] pat [files ...]") > "/dev/stderr" exit 1 }