11.2.2 Searching for Regular Expressions in Files

The grep family of programs searches files for patterns. These programs have an unusual history. Initially there was grep (Global Regular Expression Print), which used what are now called Basic Regular Expressions (BREs). Later there was egrep (Extended grep) which used what are now called Extended Regular Expressions (EREs). (These are almost identical to those available in awk; see Regular Expressions). There was also fgrep (Fast grep), which searched for matches of one or more fixed strings.

POSIX chose to combine these three programs into one, simply named grep. On a POSIX system, grep’s default behavior is to search using BREs. You use -E to specify the use of EREs, and -F to specify searching for fixed strings.

In practice, systems continue to come with separate egrep and fgrep utilities, for backwards compatibility. This section provides an awk implementation of egrep, which supports all of the POSIX-mandated options. You invoke it as follows:

egrep [options] 'pattern' files ...

The pattern is a regular expression. In typical usage, the regular expression is quoted to prevent the shell from expanding any of the special characters as file name wildcards. Normally, egrep prints the lines that matched. If multiple file names are provided on the command line, each output line is preceded by the name of the file and a colon.

The options to egrep are as follows:

-c

Print a count of the lines that matched the pattern, instead of the lines themselves.

-e pattern

Use pattern as the regexp to match. The purpose of the -e option is to allow patterns that start with a ‘-’.

-i

Ignore case distinctions in both the pattern and the input data.

-l

Only print (list) the names of the files that matched, not the lines that matched.

-q

Be quiet. No output is produced and the exit value indicates whether the pattern was matched.

-s

Be silent. Do not print error messages for files that could not be opened.

-v

Invert the sense of the test. egrep prints the lines that do not match the pattern and exits successfully if the pattern is not matched.

-x

Match the entire input line in order to consider the match as having succeeded.

This version uses the getopt() library function (see Processing Command-Line Options) and gawk’s BEGINFILE and ENDFILE special patterns (see The BEGINFILE and ENDFILE Special Patterns).

The program begins with descriptive comments and then a BEGIN rule that processes the command-line arguments with getopt(). The -i (ignore case) option is particularly easy with gawk; we just use the IGNORECASE predefined variable (see Predefined Variables):

# egrep.awk --- simulate egrep in awk
#
# Options:
#    -c    count of lines
#    -e    argument is pattern
#    -i    ignore case
#    -l    print filenames only
#    -n    add line number to output
#    -q    quiet - use exit value
#    -s    silent - don't print errors
#    -v    invert test, success if no match
#    -x    the entire line must match
#
# Requires getopt library function
# Uses IGNORECASE, BEGINFILE and ENDFILE
# Invoke using gawk -f egrep.awk -- options ...

BEGIN {
    while ((c = getopt(ARGC, ARGV, "ce:ilnqsvx")) != -1) {
        if (c == "c")
            count_only++
        else if (c == "e")
            pattern = Optarg
        else if (c == "i")
            IGNORECASE = 1
        else if (c == "l")
            filenames_only++
        else if (c == "n")
            line_numbers++
        else if (c == "q")
            no_print++
        else if (c == "s")
            no_errors++
        else if (c == "v")
            invert++
        else if (c == "x")
            full_line++
        else
            usage()
    }

Note the comment about invocation: Because several of the options overlap with gawk’s, a -- is needed to tell gawk to stop looking for options.

Next comes the code that handles the egrep-specific behavior. egrep uses the first nonoption on the command line if no pattern is supplied with -e. If the pattern is empty, that means no pattern was supplied, so it’s necessary to print an error message and exit. The command-line arguments up to ARGV[Optind] are cleared, so that gawk won’t try to process them as files. If no files are specified, the standard input is used, and if multiple files are specified, we make sure to note this so that the file names can precede the matched lines in the output:

    if (pattern == "")
        pattern = ARGV[Optind++]

    if (pattern == "")
      usage()

    for (i = 1; i < Optind; i++)
        ARGV[i] = ""

    if (Optind >= ARGC) {
        ARGV[1] = "-"
        ARGC = 2
    } else if (ARGC - Optind > 1)
        do_filenames++
}

The BEGINFILE rule executes when each new file is processed. In this case, it is fairly simple; it initializes a variable fcount to zero. fcount tracks how many lines in the current file matched the pattern.

Here also is where we implement the -s option. We check if ERRNO has been set, and if -s was supplied. In that case, it’s necessary to move on to the next file. Otherwise gawk would exit with an error:

BEGINFILE {
    fcount = 0
    if (ERRNO && no_errors)
        nextfile
}

The ENDFILE rule executes after each file has been processed. It affects the output only when the user wants a count of the number of lines that matched. no_print is true only if the exit status is desired. count_only is true if line counts are desired. egrep therefore only prints line counts if printing and counting are enabled. The output format must be adjusted depending upon the number of files to process. Finally, fcount is added to total, so that we know the total number of lines that matched the pattern:

ENDFILE {
    if (! no_print && count_only) {
        if (do_filenames)
            print file ":" fcount
        else
            print fcount
    }

    total += fcount
}

The following rule does most of the work of matching lines. The variable matches is true (non-zero) if the line matched the pattern. If the user specified that the entire line must match (with -x), the code checks this condition by looking at the values of RSTART and RLENGTH. If those indicate that the match is not over the full line, matches is set to zero (false).

If the user wants lines that did not match, we invert the sense of matches using the ‘!’ operator. We then increment fcount with the value of matches, which is either one or zero, depending upon a successful or unsuccessful match. If the line does not match, the next statement just moves on to the next input line.

We make a number of additional tests, but only if we are not counting lines. First, if the user only wants the exit status (no_print is true), then it is enough to know that one line in this file matched, and we can skip on to the next file with nextfile. Similarly, if we are only printing file names, we can print the file name, and then skip to the next file with nextfile. Finally, each line is printed, with a leading file name, optional colon and line number, and the final colon if necessary:

{
    matches = match($0, pattern)
    if (matches && full_line && (RSTART != 1 || RLENGTH != length()))
         matches = 0

    if (invert)
        matches = ! matches

    fcount += matches    # 1 or 0

    if (! matches)
        next

    if (! count_only) {
        if (no_print)
            nextfile

        if (filenames_only) {
            print FILENAME
            nextfile
        }

        if (do_filenames)
            if (line_numbers)
               print FILENAME ":" FNR ":" $0
            else
               print FILENAME ":" $0
        else
            print
    }
}

The END rule takes care of producing the correct exit status. If there are no matches, the exit status is one; otherwise, it is zero:

END {
    exit (total == 0)
}

The usage() function prints a usage message in case of invalid options, and then exits:

function usage()
{
    print("Usage:\tegrep [-cilnqsvx] [-e pat] [files ...]") > "/dev/stderr"
    print("\tegrep [-cilnqsvx] pat [files ...]") > "/dev/stderr"
    exit 1
}