10.3.1 Noting Data file Boundaries

The BEGIN and END rules are each executed exactly once, at the beginning and end of your awk program, respectively (see The BEGIN and END Special Patterns). We (the gawk authors) once had a user who mistakenly thought that the BEGIN rules were executed at the beginning of each data file and the END rules were executed at the end of each data file.

When informed that this was not the case, the user requested that we add new special patterns to gawk, named BEGIN_FILE and END_FILE, that would have the desired behavior. He even supplied us the code to do so.

Adding these special patterns to gawk wasn’t necessary; the job can be done cleanly in awk itself, as illustrated by the following library program. It arranges to call two user-supplied functions, beginfile() and endfile(), at the beginning and end of each data file. Besides solving the problem in only nine(!) lines of code, it does so portably; this works with any implementation of awk:

# transfile.awk
#
# Give the user a hook for filename transitions
#
# The user must supply functions beginfile() and endfile()
# that each take the name of the file being started or
# finished, respectively.

FILENAME != _oldfilename {
    if (_oldfilename != "")
        endfile(_oldfilename)
    _oldfilename = FILENAME
    beginfile(FILENAME)
}

END { endfile(FILENAME) }

This file must be loaded before the user’s “main” program, so that the rule it supplies is executed first.

This rule relies on awk’s FILENAME variable, which automatically changes for each new data file. The current file name is saved in a private variable, _oldfilename. If FILENAME does not equal _oldfilename, then a new data file is being processed and it is necessary to call endfile() for the old file. Because endfile() should only be called if a file has been processed, the program first checks to make sure that _oldfilename is not the null string. The program then assigns the current file name to _oldfilename and calls beginfile() for the file. Because, like all awk variables, _oldfilename is initialized to the null string, this rule executes correctly even for the first data file.

The program also supplies an END rule to do the final processing for the last file. Because this END rule comes before any END rules supplied in the “main” program, endfile() is called first. Once again, the value of multiple BEGIN and END rules should be clear.

If the same data file occurs twice in a row on the command line, then endfile() and beginfile() are not executed at the end of the first pass and at the beginning of the second pass. The following version solves the problem:

# ftrans.awk --- handle datafile transitions
#
# user supplies beginfile() and endfile() functions

FNR == 1 {
    if (_filename_ != "")
        endfile(_filename_)
    _filename_ = FILENAME
    beginfile(FILENAME)
}

END { endfile(_filename_) }

Counting Things shows how this library function can be used and how it simplifies writing the main program.

So Why Does gawk Have BEGINFILE and ENDFILE?

You are probably wondering, if beginfile() and endfile() functions can do the job, why does gawk have BEGINFILE and ENDFILE patterns?

Good question. Normally, if awk cannot open a file, this causes an immediate fatal error. In this case, there is no way for a user-defined function to deal with the problem, as the mechanism for calling it relies on the file being open and at the first record. Thus, the main reason for BEGINFILE is to give you a “hook” to catch files that cannot be processed. ENDFILE exists for symmetry, and because it provides an easy way to do per-file cleanup processing. For more information, refer to The BEGINFILE and ENDFILE Special Patterns.