In programs that use arrays, it is often necessary to use a loop that
executes once for each element of an array. In other languages, where
arrays are contiguous and indices are limited to nonnegative integers,
this is easy: all the valid indices can be found by counting from
the lowest index up to the highest. This technique won’t do the job
in awk
, because any number or string can be an array index.
So awk
has a special kind of for
statement for scanning
an array:
for (var in array) body
This loop executes body once for each index in array that the program has previously used, with the variable var set to that index.
The following program uses this form of the for
statement. The
first rule scans the input records and notes which words appear (at
least once) in the input, by storing a one into the array used
with
the word as the index. The second rule scans the elements of used
to
find all the distinct words that appear in the input. It prints each
word that is more than 10 characters long and also prints the number of
such words.
See String-Manipulation Functions
for more information on the built-in function length()
.
# Record a 1 for each word that is used at least once { for (i = 1; i <= NF; i++) used[$i] = 1 }
# Find number of distinct words more than 10 characters long END { for (x in used) { if (length(x) > 10) { ++num_long_words print x } } print num_long_words, "words longer than 10 characters" }
See Generating Word-Usage Counts for a more detailed example of this type.
The order in which elements of the array are accessed by this statement
is determined by the internal arrangement of the array elements within
awk
and in standard awk
cannot be controlled
or changed. This can lead to problems if new elements are added to
array by statements in the loop body; it is not predictable whether
the for
loop will reach them. Similarly, changing var inside
the loop may produce strange results. It is best to avoid such things.
As a point of information, gawk
sets up the list of elements
to be iterated over before the loop starts, and does not change it.
But not all awk
versions do so. Consider this program, named
loopcheck.awk:
BEGIN { a["here"] = "here" a["is"] = "is" a["a"] = "a" a["loop"] = "loop" for (i in a) { j++ a[j] = j print i } }
Here is what happens when run with gawk
(and mawk
):
$ gawk -f loopcheck.awk -| here -| loop -| a -| is
Contrast this to BWK awk
:
$ nawk -f loopcheck.awk -| loop -| here -| is -| a -| 1