By default, the order in which a ‘for (indx in array)’ loop
scans an array is not defined; it is generally based upon
the internal implementation of arrays inside awk
.
Often, though, it is desirable to be able to loop over the elements
in a particular order that you, the programmer, choose. gawk
lets you do this.
Using Predefined Array Scanning Orders with gawk
describes how you can assign special,
predefined values to PROCINFO["sorted_in"]
in order to
control the order in which gawk
traverses an array
during a for
loop.
In addition, the value of PROCINFO["sorted_in"]
can be a
function name.85
This lets you traverse an array based on any custom criterion.
The array elements are ordered according to the return value of this
function. The comparison function should be defined with at least
four arguments:
function comp_func(i1, v1, i2, v2) { compare elements 1 and 2 in some fashion return < 0; 0; or > 0 }
Here, i1
and i2
are the indices, and v1
and v2
are the corresponding values of the two elements being compared.
Either v1
or v2
, or both, can be arrays if the array being
traversed contains subarrays as values.
(See Arrays of Arrays for more information about subarrays.)
The three possible return values are interpreted as follows:
comp_func(i1, v1, i2, v2) < 0
Index i1
comes before index i2
during loop traversal.
comp_func(i1, v1, i2, v2) == 0
Indices i1
and i2
come together, but the relative order with respect to each other is undefined.
comp_func(i1, v1, i2, v2) > 0
Index i1
comes after index i2
during loop traversal.
Our first comparison function can be used to scan an array in numerical order of the indices:
function cmp_num_idx(i1, v1, i2, v2) { # numerical index comparison, ascending order return (i1 - i2) }
Our second function traverses an array based on the string order of the element values rather than by indices:
function cmp_str_val(i1, v1, i2, v2) { # string value comparison, ascending order v1 = v1 "" v2 = v2 "" if (v1 < v2) return -1 return (v1 != v2) }
The third comparison function makes all numbers, and numeric strings without any leading or trailing spaces, come out first during loop traversal:
function cmp_num_str_val(i1, v1, i2, v2, n1, n2) { # numbers before string value comparison, ascending order n1 = v1 + 0 n2 = v2 + 0 if (n1 == v1) return (n2 == v2) ? (n1 - n2) : -1 else if (n2 == v2) return 1 return (v1 < v2) ? -1 : (v1 != v2) }
Here is a main program to demonstrate how gawk
behaves using each of the previous functions:
BEGIN { data["one"] = 10 data["two"] = 20 data[10] = "one" data[100] = 100 data[20] = "two" f[1] = "cmp_num_idx" f[2] = "cmp_str_val" f[3] = "cmp_num_str_val" for (i = 1; i <= 3; i++) { printf("Sort function: %s\n", f[i]) PROCINFO["sorted_in"] = f[i] for (j in data) printf("\tdata[%s] = %s\n", j, data[j]) print "" } }
Here are the results when the program is run:
$ gawk -f compdemo.awk -| Sort function: cmp_num_idx Sort by numeric index -| data[two] = 20 -| data[one] = 10 Both strings are numerically zero -| data[10] = one -| data[20] = two -| data[100] = 100 -| -| Sort function: cmp_str_val Sort by element values as strings -| data[one] = 10 -| data[100] = 100 String 100 is less than string 20 -| data[two] = 20 -| data[10] = one -| data[20] = two -| -| Sort function: cmp_num_str_val Sort all numeric values before all strings -| data[one] = 10 -| data[two] = 20 -| data[100] = 100 -| data[10] = one -| data[20] = two
Consider sorting the entries of a GNU/Linux system password file according to login name. The following program sorts records by a specific field position and can be used for this purpose:
# passwd-sort.awk --- simple program to sort by field position # field position is specified by the global variable POS function cmp_field(i1, v1, i2, v2) { # comparison by value, as string, and ascending order return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS]) } { for (i = 1; i <= NF; i++) a[NR][i] = $i }
END { PROCINFO["sorted_in"] = "cmp_field"
if (POS < 1 || POS > NF) POS = 1 for (i in a) { for (j = 1; j <= NF; j++) printf("%s%c", a[i][j], j < NF ? ":" : "") print "" } }
The first field in each entry of the password file is the user’s login name, and the fields are separated by colons. Each record defines a subarray, with each field as an element in the subarray. Running the program produces the following output:
$ gawk -v POS=1 -F: -f sort.awk /etc/passwd -| adm:x:3:4:adm:/var/adm:/sbin/nologin -| apache:x:48:48:Apache:/var/www:/sbin/nologin -| avahi:x:70:70:Avahi daemon:/:/sbin/nologin ...
The comparison should normally always return the same value when given a specific pair of array elements as its arguments. If inconsistent results are returned, then the order is undefined. This behavior can be exploited to introduce random order into otherwise seemingly ordered data:
function cmp_randomize(i1, v1, i2, v2) { # random order (caution: this may never terminate!) return (2 - 4 * rand()) }
As already mentioned, the order of the indices is arbitrary if two elements compare equal. This is usually not a problem, but letting the tied elements come out in arbitrary order can be an issue, especially when comparing item values. The partial ordering of the equal elements may change the next time the array is traversed, if other elements are added to or removed from the array. One way to resolve ties when comparing elements with otherwise equal values is to include the indices in the comparison rules. Note that doing this may make the loop traversal less efficient, so consider it only if necessary. The following comparison functions force a deterministic order, and are based on the fact that the (string) indices of two elements are never equal:
function cmp_numeric(i1, v1, i2, v2) { # numerical value (and index) comparison, descending order return (v1 != v2) ? (v2 - v1) : (i2 - i1) }
function cmp_string(i1, v1, i2, v2) { # string value (and index) comparison, descending order v1 = v1 i1 v2 = v2 i2 return (v1 > v2) ? -1 : (v1 != v2) }
A custom comparison function can often simplify ordered loop traversal, and the sky is really the limit when it comes to designing such a function.
When string comparisons are made during a sort, either for element
values where one or both aren’t numbers, or for element indices
handled as strings, the value of IGNORECASE
(see Predefined Variables) controls whether
the comparisons treat corresponding upper- and lowercase letters as
equivalent or distinct.
Another point to keep in mind is that in the case of subarrays,
the element values can themselves be arrays; a production comparison
function should use the isarray()
function
(see Getting Type Information)
to check for this, and choose a defined sorting order for subarrays.
All sorting based on PROCINFO["sorted_in"]
is disabled in POSIX mode,
because the PROCINFO
array is not special in that case.
As a side note, sorting the array indices before traversing
the array has been reported to add a 15% to 20% overhead to the
execution time of awk
programs. For this reason,
sorted array traversal is not the default.
This is why the predefined sorting orders start with an ‘@’ character, which cannot be part of an identifier.