gawk
¶In most awk
implementations, sorting an array requires writing
a sort()
function. This can be educational for exploring
different sorting algorithms, but usually that’s not the point of the program.
gawk
provides the built-in asort()
and asorti()
functions (see String-Manipulation Functions) for sorting arrays. For example:
populate the array data n = asort(data) for (i = 1; i <= n; i++) do something with data[i]
After the call to asort()
, the array data
is indexed from 1
to some number n, the total number of elements in data
.
(This count is asort()
’s return value.)
data[1]
<= data[2]
<= data[3]
, and so on.
The default comparison is based on the type of the elements
(see Variable Typing and Comparison Expressions).
All numeric values come before all string values,
which in turn come before all subarrays.
An important side effect of calling asort()
is that
the array’s original indices are irrevocably lost.
As this isn’t always desirable, asort()
accepts a
second argument:
populate the array source n = asort(source, dest) for (i = 1; i <= n; i++) do something with dest[i]
In this case, gawk
copies the source
array into the
dest
array and then sorts dest
, destroying its indices.
However, the source
array is not affected.
Often, what’s needed is to sort on the values of the indices
instead of the values of the elements. To do that, use the
asorti()
function. The interface and behavior are identical to
that of asort()
, except that the index values are used for sorting
and become the values of the result array:
{ source[$0] = some_func($0) } END { n = asorti(source, dest) for (i = 1; i <= n; i++) { Work with sorted indices directly: do something with dest[i] ... Access original array via sorted indices: do something with source[dest[i]] } }
So far, so good. Now it starts to get interesting. Both asort()
and asorti()
accept a third string argument to control comparison
of array elements. When we introduced asort()
and asorti()
in String-Manipulation Functions, we ignored this third argument; however,
now is the time to describe how this argument affects these two functions.
Basically, the third argument specifies how the array is to be sorted.
There are two possibilities. As with PROCINFO["sorted_in"]
,
this argument may be one of the predefined names that gawk
provides (see Using Predefined Array Scanning Orders with gawk
), or it may be the name of a
user-defined function (see Controlling Array Traversal).
In the latter case, the function can compare elements in any way it chooses, taking into account just the indices, just the values, or both. This is extremely powerful.
Once the array is sorted, asort()
takes the values in
their final order and uses them to fill in the result array, whereas
asorti()
takes the indices in their final order and uses
them to fill in the result array.
NOTE: Copying array indices and elements isn’t expensive in terms of memory. Internally,
gawk
maintains reference counts to data. For example, whenasort()
copies the first array to the second one, there is only one copy of the original array elements’ data, even though both arrays use the values.
You may use the same array for both the first and second arguments to
asort()
and asorti()
. Doing so only makes sense if you
are also supplying the third argument, since awk
doesn’t
provide a way to pass that third argument without also passing the first
and second ones.
Because IGNORECASE
affects string comparisons, the value
of IGNORECASE
also affects sorting for both asort()
and asorti()
.
Note also that the locale’s sorting order does not
come into play; comparisons are based on character values only.86
The following example demonstrates the use of a comparison function with
asort()
. The comparison function, case_fold_compare()
, maps
both values to lowercase in order to compare them ignoring case.
# case_fold_compare --- compare as strings, ignoring case function case_fold_compare(i1, v1, i2, v2, l, r) { l = tolower(v1)
r = tolower(v2) if (l < r) return -1 else if (l == r) return 0 else return 1 }
And here is the test program for it:
# Test program BEGIN { Letters = "abcdefghijklmnopqrstuvwxyz" \ "ABCDEFGHIJKLMNOPQRSTUVWXYZ" split(Letters, data, "") asort(data, result, "case_fold_compare") j = length(result) for (i = 1; i <= j; i++) { printf("%s", result[i]) if (i % (j/2) == 0) printf("\n") else printf(" ") } }
When run, we get the following:
$ gawk -f case_fold_compare.awk -| A a B b c C D d e E F f g G H h i I J j k K l L M m -| n N O o p P Q q r R S s t T u U V v w W X x y Y z Z
NOTE: “Under the hood,”
gawk
uses the C libraryqsort()
function to manage the sorting.qsort()
can call itself recursively. This means that when you write a comparison function, you should be careful to avoid the use of global variables and arrays; use only local variables and arrays that you declare as additional parameters to the comparison function. Otherwise, you are likely to cause unintentional memory corruption in your global arrays and possibly causegawk
itself to fail.
This
is true because locale-based comparison occurs only when in
POSIX-compatibility mode, and because asort()
and asorti()
are
gawk
extensions, they are not available in that case.