Ordinal Functions (The GNU Awk User’s Guide)

Next: Merging an Array into a String, Previous: The Cliff Random Number Generator, Up: General Programming [Contents][Index]

10.2.5 Translating Between Characters and Numbers ¶

One commercial implementation of awk supplies a built-in function, ord(), which takes a character and returns the numeric value for that character in the machine’s character set. If the string passed to ord() has more than one character, only the first one is used.

The inverse of this function is chr() (from the function of the same name in Pascal), which takes a number and returns the corresponding character. Both functions are written very nicely in awk; there is no real reason to build them into the awk interpreter:

# ord.awk --- do ord and chr

# Global identifiers:
#    _ord_:        numerical values indexed by characters
#    _ord_init:    function to initialize _ord_

BEGIN    { _ord_init() }

function _ord_init(    low, high, i, t)
{
    low = sprintf("%c", 7) # BEL is ascii 7
    if (low == "\a") {    # regular ascii
        low = 0
        high = 127
    } else if (sprintf("%c", 128 + 7) == "\a") {
        # ascii, mark parity
        low = 128
        high = 255
    } else {        # ebcdic(!)
        low = 0
        high = 255
    }

    for (i = low; i <= high; i++) {
        t = sprintf("%c", i)
        _ord_[t] = i
    }
}

Some explanation of the numbers used by _ord_init() is worthwhile. The most prominent character set in use today is ASCII.⁷¹ Although an 8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only defines characters that use the values from 0 to 127.⁷² In the now distant past, at least one minicomputer manufacturer used ASCII, but with mark parity, meaning that the leftmost bit in the byte is always 1. This means that on those systems, characters have numeric values from 128 to 255. Finally, large mainframe systems use the EBCDIC character set, which uses all 256 values. There are other character sets in use on some older systems, but they are not really worth worrying about:

function ord(str,    c)
{
    # only first character is of interest
    c = substr(str, 1, 1)
    return _ord_[c]
}

function chr(c)
{
    # force c to be numeric by adding 0
    return sprintf("%c", c + 0)
}

#### test code ####
# BEGIN {
#    for (;;) {
#        printf("enter a character: ")
#        if (getline var <= 0)
#            break
#        printf("ord(%s) = %d\n", var, ord(var))
#    }
# }

An obvious improvement to these functions is to move the code for the _ord_init function into the body of the BEGIN rule. It was written this way initially for ease of development. There is a “test program” in a BEGIN rule, to test the function. It is commented out for production use.

Footnotes

(71)

This is changing; many systems use Unicode, a very large character set that includes ASCII as a subset. On systems with full Unicode support, a character can occupy up to 32 bits, making simple tests such as used here prohibitively expensive.

(72)

ASCII has been extended in many countries to use the values from 128 to 255 for country-specific characters. If your system uses these extensions, you can simplify _ord_init() to loop from 0 to 255.