A series of awk
statements attached to a rule. If the rule’s
pattern matches an input record, awk
executes the
rule’s action. Actions are always enclosed in braces.
(See Actions.)
A programming language originally defined by the U.S. Department of Defense for embedded programming. It was designed to enforce good Software Engineering practices.
awk
Assembler ¶Henry Spencer at the University of Toronto wrote a retargetable assembler
completely as sed
and awk
scripts. It is thousands
of lines long, including machine descriptions for several eight-bit
microcomputers. It is a good example of a program that would have been
better written in another language.
awf
) ¶Henry Spencer at the University of Toronto wrote a formatter that accepts
a large subset of the ‘nroff -ms’ and ‘nroff -man’ formatting
commands, using awk
and sh
.
The regexp metacharacters ‘^’ and ‘$’, which force the match to the beginning or end of the string, respectively.
The American National Standards Institute. This organization produces many standards, among them the standards for the C and C++ programming languages. These standards often become international standards as well. See also “ISO.”
An argument can be two different things. It can be an option or a
file name passed to a command while invoking it from the command line, or
it can be something passed to a function inside a program, e.g.
inside awk
.
In the latter case, an argument can be passed to a function in two ways.
Either it is given to the called function by value, i.e., a copy of the
value of the variable is made available to the called function, but the
original variable cannot be modified by the function itself; or it is
given by reference, i.e., a pointer to the interested variable is passed to
the function, which can then directly modify it. In awk
scalars are passed by value, and arrays are passed by reference.
See “Pass By Value/Reference.”
A grouping of multiple values under the same name.
Most languages just provide sequential arrays.
awk
provides associative arrays.
A statement in a program that a condition is true at this point in the program. Useful for reasoning about how a program is supposed to behave.
An awk
expression that changes the value of some awk
variable or data object. An object that you can assign to is called an
lvalue. The assigned values are called rvalues.
See Assignment Expressions.
Arrays in which the indices may be numbers or strings, not just sequential integers in a fixed range.
awk
LanguageThe language in which awk
programs are written.
awk
ProgramAn awk
program consists of a series of patterns and
actions, collectively known as rules. For each input record
given to the program, the program’s rules are all processed in turn.
awk
programs may also contain function definitions.
awk
ScriptAnother name for an awk
program.
The GNU version of the standard shell (the Bourne-Again SHell). See also “Bourne Shell.”
Base-two notation, where the digits are 0
–1
. Since
electronic circuitry works “naturally” in base 2 (just think of Off/On),
everything inside a computer is calculated using base 2. Each digit
represents the presence (or absence) of a power of 2 and is called a
bit. So, for example, the base-two number 10101
is
the same as decimal 21, ((1 x 16) + (1 x 4) + (1 x 1)).
Since base-two numbers quickly become very long to read and write, they are usually grouped by 3 (i.e., they are read as octal numbers), or by 4 (i.e., they are read as hexadecimal numbers). There is no direct way to insert base 2 numbers in a C program. If need arises, such numbers are usually inserted as octal or hexadecimal numbers. The number of base-two digits that fit into registers used for representing integer numbers in computers is a rough indication of the computing power of the computer itself. Most computers nowadays use 64 bits for representing integer numbers in their registers, but 32-bit, 16-bit and 8-bit registers have been widely used in the past. See Octal and Hexadecimal Numbers.
Short for “Binary Digit.”
All values in computer memory ultimately reduce to binary digits: values
that are either zero or one.
Groups of bits may be interpreted differently—as integers,
floating-point numbers, character data, addresses of other
memory objects, or other data.
awk
lets you work with floating-point numbers and strings.
gawk
lets you manipulate bit values with the built-in
functions described in
Bit-Manipulation Functions.
Computers are often defined by how many bits they use to represent integer values. Typical systems are 32-bit systems, but 64-bit systems are becoming increasingly popular, and 16-bit systems have essentially disappeared.
Named after the English mathematician Boole. See also “Logical Expression.”
The standard shell (/bin/sh) on Unix and Unix-like systems,
originally written by Steven R. Bourne at Bell Laboratories.
Many shells (Bash, ksh
, pdksh
, zsh
) are
generally upwardly compatible with the Bourne shell.
The characters ‘{’ and ‘}’. Braces are used in
awk
for delimiting actions, compound statements, and function
bodies.
Inside a regular expression, an expression included in square brackets, meant to designate a single character as belonging to a specified character class. A bracket expression can contain a list of one or more characters, like ‘[abc]’, a range of characters, like ‘[A-Z]’, or a name, delimited by ‘:’, that designates a known set of characters, like ‘[:digit:]’. The form of bracket expression enclosed between ‘:’ is independent of the underlying representation of the character themselves, which could utilize the ASCII, EBCDIC, or Unicode codesets, depending on the architecture of the computer system, and on localization. See also “Regular Expression.”
The awk
language provides built-in functions that perform various
numerical, I/O-related, and string computations. Examples are
sqrt()
(for the square root of a number) and substr()
(for a
substring of a string).
gawk
provides functions for timestamp management, bit manipulation,
array sorting, type checking,
and runtime string translation.
(See Built-in Functions.)
ARGC
,
ARGV
,
CONVFMT
,
ENVIRON
,
FILENAME
,
FNR
,
FS
,
NF
,
NR
,
OFMT
,
OFS
,
ORS
,
RLENGTH
,
RSTART
,
RS
,
and
SUBSEP
are the variables that have special meaning to awk
.
In addition,
ARGIND
,
BINMODE
,
ERRNO
,
FIELDWIDTHS
,
FPAT
,
IGNORECASE
,
LINT
,
PROCINFO
,
RT
,
and
TEXTDOMAIN
are the variables that have special meaning to gawk
.
Changing some of them affects awk
’s running environment.
(See Predefined Variables.)
The system programming language that most GNU software is written in. The
awk
programming language has C-like syntax, and this Web page
points out similarities between awk
and C when appropriate.
In general, gawk
attempts to be as similar to the 1990 version
of ISO C as makes sense.
The C Shell (csh
or its improved version, tcsh
) is a Unix shell that was
created by Bill Joy in the late 1970s. The C shell was differentiated from
other shells by its interactive features and overall style, which
looks more like C. The C Shell is not backward compatible with the Bourne
Shell, so special attention is required when converting scripts
written for other Unix shells to the C shell, especially with regard to the management of
shell variables.
See also “Bourne Shell.”
A popular object-oriented programming language derived from C.
See “Bracket Expression.”
See “Bracket Expression.”
The set of numeric codes used by a computer system to represent the characters (letters, numbers, punctuation, etc.) of a particular country or place. The most common character set in use today is ASCII (American Standard Code for Information Interchange). Many European countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1). The Unicode character set is increasingly popular and standard, and is particularly widely used on GNU/Linux systems.
A preprocessor for pic
that reads descriptions of molecules
and produces pic
input for drawing them.
It was written in awk
by Brian Kernighan and Jon Bentley, and is available from
http://netlib.org/typesetting/chem.
A relation that is either true or false, such as ‘a < b’.
Comparison expressions are used in if
, while
, do
,
and for
statements, and in patterns to select which input records to process.
(See Variable Typing and Comparison Expressions.)
A program that translates human-readable source code into machine-executable object code. The object code is then executed directly by the computer. See also “Interpreter.”
The negation of a bracket expression. All that is not described by a given bracket expression. The symbol ‘^’ precedes the negated bracket expression. E.g.: ‘[^[:digit:]]’ designates whatever character is not a digit. ‘[^bad]’ designates whatever character is not one of the letters ‘b’, ‘a’, or ‘d’. See “Bracket Expression.”
A series of awk
statements, enclosed in curly braces. Compound
statements may be nested.
(See Control Statements in Actions.)
See “Dynamic Regular Expressions.”
Concatenating two strings means sticking them together, one after another, producing a new string. For example, the string ‘foo’ concatenated with the string ‘bar’ gives the string ‘foobar’. (See String Concatenation.)
An expression using the ‘?:’ ternary operator, such as ‘expr1 ? expr2 : expr3’. The expression expr1 is evaluated; if the result is true, the value of the whole expression is the value of expr2; otherwise the value is expr3. In either case, only one of expr2 and expr3 is evaluated. (See Conditional Expressions.)
A control statement is an instruction to perform a given operation or a set
of operations inside an awk
program, if a given condition is
true. Control statements are: if
, for
, while
, and
do
(see Control Statements in Actions).
A peculiar goodie, token, saying or remembrance produced by or presented to a program. (With thanks to Professor Doug McIlroy.)
A subordinate program with which two-way communications is possible.
See “Braces.”
An area in the language where specifications often were (or still are) not clear, leading to unexpected or undesirable behavior. Such areas are marked in this Web page with “(d.c.)” in the text and are indexed under the heading “dark corner.”
A description of awk
programs, where you specify the data you
are interested in processing, and what to do when that data is seen.
These are numbers and strings of characters. Numbers are converted into strings and vice versa, as needed. (See Conversion of Strings and Numbers.)
The situation in which two communicating processes are each waiting for the other to perform an action.
A program used to help developers remove “bugs” from (de-bug) their programs.
An internal representation of numbers that can have fractional parts.
Double precision numbers keep track of more digits than do single precision
numbers, but operations on them are sometimes more expensive. This is the way
awk
stores numeric values. It is the C type double
.
A dynamic regular expression is a regular expression written as an
ordinary expression. It could be a string constant, such as
"foo"
, but it may also be an expression whose value can vary.
(See Using Dynamic Regexps.)
See “Null String.”
A collection of strings, of the form ‘name=val’, that each
program has available to it. Users generally place values into the
environment in order to provide information to various programs. Typical
examples are the environment variables HOME
and PATH
.
The date used as the “beginning of time” for timestamps. Time values in most systems are represented as seconds since the epoch, with library functions available for converting these values into standard date and time formats.
The epoch on Unix and POSIX systems is 1970-01-01 00:00:00 UTC. See also “GMT” and “UTC.”
A special sequence of characters used for describing nonprinting characters, such as ‘\n’ for newline or ‘\033’ for the ASCII ESC (Escape) character. (See Escape Sequences.)
An additional feature or change to a programming language or
utility not defined by that language’s or utility’s standard.
gawk
has (too) many extensions over POSIX awk
.
See “Free Documentation License.”
When awk
reads an input record, it splits the record into pieces
separated by whitespace (or by a separator regexp that you can
change by setting the predefined variable FS
). Such pieces are
called fields. If the pieces are of fixed length, you can use the built-in
variable FIELDWIDTHS
to describe their lengths.
If you wish to specify the contents of fields instead of the field
separator, you can use the predefined variable FPAT
to do so.
(See Specifying How Fields Are Separated,
Reading Fixed-Width Data,
and
Defining Fields by Content.)
A variable whose truth value indicates the existence or nonexistence of some condition.
Often referred to in mathematical terms as a “rational” or real number, this is just a number that can have a fractional part. See also “Double Precision” and “Single Precision.”
Format strings control the appearance of output in the
strftime()
and sprintf()
functions, and in the
printf
statement as well. Also, data conversions from numbers to strings
are controlled by the format strings contained in the predefined variables
CONVFMT
and OFMT
. (See Format-Control Letters.)
Shorthand for FORmula TRANslator, one of the first programming languages available for scientific calculations. It was created by John Backus, and has been available since 1957. It is still in use today.
This document describes the terms under which this Web page is published and may be copied. (See GNU Free Documentation License.)
A nonprofit organization dedicated to the production and distribution of freely distributable software. It was founded by Richard M. Stallman, the author of the original Emacs editor. GNU Emacs is the most widely used version of Emacs today.
See “Free Software Foundation.”
A part of an awk
program that can be invoked from every point of
the program, to perform a task. awk
has several built-in
functions.
Users can define their own functions in every part of the program.
Function can be recursive, i.e., they may invoke themselves.
See Functions.
In gawk
it is also possible to have functions shared
among different programs, and included where required using the
@include
directive
(see Including Other Files into Your Program).
In gawk
the name of the function that should be invoked
can be generated at run time, i.e., dynamically.
The gawk
extension API provides constructor functions
(see Constructor Functions).
gawk
The GNU implementation of awk
.
This document describes the terms under which gawk
and its source
code may be distributed. (See GNU General Public License.)
“Greenwich Mean Time.” This is the old term for UTC. It is the time of day used internally for Unix and POSIX systems. See also “Epoch” and “UTC.”
“GNU’s not Unix”. An on-going project of the Free Software Foundation to create a complete, freely distributable, POSIX-compliant computing environment.
A variant of the GNU system using the Linux kernel, instead of the Free Software Foundation’s Hurd kernel. The Linux kernel is a stable, efficient, full-featured clone of Unix that has been ported to a variety of architectures. It is most popular on PC-class systems, but runs well on a variety of other systems too. The Linux kernel source code is available under the terms of the GNU General Public License, which is perhaps its most important aspect.
See “General Public License.”
Base 16 notation, where the digits are 0
–9
and
A
–F
, with ‘A’
representing 10, ‘B’ representing 11, and so on, up to ‘F’ for 15.
Hexadecimal numbers are written in C using a leading ‘0x’,
to indicate their base. Thus, 0x12
is 18 ((1 x 16) + 2).
See Octal and Hexadecimal Numbers.
Abbreviation for “Input/Output,” the act of moving data into and/or out of a running program.
A single chunk of data that is read in by awk
. Usually, an awk
input
record consists of one line of text.
(See How Input Is Split into Records.)
A whole number, i.e., a number that does not have a fractional part.
The process of writing or modifying a program so that it can use multiple languages without requiring further source code changes.
A program that reads human-readable source code directly, and uses
the instructions in it to process data and produce results.
awk
is typically (but not always) implemented as an interpreter.
See also “Compiler.”
A component of a regular expression that lets you specify repeated matches of
some part of the regexp. Interval expressions were not originally available
in awk
programs.
The International Organization for Standardization. This organization produces international standards for many things, including programming languages, such as C and C++. In the computer arena, important standards like those for C, C++, and POSIX become both American national and ISO international standards simultaneously. This Web page refers to Standard C as “ISO C” throughout. See the ISO website for more information about the name of the organization and its language-independent three-letter acronym.
A modern programming language originally developed by Sun Microsystems (now Oracle) supporting Object-Oriented programming. Although usually implemented by compiling to the instructions for a standard virtual machine (the JVM), the language can be compiled to native code.
In the awk
language, a keyword is a word that has special
meaning. Keywords are reserved and may not be used as variable names.
gawk
’s keywords are:
BEGIN
,
BEGINFILE
,
END
,
ENDFILE
,
break
,
case
,
continue
,
default
,
delete
,
do…while
,
else
,
exit
,
for…in
,
for
,
function
,
func
,
if
,
next
,
nextfile
,
switch
,
and
while
.
The Korn Shell (ksh
) is a Unix shell which was developed by David Korn at Bell
Laboratories in the early 1980s. The Korn Shell is backward-compatible with the Bourne
shell and includes many features of the C shell.
See also “Bourne Shell.”
This document describes the terms under which binary library archives or shared objects, and their source code may be distributed.
See “Lesser General Public License.”
See “GNU/Linux.”
The process of providing the data necessary for an internationalized program to work in a particular language.
An expression using the operators for logic, AND, OR, and NOT, written
‘&&’, ‘||’, and ‘!’ in awk
. Often called Boolean
expressions, after the mathematician who pioneered this kind of
mathematical logic.
An expression that can appear on the left side of an assignment
operator. In most languages, lvalues can be variables or array
elements. In awk
, a field designator can also be used as an
lvalue.
The act of testing a string against a regular expression. If the regexp describes the contents of the string, it is said to match it.
Characters used within a regexp that do not stand for themselves. Instead, they denote regular expression operations, such as repetition, grouping, or alternation.
Nesting is where information is organized in layers, or where objects
contain other similar objects.
In gawk
the @include
directive can be nested. The “natural” nesting of arithmetic and
logical operations can be changed using parentheses
(see Operator Precedence (How Operators Nest)).
An operation that does nothing.
A string with no characters in it. It is represented explicitly in
awk
programs by placing two double quote characters next to
each other (""
). It can appear in input data by having two successive
occurrences of the field separator appear next to each other.
A numeric-valued data object. Modern awk
implementations use
double precision floating-point to represent numbers.
Ancient awk
implementations used single precision floating-point.
Base-eight notation, where the digits are 0
–7
.
Octal numbers are written in C using a leading ‘0’,
to indicate their base. Thus, 013
is 11 ((1 x 8) + 3).
See Octal and Hexadecimal Numbers.
A single chunk of data that is written out by awk
. Usually, an
awk
output record consists of one or more lines of text.
See How Input Is Split into Records.
Patterns tell awk
which input records are interesting to which
rules.
A pattern is an arbitrary conditional expression against which input is tested. If the condition is satisfied, the pattern is said to match the input record. A typical pattern might compare the input record against a regular expression. (See Pattern Elements.)
An acronym describing what is possibly the most frequent source of computer usage problems. (Problem Exists Between Keyboard And Chair.)
See “Extensions.”
The name for a series of standards
that specify a Portable Operating System interface. The “IX” denotes
the Unix heritage of these standards. The main standard of interest for
awk
users is
IEEE Standard for Information Technology, Standard 1003.1TM-2017
(Revision of IEEE Std 1003.1-2008).
The 2018 POSIX standard can be found online at
https://pubs.opengroup.org/onlinepubs/9699919799/.
The order in which operations are performed when operators are used without explicit parentheses.
Variables and/or functions that are meant for use exclusively by library
functions and not for the main awk
program. Special care must be
taken when naming such variables and functions.
(See Naming Library Function Global Variables.)
A sequence of consecutive lines from the input file(s). A pattern
can specify ranges of input lines for awk
to process or it can
specify single lines. (See Pattern Elements.)
See “Input record” and “Output record.”
When a function calls itself, either directly or indirectly. If this is clear, stop, and proceed to the next entry. Otherwise, refer to the entry for “recursion.”
Redirection means performing input from something other than the standard input stream, or performing output to something other than the standard output stream.
You can redirect input to the getline
function using
the ‘<’, ‘|’, and ‘|&’ operators.
You can redirect the output of the print
and printf
statements
to a file or a system command, using the ‘>’, ‘>>’, ‘|’, and ‘|&’
operators.
(See Explicit Input with getline
,
and Redirecting Output of print
and printf
.)
An internal mechanism in gawk
to minimize the amount of memory
needed to store the value of string variables. If the value assumed by
a variable is used in more than one place, only one copy of the value
itself is kept, and the associated reference count is increased when the
same value is used by an additional variable, and decreased when the related
variable is no longer in use. When the reference count goes to zero,
the memory space used to store the value of the variable is freed.
See “Regular Expression.”
A regular expression (“regexp” for short) is a pattern that denotes a
set of strings, possibly an infinite set. For example, the regular expression
‘R.*xp’ matches any string starting with the letter ‘R’
and ending with the letters ‘xp’. In awk
, regular expressions are
used in patterns and in conditional expressions. Regular expressions may contain
escape sequences. (See Regular Expressions.)
A regular expression constant is a regular expression written within
slashes, such as /foo/
. This regular expression is chosen
when you write the awk
program and cannot be changed during
its execution. (See How to Use Regular Expressions.)
See “Metacharacters.”
Rounding the result of an arithmetic operation can be tricky.
More than one way of rounding exists, and in gawk
it is possible to choose which method should be used in a program.
See Setting the Rounding Mode.
A segment of an awk
program that specifies how to process single
input records. A rule consists of a pattern and an action.
awk
reads an input record; then, for each rule, if the input record
satisfies the rule’s pattern, awk
executes the rule’s action.
Otherwise, the rule does nothing for that input record.
A value that can appear on the right side of an assignment operator.
In awk
, essentially every expression has a value. These values
are rvalues.
A single value, be it a number or a string. Regular variables are scalars; arrays and functions are not.
In gawk
, a list of directories to search for awk
program source files.
In the shell, a list of directories to search for executable programs.
sed
See “Stream Editor.”
The initial value, or starting point, for a sequence of random numbers.
The command interpreter for Unix and POSIX-compliant systems. The shell works both interactively, and as a programming language for batch files, or shell scripts.
The nature of the awk
logical operators ‘&&’ and ‘||’.
If the value of the entire expression is determinable from evaluating just
the lefthand side of these operators, the righthand side is not
evaluated.
(See Boolean Expressions.)
A side effect occurs when an expression has an effect aside from merely producing a value. Assignment expressions, increment and decrement expressions, and function calls have side effects. (See Assignment Expressions.)
An internal representation of numbers that can have fractional parts.
Single precision numbers keep track of fewer digits than do double precision
numbers, but operations on them are sometimes less expensive in terms of CPU time.
This is the type used by some ancient versions of awk
to store
numeric values. It is the C type float
.
The character generated by hitting the space bar on the keyboard.
A file name interpreted internally by gawk
, instead of being handed
directly to the underlying operating system—for example, /dev/stderr.
(See Special File names in gawk
.)
An expression inside an awk
program in the action part
of a pattern–action rule, or inside an
awk
function. A statement can be a variable assignment,
an array operation, a loop, etc.
A program that reads records from an input stream and processes them one or more at a time. This is in contrast with batch programs, which may expect to read their input files in entirety before starting to do anything, as well as with interactive programs which require input from the user.
A datum consisting of a sequence of characters, such as ‘I am a
string’. Constant strings are written with double quotes in the
awk
language and may contain escape sequences.
(See Escape Sequences.)
The character generated by hitting the TAB key on the keyboard. It usually expands to up to eight spaces upon output.
A unique name that identifies an application. Used for grouping messages that are translated at runtime into the local language.
A value in the “seconds since the epoch” format used by Unix
and POSIX systems. Used for the gawk
functions
mktime()
, strftime()
, and systime()
.
See also “Epoch,” “GMT,” and “UTC.”
A computer operating system originally developed in the early 1970’s at AT&T Bell Laboratories. It initially became popular in universities around the world and later moved into commercial environments as a software development system and network server system. There are many commercial versions of Unix, as well as several work-alike systems whose source code is freely available (such as GNU/Linux, NetBSD, FreeBSD, and OpenBSD).
The accepted abbreviation for “Universal Coordinated Time.” This is standard time in Greenwich, England, which is used as a reference time for day and date calculations. See also “Epoch” and “GMT.”
A name for a value. In awk
, variables may be either scalars
or arrays.
A sequence of space, TAB, or newline characters occurring inside an input record or a string.