It is often useful to be able to send data to a separate program for processing and then read the result. This can always be done with temporary files:
# Write the data for processing tempfile = ("mydata." PROCINFO["pid"]) while (not done with data) print data | ("subprogram > " tempfile) close("subprogram > " tempfile) # Read the results, remove tempfile when done while ((getline newdata < tempfile) > 0) process newdata appropriately close(tempfile) system("rm " tempfile)
This works, but not elegantly. Among other things, it requires that the program be run in a directory that cannot be shared among users; for example, /tmp will not do, as another user might happen to be using a temporary file with the same name.87
However, with gawk
, it is possible to
open a two-way pipe to another process. The second process is
termed a coprocess, as it runs in parallel with gawk
.
The two-way connection is created using the ‘|&’ operator
(borrowed from the Korn shell, ksh
):88
do { print data |& "subprogram" "subprogram" |& getline results } while (data left to process) close("subprogram")
The first time an I/O operation is executed using the ‘|&’
operator, gawk
creates a two-way pipeline to a child process
that runs the other program. Output created with print
or printf
is written to the program’s standard input, and
output from the program’s standard output can be read by the gawk
program using getline
.
As is the case with processes started by ‘|’, the subprogram
can be any program, or pipeline of programs, that can be started by
the shell.
There are some cautionary items to be aware of:
gawk
currently stands, the coprocess’s
standard error goes to the same place that the parent gawk
’s
standard error goes. It is not possible to read the child’s
standard error separately.
gawk
automatically
flushes all output down the pipe to the coprocess.
However, if the coprocess does not flush its output,
gawk
may hang when doing a getline
in order to read
the coprocess’s results. This could lead to a situation
known as deadlock, where each process is waiting for the
other one to do something.
It is possible to close just one end of the two-way pipe to
a coprocess, by supplying a second argument to the close()
function of either "to"
or "from"
(see Closing Input and Output Redirections).
These strings tell gawk
to close the end of the pipe
that sends data to the coprocess or the end that reads from it,
respectively.
This is particularly necessary in order to use
the system sort
utility as part of a coprocess;
sort
must read all of its input
data before it can produce any output.
The sort
program does not receive an end-of-file indication
until gawk
closes the write end of the pipe.
When you have finished writing data to the sort
utility, you can close the "to"
end of the pipe, and
then start reading sorted data via getline
.
For example:
BEGIN { command = "LC_ALL=C sort" n = split("abcdefghijklmnopqrstuvwxyz", a, "") for (i = n; i > 0; i--) print a[i] |& command close(command, "to") while ((command |& getline line) > 0) print "got", line close(command) }
This program writes the letters of the alphabet in reverse order, one
per line, down the two-way pipe to sort
. It then closes the
write end of the pipe, so that sort
receives an end-of-file
indication. This causes sort
to sort the data and write the
sorted data back to the gawk
program. Once all of the data
has been read, gawk
terminates the coprocess and exits.
As a side note, the assignment ‘LC_ALL=C’ in the sort
command ensures traditional Unix (ASCII) sorting from sort
.
This is not strictly necessary here, but it’s good to know how to do this.
Be careful when closing the "from"
end of a two-way pipe; in this
case gawk
waits for the child process to exit, which may cause
your program to hang. (Thus, this particular feature is of much less
use in practice than being able to close the "to"
end.)
CAUTION: Normally, it is a fatal error to write to the
"to"
end of a two-way pipe which has been closed, and it is also a fatal error to read from the"from"
end of a two-way pipe that has been closed.You may set
PROCINFO["command", "NONFATAL"]
to make such operations become nonfatal. If you do so, you then need to checkERRNO
after eachprintf
, orgetline
. See Enabling Nonfatal Output, for more information.
You may also use pseudo-ttys (ptys) for
two-way communication instead of pipes, if your system supports them.
This is done on a per-command basis, by setting a special element
in the PROCINFO
array
(see Built-in Variables That Convey Information),
like so:
command = "sort -nr" # command, save in convenience variable PROCINFO[command, "pty"] = 1 # update PROCINFO print ... |& command # start two-way pipe ...
If your system does not have ptys, or if all the system’s ptys are in use,
gawk
automatically falls back to using regular pipes.
Using ptys usually avoids the buffer deadlock issues described earlier,
at some loss in performance. This is because the tty driver buffers
and sends data line-by-line. On systems with the stdbuf
(part of the GNU Coreutils package), you can use that program instead of ptys.
Note also that ptys are not fully transparent. Certain binary control codes, such Ctrl-d for end-of-file, are interpreted by the tty driver and not passed through.
CAUTION: Finally, coprocesses open up the possibility of deadlock between
gawk
and the program running in the coprocess. This can occur if you send “too much” data to the coprocess before reading any back; each process is blocked writing data with no one available to read what they’ve already written. There is no workaround for deadlock; careful programming and knowledge of the behavior of the coprocess are required.
The following example, due to Andrew Schorr, demonstrates how using ptys can help deal with buffering deadlocks.
Suppose gawk
were unable to add numbers.
You could use a coprocess to do it. Here’s an exceedingly
simple program written for that purpose:
$ cat add.c #include <stdio.h> int main(void) { int x, y; while (scanf("%d %d", & x, & y) == 2) printf("%d\n", x + y); return 0; } $ cc -O add.c -o add Compile the program
You could then write an exceedingly simple gawk
program
to add numbers by passing them to the coprocess:
$ echo 1 2 | > gawk -v cmd=./add '{ print |& cmd; cmd |& getline x; print x }'
And it would deadlock, because add.c fails to call
‘setlinebuf(stdout)’. The add
program freezes.
Now try instead:
$ echo 1 2 | > gawk -v cmd=add 'BEGIN { PROCINFO[cmd, "pty"] = 1 } > { print |& cmd; cmd |& getline x; print x }' -| 3
By using a pty, gawk
fools the standard I/O library into
thinking it has an interactive session, so it defaults to line buffering.
And now, magically, it works!
Michael
Brennan suggests the use of rand()
to generate unique
file names. This is a valid point; nevertheless, temporary files
remain more difficult to use than two-way pipes.
This is very different from the same operator in the C shell and in Bash.