Starting with version 5.2, gawk
supports
persistent memory. This experimental feature stores the values of
all of gawk
’s variables, arrays and user-defined functions
in a persistent heap, which resides in a file in
the filesystem. When persistent memory is not in use (the normal case),
gawk
’s data resides in ephemeral system memory.
Persistent memory is enabled on certain 64-bit systems supporting the mmap()
and munmap()
system calls. gawk
must be compiled as a
non-PIE (Position Independent Executable) binary, since the persistent
store ends up holding pointers to functions held within the gawk
executable. This also means that to use the persistent memory, you must
use the same gawk
executable from run to run.
You can see if your version of gawk
supports persistent
memory like so:
$ gawk --version -| GNU Awk 5.2.2, API 3.2, PMA Avon 8-g1, (GNU MPFR 4.1.0, GNU MP 6.2.1) -| Copyright (C) 1989, 1991-2023 Free Software Foundation. ...
If you see the ‘PMA’ with a version indicator, then it’s supported.
As of this writing, persistent memory has only been tested on GNU/Linux,
Cygwin, Solaris 2.11, Intel architecture macOS systems,
FreeBSD 13.1 and OpenBSD 7.1.
On all others, persistent memory is disabled by default. You can force
it to be enabled by exporting the shell variable
REALLY_USE_PERSIST_MALLOC
with a nonempty value before
running configure
(see Compiling gawk
for Unix-Like Systems).
If you do so and all the tests pass, please let the maintainer know.
To use persistent memory, follow these steps:
truncate
utility:
$ truncate -s 4G data.pma
$ chmod 0600 data.pma
GAWK_PERSIST_FILE
environment variable. This is best done by placing the value in the
environment just for the run of gawk
, like so:
$ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }' 1
$ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }' 2 $ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }' 3
As shown, in subsequent runs using the same data file, the values of
gawk
’s variables are preserved. However, gawk
’s
special variables, such as NR
, are reset upon each run.
Only the variables defined by the program are preserved across runs.
Interestingly, the program that you execute need not be the same from
run to run; the persistent store only maintains the values of variables,
arrays, and user-defined functions, not the totality of gawk
’s
internal state. This lets you share data between unrelated programs,
eliminating the need for scripts to communicate via text files.
Terence Kelly, the author of the persistent memory allocator
gawk
uses, provides the following advice about the backing file:
Regarding backing file size, I recommend making it far larger than all of the data that will ever reside in it, assuming that the file system supports sparse files. The “pay only for what you use” aspect of sparse files ensures that the actual storage resource footprint of the backing file will meet the application’s needs but will be as small as possible. If the file system does not support sparse files, there’s a dilemma: Making the backing file too large is wasteful, but making it too small risks memory exhaustion, i.e.,
pma_malloc()
returnsNULL
. But persistentgawk
should still work even without sparse files.
You can disable the use of the persistent memory allocator in
gawk
with the --disable-pma option to the configure
command at the time that you build gawk
(see Compiling and Installing gawk
on Unix-Like Systems).
You can set the PMA_VERBOSITY
environment variable to a
value between zero and three to control how much debugging
and error information the persistent memory allocator will print.
gawk
sets the default to one. See the support/pma.c
source code to understand what the different verbosity levels are.
There are a few constraints on the use of persistent memory:
Mixing and matching MPFR mode and regular mode with the same backing
file is not allowed. gawk
detects such a situation and issues
a fatal error message.
gawk
is run by the root
user, then
persistent memory is not allowed. This is to avoid the possibility
of private data “leaking” into the backing file and being
recovered later by an attacker.
gawk
as it runs. Most notably this is the memory used
to compile your program into an internal form before running it,
which happens each time, but there are other leakages as well.
(For an extreme example of this, see
this thread in the “bug-gawk at gnu.org”
mailing list archives.) It is up to you to use ‘du -sh
pmafile’ occasionally to monitor how full the file is, and
arrange to dump any data you may need before the backing file becomes
full.
Terence Kelly has provided a separate Persistent-Memory gawk
User Manual
document, which is included in the gawk
distribution. It is worth reading.
Here are additional articles and web links that provide more information about
persistent memory and why it’s useful in a scripting language like
gawk
.
This is the canonical source for Terence Kelly’s Persistent Memory Allocator (PMA). The latest source code and user manual will always be available at this location. Kelly may be reached directly at any of the following email addresses: “tpkelly AT acm.org”, “tpkelly AT cs.princeton.edu”, or “tpkelly AT eecs.umich.edu”.
Terence Kelly, Zi Fan Tan, Jianan Li, and Haris Volos,
ACM Queue magazine, Vol. 20 No. 2 (March/April 2022),
PDF,
HTML.
This paper explains the design of the PMA
allocator used in persistent gawk
.
Zi Fan Tan, Jianan Li, Haris Volos, and Terence Kelly,
Non-Volatile Memory Workshop (NVMW) 2022,
http://nvmw.ucsd.edu/program/.
This paper motivates and describes a research prototype of persistent
gawk
and presents performance evaluations on Intel Optane
non-volatile memory; note that the interface differs slightly.
Terence Kelly, ACM Queue magazine Vol. 17 No. 4 (July/Aug 2019), PDF, HTML. This paper describes simple techniques for persistent memory for C/C++ code on conventional computers that lack non-volatile memory hardware.
Terence Kelly, ACM Queue magazine Vol. 18 No. 2 (March/April 2020), PDF, HTML. This paper describes a simple and robust testbed for testing software against real power failures.
Terence Kelly,
ACM Queue magazine Vol. 19 No. 4 (July/Aug 2021),
PDF,
HTML.
This paper describes a crash-tolerance feature added to GNU DBM’
(gdbm
).
When Terence Kelly published his papers, his collaborators produced
a prototype integration of PMA with gawk
. That version used
a (mandatory!) option --persist=file to specify the file
for storing the persistent heap. If this option is given to gawk
,
it produces a fatal error message instructing the user to use the
GAWK_PERSIST_FILE
environment variable instead. Except for this
paragraph, that option is otherwise undocumented.
The prototype only supported persistent data; it did not support persistent functions.
As noted earlier, support for persistent memory is experimental. If it becomes burdensome,89 then the feature will be removed.
Meaning, there are too many
bug reports, or too many strange differences in behavior from when
gawk
is run normally.