C.2.4 Why Generated Files Are Kept In Git

If you look at the gawk source in the Git repository, you will notice that it includes files that are automatically generated by GNU infrastructure tools, such as Makefile.in from Automake and even configure from Autoconf.

This is different from many Free Software projects that do not store the derived files, because that keeps the repository less cluttered, and it is easier to see the substantive changes when comparing versions and trying to understand what changed between commits.

However, there are several reasons why the gawk maintainer likes to have everything in the repository.

First, because it is then easy to reproduce any given version completely, without relying upon the availability of (older, likely obsolete, and maybe even impossible to find) other tools.

As an extreme example, if you ever even think about trying to compile, oh, say, the V7 awk, you will discover that not only do you have to bootstrap the V7 yacc to do so, but you also need the V7 lex. And the latter is pretty much impossible to bring up on a modern GNU/Linux system.120

(Or, let’s say gawk 1.2 required bison whatever-it-was in 1989 and that there was no awkgram.c file in the repository. Is there a guarantee that we could find that bison version? Or that it would build?)

If the repository has all the generated files, then it’s easy to just check them out and build. (Or easier, depending upon how far back we go.)

And that brings us to the second (and stronger) reason why all the files really need to be in Git. It boils down to who do you cater to—the gawk developer(s), or the user who just wants to check out a version and try it out?

The gawk maintainer wants it to be possible for any interested awk user in the world to just clone the repository, check out the branch of interest and build it, without their having to have the correct version(s) of the autotools.121 That is the point of the bootstrap.sh file. It touches the various other files in the right order such that

# The canonical incantation for building GNU software:
./bootstrap.sh && ./configure && make

will just work.

This is extremely important for the master and gawk-X.Y-stable branches.

Further, the gawk maintainer would argue that it’s also important for the gawk developers. When he tried to check out the xgawk branch122 to build it, he couldn’t. (No ltmain.sh file, and he had no idea how to create it, and that was not the only problem.)

He felt extremely frustrated. With respect to that branch, the maintainer is no different than Jane User who wants to try to build gawk-4.1-stable or master from the repository.

Thus, the maintainer thinks that it’s not just important, but critical, that for any given branch, the above incantation just works.

A third reason to have all the files is that without them, using ‘git bisect’ to try to find the commit that introduced a bug is exceedingly difficult. The maintainer tried to do that on another project that requires running bootstrapping scripts just to create configure and so on; it was really painful. When the repository is self-contained, using git bisect in it is very easy.

What are some of the consequences and/or actions to take?

  1. We don’t mind that there are differing files in the different branches as a result of different versions of the autotools.
    1. It’s the maintainer’s job to merge them and he will deal with it.
    2. He is really good at ‘git diff x y > /tmp/diff1 ; gvim /tmp/diff1’ to remove the diffs that aren’t of interest in order to review code.
  2. It would certainly help if everyone used the same versions of the GNU tools as he does, which in general are the latest released versions of Automake, Autoconf, bison, GNU gettext, and Libtool.

    Installing from source is quite easy. It’s how the maintainer worked for years (and still works). He had /usr/local/bin at the front of his PATH and just did:

    wget https://ftp.gnu.org/gnu/package/package-x.y.z.tar.gz
    tar -xpzvf package-x.y.z.tar.gz
    cd package-x.y.z
    ./configure && make && make check
    make install    # as root
    

    NOTE: Because of the ‘https://’ URL, you may have to supply the --no-check-certificate option to wget to download the file.

Most of the above was originally written by the maintainer to other gawk developers. It raised the objection from one of the developers “… that anybody pulling down the source from Git is not an end user.”

However, this is not true. There are “power awk users” who can build gawk (using the magic incantation shown previously) but who can’t program in C. Thus, the major branches should be kept buildable all the time.

It was then suggested that there be a cron job to create nightly tarballs of “the source.” Here, the problem is that there are source trees, corresponding to the various branches! So, nightly tarballs aren’t the answer, especially as the repository can go for weeks without significant change being introduced.

Fortunately, the Git server can meet this need. For any given branch named branchname, use:

wget https://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-branchname.tar.gz

to retrieve a snapshot of the given branch.


Footnotes

(120)

We tried. It was painful.

(121)

There is one GNU program that is (in our opinion) severely difficult to bootstrap from the Git repository. For example, on the author’s old (but still working) PowerPC Macintosh with macOS 10.5, it was necessary to bootstrap a ton of software, starting with Git itself, in order to try to work with the latest code. It’s not pleasant, and especially on older systems, it’s a big waste of time.

Starting with the latest tarball was no picnic either. The maintainers had dropped .gz and .bz2 files and only distribute .tar.xz files. It was necessary to bootstrap xz first!

(122)

A branch (since removed) created by one of the other developers that did not include the generated files.