diff
Performance Tradeoffs ¶GNU diff
runs quite efficiently; however, in some
circumstances you can cause it to run faster or produce a more compact
set of changes.
One way to improve diff
performance is to use hard or
symbolic links to files instead of copies. This improves performance
because diff
normally does not need to read two hard or
symbolic links to the same file, since their contents must be
identical. For example, suppose you copy a large directory hierarchy,
make a few changes to the copy, and then often use ‘diff -r’ to
compare the original to the copy. If the original files are
read-only, you can greatly improve performance by creating the copy
using hard or symbolic links (e.g., with GNU ‘cp -lR’ or
‘cp -sR’). Before editing a file in the copy for the first time,
you should break the link and replace it with a regular copy.
You can also affect the performance of GNU diff
by
giving it options that change the way it compares files.
Performance has more than one dimension. These options improve one
aspect of performance at the cost of another, or they improve
performance in some cases while hurting it in others.
The way that GNU diff
determines which lines have
changed always comes up with a near-minimal set of differences.
Usually it is good enough for practical purposes. If the
diff
output is large, you might want diff
to use a
modified algorithm that sometimes produces a smaller set of
differences. The --minimal (-d) option does this;
however, it can also cause diff
to run more slowly than
usual, so it is not the default behavior.
When the files you are comparing are large and have small groups of
changes scattered throughout them, you can use the
--speed-large-files option to make a different modification to
the algorithm that diff
uses. If the input files have a constant
small density of changes, this option speeds up the comparisons without
changing the output. If not, diff
might produce a larger set of
differences; however, the output will still be correct.
Normally diff
discards the prefix and suffix that is common to
both files before it attempts to find a minimal set of differences.
This makes diff
run faster, but occasionally it may produce
non-minimal output. The --horizon-lines=lines option
prevents diff
from discarding the last lines lines of the
prefix and the first lines lines of the suffix. This gives
diff
further opportunities to find a minimal output.
Suppose a run of changed lines includes a sequence of lines at one end
and there is an identical sequence of lines just outside the other end.
The diff
command is free to choose which identical sequence is
included in the hunk. In this case, diff
normally shifts the
hunk’s boundaries when this merges adjacent hunks, or shifts a hunk’s
lines towards the end of the file. Merging hunks can make the output
look nicer in some cases.