The split
utility splits large text files into smaller pieces.
The usage follows the POSIX standard for split
and is as follows:
split
[-l count] [-a suffix-len] [file [outname]]split
-b N[k
|m
]] [-a suffix-len] [file [outname]]
By default, the output files are named xaa, xab, and so on. Each file has 1,000 lines in it, with the likely exception of the last file.
The split
program has evolved over time, and the current POSIX
version is more complicated than the original Unix version. The options
and what they do are as follows:
Use suffix-len characters for the suffix. For example, if suffix-len is four, the output files would range from xaaaa to xzzzz.
k
|m
]]Instead of each file containing a specified number of lines, each file should have (at most) N bytes. Supplying a trailing ‘k’ multiplies N by 1,024, yielding kilobytes. Supplying a trailing ‘m’ multiplies N by 1,048,576 (1,024 * 1,024) yielding megabytes. (This option is mutually exclusive with -l).
Each file should have at most count lines, instead of the default 1,000. (This option is mutually exclusive with -b).
If supplied, file is the input file to read. Otherwise standard input is processed. If supplied, outname is the leading prefix to use for file names, instead of ‘x’.
In order to use the -b option, gawk
should be invoked
with its -b option (see Command-Line Options), or with the environment
variable LC_ALL
set to ‘C’, so that each input byte is treated
as a separate character.78
Here is an implementation of split
in awk
. It uses the
getopt()
function presented in Processing Command-Line Options.
The program begins with a standard descriptive comment and then
a usage()
function describing the options. The variable
common
keeps the function’s lines short so that they
look nice on the page:
# split.awk --- do split in awk # # Requires getopt() library function. function usage( common) { common = "[-a suffix-len] [file [outname]]" printf("usage: split [-l count] %s\n", common) > "/dev/stderr" printf(" split [-b N[k|m]] %s\n", common) > "/dev/stderr" exit 1 }
Next, in a BEGIN
rule we set the default values and parse the arguments.
After that we initialize the data structures used to cycle the suffix
from ‘aa…’ to ‘zz…’. Finally we set the name of
the first output file:
BEGIN { # Set defaults: Suffix_length = 2 Line_count = 1000 Byte_count = 0 Outfile = "x" parse_arguments() init_suffix_data() Output = (Outfile compute_suffix()) }
Parsing the arguments is straightforward. The program follows our convention (see Naming Library Function Global Variables) of having important global variables start with an uppercase letter:
function parse_arguments( i, c, l, modifier) { while ((c = getopt(ARGC, ARGV, "a:b:l:")) != -1) { if (c == "a") Suffix_length = Optarg + 0 else if (c == "b") { Byte_count = Optarg + 0 Line_count = 0 l = length(Optarg) modifier = substr(Optarg, l, 1) if (modifier == "k") Byte_count *= 1024 else if (modifier == "m") Byte_count *= 1024 * 1024 } else if (c == "l") { Line_count = Optarg + 0 Byte_count = 0 } else usage() } # Clear out options for (i = 1; i < Optind; i++) ARGV[i] = "" # Check for filename if (ARGV[Optind]) { Optind++ # Check for different prefix if (ARGV[Optind]) { Outfile = ARGV[Optind] ARGV[Optind] = "" if (++Optind < ARGC) usage() } } }
Managing the file name suffix is interesting. Given a suffix of length three, say, the values go from ‘aaa’, ‘aab’, ‘aac’ and so on, all the way to ‘zzx’, ‘zzy’, and finally ‘zzz’. There are two important aspects to this:
The computation is handled by compute_suffix()
.
This function is called every time a new file is opened.
The flow here is messy, because we want to generate ‘zzzz’ (say), and use it, and only produce an error after all the file name suffixes have been used up. The logical steps are as follows:
result
to return.
To do this, the supplementary array Suffix_ind
contains one
element for each letter in the suffix. Each element ranges from 1 to
26, acting as the index into a string containing all the lowercase
letters of the English alphabet.
It is initialized by init_suffix_data()
.
result
is built up one letter at a time, using each substr()
.
compute_suffix()
is called. To do this, we loop over Suffix_ind
, backwards.
If the current element is less than 26, it’s incremented and the loop
breaks (‘abq’ goes to ‘abr’). Otherwise, the element is
reset to one and we move down the list (‘abz’ to ‘aca’).
Thus, the Suffix_ind
array is always “one step ahead” of the actual
file name suffix to be returned.
Reached_last
is true, print a message and exit. Otherwise,
check if Suffix_ind
describes a suffix where all the letters are
‘z’. If that’s the case we’re about to return the final suffix. If
so, we set Reached_last
to true so that the next call to
compute_suffix()
will cause a failure.
Physically, the steps in the function occur in the order 3, 1, 2:
function compute_suffix( i, result, letters) { # Logical step 3 if (Reached_last) { printf("split: too many files!\n") > "/dev/stderr" exit 1 } else if (on_last_file()) Reached_last = 1 # fail when wrapping after 'zzz' # Logical step 1 result = "" letters = "abcdefghijklmnopqrstuvwxyz" for (i = 1; i <= Suffix_length; i++) result = result substr(letters, Suffix_ind[i], 1) # Logical step 2 for (i = Suffix_length; i >= 1; i--) { if (++Suffix_ind[i] > 26) { Suffix_ind[i] = 1 } else break } return result }
The Suffix_ind
array and Reached_last
are initialized
by init_suffix_data()
:
function init_suffix_data( i) { for (i = 1; i <= Suffix_length; i++) Suffix_ind[i] = 1 Reached_last = 0 }
The function on_last_file()
returns true if Suffix_ind
describes
a suffix where all the letters are ‘z’ by checking that all the elements
in the array are equal to 26:
function on_last_file( i, on_last) { on_last = 1 for (i = 1; i <= Suffix_length; i++) { on_last = on_last && (Suffix_ind[i] == 26) } return on_last }
The actual work of splitting the input file is done by the next two rules.
Since splitting by line count and splitting by byte count are mutually
exclusive, we simply use two separate rules, one for when Line_count
is greater than zero, and another for when Byte_count
is greater than zero.
The variable tcount
counts how many lines have been processed so far.
When it exceeds Line_count
, it’s time to close the previous file and
switch to a new one:
Line_count > 0 { if (++tcount > Line_count) { close(Output) Output = (Outfile compute_suffix()) tcount = 1 } print > Output }
The rule for handling bytes is more complicated. Since lines most likely
vary in length, the Byte_count
boundary may be hit in the middle of
an input record. In that case, split
has to write enough of the
first bytes of the input record to finish up Byte_count
bytes, close
the file, open a new file, and write the rest of the record to the new file.
The logic here does all that:
Byte_count > 0 { # `+ 1' is for the final newline if (tcount + length($0) + 1 > Byte_count) { # would overflow # compute leading bytes leading_bytes = Byte_count - tcount # write leading bytes printf("%s", substr($0, 1, leading_bytes)) > Output # close old file, open new file close(Output) Output = (Outfile compute_suffix()) # set up first bytes for new file $0 = substr($0, leading_bytes + 1) # trailing bytes tcount = 0 } # write full record or trailing bytes tcount += length($0) + 1 print > Output }
Finally, the END
rule cleans up by closing the last output file:
END { close(Output) }
Using -b twice requires
separating gawk
’s options from those of the program. For example:
‘gawk -f getopt.awk -f split.awk -b -- -b 42m large-file.txt split-’.