On Unix systems, tar (Tape Archive) is a very popular file format to group a hierarchy of files and directories within a single file, otherwise known as an archive.
When compressed using gzip (tar.gz), it can compete with the infamous ZIP file format.
This format can also be used as a simple read-only file system, since it supports Unix file permissions.
In this article we will see how to implement a basic file archiver utility that supports tar files.
In another article we will see how to compress a tar archive using gzip
which uses the same compression method as ZIP.
A tar archive consists of a series of file entries ending with an end-of-archive entry. [diagram] Directories are also represented by a file entry. The archive is divided into blocks of 512 bytes.
A file entry starts with a header block describing the file and can contain zero or more data blocks which contains the file's content right after. The size taken by a file in the archive might be bigger than its actual size as the smallest unit of allocation is the block, we must round up to the nearest block size.
Unused space in blocks is filled with zeros and the end-of-archive entry is simply two or more consecutive zero-filled blocks.
Note that there are multiple variants of the Tar file format.
The GNU tar utility tar
seems to use the "gnu" format by default (which has a different magic
+ version
) even though the documentation mentions that the default should be "posix". However both "posix" and "gnu" are based of the "ustar" format (Unix Standard Tar).
To stay out of this mess we will only support the most popular "ustar" format.
The header fits into a single block.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
To check the validity of a file entry, the header contains the magic
and version
field. In the case of the "ustar" format:
1 2 3 4 5 6 7 8 |
|
The weirdest and most annoying thing in the tar file format is that the fields containing numbers in the header are stored in octal (base 8) encoded as a null terminated ASCII string. We can use the following function to convert:
1 2 3 4 5 6 7 8 9 10 |
|
The size
field stores the size of the file in octal, it is set to zero in case of a directory or special files.
Each string in the header is made of ASCII characters and ends when \0
is encountered. Note that they are are not always null terminated, as the last character can also be part of the string.
The absolute file path is contained in the name
field.
If the path is bigger than 100 characters, the prefix
must be used and follows the format [prefix]/[name]
, for a maximum length of 256 characters (including the /
). A prefix must be used if the first character of the prefix
field is not \0
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
The name of the user and group of a file is stored in the uname
and gname
fields respectively. uid
and gid
contain the corresponding Unix user ID and group ID in octal.
The Unix file permissions are stored in the mode
field in octal.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
The typeflag
field contains the type of file, i.e. if it is a file, a directory or a special file.
Our implementation will only support files and directories.
1 2 3 4 5 6 7 8 9 10 11 |
|
When the file is a hard link or a symbolic link, the linkname
field specifies the absolute path of what it is linked to.
Note that the linked-to path is limited to 100 characters and does not use the prefix
.
When the file is a device file, the devmajor
and devminor
fields contain the device number, which is used to associate the device to a device driver in a Unix system. Those fields are in octal.
The date and time of the last time that the file was modified (modified time) is stored in the mtime
field as octal. It is simply a Unix timestamp, the number of seconds between a particular date and the Unix epoch (January 1st 1970 at UTC).
It can be decoded using a function like this:
1 2 3 4 5 6 7 8 9 |
|
Finally, the chksum
field contains the sum of all bytes in the header in octal. It can be used to validate that the file entry is valid.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
1 2 3 4 |
|
Listing the files within a tar archive can be done by reading the tar file block by block from the start of the file.
The first header block encountered is the root directory.
When a block is valid (signature is correct).
For a basic utility we can only parse the file path and size. The size must also be parsed as we need to skip the data blocks to get to the next file.
1 |
|
This means that we need to know if a file entry is an actual file or something else like a directory.
The other fields can be parsed similarly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
|
On Linux, a tar file can be extracted via the command tar -xvf test.tar -C test/
.
Extraction is very similar to listing the files but we must create the files and directories as needed.
The mtime
must also be restored, this can be accomplished via the utime
syscall:
set mtime
set permissions set owner, group
Creating archives is a little more complicated than extracting them.
The command to create a tar archive is tar --format=ustar -cvf test/ test.tar
.
In this article we have omitted proper errors handling to keep the code short, please see the full source code available on GitHub for more details.