15-year-old flaw in Python potentially threatens 300,000 open source projects

It was while carrying out investigations initially unrelated to their discovery that security researchers from the Advanced Research Center of the company Trellix came across a vulnerability present in the tarfile module of Python which makes it possible to manipulate tar archives. Upon further investigating the possible impact of this vulnerability, the researchers concluded that hundreds of open source repositories were vulnerable.

This vulnerability has been present for 15 years in Python. It was reported by Jan Matejek, who was the Python package maintainer for SUSE at the time, and stamped VE-2007-4559. But it was never fixed, even though the documentation was updated saying ‘it can be dangerous to extract archives from untrusted sources‘.

A blog post from Trellix explains this vulnerability in detail and shows how exploiting it is very easy.

In summary, the vulnerability, which is located in the extract and extractall functions of the tarfile module, allows an attacker to overwrite arbitrary files on the attacked system by appending the sequence “..” to filenames in a TAR archive.

Tar archives are a collection of several files and metadata, the latter being used when unpacking the tar archive. Metadata contained in a tar archive includes, but is not limited to, information such as file name, file size and checksum, and file owner information when the file was archived. In the Python tarfile module, this information is handled by instances of the TarInfo class generated for each item in the tar archive. These elements can represent many different types of structures in a filesystem from directories, symbolic links, files, etc.

The code snippet below is taken from the extract function of the tarfile module. This code snippet shows how the filename is constructed before being passed to the function that extracts and writes the file to the filesystem during unpacking. The code explicitly trusts the information present in the TarInfo object and joins the path which is passed to the extraction function and the name in the TarInfo object thus allowing an attacker to perform a directory traversal attack.

Since the function extract all relies on the extract function, the function extract all is also vulnerable to directory traversal attack as can be seen below

A very simple feat to achieve

For an attacker to exploit this vulnerability, all he has to do is add “..” with the separator of the operating system (“https://news.google.com/” or “\”) in the file name to escape the directory the file is supposed to be extracted to. Which the tarfile module itself makes it very easy to do, by adding a filter that can be used to analyze and modify a file’s metadata before it’s added to the tarball:

Thus, as Jan Matejek pointed out 15 years ago, it is possible to create a malicious archive containing for example ../../../../../etc/passwd. And when a system administrator unpacks such an archive, he unknowingly overwrites his system file /etc/passwd …