[File] Regression in zip-archive detection

Torsten Landschoff torsten at debian.org
Thu Sep 2 21:24:05 UTC 2021


Hello world,

tl;dr: PR/228 did not fully fix the problem. Seems like a change to
Magdir/zip introduced the bug and an update to Magdir/archive was used
to repair it.


first: thanks for your work on file and libmagic. I am using it for more
than 20 years now.

This week I traced a regression in one of our applications at work back 
to
libmagic: Suddenly zip-files are not detected as such anymore but
reported as application/octet-stream.

After a bit of research it turned out that this is caused by our
containers now being based on Debian bullseye instead of buster.
On buster, the libmagic library (access via python-magic from Python)
correctly identifies zip files. Both when using magic_file and 
magic_buffer:

from_file: Zip archive data, at least v2.0 to extract, mime: 
application/zip
from_buffer: Zip archive data, at least v2.0 to extract, mime:
application/zip

Full log from Dockerfile.buster below:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
$ sudo docker build -f Dockerfile.buster --no-cache .
[sudo] password for torsten.landschoff:
Sending build context to Docker daemon  58.37kB
Step 1/6 : FROM debian:buster
  ---> 63652705977d
Step 2/6 : RUN apt-get -qq update && apt-get install -qq --yes file
libmagic1 libmagic-dev python3 python3-pip > /dev/null
  ---> Running in 84f5b40c127e
Removing intermediate container 84f5b40c127e
  ---> 1b6993462ac8
Step 3/6 : RUN python3 -m pip install python-magic
  ---> Running in 85feb98ec89f
Collecting python-magic
   Downloading
https://files.pythonhosted.org/packages/d3/99/c89223c6547df268596899334ee77b3051f606077317023617b1c43162fb/python_magic-0.4.24-py2.py3-none-any.whl
Installing collected packages: python-magic
Successfully installed python-magic-0.4.24
Removing intermediate container 85feb98ec89f
  ---> 38068f959fbb
Step 4/6 : COPY hello.zip file.py /root/
  ---> 689c7cc01bca
Step 5/6 : RUN file /root/hello.zip
  ---> Running in 7b98fe1dbb5c
/root/hello.zip: Zip archive data, at least v2.0 to extract
Removing intermediate container 7b98fe1dbb5c
  ---> 66803a178e36
Step 6/6 : RUN python3 /root/file.py /root/hello.zip
  ---> Running in 73adce5e3da7
from_file: Zip archive data, at least v2.0 to extract, mime: 
application/zip
from_buffer: Zip archive data, at least v2.0 to extract, mime:
application/zip
Removing intermediate container 73adce5e3da7
  ---> 88684f486f02
Successfully built 88684f486f02
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


This appeared to relate to https://bugs.astron.com/view.php?id=228 but
applying the fix from

https://github.com/file/file/commit/33eedc8edd1b53eea3c5c74f0105ecca8cbcf3cb

appeared to help - but only when using magic_file. For small uploads, we
are using magic_buffer and that is still misbehaving.

It appears that this is due to two definitions for zip files in
Magdir/archive vs. Magdir/zip. The Dockerfile for bullseye illustrates
the issue.


Basically, libmagic1 5.39-3 misreports the mime type of zip archives:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# file /root/hello.zip
/root/hello.zip: Zip archive data, made by v2.0 UNIX, extract using at
least v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size
19, method=deflate
# file --mime-type /root/hello.zip
/root/hello.zip: application/octet-stream
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

This is consistent with python-magic:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
from_file: Zip archive data, made by v2.0 UNIX, extract using at least
v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size 19,
method=deflate, mime: application/octet-stream
from_buffer: data, mime: application/octet-stream
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Using the file archive mirror of today
(2e48f028e670659d4674ce28604ed6ac5acba70d from github.com/file/file)
does not help wrt. the python output:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
from_file: Zip archive data, made by v2.0 UNIX, extract using at least
v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size 19,
method=deflate, mime: application/zip
from_buffer: data, mime: application/octet-stream
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

But reverting the archive magic to
https://raw.githubusercontent.com/file/file/FILE5_38/magic/Magdir/archive 
restores
the old output (and loses the new information reported above):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
from_file: Zip archive data, at least v2.0 to extract, mime: 
application/zip
from_buffer: Zip archive data, at least v2.0 to extract, mime:
application/zip
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Here is the (minimally redacted) log which should be reproducible with
Dockerfile.bullseye:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
$ sudo docker build -f Dockerfile.bullseye --no-cache .
Sending build context to Docker daemon  7.168kB

Step 1/15 : FROM debian:bullseye
  ---> fe3c5de03486
Step 2/15 : RUN apt-get update -qq && apt-get install --yes -qq
libmagic1 libmagic-dev file python3 python3-pip >/dev/null
  ---> Running in 1b622d917411
Removing intermediate container 1b622d917411
  ---> a3e449e749a4
Step 3/15 : RUN python3 -m pip install python-magic
  ---> Running in 39b95ffb5ffa
Collecting python-magic
   Downloading python_magic-0.4.24-py2.py3-none-any.whl (12 kB)
Installing collected packages: python-magic
Successfully installed python-magic-0.4.24
Removing intermediate container 39b95ffb5ffa
  ---> 6f0f7767f94a
Step 4/15 : COPY hello.zip file.py /root/
  ---> 887744ccafa6
Step 5/15 : WORKDIR /root
  ---> Running in 93850928bc25
Removing intermediate container 93850928bc25
  ---> ce07ea7f1532
Step 6/15 : RUN file /root/hello.zip
  ---> Running in 1e87dacb40e2
/root/hello.zip: Zip archive data, made by v2.0 UNIX, extract using at
least v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size
19, method=deflate
Removing intermediate container 1e87dacb40e2
  ---> c10fd2ab3b7d
Step 7/15 : RUN file --mime-type /root/hello.zip
  ---> Running in 85424427f9dc
/root/hello.zip: application/octet-stream
Removing intermediate container 85424427f9dc
  ---> 40814f1de0a6
Step 8/15 : RUN python3 /root/file.py /root/hello.zip
  ---> Running in 768f0158bd86
from_file: Zip archive data, made by v2.0 UNIX, extract using at least
v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size 19,
method=deflate, mime: application/octet-stream
from_buffer: data, mime: application/octet-stream
Removing intermediate container 768f0158bd86
  ---> 6357c72b8555
Step 9/15 : RUN apt-get install --yes -qq build-essential libbz2-dev
liblzma-dev zlib1g-dev curl autoconf automake libtool >/dev/null
  ---> Running in 6a7f0bcb7021
Removing intermediate container 6a7f0bcb7021
  ---> 4b61e3070787
Step 10/15 : RUN curl -L
https://github.com/file/file/archive/refs/heads/master.tar.gz|tar -xzf -
  ---> Running in cf9f3d9ccb9e
  ---> 7ff3254eb0ba
Step 11/15 : RUN cd /root/file-master && autoreconf -fi && ./configure
-q && make -s install && ldconfig
  ---> Running in a41fe88b0293
[...]
Making install in python
  /bin/mkdir -p '/usr/local/lib/pkgconfig'
  /usr/bin/install -c -m 644 libmagic.pc '/usr/local/lib/pkgconfig'
Removing intermediate container a41fe88b0293
  ---> 4957bcacdf9f
Step 12/15 : RUN python3 /root/file.py /root/hello.zip
  ---> Running in f127da360721
from_file: Zip archive data, made by v2.0 UNIX, extract using at least
v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size 19,
method=deflate, mime: application/zip
from_buffer: data, mime: application/octet-stream
Removing intermediate container f127da360721
  ---> fb84a7eb08c8
Step 13/15 : RUN curl
https://raw.githubusercontent.com/file/file/FILE5_38/magic/Magdir/archive 
 >
/root/file-master/magic/Magdir/archive
  ---> Running in bb1bd3515a60
Removing intermediate container bb1bd3515a60
  ---> dc3ae16b8eed
Step 14/15 : RUN cd /root/file-master && make -s install
  ---> Running in 8d55e73f8122
Making install in src
[...]
Making install in magic
  /bin/mkdir -p '/usr/local/share/misc'
  /usr/bin/install -c -m 644 magic.mgc '/usr/local/share/misc'
[...]
Removing intermediate container 8d55e73f8122
  ---> 11f7250bfee2
Step 15/15 : RUN python3 /root/file.py /root/hello.zip
  ---> Running in 484d335b8e4a
from_file: Zip archive data, at least v2.0 to extract, mime: 
application/zip
from_buffer: Zip archive data, at least v2.0 to extract, mime:
application/zip
Removing intermediate container 484d335b8e4a
  ---> 2b707c4e2fc4
Successfully built 2b707c4e2fc4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


We worked around the problem by copying the archive magic mentioned
above to /etc/magic in our application container but I bet that others
will run into this problem was well.

BTW: Is there any reason why magic_file has to behave different from
magic_buffer if the full content of the file is passed to the latter?


Greetings, Torsten

PS: To create the example hello.zip you can use this python command:

$ python3
Python 3.8.10 (default, Jun  2 2021, 10:49:15)
>>> import base64
>>> open("hello.zip", "wb").write(base64.b85decode(
... 
b'P)h>@6aWAK2mpzsB2)d`d+8Mb000vJ000R9003xZY;12Xba-_0NX^N~*HK8z%t=)!&o9bJ(c=ODP)h*<6ay3h000O8iKHS^{oH%$6#xJL6951J2><{90000000000w1EHs003xZY;12Xba- at 7O9ci1000010096u0000y0000000'
... ))
137

Adding it as attachment was rejected by the mail daemon.

PPS: I bisected this using a simple C program. Looks like this commit 
introduced this bug:

commit c21152a62f9a62cdb67e462e66fb35d06435fa84 (HEAD, refs/bisect/bad)
Author: Christos Zoulas <christos at zoulas.com>
Date:   Sun Jun 7 17:27:10 2020 +0000

     Handle EPUB documents that have no mimetype


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Dockerfile.bullseye
URL: <https://mailman.astron.com/pipermail/file/attachments/20210902/91a4fd71/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Dockerfile.buster
URL: <https://mailman.astron.com/pipermail/file/attachments/20210902/91a4fd71/attachment-0001.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: file.py
URL: <https://mailman.astron.com/pipermail/file/attachments/20210902/91a4fd71/attachment-0002.ksh>


More information about the File mailing list