[File] Regression in zip-archive detection
Torsten Landschoff
torsten at debian.org
Thu Sep 2 21:24:05 UTC 2021
Hello world,
tl;dr: PR/228 did not fully fix the problem. Seems like a change to
Magdir/zip introduced the bug and an update to Magdir/archive was used
to repair it.
first: thanks for your work on file and libmagic. I am using it for more
than 20 years now.
This week I traced a regression in one of our applications at work back
to
libmagic: Suddenly zip-files are not detected as such anymore but
reported as application/octet-stream.
After a bit of research it turned out that this is caused by our
containers now being based on Debian bullseye instead of buster.
On buster, the libmagic library (access via python-magic from Python)
correctly identifies zip files. Both when using magic_file and
magic_buffer:
from_file: Zip archive data, at least v2.0 to extract, mime:
application/zip
from_buffer: Zip archive data, at least v2.0 to extract, mime:
application/zip
Full log from Dockerfile.buster below:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
$ sudo docker build -f Dockerfile.buster --no-cache .
[sudo] password for torsten.landschoff:
Sending build context to Docker daemon 58.37kB
Step 1/6 : FROM debian:buster
---> 63652705977d
Step 2/6 : RUN apt-get -qq update && apt-get install -qq --yes file
libmagic1 libmagic-dev python3 python3-pip > /dev/null
---> Running in 84f5b40c127e
Removing intermediate container 84f5b40c127e
---> 1b6993462ac8
Step 3/6 : RUN python3 -m pip install python-magic
---> Running in 85feb98ec89f
Collecting python-magic
Downloading
https://files.pythonhosted.org/packages/d3/99/c89223c6547df268596899334ee77b3051f606077317023617b1c43162fb/python_magic-0.4.24-py2.py3-none-any.whl
Installing collected packages: python-magic
Successfully installed python-magic-0.4.24
Removing intermediate container 85feb98ec89f
---> 38068f959fbb
Step 4/6 : COPY hello.zip file.py /root/
---> 689c7cc01bca
Step 5/6 : RUN file /root/hello.zip
---> Running in 7b98fe1dbb5c
/root/hello.zip: Zip archive data, at least v2.0 to extract
Removing intermediate container 7b98fe1dbb5c
---> 66803a178e36
Step 6/6 : RUN python3 /root/file.py /root/hello.zip
---> Running in 73adce5e3da7
from_file: Zip archive data, at least v2.0 to extract, mime:
application/zip
from_buffer: Zip archive data, at least v2.0 to extract, mime:
application/zip
Removing intermediate container 73adce5e3da7
---> 88684f486f02
Successfully built 88684f486f02
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This appeared to relate to https://bugs.astron.com/view.php?id=228 but
applying the fix from
https://github.com/file/file/commit/33eedc8edd1b53eea3c5c74f0105ecca8cbcf3cb
appeared to help - but only when using magic_file. For small uploads, we
are using magic_buffer and that is still misbehaving.
It appears that this is due to two definitions for zip files in
Magdir/archive vs. Magdir/zip. The Dockerfile for bullseye illustrates
the issue.
Basically, libmagic1 5.39-3 misreports the mime type of zip archives:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# file /root/hello.zip
/root/hello.zip: Zip archive data, made by v2.0 UNIX, extract using at
least v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size
19, method=deflate
# file --mime-type /root/hello.zip
/root/hello.zip: application/octet-stream
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This is consistent with python-magic:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
from_file: Zip archive data, made by v2.0 UNIX, extract using at least
v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size 19,
method=deflate, mime: application/octet-stream
from_buffer: data, mime: application/octet-stream
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Using the file archive mirror of today
(2e48f028e670659d4674ce28604ed6ac5acba70d from github.com/file/file)
does not help wrt. the python output:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
from_file: Zip archive data, made by v2.0 UNIX, extract using at least
v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size 19,
method=deflate, mime: application/zip
from_buffer: data, mime: application/octet-stream
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
But reverting the archive magic to
https://raw.githubusercontent.com/file/file/FILE5_38/magic/Magdir/archive
restores
the old output (and loses the new information reported above):
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
from_file: Zip archive data, at least v2.0 to extract, mime:
application/zip
from_buffer: Zip archive data, at least v2.0 to extract, mime:
application/zip
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Here is the (minimally redacted) log which should be reproducible with
Dockerfile.bullseye:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
$ sudo docker build -f Dockerfile.bullseye --no-cache .
Sending build context to Docker daemon 7.168kB
Step 1/15 : FROM debian:bullseye
---> fe3c5de03486
Step 2/15 : RUN apt-get update -qq && apt-get install --yes -qq
libmagic1 libmagic-dev file python3 python3-pip >/dev/null
---> Running in 1b622d917411
Removing intermediate container 1b622d917411
---> a3e449e749a4
Step 3/15 : RUN python3 -m pip install python-magic
---> Running in 39b95ffb5ffa
Collecting python-magic
Downloading python_magic-0.4.24-py2.py3-none-any.whl (12 kB)
Installing collected packages: python-magic
Successfully installed python-magic-0.4.24
Removing intermediate container 39b95ffb5ffa
---> 6f0f7767f94a
Step 4/15 : COPY hello.zip file.py /root/
---> 887744ccafa6
Step 5/15 : WORKDIR /root
---> Running in 93850928bc25
Removing intermediate container 93850928bc25
---> ce07ea7f1532
Step 6/15 : RUN file /root/hello.zip
---> Running in 1e87dacb40e2
/root/hello.zip: Zip archive data, made by v2.0 UNIX, extract using at
least v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size
19, method=deflate
Removing intermediate container 1e87dacb40e2
---> c10fd2ab3b7d
Step 7/15 : RUN file --mime-type /root/hello.zip
---> Running in 85424427f9dc
/root/hello.zip: application/octet-stream
Removing intermediate container 85424427f9dc
---> 40814f1de0a6
Step 8/15 : RUN python3 /root/file.py /root/hello.zip
---> Running in 768f0158bd86
from_file: Zip archive data, made by v2.0 UNIX, extract using at least
v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size 19,
method=deflate, mime: application/octet-stream
from_buffer: data, mime: application/octet-stream
Removing intermediate container 768f0158bd86
---> 6357c72b8555
Step 9/15 : RUN apt-get install --yes -qq build-essential libbz2-dev
liblzma-dev zlib1g-dev curl autoconf automake libtool >/dev/null
---> Running in 6a7f0bcb7021
Removing intermediate container 6a7f0bcb7021
---> 4b61e3070787
Step 10/15 : RUN curl -L
https://github.com/file/file/archive/refs/heads/master.tar.gz|tar -xzf -
---> Running in cf9f3d9ccb9e
---> 7ff3254eb0ba
Step 11/15 : RUN cd /root/file-master && autoreconf -fi && ./configure
-q && make -s install && ldconfig
---> Running in a41fe88b0293
[...]
Making install in python
/bin/mkdir -p '/usr/local/lib/pkgconfig'
/usr/bin/install -c -m 644 libmagic.pc '/usr/local/lib/pkgconfig'
Removing intermediate container a41fe88b0293
---> 4957bcacdf9f
Step 12/15 : RUN python3 /root/file.py /root/hello.zip
---> Running in f127da360721
from_file: Zip archive data, made by v2.0 UNIX, extract using at least
v2.0, last modified Fri Mar 14 06:41:13 2014, uncompressed size 19,
method=deflate, mime: application/zip
from_buffer: data, mime: application/octet-stream
Removing intermediate container f127da360721
---> fb84a7eb08c8
Step 13/15 : RUN curl
https://raw.githubusercontent.com/file/file/FILE5_38/magic/Magdir/archive
>
/root/file-master/magic/Magdir/archive
---> Running in bb1bd3515a60
Removing intermediate container bb1bd3515a60
---> dc3ae16b8eed
Step 14/15 : RUN cd /root/file-master && make -s install
---> Running in 8d55e73f8122
Making install in src
[...]
Making install in magic
/bin/mkdir -p '/usr/local/share/misc'
/usr/bin/install -c -m 644 magic.mgc '/usr/local/share/misc'
[...]
Removing intermediate container 8d55e73f8122
---> 11f7250bfee2
Step 15/15 : RUN python3 /root/file.py /root/hello.zip
---> Running in 484d335b8e4a
from_file: Zip archive data, at least v2.0 to extract, mime:
application/zip
from_buffer: Zip archive data, at least v2.0 to extract, mime:
application/zip
Removing intermediate container 484d335b8e4a
---> 2b707c4e2fc4
Successfully built 2b707c4e2fc4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
We worked around the problem by copying the archive magic mentioned
above to /etc/magic in our application container but I bet that others
will run into this problem was well.
BTW: Is there any reason why magic_file has to behave different from
magic_buffer if the full content of the file is passed to the latter?
Greetings, Torsten
PS: To create the example hello.zip you can use this python command:
$ python3
Python 3.8.10 (default, Jun 2 2021, 10:49:15)
>>> import base64
>>> open("hello.zip", "wb").write(base64.b85decode(
...
b'P)h>@6aWAK2mpzsB2)d`d+8Mb000vJ000R9003xZY;12Xba-_0NX^N~*HK8z%t=)!&o9bJ(c=ODP)h*<6ay3h000O8iKHS^{oH%$6#xJL6951J2><{90000000000w1EHs003xZY;12Xba- at 7O9ci1000010096u0000y0000000'
... ))
137
Adding it as attachment was rejected by the mail daemon.
PPS: I bisected this using a simple C program. Looks like this commit
introduced this bug:
commit c21152a62f9a62cdb67e462e66fb35d06435fa84 (HEAD, refs/bisect/bad)
Author: Christos Zoulas <christos at zoulas.com>
Date: Sun Jun 7 17:27:10 2020 +0000
Handle EPUB documents that have no mimetype
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Dockerfile.bullseye
URL: <https://mailman.astron.com/pipermail/file/attachments/20210902/91a4fd71/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Dockerfile.buster
URL: <https://mailman.astron.com/pipermail/file/attachments/20210902/91a4fd71/attachment-0001.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: file.py
URL: <https://mailman.astron.com/pipermail/file/attachments/20210902/91a4fd71/attachment-0002.ksh>
More information about the File
mailing list