[File] [PATCH] of Magdir/compress for gzip compressed data; update +extensions

Jörg Jenderek joerg.jen.der.ek at gmx.net
Tue Apr 16 21:38:25 UTC 2019


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,
some days ago just for interest i run file command version 5.36 on
VirtualBox Extension Packs, which seems to be gzipped tar archives.
But these get their own file name extension "vbox-extpack" and mime
type "application/x-virtualbox-vbox-extpack" at least for
VirtualBox 6.0.4 on Windows systems. Furthermore with -i option old
out dated mime type application/x-gzip is shown and with --extension
option only ??? is displayed.

In the end to catch all different existing gzip compressed looking
files i use the following procedure:
Download pattern database "triddefs_xml.7z" of TrID, another file
identifying software found at http://mark0.net/soft-trid-e.html .
Extract 7z archive.
Look in extracted trid definition files *.trid.xml for pattern
<Bytes>1F8B08
That is the magic for gzip compressed with deflate method.
That is also used by file command in Magdir/compress with magic line
 0       string          \037\213        gzip compressed data
This now becomes now
 0       string          \037\213

The strength for gzip magic was doubled to 100 from 50 by line
 !:strength * 2
I see no reason for this. So i changed that line to comment.

So i start to change Magdir/compress. First i add Wikipedia page about
gzip by URL line like:
	# URL: https://en.wikipedia.org/wiki/Gzip
Then i also add gzip file format specification by line like
	# Reference: https://tools.ietf.org/html/rfc1952

By *trid.xml i found all different gzip looking types recognized by
TrID. So i see that beside ark-gz.trid.xml for "normal" gzipped
compressed there exist also 9 definitions like gxd.trid.xml. Inside
definition file i often find an reference URL like
http://www.generalcadd.com/ which leads to more information about
specific file format or downloads.
So i see that example like HOCKETT-STPAUL-WRHSE.gxd is something like
a General CADD Drawing. Not all aspects of file formats are full
documented, but the knowledge of TrID database can be used. In case of
gxd files near the beginning text phrases like "GXD" or "Created with
General CADD" are stored. But according to RFC1952 for real gzipped
files near the beginning something like ASCII text could only occur
as comment or file name, which is indicated by bits inside flag byte.
So no FNAME and FCOMMENT bit implies no file name/comment. That means
only binary file. This branch is now handled by line
 >3	byte&0x18	=0
If we now find something like ASCII text at that point, we know it is
not a real gzip compressed file. This can be seen by error message
like "invalid compressed data --format violated" when running command
`gzip  -t -v ` on such samples like HOCKETT-STPAUL-WRHSE.gxd.
So General CADD files are identified by line like
 >>10	string		GXD	General CADD, Drawing or Component
Afterwards mime type and file name extension are shown by lines like
 !:mime	application/octet-stream
 !:ext	gxc/gxd
So General CADD Drawing like HOCKETT-STPAUL-WRHSE.gxd and General CADD
component like BUILDINGEDGE.gxc are identified.

The same idea applies for Monu-Cad Drawing based on trid definition
mcd-monu-cad.trid.xml. So Monu-Cad Drawing like DEMO_DD01.MCD,
Monu-Cad Component like HANDS96.MCC and Monu-Cad Font like
MCALF020.FNT are identified by lines like
 >>10	string		MCD	Monu-Cad Drawing, Component or Font
 !:mime	application/octet-stream
 !:ext	mcc/mcd/fnt
The remaining samples in that branch are real gzip compressed files
described by magic line like:
 >>10	default		x
Normally file extension "gz" like in compressed man pages like
zlib.3.gz is used. But for compressed TAR archive like
microcode-20180312.tgz "tgz" or "tpz" instead "tar.gz" extension and
mime type like application/x-compressed-tar is used. And for
VirtualBox Extension Packs like
Oracle_VM_VirtualBox_Extension_Pack-5.0.12-104815.vbox-extpack
extension "vbox-extpack" and mime type
"application/x-virtualbox-vbox-extpack". Unfortunately it is not
possible to distinguish by magic tests these second sub level types.
So display mime type and extensions by lines like
 !:mime	application/gzip
 !:ext	gz/tgz/tpz/ipk/vbox-extpack/svgz

The other branch are gzip compressed files with file name or comment
done by looking for FNAME/FCOMMENT bit by magic line like
 >3	byte&0x18	>0
No VirtualBox Extension Pack is found here. But for compressed
Abiword document (*.abw.gz) also "zabw" is used. And for compressed
SVG SVGZ is used. This is now expressed by lines like:
 !:mime	application/gzip
 !:ext	gz/tgz/tpz/zabw/svgz

Because of 2 branches for binary and text containing gzipped files the
part to display information starting with method is put in a
subroutine with magic lines likes
 0	name				gzip-info
 >2	byte		<8		\b, reserved method
 >2	byte		>8		\b, unknown method
 ...
 >-4	lelong		x	\b, original size %u

Unfortunately last line with negative offset do no work any more
in subroutine and i get an error message like
	ERROR: line 114: non zero offset 1048572 at level 1
So transferred line back at right place in branches.
According to documentation the number displayed is not the original
files size, but precisely it is the length of the original
uncompressed data modulo 2^32. So the line becomes
 >>-4	ulelong		x	\b, original size modulo 2^32 %u
This can be seen by DVD images above 4 GiB size like
KNOPPIX_V7.6.1DVD-2016-01-16-DE.iso where 159391744 is displayed
instead 4454359040 ( 4,2 GiB).

So with old magic for samples i got an output like:

1610-098.tgz:                                           gzip
	compressed data
	, last modified: Tue Sep 20 09:33:32 2016
	, from Unix
	, original size 1751040
BUILDINGEDGE.gxc:                                       gzip
	compressed data
	, from NTFS filesystem (NT)
	, original size 701
cz-qwerty.map.gz:                                       gzip
	compressed data, max compression
	, from Unix
	, original size 92009
DEMO_DD01.MCD:                                          gzip
	compressed data
	, from NTFS filesystem (NT)
	, original size 89938
fdos.tpz:                                               gzip
	compressed data, was "fdos.tar"
	, last modified: Wed Apr 10 23:04:28 2019, max compression
	, from FAT filesystem (MS-DOS, OS/2, NT)
	, original size 11776
gujin-2.8.3.tar.gz:                                     gzip
	compressed data, last modified: Wed Dec 08 19:55:01 2010
	, from Unix
	, original size 4474880
HANDS96.MCC:                                            gzip
	compressed data
	, from NTFS filesystem (NT)
	, original size 11764
HOCKETT-STPAUL-WRHSE.gxd:                               gzip
	compressed data
	, from NTFS filesystem (NT)
	, original size 9351432
kleopatra_splashscreen.svgz:                            gzip
	compressed data, was "kleo_1b_splashscreen.svg"
	, last modified: Thu Dec 17 14:42:31 2009
	, from Unix
	, original size 314947
KNOPPIX_V7.6.1DVD-2016-01-16-DE.iso.gz:                 gzip
	compressed data, was "KNOPPIX_V7.6.1DVD-2016-01-16-DE.iso"
	, last modified: Sat Sep 17 14:49:00 2016, max compression
	, from FAT filesystem (MS-DOS, OS/2, NT)
	, original size 159391744
lua-md5_1.2-1_i386_i486.ipk:                            gzip
	compressed data
	, from Unix
	, original size 10240
MCALF020.FNT:                                           gzip
	compressed data
	, from NTFS filesystem (NT)
	, original size 37406
microcode-20180312.tgz:                                 gzip
	compressed data
	, last modified: Mon Mar 12 18:24:37 2018
	, from Unix
	, original size 6737920
NConvert-linux.tgz:                                     gzip
	compressed data
	, last modified: Thu Oct 19 09:14:40 2017
	, from Unix
	, original size 3768320
Oracle_VM_VirtualBox_Extension_Pack-6.0.4.vbox-extpack: gzip
	compressed data, max compression
	, from TOPS/20
	, original size 91578880
terminfo.5.gz:                                          gzip
	compressed data, was "man77732"
	, last modified: Mon Sep 18 05:18:46 2017
	, from Unix
	, original size 100973
test7abiword.zabw:                                      gzip
	compressed data, was "test6abiword.abw"
	, last modified: Mon Apr 08 00:21:22 2019
	, from Unix
	, original size 78755
TESTTEXT.TGZ:                                           gzip
	compressed data, was "TESTTEXT.TAR"
	, last modified: Mon Apr 08 03:33:11 2019, max speed
	, from NTFS filesystem (NT)
	, original size 1536
zlib.3.gz:                                              gzip
	compressed data, max compression
	, from Unix
	, original size 4477

After applying also the above mentioned modifications by patch
file-5.36-compress-gzip.diff for all inspected gzip like samples i get
now a more precisely output like:
1610-098.tgz:                                           gzip
	compressed data
	, last modified: Tue Sep 20 09:33:32 2016
	, from Unix
	, original size modulo 2^32 1751040
BUILDINGEDGE.gxc:
	General CADD, Drawing or Component
cz-qwerty.map.gz:                                       gzip
	compressed data, max compression
	, from Unix
	, original size modulo 2^32 92009
DEMO_DD01.MCD:
	Monu-Cad Drawing, Component or Font
fdos.tpz:                                               gzip
	compressed data, was "fdos.tar"
	, last modified: Wed Apr 10 23:04:28 2019, max compression
	, from FAT filesystem (MS-DOS, OS/2, NT)
	, original size modulo 2^32 11776
gujin-2.8.3.tar.gz:                                     gzip
	compressed data
	, last modified: Wed Dec 08 19:55:01 2010
	, from Unix
	, original size modulo 2^32 4474880
HANDS96.MCC:
	Monu-Cad Drawing, Component or Font
HOCKETT-STPAUL-WRHSE.gxd:
	General CADD, Drawing or Component
kleopatra_splashscreen.svgz:                            gzip
	compressed data, was "kleo_1b_splashscreen.svg"
	, last modified: Thu Dec 17 14:42:31 2009
	, from Unix
	, original size modulo 2^32 314947
KNOPPIX_V7.6.1DVD-2016-01-16-DE.iso.gz:                 gzip
	compressed data, was "KNOPPIX_V7.6.1DVD-2016-01-16-DE.iso"
	, last modified: Sat Sep 17 14:49:00 2016, max compression
	, from FAT filesystem (MS-DOS, OS/2, NT)
	, original size modulo 2^32 159391744
lua-md5_1.2-1_i386_i486.ipk:                            gzip
	compressed data
	, from Unix
	, original size modulo 2^32 10240
MCALF020.FNT:
	Monu-Cad Drawing, Component or Font
microcode-20180312.tgz:                                 gzip
	compressed data
	, last modified: Mon Mar 12 18:24:37 2018
	, from Unix
	, original size modulo 2^32 6737920
NConvert-linux.tgz:                                     gzip
	compressed data
	, last modified: Thu Oct 19 09:14:40 2017
	, from Unix
	, original size modulo 2^32 3768320
Oracle_VM_VirtualBox_Extension_Pack-6.0.4.vbox-extpack: gzip
	compressed data, max compression
	, from TOPS/20
	, original size modulo 2^32 91578880
terminfo.5.gz:                                          gzip
	compressed data, was "man77732"
	, last modified: Mon Sep 18 05:18:46 2017
	, from Unix
	, original size modulo 2^32 100973
test7abiword.zabw:                                      gzip
	compressed data, was "test6abiword.abw"
	, last modified: Mon Apr 08 00:21:22 2019
	, from Unix
	, original size modulo 2^32 78755
TESTTEXT.TGZ:                                           gzip
	compressed data, was "TESTTEXT.TAR"
	, last modified: Mon Apr 08 03:33:11 2019, max speed
	, from NTFS filesystem (NT)
	, original size modulo 2^32 1536
zlib.3.gz:                                              gzip
	compressed data, max compression
	, from Unix
	, original size modulo 2^32 4477

Now also correct filename extension is shown by that corresponding
command option like:
1610-098.tgz:
	gz/tgz/tpz/ipk/vbox-extpack/svgz
BUILDINGEDGE.gxc:
	gxc/gxd
cz-qwerty.map.gz:
	gz/tgz/tpz/ipk/vbox-extpack/svgz
DEMO_DD01.MCD:
	mcc/mcd/fnt
fdos.tpz:
	gz/tgz/tpz/zabw/svgz
gujin-2.8.3.tar.gz:
	gz/tgz/tpz/ipk/vbox-extpack/svgz
HANDS96.MCC:
	mcc/mcd/fnt
HOCKETT-STPAUL-WRHSE.gxd:
	gxc/gxd
kleopatra_splashscreen.svgz:
	gz/tgz/tpz/zabw/svgz
KNOPPIX_V7.6.1DVD-2016-01-16-DE.iso.gz:
	gz/tgz/tpz/zabw/svgz
lua-md5_1.2-1_i386_i486.ipk:
	gz/tgz/tpz/ipk/vbox-extpack/svgz
MCALF020.FNT:
	mcc/mcd/fnt
microcode-20180312.tgz:
	gz/tgz/tpz/ipk/vbox-extpack/svgz
NConvert-linux.tgz:
	gz/tgz/tpz/ipk/vbox-extpack/svgz
Oracle_VM_VirtualBox_Extension_Pack-6.0.4.vbox-extpack:
	gz/tgz/tpz/ipk/vbox-extpack/svgz
terminfo.5.gz:
	gz/tgz/tpz/zabw/svgz
test7abiword.zabw:
	gz/tgz/tpz/zabw/svgz
TESTTEXT.TGZ:
	gz/tgz/tpz/zabw/svgz
zlib.3.gz:
	gz/tgz/tpz/ipk/vbox-extpack/svgz


Furthermore also different correct mime types are now shown by
- --mime-type option like

1610-098.tgz:
	application/gzip; charset=binary
BUILDINGEDGE.gxc:
	application/octet-stream; charset=binary
cz-qwerty.map.gz:
	application/gzip; charset=binary
DEMO_DD01.MCD:
	application/octet-stream; charset=binary
fdos.tpz:
	application/gzip; charset=binary
gujin-2.8.3.tar.gz:
	application/gzip; charset=binary
HANDS96.MCC:
	application/octet-stream; charset=binary
HOCKETT-STPAUL-WRHSE.gxd:
	application/octet-stream; charset=binary
kleopatra_splashscreen.svgz:
	application/gzip; charset=binary
KNOPPIX_V7.6.1DVD-2016-01-16-DE.iso.gz:
	application/gzip; charset=binary
lua-md5_1.2-1_i386_i486.ipk:
	application/gzip; charset=binary
MCALF020.FNT:
	application/octet-stream; charset=binary
microcode-20180312.tgz:
	application/gzip; charset=binary
NConvert-linux.tgz:
	application/gzip; charset=binary
Oracle_VM_VirtualBox_Extension_Pack-6.0.4.vbox-extpack:
	application/gzip; charset=binary
terminfo.5.gz:
	application/gzip; charset=binary
test7abiword.zabw:
	application/gzip; charset=binary
TESTTEXT.TGZ:
	application/gzip; charset=binary
zlib.3.gz:
	application/gzip; charset=binary

I hope my diff file can be applied in future version of file utility.

I have done my best to describe different gzip compressed files. But
some are still missing. So mention these in Todo lines:
# FBR	Blueberry FlashBack screen Record
# KPR	KOffice/Calligra KPresenter
# KPT	KOffice/Calligra KPresenter template?
# SAV	Diggles Saved Game File
# SAV	FarCry (demo) saved game
# DAT	ZOAGZIP game data format

With best wishes
Jörg Jenderek
- --
Jörg Jenderek

























-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCXLZLRwAKCRCv8rHJQhrU
1pb8AJ4nefBJn9yeO8iXjGgDhZhVKn37eQCgrf89CnBBheqgLoLO9MlylzD8ru4=
=4tHb
-----END PGP SIGNATURE-----
-------------- next part --------------
--- file-5.36/magic/Magdir/compress.old	2018-09-01 14:52:39 +0000
+++ file-5.36/magic/Magdir/compress	2019-04-16 14:29:01 +0000
@@ -18,2 +18,5 @@
 # gzip (GNU zip, not to be confused with Info-ZIP or PKWARE zip archiver)
+# URL: https://en.wikipedia.org/wiki/Gzip
+# Reference: https://tools.ietf.org/html/rfc1952
+# Update: Joerg Jenderek, Apr 2019
 #   Edited by Chris Chittleborough <cchittleborough at yahoo.com.au>, March 2002
@@ -22,5 +25,67 @@
 #         other than 8 ("deflate", the only method defined in RFC 1952).
-0       string          \037\213        gzip compressed data
-!:mime	application/x-gzip
-!:strength * 2
+# Note: find defs -iname '*.trid.xml' -exec grep -q '<Bytes>1F8B08' {} \; -ls
+# TODO:
+# FBR	Blueberry FlashBack screen Record	https://www.flashbackrecorder.com/
+# KPR	KOffice/Calligra KPresenter		application/x-kpresenter
+# KPT	KOffice/Calligra KPresenter template?	application/x-kpresenter
+# SAV	Diggles Saved Game File			http://www.innonics.com
+# SAV	FarCry (demo) saved game		http://www.farcry-thegame.com
+# DAT	ZOAGZIP game data format		http://en.wikipedia.org/wiki/SD_Gundam_Capsule_Fighter
+0       string          \037\213
+# to display gzip compressed (strength=100=2*50) before other (strength=50)?
+#!:strength * 2
+# no FNAME and FCOMMENT bit implies no file name/comment. That means only binary
+>3	byte&0x18	=0
+# For binary gzipped no ASCII text should occur
+#	mcd-monu-cad.trid.xml
+>>10	string		MCD			Monu-Cad Drawing, Component or Font
+#>>36	string		Created\ with\ MONU-CAD	
+#!:mime	application/octet-stream
+# http://fileformats.archiveteam.org/wiki/Monu-CAD
+#	http://www.monucad.com/downloads/FullDemo-2005.EXE
+#	/HANDS96.MCC	Component
+#	/DEMO_DD01.MCD	Drawing
+#	/MCALF020.FNT	Font
+!:ext	mcc/mcd/fnt
+# http://www.generalcadd.com
+>>10	string		GXD			General CADD, Drawing or Component
+#!:mime	application/octet-stream
+#	/gxc/BUILDINGEDGE.gxc			Component
+#	/gxd/HOCKETT-STPAUL-WRHSE.gxd		Drawing
+#	/gxd/POWERLAND-MILL-ADD-11.gxd		Drawing		v9.1.06
+!:ext	gxc/gxd
+#>>>13	ubyte		0			\b, version 0
+>>>13	string		09			\b, version 9
+# other gzipped binary like gzipped tar, VirtualBox extension package,...
+>>10	default		x		gzip compressed data
+>>>0	use	gzip-info
+# size of the original (uncompressed) input data modulo 2^32
+>>>-4	ulelong		x		\b, original size modulo 2^32 %u
+# gzipped TAR or VirtualBox extension package
+!:mime	application/gzip
+#!:mime	application/x-compressed-tar
+#!:mime	application/x-virtualbox-vbox-extpack
+# https://www.w3.org/TR/SVG/mimereg.html
+#!:mime	image/image/svg+xml-compressed
+#	zlib.3.gz
+#	microcode-20180312.tgz
+#	tpz same as tgz
+#	lua-md5_1.2-1_i386_i486.ipk	https://en.wikipedia.org/wiki/Opkg
+#	Oracle_VM_VirtualBox_Extension_Pack-5.0.12-104815.vbox-extpack
+!:ext	gz/tgz/tpz/ipk/vbox-extpack/svgz
+# FNAME/FCOMMENT bit implies file name/comment as iso-8859-1 text
+>3	byte&0x18	>0		gzip compressed data
+!:mime	application/gzip
+# gzipped tar, gzipped Abiword document
+#!:mime	application/x-compressed-tar
+#!:mime	application/x-abiword-compressed
+#!:mime	image/image/svg+xml-compressed
+#	kleopatra_splashscreen.svgz	gzipped .svg
+!:ext	gz/tgz/tpz/zabw/svgz
+>>0	use	gzip-info
+# size of the original (uncompressed) input data modulo 2^32
+>>-4	ulelong		x		\b, original size modulo 2^32 %u
+#	display information of gzip compressed files
+0	name				gzip-info
+#>2	byte		x		THIS iS GZIP
 >2	byte		<8		\b, reserved method
@@ -51,3 +116,5 @@
 >9	byte		=0x0D		\b, from Acorn RISCOS
->-4	lelong		x		\b, original size %u
+# size of the original (uncompressed) input data modulo 2^32
+#>-4	ulelong		x		\b, original size modulo 2^32 %u
+#ERROR: line 114: non zero offset 1048572 at level 1
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.36-compress-gzip.diff.sig
Type: application/octet-stream
Size: 95 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20190416/e31623dc/attachment.obj>


More information about the File mailing list