[File] [PATCH] Magdir/images many Computer Graphics Metafile MISSED

Jörg Jenderek joerg.jen.der.ek at gmx.net
Tue Oct 25 21:18:32 UTC 2022


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

some days ago i handled some Greenstreet Art drawings. For control
reason i looked for other images on install medium. Some images have
file name extension CGM.

When running file command version 5.43 on such CGM examples then i
get an output like:

MS.CGM:                         data
SAB00012.CGM:                   BS image, Version 23610,
				Quantization 19255,
				(Decompresses to 16128 words)
SKYLINE.CGM:                    data
TAB00015.CGM:                   data
TIGER.CGM:                      data
ZAA00006.CGM:                   data
allprims.cgm:                   ASCII text
cdraw2020-cgm-3.cgm:            data
cdraw2020-cgm-webcgm1.0.cgm:    data
derby.log:                      ASCII text
fmt-301-signature-id-362.cgm:   ISO-8859 text
fmt-301-signature-id-363.cgm:   ISO-8859 text
fmt-302-signature-id-366.cgm:   ISO-8859 text
fmt-302-signature-id-367.cgm:   ISO-8859 text
fmt-303-signature-id-368.cgm:   data
fmt-304-signature-id-369.cgm:   data
fmt-305-signature-id-370.cgm:   data
fmt-306-signature-id-371.cgm:   data
input.cgm:                      data
ofz-ubsan-2.cgm:                data
ofz35504-ubsan-1.cgm:           data
ofz36348-ubsan-1.cgm:           data
ofz9707-slow-1.cgm:             data
ooo6420-1.cgm:                  clear text Computer Graphics Metafile
recurse-1.cgm:                  data
x-fmt-142-signature-id-353.cgm: ISO-8859 text
x-fmt-142-signature-id-354.cgm: ISO-8859 text

Furthermore only generic mime type application/octet-stream or
text/plain is shown with -i  option. With option --extension only 3
byte sequence ??? is shown.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). This identifies
samples (described by file command as "clear text Computer Graphics
Metafile") as "Computer Graphics Metafile (Clear Text)" by
cgm-ct.trid.xml with mime type image/cgm and CGM extension. It also
described more examples like allprims.cgm as such CGM examples.
Furthermore it describes many undetected samples like SAB00012.CGM
as variant "Computer Graphics Metafile (binary)" by cgm-bin.trid.xml
(See appended trid-v-cgm.txt.gz).

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/). This
does describe like TrID 2 variants. The first is called with
additional appended phrase "ASCII". The second is called with
additional appended phrase "(Binary)". It also use same extension
and mime type as TrID. Furthermore it does sub classification. It
list sub classes. That are called version 1, 3 and 4 by PUID
x-fmt/142 fmt/301 fmt/302 (See appended droid-cgm.csv.gz).

Luckily with information given by TrID and DROID i get a page about
CGM on file formats archive team web site and on Wikipedia. That
informations are expressed now inside Magdir/images by comment lines
like:
# URL: 	http://fileformats.archiveteam.org/wiki/CGM
#	https://en.wikipedia.org/wiki/Computer_Graphics_Metafile
# Ref.:	http://mark0.net/download/triddefs_xml.7z
#	defs/c/cgm-ct.trid.xml
#	defs/c/cgm-bin.trid.xml
#	http://standards.iso.org/ittf/PubliclyAvailableStandards/
#	c032381_ISO_IEC_8632-4_1999(E).zip
#	c032380_ISO_IEC_8632-3_1999(E).zip

The detection happens by line inside Magdir/images by line like:
0	string		BEGMF	clear text Computer Graphics Metafile
Here for BEGIN METAFILE command is only searched by upcased keyword
BEGMF. If i understand the documentation right ( Part 4: Clear text
encoding) then variations on case are allowable. So there exist
examples where keyword is not all upcased like in allprims.cgm found
on telparia.com. There BegMF is used. According to TrID there exist
samples where more letters are in lower case. Only the letter B and M
are assumed to be upcase. DROID is more relaxed and assume only B to
be always upcase, but also says any letter in element name can be in
upper or lower case. So adapting DROID rules this now becomes like
0	string/c	begmf
With this relaxed magic now also some DROID samples (like
fmt-301-signature-id-359.cgm fmt-301-signature-id-361.cgm
fmt-302-signature-id-364.cgm fmt-302-signature-id-365.cgm
x-fmt-142-signature-id-350.cgm x-fmt-142-signature-id-351.cgm) would
be described as CGM image. In reality this samples contain only some
leading byte parts of CGM images that are used by DROID as help to
recognise such images. So when you try to open files with LibreOffice
or XnView you get no graphic image shown. So these samples must be
skipped.

According to specification the keyword is followed by optional
separating character (<SPACE | CARRIAGE RETURN | LINEFEED |
HORIZONTAL TAB | VERTICAL TAB | FORMFEED>) followed by <SF:NAME>. So
for real samples like ooo6420-1.cgm the first text line looks like:
BEGMF 'xfig-fig012228';
Whereas for DROID sample like fmt-302-signature-id-366.cgm this looks
like:
bEGMF««««««««««««««««««««MFVeRSION«4ENDMF;
So in DROID samples after keyword i get some nil bytes or fil bytes
with value 0xAB. To skip such DROID samples the next lines become lik
e:
 >5	short	!0
 >>5	short	!0xABAB	clear text Computer Graphics Metafile
 !:mime	image/cgm
 !:ext	cgm
 >>>5	string		x		%s
For control reason i show also <SF:NAME>. Afterwards i do sub
classification by looking for command METAFILE VERSION as done by
DROID. Afterward comes 1 separating character SOFTSEP. That is
followed by one version letter (1=Version 1, 2=Version 2, 3=Version
3, 4=Version 4). So this is done by lines like:
 >>>2	search/128/c	mfversion	\b, version
 #>>>>&0	ubyte	x		SOFTSEP=%#x
 >>>>&1		ubyte	x		%c

According to "ISO 8632-3 Part 3: Binary Encoding" you must consider
the words as bit field. According to that specification Binary CGM
start with command BEGIN METAFILE (That is element Class 0 and ID 1
and "random" Parameter). So that is binary CCCC0000001PPPPP. So to
check for that pattern mask first word with 0XFFE0 to ignore bits of
parameter length and compare it with value as described in
Encyclopedia of Graphics File Formats. So first line for binary
variant looks like:
 0	ubeshort&0xFFe0	0x0020
Unfortunately only 11 bits instead of 32 are used for detection. So
additional tests must be used.

The DROID command look for value 0040 in last word. According to
specification that is command END METAFILE (element Class 0 and ID 2
and 0 Parameter) that is binary 0000iiiii1iPPPPP. Unfortunately i can
not use this as reliable test because file command only scans a few
first MiB of examples. So possible bigger files are missed. So i use
this only as additional information hint at the end by line like:
 >>>-2	ubeshort	!0x0040		\b, NOT_FOUND_END_METAFILE

The example SAB00012.CGM is described as version 1 sub class of
binary variant by DROID with PUID fmt/303. When looking in
corresponding DROID profile it searches for hexadecimal value
10220001. According to specification METAFILE VERSION command
(element class 1 and id 1 and parameter P1 with length 2) is binary
0001iiiiii1PPP1P or hexadecimal 1022. Afterwards comes parameter
P1. Here this parameter is 2 byte version value with range 1-4 (or
hexadecimal 0001 til 0004). In most of my examples the second
command was METAFILE VERSION (Exception EAF00010.CGM 'HiJaak 2').
Worst case was case argentin.cgm with parameter length 56. So the
additional test lines look like:
 >>2 search/64/b \x10\x22\0 Computer Graphics Metafile, version
 !:mime	image/cgm
 !:ext	cgm
 >>&-1	ubeshort	x		%u
Now 35 bits (11+24) are used for recognition. By last line the "low"
2 byte metafile version is also shown. If tests are not sufficient
then the version value could be checked to be in range 1-4. by these
lines also "strange" ofz-ubsan-2.cgm is skipped.

Unfortunately at this point SOME DROID samples ( like
fmt-303-signature-id-368.cgm fmt-304-signature-id-369.cgm
fmt-305-signature-id-370.cgm fmt-306-signature-id-371.cgm) are still
described as CGM. In reality this samples contain only some
leading byte parts (28) of CGM images that are used by DROID as help
to recognise such images. So when you try to open files with
LibreOffice or XnView you get no graphic image shown. Assuming that
real CGM examples contain more commands i skipped these samples by
checking for existence of byte at offset 28. So second test line just
looks like:
 >28	ubyte		x

Afterwards i create 4 branches to show SF:NAME and for control
reasons. On case for "short" or "long" parameter and another case
for even or odd parameter length. IF the first 5 bits in first word
are in range 0-30 it is a "short" parameter and the length is given
by these bits. If this value is then the real parameter length is
stored in next word. Afterwards comes the SF:NAME as pascal string.
Because command starts on word boundary for odd parameter length
after string comes one nil padding byte. Afterwards comes next
command. In most cases this METAFILE VERSION (0x1022). So only show
other second command and SF:NAME by lines which look for "short"
variant like:

 >>>0	ubeshort&0x001F	<31		\b, parameter length %u
 #>>>>2	ubyte		x		\b, %u BYTES (SHORT)
 >>>>2	pstring		>\0		'%s'
 >>>>0  	ubeshort&0x0001	=0
 >>>>>(2.b+3)	ubeshort !0x1022 \b, 2nd command %#4.4x (short even)
 >>>>0  	ubeshort&0x0001	=1
 #>>>>>(2.b+3)	ubyte	!0	\b, PADDING %#x
 >>>>>(2.b+4)	ubeshort !0x1022 \b, 2nd command %#4.4x (short odd)

After applying the above mentioned modifications by patch
file-5.43-images-cgm.diff then most of my inspected CGM images are
now described with more details. This now looks like:

MS.CGM:                         binary Computer Graphics Metafile
				, parameter length 40 (long)
				'Micrografx CGM Translator
				, version 4.00'
				, 2nd command 0x1054 (long even)
SAB00012.CGM:                   binary Computer Graphics Metafile
				, parameter length 56 (long)
				'K:\PROJECTS\GRAPHICS\DWKS3.5
				\CLIPART\FLAGS\argentin.cgm'
SKYLINE.CGM:                    binary Computer Graphics Metafile
				, parameter length 2,
				2nd command 0x0010 (short even)
TAB00015.CGM:                   binary Computer Graphics Metafile
				, parameter length 11
				'sahara.cgm'
TIGER.CGM:                      binary Computer Graphics Metafile
				, parameter length 14
				'B:\TIGER.CGM'
				, 2nd command 0x0010 (short even)
ZAA00006.CGM:                   binary Computer Graphics Metafile
				, parameter length 30
				'MASTERCLIPS--Art Of Business '
allprims.cgm:                   clear text Computer Graphics Metafile
				"CTN-01Id";
				% 91-10-03 11:00 %
cdraw2020-cgm-3.cgm:            binary Computer Graphics Metafile
				, version 3
				, parameter length 14
				'cdraw2020-cgm'
cdraw2020-cgm-webcgm1.0.cgm:    binary Computer Graphics Metafile
				, version 4
				, parameter length 18
				'\033%/Icdraw2020-cgm'
fmt-301-signature-id-362.cgm:   ISO-8859 text,
fmt-301-signature-id-363.cgm:   ISO-8859 text,
fmt-302-signature-id-366.cgm:   ISO-8859 text,
fmt-302-signature-id-367.cgm:   ISO-8859 text,
fmt-303-signature-id-368.cgm:   data
fmt-304-signature-id-369.cgm:   data
fmt-305-signature-id-370.cgm:   data
fmt-306-signature-id-371.cgm:   data
input.cgm:                      data
ofz-ubsan-2.cgm:                data
ofz35504-ubsan-1.cgm:           data
ofz36348-ubsan-1.cgm:           data
ofz9707-slow-1.cgm:             data
ooo6420-1.cgm:                  clear text Computer Graphics Metafile
				'xfig-fig012228';
recurse-1.cgm:                  data
x-fmt-142-signature-id-353.cgm: ISO-8859 text,
x-fmt-142-signature-id-354.cgm: ISO-8859 text,

I hope my diff file can be applied in future version of file
utility.

Unfortunately a few of my inspected samples (like input.cgm
ofz35504-ubsan-1.cgm ofz36348-ubsan-1.cgm
ofz9707-slow-1.cgm recurse-1.cgm) does not pass first binary test.
These samples are also not recognized by Xnview. So i do not know
what is wrong there. In Encyclopedia of Graphics File Formats is
written that we get the impression that it is actually legal to add
padding characters (nulls) to the beginning of the file. Maybe an
graphic expert can check, improve and correct magic lines for such
"bad/strange" examples.

Unfortunately all binary Computer Graphics Metafile (like
SAB00012.CGM) with long parameter length 56 (=38h) are described
also by line like:
2		uleshort	0x3800		BS image,
So the 2 byte magic here is too weak. Unfortunately i do not know
what format is "BS image". So maybe an expert for that format can
improve that magic.

With best wishes,
Jörg Jenderek
- --
Jörg Jenderek
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCY1hSqAAKCRCv8rHJQhrU
1pgGAJ9dYGzfyVG8omEuCr8BdiLfTUc0GACgoggaiIcSYm7FJnvDMlOjJvT+N8U=
=Mqsa
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-cgm.txt.gz
Type: application/x-gzip
Size: 940 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20221025/8dbe07e2/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: droid-cgm.csv.gz
Type: application/x-gzip
Size: 959 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20221025/8dbe07e2/attachment-0003.bin>
-------------- next part --------------
--- file-5.43/magic/Magdir/images.old	2022-09-13 20:05:39.000000000 +0200
+++ file-5.43/magic/Magdir/images	2022-10-25 22:48:58.343380700 +0200
@@ -652,7 +652,86 @@
 >24	string		SunGKS		\b, SunGKS
 
 # CGM image files
-0	string		BEGMF		clear text Computer Graphics Metafile
+# Update:	Joerg Jenderek
+# URL: 		http://fileformats.archiveteam.org/wiki/CGM
+#		https://en.wikipedia.org/wiki/Computer_Graphics_Metafile
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/c/cgm-ct.trid.xml
+#		http://standards.iso.org/ittf/PubliclyAvailableStandards/c032381_ISO_IEC_8632-4_1999(E).zip
+# Note:		called "Computer Graphics Metafile (Clear Text)" by TrID and
+#		"Computer Graphics Metafile ASCII" by DROID or CGM by XnView
+#		verified by LibreOffice and partly by XnView `nconvert -info *.CGM`
+# According to TrID only letter B and M are always upcased and by DROID often only B is upcased for command BEGIN METAFILE
+0	string/c	begmf
+# skip SOME DROID fmt-301-signature-id-359.cgm fmt-301-signature-id-361.cgm fmt-302-signature-id-364.cgm
+# fmt-302-signature-id-365.cgm x-fmt-142-signature-id-350.cgm x-fmt-142-signature-id-351.cgm
+>5	short		!0
+# skip other versions of DROID fmt-301-signature-id-359.cgm fmt-301-signature-id-361.cgm fmt-302-signature-id-364.cgm
+# fmt-302-signature-id-365.cgm x-fmt-142-signature-id-350.cgm x-fmt-142-signature-id-351.cgm
+>>5	short		!0xABab		clear text Computer Graphics Metafile
+# https://reposcope.com/mimetype/image/cgm
+!:mime	image/cgm
+!:ext	cgm
+# SF:NAME like: 'metafile example';
+>>>5	string		x		%s
+# look for command METAFILE VERSION (MFVERSION <SOFTSEP> <I:VERSION>)
+>>>2	search/128/c	mfversion
+#>>>>&0	ubyte		x		SOFTSEP=%#x
+# version like: 1 3 4
+>>>>&1	ubyte		>0x31		\b, version %c
+# Summary:	Computer Graphics Metafile (binary)
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/c/cgm-bin.trid.xml
+#		https://standards.iso.org/ittf/PubliclyAvailableStandards/c032380_ISO_IEC_8632-3_1999(E).zip
+# Note:		called "Computer Graphics Metafile (binary)" by TrID and DROID or CGM by XnView
+#		verified by LibreOffice and partly by XnView `nconvert -info *.CGM`
+# look for BEGIN METAFILE (element Class 0 and ID 1 and "random" Parameter) that is binary C C C C 0 0 0 0 0 0 1 P P P P P
+0	ubeshort&0xFFe0		0x0020
+# skip SOME DROID fmt-303-signature-id-368.cgm fmt-304-signature-id-369.cgm fmt-305-signature-id-370.cgm fmt-306-signature-id-371.cgm
+# with containing only 28 bytes
+>28	ubyte			x
+# look for METAFILE VERSION (element class 1 and id 1 and parameter P1 with length 2) that is binary 0 0 0 1 i i i i i i 1 P P P 1 P
+# with "low" version; 2nd worst case argentin.cgm with parameter length 56
+# worst MS.CGM
+#>>2	search/73/b		\x10\x22\0	binary Computer Graphics Metafile
+>>2	search/128/b		\x10\x22\0	binary Computer Graphics Metafile
+!:mime	image/cgm
+!:ext	cgm
+# metafile 2 byte version number like: 1 (most) 2 3 4
+>>>&-1	ubeshort		>1		\b, version %u
+# length number of 1st parameter octets in range 0 to 30 implies short command
+>>>0	ubeshort&0x001F		<31		\b, parameter length %u
+# length of string like: 8 9 10 11 12 29
+#>>>>2		ubyte		x		\b, %u BYTES (SHORT)
+# string like: 'HiJaak 2' 'Example 1' 'sahara.cgm' 'MASTERCLIPS--Art Of Business '
+>>>>2		pstring		>\0		'%s'
+# after 1st short command with even parameter length comes 2nd command like: 1022h 0010h (EAF00010.CGM 'HiJaak 2' FLOPPY2.CGM TIGER.CGM 'B:\TIGER.CGM')
+>>>>0  		ubeshort&0x0001	=0
+>>>>>(2.b+3)	ubeshort	!0x1022		\b, 2nd command %#4.4x (short even)
+# after 1st short command with odd parameter length comes nil padding byte followed 2nd command like: 1022h
+>>>>0  		ubeshort&0x0001	=1
+#>>>>>(2.b+3)	ubyte		!0		\b, PADDING %#x
+>>>>>(2.b+4)	ubeshort	!0x1022		\b, 2nd command %#4.4x (short odd)
+# 11111 binary (decimal 31) in the parameter field indicates that the command is in long-form
+>>>0	ubeshort&0x001F		=0x1F
+# bit 15 is partition flag with 1 for 'not-last' partition and 0 for 'last' partition
+>>>>2  		ubeshort&0x8000	!0		\b, partition flag %#4.4x
+# bits 0 to 14 is parameter list length; the number of following parameter octets; range 0 to 32767
+# length of 1st long command parameter like: 53
+>>>>2  		ubeshort&0x7Fff	x		\b, parameter length %u (long)
+# The two header words are then followed by lenghth of 1st string like: 52
+#>>>>4		ubyte		x		\b, %u BYTES
+# string like: 'K:\PROJECTS\GRAPHICS\DWKS3.5\CLIPART\FLAGS\Italy.cgm'
+>>>>4		pstring/B	x		'%s'
+# odd long parameter length implies single null padding octet to start command on word boundary
+>>>>2  		ubeshort&0x0001	=1
+# after 1st long command with odd parameter length comes nil padding byte followed by 2nd command like: 1022h
+#>>>>>(4.b+5)		ubyte	!0		\b, PADDING %#x
+>>>>>(4.b+6)		ubeshort !0x1022	\b, 2nd command %#4.4x (long odd)
+# even long parameter length implies next command directly is following
+>>>>2  		ubeshort&0x0001	=0
+# after 1st long command with even parameter length comes 2nd command like: 1022h 0x1054 (MS.CGM)
+>>>>>(4.b+5)	ubeshort	!0x1022		\b, 2nd command %#4.4x (long even)
+# look for END METAFILE (element class 0 and id 2 and 0 parameter) that is binary 0 0 0 0 i i i i i 1 i P P P P P
+>>>-2	ubeshort		!0x0040		\b, NOT_FOUND_END_METAFILE
 
 # MGR bitmaps  (Michael Haardt, u31b3hs at pool.informatik.rwth-aachen.de)
 0	string	yz	MGR bitmap, modern format, 8-bit aligned
@@ -2537,6 +2616,7 @@
 
 # BS encoded bitstreams
 2		uleshort	0x3800		BS image,
+# GRR: the above line is also true for binary Computer Graphics Metafile SAB00012.CGM with long parameter length 56 (=38h)
 >6		uleshort	x		Version %d,
 >4		uleshort	x		Quantization %d,
 >0		uleshort	x		(Decompresses to %d words)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.43-images-cgm.diff.sig
Type: application/octet-stream
Size: 2365 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20221025/8dbe07e2/attachment-0001.obj>


More information about the File mailing list