[File] [PATCH] of Magdir/images misidentify as SGI image some TeX font metric

Jörg Jenderek joerg.jen.der.ek at gmx.net
Tue Mar 26 17:29:15 UTC 2024


Hello,

some days ago i looked at the content of an exotic CD-ROM. There are
also stored samples which are misidentified as Silicon Graphics bitmap.

When running file command version 5.45 with -k option on such such
graphics and related files i get an output like:

abydos.rgba:                   SGI image data, RLE,
			       3-D, 800 x 600, 4 channels
bw.rgb:                        SGI image data, RLE,
			       3-D, 256 x 256, 3 channels
eksfi8a.tfm:                   TeX font metric data (kerkisec)
			       SGI image data,
			       0-D, 255 x 93, 16 channels
			       , "ans-Italic"
frog.rgb:                      SGI image data, RLE,
			       3-D, 496 x 497, 3 channels
greytest.rgb:                  SGI image data, RLE,
			       2-D, 256 x 256, 1 channel
input.sgi:                     SGI image data, RLE,
			       3-D, 70 x 46, 3 channels
norle-16.sgi:                  SGI image data, high precision,
			       3-D, 100 x 63, 4 channels
			       , "n.1.sgi"
pxmi.tfm:                      TeX font metric data (CMMIENCODING)
			       SGI image data,
			       0-D, 127 x 71, 16 channels
pxmi1.tfm:                     TeX font metric data (CMMIENCODING)
			       SGI image data,
			       0-D, 127 x 71, 16 channels
rle-8.sgi:                     SGI image data, RLE,
			       3-D, 100 x 63, 4 channels, "n.1.sgi"
rle.bw:                        SGI image data, RLE,
			       2-D, 150 x 97, 1 channel
rle.rgb:                       SGI image data, RLE,
			       3-D, 150 x 97, 3 channels
sample_1920x1280.sgi:          SGI image data, RLE,
			       3-D, 1920 x 1280, 3 channels
test-2channels.sgi:            SGI image data,
			       1-D, 1 x 1, 2 channels
test-5channels.sgi:            SGI image data,
			       1-D, 1 x 1, 5 channels
transtexsphere.rgb:            SGI image data, RLE,
			       3-D, 497 x 500, 3 channels
tree2.rgb:                     SGI image data, RLE,
			       3-D, 128 x 128, 3 channels
ver.bw:                        SGI image data,
			       2-D, 150 x 97, 1 channel
ver.rgb:                       SGI image data,
			       3-D, 150 x 97, 3 channels
x-fmt-140-signature-id-623.bw: SGI image data,
			       1-D, 0 x 0, 1 channel

With --extension option only ??? is displayed. Furthermore with -i
option for graphic samples only generic application/octet-stream is shown.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). This list the used
file name extension and often with -v option the related URL
pointing to used file format information. The graphic samples
are described as "Silicon Graphics bitmap" by bitmap-sgi.trid.xml.
Here now image/x-sgi is listed as mime type. Here four file name suffix
(.SGI/BW/RGB/RGBA) are listed. Some samples are described with higher
priority as "Silicon Graphics RGB bitmap" by bitmap-sgi-rgb.trid.xml.
Here only suffix  RGB listed. Some samples are described with higher
priority as "Silicon Graphics B/W bitmap" by bitmap-sgi-bw.trid.xml.
Here only BW is listed as suffix. Furthermore TFM samples are not
misidentfied (See appended trid-v-sgi.txt.gz).

For comparison reason i also run the file format identification utility
DROID (See https://sourceforge.net/projects/droid/). Here most samples
are described as "Silicon Graphics Image" by PUID x-fmt/140. Here mime
type image/x-sgi-bw is listed. The artificial samples with 2 and 5
channels are skipped. Also the TFM samples are not misidentified.
Furthermore here only RGB BW file name suffix is considered as valid.
The 2 suffix RGBA SGI are considered here as invalid (see appended
droid-sgi.csv.gz).

On Linux according to shared MIME-info database the samples are called
"SGI image". Here image/x-sgi is shown as mime type. Here only sgi is
listed as suffix. That information can be seen in freedesktop.org.xml.in
source found for example on gitlab.freedesktop.org.

Luckily with help of tools i found information about such graphic file
format on archive team web site and Wikipedia. That is expressed inside
Magdir/images new by comment lines like:
# URL:	http://fileformats.archiveteam.org/wiki/SGI_(image_file_format)
#	https://en.wikipedia.org/wiki/Silicon_Graphics_Image
# Ref.:	https://paulbourke.net/dataformats/sgirgb/sgiversion.html
#	http://mark0.net/download/triddefs_xml.7z
#	defs/b/bitmap-sgi.trid.xml

The current used URL with sgi.com is invalid because the server does not
exist any more. So i removed old link.

The description happens inside Magdir/images by lines like:
  0	ubeshort		474		SGI image data
  #>2	ubyte		0		\b, verbatim
  >2	ubyte		1		\b, RLE
  #>3	ubyte		1		\b, normal precision
  >3	ubyte		2		\b, high precision
  >4	ubeshort	x		\b, %d-D
  >6	ubeshort	x		\b, %d x
  >8	ubeshort	x		%d
  >10	ubeshort	x		\b, %d channel
  >10	ubeshort	!1		\bs
  >80	string		>0		\b, "%s"

Unfortunately only 16 bits are used for recognition. Apparently this
magic is too weak. So a few TeX font metric files with name suffix tfm
are misidentified.

To check if samples are really SGI graphics you can use command line
tools of some graphical software (like ImageMagick, XnView) by lines like:
	identify -verbose *.*
	nconvert -in sgi -info *.*
So when looking in output of these tools (See appended
nconvert-info.txt.gz identify.txt.gz) we see that the TFM samples are
not graphics.

Too overcome weak magic i first look at not used fields in the header.
After the channel information comes 2-byte fields PINMIN and PINMAX. In
the first the minimum pixel value in the image is stored. Often the
value is zero. In the other the maximum pixel value in the image is
stored. Often the value is 255. So show unusual values by additional
lines like:
  >>12	ubelong		!0		\b, %u PINMIN
  >>16	ubelong		!255		\b, %u PINMAX
Afterwards 4 DUMMY bytes are stored. According to documentation these
should be set to 0. For control reason i show unexpected values by line
like:
  >>20	ubelong		!0		\b, at 20 %#x
At offset 104 (=0x68) COLORMAP value is stored as 4 byte big endian
integer. Only four values are mentioned (0~normal 1~DITHEREDobsolete
2~SCREENobsolete 3~COLORMAP). In my inspected samples i only found zero
value. So show other unusual values by line like:
  >>104	ubelong		!0		\b, %u COLORMAP
Afterwards comes 404 padding bytes that make the header exactly 512
bytes long. According to documentation these should be set to zero, but
this is not always true. So show unusual non zero values by lines like:
  >>111	ubyte		!0		\b, at 111 %#x
  >>113	ubyte		!0		\b, at 113 %#x
  >>118	ubeshort	!0		\b, at 118 %#4.4x
  >>121	ubyte		!0		\b, at 121 %#x
  >>132	ubelong		!0		\b, at 132 %#8.8x
  >>135	ubyte		!0		\b, at 135 %#x
  >>137	ubequad		!0		\b, at 137 %#16.16llx

None of these fields seems to be suited as additional test criterium.
After nil values of padding bytes does not seems to be reliable i also
do not trust dummy bytes value at offset 20. So i take way done by DROID
tool.

For the STORAGE format only 2 values are allowed. 1 means RLE compressed
and 0 means not compressed. These values are shown by lines like:
  #>>2	ubyte		0		\b, verbatim
  >>2	ubyte		1		\b, RLE

For number of bytes per pixel component only 2 values are allowed (1 or
2). These values are shown by lines like:
  #>>3	ubyte		1		\b, normal precision
  >>3	ubyte		2		\b, high precision

So the first test now look again for magic number (integer 474=0x01DA),
storage format (0 or 1) and number of bytes per pixel channel (1 or 2)
like DROID tool. So misidentified few TeX font metric data (like
pxmi.tfm pxmi1.tfm eksfi8a.tfm handled by Magdir/tex) with invalid
"high" bytes/pixel (11 12) are skipped. This is done by modified first
line. That now looks like:
  0	ubelong&0xFFffFEfc	0x01da0000

Unfortunately at that point DROID sample x-fmt-140-signature-id-623.bw
is still misidentified as graphic. But this sample just contain some
leading bytes of such graphics. This sample is used by DROID tool to
recognize such SGI graphics. When we look in current output we see that
dimensions here are shown as "0 x 0". But for real samples of course we
get "XSIZE x YSIZE" where sizes are not zero. This information is shown
by lines like:
  >>6	ubeshort	x		\b, %d x
  >>8	ubeshort	x		%d
So the lines after first test now becomes like:
  >6	long			!0		SGI image data
  !:mime	image/x-sgi
  !:apple	????.SGI
So the DROID sample is now skipped. On Wikipedia image/sgi is listed as
mime type, but this is not officially registered at IANA. And DROID tool
list image/x-sgi-bw. That maybe apply to black/white or gray coloured
images. So i choose what is used on Linux systems by database from
freedesktop.org.

According to documentation at offset 10 the channels are stored as 2
byte big endian integer. Depending on that value different file name
suffix are used. The value 1 means black and white. The highest observed
value in my samples was 4. That means RGB with ALPHA channel. If i
understand the documentation right it is maybe possible to have samples
with higher channels. For examples i can imagine an animated RGBA. So
then an additional time component may be added and the channel number
would be 5. Unfortunately i found no samples with int suffix. I also
found no sample with inta suffix which means black and white with ALPHA
channel. So channel information and corresponding file name suffix is
now done by lines like:
  >>10	ubeshort	x		\b, %d
  >>>10	ubeshort	1		channel
  !:ext	bw
  >>>10	ubeshort	3		channels
  !:ext	rgb/sgi
  >>>10	ubeshort	4		channels
  !:ext	rgba/sgi
  >>>10	default		x		channels
  !:ext	sgi

For samples like norle-16.sgi inside double quotes a string like
"n.1.sgi" is shown. This is done by line like:
  >80	string		>0		\b, "%s"
But that is only part of image name. According to documentation after
dummy bytes and before COLORMAP field an optional image name is stored.
This is a null terminated ASCII string with up to 79 characters. So the
image name correctly is shown by line like:
  >>24	string		>\0		\b, "%0.80s"

After applying the above mentioned modifications by patch
file-5.45-images-sgi.diff and using Magdir/tex then i get a more precise
output and misidentification vanished. That with -k option looks like:

abydos.rgba:                   SGI image data, RLE,
			       3-D, 800 x 600, 4 channels
bw.rgb:                        SGI image data, RLE,
			       3-D, 256 x 256, 3 channels
			       , "no name"
eksfi8a.tfm:                   TeX font metric data (kerkisec)
frog.rgb:                      SGI image data, RLE,
			       3-D, 496 x 497, 3 channels
			       , "no name"
			       , at 111 0x5, at 113 0x2
			       , at 118 0x01f0, at 121 0x2
			       , at 132 0x1001a174, at 135 0x74
			       , at 137 0x0000000001a68410
greytest.rgb:                  SGI image data, RLE,
			       2-D, 256 x 256, 1 channel
			       , "no name"
			       , 9 PINMIN, 146 PINMAX
			       , at 111 0x4, at 113 0x2
			       , at 118 0x00ff, at 132 0x10014df0
			       , at 135 0xf0, at 137 0x00000000010f5c10
input.sgi:                     SGI image data, RLE,
			       3-D, 70 x 46, 3 channels
norle-16.sgi:                  SGI image data, high precision,
			       3-D, 100 x 63, 4 channels
			       , "...rnold_SGI_Texture_
			       Crash_Bugreport_01
			       \Default_Pass_Main.1.sgi"
			       , 65535 PINMAX
pxmi.tfm:                      TeX font metric data (CMMIENCODING)
pxmi1.tfm:                     TeX font metric data (CMMIENCODING)
rle-8.sgi:                     SGI image data, RLE,
			       3-D, 100 x 63, 4 channels
			       , "...rnold_SGI_Texture_
			       Crash_Bugreport_01
			       \Default_Pass_Main.1.sgi"
rle.bw:                        SGI image data, RLE,
			       2-D, 150 x 97, 1 channel
			       , "no name"
			       , at 111 0x4, at 113 0x2
			       , at 132 0x100105f0, at 135 0xf0
			       , at 137 0x0000000000391810
rle.rgb:                       SGI image data, RLE,
			       3-D, 150 x 97, 3 channels
			       , "no name"
			       , at 111 0x4, at 113 0x2
			       , at 121 0x2, at 132 0x10011210
			       , at 135 0x10, at 137 0x0000000000a35610
sample_1920x1280.sgi:          SGI image data, RLE,
			       3-D, 1920 x 1280, 3 channels
test-2channels.sgi:            SGI image data,
			       1-D, 1 x 1, 2 channels, 0 PINMAX
test-5channels.sgi:            SGI image data,
			       1-D, 1 x 1, 5 channels, 0 PINMAX
transtexsphere.rgb:            SGI image data, RLE,
			       3-D, 497 x 500, 3 channels
			       , "no name"
			       , 211 PINMAX
			       , at 111 0x5, at 113 0x2
			       , at 118 0x01f3, at 121 0x2
			       , at 132 0x10019f28, at 135 0x28
			       , at 137 0x00000000039f7610
tree2.rgb:                     SGI image data, RLE,
			       3-D, 128 x 128, 3 channels
			       , "no name"
ver.bw:                        SGI image data,
			       2-D, 150 x 97, 1 channel
			       , "no name"
			       , at 111 0x4, at 113 0x2
			       , at 132 0x100102e0, at 135 0xe0
ver.rgb:                       SGI image data,
			       3-D, 150 x 97, 3 channels
			       , "no name"
			       , at 111 0x4, at 113 0x2
			       , at 121 0x2, at 132 0x100108f0
			       , at 135 0xf0
x-fmt-140-signature-id-623.bw: data

I hope my diff file can be applied in future version of
file utility. Unfortunately the magic for tex font metric are also too
weak and need some polishment. For the TFM samples there exist no unique
and long pattern. So i will need some time to do this work in the future.

With best wishes
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
--- file-5.39/magic/Magdir/images.old	2020-05-31 10:34:40 +0000
+++ file-5.39/magic/Magdir/images	2020-07-15 11:50:26 +0000
@@ -1413,4 +1413,6 @@
 # https://www.cartesianinc.com/Tech/
+# Reference:	http://fileformats.archiveteam.org/wiki/Cartesian_Perceptual_Compression
 0	string	CPC\262		Cartesian Perceptual Compression image
 !:mime	image/x-cpi
+!:ext	cpi/cpc
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: droid-sgi.csv.gz
Type: application/x-gzip
Size: 808 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240326/7d4599cc/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nconvert-info.txt.gz
Type: application/x-gzip
Size: 821 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240326/7d4599cc/attachment-0005.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: identify.txt.gz
Type: application/x-gzip
Size: 567 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240326/7d4599cc/attachment-0006.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-sgi.txt.gz
Type: application/x-gzip
Size: 864 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240326/7d4599cc/attachment-0007.bin>
-------------- next part --------------
--- file-5.45/magic/Magdir/images.old	2023-07-27 20:04:45.000000000 +0200
+++ file-5.45/magic/Magdir/images	2024-03-26 18:25:20.376712000 +0100
@@ -1296,20 +1296,75 @@
 
 # SGI image file format, from Daniel Quinlan (quinlan at yggdrasil.com)
 #
-# See
-#	http://reality.sgi.com/grafica/sgiimage.html
-#
-0	ubeshort		474		SGI image data
-#>2	ubyte		0		\b, verbatim
->2	ubyte		1		\b, RLE
-#>3	ubyte		1		\b, normal precision
->3	ubyte		2		\b, high precision
->4	ubeshort	x		\b, %d-D
->6	ubeshort	x		\b, %d x
->8	ubeshort	x		%d
->10	ubeshort	x		\b, %d channel
->10	ubeshort	!1		\bs
->80	string		>0		\b, "%s"
+# Update:	Joerg Jenderek
+# URL:		http://fileformats.archiveteam.org/wiki/SGI_(image_file_format)
+#		https://en.wikipedia.org/wiki/Silicon_Graphics_Image
+# Reference:	https://paulbourke.net/dataformats/sgirgb/sgiversion.html
+#		http://mark0.net/download/triddefs_xml.7z/defs/b/bitmap-sgi.trid.xml
+# Note:		called "Silicon Graphics bitmap (generic)" by TrID,
+#		"Silicon Graphics Image" by DROID via PUID x-fmt/140 and shared MIME-info database from freedesktop.org,
+#		verfied by ImageMagick `identify -verbose *.sgi` as SGI (Irix RGB image) and
+#		verfied by XnView `nconvert -in sgi -info *.sgi` as SGI RGB
+# look for magic number (integer 474=0x01DA) + storage format (0 or 1) + number of bytes per pixel channel (1 or 2) 
+# to skip few TeX font metric data (like pxmi.tfm pxmi1.tfm eksfi8a.tfm ./tex) with invalid "high" bytes/pixel (11 12)
+0	ubelong&0xFFffFEfc	0x01da0000
+# skip DROID x-fmt-140-signature-id-623.bw with invalid "low" dimensions "0 x 0"
+>6	long			!0		SGI image data
+#!:mime	image/sgi
+!:mime	image/x-sgi
+!:apple	????.SGI
+# STORAGE format; allowed values 0~VERBATIM 1~RLE 
+#>>2	ubyte		0		\b, verbatim
+>>2	ubyte		1		\b, RLE
+#>>2	ubyte		>1		STORAGE=%#x
+# BPC; number of bytes per pixel component; allowed values 1 2
+#>>3	ubyte		1		\b, normal precision
+>>3	ubyte		2		\b, high precision
+#>>3	ubyte		x		BPC=%#x
+# DIMENSION; allowed values are 1~scanline 2~XSIZExYSIZE 3~XSIZExYSIZExZSIZE
+>>4	ubeshort	x		\b, %d-D
+# XSIZE; width of image in pixels
+>>6	ubeshort	x		\b, %d x
+# YSIZE; height of image in pixels
+>>8	ubeshort	x		%d
+# ZSIZE; number of channels in image; 1~B/W (greyscale) 3~RGB 4~RGB+ALPHA channel
+>>10	ubeshort	x		\b, %d
+# GRR: avoid
+# Magdir\images, 1347: Warning: Current entry does not yet have a description for adding a EXTENSION type
+>>>10	ubeshort	1		channel
+# GRR: exception https://sembiance.com/fileFormatSamples/image/sgi/greytest.rgb
+!:ext	bw
+# no examples found with .int suffix
+#!:ext	bw/int
+# no examples found with .inta suffix for black/white+ALPHA channel
+# no examples found with 2 channels
+#>>>10	ubeshort	2		channels
+#!:ext	sgi
+>>>10	ubeshort	3		channels
+!:ext	rgb/sgi
+>>>10	ubeshort	4		channels
+!:ext	rgba/sgi
+>>>10	default		x		channels
+# no examples found with 5 and more channels
+!:ext	sgi
+# IMAGENAME; null terminated ascii string of up to 79 characters
+>>24	string		>\0		\b, "%0.80s"
+# PINMIN; minimum pixel value in the image; often 0
+>>12	ubelong		!0		\b, %u PINMIN
+# PINMAX; maximum pixel value in the image; often 255
+>>16	ubelong		!255		\b, %u PINMAX
+# DUMMY; 4 bytes of data should be set to 0
+>>20	ubelong		!0		\b, at 20 %#x
+# COLORMAP; 0~normal 1~DITHEREDobsolete 2~SCREENobsolete 3~COLORMAP
+>>104	ubelong		!0		\b, %u COLORMAP
+# DUMMY; 404 bytes should be set to 0 but not always true; makes header exactly 512 bytes
+>>111	ubyte		!0		\b, at 111 %#x
+>>113	ubyte		!0		\b, at 113 %#x
+>>118	ubeshort	!0		\b, at 118 %#4.4x
+>>121	ubyte		!0		\b, at 121 %#x
+>>132	ubelong		!0		\b, at 132 %#8.8x
+>>135	ubyte		!0		\b, at 135 %#x
+>>137	ubequad		!0		\b, at 137 %#16.16llx
 
 0	string		IT01		FIT image data
 >4	ubelong		x		\b, %d x
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-images-sgi.diff.sig
Type: application/octet-stream
Size: 1917 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240326/7d4599cc/attachment-0001.obj>


More information about the File mailing list