[File] [PATCH] of Magdir/images misidentify as SGI image some TeX font metric

Christos Zoulas christos at zoulas.com
Sun Mar 31 14:55:58 UTC 2024


Committed, thanks!

christos

> On Mar 26, 2024, at 1:29 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
> 
> Hello,
> 
> some days ago i looked at the content of an exotic CD-ROM. There are
> also stored samples which are misidentified as Silicon Graphics bitmap.
> 
> When running file command version 5.45 with -k option on such such
> graphics and related files i get an output like:
> 
> abydos.rgba:                   SGI image data, RLE,
> 			       3-D, 800 x 600, 4 channels
> bw.rgb:                        SGI image data, RLE,
> 			       3-D, 256 x 256, 3 channels
> eksfi8a.tfm:                   TeX font metric data (kerkisec)
> 			       SGI image data,
> 			       0-D, 255 x 93, 16 channels
> 			       , "ans-Italic"
> frog.rgb:                      SGI image data, RLE,
> 			       3-D, 496 x 497, 3 channels
> greytest.rgb:                  SGI image data, RLE,
> 			       2-D, 256 x 256, 1 channel
> input.sgi:                     SGI image data, RLE,
> 			       3-D, 70 x 46, 3 channels
> norle-16.sgi:                  SGI image data, high precision,
> 			       3-D, 100 x 63, 4 channels
> 			       , "n.1.sgi"
> pxmi.tfm:                      TeX font metric data (CMMIENCODING)
> 			       SGI image data,
> 			       0-D, 127 x 71, 16 channels
> pxmi1.tfm:                     TeX font metric data (CMMIENCODING)
> 			       SGI image data,
> 			       0-D, 127 x 71, 16 channels
> rle-8.sgi:                     SGI image data, RLE,
> 			       3-D, 100 x 63, 4 channels, "n.1.sgi"
> rle.bw:                        SGI image data, RLE,
> 			       2-D, 150 x 97, 1 channel
> rle.rgb:                       SGI image data, RLE,
> 			       3-D, 150 x 97, 3 channels
> sample_1920x1280.sgi:          SGI image data, RLE,
> 			       3-D, 1920 x 1280, 3 channels
> test-2channels.sgi:            SGI image data,
> 			       1-D, 1 x 1, 2 channels
> test-5channels.sgi:            SGI image data,
> 			       1-D, 1 x 1, 5 channels
> transtexsphere.rgb:            SGI image data, RLE,
> 			       3-D, 497 x 500, 3 channels
> tree2.rgb:                     SGI image data, RLE,
> 			       3-D, 128 x 128, 3 channels
> ver.bw:                        SGI image data,
> 			       2-D, 150 x 97, 1 channel
> ver.rgb:                       SGI image data,
> 			       3-D, 150 x 97, 3 channels
> x-fmt-140-signature-id-623.bw: SGI image data,
> 			       1-D, 0 x 0, 1 channel
> 
> With --extension option only ??? is displayed. Furthermore with -i
> option for graphic samples only generic application/octet-stream is shown.
> 
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html). This list the used
> file name extension and often with -v option the related URL
> pointing to used file format information. The graphic samples
> are described as "Silicon Graphics bitmap" by bitmap-sgi.trid.xml.
> Here now image/x-sgi is listed as mime type. Here four file name suffix
> (.SGI/BW/RGB/RGBA) are listed. Some samples are described with higher
> priority as "Silicon Graphics RGB bitmap" by bitmap-sgi-rgb.trid.xml.
> Here only suffix  RGB listed. Some samples are described with higher
> priority as "Silicon Graphics B/W bitmap" by bitmap-sgi-bw.trid.xml.
> Here only BW is listed as suffix. Furthermore TFM samples are not
> misidentfied (See appended trid-v-sgi.txt.gz).
> 
> For comparison reason i also run the file format identification utility
> DROID (See https://sourceforge.net/projects/droid/). Here most samples
> are described as "Silicon Graphics Image" by PUID x-fmt/140. Here mime
> type image/x-sgi-bw is listed. The artificial samples with 2 and 5
> channels are skipped. Also the TFM samples are not misidentified.
> Furthermore here only RGB BW file name suffix is considered as valid.
> The 2 suffix RGBA SGI are considered here as invalid (see appended
> droid-sgi.csv.gz).
> 
> On Linux according to shared MIME-info database the samples are called
> "SGI image". Here image/x-sgi is shown as mime type. Here only sgi is
> listed as suffix. That information can be seen in freedesktop.org.xml.in
> source found for example on gitlab.freedesktop.org.
> 
> Luckily with help of tools i found information about such graphic file
> format on archive team web site and Wikipedia. That is expressed inside
> Magdir/images new by comment lines like:
> # URL:	http://fileformats.archiveteam.org/wiki/SGI_(image_file_format)
> #	https://en.wikipedia.org/wiki/Silicon_Graphics_Image
> # Ref.:	https://paulbourke.net/dataformats/sgirgb/sgiversion.html
> #	http://mark0.net/download/triddefs_xml.7z
> #	defs/b/bitmap-sgi.trid.xml
> 
> The current used URL with sgi.com is invalid because the server does not
> exist any more. So i removed old link.
> 
> The description happens inside Magdir/images by lines like:
> 0	ubeshort		474		SGI image data
> #>2	ubyte		0		\b, verbatim
> >2	ubyte		1		\b, RLE
> #>3	ubyte		1		\b, normal precision
> >3	ubyte		2		\b, high precision
> >4	ubeshort	x		\b, %d-D
> >6	ubeshort	x		\b, %d x
> >8	ubeshort	x		%d
> >10	ubeshort	x		\b, %d channel
> >10	ubeshort	!1		\bs
> >80	string		>0		\b, "%s"
> 
> Unfortunately only 16 bits are used for recognition. Apparently this
> magic is too weak. So a few TeX font metric files with name suffix tfm
> are misidentified.
> 
> To check if samples are really SGI graphics you can use command line
> tools of some graphical software (like ImageMagick, XnView) by lines like:
> 	identify -verbose *.*
> 	nconvert -in sgi -info *.*
> So when looking in output of these tools (See appended
> nconvert-info.txt.gz identify.txt.gz) we see that the TFM samples are
> not graphics.
> 
> Too overcome weak magic i first look at not used fields in the header.
> After the channel information comes 2-byte fields PINMIN and PINMAX. In
> the first the minimum pixel value in the image is stored. Often the
> value is zero. In the other the maximum pixel value in the image is
> stored. Often the value is 255. So show unusual values by additional
> lines like:
> >>12	ubelong		!0		\b, %u PINMIN
> >>16	ubelong		!255		\b, %u PINMAX
> Afterwards 4 DUMMY bytes are stored. According to documentation these
> should be set to 0. For control reason i show unexpected values by line
> like:
> >>20	ubelong		!0		\b, at 20 %#x
> At offset 104 (=0x68) COLORMAP value is stored as 4 byte big endian
> integer. Only four values are mentioned (0~normal 1~DITHEREDobsolete
> 2~SCREENobsolete 3~COLORMAP). In my inspected samples i only found zero
> value. So show other unusual values by line like:
> >>104	ubelong		!0		\b, %u COLORMAP
> Afterwards comes 404 padding bytes that make the header exactly 512
> bytes long. According to documentation these should be set to zero, but
> this is not always true. So show unusual non zero values by lines like:
> >>111	ubyte		!0		\b, at 111 %#x
> >>113	ubyte		!0		\b, at 113 %#x
> >>118	ubeshort	!0		\b, at 118 %#4.4x
> >>121	ubyte		!0		\b, at 121 %#x
> >>132	ubelong		!0		\b, at 132 %#8.8x
> >>135	ubyte		!0		\b, at 135 %#x
> >>137	ubequad		!0		\b, at 137 %#16.16llx
> 
> None of these fields seems to be suited as additional test criterium.
> After nil values of padding bytes does not seems to be reliable i also
> do not trust dummy bytes value at offset 20. So i take way done by DROID
> tool.
> 
> For the STORAGE format only 2 values are allowed. 1 means RLE compressed
> and 0 means not compressed. These values are shown by lines like:
> #>>2	ubyte		0		\b, verbatim
> >>2	ubyte		1		\b, RLE
> 
> For number of bytes per pixel component only 2 values are allowed (1 or
> 2). These values are shown by lines like:
> #>>3	ubyte		1		\b, normal precision
> >>3	ubyte		2		\b, high precision
> 
> So the first test now look again for magic number (integer 474=0x01DA),
> storage format (0 or 1) and number of bytes per pixel channel (1 or 2)
> like DROID tool. So misidentified few TeX font metric data (like
> pxmi.tfm pxmi1.tfm eksfi8a.tfm handled by Magdir/tex) with invalid
> "high" bytes/pixel (11 12) are skipped. This is done by modified first
> line. That now looks like:
> 0	ubelong&0xFFffFEfc	0x01da0000
> 
> Unfortunately at that point DROID sample x-fmt-140-signature-id-623.bw
> is still misidentified as graphic. But this sample just contain some
> leading bytes of such graphics. This sample is used by DROID tool to
> recognize such SGI graphics. When we look in current output we see that
> dimensions here are shown as "0 x 0". But for real samples of course we
> get "XSIZE x YSIZE" where sizes are not zero. This information is shown
> by lines like:
> >>6	ubeshort	x		\b, %d x
> >>8	ubeshort	x		%d
> So the lines after first test now becomes like:
> >6	long			!0		SGI image data
> !:mime	image/x-sgi
> !:apple	????.SGI
> So the DROID sample is now skipped. On Wikipedia image/sgi is listed as
> mime type, but this is not officially registered at IANA. And DROID tool
> list image/x-sgi-bw. That maybe apply to black/white or gray coloured
> images. So i choose what is used on Linux systems by database from
> freedesktop.org.
> 
> According to documentation at offset 10 the channels are stored as 2
> byte big endian integer. Depending on that value different file name
> suffix are used. The value 1 means black and white. The highest observed
> value in my samples was 4. That means RGB with ALPHA channel. If i
> understand the documentation right it is maybe possible to have samples
> with higher channels. For examples i can imagine an animated RGBA. So
> then an additional time component may be added and the channel number
> would be 5. Unfortunately i found no samples with int suffix. I also
> found no sample with inta suffix which means black and white with ALPHA
> channel. So channel information and corresponding file name suffix is
> now done by lines like:
> >>10	ubeshort	x		\b, %d
> >>>10	ubeshort	1		channel
> !:ext	bw
> >>>10	ubeshort	3		channels
> !:ext	rgb/sgi
> >>>10	ubeshort	4		channels
> !:ext	rgba/sgi
> >>>10	default		x		channels
> !:ext	sgi
> 
> For samples like norle-16.sgi inside double quotes a string like
> "n.1.sgi" is shown. This is done by line like:
> >80	string		>0		\b, "%s"
> But that is only part of image name. According to documentation after
> dummy bytes and before COLORMAP field an optional image name is stored.
> This is a null terminated ASCII string with up to 79 characters. So the
> image name correctly is shown by line like:
> >>24	string		>\0		\b, "%0.80s"
> 
> After applying the above mentioned modifications by patch
> file-5.45-images-sgi.diff and using Magdir/tex then i get a more precise
> output and misidentification vanished. That with -k option looks like:
> 
> abydos.rgba:                   SGI image data, RLE,
> 			       3-D, 800 x 600, 4 channels
> bw.rgb:                        SGI image data, RLE,
> 			       3-D, 256 x 256, 3 channels
> 			       , "no name"
> eksfi8a.tfm:                   TeX font metric data (kerkisec)
> frog.rgb:                      SGI image data, RLE,
> 			       3-D, 496 x 497, 3 channels
> 			       , "no name"
> 			       , at 111 0x5, at 113 0x2
> 			       , at 118 0x01f0, at 121 0x2
> 			       , at 132 0x1001a174, at 135 0x74
> 			       , at 137 0x0000000001a68410
> greytest.rgb:                  SGI image data, RLE,
> 			       2-D, 256 x 256, 1 channel
> 			       , "no name"
> 			       , 9 PINMIN, 146 PINMAX
> 			       , at 111 0x4, at 113 0x2
> 			       , at 118 0x00ff, at 132 0x10014df0
> 			       , at 135 0xf0, at 137 0x00000000010f5c10
> input.sgi:                     SGI image data, RLE,
> 			       3-D, 70 x 46, 3 channels
> norle-16.sgi:                  SGI image data, high precision,
> 			       3-D, 100 x 63, 4 channels
> 			       , "...rnold_SGI_Texture_
> 			       Crash_Bugreport_01
> 			       \Default_Pass_Main.1.sgi"
> 			       , 65535 PINMAX
> pxmi.tfm:                      TeX font metric data (CMMIENCODING)
> pxmi1.tfm:                     TeX font metric data (CMMIENCODING)
> rle-8.sgi:                     SGI image data, RLE,
> 			       3-D, 100 x 63, 4 channels
> 			       , "...rnold_SGI_Texture_
> 			       Crash_Bugreport_01
> 			       \Default_Pass_Main.1.sgi"
> rle.bw:                        SGI image data, RLE,
> 			       2-D, 150 x 97, 1 channel
> 			       , "no name"
> 			       , at 111 0x4, at 113 0x2
> 			       , at 132 0x100105f0, at 135 0xf0
> 			       , at 137 0x0000000000391810
> rle.rgb:                       SGI image data, RLE,
> 			       3-D, 150 x 97, 3 channels
> 			       , "no name"
> 			       , at 111 0x4, at 113 0x2
> 			       , at 121 0x2, at 132 0x10011210
> 			       , at 135 0x10, at 137 0x0000000000a35610
> sample_1920x1280.sgi:          SGI image data, RLE,
> 			       3-D, 1920 x 1280, 3 channels
> test-2channels.sgi:            SGI image data,
> 			       1-D, 1 x 1, 2 channels, 0 PINMAX
> test-5channels.sgi:            SGI image data,
> 			       1-D, 1 x 1, 5 channels, 0 PINMAX
> transtexsphere.rgb:            SGI image data, RLE,
> 			       3-D, 497 x 500, 3 channels
> 			       , "no name"
> 			       , 211 PINMAX
> 			       , at 111 0x5, at 113 0x2
> 			       , at 118 0x01f3, at 121 0x2
> 			       , at 132 0x10019f28, at 135 0x28
> 			       , at 137 0x00000000039f7610
> tree2.rgb:                     SGI image data, RLE,
> 			       3-D, 128 x 128, 3 channels
> 			       , "no name"
> ver.bw:                        SGI image data,
> 			       2-D, 150 x 97, 1 channel
> 			       , "no name"
> 			       , at 111 0x4, at 113 0x2
> 			       , at 132 0x100102e0, at 135 0xe0
> ver.rgb:                       SGI image data,
> 			       3-D, 150 x 97, 3 channels
> 			       , "no name"
> 			       , at 111 0x4, at 113 0x2
> 			       , at 121 0x2, at 132 0x100108f0
> 			       , at 135 0xf0
> x-fmt-140-signature-id-623.bw: data
> 
> I hope my diff file can be applied in future version of
> file utility. Unfortunately the magic for tex font metric are also too
> weak and need some polishment. For the TFM samples there exist no unique
> and long pattern. So i will need some time to do this work in the future.
> 
> With best wishes
> Jörg Jenderek
> --
> Jörg Jenderek
> <file-5_39-images-cpi_diff.DEFANGED-112485><droid-sgi.csv.gz><nconvert-info.txt.gz><identify.txt.gz><trid-v-sgi.txt.gz><file-5_45-images-sgi_diff.DEFANGED-112486><file-5_45-images-sgi_diff_sig.DEFANGED-112487>-- 
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>



More information about the File mailing list