[File] [PATCH] of Magdir/images misidentify as SGI image some TeX font metric
Christos Zoulas
christos at zoulas.com
Sun Mar 31 14:55:58 UTC 2024
Committed, thanks!
christos
> On Mar 26, 2024, at 1:29 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
>
> Hello,
>
> some days ago i looked at the content of an exotic CD-ROM. There are
> also stored samples which are misidentified as Silicon Graphics bitmap.
>
> When running file command version 5.45 with -k option on such such
> graphics and related files i get an output like:
>
> abydos.rgba: SGI image data, RLE,
> 3-D, 800 x 600, 4 channels
> bw.rgb: SGI image data, RLE,
> 3-D, 256 x 256, 3 channels
> eksfi8a.tfm: TeX font metric data (kerkisec)
> SGI image data,
> 0-D, 255 x 93, 16 channels
> , "ans-Italic"
> frog.rgb: SGI image data, RLE,
> 3-D, 496 x 497, 3 channels
> greytest.rgb: SGI image data, RLE,
> 2-D, 256 x 256, 1 channel
> input.sgi: SGI image data, RLE,
> 3-D, 70 x 46, 3 channels
> norle-16.sgi: SGI image data, high precision,
> 3-D, 100 x 63, 4 channels
> , "n.1.sgi"
> pxmi.tfm: TeX font metric data (CMMIENCODING)
> SGI image data,
> 0-D, 127 x 71, 16 channels
> pxmi1.tfm: TeX font metric data (CMMIENCODING)
> SGI image data,
> 0-D, 127 x 71, 16 channels
> rle-8.sgi: SGI image data, RLE,
> 3-D, 100 x 63, 4 channels, "n.1.sgi"
> rle.bw: SGI image data, RLE,
> 2-D, 150 x 97, 1 channel
> rle.rgb: SGI image data, RLE,
> 3-D, 150 x 97, 3 channels
> sample_1920x1280.sgi: SGI image data, RLE,
> 3-D, 1920 x 1280, 3 channels
> test-2channels.sgi: SGI image data,
> 1-D, 1 x 1, 2 channels
> test-5channels.sgi: SGI image data,
> 1-D, 1 x 1, 5 channels
> transtexsphere.rgb: SGI image data, RLE,
> 3-D, 497 x 500, 3 channels
> tree2.rgb: SGI image data, RLE,
> 3-D, 128 x 128, 3 channels
> ver.bw: SGI image data,
> 2-D, 150 x 97, 1 channel
> ver.rgb: SGI image data,
> 3-D, 150 x 97, 3 channels
> x-fmt-140-signature-id-623.bw: SGI image data,
> 1-D, 0 x 0, 1 channel
>
> With --extension option only ??? is displayed. Furthermore with -i
> option for graphic samples only generic application/octet-stream is shown.
>
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html). This list the used
> file name extension and often with -v option the related URL
> pointing to used file format information. The graphic samples
> are described as "Silicon Graphics bitmap" by bitmap-sgi.trid.xml.
> Here now image/x-sgi is listed as mime type. Here four file name suffix
> (.SGI/BW/RGB/RGBA) are listed. Some samples are described with higher
> priority as "Silicon Graphics RGB bitmap" by bitmap-sgi-rgb.trid.xml.
> Here only suffix RGB listed. Some samples are described with higher
> priority as "Silicon Graphics B/W bitmap" by bitmap-sgi-bw.trid.xml.
> Here only BW is listed as suffix. Furthermore TFM samples are not
> misidentfied (See appended trid-v-sgi.txt.gz).
>
> For comparison reason i also run the file format identification utility
> DROID (See https://sourceforge.net/projects/droid/). Here most samples
> are described as "Silicon Graphics Image" by PUID x-fmt/140. Here mime
> type image/x-sgi-bw is listed. The artificial samples with 2 and 5
> channels are skipped. Also the TFM samples are not misidentified.
> Furthermore here only RGB BW file name suffix is considered as valid.
> The 2 suffix RGBA SGI are considered here as invalid (see appended
> droid-sgi.csv.gz).
>
> On Linux according to shared MIME-info database the samples are called
> "SGI image". Here image/x-sgi is shown as mime type. Here only sgi is
> listed as suffix. That information can be seen in freedesktop.org.xml.in
> source found for example on gitlab.freedesktop.org.
>
> Luckily with help of tools i found information about such graphic file
> format on archive team web site and Wikipedia. That is expressed inside
> Magdir/images new by comment lines like:
> # URL: http://fileformats.archiveteam.org/wiki/SGI_(image_file_format)
> # https://en.wikipedia.org/wiki/Silicon_Graphics_Image
> # Ref.: https://paulbourke.net/dataformats/sgirgb/sgiversion.html
> # http://mark0.net/download/triddefs_xml.7z
> # defs/b/bitmap-sgi.trid.xml
>
> The current used URL with sgi.com is invalid because the server does not
> exist any more. So i removed old link.
>
> The description happens inside Magdir/images by lines like:
> 0 ubeshort 474 SGI image data
> #>2 ubyte 0 \b, verbatim
> >2 ubyte 1 \b, RLE
> #>3 ubyte 1 \b, normal precision
> >3 ubyte 2 \b, high precision
> >4 ubeshort x \b, %d-D
> >6 ubeshort x \b, %d x
> >8 ubeshort x %d
> >10 ubeshort x \b, %d channel
> >10 ubeshort !1 \bs
> >80 string >0 \b, "%s"
>
> Unfortunately only 16 bits are used for recognition. Apparently this
> magic is too weak. So a few TeX font metric files with name suffix tfm
> are misidentified.
>
> To check if samples are really SGI graphics you can use command line
> tools of some graphical software (like ImageMagick, XnView) by lines like:
> identify -verbose *.*
> nconvert -in sgi -info *.*
> So when looking in output of these tools (See appended
> nconvert-info.txt.gz identify.txt.gz) we see that the TFM samples are
> not graphics.
>
> Too overcome weak magic i first look at not used fields in the header.
> After the channel information comes 2-byte fields PINMIN and PINMAX. In
> the first the minimum pixel value in the image is stored. Often the
> value is zero. In the other the maximum pixel value in the image is
> stored. Often the value is 255. So show unusual values by additional
> lines like:
> >>12 ubelong !0 \b, %u PINMIN
> >>16 ubelong !255 \b, %u PINMAX
> Afterwards 4 DUMMY bytes are stored. According to documentation these
> should be set to 0. For control reason i show unexpected values by line
> like:
> >>20 ubelong !0 \b, at 20 %#x
> At offset 104 (=0x68) COLORMAP value is stored as 4 byte big endian
> integer. Only four values are mentioned (0~normal 1~DITHEREDobsolete
> 2~SCREENobsolete 3~COLORMAP). In my inspected samples i only found zero
> value. So show other unusual values by line like:
> >>104 ubelong !0 \b, %u COLORMAP
> Afterwards comes 404 padding bytes that make the header exactly 512
> bytes long. According to documentation these should be set to zero, but
> this is not always true. So show unusual non zero values by lines like:
> >>111 ubyte !0 \b, at 111 %#x
> >>113 ubyte !0 \b, at 113 %#x
> >>118 ubeshort !0 \b, at 118 %#4.4x
> >>121 ubyte !0 \b, at 121 %#x
> >>132 ubelong !0 \b, at 132 %#8.8x
> >>135 ubyte !0 \b, at 135 %#x
> >>137 ubequad !0 \b, at 137 %#16.16llx
>
> None of these fields seems to be suited as additional test criterium.
> After nil values of padding bytes does not seems to be reliable i also
> do not trust dummy bytes value at offset 20. So i take way done by DROID
> tool.
>
> For the STORAGE format only 2 values are allowed. 1 means RLE compressed
> and 0 means not compressed. These values are shown by lines like:
> #>>2 ubyte 0 \b, verbatim
> >>2 ubyte 1 \b, RLE
>
> For number of bytes per pixel component only 2 values are allowed (1 or
> 2). These values are shown by lines like:
> #>>3 ubyte 1 \b, normal precision
> >>3 ubyte 2 \b, high precision
>
> So the first test now look again for magic number (integer 474=0x01DA),
> storage format (0 or 1) and number of bytes per pixel channel (1 or 2)
> like DROID tool. So misidentified few TeX font metric data (like
> pxmi.tfm pxmi1.tfm eksfi8a.tfm handled by Magdir/tex) with invalid
> "high" bytes/pixel (11 12) are skipped. This is done by modified first
> line. That now looks like:
> 0 ubelong&0xFFffFEfc 0x01da0000
>
> Unfortunately at that point DROID sample x-fmt-140-signature-id-623.bw
> is still misidentified as graphic. But this sample just contain some
> leading bytes of such graphics. This sample is used by DROID tool to
> recognize such SGI graphics. When we look in current output we see that
> dimensions here are shown as "0 x 0". But for real samples of course we
> get "XSIZE x YSIZE" where sizes are not zero. This information is shown
> by lines like:
> >>6 ubeshort x \b, %d x
> >>8 ubeshort x %d
> So the lines after first test now becomes like:
> >6 long !0 SGI image data
> !:mime image/x-sgi
> !:apple ????.SGI
> So the DROID sample is now skipped. On Wikipedia image/sgi is listed as
> mime type, but this is not officially registered at IANA. And DROID tool
> list image/x-sgi-bw. That maybe apply to black/white or gray coloured
> images. So i choose what is used on Linux systems by database from
> freedesktop.org.
>
> According to documentation at offset 10 the channels are stored as 2
> byte big endian integer. Depending on that value different file name
> suffix are used. The value 1 means black and white. The highest observed
> value in my samples was 4. That means RGB with ALPHA channel. If i
> understand the documentation right it is maybe possible to have samples
> with higher channels. For examples i can imagine an animated RGBA. So
> then an additional time component may be added and the channel number
> would be 5. Unfortunately i found no samples with int suffix. I also
> found no sample with inta suffix which means black and white with ALPHA
> channel. So channel information and corresponding file name suffix is
> now done by lines like:
> >>10 ubeshort x \b, %d
> >>>10 ubeshort 1 channel
> !:ext bw
> >>>10 ubeshort 3 channels
> !:ext rgb/sgi
> >>>10 ubeshort 4 channels
> !:ext rgba/sgi
> >>>10 default x channels
> !:ext sgi
>
> For samples like norle-16.sgi inside double quotes a string like
> "n.1.sgi" is shown. This is done by line like:
> >80 string >0 \b, "%s"
> But that is only part of image name. According to documentation after
> dummy bytes and before COLORMAP field an optional image name is stored.
> This is a null terminated ASCII string with up to 79 characters. So the
> image name correctly is shown by line like:
> >>24 string >\0 \b, "%0.80s"
>
> After applying the above mentioned modifications by patch
> file-5.45-images-sgi.diff and using Magdir/tex then i get a more precise
> output and misidentification vanished. That with -k option looks like:
>
> abydos.rgba: SGI image data, RLE,
> 3-D, 800 x 600, 4 channels
> bw.rgb: SGI image data, RLE,
> 3-D, 256 x 256, 3 channels
> , "no name"
> eksfi8a.tfm: TeX font metric data (kerkisec)
> frog.rgb: SGI image data, RLE,
> 3-D, 496 x 497, 3 channels
> , "no name"
> , at 111 0x5, at 113 0x2
> , at 118 0x01f0, at 121 0x2
> , at 132 0x1001a174, at 135 0x74
> , at 137 0x0000000001a68410
> greytest.rgb: SGI image data, RLE,
> 2-D, 256 x 256, 1 channel
> , "no name"
> , 9 PINMIN, 146 PINMAX
> , at 111 0x4, at 113 0x2
> , at 118 0x00ff, at 132 0x10014df0
> , at 135 0xf0, at 137 0x00000000010f5c10
> input.sgi: SGI image data, RLE,
> 3-D, 70 x 46, 3 channels
> norle-16.sgi: SGI image data, high precision,
> 3-D, 100 x 63, 4 channels
> , "...rnold_SGI_Texture_
> Crash_Bugreport_01
> \Default_Pass_Main.1.sgi"
> , 65535 PINMAX
> pxmi.tfm: TeX font metric data (CMMIENCODING)
> pxmi1.tfm: TeX font metric data (CMMIENCODING)
> rle-8.sgi: SGI image data, RLE,
> 3-D, 100 x 63, 4 channels
> , "...rnold_SGI_Texture_
> Crash_Bugreport_01
> \Default_Pass_Main.1.sgi"
> rle.bw: SGI image data, RLE,
> 2-D, 150 x 97, 1 channel
> , "no name"
> , at 111 0x4, at 113 0x2
> , at 132 0x100105f0, at 135 0xf0
> , at 137 0x0000000000391810
> rle.rgb: SGI image data, RLE,
> 3-D, 150 x 97, 3 channels
> , "no name"
> , at 111 0x4, at 113 0x2
> , at 121 0x2, at 132 0x10011210
> , at 135 0x10, at 137 0x0000000000a35610
> sample_1920x1280.sgi: SGI image data, RLE,
> 3-D, 1920 x 1280, 3 channels
> test-2channels.sgi: SGI image data,
> 1-D, 1 x 1, 2 channels, 0 PINMAX
> test-5channels.sgi: SGI image data,
> 1-D, 1 x 1, 5 channels, 0 PINMAX
> transtexsphere.rgb: SGI image data, RLE,
> 3-D, 497 x 500, 3 channels
> , "no name"
> , 211 PINMAX
> , at 111 0x5, at 113 0x2
> , at 118 0x01f3, at 121 0x2
> , at 132 0x10019f28, at 135 0x28
> , at 137 0x00000000039f7610
> tree2.rgb: SGI image data, RLE,
> 3-D, 128 x 128, 3 channels
> , "no name"
> ver.bw: SGI image data,
> 2-D, 150 x 97, 1 channel
> , "no name"
> , at 111 0x4, at 113 0x2
> , at 132 0x100102e0, at 135 0xe0
> ver.rgb: SGI image data,
> 3-D, 150 x 97, 3 channels
> , "no name"
> , at 111 0x4, at 113 0x2
> , at 121 0x2, at 132 0x100108f0
> , at 135 0xf0
> x-fmt-140-signature-id-623.bw: data
>
> I hope my diff file can be applied in future version of
> file utility. Unfortunately the magic for tex font metric are also too
> weak and need some polishment. For the TFM samples there exist no unique
> and long pattern. So i will need some time to do this work in the future.
>
> With best wishes
> Jörg Jenderek
> --
> Jörg Jenderek
> <file-5_39-images-cpi_diff.DEFANGED-112485><droid-sgi.csv.gz><nconvert-info.txt.gz><identify.txt.gz><trid-v-sgi.txt.gz><file-5_45-images-sgi_diff.DEFANGED-112486><file-5_45-images-sgi_diff_sig.DEFANGED-112487>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>
More information about the File
mailing list