[File] [PATCH] Magdir/wordprocessors gfxboot compiled html help misidentifies TeX font metric

Jörg Jenderek joerg.jen.der.ek at gmx.net
Mon Jun 17 00:27:48 UTC 2024


Hello,

some days ago i looked at the content of an exotic CD-ROM. There are
also stored samples which are misidentified. The samples have TFM file
name suffix and are TeX Font Metric data.

Unfortunately many TFM samples are misidentified as other file formats.
One TeX font metric sample (tri10u.tfm) is misidentified as gfxboot
compiled html help file. I found this sample as "c:\Program
Files\MiKTeX\fonts\tfm\cg\times\tri10u.tfm" after installing MiKTeX
version 24.3 on Windows.

When running file command version 5.45 on this and other real help
examples, and related HTML files i get an output like:
de.hlp:     gfxboot compiled html help file
de.html:    HTML document, Unicode text, UTF-8 text
	    , with very long lines (508)
en.hlp:     gfxboot compiled html help file
en.html:    HTML document, ASCII text
	    , with very long lines (411)
it.hlp:     gfxboot compiled html help file
it.html:    HTML document, Unicode text
	    , UTF-8 text, with very long lines (525)
nl.hlp:     gfxboot compiled html help file
nl.html:    HTML document, Unicode text
	    , UTF-8 text, with very long lines (514)
tri10u.tfm: gfxboot compiled html help file

With --extension option ??? is displayed. Furthermore with -i option for
gfxboot samples only generic application/octet-stream is shown.

For comparison reason i also run the file format identification utility
DROID (See https://sourceforge.net/projects/droid/). Here only the HTML
samples are recognized. These are described as "Hypertext Markup
Language" with mime type text/html by PUID fmt/96. The other samples are
not recognized.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). This identifies also
all such HLP examples as "gfxboot compiled HTML Help". Some samples
(like en.hlp) are described as variant (opt) by hlp-gfxboot-opt.trid.xml
with mime type application/x-gfxboot-hlp and hlp file name suffix. Some
samples (like nl.hlp) are described as variant (main) by
hlp-gfxboot-main.trid.xml also with mime type application/x-gfxboot-hlp
and hlp file name suffix. This software list the used file name
extension and with -v option the related URL pointing to used file
format information (See appended trid-v.txt.gz).

The example is recognized by line inside Magdir/wordprocessors
0 ulelong&0x8080FFFF	0x00001204	gfxboot compiled html help file
So only 18 bit are used for recognition. Apparently this is not always
sufficient.

Luckily with information given by the other tools i found page about
Gfxboot on opensuse and github web server. That informations are
expressed by comment lines inside Magdir/wordprocessors like:
# URL:	https://en.opensuse.org/Gfxboot
# Ref.:	https://github.com/openSUSE/gfxboot/blob/master/gfxboot
#	http://mark0.net/download/triddefs_xml.7z
#	defs/h/hlp-gfxboot-main.trid.xml,hlp-gfxboot-opt.trid.xml
If a HLP sample is a real compiled html help can be verified by command
like:
	gfxboot --help-show	en.hlp >	en.html
In GFXBOOT(1) man page is written how the step from HTML to HLP
is done. This is happens by command line like:
	gfxboot --help-create	en.html >	en.hlp

By this step the tool "compiles" and generate from "readable" HTML
text binary HLP help pages. These can be considered as "tokenized" html
pages. How this happens can be see when looking inside perl script
gfxboot. The relevant lines for identification are like:
	page         => "\x04",      # start new page
	label        => "\x12",      # label start, no text output
So we see that byte sequence at the beginning 0412 means start new page
followed by label without no text output. Afterwards comes the ASCII
like label name. Now comes the interesting part. In theory you can
compile a HTML page about god and evil but in reality the HLP samples
are used as help text for boot loaders like GRUB, syslinux and so on. So
in real world examples i got only 2 label names. In about half of the
samples the first label is 4 byte string main. Similar to c program
where entry starts with function name main here main seems to be used as
first label. In the other half of samples the first label is 3 byte
string opt. Apparently it start with section about options for booting.
These 2 branches are also used by TrID. So i skip tri10u.tfm by
additional lines by checking for 2 possible labels. So this now looks like:
0	ulelong&0x8080FFFF	0x00001204
 >2 regex \^(main|opt)	gfxboot compiled html help file, label %s
!:mime	application/x-gfxboot-hlp
!:ext	hlp
Then after the label name comes title token (\x14). Afterwards comes the
title text, which itself end with token \x10. So i also show informative
title by last additional lines. These look like:
 >>&0	ubyte			0x14		\b, title
 >>>&0	regex			\^[[:print:]]+	'%s'

After applying the above mentioned modifications by patch
file-5.45-wordprocessors-hlp-gfxboot.diff and using Magdir/sgml then
misidentification of TeX font metric sample vanish and i get also a more
precise output like:
de.hlp:     gfxboot compiled html help file
	    , label opt, title 'Bootoptionen'
de.html:    HTML document, Unicode text, UTF-8 text
	    , with very long lines (508)
en.hlp:     gfxboot compiled html help file
	    , label opt, title 'Boot Options'
en.html:    HTML document, ASCII text
	    , with very long lines (411)
it.hlp:     gfxboot compiled html help file
	    , label opt, title 'Opzioni di avvio'
it.html:    HTML document, Unicode text
	    , UTF-8 text, with very long lines (525)
nl.hlp:     gfxboot compiled html help file
	    , label main, title 'Help voor bootloader'
nl.html:    HTML document, Unicode text
	    , UTF-8 text, with very long lines (514)
tri10u.tfm: data

I hope my diff file can be applied in future version of file utility.
An improvement of TeX Font Metric data is still missing. I am working on
that item. There magic pattern is also weak (16 bit) and not very
unique. There also seems to exist a dozen of variants.

With best wishes
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
--- file-5.45/magic/Magdir/wordprocessors.old	2023-02-09 18:43:53.000000000 +0100
+++ file-5.45/magic/Magdir/wordprocessors	2024-06-16 15:05:02.132196000 +0200
@@ -568,8 +568,23 @@
 !:mime	application/x-scribus
 
 # help files .hlp compiled from html and used by gfxboot added by Joerg Jenderek
+# URL: 		https://en.opensuse.org/Gfxboot
+# Reference:	https://github.com/openSUSE/gfxboot/blob/master/gfxboot
+#		http://mark0.net/download/triddefs_xml.7z/defs/h/hlp-gfxboot-main.trid.xml,hlp-gfxboot-opt.trid.xml
+# Note:		called "gfxboot compiled html help" (main),(opt) by TrID
+#		verified by command like `gfxboot --help-show en.hlp > en.html`
 # markups page=0x04,label=0x12, followed by strings like "opt" or "main" and title=0x14
-0	ulelong&0x8080FFFF	0x00001204	gfxboot compiled html help file
+0	ulelong&0x8080FFFF	0x00001204
+# display "gfxboot compiled html help file" (strength=70) after one "TeX font metric data" (tri10u.tfm strength=71=50+21) handled by ./tex
+#!:strength +0
+>2	regex			\^(main|opt)	gfxboot compiled html help file, label %s
+#!:mime	application/octet-stream
+!:mime	application/x-gfxboot-hlp
+!:ext	hlp
+# check for title token \x14
+>>&0	ubyte			0x14		\b, title
+# title text ends with \x10
+>>>&0	regex			\^[[:print:]]+	'%s'
 
 # From:		Joerg Jenderek
 # URL:		https://en.wikipedia.org/wiki/StarOffice
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v.txt.gz
Type: application/x-gzip
Size: 787 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240617/6c7e8a9c/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-wordprocessors-hlp-gfxboot.diff.sig
Type: application/octet-stream
Size: 899 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240617/6c7e8a9c/attachment-0001.obj>


More information about the File mailing list