[File] [PATCH] Magdir/ispell affix definition *.aff without russian support until war against Ukraine

Jörg Jenderek (GMX) joerg.jen.der.ek at gmx.net
Sat Jul 29 16:26:44 UTC 2023


Hello,

Some days ago i run Pirisoft ccleaner. Under item for file extension
under registry cleaner i can scan for errors. There it complains
about file name suffix AFF.

So i looked for such files on my systems. Some samples are affix
definition text files. In this session i will only consider such text
samples. The ispell variant samples are typical found inside directory
/usr/lib/ispell on UNIX like systems. The myspell variant samples are
typical found inside directory /usr/share/myspell on UNIX like systems.
The hunspell variant samples are typical found inside directory
/usr/share/hunspell on UNIX like systems. But such samples are also
found beneath directory /usr/src/dicts. Luckily on such systems there
exist a package management. So there program needing spelling often
include the needed affix files by depending on spell packages.
Unfortunately on Windows systems there exist no such package management.
So here every software with spelling included such affix definition
inside it own program directory. Software that behave in this manner are:
Calibre, LibreOffice, Scribus, LanguageTool, Firefox, Thunderbird,
gImageReader, Emacs, Gramps.

When running file command version 5.45 on such affix samples and
"related" files (*foo*) i get an output like:

1463589.aff:       ASCII text
1695964.aff:       Unicode text, UTF-8 text
2970240.aff:       ASCII text
ar.aff:            Unicode text, UTF-8 text
bulgarian.aff:     ISO-8859 text
de_DE.aff:         ISO-8859 text
de_DE_frami.aff:   ISO-8859 text
discover-foo.conf: ASCII text
en-GB-cal.aff:     Unicode text, UTF-8 (with BOM) text
en_GB.aff:         Unicode text, UTF-8 text
en_US.aff:         ASCII text
it_IT.aff:         Unicode text, UTF-8 text
ngerman.aff:       Unicode text, UTF-8 text
nilfs_foo.conf:    ASCII text
polish.aff:        Unicode text, UTF-8 text
sv_SE.aff:         ISO-8859 text
tr_TR.aff:         Unicode text, UTF-8 text

With option --extension only 3 byte sequence ??? is shown and with -i
option generic text/plain is shown.

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/). It does
not recognize the samples.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). A few examples like
en_US.aff sv_SE.aff are recognized and described correctly as "Affix
file" by affix.trid.xml with correct file name suffix AFF (See appended
trid-v-aff.txt.gz).

TrID list the used file name extension. With the help of these tools i
found manual pages with section about file formats and conventions for
ispell, Hunspell dictionaries and affix files. So this is now expressed
inside Magdir/ispell by additional comment lines like:
# URL:	https://www.openoffice.org/lingucomponent/affix.readme
#	https://man.archlinux.org/man/hunspell.5.en
# 	https://manpages.debian.org/testing/ispell/ispell.5.en.html
# Reference:	http://mark0.net/download/triddefs_xml.7z
#		defs/a/affix.trid.xml

Unfortunatly there exist no strict and unique pattern that can be used
as magic pattern. So i put displaying part in sub routine spell-aff
inside Magdir/ispell. This starts like:
  0		name				spell-aff
  >1		ubeshort	x		affix definition
  !:mime	text/x-affix
  !:ext		aff
Instead of generic mime type text/plain show an user defined one.

At the end of the subroutine for control reasons show the first lines if
not empty. For variant starting with ByteOrderMark (BOM=\xEF\xBB\xBF)
this looks like:
  >1		ubeshort	=0xBBBF	   	\b, with BOM
  >>3		string		x		\b, 1st line "%s"
  >>>&1		ubyte		>0x1F		\b, 2nd line
  >>>>&-1		string		x		"%s"
For variant without BOM this part becomes like:
  >1		ubeshort	!0xBBBF
  >>0		ubyte		=0x0A
  >>>1		ubyte		!0x0A		\b, 2nd line
  >>>>&-1	string		x		"%s"
  >>>>>&1	ubyte		=0x0A
  >>>>>>&0	string		x		\b, 4th line "%s"
  >>0		ubyte		!0x0A
  >>>0		string		x		\b, 1st line "%s"
  >>>>&1		ubyte		>0x1F		\b, 2nd line
  >>>>>&-1	string		x		"%s"
  >>>>&1		ubyte		=0x0A		\b, 3rd line
  >>>>>&0	string		x		"%s"

So for first sentences i get a line like:
"# this is the affix file of the de_DE Hunspell dictionary"
"# Affix file for British English MySpell dictionary."
"#\011Ispell affix table for German language"
"# Dizionario italiano"
"# Note: this file requires ispell 3.4.01 or later."
"#   Affix table for Bulgarian
This gives us a hint if the file is made for Ispell variant or
MySpell/Hunspell variant. This also gives us a hint about the language
that is considered in file.

So according to hunspell man page i look for language code command (that
is upcased phrase LANG) and it argument langcode. That is done inside
sub routine by lines like:
  >>0		search/1117643	LANG\040	\b, language
  >>>&0		string		x		%s
Many hunspell samples ( like /usr/share/hunspell/de_AT.aff
/usr/share/hunspell/it_IT.aff /usr/share/hunspell/tr_TR.aff
/usr/lib/firefox/browser/extensions/langpack-hu at firefox.mozilla.org/
dictionaries/hu.aff) contain such a language directive and the language
code argument looks like:
de_DE hu_HU it_IT mn_MN tr_TR

A few samples like /usr/share/hunspell/tr_TR.aff directly start with
that directive. So for such samples the detection happens by lines like:
  0		string		LANG\040
  >0		use		spell-aff

The TrID tool looks for 4 byte sequence SET\040 at the beginning.
According to hunspell man page this sets the character encoding of words
and morphemes in affix and dictionary files. Possible values are:
UTF-8 ISO8859-1 - ISO8859-10 ISO8859-13 - ISO8859-15 KOI8-R KOI8-U
cp1251 ISCII-DEVANAGARI

Unfortunately this directive does not comes always at the beginning.
Often the separator is 1 space character (0x20), but sometimes a
tabulator character (0x09) is used like in /opt/Wolfram/WolframEngine/
13.1/SystemFiles/Components/SpellingData/SpellingDictionaries/ar.aff.
Unfortunately similar looking phrase occur in some ispell affix. So
in /usr/lib/ispell/ngerman.aff i found a line like
# PARTICULAR SETTINGS FOR ISPELL ARE NECESSARY !!!
and in /usr/lib/ispell/ogerman.aff i found a line starting like:
#   sS			>	-sS,SSET	#     schosS	>

So look for character SET command used in MySpell and Hunspell and the
encoding argument inside sub routine by lines like:
  >>0		search/1117729	SET
  >>>&0	ubyte&0xD6	=0x00
  >>>>&0		ubyte		>0x48		\b,
  >>>>>&-1	string	x			"%s" encoded

Some samples like org/languagetool/resource/sv/hunspell/sv_SE.aff start
with the SET directive. So these are detected by line like:
  0		string		SET\040
  >0		use		spell-aff

Some samples like /usr/share/calibre/dictionaries/en-GB/en-GB.aff start
with a comment line and the SET directive comes later. Then i must also
explicitly check for encoding string in order to skip some scripts (like
/bin/affixcompress /bin/setupcon /bin/imdbpy2sql.py). So such samples
are described by lines like:
  0		ubyte		0x23
  >0		search/60459	SET\040
  >>&0		string		UTF-8
  >>>0		use		spell-aff

Instead of UTF-8 i must also check for other coding like KOI8-R KOI8-U
cp1251, but i am not willing to support Cyrillic alphabet until Russia
make war against Ukraine. So i do not implement the branches for Russian
and Cyrillic encodings.

Some samples like /opt/Wolfram/WolframEngine/13.1/SystemFiles/
Components/SpellingData/SpellingDictionaries/lt.aff also start with a
UTF-8 Byte Order Mark (BOM=\xEF\xBB\xBF). So such samples are described
by lines like:
  0		string		\xEF\xBB\xBF
  >3		string		\x23
  >3		search/9883	SET\040
  >>0		use		spell-aff

Unfortunately a few Hunspell samples like 1463589.aff and 2970240.aff
found as test unit inside thunderbird sources does not contain the
typical keywords. I could try to implement for every of such samples an
exception but the the magic lines then would get more complicated.
So in sub routine i first look for keywords (defstringtype and suffixes
followed by flag) suited for Ispell and if i do not found such words i
assume it is Hunspell variant by using the default clause. Unfortunately
i must use an extra test line so that default clause works. So the part
for different spell variants inside subroutine now looks like:
  >0		ubyte		x
  >>0		search/8251	defstringtype	for Ispell
  >>0		default		x
  >>>0		search/3233	suffixes
  >>>>&0		search/2	flag		for Ispell
  >>>>&0		default		x		for MySpell/Hunspell
  >>>0		default		x		for MySpell/Hunspell

After applying the above mentioned modifications by patch
file-5.45-ispell-aff.diff then all my real affix samples are now
described with details with correct name suffix. Some test affix like
1463589.aff and 2970240.aff found for example inside thunderbird sources
are not recognized. This now looks like:

1463589.aff:       ASCII text
1695964.aff:       affix definition for MySpell/Hunspell
		   , 1st line
		   "# fix NEEDAFFIX homonym suggestion."
		   , 2nd line
		   "# Sf.net Bug ID 1695964, reported by
		   Bj\303\266rn Jacke."
2970240.aff:       ASCII text
ar.aff:            affix definition for MySpell/Hunspell
		   , "UTF-8" encoded
		   , 1st line "FLAG\011long"
		   , 2nd line "AF 333"
bulgarian.aff:     affix definition for Ispell
		   , 1st line
		   "#   Affix table for Bulgarian"
		   , 3rd line
		   "nroffchars\011().\\*"
de_DE.aff:         affix definition for MySpell/Hunspell
		   , language de_DE
		   , "ISO8859-1" encoded
		   , 1st line
		   "# this is the affix file of the
		   de_DE Hunspell dictionary"
		   , 2nd line
		   "# derived from the igerman98 dictionary"
de_DE_frami.aff:   affix definition for MySpell/Hunspell
		   , language de_DE
		   , "ISO8859-1" encoded
		   , 1st line "# this is the affix file of the
		   de_DE Hunspell dictionary"
		   , 2nd line "# derived from the igerman98 dictionary"
discover-foo.conf: ASCII text
en-GB-cal.aff:     affix definition for MySpell/Hunspell,
		   "UTF-8" encoded
		   , with BOM
		   , 1st line
		   "# Affix file for
		   British English MySpell dictionary."
		   , 2nd line
		   "# Also, suitable as basis for
		   Commonwealth and European English."
en_GB.aff:         affix definition for MySpell/Hunspell
		   , "UTF-8" encoded
		   , 1st line
		   "# Affix file for
		   British English MySpell dictionary"
		   , 2nd line
		   "# Also suitable as basis for
		   Commonwealth and European English."
en_US.aff:         affix definition for MySpell/Hunspell
		   , "ISO8859-1" encoded
		   , 1st line
		   "SET ISO8859-1"
		   , 2nd line
		   "TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'"
it_IT.aff:         affix definition for MySpell/Hunspell
		   , language it_IT
		   , "UTF-8" encoded
		   , 1st line
		   "# Dizionario italiano"
		   , 2nd line
		   "#"
ngerman.aff:       affix definition for Ispell
		   , 1st line
		   "#"
		   , 2nd line
		   "#\011Ispell affix table for German language"
nilfs_foo.conf:    ASCII text
polish.aff:        affix definition for Ispell
		   , 2nd line
		   "# SJP.PL"
		   , 4th line
		   "# Note: this file requires ispell 3.4.01 or later."
sv_SE.aff:         affix definition for MySpell/Hunspell
		   , "ISO8859-1" encoded
		   , 1st line
		   "SET ISO8859-1"
		   , 2nd line
		   "TRY aerndtislogmkpbhfjuv\344c\366\345
		   yqxzvw\351\342\340\341\350"
tr_TR.aff:         affix definition for MySpell/Hunspell
		   , language tr_TR
		   , "UTF-8" encoded
		   , 1st line
		   "LANG tr_TR"
		   , 2nd line
		   "SET UTF-8"


I hope my diff file can be applied in future version of file
utility. I hope that i do not miss a real aff sample and that my magic
test lines are unique enough.

There is something to do. The suffix AFF is still also used for other
file formats.

With best wishes,
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
-- 
File mailing list
File at astron.com
https://mailman.astron.com/mailman/listinfo/file

-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-aff.txt.gz
Type: application/x-gzip
Size: 495 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20230729/2c168020/attachment-0001.bin>
-------------- next part --------------
--- file-5.45/magic/Magdir/ispell.old	2021-02-23 00:49:24.000000000 +0100
+++ file-5.45/magic/Magdir/ispell	2023-07-28 23:36:27.977560800 +0200
@@ -1,7 +1,7 @@
 
 #------------------------------------------------------------------------------
 # $File: ispell,v 1.8 2009/09/19 16:28:10 christos Exp $
-# ispell:  file(1) magic for ispell
+# ispell:  file(1) magic for ispell, MySpell and Hunspell
 #
 # Ispell 3.0 has a magic of 0x9601 and ispell 3.1 has 0x9602.  This magic
 # will match 0x9600 through 0x9603 in *both* little endian and big endian.
@@ -61,3 +61,148 @@
 >12     long            x               lexsize %d,
 >16     long            x               hashsize %d,
 >20     long            x               stblsize %d
+
+# Summary:	affixes defition text files for Ispell/MySpell/Hunspell
+# From:		Joerg Jenderek
+# URL:		https://www.openoffice.org/lingucomponent/affix.readme
+#		https://man.archlinux.org/man/hunspell.5.en
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/a/affix.trid.xml
+# Note:		called "Affix file" by TrID
+# variant starting with comment character
+0		ubyte		0x23
+# look for SET character command followed by whitespace (seems to be often 1 space character) like in:
+# /usr/share/calibre/dictionaries/en-GB/en-GB.aff
+>0		search/60459	SET\040
+# skip scripts like /bin/affixcompress /bin/setupcon /bin/imdbpy2sql.py by checking for valid character SET argument
+# character SET argument like: UTF-8
+>>&0		string		UTF-8
+>>>0		use					spell-aff
+# character SET argument like: ISO8859-1 - ISO8859-10 ISO8859-13 - ISO8859-15
+>>&0		string		ISO8859-
+>>>0		use				spell-aff
+# character SET argument for Russian with Cyrillic alphabet like: KOI8-R KOI8-U
+# no russian support until war against ukraine
+>>&0		string		KOI8-
+#>>>0		use				spell-aff
+# character SET argument for languages with Cyrillic alphabet like: cp1251
+# no cyrillic support until russia war against ukraine
+>>&0		string		cp1251
+#>>>0		use				spell-aff
+# character SET argument for Indian Script Code for Information Interchange (ISCII) like: ISCII-DEVANAGARI
+>>&0		string		ISCII-
+# no example found
+>>>0		use				spell-aff
+# not "real" affix rule files but found as tests unit inside thunderbird sources like:
+# 1463589.aff 1695964.aff 2970240.aff
+>0		default		x
+# look for suffix SFX command followed by whitespace like in:
+# 1695964.aff
+>>0		search/164	SFX\040
+>>>0		use				spell-aff
+# if not real Hunspell/MySpell affix look for ispell variant
+>>0		default		x
+# URL:		https://manpages.debian.org/testing/ispell/ispell.5.en.html
+# look for ispell declaration like in: /usr/lib/ispell/espanol.aff
+>>>0		search/8251	defstringtype
+# defstringtype declaration start with unique name (like "list" "lat" "utf8" "iso" "nroff" often like formatter name)
+# followed by formatter name (like "nroff" "tex")
+# followed by suffix list (like ".mm" ".ms" ".me" ".man" ".NeXT" ".txt" ".list")
+#>>>>&1		string		x		DECLARATION=%s
+>>>>0		use				spell-aff
+# ispell variant without declaration like in: /usr/lib/ispell/bulgarian.aff /usr/lib/ispell/russian.aff
+>>>0		default		x
+# skip /etc/nilfs_cleanerd.conf by looking for ispell suffix section
+>>>>0		search/3233	suffixes\n
+>>>>>0		use				spell-aff
+# variant starting with empty line and comment character at the beginning of 2nd line like in: /usr/lib/ispell/polish.aff
+0		ubeshort	0x0a23
+# skip /etc/discover-modprobe.conf by looking for ispell declaration
+>2		search/3118	defstringtype
+>>0		use				spell-aff
+# starting with UTF-8 Byte Order Mark (BOM) https://en.wikipedia.org/wiki/Byte_order_mark
+0		string		\xEF\xBB\xBF
+# starting with UTF-8 Byte Order Mark (BOM) followed by comment starting character
+>3		string		\x23
+# starting with UTF-8 BOM and with SET character command followed by whitespace
+# like in: /opt/Wolfram/WolframEngine/13.1/SystemFiles/Components/SpellingData/SpellingDictionaries/lt.aff
+# look for character SET command used in MySpell and Hunspell
+>3		search/9883	SET\040
+>>0		use				spell-aff
+# look for FLAG type command used in MySpell and Hunspell
+0		string		FLAG
+# followed by space character like in
+# /opt/Wolfram/WolframEngine/13.1/SystemFiles/Components/SpellingData/SpellingDictionaries/en_US.aff
+>4		ubyte		0x20
+>>0		use				spell-aff
+# or followed by tabulator character like in
+# /opt/Wolfram/WolframEngine/13.1/SystemFiles/Components/SpellingData/SpellingDictionaries/ar.aff
+>4		ubyte		0x09
+>>0		use				spell-aff
+# starting with character SET command used in MySpell and Hunspell like in: org/languagetool/resource/sv/hunspell/sv_SE.aff
+0		string		SET\040
+>0		use				spell-aff
+# starting with language code LANG used in MySpell and Hunspell like in: /usr/share/hunspell/tr_TR.aff
+0		string		LANG\040
+>0		use				spell-aff
+# starting with affix flag command AF used in MySpell and Hunspell like in: /usr/lib/thunderbird/extensions/langpack-hu at thunderbird.mozilla.org/dictionaries/hu.aff
+0		string		AF\040
+# look for number of flag vector aliases
+>3		regex		[0-9]{1,4}
+>>0		use				spell-aff
+#	display information (encoding,language,...) about affixes rules text for Ispell/MySpell/Hunspell
+0		name				spell-aff
+>1		ubeshort	x		affix definition
+#!:mime		text/plain
+!:mime		text/x-affix
+!:ext		aff
+# GRR: need extra test so that default clause works
+>0		ubyte		x
+# look for ispell declaration
+>>0		search/8251	defstringtype	for Ispell
+# ispell variant without declaration
+>>0		default		x
+# look for ispell suffixes command
+>>>0		search/3233	suffixes
+# skip "suffixes used to create first part of a compound" by checking for flag argument like in: languagetool\resource\sv\hunspell\sv_SE.aff
+>>>>&0		search/2	flag		for Ispell
+>>>>&0		default		x		for MySpell/Hunspell
+# without suffixes keyword
+>>>0		default		x		for MySpell/Hunspell
+# look for language code command used in MySpell and Hunspell
+# like in: /usr/share/hunspell/de_AT.aff /usr/share/hunspell/it_IT.aff /usr/share/hunspell/tr_TR.aff /usr/lib/firefox/browser/extensions/langpack-hu at firefox.mozilla.org/dictionaries/hu.aff
+>>0		search/1117643	LANG\040	\b, language
+# language code argument like: de_DE hu_HU it_IT mn_MN tr_TR
+>>>&0		string		x		%s
+# look for character SET command used in MySpell and Hunspell
+>>0		search/1117729	SET
+# skip SETTINGS like in /usr/lib/ispell/ngerman.aff
+# SET command followed often by space character (0x20) or tabulator (0x09) like in
+# /opt/Wolfram/WolframEngine/13.1/SystemFiles/Components/SpellingData/SpellingDictionaries/ar.aff
+>>>&0	ubyte&0xD6	=0x00
+# skip SSET	#     schosS in /usr/lib/ispell/ogerman.aff
+>>>>&0		ubyte		>0x48		\b,
+# character SET argument like: cp1251 ISCII-DEVANAGAR ISO8859-1 - ISO8859-10 ISO8859-13 - ISO8859-15 KOI8-R KOI8-U UTF-8
+>>>>>&-1	string	x			"%s" encoded
+# for control reasons show first non empty lines for ASCII or ISO-8859 text variant
+>1		ubeshort	!0xBBBF
+# 1st line starting with 0x0A like in /usr/src/dicts/sjp-ispell-pl-20140213/polish.aff
+>>0		ubyte		=0x0A
+>>>1		ubyte		!0x0A		\b, 2nd line
+>>>>&-1		string		x		"%s"
+# 3rd line starting with 0x0A like in polish.aff
+>>>>>&1		ubyte		=0x0A
+>>>>>>&0	string		x		\b, 4th line "%s"
+# 1st line starting with ASCII text like: 
+# this is the affix file of the de_DE Hunspell dictionary
+>>0		ubyte		!0x0A
+>>>0		string		x		\b, 1st line "%s"
+>>>>&1		ubyte		>0x1F		\b, 2nd line
+>>>>>&-1	string		x		"%s"
+# 2nd line starting with 0x0A like in /usr/lib/ispell/bulgarian.aff
+>>>>&1		ubyte		=0x0A		\b, 3rd line
+>>>>>&0		string		x		"%s"
+# for control reasons show first lines for variant starting with ByteOrderMark (BOM=\xEF\xBB\xBF)
+>1		ubeshort	=0xBBBF	   	\b, with BOM
+>>3		string		x		\b, 1st line "%s"
+>>>&1		ubyte		>0x1F		\b, 2nd line
+>>>>&-1		string		x		"%s"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-ispell-aff.diff.sig
Type: application/octet-stream
Size: 2822 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20230729/2c168020/attachment-0001.obj>


More information about the File mailing list