[File] [PATCH] Magdir/ispell affix definition *.aff without russian support until war against Ukraine

Sun Jul 30 16:03:00 UTC 2023

Committed, thanks!

christos

> On Jul 29, 2023, at 12:26 PM, Jörg Jenderek (GMX) <joerg.jen.der.ek at gmx.net> wrote:
> 
> Hello,
> 
> Some days ago i run Pirisoft ccleaner. Under item for file extension
> under registry cleaner i can scan for errors. There it complains
> about file name suffix AFF.
> 
> So i looked for such files on my systems. Some samples are affix
> definition text files. In this session i will only consider such text
> samples. The ispell variant samples are typical found inside directory
> /usr/lib/ispell on UNIX like systems. The myspell variant samples are
> typical found inside directory /usr/share/myspell on UNIX like systems.
> The hunspell variant samples are typical found inside directory
> /usr/share/hunspell on UNIX like systems. But such samples are also
> found beneath directory /usr/src/dicts. Luckily on such systems there
> exist a package management. So there program needing spelling often
> include the needed affix files by depending on spell packages.
> Unfortunately on Windows systems there exist no such package management.
> So here every software with spelling included such affix definition
> inside it own program directory. Software that behave in this manner are:
> Calibre, LibreOffice, Scribus, LanguageTool, Firefox, Thunderbird,
> gImageReader, Emacs, Gramps.
> 
> When running file command version 5.45 on such affix samples and
> "related" files (*foo*) i get an output like:
> 
> 1463589.aff:       ASCII text
> 1695964.aff:       Unicode text, UTF-8 text
> 2970240.aff:       ASCII text
> ar.aff:            Unicode text, UTF-8 text
> bulgarian.aff:     ISO-8859 text
> de_DE.aff:         ISO-8859 text
> de_DE_frami.aff:   ISO-8859 text
> discover-foo.conf: ASCII text
> en-GB-cal.aff:     Unicode text, UTF-8 (with BOM) text
> en_GB.aff:         Unicode text, UTF-8 text
> en_US.aff:         ASCII text
> it_IT.aff:         Unicode text, UTF-8 text
> ngerman.aff:       Unicode text, UTF-8 text
> nilfs_foo.conf:    ASCII text
> polish.aff:        Unicode text, UTF-8 text
> sv_SE.aff:         ISO-8859 text
> tr_TR.aff:         Unicode text, UTF-8 text
> 
> With option --extension only 3 byte sequence ??? is shown and with -i
> option generic text/plain is shown.
> 
> For comparison reason i also run the file format identification
> utility DROID ( See https://sourceforge.net/projects/droid/). It does
> not recognize the samples.
> 
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html). A few examples like
> en_US.aff sv_SE.aff are recognized and described correctly as "Affix
> file" by affix.trid.xml with correct file name suffix AFF (See appended
> trid-v-aff.txt.gz).
> 
> TrID list the used file name extension. With the help of these tools i
> found manual pages with section about file formats and conventions for
> ispell, Hunspell dictionaries and affix files. So this is now expressed
> inside Magdir/ispell by additional comment lines like:
> # URL:	https://www.openoffice.org/lingucomponent/affix.readme
> #	https://man.archlinux.org/man/hunspell.5.en
> # 	https://manpages.debian.org/testing/ispell/ispell.5.en.html
> # Reference:	http://mark0.net/download/triddefs_xml.7z
> #		defs/a/affix.trid.xml
> 
> Unfortunatly there exist no strict and unique pattern that can be used
> as magic pattern. So i put displaying part in sub routine spell-aff
> inside Magdir/ispell. This starts like:
> 0		name				spell-aff
> >1		ubeshort	x		affix definition
> !:mime	text/x-affix
> !:ext		aff
> Instead of generic mime type text/plain show an user defined one.
> 
> At the end of the subroutine for control reasons show the first lines if
> not empty. For variant starting with ByteOrderMark (BOM=\xEF\xBB\xBF)
> this looks like:
> >1		ubeshort	=0xBBBF	   	\b, with BOM
> >>3		string		x		\b, 1st line "%s"
> >>>&1		ubyte		>0x1F		\b, 2nd line
> >>>>&-1		string		x		"%s"
> For variant without BOM this part becomes like:
> >1		ubeshort	!0xBBBF
> >>0		ubyte		=0x0A
> >>>1		ubyte		!0x0A		\b, 2nd line
> >>>>&-1	string		x		"%s"
> >>>>>&1	ubyte		=0x0A
> >>>>>>&0	string		x		\b, 4th line "%s"
> >>0		ubyte		!0x0A
> >>>0		string		x		\b, 1st line "%s"
> >>>>&1		ubyte		>0x1F		\b, 2nd line
> >>>>>&-1	string		x		"%s"
> >>>>&1		ubyte		=0x0A		\b, 3rd line
> >>>>>&0	string		x		"%s"
> 
> So for first sentences i get a line like:
> "# this is the affix file of the de_DE Hunspell dictionary"
> "# Affix file for British English MySpell dictionary."
> "#\011Ispell affix table for German language"
> "# Dizionario italiano"
> "# Note: this file requires ispell 3.4.01 or later."
> "#   Affix table for Bulgarian
> This gives us a hint if the file is made for Ispell variant or
> MySpell/Hunspell variant. This also gives us a hint about the language
> that is considered in file.
> 
> So according to hunspell man page i look for language code command (that
> is upcased phrase LANG) and it argument langcode. That is done inside
> sub routine by lines like:
> >>0		search/1117643	LANG\040	\b, language
> >>>&0		string		x		%s
> Many hunspell samples ( like /usr/share/hunspell/de_AT.aff
> /usr/share/hunspell/it_IT.aff /usr/share/hunspell/tr_TR.aff
> /usr/lib/firefox/browser/extensions/langpack-hu at firefox.mozilla.org/
> dictionaries/hu.aff) contain such a language directive and the language
> code argument looks like:
> de_DE hu_HU it_IT mn_MN tr_TR
> 
> A few samples like /usr/share/hunspell/tr_TR.aff directly start with
> that directive. So for such samples the detection happens by lines like:
> 0		string		LANG\040
> >0		use		spell-aff
> 
> The TrID tool looks for 4 byte sequence SET\040 at the beginning.
> According to hunspell man page this sets the character encoding of words
> and morphemes in affix and dictionary files. Possible values are:
> UTF-8 ISO8859-1 - ISO8859-10 ISO8859-13 - ISO8859-15 KOI8-R KOI8-U
> cp1251 ISCII-DEVANAGARI
> 
> Unfortunately this directive does not comes always at the beginning.
> Often the separator is 1 space character (0x20), but sometimes a
> tabulator character (0x09) is used like in /opt/Wolfram/WolframEngine/
> 13.1/SystemFiles/Components/SpellingData/SpellingDictionaries/ar.aff.
> Unfortunately similar looking phrase occur in some ispell affix. So
> in /usr/lib/ispell/ngerman.aff i found a line like
> # PARTICULAR SETTINGS FOR ISPELL ARE NECESSARY !!!
> and in /usr/lib/ispell/ogerman.aff i found a line starting like:
> #   sS			>	-sS,SSET	#     schosS	>
> 
> So look for character SET command used in MySpell and Hunspell and the
> encoding argument inside sub routine by lines like:
> >>0		search/1117729	SET
> >>>&0	ubyte&0xD6	=0x00
> >>>>&0		ubyte		>0x48		\b,
> >>>>>&-1	string	x			"%s" encoded
> 
> Some samples like org/languagetool/resource/sv/hunspell/sv_SE.aff start
> with the SET directive. So these are detected by line like:
> 0		string		SET\040
> >0		use		spell-aff
> 
> Some samples like /usr/share/calibre/dictionaries/en-GB/en-GB.aff start
> with a comment line and the SET directive comes later. Then i must also
> explicitly check for encoding string in order to skip some scripts (like
> /bin/affixcompress /bin/setupcon /bin/imdbpy2sql.py). So such samples
> are described by lines like:
> 0		ubyte		0x23
> >0		search/60459	SET\040
> >>&0		string		UTF-8
> >>>0		use		spell-aff
> 
> Instead of UTF-8 i must also check for other coding like KOI8-R KOI8-U
> cp1251, but i am not willing to support Cyrillic alphabet until Russia
> make war against Ukraine. So i do not implement the branches for Russian
> and Cyrillic encodings.
> 
> Some samples like /opt/Wolfram/WolframEngine/13.1/SystemFiles/
> Components/SpellingData/SpellingDictionaries/lt.aff also start with a
> UTF-8 Byte Order Mark (BOM=\xEF\xBB\xBF). So such samples are described
> by lines like:
> 0		string		\xEF\xBB\xBF
> >3		string		\x23
> >3		search/9883	SET\040
> >>0		use		spell-aff
> 
> Unfortunately a few Hunspell samples like 1463589.aff and 2970240.aff
> found as test unit inside thunderbird sources does not contain the
> typical keywords. I could try to implement for every of such samples an
> exception but the the magic lines then would get more complicated.
> So in sub routine i first look for keywords (defstringtype and suffixes
> followed by flag) suited for Ispell and if i do not found such words i
> assume it is Hunspell variant by using the default clause. Unfortunately
> i must use an extra test line so that default clause works. So the part
> for different spell variants inside subroutine now looks like:
> >0		ubyte		x
> >>0		search/8251	defstringtype	for Ispell
> >>0		default		x
> >>>0		search/3233	suffixes
> >>>>&0		search/2	flag		for Ispell
> >>>>&0		default		x		for MySpell/Hunspell
> >>>0		default		x		for MySpell/Hunspell
> 
> After applying the above mentioned modifications by patch
> file-5.45-ispell-aff.diff then all my real affix samples are now
> described with details with correct name suffix. Some test affix like
> 1463589.aff and 2970240.aff found for example inside thunderbird sources
> are not recognized. This now looks like:
> 
> 1463589.aff:       ASCII text
> 1695964.aff:       affix definition for MySpell/Hunspell
> 		   , 1st line
> 		   "# fix NEEDAFFIX homonym suggestion."
> 		   , 2nd line
> 		   "# Sf.net Bug ID 1695964, reported by
> 		   Bj\303\266rn Jacke."
> 2970240.aff:       ASCII text
> ar.aff:            affix definition for MySpell/Hunspell
> 		   , "UTF-8" encoded
> 		   , 1st line "FLAG\011long"
> 		   , 2nd line "AF 333"
> bulgarian.aff:     affix definition for Ispell
> 		   , 1st line
> 		   "#   Affix table for Bulgarian"
> 		   , 3rd line
> 		   "nroffchars\011().\\*"
> de_DE.aff:         affix definition for MySpell/Hunspell
> 		   , language de_DE
> 		   , "ISO8859-1" encoded
> 		   , 1st line
> 		   "# this is the affix file of the
> 		   de_DE Hunspell dictionary"
> 		   , 2nd line
> 		   "# derived from the igerman98 dictionary"
> de_DE_frami.aff:   affix definition for MySpell/Hunspell
> 		   , language de_DE
> 		   , "ISO8859-1" encoded
> 		   , 1st line "# this is the affix file of the
> 		   de_DE Hunspell dictionary"
> 		   , 2nd line "# derived from the igerman98 dictionary"
> discover-foo.conf: ASCII text
> en-GB-cal.aff:     affix definition for MySpell/Hunspell,
> 		   "UTF-8" encoded
> 		   , with BOM
> 		   , 1st line
> 		   "# Affix file for
> 		   British English MySpell dictionary."
> 		   , 2nd line
> 		   "# Also, suitable as basis for
> 		   Commonwealth and European English."
> en_GB.aff:         affix definition for MySpell/Hunspell
> 		   , "UTF-8" encoded
> 		   , 1st line
> 		   "# Affix file for
> 		   British English MySpell dictionary"
> 		   , 2nd line
> 		   "# Also suitable as basis for
> 		   Commonwealth and European English."
> en_US.aff:         affix definition for MySpell/Hunspell
> 		   , "ISO8859-1" encoded
> 		   , 1st line
> 		   "SET ISO8859-1"
> 		   , 2nd line
> 		   "TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'"
> it_IT.aff:         affix definition for MySpell/Hunspell
> 		   , language it_IT
> 		   , "UTF-8" encoded
> 		   , 1st line
> 		   "# Dizionario italiano"
> 		   , 2nd line
> 		   "#"
> ngerman.aff:       affix definition for Ispell
> 		   , 1st line
> 		   "#"
> 		   , 2nd line
> 		   "#\011Ispell affix table for German language"
> nilfs_foo.conf:    ASCII text
> polish.aff:        affix definition for Ispell
> 		   , 2nd line
> 		   "# SJP.PL"
> 		   , 4th line
> 		   "# Note: this file requires ispell 3.4.01 or later."
> sv_SE.aff:         affix definition for MySpell/Hunspell
> 		   , "ISO8859-1" encoded
> 		   , 1st line
> 		   "SET ISO8859-1"
> 		   , 2nd line
> 		   "TRY aerndtislogmkpbhfjuv\344c\366\345
> 		   yqxzvw\351\342\340\341\350"
> tr_TR.aff:         affix definition for MySpell/Hunspell
> 		   , language tr_TR
> 		   , "UTF-8" encoded
> 		   , 1st line
> 		   "LANG tr_TR"
> 		   , 2nd line
> 		   "SET UTF-8"
> 
> 
> I hope my diff file can be applied in future version of file
> utility. I hope that i do not miss a real aff sample and that my magic
> test lines are unique enough.
> 
> There is something to do. The suffix AFF is still also used for other
> file formats.
> 
> With best wishes,
> Jörg Jenderek
> --
> Jörg Jenderek
> <Nachrichtenteil als Anhang.DEFANGED-46475><trid-v-aff.txt.gz><file-5_45-ispell-aff_diff.DEFANGED-46476><file-5_45-ispell-aff_diff_sig.DEFANGED-46477>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20230730/62f88592/attachment.asc>