[File] [PATCH] Magdir/ispell affix definition *.aff without russian support until war against Ukraine
Christos Zoulas
christos at zoulas.com
Sun Jul 30 16:03:00 UTC 2023
Committed, thanks!
christos
> On Jul 29, 2023, at 12:26 PM, Jörg Jenderek (GMX) <joerg.jen.der.ek at gmx.net> wrote:
>
> Hello,
>
> Some days ago i run Pirisoft ccleaner. Under item for file extension
> under registry cleaner i can scan for errors. There it complains
> about file name suffix AFF.
>
> So i looked for such files on my systems. Some samples are affix
> definition text files. In this session i will only consider such text
> samples. The ispell variant samples are typical found inside directory
> /usr/lib/ispell on UNIX like systems. The myspell variant samples are
> typical found inside directory /usr/share/myspell on UNIX like systems.
> The hunspell variant samples are typical found inside directory
> /usr/share/hunspell on UNIX like systems. But such samples are also
> found beneath directory /usr/src/dicts. Luckily on such systems there
> exist a package management. So there program needing spelling often
> include the needed affix files by depending on spell packages.
> Unfortunately on Windows systems there exist no such package management.
> So here every software with spelling included such affix definition
> inside it own program directory. Software that behave in this manner are:
> Calibre, LibreOffice, Scribus, LanguageTool, Firefox, Thunderbird,
> gImageReader, Emacs, Gramps.
>
> When running file command version 5.45 on such affix samples and
> "related" files (*foo*) i get an output like:
>
> 1463589.aff: ASCII text
> 1695964.aff: Unicode text, UTF-8 text
> 2970240.aff: ASCII text
> ar.aff: Unicode text, UTF-8 text
> bulgarian.aff: ISO-8859 text
> de_DE.aff: ISO-8859 text
> de_DE_frami.aff: ISO-8859 text
> discover-foo.conf: ASCII text
> en-GB-cal.aff: Unicode text, UTF-8 (with BOM) text
> en_GB.aff: Unicode text, UTF-8 text
> en_US.aff: ASCII text
> it_IT.aff: Unicode text, UTF-8 text
> ngerman.aff: Unicode text, UTF-8 text
> nilfs_foo.conf: ASCII text
> polish.aff: Unicode text, UTF-8 text
> sv_SE.aff: ISO-8859 text
> tr_TR.aff: Unicode text, UTF-8 text
>
> With option --extension only 3 byte sequence ??? is shown and with -i
> option generic text/plain is shown.
>
> For comparison reason i also run the file format identification
> utility DROID ( See https://sourceforge.net/projects/droid/). It does
> not recognize the samples.
>
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html). A few examples like
> en_US.aff sv_SE.aff are recognized and described correctly as "Affix
> file" by affix.trid.xml with correct file name suffix AFF (See appended
> trid-v-aff.txt.gz).
>
> TrID list the used file name extension. With the help of these tools i
> found manual pages with section about file formats and conventions for
> ispell, Hunspell dictionaries and affix files. So this is now expressed
> inside Magdir/ispell by additional comment lines like:
> # URL: https://www.openoffice.org/lingucomponent/affix.readme
> # https://man.archlinux.org/man/hunspell.5.en
> # https://manpages.debian.org/testing/ispell/ispell.5.en.html
> # Reference: http://mark0.net/download/triddefs_xml.7z
> # defs/a/affix.trid.xml
>
> Unfortunatly there exist no strict and unique pattern that can be used
> as magic pattern. So i put displaying part in sub routine spell-aff
> inside Magdir/ispell. This starts like:
> 0 name spell-aff
> >1 ubeshort x affix definition
> !:mime text/x-affix
> !:ext aff
> Instead of generic mime type text/plain show an user defined one.
>
> At the end of the subroutine for control reasons show the first lines if
> not empty. For variant starting with ByteOrderMark (BOM=\xEF\xBB\xBF)
> this looks like:
> >1 ubeshort =0xBBBF \b, with BOM
> >>3 string x \b, 1st line "%s"
> >>>&1 ubyte >0x1F \b, 2nd line
> >>>>&-1 string x "%s"
> For variant without BOM this part becomes like:
> >1 ubeshort !0xBBBF
> >>0 ubyte =0x0A
> >>>1 ubyte !0x0A \b, 2nd line
> >>>>&-1 string x "%s"
> >>>>>&1 ubyte =0x0A
> >>>>>>&0 string x \b, 4th line "%s"
> >>0 ubyte !0x0A
> >>>0 string x \b, 1st line "%s"
> >>>>&1 ubyte >0x1F \b, 2nd line
> >>>>>&-1 string x "%s"
> >>>>&1 ubyte =0x0A \b, 3rd line
> >>>>>&0 string x "%s"
>
> So for first sentences i get a line like:
> "# this is the affix file of the de_DE Hunspell dictionary"
> "# Affix file for British English MySpell dictionary."
> "#\011Ispell affix table for German language"
> "# Dizionario italiano"
> "# Note: this file requires ispell 3.4.01 or later."
> "# Affix table for Bulgarian
> This gives us a hint if the file is made for Ispell variant or
> MySpell/Hunspell variant. This also gives us a hint about the language
> that is considered in file.
>
> So according to hunspell man page i look for language code command (that
> is upcased phrase LANG) and it argument langcode. That is done inside
> sub routine by lines like:
> >>0 search/1117643 LANG\040 \b, language
> >>>&0 string x %s
> Many hunspell samples ( like /usr/share/hunspell/de_AT.aff
> /usr/share/hunspell/it_IT.aff /usr/share/hunspell/tr_TR.aff
> /usr/lib/firefox/browser/extensions/langpack-hu at firefox.mozilla.org/
> dictionaries/hu.aff) contain such a language directive and the language
> code argument looks like:
> de_DE hu_HU it_IT mn_MN tr_TR
>
> A few samples like /usr/share/hunspell/tr_TR.aff directly start with
> that directive. So for such samples the detection happens by lines like:
> 0 string LANG\040
> >0 use spell-aff
>
> The TrID tool looks for 4 byte sequence SET\040 at the beginning.
> According to hunspell man page this sets the character encoding of words
> and morphemes in affix and dictionary files. Possible values are:
> UTF-8 ISO8859-1 - ISO8859-10 ISO8859-13 - ISO8859-15 KOI8-R KOI8-U
> cp1251 ISCII-DEVANAGARI
>
> Unfortunately this directive does not comes always at the beginning.
> Often the separator is 1 space character (0x20), but sometimes a
> tabulator character (0x09) is used like in /opt/Wolfram/WolframEngine/
> 13.1/SystemFiles/Components/SpellingData/SpellingDictionaries/ar.aff.
> Unfortunately similar looking phrase occur in some ispell affix. So
> in /usr/lib/ispell/ngerman.aff i found a line like
> # PARTICULAR SETTINGS FOR ISPELL ARE NECESSARY !!!
> and in /usr/lib/ispell/ogerman.aff i found a line starting like:
> # sS > -sS,SSET # schosS >
>
> So look for character SET command used in MySpell and Hunspell and the
> encoding argument inside sub routine by lines like:
> >>0 search/1117729 SET
> >>>&0 ubyte&0xD6 =0x00
> >>>>&0 ubyte >0x48 \b,
> >>>>>&-1 string x "%s" encoded
>
> Some samples like org/languagetool/resource/sv/hunspell/sv_SE.aff start
> with the SET directive. So these are detected by line like:
> 0 string SET\040
> >0 use spell-aff
>
> Some samples like /usr/share/calibre/dictionaries/en-GB/en-GB.aff start
> with a comment line and the SET directive comes later. Then i must also
> explicitly check for encoding string in order to skip some scripts (like
> /bin/affixcompress /bin/setupcon /bin/imdbpy2sql.py). So such samples
> are described by lines like:
> 0 ubyte 0x23
> >0 search/60459 SET\040
> >>&0 string UTF-8
> >>>0 use spell-aff
>
> Instead of UTF-8 i must also check for other coding like KOI8-R KOI8-U
> cp1251, but i am not willing to support Cyrillic alphabet until Russia
> make war against Ukraine. So i do not implement the branches for Russian
> and Cyrillic encodings.
>
> Some samples like /opt/Wolfram/WolframEngine/13.1/SystemFiles/
> Components/SpellingData/SpellingDictionaries/lt.aff also start with a
> UTF-8 Byte Order Mark (BOM=\xEF\xBB\xBF). So such samples are described
> by lines like:
> 0 string \xEF\xBB\xBF
> >3 string \x23
> >3 search/9883 SET\040
> >>0 use spell-aff
>
> Unfortunately a few Hunspell samples like 1463589.aff and 2970240.aff
> found as test unit inside thunderbird sources does not contain the
> typical keywords. I could try to implement for every of such samples an
> exception but the the magic lines then would get more complicated.
> So in sub routine i first look for keywords (defstringtype and suffixes
> followed by flag) suited for Ispell and if i do not found such words i
> assume it is Hunspell variant by using the default clause. Unfortunately
> i must use an extra test line so that default clause works. So the part
> for different spell variants inside subroutine now looks like:
> >0 ubyte x
> >>0 search/8251 defstringtype for Ispell
> >>0 default x
> >>>0 search/3233 suffixes
> >>>>&0 search/2 flag for Ispell
> >>>>&0 default x for MySpell/Hunspell
> >>>0 default x for MySpell/Hunspell
>
> After applying the above mentioned modifications by patch
> file-5.45-ispell-aff.diff then all my real affix samples are now
> described with details with correct name suffix. Some test affix like
> 1463589.aff and 2970240.aff found for example inside thunderbird sources
> are not recognized. This now looks like:
>
> 1463589.aff: ASCII text
> 1695964.aff: affix definition for MySpell/Hunspell
> , 1st line
> "# fix NEEDAFFIX homonym suggestion."
> , 2nd line
> "# Sf.net Bug ID 1695964, reported by
> Bj\303\266rn Jacke."
> 2970240.aff: ASCII text
> ar.aff: affix definition for MySpell/Hunspell
> , "UTF-8" encoded
> , 1st line "FLAG\011long"
> , 2nd line "AF 333"
> bulgarian.aff: affix definition for Ispell
> , 1st line
> "# Affix table for Bulgarian"
> , 3rd line
> "nroffchars\011().\\*"
> de_DE.aff: affix definition for MySpell/Hunspell
> , language de_DE
> , "ISO8859-1" encoded
> , 1st line
> "# this is the affix file of the
> de_DE Hunspell dictionary"
> , 2nd line
> "# derived from the igerman98 dictionary"
> de_DE_frami.aff: affix definition for MySpell/Hunspell
> , language de_DE
> , "ISO8859-1" encoded
> , 1st line "# this is the affix file of the
> de_DE Hunspell dictionary"
> , 2nd line "# derived from the igerman98 dictionary"
> discover-foo.conf: ASCII text
> en-GB-cal.aff: affix definition for MySpell/Hunspell,
> "UTF-8" encoded
> , with BOM
> , 1st line
> "# Affix file for
> British English MySpell dictionary."
> , 2nd line
> "# Also, suitable as basis for
> Commonwealth and European English."
> en_GB.aff: affix definition for MySpell/Hunspell
> , "UTF-8" encoded
> , 1st line
> "# Affix file for
> British English MySpell dictionary"
> , 2nd line
> "# Also suitable as basis for
> Commonwealth and European English."
> en_US.aff: affix definition for MySpell/Hunspell
> , "ISO8859-1" encoded
> , 1st line
> "SET ISO8859-1"
> , 2nd line
> "TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'"
> it_IT.aff: affix definition for MySpell/Hunspell
> , language it_IT
> , "UTF-8" encoded
> , 1st line
> "# Dizionario italiano"
> , 2nd line
> "#"
> ngerman.aff: affix definition for Ispell
> , 1st line
> "#"
> , 2nd line
> "#\011Ispell affix table for German language"
> nilfs_foo.conf: ASCII text
> polish.aff: affix definition for Ispell
> , 2nd line
> "# SJP.PL"
> , 4th line
> "# Note: this file requires ispell 3.4.01 or later."
> sv_SE.aff: affix definition for MySpell/Hunspell
> , "ISO8859-1" encoded
> , 1st line
> "SET ISO8859-1"
> , 2nd line
> "TRY aerndtislogmkpbhfjuv\344c\366\345
> yqxzvw\351\342\340\341\350"
> tr_TR.aff: affix definition for MySpell/Hunspell
> , language tr_TR
> , "UTF-8" encoded
> , 1st line
> "LANG tr_TR"
> , 2nd line
> "SET UTF-8"
>
>
> I hope my diff file can be applied in future version of file
> utility. I hope that i do not miss a real aff sample and that my magic
> test lines are unique enough.
>
> There is something to do. The suffix AFF is still also used for other
> file formats.
>
> With best wishes,
> Jörg Jenderek
> --
> Jörg Jenderek
> <Nachrichtenteil als Anhang.DEFANGED-46475><trid-v-aff.txt.gz><file-5_45-ispell-aff_diff.DEFANGED-46476><file-5_45-ispell-aff_diff_sig.DEFANGED-46477>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20230730/62f88592/attachment.asc>
More information about the File
mailing list