[File] [PATCH] Magdir/ispell aspell dictionary not recognized.
Christos Zoulas
christos at zoulas.com
Mon Oct 23 19:50:11 UTC 2023
Committed, thanks!
christos
> On Oct 22, 2023, at 7:14 PM, Jörg Jenderek (GMX) <joerg.jen.der.ek at gmx.net> wrote:
>
> Hello,
>
> Some weeks ago i send patch to recognize some spell affix files.
>
> some days ago i handled some spell affix files. In this session i will
> only consider spell dictionary. These are used/created by aspell
> software ( See Wikipedia page https://en.wikipedia.org/wiki/GNU_Aspell).
>
> The aspell variant samples on UNIX like systems are typical are typical
> found inside directory like /usr/lib/aspell or /usr/lib/aspell-0.60 and
> some times /var/lib/aspell. In last directory the samples are normally
> created from word list files (*.wl) or compressed word list files
> (*.cwl*) during package installation by aspell command with "create
> master" option.
>
> Luckily on such systems there exist a package management. So there
> program needing such spelling often include the needed dictionary files
> by depending on aspell packages. Unfortunately on Windows systems there
> exist no such package management. So here every software with aspelling
> included such dictionary files inside it own program directory. Software
> that behave in this manner are: Inkscape, Bluefish, Aspell. So on
> Windows systems i found such RWS samples in directories like:
> c:\Program Files (x86)\Aspell\dict
> c:\Programme\Bluefish\lib\Aspell-0.60
> c:\Program Files\Inkscape\lib\aspell-0.60
>
> When running file command version 5.45 on such aspell dictionary i get
> an output like:
>
> .aspell.de_DE.prepl: ASCII text
> .aspell.de_DE.pws: ISO-8859 text
> .aspell.en.prepl: ASCII text
> .aspell.en.pws: ASCII text
> de_DE-only.rws: data
> en-only.rws: data
> en-variant_2.rws: data
> it.rws: data
>
> With option --extension only 3 byte sequence ??? is shown and with -i
> option generic text/plain or application/octet-stream is shown.
>
> For comparison reason i also run the file format identification
> utility DROID ( See https://sourceforge.net/projects/droid/). It does
> describe the RWS samples wrong as "Revit Workspace" by PUID x-fmt/448
> based on file name suffix.
>
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html).
> The RWS samples are described as "Aspell dictionary" with mime type
> application/x-aspell-dictionary by rws-aspell.trid.xml. The PWS samples
> are described as "aspell Personal dictionary" by pws-aspell.trid.xml
> with mime type text/x-aspell-dictionary. The PREPL samples are described
> as "aspell Personal Replacement dictionary" by prepl-aspell.trid.xml
> with mime type text/x-aspell-dictionary (See appended
> trid-v-aspell.txt.gz).
>
> TrID list the used file name extension and often with -v option the
> related URL pointing to used file format information. With the help of
> this tools i found manual pages with section about file formats and
> conventions for aspell dictionaries.
>
> Unfortunately in the aspell documentation you find no explicit file
> format specification of RWS files. Even the RWS suffix is rarely
> mentioned. In the man page aspell-autobuildhash(8) part of
> dictionaries-common package on Linux Mint the standard location
> directories are mentioned. There is also mentioned that the RWS samples
> are created from $lang.cwl.gz or $lang.mwl.gz, but the procedure is not
> described in detail. Whereas on the man page word-list-compress(1) part
> of the aspell package this is shown in example section by command like:
> word-list-compress d <words.cwl | aspell create master ./words.rws
>
> So for RWS these informations are now expressed inside Magdir/ispell by
> additional comment lines like:
> # URL: https://en.wikipedia.org/wiki/GNU_Aspell
> # https://manpages.ubuntu.com/manpages/trusty/en/man8/
> # aspell-autobuildhash.8.html
> # Reference: http://mark0.net/download/triddefs_xml.7z
> # defs/r/rws-aspell.trid.xml
> # https://ftp.gnu.org/gnu/aspell/aspell-0.60.8.tar.gz
> # aspell-0.60.8/modules/speller/default/data.cpp
> # aspell-0.60.8/modules/speller/default/readonly_ws.cpp
> Luckily aspell is open source. So i looked inside sources of aspell
> version 0.60.8. So i see in readonly_ws.cpp that this is generated by 32
> byte constant string cur_check_word which is equal "aspell default
> speller rowl 1.10". For older variants i found string like 1.4 instead
> of 1.10. So i add lines at end of Magdir/ispell. These start like:
> 0 string aspell\040default\040speller\040rowl aspell dictionary
> !:mime application/x-aspell-dictionary
> !:ext rws
> >28 string x \b, version %s
> After the first structure at offset 64 the variable section starts with
> endian_check variable. For little endian this is decimal 12345678 or
> 00BC614E hexadecimal in little endian. That is byte sequence 4e61bc00 or
> Na\274\0 string. Unfortunately i have no big endian samples and for
> older aspell variants things are structured in another way. So
> additional information like endian-ness is shown by additional lines like:
> >>64 ulelong 12345678 \b, little endian
> >>64 ubelong 12345678 \b, big endian
> >>64 default x \b, old
>
> In next session i will only consider spell dictionary files with PWS or
> PREPL suffix. These are used/created by aspell software. The aspell
> variant samples are typically found inside user home directory.
> Depending on the used spelling language the names are ( like
> .aspell.de_DE.prepl .aspell.de_DE.pws .aspell.en.prepl
> .aspell.en.pws .aspell.it.prepl .aspell.it.pws). Luckily in the aspell
> documentation you find an explicit file format specification of such
> files with title "Format of the Personal and Replacement Dictionaries".
> So i choose this page as reference. So that is expressed by
> lines like:
> # Reference http://aspell.net/man-html/
> # Format-of-the-Personal-and-Replacement-Dictionaries.html
> # Reference: http://mark0.net/download/triddefs_xml.7z
> # defs/p/pws-aspell.trid.xml
> # defs/p/prepl-aspell.trid.xml
> According to that documentation such dictionaries start with phrase
> personal_. For the replacement dictionary this is followed by phrase
> like repl-1.1 whereas for the other variant next phrase is like ws-1.1.
> So such samples are now detected by lines like:
> 0 string personal_ aspell personal
> >9 string ws-1.1 dictionary
> !:mime text/x-aspell-dictionary
> !:ext pws
> >9 string repl replacement dictionary
> !:mime text/x-aspell-dictionary
> !:ext prepl
> The personal dictionary are not binary files like the RWS dictionary.
> The personal dictionary samples are "just" text files. So these can be
> also created/corrected with every text editor. So instead of generic
> mime type text/plain i choose an user defined one.
>
> After applying the above mentioned modifications by patch
> file-5.45-ispell-aspell.diff then all my aspell dictionary are now
> recognized and described with correct name suffix. This now looks like:
>
> .aspell.de_DE.prepl: aspell personal replacement dictionary
> .aspell.de_DE.pws: aspell personal dictionary
> .aspell.en.prepl: aspell personal replacement dictionary
> .aspell.en.pws: aspell personal dictionary
> de_DE-only.rws: aspell dictionary, version 1.4, old
> en-only.rws: aspell dictionary, version 1.4, old
> en-variant_2.rws: aspell dictionary, version 1.10, little endian
> it.rws: aspell dictionary, version 1.10, little endian
>
> I hope my diff file can be applied in future version of file
> utility.
>
> There is something to do. There are other spell samples like word list.
> I will try to handle this in future session.
>
> With best wishes,
> Jörg Jenderek
> --
> Jörg Jenderek
> <trid-v-aspell.txt.gz><file-ispell-aspell_diff.DEFANGED-106929><file-ispell-aspell_diff_sig.DEFANGED-106930>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>
More information about the File
mailing list