[File] [PATCH] Magdir/ispell aspell dictionary not recognized.

Jörg Jenderek (GMX) joerg.jen.der.ek at gmx.net
Sun Oct 22 23:14:21 UTC 2023


Hello,

Some weeks ago i send patch to recognize some spell affix files.

some days ago i handled some spell affix files. In this session i will
only consider spell dictionary. These are used/created by aspell
software ( See Wikipedia page https://en.wikipedia.org/wiki/GNU_Aspell).

The aspell variant samples on UNIX like systems are typical are typical
found inside directory like /usr/lib/aspell or /usr/lib/aspell-0.60 and
some times /var/lib/aspell. In last directory the samples are normally
created from word list files (*.wl) or compressed word list files
(*.cwl*) during package installation by aspell command with "create
master" option.

Luckily on such systems there exist a package management. So there
program needing such spelling often include the needed dictionary files
by depending on aspell packages. Unfortunately on Windows systems there
exist no such package management. So here every software with aspelling
included such dictionary files inside it own program directory. Software
that behave in this manner are: Inkscape, Bluefish, Aspell. So on
Windows systems i found such RWS samples in directories like:
	c:\Program Files (x86)\Aspell\dict
	c:\Programme\Bluefish\lib\Aspell-0.60
	c:\Program Files\Inkscape\lib\aspell-0.60

When running file command version 5.45 on such aspell dictionary i get
an output like:

.aspell.de_DE.prepl: ASCII text
.aspell.de_DE.pws:   ISO-8859 text
.aspell.en.prepl:    ASCII text
.aspell.en.pws:      ASCII text
de_DE-only.rws:      data
en-only.rws:         data
en-variant_2.rws:    data
it.rws:              data

With option --extension only 3 byte sequence ??? is shown and with -i
option generic text/plain or application/octet-stream is shown.

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/). It does
describe the RWS samples wrong as "Revit Workspace" by PUID x-fmt/448
based on file name suffix.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html).
The RWS samples are described as "Aspell dictionary" with mime type
application/x-aspell-dictionary by rws-aspell.trid.xml. The PWS samples
are described as "aspell Personal dictionary" by pws-aspell.trid.xml
with mime type text/x-aspell-dictionary. The PREPL samples are described
as "aspell Personal Replacement dictionary" by prepl-aspell.trid.xml
with mime type text/x-aspell-dictionary (See appended
trid-v-aspell.txt.gz).

TrID list the used file name extension and often with -v option the
related URL pointing to used file format information. With the help of
this tools i found manual pages with section about file formats and
conventions for aspell dictionaries.

Unfortunately in the aspell documentation you find no explicit file
format specification of RWS files. Even the RWS suffix is rarely
mentioned. In the man page aspell-autobuildhash(8) part of
dictionaries-common package on Linux Mint the standard location
directories are mentioned. There is also mentioned that the RWS samples
are created from $lang.cwl.gz or $lang.mwl.gz, but the procedure is not
described in detail. Whereas on the man page word-list-compress(1) part
of the aspell package this is shown in example section by command like:
      word-list-compress d <words.cwl | aspell create master ./words.rws

So for RWS these informations are now expressed inside Magdir/ispell by
additional comment lines like:
# URL:	https://en.wikipedia.org/wiki/GNU_Aspell
#	https://manpages.ubuntu.com/manpages/trusty/en/man8/
#	aspell-autobuildhash.8.html
# Reference:	http://mark0.net/download/triddefs_xml.7z
#			defs/r/rws-aspell.trid.xml
#	https://ftp.gnu.org/gnu/aspell/aspell-0.60.8.tar.gz
#		aspell-0.60.8/modules/speller/default/data.cpp
#		aspell-0.60.8/modules/speller/default/readonly_ws.cpp
Luckily aspell is open source. So i looked inside sources of aspell
version 0.60.8. So i see in readonly_ws.cpp that this is generated by 32
byte constant string cur_check_word which is equal "aspell default
speller rowl 1.10". For older variants i found string like 1.4 instead
of 1.10. So i add lines at end of Magdir/ispell. These start like:
  0 string aspell\040default\040speller\040rowl	aspell dictionary
  !:mime	application/x-aspell-dictionary
  !:ext	rws
  >28	string	x				\b, version %s
After the first structure at offset 64 the variable section starts with
endian_check variable. For little endian this is decimal 12345678 or
00BC614E hexadecimal in little endian. That is byte sequence 4e61bc00 or
Na\274\0 string. Unfortunately i have no big endian samples and for
older aspell variants things are structured in another way. So
additional information like endian-ness is shown by additional lines like:
  >>64	ulelong	12345678			\b, little endian
  >>64	ubelong	12345678			\b, big endian
  >>64	default	x				\b, old

In next session i will only consider spell dictionary files with PWS or
PREPL suffix. These are used/created by aspell software. The aspell
variant samples are typically found inside user home directory.
Depending on the used spelling language the names are ( like
.aspell.de_DE.prepl .aspell.de_DE.pws .aspell.en.prepl
.aspell.en.pws .aspell.it.prepl .aspell.it.pws). Luckily in the aspell
documentation you find an explicit file format specification of such
files with title "Format of the Personal and Replacement Dictionaries".
So i choose this page as reference. So that is expressed by
lines like:
# Reference	http://aspell.net/man-html/
#		Format-of-the-Personal-and-Replacement-Dictionaries.html
# Reference:	http://mark0.net/download/triddefs_xml.7z
#		defs/p/pws-aspell.trid.xml
#		defs/p/prepl-aspell.trid.xml
According to that documentation such dictionaries start with phrase
personal_. For the replacement dictionary this is followed by phrase
like repl-1.1 whereas for the other variant next phrase is like ws-1.1.
So such samples are now detected by lines like:
  0	string	personal_			aspell personal
  >9	string	ws-1.1				dictionary
  !:mime	text/x-aspell-dictionary
  !:ext	pws
  >9	string	repl				replacement dictionary
  !:mime	text/x-aspell-dictionary
  !:ext	prepl
The personal dictionary are not binary files like the RWS dictionary.
The personal dictionary samples are "just" text files. So these can be
also created/corrected with every text editor. So instead of generic
mime type text/plain i choose an user defined one.

After applying the above mentioned modifications by patch
file-5.45-ispell-aspell.diff then all my aspell dictionary are now
recognized and described with correct name suffix. This now looks like:

.aspell.de_DE.prepl: aspell personal replacement dictionary
.aspell.de_DE.pws:   aspell personal dictionary
.aspell.en.prepl:    aspell personal replacement dictionary
.aspell.en.pws:      aspell personal dictionary
de_DE-only.rws:      aspell dictionary, version 1.4, old
en-only.rws:         aspell dictionary, version 1.4, old
en-variant_2.rws:    aspell dictionary, version 1.10, little endian
it.rws:              aspell dictionary, version 1.10, little endian

I hope my diff file can be applied in future version of file
utility.

There is something to do. There are other spell samples like word list.
I will try to handle this in future session.

With best wishes,
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-aspell.txt.gz
Type: application/x-gzip
Size: 500 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20231023/938fbfc0/attachment.bin>
-------------- next part --------------
--- file-master/magic/Magdir/ispell.old	2023-10-21 18:37:39.934990400 +0200
+++ file-master/magic/Magdir/ispell	2023-10-22 21:39:13.096931100 +0200
@@ -1,7 +1,7 @@
 
 #------------------------------------------------------------------------------
 # $File: ispell,v 1.9 2023/07/30 16:02:43 christos Exp $
-# ispell:  file(1) magic for ispell, MySpell and Hunspell
+# ispell:  file(1) magic for ispell, MySpell, Hunspell and aspell
 #
 # Ispell 3.0 has a magic of 0x9601 and ispell 3.1 has 0x9602.  This magic
 # will match 0x9600 through 0x9603 in *both* little endian and big endian.
@@ -206,3 +206,44 @@
 >>3		string		x		\b, 1st line "%s"
 >>>&1		ubyte		>0x1F		\b, 2nd line
 >>>>&-1		string		x		"%s"
+
+# From:		Joerg Jenderek
+# URL:		https://en.wikipedia.org/wiki/GNU_Aspell
+#		https://manpages.ubuntu.com/manpages/trusty/en/man8/aspell-autobuildhash.8.html
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/r/rws-aspell.trid.xml
+#		https://ftp.gnu.org/gnu/aspell/aspell-0.60.8.tar.gz
+#		aspell-0.60.8/modules/speller/default/data.cpp
+#		aspell-0.60.8/modules/speller/default/readonly_ws.cpp
+# Note:		called "aspell dictionary" by TrID
+0	string	aspell\040default\040speller\040rowl	aspell dictionary
+#!:mime	application/octet-stream
+!:mime	application/x-aspell-dictionary
+!:ext	rws
+# version like: 1.10 1.4
+>28	string	x					\b, version %s
+# u32int endian_check; 12345678=00BC614Eh
+#>64	ulelong	x					\b, endian_check=%u
+>>64	ulelong	12345678				\b, little endian
+# not tested
+>>64	ubelong	12345678				\b, big endian
+# older aspell version not like 0.60.8
+>>64	default	x					\b, old
+# URL:		https://en.wikipedia.org/wiki/GNU_Aspell
+# Reference	http://aspell.net/man-html/Format-of-the-Personal-and-Replacement-Dictionaries.html
+# personal_ws-1.1 lang num [encoding]
+0	string	personal_				aspell personal
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/p/pws-aspell.trid.xml
+# Note:		called "aspell Personal dictionary" by TrID
+>9	string	ws-					dictionary
+#!:mime	text/plain
+!:mime	text/x-aspell-dictionary
+# like: ~/.aspell.en.pws ~/.aspell.de_DE.pws ~/.aspell.it.pws
+!:ext	pws
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/p/prepl-aspell.trid.xml
+# Note:		called "aspell Personal Replacement dictionary" by TrID
+# personal_repl-1.1 lang num [encoding]
+>9	string	repl-					replacement dictionary
+#!:mime	text/plain
+!:mime	text/x-aspell-dictionary
+# like: ~/.aspell.en.prepl ~/.aspell.de_DE.prepl ~/.aspell.it.prepl
+!:ext	prepl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-ispell-aspell.diff.sig
Type: application/octet-stream
Size: 1162 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20231023/938fbfc0/attachment.obj>


More information about the File mailing list