[File] [PATCH] of Magdir/scientific GEDCOM -duplicates +mime type

Sat Apr 29 17:28:24 UTC 2023

Applied, thanks!

christos

> On Apr 28, 2023, at 4:05 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hello,
> some days ago i handle some genealogical databases. One format used
> filename extension ged.
> 
> When running file command version 5.44 with -k option on such
> genealogical examples and other files with ged extension i get at
> first glance not bad looking an output like:
> 
> HUNTERS.GED:                   data
> MY-FTW4.GED:                   GEDCOM
> 			       genealogy,
> 			       ASCII text, with CRLF line terminators
> age-all.ged:                   GEDCOM
> 			       genealogy text version 5.5.1,
> 			       ASCII text
> char_utf16be-1.ged:            GEDCOM
> 			       data
> char_utf16be-2.ged:            GEDCOM
> 			       data
> 			       GEDCOM genealogy text version 5.5.1,
> 			       Unicode text, UTF-16,
> 			       big-endian text
> char_utf16le-1.ged:            GEDCOM
> 			       data
> char_utf16le-2.ged:            GEDCOM
> 			       data
> 			       GEDCOM
> 			       genealogy text version 5.5.1,
> 			       Unicode text, UTF-16,
> 			       little-endian text
> fmt-851-signature-id-1210.ged: GEDCOM
> 			       genealogy,
> 			       ISO-8859 text,
> 			       with very long lines (522),
> 			       with CRLF line terminators
> lang-all.ged:                  GEDCOM
> 			       genealogy text version 7.0,
> 			       Unicode text, UTF-8 (with BOM) text
> notes-1.ged:                   GEDCOM
> 			       genealogy text version 7.0,
> 			       Unicode text, UTF-8 (with BOM) text
> rela_1.ged:                    GEDCOM
> 			       genealogy text version 5.5.1,
> 			       ASCII text
> 
> With --extension option wrong 3 byte extension "???" is displayed and
> with -i option generic mime type application/octet-stream or
> text/plain is shown.
> 
> For comparison reason i also run the file format identification
> utility DROID ( See https://sourceforge.net/projects/droid/). Here
> many examples are also recognized. These are described as
> "Genealogical Data Communication (GEDCOM) Format" with ged suffix and
> without mime type by PUID fmt/851. The UTF-16 encoded samples are not
> recognized. Also my example MY-FTW4.GED is not recognized (See
> appended output/droid-ged.csv)
> 
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html).
> So i run trid utility on my GED examples. Most of my genealogical
> samples are described as correctly as "GEDCOM Family History". All
> look for string "0 HEAD" "near" at the beginning. When encoded as
> ASCII and at the beginning like in age-all.ged then these are
> identified by ged.trid.xml. The samples like lang-all.ged encoded
> with UTF-8 and with a byte order mark (BOM=EFBBBF) at start are
> described by ged-utf8.trid.xml with additional phrase (UTF-8).
> Because of BOM these are also described with lower priority as "Text
> - - UTF-8 encoded" by txt-utf-8.trid.xml. The samples like
> char_utf16le-2.ged encoded with UTF-16 little endian and BOM (=FFFE)
> at start are described by ged-utf16.trid.xml with additional phrase
> (UTF-16LE). Because of BOM these are also described with lower
> priority as "Text - UTF-16 (LE) encoded" by txt-utf-16-le.trid.xml.
> The other UTF-16 encoded samples like char_utf16be-2.ged are only
> described generic correctly as "Text - UTF-16 (BE) encoded" by
> txt-utf-16-be.trid.xml. Few samples like char_utf16be-1.ged are
> misidentified as "Adobe PhotoShop Brush" by abr.trid.xml and some
> like char_utf16le-1.ged are described as "Unknown!" (See appended
> trid-v-ged.txt.gz ).
> 
> TrID list the used file name extension and often with -v option the
> related URL pointing to some information. So i found here page about
> GEDCOM on Wikipedia and file formats archive team web site. This is
> now expressed by additional comment lines inside Magdir/scientific
> like:
> # URL:		http://fileformats.archiveteam.org/wiki/GEDCOM
> #		https://en.wikipedia.org/wiki/GEDCOM
> # Reference:	http://mark0.net/download/triddefs_xml.7z/defs/g/
> #		ged.trid.xml ged-utf8.trid.xml ged-utf16.trid.xml
> 
> Often the recognition is is done by line inside Magdir/scientific.
> These look like:
> 0       search/1/c	0\ HEAD         GEDCOM genealogy text
>> &0     search		1\ GEDC
>>> &0    search		2\ VERS         version
>>>> &1   string		>\0		%s
> By the last line the version is shown with values like:
> 	4 5.0 5.3 5.4 5.5 5.5.1 5.5.5 5.6 7.0
> Apparently the version is optional and is missing in my own example
> MY-FTW4.GED. The DROID tool assumes that this information is
> required and therefore does not recognize this example.
> 
> The samples are just text files. So the generic mime type text/plain
> in principal is OK. On my PI (Debian 11 based) ged samples are
> associated with application/x-gedcom according to mime shared
> database. On GEDCOM page on Wikipedia other types are listed. For
> zip compressed variant with gdz suffix officially registered type
> application/vnd.familysearch.gedcom+zip is listed. Apparently
> somebody assumes that for not zipped variant
> application/vnd.familysearch.gedcom is the mime type, but when i am
> looking at iana.org such type does not exist. The correct type is
> expressed by line like:
> !:mime	text/vnd.familysearch.gedcom
> 
> According to mime shared database also second suffix gedcom beside
> ged is listed, but i myself do not found such examples. So this
> information is now shown by additional line like:
> !:ext	ged
> 
> All tools as first test look for byte sequence "0 HEAD" near the
> beginning. So does file command by first test line. This also true
> for UTF8 encoded samples with byte order mark (BOM=EFBBBF) at the
> beginning and for the UFT16 encoded samples with BOM (=FEFF for big
> endian and FFFE for little endian).
> Because UTF-16 samples with BOM are already handled by first test
> the last 2 tests are not needed any more. The second to last look
> for 0\040HEAD string as UTF-16 big endian after BOM. That was done
> by line like:
> 0 string \376\377\000\060\000\040\000\110\000\105\000\101\000\104
> 		GEDCOM data
> The last test look for 0\040HEAD string as UTF-16 little endian
> after BOM. That was done by line like:
> 0 string \377\376\060\000\040\000\110\000\105\000\101\000\104\000
> 		GEDCOM data
> So i comment out these lines.
> 
> Then we see that second test describes samples with string
> 0\040HEAD as UTF-16 big endian without BOM. Because the same file
> type is described, in my option it makes no sense to call it here
> "GEDCOM data" instead of "GEDCOM genealogy text". To be consistent
> i changed describing text, look also for version information and
> show encoding information similar as in other branch. So the second
> test now becomes like:
> 0 string \000\060\000\040\000\110\000\105\000\101\000\104
> 		GEDCOM genealogy text
> !:mime	text/vnd.familysearch.gedcom
> !:ext	ged
>> 12	search/0x65	V\0E\0R\0S	version
>>> &2	bestring16	x %s
>>> 0	string		x \b, UTF-16 (without BOM) big-endian text
> Then do the same procedure for third test for UTF-16 little endian.
> 
> After applying the above mentioned modifications by patch
> file-5.44-scientific-ged.diff then duplicates vanish, i get a
> consistent output and some more details. This now looks like:
> 
> HUNTERS.GED:                   data
> MY-FTW4.GED:                   GEDCOM
> 			       genealogy,
> 			       ASCII text, with CRLF line terminators
> age-all.ged:                   GEDCOM
> 			       genealogy text version 5.5.1,
> 			       ASCII text
> char_utf16be-1.ged:            GEDCOM
> 			       genealogy text version 5.5.1,
> 			       UTF-16 (without BOM)
> 			       big-endian text
> char_utf16be-2.ged:            GEDCOM
> 			       genealogy text version 5.5.1,
> 			       Unicode text, UTF-16,
> 			       big-endian text
> char_utf16le-1.ged:            GEDCOM
> 			       genealogy text version 5.5.1,
> 			       UTF-16 (without BOM)
> 			       little-endian text
> char_utf16le-2.ged:            GEDCOM
> 			       genealogy text version 5.5.1,
> 			       Unicode text, UTF-16,
> 			       little-endian text
> fmt-851-signature-id-1210.ged: GEDCOM
> 			       genealogy,
> 			       ISO-8859 text,
> 			       with very long lines (522),
> 			       with CRLF line terminators
> lang-all.ged:                  GEDCOM
> 			       genealogy text version 7.0,
> 			       Unicode text, UTF-8 (with BOM) text
> notes-1.ged:                   GEDCOM
> 			       genealogy text version 7.0,
> 			       Unicode text, UTF-8 (with BOM) text
> rela_1.ged:                    GEDCOM
> 			       genealogy text version 5.5.1,
> 			       ASCII text
> 
> I hope my diff file can be applied in future version of
> file utility.
> 
> Unfortunately the ged suffix is also used for some graphic image
> format like in HUNTERS.GED.
> With best wishes
> Jörg Jenderek
> - --
> Jörg Jenderek
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
> 
> iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCZEwnGAAKCRCv8rHJQhrU
> 1jiXAKDUTdjq37OUs1NSIVEjiqq3+ocfywCgj26toIzKEx3ucLmyxcv7uMl9RAM=
> =pgX+
> -----END PGP SIGNATURE-----
> <droid-ged.csv.gz><trid-v-ged.txt.gz><file-5_44-scientific-ged_diff.DEFANGED-9><file-5_44-scientific-ged_diff_sig.DEFANGED-10>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20230429/dec75fd2/attachment.asc>