[File] [PATCH] of Magdir/scientific GEDCOM -duplicates +mime type
Christos Zoulas
christos at zoulas.com
Sat Apr 29 17:28:24 UTC 2023
Applied, thanks!
christos
> On Apr 28, 2023, at 4:05 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello,
> some days ago i handle some genealogical databases. One format used
> filename extension ged.
>
> When running file command version 5.44 with -k option on such
> genealogical examples and other files with ged extension i get at
> first glance not bad looking an output like:
>
> HUNTERS.GED: data
> MY-FTW4.GED: GEDCOM
> genealogy,
> ASCII text, with CRLF line terminators
> age-all.ged: GEDCOM
> genealogy text version 5.5.1,
> ASCII text
> char_utf16be-1.ged: GEDCOM
> data
> char_utf16be-2.ged: GEDCOM
> data
> GEDCOM genealogy text version 5.5.1,
> Unicode text, UTF-16,
> big-endian text
> char_utf16le-1.ged: GEDCOM
> data
> char_utf16le-2.ged: GEDCOM
> data
> GEDCOM
> genealogy text version 5.5.1,
> Unicode text, UTF-16,
> little-endian text
> fmt-851-signature-id-1210.ged: GEDCOM
> genealogy,
> ISO-8859 text,
> with very long lines (522),
> with CRLF line terminators
> lang-all.ged: GEDCOM
> genealogy text version 7.0,
> Unicode text, UTF-8 (with BOM) text
> notes-1.ged: GEDCOM
> genealogy text version 7.0,
> Unicode text, UTF-8 (with BOM) text
> rela_1.ged: GEDCOM
> genealogy text version 5.5.1,
> ASCII text
>
> With --extension option wrong 3 byte extension "???" is displayed and
> with -i option generic mime type application/octet-stream or
> text/plain is shown.
>
> For comparison reason i also run the file format identification
> utility DROID ( See https://sourceforge.net/projects/droid/). Here
> many examples are also recognized. These are described as
> "Genealogical Data Communication (GEDCOM) Format" with ged suffix and
> without mime type by PUID fmt/851. The UTF-16 encoded samples are not
> recognized. Also my example MY-FTW4.GED is not recognized (See
> appended output/droid-ged.csv)
>
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html).
> So i run trid utility on my GED examples. Most of my genealogical
> samples are described as correctly as "GEDCOM Family History". All
> look for string "0 HEAD" "near" at the beginning. When encoded as
> ASCII and at the beginning like in age-all.ged then these are
> identified by ged.trid.xml. The samples like lang-all.ged encoded
> with UTF-8 and with a byte order mark (BOM=EFBBBF) at start are
> described by ged-utf8.trid.xml with additional phrase (UTF-8).
> Because of BOM these are also described with lower priority as "Text
> - - UTF-8 encoded" by txt-utf-8.trid.xml. The samples like
> char_utf16le-2.ged encoded with UTF-16 little endian and BOM (=FFFE)
> at start are described by ged-utf16.trid.xml with additional phrase
> (UTF-16LE). Because of BOM these are also described with lower
> priority as "Text - UTF-16 (LE) encoded" by txt-utf-16-le.trid.xml.
> The other UTF-16 encoded samples like char_utf16be-2.ged are only
> described generic correctly as "Text - UTF-16 (BE) encoded" by
> txt-utf-16-be.trid.xml. Few samples like char_utf16be-1.ged are
> misidentified as "Adobe PhotoShop Brush" by abr.trid.xml and some
> like char_utf16le-1.ged are described as "Unknown!" (See appended
> trid-v-ged.txt.gz ).
>
> TrID list the used file name extension and often with -v option the
> related URL pointing to some information. So i found here page about
> GEDCOM on Wikipedia and file formats archive team web site. This is
> now expressed by additional comment lines inside Magdir/scientific
> like:
> # URL: http://fileformats.archiveteam.org/wiki/GEDCOM
> # https://en.wikipedia.org/wiki/GEDCOM
> # Reference: http://mark0.net/download/triddefs_xml.7z/defs/g/
> # ged.trid.xml ged-utf8.trid.xml ged-utf16.trid.xml
>
> Often the recognition is is done by line inside Magdir/scientific.
> These look like:
> 0 search/1/c 0\ HEAD GEDCOM genealogy text
>> &0 search 1\ GEDC
>>> &0 search 2\ VERS version
>>>> &1 string >\0 %s
> By the last line the version is shown with values like:
> 4 5.0 5.3 5.4 5.5 5.5.1 5.5.5 5.6 7.0
> Apparently the version is optional and is missing in my own example
> MY-FTW4.GED. The DROID tool assumes that this information is
> required and therefore does not recognize this example.
>
> The samples are just text files. So the generic mime type text/plain
> in principal is OK. On my PI (Debian 11 based) ged samples are
> associated with application/x-gedcom according to mime shared
> database. On GEDCOM page on Wikipedia other types are listed. For
> zip compressed variant with gdz suffix officially registered type
> application/vnd.familysearch.gedcom+zip is listed. Apparently
> somebody assumes that for not zipped variant
> application/vnd.familysearch.gedcom is the mime type, but when i am
> looking at iana.org such type does not exist. The correct type is
> expressed by line like:
> !:mime text/vnd.familysearch.gedcom
>
> According to mime shared database also second suffix gedcom beside
> ged is listed, but i myself do not found such examples. So this
> information is now shown by additional line like:
> !:ext ged
>
> All tools as first test look for byte sequence "0 HEAD" near the
> beginning. So does file command by first test line. This also true
> for UTF8 encoded samples with byte order mark (BOM=EFBBBF) at the
> beginning and for the UFT16 encoded samples with BOM (=FEFF for big
> endian and FFFE for little endian).
> Because UTF-16 samples with BOM are already handled by first test
> the last 2 tests are not needed any more. The second to last look
> for 0\040HEAD string as UTF-16 big endian after BOM. That was done
> by line like:
> 0 string \376\377\000\060\000\040\000\110\000\105\000\101\000\104
> GEDCOM data
> The last test look for 0\040HEAD string as UTF-16 little endian
> after BOM. That was done by line like:
> 0 string \377\376\060\000\040\000\110\000\105\000\101\000\104\000
> GEDCOM data
> So i comment out these lines.
>
> Then we see that second test describes samples with string
> 0\040HEAD as UTF-16 big endian without BOM. Because the same file
> type is described, in my option it makes no sense to call it here
> "GEDCOM data" instead of "GEDCOM genealogy text". To be consistent
> i changed describing text, look also for version information and
> show encoding information similar as in other branch. So the second
> test now becomes like:
> 0 string \000\060\000\040\000\110\000\105\000\101\000\104
> GEDCOM genealogy text
> !:mime text/vnd.familysearch.gedcom
> !:ext ged
>> 12 search/0x65 V\0E\0R\0S version
>>> &2 bestring16 x %s
>>> 0 string x \b, UTF-16 (without BOM) big-endian text
> Then do the same procedure for third test for UTF-16 little endian.
>
> After applying the above mentioned modifications by patch
> file-5.44-scientific-ged.diff then duplicates vanish, i get a
> consistent output and some more details. This now looks like:
>
> HUNTERS.GED: data
> MY-FTW4.GED: GEDCOM
> genealogy,
> ASCII text, with CRLF line terminators
> age-all.ged: GEDCOM
> genealogy text version 5.5.1,
> ASCII text
> char_utf16be-1.ged: GEDCOM
> genealogy text version 5.5.1,
> UTF-16 (without BOM)
> big-endian text
> char_utf16be-2.ged: GEDCOM
> genealogy text version 5.5.1,
> Unicode text, UTF-16,
> big-endian text
> char_utf16le-1.ged: GEDCOM
> genealogy text version 5.5.1,
> UTF-16 (without BOM)
> little-endian text
> char_utf16le-2.ged: GEDCOM
> genealogy text version 5.5.1,
> Unicode text, UTF-16,
> little-endian text
> fmt-851-signature-id-1210.ged: GEDCOM
> genealogy,
> ISO-8859 text,
> with very long lines (522),
> with CRLF line terminators
> lang-all.ged: GEDCOM
> genealogy text version 7.0,
> Unicode text, UTF-8 (with BOM) text
> notes-1.ged: GEDCOM
> genealogy text version 7.0,
> Unicode text, UTF-8 (with BOM) text
> rela_1.ged: GEDCOM
> genealogy text version 5.5.1,
> ASCII text
>
> I hope my diff file can be applied in future version of
> file utility.
>
> Unfortunately the ged suffix is also used for some graphic image
> format like in HUNTERS.GED.
> With best wishes
> Jörg Jenderek
> - --
> Jörg Jenderek
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCZEwnGAAKCRCv8rHJQhrU
> 1jiXAKDUTdjq37OUs1NSIVEjiqq3+ocfywCgj26toIzKEx3ucLmyxcv7uMl9RAM=
> =pgX+
> -----END PGP SIGNATURE-----
> <droid-ged.csv.gz><trid-v-ged.txt.gz><file-5_44-scientific-ged_diff.DEFANGED-9><file-5_44-scientific-ged_diff_sig.DEFANGED-10>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20230429/dec75fd2/attachment.asc>
More information about the File
mailing list