[File] [PATCH] of Magdir/scientific GEDCOM -duplicates +mime type

Fri Apr 28 20:05:44 UTC 2023

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,
some days ago i handle some genealogical databases. One format used
filename extension ged.

When running file command version 5.44 with -k option on such
genealogical examples and other files with ged extension i get at
first glance not bad looking an output like:

HUNTERS.GED:                   data
MY-FTW4.GED:                   GEDCOM
			       genealogy,
			       ASCII text, with CRLF line terminators
age-all.ged:                   GEDCOM
			       genealogy text version 5.5.1,
			       ASCII text
char_utf16be-1.ged:            GEDCOM
			       data
char_utf16be-2.ged:            GEDCOM
			       data
			       GEDCOM genealogy text version 5.5.1,
			       Unicode text, UTF-16,
			       big-endian text
char_utf16le-1.ged:            GEDCOM
			       data
char_utf16le-2.ged:            GEDCOM
			       data
			       GEDCOM
			       genealogy text version 5.5.1,
			       Unicode text, UTF-16,
			       little-endian text
fmt-851-signature-id-1210.ged: GEDCOM
			       genealogy,
			       ISO-8859 text,
			       with very long lines (522),
			       with CRLF line terminators
lang-all.ged:                  GEDCOM
			       genealogy text version 7.0,
			       Unicode text, UTF-8 (with BOM) text
notes-1.ged:                   GEDCOM
			       genealogy text version 7.0,
			       Unicode text, UTF-8 (with BOM) text
rela_1.ged:                    GEDCOM
			       genealogy text version 5.5.1,
			       ASCII text

With --extension option wrong 3 byte extension "???" is displayed and
with -i option generic mime type application/octet-stream or
text/plain is shown.

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/). Here
many examples are also recognized. These are described as
"Genealogical Data Communication (GEDCOM) Format" with ged suffix and
without mime type by PUID fmt/851. The UTF-16 encoded samples are not
recognized. Also my example MY-FTW4.GED is not recognized (See
appended output/droid-ged.csv)

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html).
So i run trid utility on my GED examples. Most of my genealogical
samples are described as correctly as "GEDCOM Family History". All
look for string "0 HEAD" "near" at the beginning. When encoded as
ASCII and at the beginning like in age-all.ged then these are
identified by ged.trid.xml. The samples like lang-all.ged encoded
with UTF-8 and with a byte order mark (BOM=EFBBBF) at start are
described by ged-utf8.trid.xml with additional phrase (UTF-8).
Because of BOM these are also described with lower priority as "Text
- - UTF-8 encoded" by txt-utf-8.trid.xml. The samples like
char_utf16le-2.ged encoded with UTF-16 little endian and BOM (=FFFE)
at start are described by ged-utf16.trid.xml with additional phrase
(UTF-16LE). Because of BOM these are also described with lower
priority as "Text - UTF-16 (LE) encoded" by txt-utf-16-le.trid.xml.
The other UTF-16 encoded samples like char_utf16be-2.ged are only
described generic correctly as "Text - UTF-16 (BE) encoded" by
txt-utf-16-be.trid.xml. Few samples like char_utf16be-1.ged are
misidentified as "Adobe PhotoShop Brush" by abr.trid.xml and some
like char_utf16le-1.ged are described as "Unknown!" (See appended
trid-v-ged.txt.gz ).

TrID list the used file name extension and often with -v option the
related URL pointing to some information. So i found here page about
GEDCOM on Wikipedia and file formats archive team web site. This is
now expressed by additional comment lines inside Magdir/scientific
like:
# URL:		http://fileformats.archiveteam.org/wiki/GEDCOM
#		https://en.wikipedia.org/wiki/GEDCOM
# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/g/
#		ged.trid.xml ged-utf8.trid.xml ged-utf16.trid.xml

Often the recognition is is done by line inside Magdir/scientific.
These look like:
 0       search/1/c	0\ HEAD         GEDCOM genealogy text
 >&0     search		1\ GEDC
 >>&0    search		2\ VERS         version
 >>>&1   string		>\0		%s
By the last line the version is shown with values like:
	4 5.0 5.3 5.4 5.5 5.5.1 5.5.5 5.6 7.0
Apparently the version is optional and is missing in my own example
MY-FTW4.GED. The DROID tool assumes that this information is
required and therefore does not recognize this example.

The samples are just text files. So the generic mime type text/plain
in principal is OK. On my PI (Debian 11 based) ged samples are
associated with application/x-gedcom according to mime shared
database. On GEDCOM page on Wikipedia other types are listed. For
zip compressed variant with gdz suffix officially registered type
application/vnd.familysearch.gedcom+zip is listed. Apparently
somebody assumes that for not zipped variant
application/vnd.familysearch.gedcom is the mime type, but when i am
looking at iana.org such type does not exist. The correct type is
expressed by line like:
!:mime	text/vnd.familysearch.gedcom

According to mime shared database also second suffix gedcom beside
ged is listed, but i myself do not found such examples. So this
information is now shown by additional line like:
!:ext	ged

All tools as first test look for byte sequence "0 HEAD" near the
beginning. So does file command by first test line. This also true
for UTF8 encoded samples with byte order mark (BOM=EFBBBF) at the
beginning and for the UFT16 encoded samples with BOM (=FEFF for big
endian and FFFE for little endian).
Because UTF-16 samples with BOM are already handled by first test
the last 2 tests are not needed any more. The second to last look
for 0\040HEAD string as UTF-16 big endian after BOM. That was done
by line like:
0 string \376\377\000\060\000\040\000\110\000\105\000\101\000\104
		GEDCOM data
The last test look for 0\040HEAD string as UTF-16 little endian
after BOM. That was done by line like:
0 string \377\376\060\000\040\000\110\000\105\000\101\000\104\000
		GEDCOM data
So i comment out these lines.

Then we see that second test describes samples with string
0\040HEAD as UTF-16 big endian without BOM. Because the same file
type is described, in my option it makes no sense to call it here
"GEDCOM data" instead of "GEDCOM genealogy text". To be consistent
i changed describing text, look also for version information and
show encoding information similar as in other branch. So the second
test now becomes like:
 0 string \000\060\000\040\000\110\000\105\000\101\000\104
		GEDCOM genealogy text
 !:mime	text/vnd.familysearch.gedcom
 !:ext	ged
 >12	search/0x65	V\0E\0R\0S	version
 >>&2	bestring16	x %s
 >>0	string		x \b, UTF-16 (without BOM) big-endian text
Then do the same procedure for third test for UTF-16 little endian.

After applying the above mentioned modifications by patch
file-5.44-scientific-ged.diff then duplicates vanish, i get a
consistent output and some more details. This now looks like:

HUNTERS.GED:                   data
MY-FTW4.GED:                   GEDCOM
			       genealogy,
			       ASCII text, with CRLF line terminators
age-all.ged:                   GEDCOM
			       genealogy text version 5.5.1,
			       ASCII text
char_utf16be-1.ged:            GEDCOM
			       genealogy text version 5.5.1,
			       UTF-16 (without BOM)
			       big-endian text
char_utf16be-2.ged:            GEDCOM
			       genealogy text version 5.5.1,
			       Unicode text, UTF-16,
			       big-endian text
char_utf16le-1.ged:            GEDCOM
			       genealogy text version 5.5.1,
			       UTF-16 (without BOM)
			       little-endian text
char_utf16le-2.ged:            GEDCOM
			       genealogy text version 5.5.1,
			       Unicode text, UTF-16,
			       little-endian text
fmt-851-signature-id-1210.ged: GEDCOM
			       genealogy,
			       ISO-8859 text,
			       with very long lines (522),
			       with CRLF line terminators
lang-all.ged:                  GEDCOM
			       genealogy text version 7.0,
			       Unicode text, UTF-8 (with BOM) text
notes-1.ged:                   GEDCOM
			       genealogy text version 7.0,
			       Unicode text, UTF-8 (with BOM) text
rela_1.ged:                    GEDCOM
			       genealogy text version 5.5.1,
			       ASCII text

I hope my diff file can be applied in future version of
file utility.

Unfortunately the ged suffix is also used for some graphic image
format like in HUNTERS.GED.
With best wishes
Jörg Jenderek
- --
Jörg Jenderek
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCZEwnGAAKCRCv8rHJQhrU
1jiXAKDUTdjq37OUs1NSIVEjiqq3+ocfywCgj26toIzKEx3ucLmyxcv7uMl9RAM=
=pgX+
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: droid-ged.csv.gz
Type: application/x-gzip
Size: 523 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20230428/08b25015/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-ged.txt.gz
Type: application/x-gzip
Size: 791 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20230428/08b25015/attachment-0001.bin>
-------------- next part --------------

--- file-5.44/magic/Magdir/scientific.old	2020-05-31 13:34:40.000000000 +0200
+++ file-5.44/magic/Magdir/scientific	2023-04-28 21:46:31.623534100 +0200
@@ -62,15 +62,48 @@
 
 # Type: GEDCOM genealogical (family history) data
 # From: Giuseppe Bilotta
+# Update:	Joerg Jenderek
+# URL:		http://fileformats.archiveteam.org/wiki/GEDCOM
+#		https://en.wikipedia.org/wiki/GEDCOM
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/g/
+#		ged.trid.xml ged-utf8.trid.xml ged-utf16.trid.xml
+# Note:		called "GEDCOM Family History" by TrID and "Genealogical Data Communication (GEDCOM) Format" by DROID via PUID fmt/851
 0       search/1/c	0\ HEAD         GEDCOM genealogy text
+#!:mime	text/plain
+#!:mime	application/x-gedcom
+# https://www.iana.org/assignments/media-types/text/vnd.familysearch.gedcom
+!:mime	text/vnd.familysearch.gedcom
+!:ext	ged
+# no gedcom sample found and ged suffix also used for other formats
+#!:ext	ged/gedcom
 >&0     search		1\ GEDC
 >>&0    search		2\ VERS         version
+# 4 5.0 5.3 5.4 5.5 5.5.1 5.5.5 5.6 7.0 or no version
 >>>&1   string		>\0		%s
 # From: Phil Endecott <phil05 at chezphil.org>
-0	string	\000\060\000\040\000\110\000\105\000\101\000\104		GEDCOM data
-0	string	\060\000\040\000\110\000\105\000\101\000\104\000		GEDCOM data
-0	string	\376\377\000\060\000\040\000\110\000\105\000\101\000\104	GEDCOM data
-0	string	\377\376\060\000\040\000\110\000\105\000\101\000\104\000	GEDCOM data
+# 0\040HEAD as UTF-16 big endian without BOM
+0	string	\000\060\000\040\000\110\000\105\000\101\000\104		GEDCOM genealogy text
+!:mime	text/vnd.familysearch.gedcom
+!:ext	ged
+# look for VERS tag encoded as UTF-16 big endian
+>12		search/0x65	V\0E\0R\0S					version
+# version like: 5.5.1
+>>&2		bestring16	x						%s
+>>0		string		x						\b, UTF-16 (without BOM) big-endian text
+# 0\040HEAD as UTF-16 little endian without BOM
+0	string	\060\000\040\000\110\000\105\000\101\000\104\000		GEDCOM genealogy text
+!:mime	text/vnd.familysearch.gedcom
+!:ext	ged
+# look for VERS tag encoded as UTF-16 lttle endian
+>12		search/0x65	V\0E\0R\0S					version
+# version like: 5.5.1
+>>&3		lestring16	x						%s
+>>2		string		x						\b, UTF-16 (without BOM) little-endian text
+# Note:		UTF-16 with BOM variants already described above by first test as "GEDCOM genealogy text"
+# 0\040HEAD as UTF-16 big endian with BOM
+#0	string	\376\377\000\060\000\040\000\110\000\105\000\101\000\104	GEDCOM data
+# 0\040HEAD as UTF-16 little endian with BOM
+#0	string	\377\376\060\000\040\000\110\000\105\000\101\000\104\000	GEDCOM data
 
 # PDB: Protein Data Bank files
 # Adam Buchbinder <adam.buchbinder at gmail.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.44-scientific-ged.diff.sig
Type: application/octet-stream
Size: 1155 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20230428/08b25015/attachment.obj>