[File] [PATCH] Magdir/mail.news for Mailbox *.mbox

Jörg Jenderek (GMX) joerg.jen.der.ek at gmx.net
Sun Oct 8 17:39:29 UTC 2023


Hello,

some months ago i migrate to Windows 10 on my system. Therefore i must
transfer also my mail stuff handled by thunderbird. I had some problems.
So i look at files belonging to thunderbird.
When running file command version 5.45 on mail messages i get an output
like:

INBOX:                         Unicode text
			       , UTF-8 text, with CRLF line terminators
Stromanbieter.mbox:            ASCII text
			       , with CRLF, LF line terminators
Verivox.mbox:                  ASCII text
			       , with CRLF line terminators
file5.18patch-dyadic.mbox:     ASCII text
file5.19patchWindows.PIF.mbox: ASCII text

With option -i only generic text/plain and with option --extension only
??? is displayed.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). Many of the mail
samples are described with highest priority as "Standard Unix Mailbox"
by mbox.trid.xml with correct file name suffix MBOX and mime type
application/mbox. All samples are described with low priority as "E-Mail
message (Var. 2)" by eml-var2.trid.xml with mime type message/rfc822 and
wrong file suffix EML (See appended trid-v-mbox.txt.gz).

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/).
Here all examples are described as "MIME Email" with mime type
message/rfc822 by PUID fmt/950. For samples with mbox and without file
name suffix the names are considered as invalid (See EXTENSION_MISMATCH
true in droid-mbox.csv.gz)

According to shared-mime-info database the samples are called "Mailbox
file" with mime type application/mbox and file name suffix mbox.

TrID list the used file name extension and often with -v option the
related URL pointing to used file format information.

With the help of these tools i add more lines. So this is now expressed
inside Magdir/mail.news after other mail/news by additional comment
lines like:
# URL:		https://tools.ietf.org/rfc/rfc4155.txt
# Reference:	http://mark0.net/download/triddefs_xml.7z
#		defs/m/mbox.trid.xml

According to all tools and documentation the mail samples start with
capitalized word From followed by one space character. Instead of
text/plain an official registered mime type should be used.
So these are now described by lines like:
0	string			From\040	Mailbox text
!:mime	application/mbox
!:ext	/mbox
   >0	string		x	\b, 1st line "%s"

As described in documentation often the file name suffix mbox is used.
But i also find samples like INBOX without suffix. I am not sure that
the starting pattern is unique enough. So for control reason show
complete first line. Maybe additional test lines may be added in such a
worst case.

After applying the above mentioned modifications by patch
file-5.45-mail.news-mbox.diff then my mail messages are now
recognized and described with some details. This now looks like:

INBOX:                         Mailbox text, 1st line
			       "From - Tue May 30 21:55:54 2023"
Stromanbieter.mbox:            Mailbox text, 1st line
			       "From - Wed Apr 08 17:44:27 2015"
Verivox.mbox:                  Mailbox text, 1st line
			       "From - Tue Apr 07 18:34:15 2015"
file5.18patch-dyadic.mbox:     Mailbox text, 1st line
			       "From joerg.jen.der.ek at gmx.net
			       Sat May 31 20:31:20 2014"
file5.19patchWindows.PIF.mbox: Mailbox text, 1st line
			       "From joerg.jen.der.ek at gmx.net
			       Fri Aug 22 17:56:31 2014"

The world seems to be crazy. All talk about AI, waste much time and
resources in this area, but mail stuff much standardized since decades
and established is still not 100% working until today. What a shame for
all people working in IT sector.

I hope my diff file is unique enough and can be applied in future
version of file utility.

With best wishes,
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-mbox.txt.gz
Type: application/x-gzip
Size: 515 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20231008/a283f18e/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: droid-mbox.csv
Type: text/csv
Size: 1152 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20231008/a283f18e/attachment.csv>
-------------- next part --------------
--- file-5.45/magic/Magdir/mail.news.old	2022-11-06 19:33:00.000000000 +0100
+++ file-5.45/magic/Magdir/mail.news	2023-10-08 04:04:21.599838900 +0200
@@ -44,4 +44,18 @@
 #0	string/t		Content-	MIME entity text
 
+# From:		Joerg Jenderek
+# URL:		https://tools.ietf.org/rfc/rfc4155.txt
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/m/mbox.trid.xml
+# Note:		called "Standard Unix Mailbox" by TrID and
+#		"mailbox file" by shared MIME-info database
+#https://gitlab.freedesktop.org/xdg/shared-mime-info/-/blob/master/data/freedesktop.org.xml.in?ref_type=heads
+0	string			From\040	Mailbox text
+#!:mime	text/plain
+!:mime	application/mbox
+# like: INBOX 1.mbox
+!:ext	/mbox
+# For control reasons show first line like: "From - Tue May 30 21:55:54 2023" "From noreply at unitymedia.info  Thu Oct 13 17:23:38 2016"
+>0	string		x	\b, 1st line "%s"
+
 # TNEF files...
 # URL:		http://fileformats.archiveteam.org/wiki/Transport_Neutral_Encapsulation_Format
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-mail.news-mbox.diff.sig
Type: application/octet-stream
Size: 792 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20231008/a283f18e/attachment.obj>


More information about the File mailing list