[File] [PATCH] Magdir/diff context diff; C source first because of strenght

Jörg Jenderek (GMX) joerg.jen.der.ek at gmx.net
Wed Feb 7 02:33:27 UTC 2024


Hello,

some days ago i must handle some patch files. Unfortunately there exist
about a dozen of different variants. In this session i will handle
mainly "context" samples which are not "unified".

So i run command version 5.45 with -k option on such hundreds of patch
examples. I get an output like:

create.dgux.patch:        context diff output, ASCII text
gpxe.diff:                unified diff output text
			  diff output, ASCII text
hqx.diff:                 C source text
			  context diff output, ASCII text
osx-roots.diff:           unified diff output text
			  diff output, ASCII text
			  , with very long lines (435)
progname.h.diff:          context diff output text
			  C source, ASCII text
python-2.7.6-mingw.patch: unified diff output text
			  Python script text executable
			  diff output, ASCII text
sac.patch:                context diff output text
			  exported SGML document, ASCII text
vblade-17-aio.2.diff:     unified diff output text
			  C source text
			  diff output text
			  C source, ASCII text

With option --extension only 3 byte sequence ??? is shown and with -i
option often text/x-diff is shown.

For comparison reason i also run the file format identification utility
DROID (See https://sourceforge.net/projects/droid/). Here the samples
are  recognized.

On Linux according to shared MIME-info database such samples are called
"Differences between files". Here text/x-patch is used as mime type. The
type text/x-diff is listed here as alias and parent type is text/plain.
The context samples are just recognized by looking for 3 byte sequence
"---" at the beginning followed by 1 space or tabulator character at the
beginning. Here 2 suffix (*.diff *.patch) are listed. The other samples
are recognized by looking for 4 byte phrase diff at the beginning
followed by 1 space or tabulator character. That information can be seen
in source freedesktop.org.xml.in found for example on
gitlab.freedesktop.org.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). The context samples
(like hqx.diff) are here also described as "context diff output" by
diff-context.trid.xml. Here text/x-patch is listed as mime type. Here 2
suffix are listed (.DIFF/PATCH). Another variant (like in
vblade-17-aio.2.diff) which is described by file command also as "diff
output text" is here also described as "diff output text" by diff.trid.xml.
Here text/x-patch is listed as mime type. Here 4 suffix are listed
(.PATCH/PCH/DIFF/DIF See appended trid-v-diff-context.txt.gz).

This tool list the used file name extension and with -v option the
related URL pointing to used web site with file format information.
That informations are expressed by comment lines inside Magdir/diff like:
# URL:	https://en.wikipedia.org/wiki/Diff_utility#Context_format
# Ref.:	http://mark0.net/download/triddefs_xml.7z
#	defs/d/diff.trid.xml
#	defs/d/diff-context.trid.xml

The context samples are recognized by lines inside Magdir/diff. These
look like:
0	search/1	***\040
  >&0	search/1024	\n---\040	context diff output text
!:mime	text/x-diff

For the context samples i get same problems as found and described for
"unified" diff variant. Often the patches are used to describe the
differences between source text files. So the patch files contain
fragments of some programming language. For that reason most samples are
also described as source file for programming language like C (at least
when using -k option). The strength of context variant was 38.
Unfortunately samples like hqx.diff are also described as C source (with
strength values like 41,39,37). So in order that description as diff of
such samples comes first the magic strength must by raised by at least
by 4 to get total strength 42. Maybe that there exist some HTML samples
which need even more strength (170) or POSIX shell script (strength=130
./commands), but i do not found such text in my hundreds of such examples.

Such output are used/created by diff and patch utility. Therefore these
2 names are often used as file name suffix. On old FAT file system there
exist a 8+3 limit for file names. So there the maximal length of suffix
is 3. Apparently so there instead of diff dif is used and instead of
patch pch is used. For the "unified" i found such short name examples,
but not for "context" variant. So at the moment i show only 2 suffix.
For unified variant the diff characteristic sometimes comes some KB
later instead of offset 0. So maybe that there exist samples where this
occur, but i do not found such examples in my inspected samples. If
others find such examples then the search range must be raised. So these
magic lines now becomes like:
0	search/1	***\040
!:strength +4
 >&0	search/1024	\n---\040	context diff output text
!:mime	text/x-diff
!:ext	diff/patch

The other samples are also described by lines like:
0	search/1	diff\040	diff output text
!:mime	text/x-diff

So these lines become like:
0	search/1	diff\040	diff output text
!:mime	text/x-diff
!:ext	diff/patch
I found only examples with 2 suffix but maybe that there exist
also short name samples like in unified variant. If others find such
samples then i expect 4 suffix (diff/patch/dif/pch). Linux shared mime
database also accept samples with tab instead of space character. And
TrID tool also check for minus character after space character. These
tools also check for magic phrase only at the beginning.  For unified
variant the diff characteristic comes some KB later. So maybe that there
exist examples where magic fragment comes later. Furthermore all
inspected samples of these variant are also described as "unified diff
output text" because the unified characteristics comes later. So i do
not know if there exist patches which start with diff phrase at the
beginning and are not unified. So i keep that first variant.

After applying the above mentioned modifications by patch
file-5.45-diff-context.diff then for all such inspected examples the
diff description comes first. This now looks like:

create.dgux.patch:        context diff output, ASCII text
gpxe.diff:                unified diff output text
			  diff output, ASCII text
hqx.diff:                 context diff output text
			  C source, ASCII text
osx-roots.diff:           unified diff output text
			  diff output, ASCII text
			  , with very long lines (435)
progname.h.diff:          context diff output text
			  C source, ASCII text
python-2.7.6-mingw.patch: unified diff output text
			  diff output, ASCII text
sac.patch:                context diff output text
			  exported SGML document, ASCII text
vblade-17-aio.2.diff:     unified diff output text
			  C source text
			  diff output text
			  C source, ASCII text

I hope my diff file can be applied in future version of file
utility. There are still other patch formats, which are sometimes are
not recognized or not described completely. I will try to handle these
in a future session.

With best wishes,
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-diff-context.txt.gz
Type: application/x-gzip
Size: 565 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240207/17e09aa6/attachment.bin>
-------------- next part --------------
--- file-5.45/magic/Magdir/diff.old	2021-02-23 01:49:24.000000000 +0100
+++ file-5.45/magic/Magdir/diff	2024-02-07 03:08:13.229299000 +0100
@@ -5,7 +5,31 @@
 #
+# Update:	Joerg Jenderek
+# URL: 		https://en.wikipedia.org/wiki/Diff
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/d/diff.trid.xml
+# Note:		called "diff output text" by TrID and
+#		"Differences between files" by shared MIME-info database from freedesktop.org
+#		According to shared MIME-info database also tabulator character instead of space character and
+#		by TrID minus character after space character
 0	search/1	diff\040	diff output text
+# diff output text (strength=40=40+0) after unified diff output (strength=131=38+93)
+#!:strength +0
 !:mime	text/x-diff
+#!:mime	text/x-patch
+!:ext	diff/patch
+# no short named pch dif examples found
+#!:ext	diff/patch/dif/pch
+# URL:		https://en.wikipedia.org/wiki/Diff_utility#Context_format
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/d/diff-context.trid.xml
+# Note:		called "context diff output" by TrID
+#		and "Differences between files" by shared MIME-info database from freedesktop.org
 0	search/1	***\040
+# context diff output text (strength=42=38+4) before
+# C source (strength=41,39,37)				exported SGML document (strength=39,28)
+!:strength +4
 >&0	search/1024	\n---\040	context diff output text
 !:mime	text/x-diff
+#!:mime	text/x-patch
+!:ext	diff/patch
+# no short named pch dif examples found
+#!:ext	diff/patch/dif/pch
 0	search/1	Only\040in\040 	diff output text
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-diff-context.diff.sig
Type: application/octet-stream
Size: 812 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240207/17e09aa6/attachment.obj>


More information about the File mailing list