[File] [PATCH] Magdir/diff unified diff; error for binary+POSIX shell script

Jörg Jenderek (GMX) joerg.jen.der.ek at gmx.net
Sun Jan 28 16:47:07 UTC 2024


Hello,

some days ago i must handle some patch files. Unfortunately there exist
about a dozen of different variants. Some are not recognized.

In this session i will handle "unified diff" samples and related files.

When running file command version 5.45 on thousands of such samples with
option -k i get at first glance not bad looking output like:

NewClients.patch:              data
Python-2.6.1-mingw.patch:      unified diff output text
			       Python script, ASCII text executable
ShellR64.patch:                ASCII text, with CRLF
			       , LF line terminators
aspell-0.50.4.1-vc++.diff:     unified diff output text
			       C++ source text
			       C++ source text
			       diff output, ASCII text
diff_file.diff:                C source, ASCII text
doublecmd.diff:                unified diff output text
			       RCS/CVS diff output, ASCII text
			       , with CRLF line terminators
fdiskpt.dif:                   unified diff output, ASCII text
file-5.12-msdos-encoding.diff: data
file-5.30-apple-Ctrl-Z.diff:   data
file-5.32-macintosh-type.diff: unified diff output text
			       JavaScript source, ASCII text
file-5.40-algol68-a68.diff:    unified diff output text
			       Algol 68 source text
			       Pascal source, ASCII text
file-5.40-images-pdd.diff:     unified diff output text
			       TeX document, ASCII text
file-5.45-database-mork.diff:  unified diff output text
			       exported SGML document, ASCII text
fix-qt5.6-build.patch:         POSIX shell script text executable
			       unified diff output text
			       a /bin/sh script, ASCII text executable
httpclient.patch:              unified diff output text
			       HTML document, ISO-8859 text
indent-header.patch:           ASCII text
ldlang.c.rej:                  unified diff output, ASCII text
xsltml_2.1.2.patch:            unified diff output text
			       LaTeX document text
			       exported SGML document, Unicode text,
			       UTF-8 text, with very long lines (309)
zip.c.diff:                    unified diff output text
			       C source, ASCII text


Furthermore for most samples text/x-diff is shown with -i option. With
option --extension only 3 byte sequence ??? is shown.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). This does also recognize
most samples. These are here described as "unified diff output" with
mime type text/x-patch by diff-unified.trid.xml or as "RCS/CVS diff
output" with mime type text/plain by diff-rcs.trid.xml. For first
here 5 file name suffix (.DIFF/DIF/PATCH/PCH/REJ) are listed (See
appended trid-v-diff-unified.txt.gz).

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/). Most
samples are not recognized. Samples with 3 byte suffix DIF are therefore
described wrong as "VisiCalc Database" by PUID x-fmt/368. Samples which
are described by file command as "LaTeX" like xsltml_2.1.2.patch are
here described as "LaTeX (Subdocument)" by looking for relevant tags
(like \usepackage \chapter \section \subsection \begin) via PUID
fmt/281. No mime types are listed here (see appended
droid-diff-unified.csv.gz).

On Linux according to shared MIME-info database such samples are called
"Differences between files". Here text/x-patch is used as mime type. The
type text/x-diff is listed here as alias and parent type is text/plain.
The unified samples are just recognized by looking for 4 byte sequence
"--- " at the beginning. That information can be seen in source
freedesktop.org.xml.in found for example on gitlab.freedesktop.org.

With the help of these tools i found pages about DIFF file format
(especially unified) on Wikipedia and file formats archive team.
That is expressed inside Magdir/diff by comment lines like:
# URL:	http://fileformats.archiveteam.org/wiki/Unified_diff
#	https://en.wikipedia.org/wiki/Diff_utility#Unified_format
# Ref.:	https://www.artima.com/weblogs/viewpost.jsp?thread=164293
#	http://mark0.net/download/triddefs_xml.7z
#	defs/d/diff-unified.trid.xml

The unified detected samples are done by lines inside Magdir/diff which
looks like:
  0	search/4096	---\040
  >&0	search/1024 	\n
  >>&0	search/1 	+++\040
  >>>&0	search/1024 	\n
  >>>>&0	search/1	@@		unified diff output text
  !:mime	text/x-diff
  !:strength + 90

Often the patches are used to describe the differences between source
text files. So the patch files contain fragments of some programming
language. For that reason  most samples are also described as source
file for programming language like C++ (at least when using -k option).
To avoid duplicate messages one could try to exclude unified
characteristics from source describing fragments. That is in reality
impossible because for every exotic source fragment must be adapted.
Furthermore often the characteristics are neither simply nor very
unique. So apparently a person in the past had raised the magic strength
for unified diff by adding 90 to original strength 38. Unfortunately
this is bad behaviour of many people not explaining and documenting the
facts and reasons. So it take me some weeks that this is sometimes
WRONG! At the moment the total strength is 128, but for at least 2 text
types the strength is higher. I check this by running file command with
--list option and greping for source/text magic. These types are:
	# HTML document text (strength=170,90,71,53,52,51,49)
	# POSIX shell script (strength=130 ./commands)
So a sample like fix-qt5.6-build.patch is described first as "POSIX
shell script" because of magic strength 130. When looking in first lines
of such patches we see that this a self applying patch. So it starts as
a shell script (she bang line #! /bin/sh), that call the patch tool and
this applies the appended text differences (See appended
head-diff-unified.txt.gz). If the intension of diff creator was to give
other users an easy method to change some sources the author has called
the file like fix-qt5.6-build.sh. Obviously for such samples the aspect
of diff comes first. That is implied by standard PATCH suffix. So in
order that description as diff of such samples comes first the magic
strength must by raised by at least by 93 to get total strength 131.
Maybe that there exist some HTML samples which need even more strength,
but i do not found such text in my thousands of examples.

Next error is found in example ShellR64.patch found in UEFI SDK. Here
comes much more text before unified fragment. So the search range about
4 KB is too low and must be raised to 11 KB limit.

Next error is found in examples (like diff_file.diff). In order to
understand what is going wrong first we recapitulate the unified
characteristics. It checks for 3 adjunct lines. The first starts with 3
minus signs and a space character. The second starts with 3 plus signs
and a space character. The third line starts with 2 at signs. Often
these 3 line construct comes at the beginning, but not always. So the
search construct is used to match unusual samples, but you must now
carefully consider lines before these 3 lines characteristic for unified
diff.
One case are phrases at the beginning which are interpreted as other
diff variants. So we must look at the strength of these diff variants.
These look like:
	# RCS/CVS diff (strength=36=36+0),
	# diff output text (strength=38=38+0)
So a patch for Revision Control System (RCS) can be also be in unified
diff format. Then both variants are described, but unified description
comes first because of raised strength. So this does not hurt.

Now comes case with the subtle error (indent-header.patch
diff_file.diff). Here before unified part are lines starting with 3
minus and space character. This triggers first magic test, but in next
test no plus sign fragment is found on next line and magic execution
stops here. So such samples are not recognized. So i relax the tests a
little bit. So i just test if examples has line with plus and minus
fragments near the beginning. This would match also samples where plus
lines comes first and line with minus fragment comes later. But i keep
the check that after plus-line comes on next line the at sign fragment.
But in indent-header.patch sample before comes indented diff fragment.
So i must increase the search range a little bit, but as counter part i
also check for space and minus character after at signs. I hope that
whole tests are still unique for this diff variant. So this now looks like:
  0	search/11054	---\040
  !:strength + 93
  >0	use		diff-unified
  0	name		diff-unified
  >0	search/11084 	+++\040
  >>&0	search/1024 	\n
  >>>&0	search/2	@@\040-	unified diff output text
  !:mime	text/x-diff
  !:ext	diff/patch/dif/pch/rej
  >>>>0	string	!---\040
  >>>>>0	string		x	\b, 1st line "%s"
  >>>>>>&1 string	x	\b, 2nd line "%s"
  >>>>>>>&1 string	x	\b, 3rd line "%s"

Such output are used/created by diff and patch utility. Therefore these
2 names are often used as file name suffix. On old FAT file system there
exist a 8+3 limit for file names. So there the maximal length of suffix
is 3. Apparently so there instead of diff dif is used and instead of
patch pch is used. According to patch documentation if patch cannot find
a place to install that hunk of the patch, it puts the hunk out to a
reject file, which normally is the name of the output file plus a .rej
suffix or similar. These extension are also listed on
https://file-extension.net/seeker/
For control reason i also show first 3 lines for samples where unified
fragment is not at the beginning.

Error number 4 is that few samples (like NewClients.patch
file-5.12-msdos-encoding.diff file-5.30-apple-Ctrl-Z.diff) are not
recognized and described as data, whereas TrID correctly recognize these
samples. Normally patches are difference of pure ASCII text file, but
that is not always true. The above mentioned samples contain control
characters
(like Ctrl-Z Ctrl-D Ctrl-V). The used search directive without option is
the same as using /t. So the samples are tested as text files. So this
test fail for these "binary" samples and the remaining test and
displaying part is never executed. So i do it similar to text variant,
but just use binary option in search directive. So these additional
lines look like:
  0	search/4096/b	---\040	uni~b
  !:strength + 93
  >0	use		diff-unified

After applying the above mentioned modifications by patch
file-5.45-diff-unified.diff and using more sources from Magdir
then all my "unified" samples are recognized. This with -k option now
then looks like:
NewClients.patch:              unified diff output text
			       , 1st line
			       "Index: controls.pp"
			       , 2nd line
			       "=============================
			       , 3rd line
			       "--- controls.pp\011(revision 20785)"
Python-2.6.1-mingw.patch:      unified diff output
			       Python script, ASCII text executable
			       , ASCII text
ShellR64.patch:                unified diff output text
			       , 1st line
			       "From 6451e0daf7f733a27e1
			       afb3c7ac662a620d8b93b
			       Mon Sep 17 00:00:00 2001"
			       , 2nd line
			       "From: Olivier Martin
			       <olivier.martin at arm.com>"
			       , 3rd line
			       "Date: Tue, 14 Jan 2014 14:43:50 +0000"
			       , ASCII text, with CRLF
			       , LF line terminators
aspell-0.50.4.1-vc++.diff:     unified diff output text
			       , 1st line
			       "Only in aspell-win32: Debug"
			       , 2nd line
			       "Only in aspell-win32: Release"
			       , 3rd line
			       "Only in aspell-win32: StdAfx.cpp"
			       C++ source text
			       C++ source text
			       diff output, ASCII text
diff_file.diff:                unified diff output
			       C source, ASCII text
doublecmd.diff:                unified diff output text
			       , 1st line
			       "Index: kperm_64.inc"
			       , 2nd line
			       ""
			       , 3rd line
			       "============================
			       RCS/CVS diff output
			       , ASCII text, with CRLF line terminators
fdiskpt.dif:                   unified diff output, ASCII text
file-5.12-msdos-encoding.diff: unified diff output text
file-5.30-apple-Ctrl-Z.diff:   unified diff output text
file-5.32-macintosh-type.diff: unified diff output
			       JavaScript source, ASCII text
file-5.40-algol68-a68.diff:    unified diff output
			       Algol 68 source text
			       Pascal source, ASCII text
file-5.40-images-pdd.diff:     unified diff output
			       TeX document, ASCII text
file-5.45-database-mork.diff:  unified diff output
			       exported SGML document, ASCII text
fix-qt5.6-build.patch:         unified diff output text
			       , 1st line
			       "#! /bin/sh"
			       , 2nd line
			       "patch -p1 -l -f -R $* < $0"
			       , 3rd line
			       "exit $?", ASCII text
httpclient.patch:              unified diff output, ISO-8859 text
indent-header.patch:           unified diff output text
			       , 1st line
			       "  --- /dev/null"
			       , 2nd line
			       "  +++ b/symlink/index-file"
			       , 3rd line
			       "  @@ -0,0 +1,1 @@"
			       POSIX shell script, ASCII text executable
ldlang.c.rej:                  unified diff output, ASCII text
xsltml_2.1.2.patch:            unified diff output
			       LaTeX document text
			       exported SGML document
			       , Unicode text
			       , UTF-8 text, with very long lines (309)
zip.c.diff:                    unified diff output
			       C source, ASCII text

I hope my diff file can be applied in future version of file
utility.

There are still other patch formats, which are sometimes are not
recognized or not described completely. I will try to handle these in a
future session.

With best wishes,
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
-- 
File mailing list
File at astron.com
https://mailman.astron.com/mailman/listinfo/file

-------------- next part --------------
A non-text attachment was scrubbed...
Name: head-diff-unified.txt.gz
Type: application/x-gzip
Size: 3625 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240128/84afb647/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-diff-unified.txt.gz
Type: application/x-gzip
Size: 827 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240128/84afb647/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: droid-diff-unified.csv.gz
Type: application/x-gzip
Size: 1003 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240128/84afb647/attachment-0005.bin>
-------------- next part --------------
--- file-5.45/magic/Magdir/diff.old	2021-02-23 01:49:24.000000000 +0100
+++ file-5.45/magic/Magdir/diff	2024-01-28 17:14:23.689605300 +0100
@@ -11,3 +11,7 @@
 0	search/1	Only\040in\040 	diff output text
+# diff output text output text (strength=38=38+0) after unified diff output (strength=131=38+93)
+#!:strength +0
 !:mime	text/x-diff
+#!:mime	text/x-patch
+!:ext	diff
 0	search/1	Common\040subdirectories:\040 	diff output text
@@ -15,4 +19,12 @@
 
+# URL: 		https://en.wikipedia.org/wiki/Diff#Extensions
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/d/diff-rcs.trid.xml
+# Note:		called "RCS/CVS diff output" by TrID
+#		and "Differences between files" by shared MIME-info database from freedesktop.org
 0	search/1	Index:		RCS/CVS diff output text
+# RCS/CVS diff output text (strength=36=36+0) after unified diff output (strength=131=38+93)
+#!:strength +0
 !:mime	text/x-diff
+#!:mime	text/x-patch
+!:ext	diff/patch
 
@@ -23,9 +35,39 @@
 # unified diff
-0	search/4096	---\040
->&0	search/1024 	\n
->>&0	search/1 	+++\040
->>>&0	search/1024 	\n
->>>>&0	search/1	@@		unified diff output text
+# URL:		http://fileformats.archiveteam.org/wiki/Unified_diff
+#		https://en.wikipedia.org/wiki/Diff_utility#Unified_format
+# Reference:	https://www.artima.com/weblogs/viewpost.jsp?thread=164293
+#		http://mark0.net/download/triddefs_xml.7z/defs/d/diff-unified.trid.xml
+# Note:		called "unified diff output" by TrID and
+#		"Differences between files" by shared MIME-info database from freedesktop.org
+# use b flag to forces the test to be done for binary files (non ASCII text like with Ctrl-D Ctrl-V Ctrl-Z)
+0	search/4096/b	---\040
+!:strength + 93
+>0	use		diff-unified
+# most samples are just pure ASCII text like: ShellR64.patch
+0	search/11054	---\040
+# unified diff (strength=131=38+93) before
+# HTML document text (strength=170,90,71,53,52,51,49)	POSIX shell script (fix-qt5.6-build.patch strength=130 ./commands)
+# JavaScript source (strength=112,84,81,80,79,78,72,69)	C++ source (strength=71,70,69,68,67,54),
+# Python script (strength=69,67,63,60,58,57,56,54,52,37)LaTeX document text (strength=62,56,55,51,43)
+# TeX document (strength=51,38)				C source (strength=41,39,37)
+# exported SGML document (strength=39,28)		diff output text (strength=38=38+0)
+# Pascal source (strength=37)				RCS/CVS diff (strength=36=36+0),
+# Algol 68 source (strength=?)				CSV ASCII text (strength=?)
+!:strength + 93
+>0	use		diff-unified
+#	check for 3 characteristic lines of unified diff
+0	name			diff-unified
+>0	search/11084 	+++\040
+>>&0	search/1024 	\n
+# at signs line sometimes other (with 2 space chars before) like: indent-header.patch
+>>>&0	search/2	@@\040-	unified diff output text
 !:mime	text/x-diff
-!:strength + 90
+#!:mime	text/x-patch
+# https://file-extension.net/seeker/file_extension_dif file_extension_pch file_extension_rej
+!:ext	diff/patch/dif/pch/rej
+# GRR: mainly for debugging purpose for variants with text before real diff output
+>>>>0	string	!---\040
+>>>>>0	string		x	\b, 1st line "%s"
+>>>>>>&1 string		x	\b, 2nd line "%s"
+>>>>>>>&1 string	x	\b, 3rd line "%s"
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-diff-unified.diff.sig
Type: application/octet-stream
Size: 1454 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240128/84afb647/attachment-0001.obj>


More information about the File mailing list