[File] [PATCH] Magdir/diff unified diff; error for binary+POSIX shell script

Christos Zoulas christos at zoulas.com
Sun Jan 28 17:40:56 UTC 2024


Committed, thanks!

christos

> On Jan 28, 2024, at 11:47 AM, Jörg Jenderek (GMX) <joerg.jen.der.ek at gmx.net> wrote:
> 
> Hello,
> 
> some days ago i must handle some patch files. Unfortunately there exist
> about a dozen of different variants. Some are not recognized.
> 
> In this session i will handle "unified diff" samples and related files.
> 
> When running file command version 5.45 on thousands of such samples with
> option -k i get at first glance not bad looking output like:
> 
> NewClients.patch:              data
> Python-2.6.1-mingw.patch:      unified diff output text
> 			       Python script, ASCII text executable
> ShellR64.patch:                ASCII text, with CRLF
> 			       , LF line terminators
> aspell-0.50.4.1-vc++.diff:     unified diff output text
> 			       C++ source text
> 			       C++ source text
> 			       diff output, ASCII text
> diff_file.diff:                C source, ASCII text
> doublecmd.diff:                unified diff output text
> 			       RCS/CVS diff output, ASCII text
> 			       , with CRLF line terminators
> fdiskpt.dif:                   unified diff output, ASCII text
> file-5.12-msdos-encoding.diff: data
> file-5.30-apple-Ctrl-Z.diff:   data
> file-5.32-macintosh-type.diff: unified diff output text
> 			       JavaScript source, ASCII text
> file-5.40-algol68-a68.diff:    unified diff output text
> 			       Algol 68 source text
> 			       Pascal source, ASCII text
> file-5.40-images-pdd.diff:     unified diff output text
> 			       TeX document, ASCII text
> file-5.45-database-mork.diff:  unified diff output text
> 			       exported SGML document, ASCII text
> fix-qt5.6-build.patch:         POSIX shell script text executable
> 			       unified diff output text
> 			       a /bin/sh script, ASCII text executable
> httpclient.patch:              unified diff output text
> 			       HTML document, ISO-8859 text
> indent-header.patch:           ASCII text
> ldlang.c.rej:                  unified diff output, ASCII text
> xsltml_2.1.2.patch:            unified diff output text
> 			       LaTeX document text
> 			       exported SGML document, Unicode text,
> 			       UTF-8 text, with very long lines (309)
> zip.c.diff:                    unified diff output text
> 			       C source, ASCII text
> 
> 
> Furthermore for most samples text/x-diff is shown with -i option. With
> option --extension only 3 byte sequence ??? is shown.
> 
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html). This does also recognize
> most samples. These are here described as "unified diff output" with
> mime type text/x-patch by diff-unified.trid.xml or as "RCS/CVS diff
> output" with mime type text/plain by diff-rcs.trid.xml. For first
> here 5 file name suffix (.DIFF/DIF/PATCH/PCH/REJ) are listed (See
> appended trid-v-diff-unified.txt.gz).
> 
> For comparison reason i also run the file format identification
> utility DROID ( See https://sourceforge.net/projects/droid/). Most
> samples are not recognized. Samples with 3 byte suffix DIF are therefore
> described wrong as "VisiCalc Database" by PUID x-fmt/368. Samples which
> are described by file command as "LaTeX" like xsltml_2.1.2.patch are
> here described as "LaTeX (Subdocument)" by looking for relevant tags
> (like \usepackage \chapter \section \subsection \begin) via PUID
> fmt/281. No mime types are listed here (see appended
> droid-diff-unified.csv.gz).
> 
> On Linux according to shared MIME-info database such samples are called
> "Differences between files". Here text/x-patch is used as mime type. The
> type text/x-diff is listed here as alias and parent type is text/plain.
> The unified samples are just recognized by looking for 4 byte sequence
> "--- " at the beginning. That information can be seen in source
> freedesktop.org.xml.in found for example on gitlab.freedesktop.org.
> 
> With the help of these tools i found pages about DIFF file format
> (especially unified) on Wikipedia and file formats archive team.
> That is expressed inside Magdir/diff by comment lines like:
> # URL:	http://fileformats.archiveteam.org/wiki/Unified_diff
> #	https://en.wikipedia.org/wiki/Diff_utility#Unified_format
> # Ref.:	https://www.artima.com/weblogs/viewpost.jsp?thread=164293
> #	http://mark0.net/download/triddefs_xml.7z
> #	defs/d/diff-unified.trid.xml
> 
> The unified detected samples are done by lines inside Magdir/diff which
> looks like:
> 0	search/4096	---\040
> >&0	search/1024 	\n
> >>&0	search/1 	+++\040
> >>>&0	search/1024 	\n
> >>>>&0	search/1	@@		unified diff output text
> !:mime	text/x-diff
> !:strength + 90
> 
> Often the patches are used to describe the differences between source
> text files. So the patch files contain fragments of some programming
> language. For that reason  most samples are also described as source
> file for programming language like C++ (at least when using -k option).
> To avoid duplicate messages one could try to exclude unified
> characteristics from source describing fragments. That is in reality
> impossible because for every exotic source fragment must be adapted.
> Furthermore often the characteristics are neither simply nor very
> unique. So apparently a person in the past had raised the magic strength
> for unified diff by adding 90 to original strength 38. Unfortunately
> this is bad behaviour of many people not explaining and documenting the
> facts and reasons. So it take me some weeks that this is sometimes
> WRONG! At the moment the total strength is 128, but for at least 2 text
> types the strength is higher. I check this by running file command with
> --list option and greping for source/text magic. These types are:
> 	# HTML document text (strength=170,90,71,53,52,51,49)
> 	# POSIX shell script (strength=130 ./commands)
> So a sample like fix-qt5.6-build.patch is described first as "POSIX
> shell script" because of magic strength 130. When looking in first lines
> of such patches we see that this a self applying patch. So it starts as
> a shell script (she bang line #! /bin/sh), that call the patch tool and
> this applies the appended text differences (See appended
> head-diff-unified.txt.gz). If the intension of diff creator was to give
> other users an easy method to change some sources the author has called
> the file like fix-qt5.6-build.sh. Obviously for such samples the aspect
> of diff comes first. That is implied by standard PATCH suffix. So in
> order that description as diff of such samples comes first the magic
> strength must by raised by at least by 93 to get total strength 131.
> Maybe that there exist some HTML samples which need even more strength,
> but i do not found such text in my thousands of examples.
> 
> Next error is found in example ShellR64.patch found in UEFI SDK. Here
> comes much more text before unified fragment. So the search range about
> 4 KB is too low and must be raised to 11 KB limit.
> 
> Next error is found in examples (like diff_file.diff). In order to
> understand what is going wrong first we recapitulate the unified
> characteristics. It checks for 3 adjunct lines. The first starts with 3
> minus signs and a space character. The second starts with 3 plus signs
> and a space character. The third line starts with 2 at signs. Often
> these 3 line construct comes at the beginning, but not always. So the
> search construct is used to match unusual samples, but you must now
> carefully consider lines before these 3 lines characteristic for unified
> diff.
> One case are phrases at the beginning which are interpreted as other
> diff variants. So we must look at the strength of these diff variants.
> These look like:
> 	# RCS/CVS diff (strength=36=36+0),
> 	# diff output text (strength=38=38+0)
> So a patch for Revision Control System (RCS) can be also be in unified
> diff format. Then both variants are described, but unified description
> comes first because of raised strength. So this does not hurt.
> 
> Now comes case with the subtle error (indent-header.patch
> diff_file.diff). Here before unified part are lines starting with 3
> minus and space character. This triggers first magic test, but in next
> test no plus sign fragment is found on next line and magic execution
> stops here. So such samples are not recognized. So i relax the tests a
> little bit. So i just test if examples has line with plus and minus
> fragments near the beginning. This would match also samples where plus
> lines comes first and line with minus fragment comes later. But i keep
> the check that after plus-line comes on next line the at sign fragment.
> But in indent-header.patch sample before comes indented diff fragment.
> So i must increase the search range a little bit, but as counter part i
> also check for space and minus character after at signs. I hope that
> whole tests are still unique for this diff variant. So this now looks like:
> 0	search/11054	---\040
> !:strength + 93
> >0	use		diff-unified
> 0	name		diff-unified
> >0	search/11084 	+++\040
> >>&0	search/1024 	\n
> >>>&0	search/2	@@\040-	unified diff output text
> !:mime	text/x-diff
> !:ext	diff/patch/dif/pch/rej
> >>>>0	string	!---\040
> >>>>>0	string		x	\b, 1st line "%s"
> >>>>>>&1 string	x	\b, 2nd line "%s"
> >>>>>>>&1 string	x	\b, 3rd line "%s"
> 
> Such output are used/created by diff and patch utility. Therefore these
> 2 names are often used as file name suffix. On old FAT file system there
> exist a 8+3 limit for file names. So there the maximal length of suffix
> is 3. Apparently so there instead of diff dif is used and instead of
> patch pch is used. According to patch documentation if patch cannot find
> a place to install that hunk of the patch, it puts the hunk out to a
> reject file, which normally is the name of the output file plus a .rej
> suffix or similar. These extension are also listed on
> https://file-extension.net/seeker/
> For control reason i also show first 3 lines for samples where unified
> fragment is not at the beginning.
> 
> Error number 4 is that few samples (like NewClients.patch
> file-5.12-msdos-encoding.diff file-5.30-apple-Ctrl-Z.diff) are not
> recognized and described as data, whereas TrID correctly recognize these
> samples. Normally patches are difference of pure ASCII text file, but
> that is not always true. The above mentioned samples contain control
> characters
> (like Ctrl-Z Ctrl-D Ctrl-V). The used search directive without option is
> the same as using /t. So the samples are tested as text files. So this
> test fail for these "binary" samples and the remaining test and
> displaying part is never executed. So i do it similar to text variant,
> but just use binary option in search directive. So these additional
> lines look like:
> 0	search/4096/b	---\040	uni~b
> !:strength + 93
> >0	use		diff-unified
> 
> After applying the above mentioned modifications by patch
> file-5.45-diff-unified.diff and using more sources from Magdir
> then all my "unified" samples are recognized. This with -k option now
> then looks like:
> NewClients.patch:              unified diff output text
> 			       , 1st line
> 			       "Index: controls.pp"
> 			       , 2nd line
> 			       "=============================
> 			       , 3rd line
> 			       "--- controls.pp\011(revision 20785)"
> Python-2.6.1-mingw.patch:      unified diff output
> 			       Python script, ASCII text executable
> 			       , ASCII text
> ShellR64.patch:                unified diff output text
> 			       , 1st line
> 			       "From 6451e0daf7f733a27e1
> 			       afb3c7ac662a620d8b93b
> 			       Mon Sep 17 00:00:00 2001"
> 			       , 2nd line
> 			       "From: Olivier Martin
> 			       <olivier.martin at arm.com>"
> 			       , 3rd line
> 			       "Date: Tue, 14 Jan 2014 14:43:50 +0000"
> 			       , ASCII text, with CRLF
> 			       , LF line terminators
> aspell-0.50.4.1-vc++.diff:     unified diff output text
> 			       , 1st line
> 			       "Only in aspell-win32: Debug"
> 			       , 2nd line
> 			       "Only in aspell-win32: Release"
> 			       , 3rd line
> 			       "Only in aspell-win32: StdAfx.cpp"
> 			       C++ source text
> 			       C++ source text
> 			       diff output, ASCII text
> diff_file.diff:                unified diff output
> 			       C source, ASCII text
> doublecmd.diff:                unified diff output text
> 			       , 1st line
> 			       "Index: kperm_64.inc"
> 			       , 2nd line
> 			       ""
> 			       , 3rd line
> 			       "============================
> 			       RCS/CVS diff output
> 			       , ASCII text, with CRLF line terminators
> fdiskpt.dif:                   unified diff output, ASCII text
> file-5.12-msdos-encoding.diff: unified diff output text
> file-5.30-apple-Ctrl-Z.diff:   unified diff output text
> file-5.32-macintosh-type.diff: unified diff output
> 			       JavaScript source, ASCII text
> file-5.40-algol68-a68.diff:    unified diff output
> 			       Algol 68 source text
> 			       Pascal source, ASCII text
> file-5.40-images-pdd.diff:     unified diff output
> 			       TeX document, ASCII text
> file-5.45-database-mork.diff:  unified diff output
> 			       exported SGML document, ASCII text
> fix-qt5.6-build.patch:         unified diff output text
> 			       , 1st line
> 			       "#! /bin/sh"
> 			       , 2nd line
> 			       "patch -p1 -l -f -R $* < $0"
> 			       , 3rd line
> 			       "exit $?", ASCII text
> httpclient.patch:              unified diff output, ISO-8859 text
> indent-header.patch:           unified diff output text
> 			       , 1st line
> 			       "  --- /dev/null"
> 			       , 2nd line
> 			       "  +++ b/symlink/index-file"
> 			       , 3rd line
> 			       "  @@ -0,0 +1,1 @@"
> 			       POSIX shell script, ASCII text executable
> ldlang.c.rej:                  unified diff output, ASCII text
> xsltml_2.1.2.patch:            unified diff output
> 			       LaTeX document text
> 			       exported SGML document
> 			       , Unicode text
> 			       , UTF-8 text, with very long lines (309)
> zip.c.diff:                    unified diff output
> 			       C source, ASCII text
> 
> I hope my diff file can be applied in future version of file
> utility.
> 
> There are still other patch formats, which are sometimes are not
> recognized or not described completely. I will try to handle these in a
> future session.
> 
> With best wishes,
> Jörg Jenderek
> --
> Jörg Jenderek
> <Nachrichtenteil als Anhang.DEFANGED-2349><head-diff-unified.txt.gz><trid-v-diff-unified.txt.gz><droid-diff-unified.csv.gz><file-5_45-diff-unified_diff.DEFANGED-2350><file-5_45-diff-unified_diff_sig.DEFANGED-2351>-- 
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>



More information about the File mailing list