[File] [PATCH] Magdir/diff context diff; C source first because of strenght

Christos Zoulas christos at zoulas.com
Fri Feb 9 00:50:50 UTC 2024


Committed, thanks!

christos

> On Feb 6, 2024, at 9:33 PM, Jörg Jenderek (GMX) <joerg.jen.der.ek at gmx.net> wrote:
> 
> Hello,
> 
> some days ago i must handle some patch files. Unfortunately there exist
> about a dozen of different variants. In this session i will handle
> mainly "context" samples which are not "unified".
> 
> So i run command version 5.45 with -k option on such hundreds of patch
> examples. I get an output like:
> 
> create.dgux.patch:        context diff output, ASCII text
> gpxe.diff:                unified diff output text
> 			  diff output, ASCII text
> hqx.diff:                 C source text
> 			  context diff output, ASCII text
> osx-roots.diff:           unified diff output text
> 			  diff output, ASCII text
> 			  , with very long lines (435)
> progname.h.diff:          context diff output text
> 			  C source, ASCII text
> python-2.7.6-mingw.patch: unified diff output text
> 			  Python script text executable
> 			  diff output, ASCII text
> sac.patch:                context diff output text
> 			  exported SGML document, ASCII text
> vblade-17-aio.2.diff:     unified diff output text
> 			  C source text
> 			  diff output text
> 			  C source, ASCII text
> 
> With option --extension only 3 byte sequence ??? is shown and with -i
> option often text/x-diff is shown.
> 
> For comparison reason i also run the file format identification utility
> DROID (See https://sourceforge.net/projects/droid/). Here the samples
> are  recognized.
> 
> On Linux according to shared MIME-info database such samples are called
> "Differences between files". Here text/x-patch is used as mime type. The
> type text/x-diff is listed here as alias and parent type is text/plain.
> The context samples are just recognized by looking for 3 byte sequence
> "---" at the beginning followed by 1 space or tabulator character at the
> beginning. Here 2 suffix (*.diff *.patch) are listed. The other samples
> are recognized by looking for 4 byte phrase diff at the beginning
> followed by 1 space or tabulator character. That information can be seen
> in source freedesktop.org.xml.in found for example on
> gitlab.freedesktop.org.
> 
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html). The context samples
> (like hqx.diff) are here also described as "context diff output" by
> diff-context.trid.xml. Here text/x-patch is listed as mime type. Here 2
> suffix are listed (.DIFF/PATCH). Another variant (like in
> vblade-17-aio.2.diff) which is described by file command also as "diff
> output text" is here also described as "diff output text" by diff.trid.xml.
> Here text/x-patch is listed as mime type. Here 4 suffix are listed
> (.PATCH/PCH/DIFF/DIF See appended trid-v-diff-context.txt.gz).
> 
> This tool list the used file name extension and with -v option the
> related URL pointing to used web site with file format information.
> That informations are expressed by comment lines inside Magdir/diff like:
> # URL:	https://en.wikipedia.org/wiki/Diff_utility#Context_format
> # Ref.:	http://mark0.net/download/triddefs_xml.7z
> #	defs/d/diff.trid.xml
> #	defs/d/diff-context.trid.xml
> 
> The context samples are recognized by lines inside Magdir/diff. These
> look like:
> 0	search/1	***\040
> >&0	search/1024	\n---\040	context diff output text
> !:mime	text/x-diff
> 
> For the context samples i get same problems as found and described for
> "unified" diff variant. Often the patches are used to describe the
> differences between source text files. So the patch files contain
> fragments of some programming language. For that reason most samples are
> also described as source file for programming language like C (at least
> when using -k option). The strength of context variant was 38.
> Unfortunately samples like hqx.diff are also described as C source (with
> strength values like 41,39,37). So in order that description as diff of
> such samples comes first the magic strength must by raised by at least
> by 4 to get total strength 42. Maybe that there exist some HTML samples
> which need even more strength (170) or POSIX shell script (strength=130
> ./commands), but i do not found such text in my hundreds of such examples.
> 
> Such output are used/created by diff and patch utility. Therefore these
> 2 names are often used as file name suffix. On old FAT file system there
> exist a 8+3 limit for file names. So there the maximal length of suffix
> is 3. Apparently so there instead of diff dif is used and instead of
> patch pch is used. For the "unified" i found such short name examples,
> but not for "context" variant. So at the moment i show only 2 suffix.
> For unified variant the diff characteristic sometimes comes some KB
> later instead of offset 0. So maybe that there exist samples where this
> occur, but i do not found such examples in my inspected samples. If
> others find such examples then the search range must be raised. So these
> magic lines now becomes like:
> 0	search/1	***\040
> !:strength +4
> >&0	search/1024	\n---\040	context diff output text
> !:mime	text/x-diff
> !:ext	diff/patch
> 
> The other samples are also described by lines like:
> 0	search/1	diff\040	diff output text
> !:mime	text/x-diff
> 
> So these lines become like:
> 0	search/1	diff\040	diff output text
> !:mime	text/x-diff
> !:ext	diff/patch
> I found only examples with 2 suffix but maybe that there exist
> also short name samples like in unified variant. If others find such
> samples then i expect 4 suffix (diff/patch/dif/pch). Linux shared mime
> database also accept samples with tab instead of space character. And
> TrID tool also check for minus character after space character. These
> tools also check for magic phrase only at the beginning.  For unified
> variant the diff characteristic comes some KB later. So maybe that there
> exist examples where magic fragment comes later. Furthermore all
> inspected samples of these variant are also described as "unified diff
> output text" because the unified characteristics comes later. So i do
> not know if there exist patches which start with diff phrase at the
> beginning and are not unified. So i keep that first variant.
> 
> After applying the above mentioned modifications by patch
> file-5.45-diff-context.diff then for all such inspected examples the
> diff description comes first. This now looks like:
> 
> create.dgux.patch:        context diff output, ASCII text
> gpxe.diff:                unified diff output text
> 			  diff output, ASCII text
> hqx.diff:                 context diff output text
> 			  C source, ASCII text
> osx-roots.diff:           unified diff output text
> 			  diff output, ASCII text
> 			  , with very long lines (435)
> progname.h.diff:          context diff output text
> 			  C source, ASCII text
> python-2.7.6-mingw.patch: unified diff output text
> 			  diff output, ASCII text
> sac.patch:                context diff output text
> 			  exported SGML document, ASCII text
> vblade-17-aio.2.diff:     unified diff output text
> 			  C source text
> 			  diff output text
> 			  C source, ASCII text
> 
> I hope my diff file can be applied in future version of file
> utility. There are still other patch formats, which are sometimes are
> not recognized or not described completely. I will try to handle these
> in a future session.
> 
> With best wishes,
> Jörg Jenderek
> --
> Jörg Jenderek
> <trid-v-diff-context.txt.gz><file-5_45-diff-context_diff.DEFANGED-453><file-5_45-diff-context_diff_sig.DEFANGED-454>-- 
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>



More information about the File mailing list