[File] [PATCH] Magdir/diff context diff; C source first because of strenght
Christos Zoulas
christos at zoulas.com
Fri Feb 9 00:50:50 UTC 2024
Committed, thanks!
christos
> On Feb 6, 2024, at 9:33 PM, Jörg Jenderek (GMX) <joerg.jen.der.ek at gmx.net> wrote:
>
> Hello,
>
> some days ago i must handle some patch files. Unfortunately there exist
> about a dozen of different variants. In this session i will handle
> mainly "context" samples which are not "unified".
>
> So i run command version 5.45 with -k option on such hundreds of patch
> examples. I get an output like:
>
> create.dgux.patch: context diff output, ASCII text
> gpxe.diff: unified diff output text
> diff output, ASCII text
> hqx.diff: C source text
> context diff output, ASCII text
> osx-roots.diff: unified diff output text
> diff output, ASCII text
> , with very long lines (435)
> progname.h.diff: context diff output text
> C source, ASCII text
> python-2.7.6-mingw.patch: unified diff output text
> Python script text executable
> diff output, ASCII text
> sac.patch: context diff output text
> exported SGML document, ASCII text
> vblade-17-aio.2.diff: unified diff output text
> C source text
> diff output text
> C source, ASCII text
>
> With option --extension only 3 byte sequence ??? is shown and with -i
> option often text/x-diff is shown.
>
> For comparison reason i also run the file format identification utility
> DROID (See https://sourceforge.net/projects/droid/). Here the samples
> are recognized.
>
> On Linux according to shared MIME-info database such samples are called
> "Differences between files". Here text/x-patch is used as mime type. The
> type text/x-diff is listed here as alias and parent type is text/plain.
> The context samples are just recognized by looking for 3 byte sequence
> "---" at the beginning followed by 1 space or tabulator character at the
> beginning. Here 2 suffix (*.diff *.patch) are listed. The other samples
> are recognized by looking for 4 byte phrase diff at the beginning
> followed by 1 space or tabulator character. That information can be seen
> in source freedesktop.org.xml.in found for example on
> gitlab.freedesktop.org.
>
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html). The context samples
> (like hqx.diff) are here also described as "context diff output" by
> diff-context.trid.xml. Here text/x-patch is listed as mime type. Here 2
> suffix are listed (.DIFF/PATCH). Another variant (like in
> vblade-17-aio.2.diff) which is described by file command also as "diff
> output text" is here also described as "diff output text" by diff.trid.xml.
> Here text/x-patch is listed as mime type. Here 4 suffix are listed
> (.PATCH/PCH/DIFF/DIF See appended trid-v-diff-context.txt.gz).
>
> This tool list the used file name extension and with -v option the
> related URL pointing to used web site with file format information.
> That informations are expressed by comment lines inside Magdir/diff like:
> # URL: https://en.wikipedia.org/wiki/Diff_utility#Context_format
> # Ref.: http://mark0.net/download/triddefs_xml.7z
> # defs/d/diff.trid.xml
> # defs/d/diff-context.trid.xml
>
> The context samples are recognized by lines inside Magdir/diff. These
> look like:
> 0 search/1 ***\040
> >&0 search/1024 \n---\040 context diff output text
> !:mime text/x-diff
>
> For the context samples i get same problems as found and described for
> "unified" diff variant. Often the patches are used to describe the
> differences between source text files. So the patch files contain
> fragments of some programming language. For that reason most samples are
> also described as source file for programming language like C (at least
> when using -k option). The strength of context variant was 38.
> Unfortunately samples like hqx.diff are also described as C source (with
> strength values like 41,39,37). So in order that description as diff of
> such samples comes first the magic strength must by raised by at least
> by 4 to get total strength 42. Maybe that there exist some HTML samples
> which need even more strength (170) or POSIX shell script (strength=130
> ./commands), but i do not found such text in my hundreds of such examples.
>
> Such output are used/created by diff and patch utility. Therefore these
> 2 names are often used as file name suffix. On old FAT file system there
> exist a 8+3 limit for file names. So there the maximal length of suffix
> is 3. Apparently so there instead of diff dif is used and instead of
> patch pch is used. For the "unified" i found such short name examples,
> but not for "context" variant. So at the moment i show only 2 suffix.
> For unified variant the diff characteristic sometimes comes some KB
> later instead of offset 0. So maybe that there exist samples where this
> occur, but i do not found such examples in my inspected samples. If
> others find such examples then the search range must be raised. So these
> magic lines now becomes like:
> 0 search/1 ***\040
> !:strength +4
> >&0 search/1024 \n---\040 context diff output text
> !:mime text/x-diff
> !:ext diff/patch
>
> The other samples are also described by lines like:
> 0 search/1 diff\040 diff output text
> !:mime text/x-diff
>
> So these lines become like:
> 0 search/1 diff\040 diff output text
> !:mime text/x-diff
> !:ext diff/patch
> I found only examples with 2 suffix but maybe that there exist
> also short name samples like in unified variant. If others find such
> samples then i expect 4 suffix (diff/patch/dif/pch). Linux shared mime
> database also accept samples with tab instead of space character. And
> TrID tool also check for minus character after space character. These
> tools also check for magic phrase only at the beginning. For unified
> variant the diff characteristic comes some KB later. So maybe that there
> exist examples where magic fragment comes later. Furthermore all
> inspected samples of these variant are also described as "unified diff
> output text" because the unified characteristics comes later. So i do
> not know if there exist patches which start with diff phrase at the
> beginning and are not unified. So i keep that first variant.
>
> After applying the above mentioned modifications by patch
> file-5.45-diff-context.diff then for all such inspected examples the
> diff description comes first. This now looks like:
>
> create.dgux.patch: context diff output, ASCII text
> gpxe.diff: unified diff output text
> diff output, ASCII text
> hqx.diff: context diff output text
> C source, ASCII text
> osx-roots.diff: unified diff output text
> diff output, ASCII text
> , with very long lines (435)
> progname.h.diff: context diff output text
> C source, ASCII text
> python-2.7.6-mingw.patch: unified diff output text
> diff output, ASCII text
> sac.patch: context diff output text
> exported SGML document, ASCII text
> vblade-17-aio.2.diff: unified diff output text
> C source text
> diff output text
> C source, ASCII text
>
> I hope my diff file can be applied in future version of file
> utility. There are still other patch formats, which are sometimes are
> not recognized or not described completely. I will try to handle these
> in a future session.
>
> With best wishes,
> Jörg Jenderek
> --
> Jörg Jenderek
> <trid-v-diff-context.txt.gz><file-5_45-diff-context_diff.DEFANGED-453><file-5_45-diff-context_diff_sig.DEFANGED-454>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>
More information about the File
mailing list