[File] [PATCH] Magdir/magic C-source, scripts misidentified as magic text fragment
Christos Zoulas
christos at zoulas.com
Mon Jun 10 23:10:18 UTC 2024
Committed, thanks!
christos
> On Jun 9, 2024, at 8:48 AM, Jörg Jenderek (GMX) <joerg.jen.der.ek at gmx.net> wrote:
>
> Hello,
>
> some times ago i send patch of Magdir/magic to recognize the text
> fragment to build magic binary database magic.mgc. Unfortunately these
> have no unique pattern. So when running file command on some
>
> When running file command version 5.45 on real magic text fragments,
> some misidentified C-sources and some scripts (sed, Python, AWK found in
> fetchmail package) i get an output like:
>
> EMSINIT.INC: magic text fragment for file(1) cmd, 2nd line
> ";-------------------------------------------
> -----------------------------;", 3rd line
> ";\011 Initialization...\011\011\011\011\011\011 ;"
> RESTPARS.C: magic text fragment for file(1) cmd, 2nd line
> "/*-------------------------------", 3rd line
> "/* SOURCE FILE NAME: restpars.c"
> RTDO1.C: magic text fragment for file(1) cmd, 2nd line
> "/*-----------------------------", 3rd line
> "/* SOURCE FILE NAME: RTDO1.C"
> aria: magic text fragment for file(1) cmd, 2nd line
> "#-------------------------------------------
> -----------------------------------", 3rd line
> "# URL: \011\011https://de.wikipedia.org/wiki/
> Aria_(Software)"
> biosig: magic text fragment for file(1) cmd, 2nd line
> "############################################
> ##################################", 3rd line
> "#"
> constants.pxi: magic text fragment for file(1) cmd, 1st line
> "#-------------------------------------------
> ----------------------------------", 2nd line
> "# Python module level constants"
> ctf: magic text fragment for file(1) cmd, 2nd line
> "#-------------------------------------------
> -------------------", 3rd line
> "# ctf: file(1) magic for CTF (Common Trace Format)
> trace files"
> fse.sed: magic text fragment for file(1) cmd, 1st line
> "# ------------------------------------------
> -----------------------", 2nd line
> "# _FSEHLQ is the High Level Qualifier used for the
> FSE files"
> gotmail.awk: magic text fragment for file(1) cmd, 1st line
> "#-------------------------------------------
> ----------------------------------", 2nd line
> "#"
> map: magic text fragment for file(1) cmd, 2nd line
> "", 3rd line "#------------------------------
> ------------------------------------------------"
> msx: magic text fragment for file(1) cmd, 2nd line
> "#-------------------------------------------
> -----------------------------------", 3rd line
> "# msx: file(1) magic for the MSX Home Computer"
> music: magic text fragment for file(1) cmd, 1st line
> "#-------------------------------------------
> -----------------------------------", 2nd line
> "# $File: music,v 1.1 2011/11/25 03:28:17 christos Exp $"
> nasa: magic text fragment for file(1) cmd, 2nd line
> "#-------------------------------------------
> -----------------------------------", 3rd line
> "# nasa:\011file(1) magic"
> symbos: magic text fragment for file(1) cmd, 2nd line
> "#-------------------------------------------
> -----------------------------------", 3rd line
> "# msx: file(1) magic for the SymbOS operating system"
> weak: magic text fragment for file(1) cmd, 2nd line
> "#-------------------------------------------
> -----------------------------------", 3rd line
> "# weak: file(1) magic for very weak magic entries,
> disabled by default"
>
> Luckily the displaying part is done by subroutine magic-fragment inside
> Magdir/magic. This starts like
> 0 name magic-fragment
> >0 string x magic text fragment for file(1) cmd
> !:mime text/x-file
> !:ext /news/out/script
> For control reasons show afterwards the first text lines by magic lines
> like:
> >0 ubyte =0x0A
> >>1 string x \b, 2nd line "%s"
> >>>&1 string x \b, 3rd line "%s"
> Most (305/339) samples start with an empty first line. Then in typical
> samples like "music" the second line consist of a comment separator line
> to make text more suited for human reading. That line is hash character
> followed by about 78 minus characters. On third line i often found
> Revision Control System (RCS) keyword starting with "$File".
> Unfortunately for some examples like "map" not of these two
> characteristic patterns are shown. For this sample by current sub
> routine only separator is shown. So show some more text lines by
> additional lines afterwards like:
> >>>>&1 string x \b, 4th line "%s"
> >>>>>&1 string x \b, 5th line "%s"
>
> So now we see that in "map" sample RCS keyword comes some lines later.
> Furthermore we can now see that many samples contain near the beginning
> a comment line with reference to file man page. These lines look like:
> # map: file(1) magic for Map data
> # music: file (1) magic for music formats"
> But caution because sometimes is a space charter before opening
> parenthesis. So now characteristic patterns to handle the difference
> between fragments and misidentified scripts are known. So only
> additional test lines must be inserted before calling the sub routine.
>
> Some samples (28/339 archive arm assembler beetle c-lang clojure
> compress der filesystems firmware gentoo lammps m4 mail.news make marc21
> music parrot pascal pc88 pc98 perl ringdove tcl varied.script
> webassembly x68000 zfs) start with a comment line followed by separator
> line. These are done by current branch that looks like:
> 0 ubyte =0x23
> >4 string --------
> >>0 use magic-fragment
> In order to skip some scripts (like fse.sed stage1.sed constants.pxi
> gotmail.awk from fetchmail package) i look near the beginning for
> reference to man page file(1) {often like in lammps v 1.1} or
> file (1) (in few samples like music v 1.1) this branch now becomes like:
> 0 ubyte =0x23
> >4 string --------
> >>12 search/180 (1)
> >>>0 use magic-fragment
>
> Many samples start with an empty first line followed by separator line.
> Such samples are done by a branch that looks like:
> 0 ubyte =0x0A
> >4 string --------
> >>0 use magic-fragment
>
> In order to skip some MS-DOS C source text {EMSINIT.INC MEM.C RESTPARS.C
> RTDO.C RTDO1.C RTFILE.C RTFILE1.C RTNEW.C RTNEW1.C RTOLD.C RTOLD1.C
> RTT1.C RTT3.C} now i also look for Revision Control System keyword near
> the beginning. That match many fragments. If no RCS keyword is found
> then i look for reference to man page. This applies to few samples {like
> {ctf (2022-03-26) msx (2021-06-30) nasa (2021-02-23) symbos (2021-02-23)
> weak (2021-02-23)}. If this is not matched and i also look for magic
> mime keyword. This applies to one sample aria (2021-12-24). But that
> characteristic is not found near the beginning. So this branch now
> becomes like:
> 0 ubyte =0x0A
> >4 string --------
> >>1 search/128 $File
> >>>0 use magic-fragment
> >>1 default x
> >>>1 search/180 file(1)
> >>>>0 use magic-fragment
> >>>1 default x
> >>>>1 search/1024 \041:mime
> >>>>>0 use magic-fragment
>
> After applying the above mentioned modifications by patch
> file-5.45-magic.diff then i get an output like:
> EMSINIT.INC: ISO-8859 text
> RESTPARS.C: data
> RTDO1.C: data
> aria: magic text fragment for file(1) cmd, 2nd line
> "#------------------------------------------
> ------------------------------------", 3rd line
> "# URL: \011\011https://de.wikipedia.org/wiki/
> Aria_(Software)", 4th line
> "# Reference:\011https://github.com/aria2/aria2/blob/
> master/doc/manual-src/en/technical-notes.rst"
> , 5th line "# From:\011\011Joerg Jenderek"
> biosig: magic text fragment for file(1) cmd, 2nd line
> "###########################################
> ###################################", 3rd line
> "#"
> , 4th line "# Magic ids for biomedical signal
> file formats ", 5th line
> "# Copyright (C) 2018 Alois Schloegl
> <alois.schloegl at gmail.com>"
> constants.pxi: ASCII text
> ctf: magic text fragment for file(1) cmd, 2nd line
> "#------------------------------------------
> --------------------", 3rd line
> "# ctf: file(1) magic for CTF (Common Trace Format)
> trace files", 4th line
> "#", 5th line
> "# Specs. available here: <https://www.efficios.com/ctf>"
> fse.sed: ASCII text
> gotmail.awk: ASCII text
> map: magic text fragment for file(1) cmd, 2nd line
> "", 3rd line
> "#------------------------------------------
> ------------------------------------", 4th line
> "# $File: map,v 1.10 2023/02/03 20:41:57 christos Exp $"
> , 5th line "# map: file(1) magic for Map data"
> msx: magic text fragment for file(1) cmd, 2nd line
> "#------------------------------------------
> ------------------------------------", 3rd line
> "# msx: file(1) magic for the MSX Home Computer"
> , 4th line "# v1.3", 5th line
> "# Fabio R. Schmidlin
> <sd-snatcher at users.sourceforge.net>"
> music: magic text fragment for file(1) cmd, 1st line
> "#------------------------------------------
> ------------------------------------", 2nd line
> "# $File: music,v 1.1 2011/11/25 03:28:17 christos Exp $"
> , 3rd line
> "# music: file (1) magic for music formats", 4th line
> "", 5th line
> "# BWW format used by Bagpipe Music Writer Gold by
> Robert MacNeil Musicworks"
> nasa: magic text fragment for file(1) cmd, 2nd line
> "#------------------------------------------
> ------------------------------------", 3rd line
> "# nasa:\011file(1) magic", 4th line
> "", 5th line "# From: Barry Carter
> <carter.barry at gmail.com>"
> symbos: magic text fragment for file(1) cmd, 2nd line
> "#------------------------------------------
> ------------------------------------", 3rd line
> "# msx: file(1) magic for the SymbOS operating system"
> , 4th line "# http://www.symbos.de", 5th line
> "# Fabio R. Schmidlin <frs at pop.com.br>"
> weak: magic text fragment for file(1) cmd, 2nd line
> "#------------------------------------------
> ------------------------------------", 3rd line
> "# weak: file(1) magic for very weak magic entries,
> disabled by default", 4th line "#"
> , 5th line "# These entries are so weak that they
> might interfere identification of"
>
> When running file command on magic fragments in known directory like
> Magdir it is not disturbing when few samples have an other look, but
> when such samples are found in another directory for backup reason
> or found in /lost+found directory after a system crash, then it is
> irritating when few samples have another look. So i started to unify the
> the fragments to match most common appearance. So that look is first
> empty line, second line is separator line (with 78 minus characters),
> Third line with RCS keyword and fourth line with man page reference.
> Another advantage is that that in the end only one branch is executed.
> If this is cached by the operating system this can speed up program
> execution. So this procedure is done by
> file-5.45-aria-unified.diff
> file-5.45-biosig-unified.diff
> file-5.45-ctf-unified.diff
> file-5.45-map-unified.diff
> file-5.45-msx-unified.diff
> file-5.45-music-unified.diff
> file-5.45-nasa-unified.diff
> file-5.45-symbos-unified.diff
> file-5.45-lammps-unified.diff
> file-5.45-espressif-unified.diff
> There are still dozen of fragments which do not have the unified look.
> But at the moment i do not want to apply this steps because than i get
> so many patches that i may lost overview. So i will done the remaining
> fragments in the future.
>
> I hope my diff files can be applied in future version of
> file utility.
>
> With best wishes
> Jörg Jenderek
> --
> Jörg Jenderek
> <file-5_45-magic-scripts_diff.DEFANGED-29596><file-5_45-magic-scripts_diff_sig.DEFANGED-29597><Nachrichtenteil als Anhang.DEFANGED-29598><file-5_45-aria-unified_diff.DEFANGED-29599><file-5_45-aria-unified_diff_sig.DEFANGED-29600><file-5_45-biosig-unified_diff.DEFANGED-29601><file-5_45-biosig-unified_diff_sig.DEFANGED-29602><file-5_45-ctf-unified_diff.DEFANGED-29603><file-5_45-ctf-unified_diff_sig.DEFANGED-29604><file-5_45-map-unified_diff.DEFANGED-29605><file-5_45-map-unified_diff_sig.DEFANGED-29606><file-5_45-msx-unified_diff.DEFANGED-29607><file-5_45-msx-unified_diff_sig.DEFANGED-29608><file-5_45-music-unified_diff.DEFANGED-29609><file-5_45-music-unified_diff_sig.DEFANGED-29610><file-5_45-nasa-unified_diff.DEFANGED-29611><file-5_45-nasa-unified_diff_sig.DEFANGED-29612><file-5_45-symbos-unified_diff.DEFANGED-29613><file-5_45-symbos-unified_diff_sig.DEFANGED-29614><file-5_45-weak-unified_diff.DEFANGED-29615><file-5_45-weak-unified_diff_sig.DEFANGED-29616><file-5_45-lammps-unified_diff.DEFANGED-29617><file-5_45-lammps-unified_diff_sig.DEFANGED-29618><file-5_45-espressif-unified_diff.DEFANGED-29619><file-5_45-espressif-unified_diff_sig.DEFANGED-29620>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>
More information about the File
mailing list