[File] [PATCH] Magdir/magic C-source, scripts misidentified as magic text fragment

Christos Zoulas christos at zoulas.com
Mon Jun 10 23:10:18 UTC 2024


Committed, thanks!

christos

> On Jun 9, 2024, at 8:48 AM, Jörg Jenderek (GMX) <joerg.jen.der.ek at gmx.net> wrote:
> 
> Hello,
> 
> some times ago i send patch of Magdir/magic to recognize the text
> fragment to build magic binary database magic.mgc. Unfortunately these
> have no unique pattern. So when running file command on some
> 
> When running file command version 5.45 on real magic text fragments,
> some misidentified C-sources and some scripts (sed, Python, AWK found in
> fetchmail package) i get an output like:
> 
> EMSINIT.INC:   magic text fragment for file(1) cmd, 2nd line
> 	       ";-------------------------------------------
> 	       -----------------------------;", 3rd line
> 	       ";\011   Initialization...\011\011\011\011\011\011 ;"
> RESTPARS.C:    magic text fragment for file(1) cmd, 2nd line
> 	       "/*-------------------------------", 3rd line
> 	       "/* SOURCE FILE NAME: restpars.c"
> RTDO1.C:       magic text fragment for file(1) cmd, 2nd line
> 	       "/*-----------------------------", 3rd line
> 	       "/* SOURCE FILE NAME:  RTDO1.C"
> aria:          magic text fragment for file(1) cmd, 2nd line
> 	       "#-------------------------------------------
> 	       -----------------------------------", 3rd line
> 	       "# URL: \011\011https://de.wikipedia.org/wiki/
> 	       Aria_(Software)"
> biosig:        magic text fragment for file(1) cmd, 2nd line
> 	       "############################################
> 	       ##################################", 3rd line
> 	       "#"
> constants.pxi: magic text fragment for file(1) cmd, 1st line
> 	       "#-------------------------------------------
> 	       ----------------------------------", 2nd line
> 	       "# Python module level constants"
> ctf:           magic text fragment for file(1) cmd, 2nd line
> 	       "#-------------------------------------------
> 	       -------------------", 3rd line
> 	       "# ctf:  file(1) magic for CTF (Common Trace Format)
> 	       trace files"
> fse.sed:       magic text fragment for file(1) cmd, 1st line
> 	       "# ------------------------------------------
> 	       -----------------------", 2nd line
> 	       "# _FSEHLQ is the High Level Qualifier used for the
> 	       FSE files"
> gotmail.awk:   magic text fragment for file(1) cmd, 1st line
> 	       "#-------------------------------------------
> 	       ----------------------------------", 2nd line
> 	       "#"
> map:           magic text fragment for file(1) cmd, 2nd line
> 	       "", 3rd line "#------------------------------
> 	       ------------------------------------------------"
> msx:           magic text fragment for file(1) cmd, 2nd line
> 	       "#-------------------------------------------
> 	       -----------------------------------", 3rd line
> 	       "# msx:  file(1) magic for the MSX Home Computer"
> music:         magic text fragment for file(1) cmd, 1st line
> 	       "#-------------------------------------------
> 	       -----------------------------------", 2nd line
> 	       "# $File: music,v 1.1 2011/11/25 03:28:17 christos Exp $"
> nasa:          magic text fragment for file(1) cmd, 2nd line
> 	       "#-------------------------------------------
> 	       -----------------------------------", 3rd line
> 	       "# nasa:\011file(1) magic"
> symbos:        magic text fragment for file(1) cmd, 2nd line
> 	       "#-------------------------------------------
> 	       -----------------------------------", 3rd line
> 	       "# msx:  file(1) magic for the SymbOS operating system"
> weak:          magic text fragment for file(1) cmd, 2nd line
> 	       "#-------------------------------------------
> 	       -----------------------------------", 3rd line
> 	       "# weak:  file(1) magic for very weak magic entries,
> 	       disabled by default"
> 
> Luckily the displaying part is done by subroutine magic-fragment inside
> Magdir/magic. This starts like
> 0	name		magic-fragment
>   >0	string		x	magic text fragment for file(1) cmd
> !:mime	text/x-file
> !:ext	/news/out/script
> For control reasons show afterwards the first text lines by magic lines
> like:
>   >0	ubyte		=0x0A
>   >>1	string		x		\b, 2nd line "%s"
>   >>>&1	string		x		\b, 3rd line "%s"
> Most (305/339) samples start with an empty first line. Then in typical
> samples like "music" the second line consist of a comment separator line
> to make text more suited for human reading. That line is hash character
> followed by about 78 minus characters. On third line i often found
> Revision Control System (RCS) keyword starting with "$File".
> Unfortunately for some examples like "map" not of these two
> characteristic patterns are shown. For this sample by current sub
> routine only separator is shown. So show some more text lines by
> additional lines afterwards like:
>   >>>>&1	string		x		\b, 4th line "%s"
>   >>>>>&1	string		x		\b, 5th line "%s"
> 
> So now we see that in "map" sample RCS keyword comes some lines later.
> Furthermore we can now see that many samples contain near the beginning
> a comment line with reference to file man page. These lines look like:
> # map:  file(1) magic for Map data
> # music:  file (1) magic for music formats"
> But caution because sometimes is a space charter before opening
> parenthesis. So now characteristic patterns to handle the difference
> between fragments and misidentified scripts are known. So only
> additional test lines must be inserted before calling the sub routine.
> 
> Some samples (28/339 archive arm assembler beetle c-lang clojure
> compress der filesystems firmware gentoo lammps m4 mail.news make marc21
> music parrot pascal pc88 pc98 perl ringdove tcl varied.script
> webassembly x68000 zfs) start with a comment line followed by separator
> line. These are done by current branch that looks like:
> 0	ubyte		=0x23
> >4	string		--------
> >>0	use		magic-fragment
> In order to skip some scripts (like fse.sed stage1.sed constants.pxi
> gotmail.awk from fetchmail package) i look near the beginning for
> reference to man page file(1) {often like in lammps v 1.1} or
> file (1) (in few samples like music v 1.1) this branch now becomes like:
> 0	ubyte		=0x23
> >4	string		--------
> >>12	search/180	(1)
> >>>0	use		magic-fragment
> 
> Many samples start with an empty first line followed by separator line.
> Such samples are done by a branch that looks like:
> 0	ubyte		=0x0A
> >4	string		--------
> >>0	use		magic-fragment
> 
> In order to skip some MS-DOS C source text {EMSINIT.INC MEM.C RESTPARS.C
> RTDO.C RTDO1.C RTFILE.C RTFILE1.C RTNEW.C RTNEW1.C RTOLD.C RTOLD1.C
> RTT1.C RTT3.C} now i also look for Revision Control System keyword near
> the beginning. That match many fragments. If no RCS keyword is found
> then i look for reference to man page. This applies to few samples {like
> {ctf (2022-03-26) msx (2021-06-30) nasa (2021-02-23) symbos (2021-02-23)
> weak (2021-02-23)}. If this is not matched and i also look for magic
> mime keyword. This applies to one sample aria (2021-12-24). But that
> characteristic is not found near the beginning. So this branch now
> becomes like:
> 0	ubyte		=0x0A
> >4	string		--------
> >>1	search/128	$File
> >>>0	use		magic-fragment
> >>1	default		x
> >>>1	search/180	file(1)
> >>>>0	use		magic-fragment
> >>>1	default		x
> >>>>1	search/1024	\041:mime
> >>>>>0	use		magic-fragment
> 
> After applying the above mentioned modifications by patch
> file-5.45-magic.diff then i get an output like:
> EMSINIT.INC:   ISO-8859 text
> RESTPARS.C:    data
> RTDO1.C:       data
> aria:          magic text fragment for file(1) cmd, 2nd line
> 	       "#------------------------------------------
> 	       ------------------------------------", 3rd line
> 	       "# URL: \011\011https://de.wikipedia.org/wiki/
> 	       Aria_(Software)", 4th line
> 	       "# Reference:\011https://github.com/aria2/aria2/blob/
> 	       master/doc/manual-src/en/technical-notes.rst"
> 	       , 5th line "# From:\011\011Joerg Jenderek"
> biosig:        magic text fragment for file(1) cmd, 2nd line
> 	       "###########################################
> 	       ###################################", 3rd line
> 	       "#"
> 	       , 4th line "#    Magic ids for biomedical signal
> 	       file formats ", 5th line
> 	       "#    Copyright (C) 2018 Alois Schloegl
> 	       <alois.schloegl at gmail.com>"
> constants.pxi: ASCII text
> ctf:           magic text fragment for file(1) cmd, 2nd line
> 	       "#------------------------------------------
> 	       --------------------", 3rd line
> 	       "# ctf:  file(1) magic for CTF (Common Trace Format)
> 	       trace files", 4th line
> 	       "#", 5th line
> 	       "# Specs. available here: <https://www.efficios.com/ctf>"
> fse.sed:       ASCII text
> gotmail.awk:   ASCII text
> map:           magic text fragment for file(1) cmd, 2nd line
> 	       "", 3rd line
> 	       "#------------------------------------------
> 	       ------------------------------------", 4th line
> 	       "# $File: map,v 1.10 2023/02/03 20:41:57 christos Exp $"
> 	       , 5th line "# map:  file(1) magic for Map data"
> msx:           magic text fragment for file(1) cmd, 2nd line
> 	       "#------------------------------------------
> 	       ------------------------------------", 3rd line
> 	       "# msx:  file(1) magic for the MSX Home Computer"
> 	       , 4th line "# v1.3", 5th line
> 	       "# Fabio R. Schmidlin
> 	       <sd-snatcher at users.sourceforge.net>"
> music:         magic text fragment for file(1) cmd, 1st line
> 	       "#------------------------------------------
> 	       ------------------------------------", 2nd line
> 	       "# $File: music,v 1.1 2011/11/25 03:28:17 christos Exp $"
> 	       , 3rd line
> 	       "# music:  file (1) magic for music formats", 4th line
> 	       "", 5th line
> 	       "# BWW format used by Bagpipe Music Writer Gold by
> 	       Robert MacNeil Musicworks"
> nasa:          magic text fragment for file(1) cmd, 2nd line
> 	       "#------------------------------------------
> 	       ------------------------------------", 3rd line
> 	       "# nasa:\011file(1) magic", 4th line
> 	       "", 5th line "# From: Barry Carter
> 	       <carter.barry at gmail.com>"
> symbos:        magic text fragment for file(1) cmd, 2nd line
> 	       "#------------------------------------------
> 	       ------------------------------------", 3rd line
> 	       "# msx:  file(1) magic for the SymbOS operating system"
> 	       , 4th line "# http://www.symbos.de", 5th line
> 	       "# Fabio R. Schmidlin <frs at pop.com.br>"
> weak:          magic text fragment for file(1) cmd, 2nd line
> 	       "#------------------------------------------
> 	       ------------------------------------", 3rd line
> 	       "# weak:  file(1) magic for very weak magic entries,
> 	       disabled by default", 4th line "#"
> 	       , 5th line "# These entries are so weak that they
> 	       might interfere identification of"
> 
> When running file command on magic fragments in known directory like
> Magdir it is not disturbing when few samples have an other look, but
> when such samples are found in another directory for backup reason
> or found in /lost+found directory after a system crash, then it is
> irritating when few samples have another look. So i started to unify the
> the fragments to match most common appearance. So that look is first
> empty line, second line is separator line (with 78 minus characters),
> Third line with RCS keyword and fourth line with man page reference.
> Another advantage is that that in the end only one branch is executed.
> If this is cached by the operating system this can speed up program
> execution. So this procedure is done by
> file-5.45-aria-unified.diff
> file-5.45-biosig-unified.diff
> file-5.45-ctf-unified.diff
> file-5.45-map-unified.diff
> file-5.45-msx-unified.diff
> file-5.45-music-unified.diff
> file-5.45-nasa-unified.diff
> file-5.45-symbos-unified.diff
> file-5.45-lammps-unified.diff
> file-5.45-espressif-unified.diff
> There are still dozen of fragments which do not have the unified look.
> But at the moment i do not want to apply this steps because than i get
> so many patches that i may lost overview. So i will done the remaining
> fragments in the future.
> 
> I hope my diff files can be applied in future version of
> file utility.
> 
> With best wishes
> Jörg Jenderek
> --
> Jörg Jenderek
> <file-5_45-magic-scripts_diff.DEFANGED-29596><file-5_45-magic-scripts_diff_sig.DEFANGED-29597><Nachrichtenteil als Anhang.DEFANGED-29598><file-5_45-aria-unified_diff.DEFANGED-29599><file-5_45-aria-unified_diff_sig.DEFANGED-29600><file-5_45-biosig-unified_diff.DEFANGED-29601><file-5_45-biosig-unified_diff_sig.DEFANGED-29602><file-5_45-ctf-unified_diff.DEFANGED-29603><file-5_45-ctf-unified_diff_sig.DEFANGED-29604><file-5_45-map-unified_diff.DEFANGED-29605><file-5_45-map-unified_diff_sig.DEFANGED-29606><file-5_45-msx-unified_diff.DEFANGED-29607><file-5_45-msx-unified_diff_sig.DEFANGED-29608><file-5_45-music-unified_diff.DEFANGED-29609><file-5_45-music-unified_diff_sig.DEFANGED-29610><file-5_45-nasa-unified_diff.DEFANGED-29611><file-5_45-nasa-unified_diff_sig.DEFANGED-29612><file-5_45-symbos-unified_diff.DEFANGED-29613><file-5_45-symbos-unified_diff_sig.DEFANGED-29614><file-5_45-weak-unified_diff.DEFANGED-29615><file-5_45-weak-unified_diff_sig.DEFANGED-29616><file-5_45-lammps-unified_diff.DEFANGED-29617><file-5_45-lammps-unified_diff_sig.DEFANGED-29618><file-5_45-espressif-unified_diff.DEFANGED-29619><file-5_45-espressif-unified_diff_sig.DEFANGED-29620>-- 
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>



More information about the File mailing list