[File] [PATCH] Magdir/magic C-source, scripts misidentified as magic text fragment

Jörg Jenderek (GMX) joerg.jen.der.ek at gmx.net
Sun Jun 9 12:48:23 UTC 2024


Hello,

some times ago i send patch of Magdir/magic to recognize the text
fragment to build magic binary database magic.mgc. Unfortunately these
have no unique pattern. So when running file command on some

When running file command version 5.45 on real magic text fragments,
some misidentified C-sources and some scripts (sed, Python, AWK found in
fetchmail package) i get an output like:

EMSINIT.INC:   magic text fragment for file(1) cmd, 2nd line
	       ";-------------------------------------------
	       -----------------------------;", 3rd line
	       ";\011   Initialization...\011\011\011\011\011\011 ;"
RESTPARS.C:    magic text fragment for file(1) cmd, 2nd line
	       "/*-------------------------------", 3rd line
	       "/* SOURCE FILE NAME: restpars.c"
RTDO1.C:       magic text fragment for file(1) cmd, 2nd line
	       "/*-----------------------------", 3rd line
	       "/* SOURCE FILE NAME:  RTDO1.C"
aria:          magic text fragment for file(1) cmd, 2nd line
	       "#-------------------------------------------
	       -----------------------------------", 3rd line
	       "# URL: \011\011https://de.wikipedia.org/wiki/
	       Aria_(Software)"
biosig:        magic text fragment for file(1) cmd, 2nd line
	       "############################################
	       ##################################", 3rd line
	       "#"
constants.pxi: magic text fragment for file(1) cmd, 1st line
	       "#-------------------------------------------
	       ----------------------------------", 2nd line
	       "# Python module level constants"
ctf:           magic text fragment for file(1) cmd, 2nd line
	       "#-------------------------------------------
	       -------------------", 3rd line
	       "# ctf:  file(1) magic for CTF (Common Trace Format)
	       trace files"
fse.sed:       magic text fragment for file(1) cmd, 1st line
	       "# ------------------------------------------
	       -----------------------", 2nd line
	       "# _FSEHLQ is the High Level Qualifier used for the
	       FSE files"
gotmail.awk:   magic text fragment for file(1) cmd, 1st line
	       "#-------------------------------------------
	       ----------------------------------", 2nd line
	       "#"
map:           magic text fragment for file(1) cmd, 2nd line
	       "", 3rd line "#------------------------------
	       ------------------------------------------------"
msx:           magic text fragment for file(1) cmd, 2nd line
	       "#-------------------------------------------
	       -----------------------------------", 3rd line
	       "# msx:  file(1) magic for the MSX Home Computer"
music:         magic text fragment for file(1) cmd, 1st line
	       "#-------------------------------------------
	       -----------------------------------", 2nd line
	       "# $File: music,v 1.1 2011/11/25 03:28:17 christos Exp $"
nasa:          magic text fragment for file(1) cmd, 2nd line
	       "#-------------------------------------------
	       -----------------------------------", 3rd line
	       "# nasa:\011file(1) magic"
symbos:        magic text fragment for file(1) cmd, 2nd line
	       "#-------------------------------------------
	       -----------------------------------", 3rd line
	       "# msx:  file(1) magic for the SymbOS operating system"
weak:          magic text fragment for file(1) cmd, 2nd line
	       "#-------------------------------------------
	       -----------------------------------", 3rd line
	       "# weak:  file(1) magic for very weak magic entries,
	       disabled by default"

Luckily the displaying part is done by subroutine magic-fragment inside
Magdir/magic. This starts like
0	name		magic-fragment
    >0	string		x	magic text fragment for file(1) cmd
!:mime	text/x-file
!:ext	/news/out/script
For control reasons show afterwards the first text lines by magic lines
like:
    >0	ubyte		=0x0A
    >>1	string		x		\b, 2nd line "%s"
    >>>&1	string		x		\b, 3rd line "%s"
Most (305/339) samples start with an empty first line. Then in typical
samples like "music" the second line consist of a comment separator line
to make text more suited for human reading. That line is hash character
followed by about 78 minus characters. On third line i often found
Revision Control System (RCS) keyword starting with "$File".
Unfortunately for some examples like "map" not of these two
characteristic patterns are shown. For this sample by current sub
routine only separator is shown. So show some more text lines by
additional lines afterwards like:
    >>>>&1	string		x		\b, 4th line "%s"
    >>>>>&1	string		x		\b, 5th line "%s"

So now we see that in "map" sample RCS keyword comes some lines later.
Furthermore we can now see that many samples contain near the beginning
a comment line with reference to file man page. These lines look like:
# map:  file(1) magic for Map data
# music:  file (1) magic for music formats"
But caution because sometimes is a space charter before opening
parenthesis. So now characteristic patterns to handle the difference
between fragments and misidentified scripts are known. So only
additional test lines must be inserted before calling the sub routine.

Some samples (28/339 archive arm assembler beetle c-lang clojure
compress der filesystems firmware gentoo lammps m4 mail.news make marc21
music parrot pascal pc88 pc98 perl ringdove tcl varied.script
webassembly x68000 zfs) start with a comment line followed by separator
line. These are done by current branch that looks like:
0	ubyte		=0x23
  >4	string		--------
  >>0	use		magic-fragment
In order to skip some scripts (like fse.sed stage1.sed constants.pxi
gotmail.awk from fetchmail package) i look near the beginning for
reference to man page file(1) {often like in lammps v 1.1} or
file (1) (in few samples like music v 1.1) this branch now becomes like:
0	ubyte		=0x23
  >4	string		--------
  >>12	search/180	(1)
  >>>0	use		magic-fragment

Many samples start with an empty first line followed by separator line.
Such samples are done by a branch that looks like:
0	ubyte		=0x0A
  >4	string		--------
  >>0	use		magic-fragment

In order to skip some MS-DOS C source text {EMSINIT.INC MEM.C RESTPARS.C
RTDO.C RTDO1.C RTFILE.C RTFILE1.C RTNEW.C RTNEW1.C RTOLD.C RTOLD1.C
RTT1.C RTT3.C} now i also look for Revision Control System keyword near
the beginning. That match many fragments. If no RCS keyword is found
then i look for reference to man page. This applies to few samples {like
{ctf (2022-03-26) msx (2021-06-30) nasa (2021-02-23) symbos (2021-02-23)
weak (2021-02-23)}. If this is not matched and i also look for magic
mime keyword. This applies to one sample aria (2021-12-24). But that
characteristic is not found near the beginning. So this branch now
becomes like:
0	ubyte		=0x0A
  >4	string		--------
  >>1	search/128	$File
  >>>0	use		magic-fragment
  >>1	default		x
  >>>1	search/180	file(1)
  >>>>0	use		magic-fragment
  >>>1	default		x
  >>>>1	search/1024	\041:mime
  >>>>>0	use		magic-fragment

After applying the above mentioned modifications by patch
file-5.45-magic.diff then i get an output like:
EMSINIT.INC:   ISO-8859 text
RESTPARS.C:    data
RTDO1.C:       data
aria:          magic text fragment for file(1) cmd, 2nd line
	       "#------------------------------------------
	       ------------------------------------", 3rd line
	       "# URL: \011\011https://de.wikipedia.org/wiki/
	       Aria_(Software)", 4th line
	       "# Reference:\011https://github.com/aria2/aria2/blob/
	       master/doc/manual-src/en/technical-notes.rst"
	       , 5th line "# From:\011\011Joerg Jenderek"
biosig:        magic text fragment for file(1) cmd, 2nd line
	       "###########################################
	       ###################################", 3rd line
	       "#"
	       , 4th line "#    Magic ids for biomedical signal
	       file formats ", 5th line
	       "#    Copyright (C) 2018 Alois Schloegl
	       <alois.schloegl at gmail.com>"
constants.pxi: ASCII text
ctf:           magic text fragment for file(1) cmd, 2nd line
	       "#------------------------------------------
	       --------------------", 3rd line
	       "# ctf:  file(1) magic for CTF (Common Trace Format)
	       trace files", 4th line
	       "#", 5th line
	       "# Specs. available here: <https://www.efficios.com/ctf>"
fse.sed:       ASCII text
gotmail.awk:   ASCII text
map:           magic text fragment for file(1) cmd, 2nd line
	       "", 3rd line
	       "#------------------------------------------
	       ------------------------------------", 4th line
	       "# $File: map,v 1.10 2023/02/03 20:41:57 christos Exp $"
	       , 5th line "# map:  file(1) magic for Map data"
msx:           magic text fragment for file(1) cmd, 2nd line
	       "#------------------------------------------
	       ------------------------------------", 3rd line
	       "# msx:  file(1) magic for the MSX Home Computer"
	       , 4th line "# v1.3", 5th line
	       "# Fabio R. Schmidlin
	       <sd-snatcher at users.sourceforge.net>"
music:         magic text fragment for file(1) cmd, 1st line
	       "#------------------------------------------
	       ------------------------------------", 2nd line
	       "# $File: music,v 1.1 2011/11/25 03:28:17 christos Exp $"
	       , 3rd line
	       "# music:  file (1) magic for music formats", 4th line
	       "", 5th line
	       "# BWW format used by Bagpipe Music Writer Gold by
	       Robert MacNeil Musicworks"
nasa:          magic text fragment for file(1) cmd, 2nd line
	       "#------------------------------------------
	       ------------------------------------", 3rd line
	       "# nasa:\011file(1) magic", 4th line
	       "", 5th line "# From: Barry Carter
	       <carter.barry at gmail.com>"
symbos:        magic text fragment for file(1) cmd, 2nd line
	       "#------------------------------------------
	       ------------------------------------", 3rd line
	       "# msx:  file(1) magic for the SymbOS operating system"
	       , 4th line "# http://www.symbos.de", 5th line
	       "# Fabio R. Schmidlin <frs at pop.com.br>"
weak:          magic text fragment for file(1) cmd, 2nd line
	       "#------------------------------------------
	       ------------------------------------", 3rd line
	       "# weak:  file(1) magic for very weak magic entries,
	       disabled by default", 4th line "#"
	       , 5th line "# These entries are so weak that they
	       might interfere identification of"

When running file command on magic fragments in known directory like
Magdir it is not disturbing when few samples have an other look, but
when such samples are found in another directory for backup reason
or found in /lost+found directory after a system crash, then it is
irritating when few samples have another look. So i started to unify the
the fragments to match most common appearance. So that look is first
empty line, second line is separator line (with 78 minus characters),
Third line with RCS keyword and fourth line with man page reference.
Another advantage is that that in the end only one branch is executed.
If this is cached by the operating system this can speed up program
execution. So this procedure is done by
file-5.45-aria-unified.diff
file-5.45-biosig-unified.diff
file-5.45-ctf-unified.diff
file-5.45-map-unified.diff
file-5.45-msx-unified.diff
file-5.45-music-unified.diff
file-5.45-nasa-unified.diff
file-5.45-symbos-unified.diff
file-5.45-lammps-unified.diff
file-5.45-espressif-unified.diff
There are still dozen of fragments which do not have the unified look.
But at the moment i do not want to apply this steps because than i get
so many patches that i may lost overview. So i will done the remaining
fragments in the future.

I hope my diff files can be applied in future version of
file utility.

With best wishes
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
--- file-5.45/magic/Magdir/magic.old	2023-07-02 13:52:38.000000000 +0200
+++ file-5.45/magic/Magdir/magic	2024-06-08 21:45:38.492691100 +0200
@@ -12,9 +12,15 @@
 !:ext	/
-#
-# some samples start with a comment line
+# 
+# some (34/339) samples start with a comment line
 0	ubyte		=0x23
-# many samples start with separator line
+# some (28/339) samples start with separator line (about 78 minus characters) like:
+# archive arm assembler beetle c-lang clojure compress der filesystems firmware gentoo lammps
+# m4 mail.news make marc21 music parrot pascal pc88 pc98 perl ringdove tcl varied.script webassembly x68000 zfs
 >4	string		--------
->>0	use		magic-fragment
-# few samples with 1st comment line and without seperator comment line
+# skip scripts fse.sed stage1.sed constants.pxi gotmail.awk from fetchmail package by
+# searching for reference to man page file(1) {lammps v 1.1} or file (1) {muscic v 1.1}
+>>12	search/180	(1)
+>>>0	use		magic-fragment
+# few (6/339) samples with 1st comment line and without separator comment line
+# like: blcr bsi selinux sisu ssh svf
 >4	default		x
@@ -35,6 +41,18 @@
 0	ubyte		=0x0A
-# many samples sttart with separator comment line
+# many samples start with separator comment line
 >4	string		--------
->>0	use		magic-fragment
-# few samples with 1st empty line and without seperator comment line like: biosig espressif
+# skip some MS-DOS C source text {EMSINIT.INC MEM.C RESTPARS.C RTDO.C RTDO1.C RTFILE.C RTFILE1.C RTNEW.C RTNEW1.C RTOLD.C RTOLD1.C RTT1.C RTT3.C}
+# and match many fragments by looking for Revision Control System keyword near the beginning
+>>1	search/128	$File
+>>>0	use		magic-fragment
+# few samples {ctf (2022-03-26) msx (2021-06-30) nasa (2021-02-23) symbos (2021-02-23) weak (2021-02-23)}
+# with 1st empty line, separator comment line and without Revision Control System keyword but with reference to man page file(1)
+>>1	default		x
+>>>1	search/180	file(1)
+>>>>0	use		magic-fragment
+>>>1	default		x
+# sample aria (2021-12-24) with 1st empty line, separator comment line and without Revision Control System keyword and without reference to man page file(1) 
+>>>>1	search/1024	\041:mime
+>>>>>0	use		magic-fragment
+# few samples with 1st empty line and without separator comment line like: biosig (2021-02-23) espressif (v 1.3)
 >4	default		x
@@ -49,3 +67,3 @@
 # next lines are mainly for control reasons
-# some (34/339) samples start comment line
+# some (34/339) samples start with comment line
 >0	ubyte		!0x0A
@@ -53,2 +71,6 @@
 >>>&1	string		x		\b, 2nd line "%s"
+# show more information to see difference between fragments and misidentfied scripts
+>>>>&1	string		x		\b, 3rd line "%s"
+>>>>>&1	string		x		\b, 4th line "%s"
+>>>>>>&1 string		x		\b, 5th line "%s"
 # but most (305/339) samples start with an empty first line
@@ -57,2 +79,5 @@
 >>>&1	string		x		\b, 3rd line "%s"
+# show more information to see difference between fragments and misidentfied scripts
+>>>>&1	string		x		\b, 4th line "%s"
+>>>>>&1	string		x		\b, 5th line "%s"
 #
@@ -66,2 +91,4 @@
 >4	lelong		x		(version %d) (little endian)
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/m/mgc-be.trid.xml
+# Note:		called "magic compiled data (BE)" by TrID
 0	belong		0xF11E041C	magic binary file for file(1) cmd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-magic-scripts.diff.sig
Type: application/octet-stream
Size: 1470 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0012.obj>
-------------- next part --------------
-- 
File mailing list
File at astron.com
https://mailman.astron.com/mailman/listinfo/file

-------------- next part --------------
--- file-5.45/magic/Magdir/aria.old	2021-12-24 19:08:32.000000000 +0100
+++ file-5.45/magic/Magdir/aria	2024-06-08 21:11:08.583674800 +0200
@@ -1,5 +1,7 @@
 
 #------------------------------------------------------------------------------
+# $File$
+# aria:		file(1) magic for download manager aria
 # URL: 		https://de.wikipedia.org/wiki/Aria_(Software)
 # Reference:	https://github.com/aria2/aria2/blob/master/doc/manual-src/en/technical-notes.rst
 # From:		Joerg Jenderek
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-aria-unified.diff.sig
Type: application/octet-stream
Size: 451 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0013.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/biosig.old	2021-02-23 01:49:24.000000000 +0100
+++ file-5.45/magic/Magdir/biosig	2024-06-08 21:20:01.103661300 +0200
@@ -1,7 +1,7 @@
 
-##############################################################################
-#
-#    Magic ids for biomedical signal file formats 
+#------------------------------------------------------------------------------
+# $File$
+#    file(1) magic for biomedical signal file formats 
 #    Copyright (C) 2018 Alois Schloegl <alois.schloegl at gmail.com>
 #
 #    The list has been derived from biosig projects
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-biosig-unified.diff.sig
Type: application/octet-stream
Size: 430 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0014.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/ctf.old	2022-03-26 19:58:39.000000000 +0100
+++ file-5.45/magic/Magdir/ctf	2024-06-08 21:24:06.302667000 +0200
@@ -1,5 +1,6 @@
 
 #--------------------------------------------------------------
+# $File$
 # ctf:  file(1) magic for CTF (Common Trace Format) trace files
 #
 # Specs. available here: <https://www.efficios.com/ctf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-ctf-unified.diff.sig
Type: application/octet-stream
Size: 373 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0015.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/map.old	2023-02-09 18:43:53.000000000 +0100
+++ file-5.45/magic/Magdir/map	2024-06-08 21:31:01.107367500 +0200
@@ -1,5 +1,4 @@
 
-
 #------------------------------------------------------------------------------
 # $File: map,v 1.10 2023/02/03 20:41:57 christos Exp $
 # map:  file(1) magic for Map data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-map-unified.diff.sig
Type: application/octet-stream
Size: 335 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0016.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/msx.old	2021-06-30 11:43:35.000000000 +0200
+++ file-5.45/magic/Magdir/msx	2024-06-08 21:36:26.183679300 +0200
@@ -1,5 +1,6 @@
 
 #------------------------------------------------------------------------------
+# $File$
 # msx:  file(1) magic for the MSX Home Computer
 # v1.3
 # Fabio R. Schmidlin <sd-snatcher at users.sourceforge.net>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-msx-unified.diff.sig
Type: application/octet-stream
Size: 364 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0017.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/music.old	2021-02-23 01:49:24.000000000 +0100
+++ file-5.45/magic/Magdir/music	2024-06-08 21:40:23.598899700 +0200
@@ -1,6 +1,6 @@
 #------------------------------------------------------------------------------
 # $File: music,v 1.1 2011/11/25 03:28:17 christos Exp $
-# music:  file (1) magic for music formats
+# music:  file(1) magic for music formats
 
 # BWW format used by Bagpipe Music Writer Gold by Robert MacNeil Musicworks
 # and Bagpipe Writer by Doug Wickstrom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-music-unified.diff.sig
Type: application/octet-stream
Size: 426 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0018.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/nasa.old	2021-02-23 01:49:24.000000000 +0100
+++ file-5.45/magic/Magdir/nasa	2024-06-08 21:49:48.497081700 +0200
@@ -1,6 +1,7 @@
 
 #------------------------------------------------------------------------------
-# nasa:	file(1) magic
+# $File$
+# nasa:	file(1) magic for NASA SPICE file
 
 # From: Barry Carter <carter.barry at gmail.com>
 0	string	DAF/SPK				NASA SPICE file (binary format)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-nasa-unified.diff.sig
Type: application/octet-stream
Size: 379 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0019.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/symbos.old	2021-02-23 01:49:24.000000000 +0100
+++ file-5.45/magic/Magdir/symbos	2024-06-08 21:54:50.420628800 +0200
@@ -1,6 +1,7 @@
 
 #------------------------------------------------------------------------------
-# msx:  file(1) magic for the SymbOS operating system
+# $File$
+# symbos:  file(1) magic for the SymbOS operating system
 # http://www.symbos.de
 # Fabio R. Schmidlin <frs at pop.com.br>
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-symbos-unified.diff.sig
Type: application/octet-stream
Size: 380 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0020.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/weak.old	2021-02-23 01:49:24.000000000 +0100
+++ file-5.45/magic/Magdir/weak	2024-06-08 22:01:13.036233400 +0200
@@ -1,5 +1,6 @@
 
 #------------------------------------------------------------------------------
+# $File$
 # weak:  file(1) magic for very weak magic entries, disabled by default
 #
 # These entries are so weak that they might interfere identification of
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-weak-unified.diff.sig
Type: application/octet-stream
Size: 381 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0021.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/lammps.old	2021-03-14 17:24:18.000000000 +0100
+++ file-5.45/magic/Magdir/lammps	2024-06-09 14:19:54.178587100 +0200
@@ -1,7 +1,6 @@
+
 #------------------------------------------------------------------------------
 # $File: lammps,v 1.1 2021/03/14 16:24:18 christos Exp $
-#
-
 # Magic file patterns for use with file(1) for the
 # LAMMPS molecular dynamics simulation software.
 # https://lammps.sandia.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-lammps-unified.diff.sig
Type: application/octet-stream
Size: 416 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0022.obj>
-------------- next part --------------
--- file-5.45/magic/Magdir/espressif.old	2021-06-30 11:43:34.000000000 +0200
+++ file-5.45/magic/Magdir/espressif	2024-06-09 14:37:16.049396900 +0200
@@ -1,5 +1,7 @@
 
+#------------------------------------------------------------------------------
 # $File: espressif,v 1.3 2021/04/26 15:56:00 christos Exp $
+# espressif:  file(1) magic for ESP8266 based devices
 # configuration dump of Tasmota firmware for ESP8266 based devices by Espressif
 # URL: https://github.com/arendst/Sonoff-Tasmota/
 # Reference: https://codeload.github.com/arendst/Sonoff-Tasmota/zip/release-6.2/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.45-espressif-unified.diff.sig
Type: application/octet-stream
Size: 469 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240609/f7469794/attachment-0023.obj>


More information about the File mailing list