[File] mixed and seemingly inconsistent results....

Sun Jul 26 01:03:55 UTC 2020

When I run the file command on my system over some RAID stripes, I'm seeing
a couple of inconsistencies and/or oddities.

The main class that is odd, is in files that have an interpreter line,
examples:

#/usr/bin/perl -w = a /usr/bin/perl -w script executable (binary data)
#/bin/sh     = POSIX shell script executable (binary data)
#/bin/bash = Bourne-Again shell script executable (binary data)

Note, in perl, we also see:
  Perl5 module source, ASCII text (starts with package NAME)
  Perl POD document, ASCII text   (starts with '=head1 NAME')
  Perl POD document, UTF-8 Unicode text
  Perl Script text executable

Maybe part of this is I'm looking at disk stripes and while many start with
a file, it may be several files in one 64K stripe with a bunch of binary
00000's after the file to line it up to a sector (4k sector size).

When I started this post, I didn't understand the binary data annotation,
since the sources in them were not binary -- but that's likely explained by
file looking at a 64K disk stripe and seeing multiple files separated by
NUL's.

The other oddity are separate names for various perl files.
What I mean by that, is that I have Perl module file that is a
module file, has POD code for the module in it, and can be executed
like a program, and has UTF-8 characters in it.

It ID'd as a Perl Script text executable, but would also be a:
  Perl5 module source, UTF-8 text
  Perl POD document, UTF-8 Unicode text
    (isn't "Unicode" after "UTF-8" redundant?)

Beginning of file looks like:
    #!/usr/bin/perl  -w
    # vim=:SetNumberAndWidth

    =encoding utf-8

    =head1 NAME
    P  -   Safer, friendlier printf/print/sprintf + say

    =head1 VERSION

    Version  "1.1.38"

    =cut

    { package P;
      use warnings; use strict;use mem;
      our $VERSION='1.1.38';

I feel 'file' made an acceptable choice choice in calling it a
perl script text executable, though it's primary purpose is being a
module: the executable part was to demo features of the
module.

Conversely, some C-source files that also had NUL's between
them were simply labeled:
    "fname:   data".

They were several C-source files separated by the NUL's, and
started out:

    // SPDX-License-Identifier: GPL-2.0-or-later
    /*
     * CRC32C
     *@Article{castagnoli-crc,
     * authors =      { Guy C. Stefan B. and Martin H.},
     * month =        {June},
     *}
     * Used by the iSCSI driver, possibly others, and derived from the
     * the iscsi-crc.c module of the linux-iscsi driver at
     * http://linux-iscsi.sourceforge.net.
     */
     #include <crypto/hash.h>
     #include <linux/err.h>

    static struct crypto_shash *tfm;

I.e. it's C-source.  After nulls, turns out 'file' "ldb" is several
C source files with zero'd EOF space after each C file.

On the ones starting with C-source files, I'm guessing the
NUL's would have had file wanting to label it with (binary data),
but that would conflict with C-source -- even though, 'file' had
no problem displaying some script files w/tag of (binary data).
Not wrong, exactly, but just inconsistent.

In these cases, it's almost like it needs to look at content
to know what type of source file it is (Perl/Bash/C), but rather
than label the file as (binary data), it would be more useful
(not sure of what's involved) to note that the binary was
'nul' data between C-source files, vs. labelling the whole thing
as just 'data' (as it did with C source files, but not script
files).

Have written too much (sorry), but trying to be clear
w/examples.  Use file "alot", so don't think I'm criticizing
or even "need a fix", but wanted to point out that having many
disparate sources, may be having the effect of creating
inconsistencies in the output (some of which I am contributing
to by running it over images-in-files that are stripes from
a RAID disk.

Thanks for the fish & tool!
Astara (aka L.A. Walsh)