[File] mixed and seemingly inconsistent results....

Sun Jul 26 17:06:19 UTC 2020

Hi,

> On Jul 25, 2020, at 9:03 PM, Astara <file at tlinx.org> wrote:
> 
> When I run the file command on my system over some RAID stripes, I'm seeing
> a couple of inconsistencies and/or oddities.
> 
> The main class that is odd, is in files that have an interpreter line,
> examples:
> 
> #/usr/bin/perl -w = a /usr/bin/perl -w script executable (binary data)
> #/bin/sh     = POSIX shell script executable (binary data)
> #/bin/bash = Bourne-Again shell script executable (binary data)
> 
> Note, in perl, we also see:
>  Perl5 module source, ASCII text (starts with package NAME)
>  Perl POD document, ASCII text   (starts with '=head1 NAME')
>  Perl POD document, UTF-8 Unicode text
>  Perl Script text executable
> 
> Maybe part of this is I'm looking at disk stripes and while many start with
> a file, it may be several files in one 64K stripe with a bunch of binary
> 00000's after the file to line it up to a sector (4k sector size).
> 
> When I started this post, I didn't understand the binary data annotation,
> since the sources in them were not binary -- but that's likely explained by
> file looking at a 64K disk stripe and seeing multiple files separated by
> NUL's.
> 
> The other oddity are separate names for various perl files.
> What I mean by that, is that I have Perl module file that is a
> module file, has POD code for the module in it, and can be executed
> like a program, and has UTF-8 characters in it.
> 
> It ID'd as a Perl Script text executable, but would also be a:
>  Perl5 module source, UTF-8 text
>  Perl POD document, UTF-8 Unicode text
>    (isn't "Unicode" after "UTF-8" redundant?)

Well, it is... The encoding magic text was inconsistent anyway with
the unicode magic file so I made it match. It will now print:

Unicode text, UTF-8

Still redundant but at least consistent everywhere within the program.
The rationale is to print Unicode text (which it is) followed by the
encoding, and optionally followed by endianness.

> 
> Beginning of file looks like:
>    #!/usr/bin/perl  -w
>    # vim=:SetNumberAndWidth
> 
>    =encoding utf-8
> 
>    =head1 NAME
>    P  -   Safer, friendlier printf/print/sprintf + say
> 
>    =head1 VERSION
> 
>    Version  "1.1.38"
> 
>    =cut
> 
>    { package P;
>      use warnings; use strict;use mem;
>      our $VERSION='1.1.38';
> 
> 
> I feel 'file' made an acceptable choice choice in calling it a
> perl script text executable, though it's primary purpose is being a
> module: the executable part was to demo features of the
> module.

Well, if it has #! , it was meant to be executed.

> 
> 
> 
> Conversely, some C-source files that also had NUL's between
> them were simply labeled:
>    "fname:   data".
> 
> They were several C-source files separated by the NUL's, and
> started out:
> 
>    // SPDX-License-Identifier: GPL-2.0-or-later
>    /*
>     * CRC32C
>     *@Article{castagnoli-crc,
>     * authors =      { Guy C. Stefan B. and Martin H.},
>     * month =        {June},
>     *}
>     * Used by the iSCSI driver, possibly others, and derived from the
>     * the iscsi-crc.c module of the linux-iscsi driver at
>     * http://linux-iscsi.sourceforge.net.
>     */
>     #include <crypto/hash.h>
>     #include <linux/err.h>
> 
>    static struct crypto_shash *tfm;
> 
> I.e. it's C-source.  After nulls, turns out 'file' "ldb" is several
> C source files with zero'd EOF space after each C file.

This is an unusual setup that you have with all the NUL's in
the data. Perhaps what's needed here is an option to ignore them.

> 
> 
> On the ones starting with C-source files, I'm guessing the
> NUL's would have had file wanting to label it with (binary data),
> but that would conflict with C-source -- even though, 'file' had
> no problem displaying some script files w/tag of (binary data).
> Not wrong, exactly, but just inconsistent.

There are two types of magic, ascii and binary. If the file has
NUL's only binary magic is consulted.

> In these cases, it's almost like it needs to look at content
> to know what type of source file it is (Perl/Bash/C), but rather
> than label the file as (binary data), it would be more useful
> (not sure of what's involved) to note that the binary was
> 'nul' data between C-source files, vs. labelling the whole thing
> as just 'data' (as it did with C source files, but not script
> files).
> 
> Have written too much (sorry), but trying to be clear
> w/examples.  Use file "alot", so don't think I'm criticizing
> or even "need a fix", but wanted to point out that having many
> disparate sources, may be having the effect of creating
> inconsistencies in the output (some of which I am contributing
> to by running it over images-in-files that are stripes from
> a RAID disk.
> 
> Thanks for the fish & tool!
> Astara (aka L.A. Walsh)

You are welcome.

christos

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20200726/fb027e5c/attachment.asc>