[File] mixed and seemingly inconsistent results....

Mon Jul 27 22:35:00 UTC 2020

On Sun, Jul 26, 2020 at 10:06 AM Christos Zoulas <christos at zoulas.com> wrote:
>
> Hi,
>
>
> > On Jul 25, 2020, at 9:03 PM, Astara <file at tlinx.org> wrote:
> >
> > When I run the file command on my system over some RAID stripes, I'm seeing
> > a couple of inconsistencies and/or oddities.
> >
> > The main class that is odd, is in files that have an interpreter line,
> > examples:
> >
> > #/usr/bin/perl -w = a /usr/bin/perl -w script executable (binary data)
> > #/bin/sh     = POSIX shell script executable (binary data)
> > #/bin/bash = Bourne-Again shell script executable (binary data)
> >
> > Note, in perl, we also see:
> >  Perl5 module source, ASCII text (starts with package NAME)
> >  Perl POD document, ASCII text   (starts with '=head1 NAME')
> >  Perl POD document, UTF-8 Unicode text
> >  Perl Script text executable
> >

> >  Perl5 module source, UTF-8 text
> >  Perl POD document, UTF-8 Unicode text
> >    (isn't "Unicode" after "UTF-8" redundant?)
>
> Well, it is... The encoding magic text was inconsistent anyway with
> the unicode magic file so I made it match. It will now print:
>
> Unicode text, UTF-8
---
Anything is fine...was just pointing out things that looked inconsistent...

>
> Still redundant but at least consistent everywhere within the program.
> The rationale is to print Unicode text (which it is) followed by the
> encoding, and optionally followed by endianness.
----
right -- with UTF-16LE being the native order on Windows (with no BOM)
and being the default in many windows-internal OS-files (some logs).
(of course BE was the default order for many network protocols as they were
developed when big-iron held sway) (unrelated endian thoughts)
>
> >
> > Beginning of file looks like:
> >    #!/usr/bin/perl  -w...
> >    =encoding utf-8...
> >    { package P;...

> > I feel 'file' made an acceptable choice choice in calling it a
> > perl script text executable, though it's primary purpose is being a
> > module: the executable part was to demo features of the
> > module.
>
> Well, if it has #! , it was meant to be executed.
---
     Sorta -- that was added in as an after though, but I do it with
many of my perl
modules, as it is easier to test the modules separately / standalone then to use
an external program that "use"s them.

> >
> > Conversely, some C-source files that also had NUL's between
> > them were simply labeled:
> >    "fname:   data".
> >
> > They were several C-source files separated by the NUL's, and
> > started out:
> >

> > I.e. it's C-source.  After nulls, turns out 'file' "ldb" is several
> > C source files with zero'd EOF space after each C file.
>
> This is an unusual setup that you have with all the NUL's in
> the data. Perhaps what's needed here is an option to ignore them.
----
re: NUL's...
please understand these statements from above before you write
code for this type of case)...
  "When I run the file command on my system over some RAID stripes...
> > Maybe part of this is I'm looking at disk stripes and while many start with
> > a file, it may be several files in one 64K stripe with a bunch of binary
> > 00000's after the file to line it up to a sector (4k sector size).

These aren't native files.  They are me recovering/looking at data on
a disk.  The data is written in 64K "stripes", on parallel disks, which I called
logical volume A...
(lva lvb lvc...)

In looking to see how the data fits together (primarily to order parallel
stripes)

So a 188K file would use 3 x 64K stripes and 7K(8K)of a fourth.
Initially, so, say lda+ldb+ldc = first 192K of the file, and ldd would contain
the the last 7K of the file. Since the disk sectors are using disks with a 4KB
sector size, 8K would be used out of a 4th 64K segment with NUL's padding
from 7K-8K.   Then at the 8K mark I might see the start of another file.
In one 64K stripe, I might see several small files, each padded with NUL's
to light them up to a 4K boundary within the 64K stripe.

So files represent stripes from a RAID with 64K being the how the raid
controller
manages its space (64K was chosen when I created the RAID).

I'm not sure it is worth the bother to write special code for this
case, as one hopes that
one won't ever be trying to recover data from a failed RAID10.  I'd
thought about
trying to separate separate areas that looked like multiple files,
packed into 1 64K
space, but unless you want to add functionality, ALSO for data
recovery, I wouldn't
bother with special switches.

A 64K stripe that contains multiple short files might have:
file1 Nul | file2...nul | file3...nul (etc) up to an even 64K.
Ideally if I'm trying to identify data, I might for non-NUL data
followed by an  1 or more nulls.
If it lines up to a 4K (in my case) boundary, it's like an end-of-file
for file1 and a new file would
start after the nuls.  The nuls represent unused space allocated at
the end of a file just to make
the next file line up to a 4K boundary.
 >>>They wouldn't likely be part of the data<<<, but do serve the
purpose of seeing an
end of file and start of a new one.

I can't imagine how rare the use of such a feature might be, so unless
I specify the feature in
terms of a more generally useful piece, I wouldn't waste my time.
OTOH, file was
operating in a mode where it skips nul's, at least in this case, it
would need to recognize that
a new file (and new file type) likely comes after such an area, with
each non-null area after
an aligned NUL fill, file could have an opportunity to identify such
small files within some larger
block of data (64K in my case).   I'd probably first want to prototype
it by using some program
to look for, and split such togglings of NUL into separate files and
call "file" on those smal files.
I.e. I'd script such a behavior first to see how useful it might be.

I was wondering why some files were tag'd as some type of file,
whereas others were tagged
as just "Data".  And that,  I think is when I see multple files in 1
64K area separated by NUL's.

>
> >
> >
> > On the ones starting with C-source files, I'm guessing the
> > NUL's would have had file wanting to label it with (binary data),
> > but that would conflict with C-source -- even though, 'file' had
> > no problem displaying some script files w/tag of (binary data).
> > Not wrong, exactly, but just inconsistent.
>
> There are two types of magic, ascii and binary. If the file has
> NUL's only binary magic is consulted.
----
     In this case multiple files may reside in the same 64K block
separated by NUL's, but
that's something to script first if needed, and not change 'file' for... :-)

(how often do users say "don't" implement a feature or case for this
because it might
not be generally useful enough? :-) )

Thanks again!