[File] [PATCH] of Magdir/mathematica "Matlab v4 mat-file" misidentfied of Netwfw00.dat tokens.dat TileCacheLogo-*.dat
Christos Zoulas
christos at zoulas.com
Sun Nov 7 16:27:57 UTC 2021
Committed, thanks!
christos
> On Nov 6, 2021, at 4:13 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
>
> Hello,
>
> at July 2021 i send patch for Magdir/mathematica to recognise Matlab
> v4 mat-file. Just some days ago i inspect samples with DAT file name
> extension.
> When running running file command version 5.41 on such examples and
> related MATrix files option i get an output like:
>
> $I3KREPH.dat: Matlab v4 mat-file (little endian)
> \2014\326\001N,
> text, rows 0, columns 1200
> Netwfw00.dat: Matlab v4 mat-file (little endian)
> numeric, rows 0, columns 0,
> imaginary
> Netwfw02.dat: Matlab v4 mat-file (little endian)
> !\001\025 \216\372,
> rows 161, columns 65536,
> imaginary
> PreviousEntries.dat: Matlab v4 mat-file (little endian)
> \304P\344@\001,
> text, rows 23, columns 259
> TileCacheLogo-1050505875_100.dat: Matlab v4 mat-file (little endian)
> _\001,
> numeric, rows 351, columns 100,
> imaginary
> TileCacheLogo-133343421_100.dat: Matlab v4 mat-file (little endian)
> ]\001,
> numeric, rows 385, columns 100,
> imaginary
> TileCacheLogo-738609890_100.dat: Matlab v4 mat-file (little endian)
> \376\001,
> numeric, rows 551, columns 100,
> imaginary
> TileCacheLogo-947200500_100.dat: Matlab v4 mat-file (little endian)
> j\001,
> numeric, rows 362, columns 100,
> imaginary
> test_mat4_le_floats.mat: Matlab v4 mat-file (little endian)
> a,
> numeric, rows 1, columns 2
> testcomplex_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
> testcomplex,
> numeric, rows 1, columns 9,
> imaginary
> teststringarray_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
> teststringarray,
> text, rows 3, columns 5
> testvec_4_GLNX86.mat: Matlab v4 mat-file (little endian)
> fit_params,
> numeric, rows 2, columns 1
> tokens.dat: Matlab v4 mat-file (little endian)
> \237\360\006,
> text, rows 4, columns 455157,
> imaginary
>
> Unfortunately level 4 MAT files have no significant magic pattern. So
> i put displaying part inside a subroutine named matlab4 and then add
> test lines to identify such matrices in a unique manner. The sub
> routine starts with lines displaying text inside Magdir/mathematica like:
> 0 name matlab4 Matlab v4 mat-file
> !:mime application/x-matlab-data
> !:ext mat
>
> So in principal only the test lines must be changed or added.
> Obviously the matrix name at offset 20 for real MAT samples is like
> fit_params, a, testcomplex, whereas for misidentified DAT examples i
> get 2-byte names like j\001 and _\001 or 3 byte sequence like
> \237\360\006 or 5 byte sequence \304P\344@\001 in PreviousEntries.dat.
>
> There was just one line, that checks for "valid ASCII" matrix name like:
>> 20 ubyte >0x1F
> So by this line it is only checked if first character of matrix name
> is not a space or control character or similar. And in the
> documentation are not explicit specification for matrix name mentioned.
> Furthermore is not clear if matrix name is required or if this can be
> empty like in misidentified examples Netwfw00.dat Netwfw01.dat.
> So it quite difficult to restrict check of matrix name.
>
> Finally i must also also a test for first character of name is not
> "to high". By this additional line bad example PreviousEntries.dat
> with invalid name \304P\344@\001 is skipped. So these line sequences
> now becomes like:
>> 20 ubyte >0x1F
>>> 20 ubyte <0304
>
> The matrix name was shown by line like:
>> 16 pstring/L x %s
> because the name length is stored as 4 byte integer before.
> Furthermore the name is still nul-terminated. So that information can
> be shown by debugging lines like:
>> 16 ubelong x \b, name length %u
>> (16.L+19) ubyte x \b, TERMINATING NAME CHARACTER %#x
>> 21 ubyte x \b, MAYBE 2ND CHAR=%c
> Unfortunately this is also true for some Netwfw examples and all my
> TileCacheLogo examples. So this test is not so suited.
>
> At the end i insert additional tests before calling sub routine for
> little endian branch. First i look for matrix name length. Because i
> found no misidentified example with "short" matrix name i directly
> afterwards call sub routine. For examples with "longer" matrix name i
> inspect next character of name. If it is ASCII like i continue with
> calling sub routine. By this step TileCacheLogo-*.dat with invalid
> 2nd character \001 of name and name length 96 are skipped. So this is
> now done by additional lines like:
>>>>>> 16 ulelong <3
>>>>>>> 0 use \^matlab4
>>>>>> 16 ulelong >2
>>>>>>> 21 ubyte >0x1F
>>>>>>>> 0 use \^matlab4
>
> Many DAT examples are described as "imaginary"
> At offset 12 the imaginary flag is stored as 4 byte integer. If this
> is 1, then the matrix has an imaginary part. If 0, there is only real
> data. So this this information for not real (that means imaginary)
> matrix was shown by line like:
>> 12 ubelong !0 \b, imaginary
> For control reasons i changed this to line like:
>> 12 ubelong !0 \b, imaginary (%u)
>
> So it becomes visible that for many DAT examples i get invalid
> imaginary flag value like 12 for token examples or 2147483648 for
> example $I3KREPH.dat.
>
> There was only one test of imaginary flag. This test only if the 2
> middle bytes are nil by line like:
> 13 ushort 0
>
> So after the check for "valid low" little endian type flag the sub
> routine is called by lines like:
>>> 0 ulelong <53
>>>> 0 use \^matlab4
> With additional check for invalid imaginary flag value this now
> becomes like:
>>> 0 ulelong <53
>>>> 12 ulelong <2
>>>>> 0 use \^matlab4
>
> At offset 4 the number of rows in the matrix is stored as 4 byte
> integer (like: 1 3 8). At offset 8 the number of columns in the
> matrix is stored as 4 byte integer (like 1 3 4 5 9 43). So the matrix
> dimension are shown by line like:
>> 4 ubelong x \b, rows %u
>> 8 ubelong x \b, columns %u
>
> It is not explicitly mentioned but in matrix at least row and columns
> must be at least 1 or bigger. That means value zero could not occur
> like in bad example $I3KREPH.dat or some Netwfw examples. So skip
> such examples by additional test for non zero rows via line like:
>> 4 ulong !0 ROW_OK
>
> So i insert such a line after test for check for "valid ASCII" matrix
> name and before check of valid type flag. So this now looks like:
>> 20 ubyte >0x1F
>>> 4 ulong !0
>>>> 0 ubelong&0xFFffFF00 0x00000300
>
> I hope that my additional lines are now unique enough to identify MAT
> level 4 files.
>
> After applying the above mentioned modifications by patch
> file-5.40-mathematica-v4.diff then all my matrix examples are still
> described and misidentification of DAT examples vanished like:
>
> $I3KREPH.dat: data
> Netwfw00.dat: data
> Netwfw02.dat: data
> PreviousEntries.dat: data
> TileCacheLogo-1050505875_100.dat: data
> TileCacheLogo-133343421_100.dat: data
> TileCacheLogo-738609890_100.dat: data
> TileCacheLogo-947200500_100.dat: data
> test_mat4_le_floats.mat: Matlab v4 mat-file (little endian)
> a,
> numeric, rows 1, columns 2
> testcomplex_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
> testcomplex,
> numeric, rows 1, columns 9,
> imaginary (1)
> teststringarray_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
> teststringarray,
> text, rows 3, columns 5
> testvec_4_GLNX86.mat: Matlab v4 mat-file (little endian)
> fit_params,
> numeric, rows 2, columns 1
> tokens.dat: data
>
> I hope my diff file can be applied in future version of file utility.
>
> With best wishes
> Jörg Jenderek
> --
> Jörg Jenderek
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> <file-5_41-mathematica-v4_diff.DEFANGED-127><file-5_41-mathematica-v4_diff_sig.DEFANGED-128>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20211107/860ac5fe/attachment.asc>
More information about the File
mailing list