[File] [PATCH] of Magdir/mathematica "Matlab v4 mat-file" misidentfied of Netwfw00.dat tokens.dat TileCacheLogo-*.dat

Christos Zoulas christos at zoulas.com
Sun Nov 7 16:27:57 UTC 2021


Committed, thanks!

christos

> On Nov 6, 2021, at 4:13 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
> 
> Hello,
> 
> at July 2021 i send patch for Magdir/mathematica to recognise Matlab
> v4 mat-file. Just some days ago i inspect samples with DAT file name
> extension.
> When running running file command version 5.41 on such examples and
> related MATrix files option i get an output like:
> 
> $I3KREPH.dat:                     Matlab v4 mat-file (little endian)
> 				  \2014\326\001N,
> 				  text, rows 0, columns 1200
> Netwfw00.dat:                     Matlab v4 mat-file (little endian)
> 				  numeric, rows 0, columns 0,
> 				  imaginary
> Netwfw02.dat:                     Matlab v4 mat-file (little endian)
> 				  !\001\025 \216\372,
> 				  rows 161, columns 65536,
> 				  imaginary
> PreviousEntries.dat:              Matlab v4 mat-file (little endian)
> 				  \304P\344@\001,
> 				  text, rows 23, columns 259
> TileCacheLogo-1050505875_100.dat: Matlab v4 mat-file (little endian)
> 				  _\001,
> 				  numeric, rows 351, columns 100,
> 				  imaginary
> TileCacheLogo-133343421_100.dat:  Matlab v4 mat-file (little endian)
> 				  ]\001,
> 				  numeric, rows 385, columns 100,
> 				  imaginary
> TileCacheLogo-738609890_100.dat:  Matlab v4 mat-file (little endian)
> 				  \376\001,
> 				  numeric, rows 551, columns 100,
> 				  imaginary
> TileCacheLogo-947200500_100.dat:  Matlab v4 mat-file (little endian)
> 				  j\001,
> 				  numeric, rows 362, columns 100,
> 				  imaginary
> test_mat4_le_floats.mat:          Matlab v4 mat-file (little endian)
> 				  a,
> 				  numeric, rows 1, columns 2
> testcomplex_4.2c_SOL2.mat:        Matlab v4 mat-file (big endian)
> 				  testcomplex,
> 				  numeric, rows 1, columns 9,
> 				  imaginary
> teststringarray_4.2c_SOL2.mat:    Matlab v4 mat-file (big endian)
> 				  teststringarray,
> 				  text, rows 3, columns 5
> testvec_4_GLNX86.mat:             Matlab v4 mat-file (little endian)
> 				  fit_params,
> 				  numeric, rows 2, columns 1
> tokens.dat:                       Matlab v4 mat-file (little endian)
> 				  \237\360\006,
> 				  text, rows 4, columns 455157,
> 				  imaginary
> 
> Unfortunately level 4 MAT files have no significant magic pattern. So
> i put displaying part inside a subroutine named matlab4 and then add
> test lines to identify such matrices in a unique manner. The sub
> routine starts with lines displaying text inside Magdir/mathematica like:
>  0	name	matlab4		Matlab v4 mat-file
>  !:mime	application/x-matlab-data
>  !:ext	mat
> 
> So in principal only the test lines must be changed or added.
> Obviously the matrix name at offset 20 for real MAT samples is like
> fit_params, a, testcomplex, whereas for misidentified DAT examples i
> get 2-byte names like j\001 and _\001 or 3 byte sequence like
> \237\360\006 or 5 byte sequence \304P\344@\001 in PreviousEntries.dat.
> 
> There was just one line, that checks for "valid ASCII" matrix name like:
>> 20	ubyte	>0x1F
> So by this line it is only checked if first character of matrix name
> is not a space or control character or similar. And in the
> documentation are not explicit specification for matrix name mentioned.
> Furthermore is not clear if matrix name is required or if this can be
> empty like in misidentified examples Netwfw00.dat Netwfw01.dat.
> So it quite difficult to restrict check of matrix name.
> 
> Finally i must also also a test for first character of name is not
> "to high". By this additional line bad example PreviousEntries.dat
> with invalid name \304P\344@\001 is skipped. So these line sequences
> now becomes like:
>> 20	ubyte	>0x1F
>>> 20	ubyte	<0304
> 
> The matrix name was shown by line like:
>> 16	pstring/L	x	%s
> because the name length is stored as 4 byte integer before.
> Furthermore the name is still nul-terminated. So that information can
> be shown by debugging lines like:
>> 16		ubelong	x	\b, name length %u
>> (16.L+19)	ubyte	x	\b, TERMINATING NAME CHARACTER %#x
>> 21		ubyte	x	\b, MAYBE 2ND CHAR=%c
> Unfortunately this is also true for some Netwfw examples and all my
> TileCacheLogo examples. So this test is not so suited.
> 
> At the end i insert additional tests before calling sub routine for
> little endian branch. First i look for matrix name length. Because i
> found no misidentified example with "short" matrix name i directly
> afterwards call sub routine. For examples with "longer" matrix name i
> inspect next character of name. If it is ASCII like i continue with
> calling sub routine. By this step TileCacheLogo-*.dat with invalid
> 2nd character \001 of name and name length 96 are skipped. So this is
> now done by additional lines like:
>>>>>> 16	ulelong		<3
>>>>>>> 0		use	\^matlab4
>>>>>> 16	ulelong		>2
>>>>>>> 21	ubyte	>0x1F
>>>>>>>> 0	use	\^matlab4
> 
> Many DAT examples are described as "imaginary"
> At offset 12 the imaginary flag is stored as 4 byte integer. If this
> is 1, then the matrix has an imaginary part. If 0, there is only real
> data. So this this information for not real (that means imaginary)
> matrix was shown by line like:
>> 12	ubelong		!0	\b, imaginary
> For control reasons i changed this to line like:
>> 12	ubelong		!0	\b, imaginary (%u)
> 
> So it becomes visible that for many DAT examples i get invalid
> imaginary flag value like 12 for token examples or 2147483648 for
> example $I3KREPH.dat.
> 
> There was only one test of imaginary flag. This test only if the 2
> middle bytes are nil by line like:
>  13	ushort	0
> 
> So after the check for "valid low" little endian type flag the sub
> routine is called by lines like:
>>> 0	ulelong		<53
>>>> 0	use		\^matlab4
> With additional check for invalid imaginary flag value this now
> becomes like:
>>> 0	ulelong		<53
>>>> 12	ulelong		<2
>>>>> 0	use		\^matlab4
> 
> At offset 4 the number of rows in the matrix is stored as 4 byte
> integer (like: 1 3 8). At offset 8 the number of columns in the
> matrix is stored as 4 byte integer (like 1 3 4 5 9 43). So the matrix
> dimension are shown by line like:
>> 4	ubelong		x	\b, rows %u
>> 8	ubelong		x	\b, columns %u
> 
> It is not explicitly mentioned but in matrix at least row and columns
> must be at least 1 or bigger. That means value zero could not occur
> like in bad example $I3KREPH.dat or some Netwfw examples. So skip
> such examples by additional test for non zero rows via line like:
>> 4	ulong		!0	ROW_OK
> 
> So i insert such a line after test for check for "valid ASCII" matrix
> name and before check of valid type flag. So this now looks like:
>> 20	ubyte	>0x1F
>>> 4	ulong		!0
>>>> 0	ubelong&0xFFffFF00	0x00000300
> 
> I hope that my additional lines are now unique enough to identify MAT
> level 4 files.
> 
> After applying the above mentioned modifications by patch
> file-5.40-mathematica-v4.diff then all my matrix examples are still
> described and misidentification of DAT examples vanished like:
> 
> $I3KREPH.dat:                     data
> Netwfw00.dat:                     data
> Netwfw02.dat:                     data
> PreviousEntries.dat:              data
> TileCacheLogo-1050505875_100.dat: data
> TileCacheLogo-133343421_100.dat:  data
> TileCacheLogo-738609890_100.dat:  data
> TileCacheLogo-947200500_100.dat:  data
> test_mat4_le_floats.mat:          Matlab v4 mat-file (little endian)
> 				  a,
> 				  numeric, rows 1, columns 2
> testcomplex_4.2c_SOL2.mat:        Matlab v4 mat-file (big endian)
> 				  testcomplex,
> 				  numeric, rows 1, columns 9,
> 				  imaginary (1)
> teststringarray_4.2c_SOL2.mat:    Matlab v4 mat-file (big endian)
> 				  teststringarray,
> 				  text, rows 3, columns 5
> testvec_4_GLNX86.mat:             Matlab v4 mat-file (little endian)
> 				  fit_params,
> 				  numeric, rows 2, columns 1
> tokens.dat:                       data
> 
> I hope my diff file can be applied in future version of file utility.
> 
> With best wishes
> Jörg Jenderek
> --
> Jörg Jenderek
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> <file-5_41-mathematica-v4_diff.DEFANGED-127><file-5_41-mathematica-v4_diff_sig.DEFANGED-128>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20211107/860ac5fe/attachment.asc>


More information about the File mailing list