[File] [PATCH] of Magdir/mathematica "Matlab v4 mat-file" misidentfied of Netwfw00.dat tokens.dat TileCacheLogo-*.dat

Jörg Jenderek joerg.jen.der.ek at gmx.net
Sat Nov 6 20:13:46 UTC 2021


Hello,

at July 2021 i send patch for Magdir/mathematica to recognise Matlab
v4 mat-file. Just some days ago i inspect samples with DAT file name
extension.
When running running file command version 5.41 on such examples and
related MATrix files option i get an output like:

$I3KREPH.dat:                     Matlab v4 mat-file (little endian)
				  \2014\326\001N,
				  text, rows 0, columns 1200
Netwfw00.dat:                     Matlab v4 mat-file (little endian)
				  numeric, rows 0, columns 0,
				  imaginary
Netwfw02.dat:                     Matlab v4 mat-file (little endian)
				  !\001\025 \216\372,
				  rows 161, columns 65536,
				  imaginary
PreviousEntries.dat:              Matlab v4 mat-file (little endian)
				  \304P\344@\001,
				  text, rows 23, columns 259
TileCacheLogo-1050505875_100.dat: Matlab v4 mat-file (little endian)
				  _\001,
				  numeric, rows 351, columns 100,
				  imaginary
TileCacheLogo-133343421_100.dat:  Matlab v4 mat-file (little endian)
				  ]\001,
				  numeric, rows 385, columns 100,
				  imaginary
TileCacheLogo-738609890_100.dat:  Matlab v4 mat-file (little endian)
				  \376\001,
				  numeric, rows 551, columns 100,
				  imaginary
TileCacheLogo-947200500_100.dat:  Matlab v4 mat-file (little endian)
				  j\001,
				  numeric, rows 362, columns 100,
				  imaginary
test_mat4_le_floats.mat:          Matlab v4 mat-file (little endian)
				  a,
				  numeric, rows 1, columns 2
testcomplex_4.2c_SOL2.mat:        Matlab v4 mat-file (big endian)
				  testcomplex,
				  numeric, rows 1, columns 9,
				  imaginary
teststringarray_4.2c_SOL2.mat:    Matlab v4 mat-file (big endian)
				  teststringarray,
				  text, rows 3, columns 5
testvec_4_GLNX86.mat:             Matlab v4 mat-file (little endian)
				  fit_params,
				  numeric, rows 2, columns 1
tokens.dat:                       Matlab v4 mat-file (little endian)
				  \237\360\006,
				  text, rows 4, columns 455157,
				  imaginary

Unfortunately level 4 MAT files have no significant magic pattern. So
i put displaying part inside a subroutine named matlab4 and then add
test lines to identify such matrices in a unique manner. The sub
routine starts with lines displaying text inside Magdir/mathematica like:
  0	name	matlab4		Matlab v4 mat-file
  !:mime	application/x-matlab-data
  !:ext	mat

So in principal only the test lines must be changed or added.
Obviously the matrix name at offset 20 for real MAT samples is like
fit_params, a, testcomplex, whereas for misidentified DAT examples i
get 2-byte names like j\001 and _\001 or 3 byte sequence like
\237\360\006 or 5 byte sequence \304P\344@\001 in PreviousEntries.dat.

There was just one line, that checks for "valid ASCII" matrix name like:
>20	ubyte	>0x1F
So by this line it is only checked if first character of matrix name
is not a space or control character or similar. And in the
documentation are not explicit specification for matrix name mentioned.
Furthermore is not clear if matrix name is required or if this can be
empty like in misidentified examples Netwfw00.dat Netwfw01.dat.
So it quite difficult to restrict check of matrix name.

Finally i must also also a test for first character of name is not
"to high". By this additional line bad example PreviousEntries.dat
with invalid name \304P\344@\001 is skipped. So these line sequences
now becomes like:
>20	ubyte	>0x1F
>>20	ubyte	<0304

The matrix name was shown by line like:
>16	pstring/L	x	%s
because the name length is stored as 4 byte integer before.
Furthermore the name is still nul-terminated. So that information can
be shown by debugging lines like:
>16		ubelong	x	\b, name length %u
>(16.L+19)	ubyte	x	\b, TERMINATING NAME CHARACTER %#x
>21		ubyte	x	\b, MAYBE 2ND CHAR=%c
Unfortunately this is also true for some Netwfw examples and all my
TileCacheLogo examples. So this test is not so suited.

At the end i insert additional tests before calling sub routine for
little endian branch. First i look for matrix name length. Because i
found no misidentified example with "short" matrix name i directly
afterwards call sub routine. For examples with "longer" matrix name i
inspect next character of name. If it is ASCII like i continue with
calling sub routine. By this step TileCacheLogo-*.dat with invalid
2nd character \001 of name and name length 96 are skipped. So this is
now done by additional lines like:
>>>>>16	ulelong		<3
>>>>>>0		use	\^matlab4
>>>>>16	ulelong		>2
>>>>>>21	ubyte	>0x1F
>>>>>>>0	use	\^matlab4

Many DAT examples are described as "imaginary"
At offset 12 the imaginary flag is stored as 4 byte integer. If this
is 1, then the matrix has an imaginary part. If 0, there is only real
data. So this this information for not real (that means imaginary)
matrix was shown by line like:
>12	ubelong		!0	\b, imaginary
For control reasons i changed this to line like:
>12	ubelong		!0	\b, imaginary (%u)

So it becomes visible that for many DAT examples i get invalid
imaginary flag value like 12 for token examples or 2147483648 for
example $I3KREPH.dat.

There was only one test of imaginary flag. This test only if the 2
middle bytes are nil by line like:
  13	ushort	0

So after the check for "valid low" little endian type flag the sub
routine is called by lines like:
>>0	ulelong		<53
>>>0	use		\^matlab4
With additional check for invalid imaginary flag value this now
becomes like:
>>0	ulelong		<53
>>>12	ulelong		<2
>>>>0	use		\^matlab4

At offset 4 the number of rows in the matrix is stored as 4 byte
integer (like: 1 3 8). At offset 8 the number of columns in the
matrix is stored as 4 byte integer (like 1 3 4 5 9 43). So the matrix
dimension are shown by line like:
  >4	ubelong		x	\b, rows %u
  >8	ubelong		x	\b, columns %u

It is not explicitly mentioned but in matrix at least row and columns
must be at least 1 or bigger. That means value zero could not occur
like in bad example $I3KREPH.dat or some Netwfw examples. So skip
such examples by additional test for non zero rows via line like:
  >4	ulong		!0	ROW_OK

So i insert such a line after test for check for "valid ASCII" matrix
name and before check of valid type flag. So this now looks like:
>20	ubyte	>0x1F
>>4	ulong		!0
>>>0	ubelong&0xFFffFF00	0x00000300

I hope that my additional lines are now unique enough to identify MAT
level 4 files.

After applying the above mentioned modifications by patch
file-5.40-mathematica-v4.diff then all my matrix examples are still
described and misidentification of DAT examples vanished like:

$I3KREPH.dat:                     data
Netwfw00.dat:                     data
Netwfw02.dat:                     data
PreviousEntries.dat:              data
TileCacheLogo-1050505875_100.dat: data
TileCacheLogo-133343421_100.dat:  data
TileCacheLogo-738609890_100.dat:  data
TileCacheLogo-947200500_100.dat:  data
test_mat4_le_floats.mat:          Matlab v4 mat-file (little endian)
				  a,
				  numeric, rows 1, columns 2
testcomplex_4.2c_SOL2.mat:        Matlab v4 mat-file (big endian)
				  testcomplex,
				  numeric, rows 1, columns 9,
				  imaginary (1)
teststringarray_4.2c_SOL2.mat:    Matlab v4 mat-file (big endian)
				  teststringarray,
				  text, rows 3, columns 5
testvec_4_GLNX86.mat:             Matlab v4 mat-file (little endian)
				  fit_params,
				  numeric, rows 2, columns 1
tokens.dat:                       data

I hope my diff file can be applied in future version of file utility.

With best wishes
Jörg Jenderek
--
Jörg Jenderek

















-------------- next part --------------
--- file-5.41/magic/Magdir/mathematica.old	2021-08-15 06:08:37 +0000
+++ file-5.41/magic/Magdir/mathematica	2021-11-06 19:45:53 +0000
@@ -118,14 +118,27 @@
 # check for valid ASCII matrix name
 >20	ubyte	>0x1F
+# skip PreviousEntries.dat with "invalid high" name \304P\344@\001
+>>20	ubyte	<0304
+# skip some Netwfw*.dat and $I3KREPH.dat by checking for non zero number of rows
+>>>4	ulong		!0
 # skip some CD-ROM filesystem like test-hfs.iso by looking for valid big endian type flag
->>0	ubelong&0xFFffFF00	0x00000300
->>>0	use	matlab4
+>>>>0	ubelong&0xFFffFF00	0x00000300
+>>>>>0	use	matlab4
 # no example for 8-bit and 16-bit integers matrix
->>0	ubelong&0xFFffFF00	0x00000400
->>>0	use	matlab4
->>0	ulelong		x
+>>>>0	ubelong&0xFFffFF00	0x00000400
+>>>>>0	use	matlab4
+#	branch for Little-Endian variant of Matlab MATrix version 4
 # skip big endian variant by looking for valid low lttle endian type flag
->>0	ulelong		<53
->>>0	use	\^matlab4
+>>>>0	ulelong		<53
+# skip tokens.dat and some Netwfw*.dat by check for valid imaginary flag value of MAT version 4
+>>>>>12	ulelong		<2
+# no misidentfied little endian MATrix example with "short" matrix name
+>>>>>>16	ulelong		<3
+>>>>>>>0	use	\^matlab4
+# little endian MATrix with "long" matrix name or some misidentified samples
+>>>>>>16	ulelong		>2
+# skip TileCacheLogo-*.dat with invalid 2nd character \001 of matrix name with length 96
+>>>>>>>21 ubyte	>0x1F
+>>>>>>>>0 use	\^matlab4
 #	display information of Matlab v4 mat-file
 0	name	matlab4		Matlab v4 mat-file
@@ -146,6 +159,8 @@
 # namlen; the length of the matrix name
 #>16	ubelong		x	\b, name length %u
+#>(16.L+19)	ubyte	x	\b, TERMINATING NAME CHARACTER=%#x
 # nul terminated matrix name like: fit_params testmatrix testsparsecomplex teststringarray
 #>20	string		x	\b, MATRIX NAME="%s"
+#>21		ubyte	x	\b, MAYBE 2ND CHAR=%c
 >16	pstring/L	x	%s
 # T indicates the matrix type: 0~numeric 1~text 2~sparse
@@ -159,4 +174,4 @@
 >8	ubelong		x	\b, columns %u
 # imagf; imaginary flag; 1~matrix has an imaginary part 0~only real data
->12	ubelong		!0	\b, imaginary
+>12	ubelong		!0	\b, imaginary (%u)
 # real; Real part of the matrix consists of mrows * ncols numbers
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.41-mathematica-v4.diff.sig
Type: application/octet-stream
Size: 1189 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20211106/6d963e95/attachment.obj>


More information about the File mailing list