[File] [PATCH] of Magdir/mathematica, images Matlab mat-file *.mat, Hierarchical Data Format *.hdf

Jörg Jenderek joerg.jen.der.ek at gmx.net
Sat Jul 10 16:11:50 UTC 2021


Hello,

some days ago i inspected some Matlab examples with file name
extension mat.

When running running file command version 5.40  on such examples and
related files with -k option i get an output like:

abydos.h5:                 Hierarchical Data Format (version 5) data
big_endian.mat:            Matlab v5 mat-file
			   (big endian) version 0x0100
input_256.hdf:             Hierarchical Data Format (version 4) data
malformed1.mat:            Matlab v5 mat-file
			   (little endian) version 0x0100
miuint32_for_miint32.mat:  Matlab v5 mat-file
			   (little endian) version 0x0100
one_by_zero_char.mat:      Matlab v5 mat-file
			   (little endian) version 0x0100
ReactOS-LiveCD.iso:        ISO 9660 CD-ROM filesystem data
			   'REACTOS' (bootable)
			   (Lepton 3.x), scale 0-0,
			   (Lepton 2.x), scale 0-0,
test-hfs.iso:              ISO 9660 CD-ROM filesystem data
			   (DOS/MBR boot sector)
			   'test-hfs-cdrom-hybrid'
			   (Lepton 3.x), scale 0-0,
			   (Lepton 2.x), scale 0-0,
testbool_8_WIN64.mat:      Matlab v5 mat-file
			   (little endian) version 0x0100
testcell_6.1_SOL2.mat:     Matlab v5 mat-file
			   (big endian) version 0x0100
testcomplex_4.2c_SOL2.mat: data
testhdf5_7.4_GLNX86.mat:   Hierarchical Data Format (version 5)
			   with 512 bytes user block
			   Matlab v5 mat-file
			   (little endian) version 0x0200
testsparse_4.2c_SOL2.mat:  data
teststring_4.2c_SOL2.mat:  data
testvec_4_GLNX86.mat:      data

Most MAT examples like one_by_zero_char.mat are described by
Magdir/mathematica correctly as "Matlab v5 mat-file".
But a few examples like testcomplex_4.2c_SOL2.mat are only described
as "data".

Furthermore with --extension only ??? is displayed and with -i option
only generic application/octet-stream is shown for MAT examples.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html).

A few MAT examples like malformed1.mat, miuint32_for_miint32.mat and
one_by_zero_char.mat, which are described correctly by file command
are misidentified by TrID as "SMS Material" by mat-sms.trid.xml.

Most examples like testcomplex_4.2c_SOL2.mat,
testsparse_4.2c_SOL2.mat and teststring_4.2c_SOL2.mat which are
described as "data" by file command are described by TrID as "Matlab
Level 4 MAT-File (big-endian)" by mat-l4-be.trid.xml. (See appended
MAT-trid-v.txt.gz ). It also displays related URL and file name
extension.

The detection of MAT examples happens by lines inside
Magdir/mathematica like:

  0       string  MATLAB  Matlab v5 mat-file
  >126    short   0x494d  (big endian)
  >>124   beshort x       version 0x%04x
  >126    short   0x4d49  (little endian)
  >>124   leshort x       version 0x%04x

For the MAT examples a page about MAT on file formats archive team
website was mentioned by TrID as related URL. On that page a MAT-File
Format documentation matfile_format.pdf is mentioned. So this
information is now expressed by additional comment lines like:
  # URL:		http://fileformats.archiveteam.org/wiki/MAT
  # Reference:
  # https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf

According to documentation the first 116 bytes of the header can
contain text data in human-readable form. This text typically
provides information that describes how the MAT-file was created. For
MAT-files created by MATLAB include the following information in
their headers:
1) Level of the MAT-file
2) Platform on which the file was created
3) Date and time the file was created
This often looks like the following string:
MATLAB 5.0 MAT-file, Platform: SOL2, Created on: Thu Nov 13
10:10:27 1997

The file command only test for the start key word MATLAB whereas the
TrID command looks for more bytes. So i look for the platform tag
part (which is like: GLNX86 PCWIN PCWIN64 SOL2 Windows_7 nt posix)
and for the creation time. So in a few examples like malformed1.mat
and miuint32_for_miint32.mat the leading comma (0x2C) before platform
part is missing. And in one example not created by MATLAB like in
one_by_zero_char.mat the leading ASCII string looks like
"MATLAB 5.0 MAT-file, written by Octave 3.2.3, 2011-01-25 19:30:48
UTC".
So here platform part is missing and creation time is stored in
another format. So show that information now by adaptional lines like:


  >>20	search/2	Platform:\040	\b, platform
  >>>&0	string		x		%-0.2s
  >>>&2		ubyte	!0x2C		\b%c
  >>>>&0		ubyte	!0x2C		\b%c
  >>>>>&0	ubyte	!0x2C		\b%c
  >>>>>>&0	ubyte	!0x2C		\b%c
  >>>>>>>&0	ubyte	!0x2C		\b%c
  >>>>>>>>&0	ubyte	!0x2C		\b%c
  >>>>>>>>>&0	ubyte	!0x2C		\b%c
  >>20	default		x
  >>>11	string		x	"%s"
  >34	search/9/c	created\040on:\040	\b, created
  >>&0	string	x		%-.24s

One MAT example testhdf5_7.4_GLNX86.mat was not identified by TrID
because it start with ASCII string "MATLAB 7.0" instead of string
"MATLAB 5.0" like in other examples. So this is a variant with higher
version level 7. This is also visible that the hexadecimal version is
0x0200 in that case whereas for level 5 this value is 0x0100. So this
example should be described correctly as like "Matlab v7 mat-file"
instead of "Matlab v5 mat-file". So this is now done by line like:
  0       string  MATLAB  Matlab v
  !:mime	application/x-matlab-data
  !:ext	mat
  >7	ubyte	=0x35	\b5 mat-file
  >7	ubyte	!0x35
  >>7	string	x	\b%.3s mat-file

Instead of generic application/octet-stream the mentioned mime type
application/x-matlab-data is now shown and now also file name
extension mat is displayed.

This MAT example testhdf5_7.4_GLNX86.mat was identified first as
"Hierarchical Data Format (version 5) with 512 bytes user block"
by Magdir/images with lines like:
  512 string \211HDF\r\n\032\n Hierarchical Data Format \
                               (version 5) with 512 bytes user block
  !:mime	application/x-hdf
After inspecting more details of MAT file, is becomes clear that this
example is really a matrix file that just tests some HDF aspects.
Therefore it also contains short HDF pattern at suited position. So i
skip HDF recognition of this examples by looking for MATLAB
characteristics. So the above lines now becomes like:

  512 string \211HDF\r\n\032\n
  >0  string !MATLAB	Hierarchical Data Format \
                         (version 5) with 512 bytes user block
  !:mime	application/x-hdf5
  !:ext	h5/hdf5/hdf/he5

According to Wikipedia now i show four extension for version 5 and
three for version 4, but in my examples i found only hdf extension
for version 4 and h5 extension for version 5. For version 5 the mime
type application/x-hdf5 is used instead of application/x-hdf.

The mentioned link to hdf.ncsa.uiuc.edu does not exist any more. So i
add URL to Wikipedia page about HDF. This is now expressed by comment
lines like:
  # URL: http://fileformats.archiveteam.org/wiki/HDF
  #	https://en.wikipedia.org/wiki/Hierarchical_Data_Format

In MAT-File Format documentation matfile_format.pdf beside the Level
5 MAT-File Format also the older Level 4 MAT-File Format was
explained. So i see that the unrecognized ("data") MAT samples are
just older level 4 examples.

Unfortunately level 4 MAT files have no significant magic pattern. So
i put displaying part inside a subroutine named matlab4 and then add
enough test lines to identify such matrices in a unique manner. The
sub routine starts with lines displaying similar text comparing with
level 5 mat-files like:
  0	name	matlab4		Matlab v4 mat-file
  !:mime	application/x-matlab-data
  !:ext	mat

According to specification such MAT files start with 20-byte header
with 5 long integers that contains information describing certain
attributes of the matrix.
At offset 0 the type flag is stored as 4 byte integer depending on
endian. In decimal that type integer is represented as MOPT, where M
counts the thousands and indicates the numeric format of numbers on
the machine. Biggest possible value is 4052 (=0xFD4). That means 2
upper bytes are always 0.
For big endian ( that means Macintosh, SPARC, Apollo, SGI, HP
9000/300, other Motorola systems) M value is 1. So lowest flag value
is 1000 (=3E8 hexadecimal) and highest value is 1052 (=41C
hexadecimal). The highest hexadecimal value with 3 as second byte
is 3FF (=1023 decimal). That is true for floating point numbers (P=0
for double-precision 64-bit or P=1 for single-precision 32-bit) and
for 32-bit integers. So value for second byte is 3 or 4. So value 4
as second byte only occur for 16-bit signed integers (P=3) 16-bit
unsigned integers (P=4) 8-bit unsigned integers (P=5).
According to documentation for little endian (PC, 386, 486, DEC
RISC) machine M value is 0. That means highest type value is 52 (=34
hexadecimal).
That is used to display information about machine type (big endian
for example in same manner as for level 5) by lines like:

  #>0	ubelong		x	\b, type flag %u
  #>0	ubelong		x	(0x%x)
  #>0	ubelong/1000	x	\b, M=%u
  >0	ubelong/1000	0	(little endian)
  >0	ubelong/1000	1	(big endian)
  >0	ubelong/1000	2	(VAX D-float)
  >0	ubelong/1000	3	(VAX G-float)
  >0	ubelong/1000	4	(Cray)

Furthermore this information is used as third test to skip some
CD-ROM filesystem like test-hfs.iso with many low nil values at right
positions by lines like:
  >>0	ubelong&0xFFffFF00	0x00000300
  >>>0	use	matlab4
  >>0	ubelong&0xFFffFF00	0x00000400
  >>>0	use	matlab4
  >>0	ulelong		x
  >>0	ulelong		<53
  >>>0	use	\^matlab4

At offset 20 the null terminated matrix is stored as ASCII string
(like testmatrix testsparsecomplex teststringarray testcomplex) and
at offset 16 the length of this string is stored as 4 byte integer.
So matrix name is shown by lines like:
  #>16	ubelong		x	\b, name length %u
  #>20	string		x	\b, MATRIX NAME="%s"
  >16	pstring/L	x	%s
The existing of a valid printable ASCII matrix name is used as second
test by line like:
  >20	ubyte	>0x1F

At offset 4 the number of rows in the matrix is stored as 4 byte
integer (like: 1 3 8). At offset 8 the number of columns in the
matrix is stored as 4 byte integer (like 1 3 4 5 9 43). So the matrix
dimension are shown by line like:
  >4	ubelong		x	\b, rows %u
  >8	ubelong		x	\b, columns %u

At offset 12 the imaginary flag is stored as 4 byte integer. If this
is 1, then the matrix has an imaginary part. If 0, there is only real
data. So print this information for not real (that means imaginary)
matrix by line like:
  >12	ubelong		!0	\b, imaginary
Because of endian the value 1 can occur in byte at offset 12 or 15,
but that also means that the two middle bytes are nil for both endian
variants.
That information is used as first test line like:
  13	ushort	0

I hope that 3 test lines are unique enough to identify MAT level 4
files. According to specification for VAX and Cray machines the
header file looks different. So maybe for such machine types other
test conditions must be created.

After applying the above mentioned modifications by patches
file-5.40-mathematica-matlab.diff and file-5.40-images-matlab.diff
then all matrix examples and Hierarchical Data Format (HDF) images
are recognized, described with more details and some
misidentifications vanished like:

abydos.h5:                 Hierarchical Data Format (version 5) data
big_endian.mat:            Matlab v5 mat-file (big endian)
			   version 0x0100, platform Windows 7,
			   created Tue Feb 26 11:20:36 GMT
input_256.hdf:             Hierarchical Data Format (version 4) data
malformed1.mat:            Matlab v5 mat-file (little endian)
			   version 0x0100, platform nt,
			   created Thu Mar 24 17:53:52 2016
miuint32_for_miint32.mat:  Matlab v5 mat-file (little endian)
			   version 0x0100, platform posix,
			   created Sat Jan 31 13:15:43 2015
one_by_zero_char.mat:      Matlab v5 mat-file (little endian)
			   version 0x0100
			   "MAT-file, written by Octave 3.2.3,
			   2011-01-25 19:30:48 UTC"
ReactOS-LiveCD.iso:        ISO 9660 CD-ROM filesystem data
			   'REACTOS' (bootable)
test-hfs.iso:              ISO 9660 CD-ROM filesystem data
			   (DOS/MBR boot sector)
			   'test-hfs-cdrom-hybrid'
testbool_8_WIN64.mat:      Matlab v5 mat-file (little endian)
			   version 0x0100, platform PCWIN64,
			   created Fri Apr 12 16:18:43 2013
testcell_6.1_SOL2.mat:     Matlab v5 mat-file (big endian)
			   version 0x0100, platform SOL2,
			   created Sat Aug 19 09:37:19 2006
testcomplex_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
			   testcomplex, numeric, rows 1, columns 9,
			   imaginary
testhdf5_7.4_GLNX86.mat:   Matlab v7.0 mat-file (little endian)
			   version 0x0200, platform GLNX86,
			   created Sat Oct  4 19:01:58 2008
testsparse_4.2c_SOL2.mat:  Matlab v4 mat-file (big endian)
			   testsparse, sparse, rows 8, columns 3
teststring_4.2c_SOL2.mat:  Matlab v4 mat-file (big endian)
			   teststring, text, rows 1, columns 43
testvec_4_GLNX86.mat:      Matlab v4 mat-file (little endian)
			   fit_params, numeric, rows 2, columns 1

I hope my 2 diff files can be applied in future version of file utility.

Furthermore many examples like ReactOS-LiveCD.iso and test-hfs.iso
are still are misidentified by sub routine diy-thermocam-checker
inside Magdir/measure as "(Lepton 3.x)" and  "(Lepton 2.x)". This sub
routine still gives too many false hits.

With best wishes
Jörg Jenderek
--
Jörg Jenderek














-------------- next part --------------
--- file-5.40/magic/Magdir/images.old	2021-02-22 23:49:24 +0000
+++ file-5.40/magic/Magdir/images	2021-07-07 14:43:58 +0000
@@ -1450,17 +1450,32 @@
 # Hierarchical Data Format, used to facilitate scientific data exchange
 # specifications at http://hdf.ncsa.uiuc.edu/
+# URL: 		http://fileformats.archiveteam.org/wiki/HDF
+#		https://en.wikipedia.org/wiki/Hierarchical_Data_Format
+# Reference:	https://portal.hdfgroup.org/download/attachments/52627880/HDF5_File_Format_Specification_Version-3.0.pdf
 0	belong	0x0e031301	Hierarchical Data Format (version 4) data
 !:mime	application/x-hdf
+!:ext	hdf/hdf4/h4
 0	string	\211HDF\r\n\032\n	Hierarchical Data Format (version 5) data
-!:mime	application/x-hdf
-512	string	\211HDF\r\n\032\n	Hierarchical Data Format (version 5) with 512 bytes user block
-!:mime	application/x-hdf
+#!:mime	application/x-hdf
+!:mime	application/x-hdf5
+!:ext	h5/hdf5/hdf/he5
+512	string	\211HDF\r\n\032\n
+# skip Matlab v5 mat-file testhdf5_7.4_GLNX86.mat handled by ./mathematica
+>0	string	!MATLAB			Hierarchical Data Format (version 5) with 512 bytes user block
+#!:mime	application/x-hdf
+!:mime	application/x-hdf5
+!:ext	h5/hdf5/hdf/he5
 1024	string	\211HDF\r\n\032\n	Hierarchical Data Format (version 5) with 1k user block
-!:mime	application/x-hdf
+#!:mime	application/x-hdf
+!:mime	application/x-hdf5
+!:ext	h5/hdf5/hdf/he5
 2048	string	\211HDF\r\n\032\n	Hierarchical Data Format (version 5) with 2k user block
-!:mime	application/x-hdf
+#!:mime	application/x-hdf
+!:mime	application/x-hdf5
+!:ext	h5/hdf5/hdf/he5
 4096	string	\211HDF\r\n\032\n	Hierarchical Data Format (version 5) with 4k user block
-!:mime	application/x-hdf
-
+#!:mime	application/x-hdf
+!:mime	application/x-hdf5
+!:ext	h5/hdf5/hdf/he5
 
 # From: Tobias Burnus <burnus at net-b.de>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.40-images-hdf.diff.sig
Type: application/octet-stream
Size: 763 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20210710/ee22fa2c/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MAT-trid-v.txt.gz
Type: application/x-gzip
Size: 1541 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20210710/ee22fa2c/attachment-0001.bin>
-------------- next part --------------
--- file-5.40/magic/Magdir/mathematica.old	2021-02-22 23:51:10 +0000
+++ file-5.40/magic/Magdir/mathematica	2021-07-10 15:58:18 +0000
@@ -74,8 +74,90 @@
 #########################
 # MatLab v5
-0       string  MATLAB  Matlab v5 mat-file
+# URL:		http://fileformats.archiveteam.org/wiki/MAT
+# Reference:	https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf
+# first 116 bytes of header contain text in human-readable form
+0       string  MATLAB  Matlab v
+#>11	string/T	x	\b, at 11 "%.105s"
+#!:mime	application/octet-stream
+!:mime	application/x-matlab-data
+!:ext	mat
+#	https://de.mathworks.com/help/matlab/import_export/mat-file-versions.html
+# level of the MAT-file like: 5.0 7.0 or maybe 7.3
+#>7	string	x	LEVEL "%.3s"
+>7	ubyte	=0x35	\b5 mat-file
+>7	ubyte	!0x35
+>>7	string	x	\b%.3s mat-file
 >126    short   0x494d  (big endian)
 >>124   beshort x       version 0x%04x
 >126    short   0x4d49  (little endian)
+# 0x0100 for level 5.0 and 0x0200 for level 7.0
 >>124   leshort x       version 0x%04x
+# test again so that default clause works
+>126	short	x
+# created by MATLAB include Platform sometimes without leading comma (0x2C) or missing
+# like: GLNX86 PCWIN PCWIN64 SOL2 Windows\0407 nt posix
+>>20	search/2	Platform:\040	\b, platform
+>>>&0	string		x		%-0.2s
+>>>&2		ubyte	!0x2C		\b%c
+>>>>&0		ubyte	!0x2C		\b%c
+>>>>>&0		ubyte	!0x2C		\b%c
+>>>>>>&0	ubyte	!0x2C		\b%c
+>>>>>>>&0	ubyte	!0x2C		\b%c
+>>>>>>>>&0	ubyte	!0x2C		\b%c
+>>>>>>>>>&0	ubyte	!0x2C		\b%c
+# examples without Platform tag like one_by_zero_char.mat
+>>20	default		x
+>>>11	string		x	"%s"
+# created by MATLAB include time like: Fri Feb 20 15:26:59 2009
+>34	search/9/c	created\040on:\040	\b, created
+>>&0	string	x		%-.24s
+#	MatLab v4
+# From:	Joerg Jenderek
+# check for valid imaginary flag of Matlab matrix version 4
+13	ushort	0
+# check for valid ASCII matrix name
+>20	ubyte	>0x1F
+# skip some CD-ROM filesystem like test-hfs.iso by looking for valid big endian type flag
+>>0	ubelong&0xFFffFF00	0x00000300
+>>>0	use	matlab4
+# no example for 8-bit and 16-bit integers matrix
+>>0	ubelong&0xFFffFF00	0x00000400
+>>>0	use	matlab4
+>>0	ulelong		x
+# skip big endian variant by looking for valid low lttle endian type flag
+>>0	ulelong		<53
+>>>0	use	\^matlab4
+#	display information of Matlab v4 mat-file
+0	name	matlab4		Matlab v4 mat-file
+#!:mime	application/octet-stream
+!:mime	application/x-matlab-data
+!:ext	mat
+# 20-byte header with 5 long integers that contains information describing certain attributes of the Matrix
+# type flag decimal MOPT; maximal 4052=FD4h; maximal 52=34h for little endian
+#>0	ubelong		x	\b, type flag %u
+#>0	ubelong		x	(0x%x)
+# M: 0~little endian 1~Big Endian 2~VAX D-float 3~VAX G-float 4~Cray
+#>0	ubelong/1000	x	\b, M=%u
+>0	ubelong/1000	0	(little endian)
+>0	ubelong/1000	1	(big endian)
+>0	ubelong/1000	2	(VAX D-float)
+>0	ubelong/1000	3	(VAX G-float)
+>0	ubelong/1000	4	(Cray)
+# namlen; the length of the matrix name
+#>16	ubelong		x	\b, name length %u
+# nul terminated matrix name like: fit_params testmatrix testsparsecomplex teststringarray
+#>20	string		x	\b, MATRIX NAME="%s"
+>16	pstring/L	x	%s
+# T indicates the matrix type: 0~numeric 1~text 2~sparse
+#>0	ubelong%10	x	\b, T=%u
+>0	ubelong%10	0	\b, numeric
+>0	ubelong%10	1	\b, text
+>0	ubelong%10	2	\b, sparse
+# mrows; number of rows in the matrix like: 1 3 8
+>4	ubelong		x	\b, rows %u
+# ncols; number of columns in the matrix like: 1 3 4 5 9 43
+>8	ubelong		x	\b, columns %u
+# imagf; imaginary flag; 1~matrix has an imaginary part 0~only real data
+>12	ubelong		!0	\b, imaginary
+# real; Real part of the matrix consists of mrows * ncols numbers
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.40-mathematica-mat.diff.sig
Type: application/octet-stream
Size: 1759 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20210710/ee22fa2c/attachment-0003.obj>


More information about the File mailing list