[File] [PATCH] of Magdir/mathematica, images Matlab matfile *.mat, Hierarchical Data Format *.hdf
Jörg Jenderek
joerg.jen.der.ek at gmx.net
Sat Jul 10 16:11:50 UTC 2021
Hello,
some days ago i inspected some Matlab examples with file name
extension mat.
When running running file command version 5.40 on such examples and
related files with k option i get an output like:
abydos.h5: Hierarchical Data Format (version 5) data
big_endian.mat: Matlab v5 matfile
(big endian) version 0x0100
input_256.hdf: Hierarchical Data Format (version 4) data
malformed1.mat: Matlab v5 matfile
(little endian) version 0x0100
miuint32_for_miint32.mat: Matlab v5 matfile
(little endian) version 0x0100
one_by_zero_char.mat: Matlab v5 matfile
(little endian) version 0x0100
ReactOSLiveCD.iso: ISO 9660 CDROM filesystem data
'REACTOS' (bootable)
(Lepton 3.x), scale 00,
(Lepton 2.x), scale 00,
testhfs.iso: ISO 9660 CDROM filesystem data
(DOS/MBR boot sector)
'testhfscdromhybrid'
(Lepton 3.x), scale 00,
(Lepton 2.x), scale 00,
testbool_8_WIN64.mat: Matlab v5 matfile
(little endian) version 0x0100
testcell_6.1_SOL2.mat: Matlab v5 matfile
(big endian) version 0x0100
testcomplex_4.2c_SOL2.mat: data
testhdf5_7.4_GLNX86.mat: Hierarchical Data Format (version 5)
with 512 bytes user block
Matlab v5 matfile
(little endian) version 0x0200
testsparse_4.2c_SOL2.mat: data
teststring_4.2c_SOL2.mat: data
testvec_4_GLNX86.mat: data
Most MAT examples like one_by_zero_char.mat are described by
Magdir/mathematica correctly as "Matlab v5 matfile".
But a few examples like testcomplex_4.2c_SOL2.mat are only described
as "data".
Furthermore with extension only ??? is displayed and with i option
only generic application/octetstream is shown for MAT examples.
For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/softtride.html).
A few MAT examples like malformed1.mat, miuint32_for_miint32.mat and
one_by_zero_char.mat, which are described correctly by file command
are misidentified by TrID as "SMS Material" by matsms.trid.xml.
Most examples like testcomplex_4.2c_SOL2.mat,
testsparse_4.2c_SOL2.mat and teststring_4.2c_SOL2.mat which are
described as "data" by file command are described by TrID as "Matlab
Level 4 MATFile (bigendian)" by matl4be.trid.xml. (See appended
MATtridv.txt.gz ). It also displays related URL and file name
extension.
The detection of MAT examples happens by lines inside
Magdir/mathematica like:
0 string MATLAB Matlab v5 matfile
>126 short 0x494d (big endian)
>>124 beshort x version 0x%04x
>126 short 0x4d49 (little endian)
>>124 leshort x version 0x%04x
For the MAT examples a page about MAT on file formats archive team
website was mentioned by TrID as related URL. On that page a MATFile
Format documentation matfile_format.pdf is mentioned. So this
information is now expressed by additional comment lines like:
# URL: http://fileformats.archiveteam.org/wiki/MAT
# Reference:
# https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf
According to documentation the first 116 bytes of the header can
contain text data in humanreadable form. This text typically
provides information that describes how the MATfile was created. For
MATfiles created by MATLAB include the following information in
their headers:
1) Level of the MATfile
2) Platform on which the file was created
3) Date and time the file was created
This often looks like the following string:
MATLAB 5.0 MATfile, Platform: SOL2, Created on: Thu Nov 13
10:10:27 1997
The file command only test for the start key word MATLAB whereas the
TrID command looks for more bytes. So i look for the platform tag
part (which is like: GLNX86 PCWIN PCWIN64 SOL2 Windows_7 nt posix)
and for the creation time. So in a few examples like malformed1.mat
and miuint32_for_miint32.mat the leading comma (0x2C) before platform
part is missing. And in one example not created by MATLAB like in
one_by_zero_char.mat the leading ASCII string looks like
"MATLAB 5.0 MATfile, written by Octave 3.2.3, 20110125 19:30:48
UTC".
So here platform part is missing and creation time is stored in
another format. So show that information now by adaptional lines like:
>>20 search/2 Platform:\040 \b, platform
>>>&0 string x %0.2s
>>>&2 ubyte !0x2C \b%c
>>>>&0 ubyte !0x2C \b%c
>>>>>&0 ubyte !0x2C \b%c
>>>>>>&0 ubyte !0x2C \b%c
>>>>>>>&0 ubyte !0x2C \b%c
>>>>>>>>&0 ubyte !0x2C \b%c
>>>>>>>>>&0 ubyte !0x2C \b%c
>>20 default x
>>>11 string x "%s"
>34 search/9/c created\040on:\040 \b, created
>>&0 string x %.24s
One MAT example testhdf5_7.4_GLNX86.mat was not identified by TrID
because it start with ASCII string "MATLAB 7.0" instead of string
"MATLAB 5.0" like in other examples. So this is a variant with higher
version level 7. This is also visible that the hexadecimal version is
0x0200 in that case whereas for level 5 this value is 0x0100. So this
example should be described correctly as like "Matlab v7 matfile"
instead of "Matlab v5 matfile". So this is now done by line like:
0 string MATLAB Matlab v
!:mime application/xmatlabdata
!:ext mat
>7 ubyte =0x35 \b5 matfile
>7 ubyte !0x35
>>7 string x \b%.3s matfile
Instead of generic application/octetstream the mentioned mime type
application/xmatlabdata is now shown and now also file name
extension mat is displayed.
This MAT example testhdf5_7.4_GLNX86.mat was identified first as
"Hierarchical Data Format (version 5) with 512 bytes user block"
by Magdir/images with lines like:
512 string \211HDF\r\n\032\n Hierarchical Data Format \
(version 5) with 512 bytes user block
!:mime application/xhdf
After inspecting more details of MAT file, is becomes clear that this
example is really a matrix file that just tests some HDF aspects.
Therefore it also contains short HDF pattern at suited position. So i
skip HDF recognition of this examples by looking for MATLAB
characteristics. So the above lines now becomes like:
512 string \211HDF\r\n\032\n
>0 string !MATLAB Hierarchical Data Format \
(version 5) with 512 bytes user block
!:mime application/xhdf5
!:ext h5/hdf5/hdf/he5
According to Wikipedia now i show four extension for version 5 and
three for version 4, but in my examples i found only hdf extension
for version 4 and h5 extension for version 5. For version 5 the mime
type application/xhdf5 is used instead of application/xhdf.
The mentioned link to hdf.ncsa.uiuc.edu does not exist any more. So i
add URL to Wikipedia page about HDF. This is now expressed by comment
lines like:
# URL: http://fileformats.archiveteam.org/wiki/HDF
# https://en.wikipedia.org/wiki/Hierarchical_Data_Format
In MATFile Format documentation matfile_format.pdf beside the Level
5 MATFile Format also the older Level 4 MATFile Format was
explained. So i see that the unrecognized ("data") MAT samples are
just older level 4 examples.
Unfortunately level 4 MAT files have no significant magic pattern. So
i put displaying part inside a subroutine named matlab4 and then add
enough test lines to identify such matrices in a unique manner. The
sub routine starts with lines displaying similar text comparing with
level 5 matfiles like:
0 name matlab4 Matlab v4 matfile
!:mime application/xmatlabdata
!:ext mat
According to specification such MAT files start with 20byte header
with 5 long integers that contains information describing certain
attributes of the matrix.
At offset 0 the type flag is stored as 4 byte integer depending on
endian. In decimal that type integer is represented as MOPT, where M
counts the thousands and indicates the numeric format of numbers on
the machine. Biggest possible value is 4052 (=0xFD4). That means 2
upper bytes are always 0.
For big endian ( that means Macintosh, SPARC, Apollo, SGI, HP
9000/300, other Motorola systems) M value is 1. So lowest flag value
is 1000 (=3E8 hexadecimal) and highest value is 1052 (=41C
hexadecimal). The highest hexadecimal value with 3 as second byte
is 3FF (=1023 decimal). That is true for floating point numbers (P=0
for doubleprecision 64bit or P=1 for singleprecision 32bit) and
for 32bit integers. So value for second byte is 3 or 4. So value 4
as second byte only occur for 16bit signed integers (P=3) 16bit
unsigned integers (P=4) 8bit unsigned integers (P=5).
According to documentation for little endian (PC, 386, 486, DEC
RISC) machine M value is 0. That means highest type value is 52 (=34
hexadecimal).
That is used to display information about machine type (big endian
for example in same manner as for level 5) by lines like:
#>0 ubelong x \b, type flag %u
#>0 ubelong x (0x%x)
#>0 ubelong/1000 x \b, M=%u
>0 ubelong/1000 0 (little endian)
>0 ubelong/1000 1 (big endian)
>0 ubelong/1000 2 (VAX Dfloat)
>0 ubelong/1000 3 (VAX Gfloat)
>0 ubelong/1000 4 (Cray)
Furthermore this information is used as third test to skip some
CDROM filesystem like testhfs.iso with many low nil values at right
positions by lines like:
>>0 ubelong&0xFFffFF00 0x00000300
>>>0 use matlab4
>>0 ubelong&0xFFffFF00 0x00000400
>>>0 use matlab4
>>0 ulelong x
>>0 ulelong <53
>>>0 use \^matlab4
At offset 20 the null terminated matrix is stored as ASCII string
(like testmatrix testsparsecomplex teststringarray testcomplex) and
at offset 16 the length of this string is stored as 4 byte integer.
So matrix name is shown by lines like:
#>16 ubelong x \b, name length %u
#>20 string x \b, MATRIX NAME="%s"
>16 pstring/L x %s
The existing of a valid printable ASCII matrix name is used as second
test by line like:
>20 ubyte >0x1F
At offset 4 the number of rows in the matrix is stored as 4 byte
integer (like: 1 3 8). At offset 8 the number of columns in the
matrix is stored as 4 byte integer (like 1 3 4 5 9 43). So the matrix
dimension are shown by line like:
>4 ubelong x \b, rows %u
>8 ubelong x \b, columns %u
At offset 12 the imaginary flag is stored as 4 byte integer. If this
is 1, then the matrix has an imaginary part. If 0, there is only real
data. So print this information for not real (that means imaginary)
matrix by line like:
>12 ubelong !0 \b, imaginary
Because of endian the value 1 can occur in byte at offset 12 or 15,
but that also means that the two middle bytes are nil for both endian
variants.
That information is used as first test line like:
13 ushort 0
I hope that 3 test lines are unique enough to identify MAT level 4
files. According to specification for VAX and Cray machines the
header file looks different. So maybe for such machine types other
test conditions must be created.
After applying the above mentioned modifications by patches
file5.40mathematicamatlab.diff and file5.40imagesmatlab.diff
then all matrix examples and Hierarchical Data Format (HDF) images
are recognized, described with more details and some
misidentifications vanished like:
abydos.h5: Hierarchical Data Format (version 5) data
big_endian.mat: Matlab v5 matfile (big endian)
version 0x0100, platform Windows 7,
created Tue Feb 26 11:20:36 GMT
input_256.hdf: Hierarchical Data Format (version 4) data
malformed1.mat: Matlab v5 matfile (little endian)
version 0x0100, platform nt,
created Thu Mar 24 17:53:52 2016
miuint32_for_miint32.mat: Matlab v5 matfile (little endian)
version 0x0100, platform posix,
created Sat Jan 31 13:15:43 2015
one_by_zero_char.mat: Matlab v5 matfile (little endian)
version 0x0100
"MATfile, written by Octave 3.2.3,
20110125 19:30:48 UTC"
ReactOSLiveCD.iso: ISO 9660 CDROM filesystem data
'REACTOS' (bootable)
testhfs.iso: ISO 9660 CDROM filesystem data
(DOS/MBR boot sector)
'testhfscdromhybrid'
testbool_8_WIN64.mat: Matlab v5 matfile (little endian)
version 0x0100, platform PCWIN64,
created Fri Apr 12 16:18:43 2013
testcell_6.1_SOL2.mat: Matlab v5 matfile (big endian)
version 0x0100, platform SOL2,
created Sat Aug 19 09:37:19 2006
testcomplex_4.2c_SOL2.mat: Matlab v4 matfile (big endian)
testcomplex, numeric, rows 1, columns 9,
imaginary
testhdf5_7.4_GLNX86.mat: Matlab v7.0 matfile (little endian)
version 0x0200, platform GLNX86,
created Sat Oct 4 19:01:58 2008
testsparse_4.2c_SOL2.mat: Matlab v4 matfile (big endian)
testsparse, sparse, rows 8, columns 3
teststring_4.2c_SOL2.mat: Matlab v4 matfile (big endian)
teststring, text, rows 1, columns 43
testvec_4_GLNX86.mat: Matlab v4 matfile (little endian)
fit_params, numeric, rows 2, columns 1
I hope my 2 diff files can be applied in future version of file utility.
Furthermore many examples like ReactOSLiveCD.iso and testhfs.iso
are still are misidentified by sub routine diythermocamchecker
inside Magdir/measure as "(Lepton 3.x)" and "(Lepton 2.x)". This sub
routine still gives too many false hits.
With best wishes
Jörg Jenderek

Jörg Jenderek
 next part 
 file5.40/magic/Magdir/images.old 20210222 23:49:24 +0000
+++ file5.40/magic/Magdir/images 20210707 14:43:58 +0000
@@ 1450,17 +1450,32 @@
# Hierarchical Data Format, used to facilitate scientific data exchange
# specifications at http://hdf.ncsa.uiuc.edu/
+# URL: http://fileformats.archiveteam.org/wiki/HDF
+# https://en.wikipedia.org/wiki/Hierarchical_Data_Format
+# Reference: https://portal.hdfgroup.org/download/attachments/52627880/HDF5_File_Format_Specification_Version3.0.pdf
0 belong 0x0e031301 Hierarchical Data Format (version 4) data
!:mime application/xhdf
+!:ext hdf/hdf4/h4
0 string \211HDF\r\n\032\n Hierarchical Data Format (version 5) data
!:mime application/xhdf
512 string \211HDF\r\n\032\n Hierarchical Data Format (version 5) with 512 bytes user block
!:mime application/xhdf
+#!:mime application/xhdf
+!:mime application/xhdf5
+!:ext h5/hdf5/hdf/he5
+512 string \211HDF\r\n\032\n
+# skip Matlab v5 matfile testhdf5_7.4_GLNX86.mat handled by ./mathematica
+>0 string !MATLAB Hierarchical Data Format (version 5) with 512 bytes user block
+#!:mime application/xhdf
+!:mime application/xhdf5
+!:ext h5/hdf5/hdf/he5
1024 string \211HDF\r\n\032\n Hierarchical Data Format (version 5) with 1k user block
!:mime application/xhdf
+#!:mime application/xhdf
+!:mime application/xhdf5
+!:ext h5/hdf5/hdf/he5
2048 string \211HDF\r\n\032\n Hierarchical Data Format (version 5) with 2k user block
!:mime application/xhdf
+#!:mime application/xhdf
+!:mime application/xhdf5
+!:ext h5/hdf5/hdf/he5
4096 string \211HDF\r\n\032\n Hierarchical Data Format (version 5) with 4k user block
!:mime application/xhdf

+#!:mime application/xhdf
+!:mime application/xhdf5
+!:ext h5/hdf5/hdf/he5
# From: Tobias Burnus <burnus at netb.de>
 next part 
A nontext attachment was scrubbed...
Name: file5.40imageshdf.diff.sig
Type: application/octetstream
Size: 763 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20210710/ee22fa2c/attachment0002.obj>
 next part 
A nontext attachment was scrubbed...
Name: MATtridv.txt.gz
Type: application/xgzip
Size: 1541 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20210710/ee22fa2c/attachment0001.bin>
 next part 
 file5.40/magic/Magdir/mathematica.old 20210222 23:51:10 +0000
+++ file5.40/magic/Magdir/mathematica 20210710 15:58:18 +0000
@@ 74,8 +74,90 @@
#########################
# MatLab v5
0 string MATLAB Matlab v5 matfile
+# URL: http://fileformats.archiveteam.org/wiki/MAT
+# Reference: https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf
+# first 116 bytes of header contain text in humanreadable form
+0 string MATLAB Matlab v
+#>11 string/T x \b, at 11 "%.105s"
+#!:mime application/octetstream
+!:mime application/xmatlabdata
+!:ext mat
+# https://de.mathworks.com/help/matlab/import_export/matfileversions.html
+# level of the MATfile like: 5.0 7.0 or maybe 7.3
+#>7 string x LEVEL "%.3s"
+>7 ubyte =0x35 \b5 matfile
+>7 ubyte !0x35
+>>7 string x \b%.3s matfile
>126 short 0x494d (big endian)
>>124 beshort x version 0x%04x
>126 short 0x4d49 (little endian)
+# 0x0100 for level 5.0 and 0x0200 for level 7.0
>>124 leshort x version 0x%04x
+# test again so that default clause works
+>126 short x
+# created by MATLAB include Platform sometimes without leading comma (0x2C) or missing
+# like: GLNX86 PCWIN PCWIN64 SOL2 Windows\0407 nt posix
+>>20 search/2 Platform:\040 \b, platform
+>>>&0 string x %0.2s
+>>>&2 ubyte !0x2C \b%c
+>>>>&0 ubyte !0x2C \b%c
+>>>>>&0 ubyte !0x2C \b%c
+>>>>>>&0 ubyte !0x2C \b%c
+>>>>>>>&0 ubyte !0x2C \b%c
+>>>>>>>>&0 ubyte !0x2C \b%c
+>>>>>>>>>&0 ubyte !0x2C \b%c
+# examples without Platform tag like one_by_zero_char.mat
+>>20 default x
+>>>11 string x "%s"
+# created by MATLAB include time like: Fri Feb 20 15:26:59 2009
+>34 search/9/c created\040on:\040 \b, created
+>>&0 string x %.24s
+# MatLab v4
+# From: Joerg Jenderek
+# check for valid imaginary flag of Matlab matrix version 4
+13 ushort 0
+# check for valid ASCII matrix name
+>20 ubyte >0x1F
+# skip some CDROM filesystem like testhfs.iso by looking for valid big endian type flag
+>>0 ubelong&0xFFffFF00 0x00000300
+>>>0 use matlab4
+# no example for 8bit and 16bit integers matrix
+>>0 ubelong&0xFFffFF00 0x00000400
+>>>0 use matlab4
+>>0 ulelong x
+# skip big endian variant by looking for valid low lttle endian type flag
+>>0 ulelong <53
+>>>0 use \^matlab4
+# display information of Matlab v4 matfile
+0 name matlab4 Matlab v4 matfile
+#!:mime application/octetstream
+!:mime application/xmatlabdata
+!:ext mat
+# 20byte header with 5 long integers that contains information describing certain attributes of the Matrix
+# type flag decimal MOPT; maximal 4052=FD4h; maximal 52=34h for little endian
+#>0 ubelong x \b, type flag %u
+#>0 ubelong x (0x%x)
+# M: 0~little endian 1~Big Endian 2~VAX Dfloat 3~VAX Gfloat 4~Cray
+#>0 ubelong/1000 x \b, M=%u
+>0 ubelong/1000 0 (little endian)
+>0 ubelong/1000 1 (big endian)
+>0 ubelong/1000 2 (VAX Dfloat)
+>0 ubelong/1000 3 (VAX Gfloat)
+>0 ubelong/1000 4 (Cray)
+# namlen; the length of the matrix name
+#>16 ubelong x \b, name length %u
+# nul terminated matrix name like: fit_params testmatrix testsparsecomplex teststringarray
+#>20 string x \b, MATRIX NAME="%s"
+>16 pstring/L x %s
+# T indicates the matrix type: 0~numeric 1~text 2~sparse
+#>0 ubelong%10 x \b, T=%u
+>0 ubelong%10 0 \b, numeric
+>0 ubelong%10 1 \b, text
+>0 ubelong%10 2 \b, sparse
+# mrows; number of rows in the matrix like: 1 3 8
+>4 ubelong x \b, rows %u
+# ncols; number of columns in the matrix like: 1 3 4 5 9 43
+>8 ubelong x \b, columns %u
+# imagf; imaginary flag; 1~matrix has an imaginary part 0~only real data
+>12 ubelong !0 \b, imaginary
+# real; Real part of the matrix consists of mrows * ncols numbers
 next part 
A nontext attachment was scrubbed...
Name: file5.40mathematicamat.diff.sig
Type: application/octetstream
Size: 1759 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20210710/ee22fa2c/attachment0003.obj>
More information about the File
mailing list