[File] [PATCH] of Magdir/mathematica, images Matlab mat-file *.mat, Hierarchical Data Format *.hdf

Christos Zoulas christos at zoulas.com
Wed Jul 14 09:07:28 UTC 2021


Committed, thanks!

christos

> On Jul 10, 2021, at 7:11 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
> 
> Hello,
> 
> some days ago i inspected some Matlab examples with file name
> extension mat.
> 
> When running running file command version 5.40  on such examples and
> related files with -k option i get an output like:
> 
> abydos.h5:                 Hierarchical Data Format (version 5) data
> big_endian.mat:            Matlab v5 mat-file
> 			   (big endian) version 0x0100
> input_256.hdf:             Hierarchical Data Format (version 4) data
> malformed1.mat:            Matlab v5 mat-file
> 			   (little endian) version 0x0100
> miuint32_for_miint32.mat:  Matlab v5 mat-file
> 			   (little endian) version 0x0100
> one_by_zero_char.mat:      Matlab v5 mat-file
> 			   (little endian) version 0x0100
> ReactOS-LiveCD.iso:        ISO 9660 CD-ROM filesystem data
> 			   'REACTOS' (bootable)
> 			   (Lepton 3.x), scale 0-0,
> 			   (Lepton 2.x), scale 0-0,
> test-hfs.iso:              ISO 9660 CD-ROM filesystem data
> 			   (DOS/MBR boot sector)
> 			   'test-hfs-cdrom-hybrid'
> 			   (Lepton 3.x), scale 0-0,
> 			   (Lepton 2.x), scale 0-0,
> testbool_8_WIN64.mat:      Matlab v5 mat-file
> 			   (little endian) version 0x0100
> testcell_6.1_SOL2.mat:     Matlab v5 mat-file
> 			   (big endian) version 0x0100
> testcomplex_4.2c_SOL2.mat: data
> testhdf5_7.4_GLNX86.mat:   Hierarchical Data Format (version 5)
> 			   with 512 bytes user block
> 			   Matlab v5 mat-file
> 			   (little endian) version 0x0200
> testsparse_4.2c_SOL2.mat:  data
> teststring_4.2c_SOL2.mat:  data
> testvec_4_GLNX86.mat:      data
> 
> Most MAT examples like one_by_zero_char.mat are described by
> Magdir/mathematica correctly as "Matlab v5 mat-file".
> But a few examples like testcomplex_4.2c_SOL2.mat are only described
> as "data".
> 
> Furthermore with --extension only ??? is displayed and with -i option
> only generic application/octet-stream is shown for MAT examples.
> 
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html).
> 
> A few MAT examples like malformed1.mat, miuint32_for_miint32.mat and
> one_by_zero_char.mat, which are described correctly by file command
> are misidentified by TrID as "SMS Material" by mat-sms.trid.xml.
> 
> Most examples like testcomplex_4.2c_SOL2.mat,
> testsparse_4.2c_SOL2.mat and teststring_4.2c_SOL2.mat which are
> described as "data" by file command are described by TrID as "Matlab
> Level 4 MAT-File (big-endian)" by mat-l4-be.trid.xml. (See appended
> MAT-trid-v.txt.gz ). It also displays related URL and file name
> extension.
> 
> The detection of MAT examples happens by lines inside
> Magdir/mathematica like:
> 
> 0       string  MATLAB  Matlab v5 mat-file
> >126    short   0x494d  (big endian)
> >>124   beshort x       version 0x%04x
> >126    short   0x4d49  (little endian)
> >>124   leshort x       version 0x%04x
> 
> For the MAT examples a page about MAT on file formats archive team
> website was mentioned by TrID as related URL. On that page a MAT-File
> Format documentation matfile_format.pdf is mentioned. So this
> information is now expressed by additional comment lines like:
> # URL:		http://fileformats.archiveteam.org/wiki/MAT
> # Reference:
> # https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf
> 
> According to documentation the first 116 bytes of the header can
> contain text data in human-readable form. This text typically
> provides information that describes how the MAT-file was created. For
> MAT-files created by MATLAB include the following information in
> their headers:
> 1) Level of the MAT-file
> 2) Platform on which the file was created
> 3) Date and time the file was created
> This often looks like the following string:
> MATLAB 5.0 MAT-file, Platform: SOL2, Created on: Thu Nov 13
> 10:10:27 1997
> 
> The file command only test for the start key word MATLAB whereas the
> TrID command looks for more bytes. So i look for the platform tag
> part (which is like: GLNX86 PCWIN PCWIN64 SOL2 Windows_7 nt posix)
> and for the creation time. So in a few examples like malformed1.mat
> and miuint32_for_miint32.mat the leading comma (0x2C) before platform
> part is missing. And in one example not created by MATLAB like in
> one_by_zero_char.mat the leading ASCII string looks like
> "MATLAB 5.0 MAT-file, written by Octave 3.2.3, 2011-01-25 19:30:48
> UTC".
> So here platform part is missing and creation time is stored in
> another format. So show that information now by adaptional lines like:
> 
> 
> >>20	search/2	Platform:\040	\b, platform
> >>>&0	string		x		%-0.2s
> >>>&2		ubyte	!0x2C		\b%c
> >>>>&0		ubyte	!0x2C		\b%c
> >>>>>&0	ubyte	!0x2C		\b%c
> >>>>>>&0	ubyte	!0x2C		\b%c
> >>>>>>>&0	ubyte	!0x2C		\b%c
> >>>>>>>>&0	ubyte	!0x2C		\b%c
> >>>>>>>>>&0	ubyte	!0x2C		\b%c
> >>20	default		x
> >>>11	string		x	"%s"
> >34	search/9/c	created\040on:\040	\b, created
> >>&0	string	x		%-.24s
> 
> One MAT example testhdf5_7.4_GLNX86.mat was not identified by TrID
> because it start with ASCII string "MATLAB 7.0" instead of string
> "MATLAB 5.0" like in other examples. So this is a variant with higher
> version level 7. This is also visible that the hexadecimal version is
> 0x0200 in that case whereas for level 5 this value is 0x0100. So this
> example should be described correctly as like "Matlab v7 mat-file"
> instead of "Matlab v5 mat-file". So this is now done by line like:
> 0       string  MATLAB  Matlab v
> !:mime	application/x-matlab-data
> !:ext	mat
> >7	ubyte	=0x35	\b5 mat-file
> >7	ubyte	!0x35
> >>7	string	x	\b%.3s mat-file
> 
> Instead of generic application/octet-stream the mentioned mime type
> application/x-matlab-data is now shown and now also file name
> extension mat is displayed.
> 
> This MAT example testhdf5_7.4_GLNX86.mat was identified first as
> "Hierarchical Data Format (version 5) with 512 bytes user block"
> by Magdir/images with lines like:
> 512 string \211HDF\r\n\032\n Hierarchical Data Format \
>                              (version 5) with 512 bytes user block
> !:mime	application/x-hdf
> After inspecting more details of MAT file, is becomes clear that this
> example is really a matrix file that just tests some HDF aspects.
> Therefore it also contains short HDF pattern at suited position. So i
> skip HDF recognition of this examples by looking for MATLAB
> characteristics. So the above lines now becomes like:
> 
> 512 string \211HDF\r\n\032\n
> >0  string !MATLAB	Hierarchical Data Format \
>                        (version 5) with 512 bytes user block
> !:mime	application/x-hdf5
> !:ext	h5/hdf5/hdf/he5
> 
> According to Wikipedia now i show four extension for version 5 and
> three for version 4, but in my examples i found only hdf extension
> for version 4 and h5 extension for version 5. For version 5 the mime
> type application/x-hdf5 is used instead of application/x-hdf.
> 
> The mentioned link to hdf.ncsa.uiuc.edu does not exist any more. So i
> add URL to Wikipedia page about HDF. This is now expressed by comment
> lines like:
> # URL: http://fileformats.archiveteam.org/wiki/HDF
> #	https://en.wikipedia.org/wiki/Hierarchical_Data_Format
> 
> In MAT-File Format documentation matfile_format.pdf beside the Level
> 5 MAT-File Format also the older Level 4 MAT-File Format was
> explained. So i see that the unrecognized ("data") MAT samples are
> just older level 4 examples.
> 
> Unfortunately level 4 MAT files have no significant magic pattern. So
> i put displaying part inside a subroutine named matlab4 and then add
> enough test lines to identify such matrices in a unique manner. The
> sub routine starts with lines displaying similar text comparing with
> level 5 mat-files like:
> 0	name	matlab4		Matlab v4 mat-file
> !:mime	application/x-matlab-data
> !:ext	mat
> 
> According to specification such MAT files start with 20-byte header
> with 5 long integers that contains information describing certain
> attributes of the matrix.
> At offset 0 the type flag is stored as 4 byte integer depending on
> endian. In decimal that type integer is represented as MOPT, where M
> counts the thousands and indicates the numeric format of numbers on
> the machine. Biggest possible value is 4052 (=0xFD4). That means 2
> upper bytes are always 0.
> For big endian ( that means Macintosh, SPARC, Apollo, SGI, HP
> 9000/300, other Motorola systems) M value is 1. So lowest flag value
> is 1000 (=3E8 hexadecimal) and highest value is 1052 (=41C
> hexadecimal). The highest hexadecimal value with 3 as second byte
> is 3FF (=1023 decimal). That is true for floating point numbers (P=0
> for double-precision 64-bit or P=1 for single-precision 32-bit) and
> for 32-bit integers. So value for second byte is 3 or 4. So value 4
> as second byte only occur for 16-bit signed integers (P=3) 16-bit
> unsigned integers (P=4) 8-bit unsigned integers (P=5).
> According to documentation for little endian (PC, 386, 486, DEC
> RISC) machine M value is 0. That means highest type value is 52 (=34
> hexadecimal).
> That is used to display information about machine type (big endian
> for example in same manner as for level 5) by lines like:
> 
> #>0	ubelong		x	\b, type flag %u
> #>0	ubelong		x	(0x%x)
> #>0	ubelong/1000	x	\b, M=%u
> >0	ubelong/1000	0	(little endian)
> >0	ubelong/1000	1	(big endian)
> >0	ubelong/1000	2	(VAX D-float)
> >0	ubelong/1000	3	(VAX G-float)
> >0	ubelong/1000	4	(Cray)
> 
> Furthermore this information is used as third test to skip some
> CD-ROM filesystem like test-hfs.iso with many low nil values at right
> positions by lines like:
> >>0	ubelong&0xFFffFF00	0x00000300
> >>>0	use	matlab4
> >>0	ubelong&0xFFffFF00	0x00000400
> >>>0	use	matlab4
> >>0	ulelong		x
> >>0	ulelong		<53
> >>>0	use	\^matlab4
> 
> At offset 20 the null terminated matrix is stored as ASCII string
> (like testmatrix testsparsecomplex teststringarray testcomplex) and
> at offset 16 the length of this string is stored as 4 byte integer.
> So matrix name is shown by lines like:
> #>16	ubelong		x	\b, name length %u
> #>20	string		x	\b, MATRIX NAME="%s"
> >16	pstring/L	x	%s
> The existing of a valid printable ASCII matrix name is used as second
> test by line like:
> >20	ubyte	>0x1F
> 
> At offset 4 the number of rows in the matrix is stored as 4 byte
> integer (like: 1 3 8). At offset 8 the number of columns in the
> matrix is stored as 4 byte integer (like 1 3 4 5 9 43). So the matrix
> dimension are shown by line like:
> >4	ubelong		x	\b, rows %u
> >8	ubelong		x	\b, columns %u
> 
> At offset 12 the imaginary flag is stored as 4 byte integer. If this
> is 1, then the matrix has an imaginary part. If 0, there is only real
> data. So print this information for not real (that means imaginary)
> matrix by line like:
> >12	ubelong		!0	\b, imaginary
> Because of endian the value 1 can occur in byte at offset 12 or 15,
> but that also means that the two middle bytes are nil for both endian
> variants.
> That information is used as first test line like:
> 13	ushort	0
> 
> I hope that 3 test lines are unique enough to identify MAT level 4
> files. According to specification for VAX and Cray machines the
> header file looks different. So maybe for such machine types other
> test conditions must be created.
> 
> After applying the above mentioned modifications by patches
> file-5.40-mathematica-matlab.diff and file-5.40-images-matlab.diff
> then all matrix examples and Hierarchical Data Format (HDF) images
> are recognized, described with more details and some
> misidentifications vanished like:
> 
> abydos.h5:                 Hierarchical Data Format (version 5) data
> big_endian.mat:            Matlab v5 mat-file (big endian)
> 			   version 0x0100, platform Windows 7,
> 			   created Tue Feb 26 11:20:36 GMT
> input_256.hdf:             Hierarchical Data Format (version 4) data
> malformed1.mat:            Matlab v5 mat-file (little endian)
> 			   version 0x0100, platform nt,
> 			   created Thu Mar 24 17:53:52 2016
> miuint32_for_miint32.mat:  Matlab v5 mat-file (little endian)
> 			   version 0x0100, platform posix,
> 			   created Sat Jan 31 13:15:43 2015
> one_by_zero_char.mat:      Matlab v5 mat-file (little endian)
> 			   version 0x0100
> 			   "MAT-file, written by Octave 3.2.3,
> 			   2011-01-25 19:30:48 UTC"
> ReactOS-LiveCD.iso:        ISO 9660 CD-ROM filesystem data
> 			   'REACTOS' (bootable)
> test-hfs.iso:              ISO 9660 CD-ROM filesystem data
> 			   (DOS/MBR boot sector)
> 			   'test-hfs-cdrom-hybrid'
> testbool_8_WIN64.mat:      Matlab v5 mat-file (little endian)
> 			   version 0x0100, platform PCWIN64,
> 			   created Fri Apr 12 16:18:43 2013
> testcell_6.1_SOL2.mat:     Matlab v5 mat-file (big endian)
> 			   version 0x0100, platform SOL2,
> 			   created Sat Aug 19 09:37:19 2006
> testcomplex_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
> 			   testcomplex, numeric, rows 1, columns 9,
> 			   imaginary
> testhdf5_7.4_GLNX86.mat:   Matlab v7.0 mat-file (little endian)
> 			   version 0x0200, platform GLNX86,
> 			   created Sat Oct  4 19:01:58 2008
> testsparse_4.2c_SOL2.mat:  Matlab v4 mat-file (big endian)
> 			   testsparse, sparse, rows 8, columns 3
> teststring_4.2c_SOL2.mat:  Matlab v4 mat-file (big endian)
> 			   teststring, text, rows 1, columns 43
> testvec_4_GLNX86.mat:      Matlab v4 mat-file (little endian)
> 			   fit_params, numeric, rows 2, columns 1
> 
> I hope my 2 diff files can be applied in future version of file utility.
> 
> Furthermore many examples like ReactOS-LiveCD.iso and test-hfs.iso
> are still are misidentified by sub routine diy-thermocam-checker
> inside Magdir/measure as "(Lepton 3.x)" and  "(Lepton 2.x)". This sub
> routine still gives too many false hits.
> 
> With best wishes
> Jörg Jenderek
> --
> Jörg Jenderek
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> <file-5_40-images-hdf_diff.DEFANGED-14><file-5_40-images-hdf_diff_sig.DEFANGED-15><MAT-trid-v.txt.gz><file-5_40-mathematica-mat_diff.DEFANGED-16><file-5_40-mathematica-mat_diff_sig.DEFANGED-17>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20210714/605828c1/attachment.asc>


More information about the File mailing list