[File] [PATCH] of Magdir/mathematica, images Matlab mat-file *.mat, Hierarchical Data Format *.hdf
Christos Zoulas
christos at zoulas.com
Wed Jul 14 09:07:28 UTC 2021
Committed, thanks!
christos
> On Jul 10, 2021, at 7:11 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
>
> Hello,
>
> some days ago i inspected some Matlab examples with file name
> extension mat.
>
> When running running file command version 5.40 on such examples and
> related files with -k option i get an output like:
>
> abydos.h5: Hierarchical Data Format (version 5) data
> big_endian.mat: Matlab v5 mat-file
> (big endian) version 0x0100
> input_256.hdf: Hierarchical Data Format (version 4) data
> malformed1.mat: Matlab v5 mat-file
> (little endian) version 0x0100
> miuint32_for_miint32.mat: Matlab v5 mat-file
> (little endian) version 0x0100
> one_by_zero_char.mat: Matlab v5 mat-file
> (little endian) version 0x0100
> ReactOS-LiveCD.iso: ISO 9660 CD-ROM filesystem data
> 'REACTOS' (bootable)
> (Lepton 3.x), scale 0-0,
> (Lepton 2.x), scale 0-0,
> test-hfs.iso: ISO 9660 CD-ROM filesystem data
> (DOS/MBR boot sector)
> 'test-hfs-cdrom-hybrid'
> (Lepton 3.x), scale 0-0,
> (Lepton 2.x), scale 0-0,
> testbool_8_WIN64.mat: Matlab v5 mat-file
> (little endian) version 0x0100
> testcell_6.1_SOL2.mat: Matlab v5 mat-file
> (big endian) version 0x0100
> testcomplex_4.2c_SOL2.mat: data
> testhdf5_7.4_GLNX86.mat: Hierarchical Data Format (version 5)
> with 512 bytes user block
> Matlab v5 mat-file
> (little endian) version 0x0200
> testsparse_4.2c_SOL2.mat: data
> teststring_4.2c_SOL2.mat: data
> testvec_4_GLNX86.mat: data
>
> Most MAT examples like one_by_zero_char.mat are described by
> Magdir/mathematica correctly as "Matlab v5 mat-file".
> But a few examples like testcomplex_4.2c_SOL2.mat are only described
> as "data".
>
> Furthermore with --extension only ??? is displayed and with -i option
> only generic application/octet-stream is shown for MAT examples.
>
> For comparison reason i run the file format identification utility
> TrID ( See https://mark0.net/soft-trid-e.html).
>
> A few MAT examples like malformed1.mat, miuint32_for_miint32.mat and
> one_by_zero_char.mat, which are described correctly by file command
> are misidentified by TrID as "SMS Material" by mat-sms.trid.xml.
>
> Most examples like testcomplex_4.2c_SOL2.mat,
> testsparse_4.2c_SOL2.mat and teststring_4.2c_SOL2.mat which are
> described as "data" by file command are described by TrID as "Matlab
> Level 4 MAT-File (big-endian)" by mat-l4-be.trid.xml. (See appended
> MAT-trid-v.txt.gz ). It also displays related URL and file name
> extension.
>
> The detection of MAT examples happens by lines inside
> Magdir/mathematica like:
>
> 0 string MATLAB Matlab v5 mat-file
> >126 short 0x494d (big endian)
> >>124 beshort x version 0x%04x
> >126 short 0x4d49 (little endian)
> >>124 leshort x version 0x%04x
>
> For the MAT examples a page about MAT on file formats archive team
> website was mentioned by TrID as related URL. On that page a MAT-File
> Format documentation matfile_format.pdf is mentioned. So this
> information is now expressed by additional comment lines like:
> # URL: http://fileformats.archiveteam.org/wiki/MAT
> # Reference:
> # https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf
>
> According to documentation the first 116 bytes of the header can
> contain text data in human-readable form. This text typically
> provides information that describes how the MAT-file was created. For
> MAT-files created by MATLAB include the following information in
> their headers:
> 1) Level of the MAT-file
> 2) Platform on which the file was created
> 3) Date and time the file was created
> This often looks like the following string:
> MATLAB 5.0 MAT-file, Platform: SOL2, Created on: Thu Nov 13
> 10:10:27 1997
>
> The file command only test for the start key word MATLAB whereas the
> TrID command looks for more bytes. So i look for the platform tag
> part (which is like: GLNX86 PCWIN PCWIN64 SOL2 Windows_7 nt posix)
> and for the creation time. So in a few examples like malformed1.mat
> and miuint32_for_miint32.mat the leading comma (0x2C) before platform
> part is missing. And in one example not created by MATLAB like in
> one_by_zero_char.mat the leading ASCII string looks like
> "MATLAB 5.0 MAT-file, written by Octave 3.2.3, 2011-01-25 19:30:48
> UTC".
> So here platform part is missing and creation time is stored in
> another format. So show that information now by adaptional lines like:
>
>
> >>20 search/2 Platform:\040 \b, platform
> >>>&0 string x %-0.2s
> >>>&2 ubyte !0x2C \b%c
> >>>>&0 ubyte !0x2C \b%c
> >>>>>&0 ubyte !0x2C \b%c
> >>>>>>&0 ubyte !0x2C \b%c
> >>>>>>>&0 ubyte !0x2C \b%c
> >>>>>>>>&0 ubyte !0x2C \b%c
> >>>>>>>>>&0 ubyte !0x2C \b%c
> >>20 default x
> >>>11 string x "%s"
> >34 search/9/c created\040on:\040 \b, created
> >>&0 string x %-.24s
>
> One MAT example testhdf5_7.4_GLNX86.mat was not identified by TrID
> because it start with ASCII string "MATLAB 7.0" instead of string
> "MATLAB 5.0" like in other examples. So this is a variant with higher
> version level 7. This is also visible that the hexadecimal version is
> 0x0200 in that case whereas for level 5 this value is 0x0100. So this
> example should be described correctly as like "Matlab v7 mat-file"
> instead of "Matlab v5 mat-file". So this is now done by line like:
> 0 string MATLAB Matlab v
> !:mime application/x-matlab-data
> !:ext mat
> >7 ubyte =0x35 \b5 mat-file
> >7 ubyte !0x35
> >>7 string x \b%.3s mat-file
>
> Instead of generic application/octet-stream the mentioned mime type
> application/x-matlab-data is now shown and now also file name
> extension mat is displayed.
>
> This MAT example testhdf5_7.4_GLNX86.mat was identified first as
> "Hierarchical Data Format (version 5) with 512 bytes user block"
> by Magdir/images with lines like:
> 512 string \211HDF\r\n\032\n Hierarchical Data Format \
> (version 5) with 512 bytes user block
> !:mime application/x-hdf
> After inspecting more details of MAT file, is becomes clear that this
> example is really a matrix file that just tests some HDF aspects.
> Therefore it also contains short HDF pattern at suited position. So i
> skip HDF recognition of this examples by looking for MATLAB
> characteristics. So the above lines now becomes like:
>
> 512 string \211HDF\r\n\032\n
> >0 string !MATLAB Hierarchical Data Format \
> (version 5) with 512 bytes user block
> !:mime application/x-hdf5
> !:ext h5/hdf5/hdf/he5
>
> According to Wikipedia now i show four extension for version 5 and
> three for version 4, but in my examples i found only hdf extension
> for version 4 and h5 extension for version 5. For version 5 the mime
> type application/x-hdf5 is used instead of application/x-hdf.
>
> The mentioned link to hdf.ncsa.uiuc.edu does not exist any more. So i
> add URL to Wikipedia page about HDF. This is now expressed by comment
> lines like:
> # URL: http://fileformats.archiveteam.org/wiki/HDF
> # https://en.wikipedia.org/wiki/Hierarchical_Data_Format
>
> In MAT-File Format documentation matfile_format.pdf beside the Level
> 5 MAT-File Format also the older Level 4 MAT-File Format was
> explained. So i see that the unrecognized ("data") MAT samples are
> just older level 4 examples.
>
> Unfortunately level 4 MAT files have no significant magic pattern. So
> i put displaying part inside a subroutine named matlab4 and then add
> enough test lines to identify such matrices in a unique manner. The
> sub routine starts with lines displaying similar text comparing with
> level 5 mat-files like:
> 0 name matlab4 Matlab v4 mat-file
> !:mime application/x-matlab-data
> !:ext mat
>
> According to specification such MAT files start with 20-byte header
> with 5 long integers that contains information describing certain
> attributes of the matrix.
> At offset 0 the type flag is stored as 4 byte integer depending on
> endian. In decimal that type integer is represented as MOPT, where M
> counts the thousands and indicates the numeric format of numbers on
> the machine. Biggest possible value is 4052 (=0xFD4). That means 2
> upper bytes are always 0.
> For big endian ( that means Macintosh, SPARC, Apollo, SGI, HP
> 9000/300, other Motorola systems) M value is 1. So lowest flag value
> is 1000 (=3E8 hexadecimal) and highest value is 1052 (=41C
> hexadecimal). The highest hexadecimal value with 3 as second byte
> is 3FF (=1023 decimal). That is true for floating point numbers (P=0
> for double-precision 64-bit or P=1 for single-precision 32-bit) and
> for 32-bit integers. So value for second byte is 3 or 4. So value 4
> as second byte only occur for 16-bit signed integers (P=3) 16-bit
> unsigned integers (P=4) 8-bit unsigned integers (P=5).
> According to documentation for little endian (PC, 386, 486, DEC
> RISC) machine M value is 0. That means highest type value is 52 (=34
> hexadecimal).
> That is used to display information about machine type (big endian
> for example in same manner as for level 5) by lines like:
>
> #>0 ubelong x \b, type flag %u
> #>0 ubelong x (0x%x)
> #>0 ubelong/1000 x \b, M=%u
> >0 ubelong/1000 0 (little endian)
> >0 ubelong/1000 1 (big endian)
> >0 ubelong/1000 2 (VAX D-float)
> >0 ubelong/1000 3 (VAX G-float)
> >0 ubelong/1000 4 (Cray)
>
> Furthermore this information is used as third test to skip some
> CD-ROM filesystem like test-hfs.iso with many low nil values at right
> positions by lines like:
> >>0 ubelong&0xFFffFF00 0x00000300
> >>>0 use matlab4
> >>0 ubelong&0xFFffFF00 0x00000400
> >>>0 use matlab4
> >>0 ulelong x
> >>0 ulelong <53
> >>>0 use \^matlab4
>
> At offset 20 the null terminated matrix is stored as ASCII string
> (like testmatrix testsparsecomplex teststringarray testcomplex) and
> at offset 16 the length of this string is stored as 4 byte integer.
> So matrix name is shown by lines like:
> #>16 ubelong x \b, name length %u
> #>20 string x \b, MATRIX NAME="%s"
> >16 pstring/L x %s
> The existing of a valid printable ASCII matrix name is used as second
> test by line like:
> >20 ubyte >0x1F
>
> At offset 4 the number of rows in the matrix is stored as 4 byte
> integer (like: 1 3 8). At offset 8 the number of columns in the
> matrix is stored as 4 byte integer (like 1 3 4 5 9 43). So the matrix
> dimension are shown by line like:
> >4 ubelong x \b, rows %u
> >8 ubelong x \b, columns %u
>
> At offset 12 the imaginary flag is stored as 4 byte integer. If this
> is 1, then the matrix has an imaginary part. If 0, there is only real
> data. So print this information for not real (that means imaginary)
> matrix by line like:
> >12 ubelong !0 \b, imaginary
> Because of endian the value 1 can occur in byte at offset 12 or 15,
> but that also means that the two middle bytes are nil for both endian
> variants.
> That information is used as first test line like:
> 13 ushort 0
>
> I hope that 3 test lines are unique enough to identify MAT level 4
> files. According to specification for VAX and Cray machines the
> header file looks different. So maybe for such machine types other
> test conditions must be created.
>
> After applying the above mentioned modifications by patches
> file-5.40-mathematica-matlab.diff and file-5.40-images-matlab.diff
> then all matrix examples and Hierarchical Data Format (HDF) images
> are recognized, described with more details and some
> misidentifications vanished like:
>
> abydos.h5: Hierarchical Data Format (version 5) data
> big_endian.mat: Matlab v5 mat-file (big endian)
> version 0x0100, platform Windows 7,
> created Tue Feb 26 11:20:36 GMT
> input_256.hdf: Hierarchical Data Format (version 4) data
> malformed1.mat: Matlab v5 mat-file (little endian)
> version 0x0100, platform nt,
> created Thu Mar 24 17:53:52 2016
> miuint32_for_miint32.mat: Matlab v5 mat-file (little endian)
> version 0x0100, platform posix,
> created Sat Jan 31 13:15:43 2015
> one_by_zero_char.mat: Matlab v5 mat-file (little endian)
> version 0x0100
> "MAT-file, written by Octave 3.2.3,
> 2011-01-25 19:30:48 UTC"
> ReactOS-LiveCD.iso: ISO 9660 CD-ROM filesystem data
> 'REACTOS' (bootable)
> test-hfs.iso: ISO 9660 CD-ROM filesystem data
> (DOS/MBR boot sector)
> 'test-hfs-cdrom-hybrid'
> testbool_8_WIN64.mat: Matlab v5 mat-file (little endian)
> version 0x0100, platform PCWIN64,
> created Fri Apr 12 16:18:43 2013
> testcell_6.1_SOL2.mat: Matlab v5 mat-file (big endian)
> version 0x0100, platform SOL2,
> created Sat Aug 19 09:37:19 2006
> testcomplex_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
> testcomplex, numeric, rows 1, columns 9,
> imaginary
> testhdf5_7.4_GLNX86.mat: Matlab v7.0 mat-file (little endian)
> version 0x0200, platform GLNX86,
> created Sat Oct 4 19:01:58 2008
> testsparse_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
> testsparse, sparse, rows 8, columns 3
> teststring_4.2c_SOL2.mat: Matlab v4 mat-file (big endian)
> teststring, text, rows 1, columns 43
> testvec_4_GLNX86.mat: Matlab v4 mat-file (little endian)
> fit_params, numeric, rows 2, columns 1
>
> I hope my 2 diff files can be applied in future version of file utility.
>
> Furthermore many examples like ReactOS-LiveCD.iso and test-hfs.iso
> are still are misidentified by sub routine diy-thermocam-checker
> inside Magdir/measure as "(Lepton 3.x)" and "(Lepton 2.x)". This sub
> routine still gives too many false hits.
>
> With best wishes
> Jörg Jenderek
> --
> Jörg Jenderek
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> <file-5_40-images-hdf_diff.DEFANGED-14><file-5_40-images-hdf_diff_sig.DEFANGED-15><MAT-trid-v.txt.gz><file-5_40-mathematica-mat_diff.DEFANGED-16><file-5_40-mathematica-mat_diff_sig.DEFANGED-17>--
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20210714/605828c1/attachment.asc>
More information about the File
mailing list