[File] [PATCH] of Magdir/ole2compounddocs,msdos,wordprocessors for CDF; updates + extension

Christos Zoulas christos at zoulas.com
Fri Aug 2 18:08:54 UTC 2019


Committed, thanks!

christos

> On Jul 28, 2019, at 8:02 PM, Jörg Jenderek <joerg.jen.der.ek at gmx.net> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hello,
> some days ago i get a Microsoft Publisher document (*.pub). When i run
> file command version 5.37 on such files it was recognized as a
> OLE 2 Compound Document but sub type identification is missing or
> duplicate identification happened. Then i also look for other Compound
> Document files (abbreviated as cdf) on my system, especially other
> popular office application like StarOffice and other Microsoft
> programs. After getting a test collection, i get undetermined sub
> type for examples when i run file with options -k -e cdf like stored
> in appended file-5.37-soft.txt
> 
> The identification happens in Magdir/ole2compounddocs by line like
> 0   string  \320\317\021\340\241\261\032\341 OLE 2 Compound Document
> Unfortunately this pattern is not unique enough. So some additional
> test lines are needed. Furthermore only ??? is displayed when
> executing file command with --extension option.
> 
> Even when i use the internal cdf recognition parts of file command,
> some examples are not precisely recognized. And i got an output with
> - -e soft option like in appended file-5.37-cdf.txt.
> 
> First i look how samples are identified by other tools. I use TrID.
> TrID is an utility designed to identify file types from their binary
> signatures ( see http://mark0.net/soft-trid-e.html ).
> In the end to catch many different existing types i use the following
> procedure:
> Download pattern database "triddefs_xml.7z" of TrID. Extract 7z
> archive. Look in extracted trid definition files *.trid.xml for patter
> n
> 	<Bytes>D0CF11E0A1B11AE100</Bytes>
> 
> DROID (Digital Record and Object Identification) is a software tool
> developed by The National Archives to perform automated batch
> identification of file formats. See
> 	https://digital-preservation.github.io/droid/
> 
> So i add lines to Magdir/ole2compounddoc. Some Information about that
> file format is found on OpenOffice website. So i add comment line like
> # reference:	https://www.openoffice.org/sc/compdocfileformat.pdf
> 
> According to that document the major version number is now shown by
> line like
>>>> 0x1A	uleshort	x	\b, v%u
> For real examples this number is 3 or 4. So skip bad droid skeletons
> like fmt-39-signature-id-128.doc by additional test for valid version
>> 0x1A	ushort		!0xABAB
> 
> The byte order is stored inside ole2 files. Big-endian could occur,
> but in real world inspected samples only little-endian was used. So
> test for little-endian and then continue by additional line
>>> 0x1C	uleshort		=0xfffe
> 
> Now for little endian examples show some useful information.
> Display number of first sector of the directory stream by line
>>>> 48	ulelong			x	\b, SecID 0x%x
> 
> The exponent of sector size with basis 2 is stored inside header.
> Minimum value is 7. In most cases value is 9 for sector size 512 and
> some times value Ch=12 for 4096 is found. For every block size create
> a branch with magic lines. For block size value 512 this start with
> line like:
>>>> 0x1E	uleshort	9	\b, blocksize 512
> 
> Afterwards jump to one block (512 bytes per block here) before root
> storage block by line like
>>>>> (48.l*512)	ubyte		x
> 
> Unfortunately this pointer construct does not work for samples like
> Red-Carpet-presentation-1.0-1.sdd sg10.sdv XnView_metadata.doc
> "Barham, Lisa - Die Shopping-Prinzessinnen.doc" or
> 2000_GA_Annual_Review_Data.xls with standard configuration. And the
> file command display no error message. So i display such message after
> SecID information by additional line
>>>> 48	ulelong	>0x800		too big for FILE_BYTES_MAX = 1 MiB
> 
> If you want that such samples are recognized, the FILE_BYTES_MAX value
> must be raised. For doc file with SecID 0x3144 a size of 6,158203125
> MiB is needed. So raise limit up to 7 MiB by changed line in
> src/file.h like
> # define FILE_BYTES_MAX (7168 * 1024)
> 
> Afterwards jump 1 block forward to directory stream, inspect this
> structure by subroutine and display sub type according to stored GUID
>>>>>> &511 	use		ole2-directory
> 0	name			ole2-directory
> To check that jump to directory stream was successful, i first check
> for Root entry by looking for type 5 value by line
>> 66 	ubyte		5
> 
> The first sub type classification in ole2compounddocs was done by line
>> 0x480 string V\000i\000s\000i\000o\000D\000o\000c : Visio Document
> 
> This matched only for some accident conditions like second directory
> entry is UTF-16 string "VisioDocument" with SecID 0x1 and block size
> 512. So test files mentioned on fileformats.archiveteam.org with SecID
> 0x2 like Visio2002Test.vsd are not recognized because characteristic
> string now appears at offset 0x680. But instead of looking for
> directory entry name i test for 16 byte sized Unique identifier when
> possible ( GUID clsid is not null).
> The GUID can be shown for inspected samples by debug lines likes
>> 80 	ubequad		!0			\b, clsid 0x%16.16llx
>>> 88 	ubequad		x			\b%16.16llx
> Or you look for known GUID on web sides like
> 	https://wikileaks.org/ciav7p1/cms/page_13762814.html
> Often the GUID are listen as string format with curly braces. For file
> command such string must be converted to hexadecimal representation by
> converters like www.windowstricks.in/online-windows-guid-converter.
> 
> So i test for second part of GUID by line like
>>> 88 	ubequad		0xc000000000000046	: Microsoft
> So some Microsoft files are detected. Microsoft Visio samples are
> described by other 8 bytes of GUID by line like
>>>> 80 ubequad 0x131a020000000000 Visio Document, stencil or template
> Aftwerwards now show mime type by line
> !:mime	application/vnd.visio
> The filename extension varies. For Drawing VSD is used. For stencil it
> is VSS and for template it is VST. This is now expressed by line
> !:ext	vsd/vss/vst
> An other advantage of testing by GUID is that GUID changes for
> different file type versions. So the above id is used for Visio
> versions 2000-2002. So Visio 2003-2010 versions are recognized by
> additional test block starting with line like
>>>> 80 	ubequad		0x141a020000000000	Visio 2003 Document
> 
> The first part of GUID is also used by Windows installer files. These
> are described by additional test lines like
>>>> 80 ubequad	0x84100c0000000000	Windows Installer Package
> !:mime	application/x-msi
> !:ext	msi
>>>> 80 ubequad	0x86100c0000000000	Windows Installer Patch
> !:mime	application/x-wine-extension-msp
> !:ext	msp
> 
> Also used for Word document variant like example harmless-clean.doc by
> lines like
>>>> 80 	ubequad	0x0609020000000000	Word 97-2003 doc or template
> !:mime	application/msword
> Doc is usual file name extension, but for template dot is used and on
> Macintosh no file name extension is used. This is now expressed by lin
> e
> !:ext	doc/dot/
> For template apple id is MSWDW8TN, which is different for doc-files
> expressed by line
> !:apple	MSWDWDBN
> These lines replace the old unreliable test starting with line
>> 546	string	bjbj			: Microsoft Word Document
> 
> Also some Outlook files are matched. Often such files are described as
> Outlook Message, but i found inside Windows registry description text
> Outlook Item. Furthermore there does not exist an official mime type
> application/vnd.ms-outlook. For template another GUID is used and file
> name extension oft instead msg is used. So such outlook examples are
> described by code segments
>>>> 80 	ubequad	0x0b0d020000000000	Outlook 97-2003 item
> !:mime	application/x-ms-msg
> !:ext	msg
>>>> 80 	ubequad	0x46f0060000000000	Outlook 97-2003 item template
> !:mime	application/x-ms-oft
> !:ext	oft
> 
> Then i add also remaining seen GUID of other file types in the same
> way. So catch more non null clsid values i also add lines like
>>> 88 	default		x			: UNKNOWN
> !:mime	application/x-ole-storage
>>>> 80 	ubequad		!0			\b, clsid 0x%16.16llx
>>>> 88 	ubequad		x			\b%16.16llx
> So possible undetected samples like test guidunknown.ole2 get correct
> mime type.
> 
> Unfortunately not all cdf types have an unique GUID. So look for
> samples with CLSID GUID value zero by sub branch starting with lines
> like
>>> 88 	ubequad		0x0
>>>> 80 	ubequad		0x0
> At that point often no exact file type documentation exist, but often
> a characteristic directory entry name exist.
> 
> So Microstation V8 DGN samples like 1344468165.dgn seems to have a
> second directory entry name "Dgn~H" or "Dgn~S". This is now expressed
> by code segment like
>>>>> 128 	lestring16	Dgn~	: Microstation V8 CAD
> !:mime	application/x-bentley-dgn
> !:ext	dgn
> This replaces the old unreliable segment with line
>> 0x480  string  D\000g\000n\000~\000H	: Microstation V8 DGN
> 
> Hangul (Korean) 5.0 Word Processor files like 100723-.hwp and
> example.hwp seems to have second directory entry name FileHeader with
> significant signature "HWP Document File". This are now identified by
> lines like
>>>>> 128 lestring16 FileHeader : Hangul (Korean) 5.0 Word Processor
> !:mime	application/x-hwp
> !:ext	hwp
> To detect possible more CDF examples like test guidNull.ole2 with null
> clsid i add lines like
>>>>> 128 	default		x	: UNKNOWN
> !:mime	application/x-ole-storage
> 
> After handling the samples with blocks size 512 do same procedure for
> samples with exponent 12 (that is block size 4096) by additional
> branch like
>>>> 0x1E		uleshort	0xc	\b, blocksize 4096
>>>>> (48.l*4096)	ubyte		x
>>>>>> &4095 		use		ole2-directory
> Now also samples like AusweisApp2-1.16.1.msi are identified precisely.
> 
> For some samples i get duplicate descriptions. When example is already
> recognized by Magdir/ole2compounddocs, then others by Magdir/msdos are
> not needed. So remove for example like WORD60ES.DOC lines like
> 2080 string Documento\ Microsoft\ Word\ 6 Spanish Microsoft Word 6
> 2112 string MSWordDoc		Microsoft Word document data
> So remove for example DOT60.DOT lines with
> 2080 string Microsoft\ Word\ 6.0\ Document	%s
> So remove for example test-italian.xls lines with
> 2080	string	Foglio\ di\ lavoro\ Microsoft\ Exce	%s
> So remove for example ANALYSIS.XLA lines with
> 2080	string	Microsoft\ Excel\ 5.0\ Worksheet	%s
> So remove for example like SLIDES.XLT lines with
> 2114	string	Biff5		Microsoft Excel 5.0 Worksheet
> I found no example for next code segment, but apparently this is also
> a cdf format. So i also remove lines with
> 2121	string	Biff5		Microsoft Excel 5.0 Worksheet
> 
> At the beginning of directory entry structure the name is stored as
> UTF-16 string which can be displayed by a debug line like:
>> 0 	lestring16	x 			\b, 1st %.10s
> For the root directory this is "Root Entry".
> So remove for example surfnet-keynote.doc lines with
> 512 string R\0o\0o\0t\0\ \0E\0n\0t\0r\0y Microsoft Word Document
> This pattern applies not only to DOC files, but also to other CDF
> files, with some suited SecID and block size combination.
> The same error occurs in Magdir/wordprocessors. So remove there lines
> starting with
> 512 string R\0o\0o\0t\0 Hangul (Korean) Word Processor File 2000
> By that lines samples like outlook item 15.11.10_Schiwy.msg and
> Corel PrintHouse image DINASAUR.CPH are also misidentified as Hangul
> (Korean) Word Processor File 2000.
> 
> After applying the above mentioned modifications by patch
> file-5.37-ole2compounddocs.diff, file-5.37-msdos-ole2compounddocs.diff
> and file-5.37-wordprocessors-ole2compounddocs.diff then cdf files have
> no duplicate descriptions and are described more precisely stored in
> appended file-soft-new.txt
> 
> I hope my diff files can be applied in future version of file utility.
> I have done my best to catch most popular cdf, but according to Trid
> database there still exist more exotic cdf files. Furthermore often fo
> r
> mime type and apple id there is no precise declaration.
> 
> With best wishes
> Jörg Jenderek
> - --
> Jörg Jenderek
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
> 
> iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCXT3VEAAKCRCv8rHJQhrU
> 1qDfAKCgQRfhqyyp28PEaMy9qShu5c46xgCfcKJjtF2VSsd7UD3kuqwvh4xpnhs=
> =McaO
> -----END PGP SIGNATURE-----
> <file-5_37-msdos-ole2compounddocs_diff.DEFANGED-8813><file-5_37-wordprocessors-ole2compounddocs_diff.DEFANGED-8814><file-5.37-cdf.txt><file-soft-new.txt><file-5.37-soft.txt><file-5_37-ole2compounddocs_diff.DEFANGED-8815><file-5_37-msdos-ole2compounddocs_diff_sig.DEFANGED-8816><file-5_37-wordprocessors-ole2compounddocs_diff_sig.DEFANGED-8817><file-5_37-cdf_txt_sig.DEFANGED-8818><file-soft-new_txt_sig.DEFANGED-8819><file-5_37-soft_txt_sig.DEFANGED-8820><file-5_37-ole2compounddocs_diff_sig.DEFANGED-8821>-- 
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
> <sanitizer.log>



More information about the File mailing list