[File] [PATCH] of Magdir/msdos for old Microsoft DOS Word documents (*.doc *.wri)

Christos Zoulas christos at zoulas.com
Sun Jun 2 15:26:14 UTC 2019


On Jun 2, 12:39am, joerg.jen.der.ek at gmx.net (=?UTF-8?Q?J=c3=b6rg_Jenderek?=) wrote:
-- Subject: [File] [PATCH] of Magdir/msdos for old Microsoft DOS Word documen

| Hello,
| some days ago i inspect some old Microsoft documents with file
| extension doc and wri. I run version 5.37 on such documents. Such
| documents and similar files are only described general as "Microsoft
| Word Document". The output looks like:
| 
| dlgaccf.doc:                    Microsoft Word Document
| DOS6DS.WRI:                     Microsoft Word Document
| MEMO.DOC:                       Microsoft Word Document
| splitter.doc:                   Microsoft Word Document
| x-fmt-12-signature-id-614.wri:  Microsoft Word Document
| x-fmt-274-signature-id-488.doc: Microsoft Word Document
| x-fmt-275-signature-id-489.doc: Microsoft Word Document
| x-fmt-276-signature-id-490.doc: Microsoft Word Document
| Furthermore with --extension option ??? is displayed. And with
| - --apple option UNKNUNKN is shown.
| 
| The file identifying tool TrID ( http://mark0.net/soft-trid-e.html )
| describes inspected examples as "Microsoft Word for DOS Document"
| and "Windows Write Document"
| 
| Droid, the UK government national archives program describes such
| examples as "Write for Windows Document 3.0", Microsoft Word for
| MS-DOS Document 4.0 or 5.5. See https://sourceforge.net/projects/droid
| 
| So i add more lines to Magdir/msdos. Some Information is found on
| fileformats.archiveteam.org website. So i add comment line like
|  # URL: http://fileformats.archiveteam.org/wiki/DOC
| There a website about old "MS-WORD FORMAT" conserved at
| web.archive.org is mentioned. So use this:
|  # Reference:	https://web.archive.org/web/20170206041048/
|  #		http://www.msxnet.org/word2rtf/formats/ffh-dosword5
| 
| According to that site such documents start with a characteristic 4
| byte signature. This was expressed by line
|  0	belong	0x31be0000		Microsoft Word Document
| Unfortunately this is not unique enough. Droid test skeletons like
| x-fmt-274-signature-id-488.doc are misidentified. So this line now
| becomes
|  0	belong	0x31be0000
| More magic lines are needed. For real documents text content start
| at offset 128 (80h), whereas the skeleton examples contains nothing
| at that place. So skip droid skeleton like
| x-fmt-274-signature-id-488.doc by additional line
|  >128	ubyte		>0  			Microsoft
| 
| At the end show this stored ASCII text. Some times start with 4 non
| printable characters like Carriage Returns or Line Feeds. So first
| test for printable character. If this is true print string. If it is
| not printable jump to next character and repeat procedure. This is
| expressed by magic lines starting like
|  >>128	ubyte		x			\b,
|  >>>128		ubyte	>0x1F
|  >>>>128	string	x			%s
|  >>>128		ubyte	<0x20
|  >>>>129	ubyte	>0x1F
| 
| The described Word document format is also used in a variant by some
| Microsoft Write versions. According to archived web site
| http://msxnet.org/word2rtf/formats/write.txt examining the value of
| word 48 of the header is a good way to distinguish Write files from
| Microsoft Word files. If equals 0, the file originated in Word.
| Other identifies a Write file. This is now expressed by lines
|  >>96	uleshort	=3D0		Word
|  !:mime	application/msword
|  !:apple	MSWDWDBN
|  !:ext	doc/dcx
| For Word documents mime type is "application/msword" and file name
| extension is "doc" on DOS systems and "dcx" on Unix systems
| according to TrID.
|  >>96	uleshort	!0		Write 3.0 (Windows) Document
|  !:mime	application/x-mswrite
|  !:apple	MSWDWDBN
| For Write Documents mime type is "application/x-mswrite" and file
| name extension is in most cases "wri". But i also found 3 examples
| like splitter.doc or srchtest.doc with "doc" extension. I do not
| know if this an accident.
| 
| According to reference at offset 6Eh 18 bytes in version 4.0 are
| always 00h, but after version 5.0 used for unknown code. So
| different DOS variants can be distinguished by additional lines
|  >>>0x6E	ulequad		=3D0	1.0-4.0
|  >>>0x6E	ulequad		!0	5.0-6.0
|  >>>0x6E	ulequad		x	(DOS) Document
| 
| To print such Document from DOS additional style files like
| NORMAL.STY and DOS Printer driver like HPLASMS are needed. This
| information is also stored inside document and is shown by lines like
|  >>0x1E	string		>0		\b, formate by %-.66s
|  >>0x62	string		>0		\b, %-.8s printer
| 
| According to reference block pointer to optional file manager
| information block is stored at offset 1Ch. This can be displayed by
| line like:
|  >>0x1C	uleshort	x			\b, at 0x%x info block
| Because block size is 128 bytes, jump to optional file manager block
| by line
|  >>(0x1C.s*128)	uleshort x
| Then test for valid information start values 14 or maybe 12 by line
| >>> &-2		uleshort	=3D0x0014
| Afterward show appended ASCIIZ names starting with document name by
| line like
|  >>>>&0x12	string		x		%s
| and finally the modification and creation date stored as MM/DD/YY by
| lines
|  >>>>>>>>>>&1	string		x		\b, %-.8s
|  >>>>>>>>>>&9	string		x		created %-.8s
| 
| After applying the above mentioned modifications by patch
| file-5.37-msdos-doc_wri.diff then such old Microsoft documents are
| identified and described more precisely like:
| 
| dlgaccf.doc:
| 	Microsoft Word 1.0-4.0 (DOS) Document,
| 	formated by DOC.STY, HPLASPS printer,
| 	Dialog Accelerators   -----   Technical Information
| DOS6DS.WRI:
| 	Microsoft Write 3.0 (Windows) Document,
| 	224 pages,
| 	MS-DOS 6.0 DoubleSpace Test - Final Summary.
| MEMO.DOC:
| 	Microsoft Word 5.0-6.0 (DOS) Document
| 	Sample heading for memo, author Microsoft Corp., reviser jpf,
| 	keywords memo sample heading,
| 	comment Learning Word refers to this file.,
| 	version , 10/01/90 created 10/01/90, HPLASMS printer,
| 	9 blocks,
| 	MEMORANDUM
| splitter.doc:
| 	Microsoft Write 3.0 (Windows) Document,
| 	formated by RP.STY, 20552 pages, LASPS printer,
| 	Window Splitting in Opus -- \001
| x-fmt-12-signature-id-614.wri:  data
| x-fmt-274-signature-id-488.doc: data
| x-fmt-275-signature-id-489.doc: data
| x-fmt-276-signature-id-490.doc: data
| 
| I hope my diff file can be applied in future version of
| file utility.

Thanks, committed!

christos


More information about the File mailing list