[File] [PATCH] of Magdir/msdos for old Microsoft DOS Word documents (*.doc *.wri)

Jörg Jenderek joerg.jen.der.ek at gmx.net
Sat Jun 1 22:39:50 UTC 2019

Hash: SHA1

some days ago i inspect some old Microsoft documents with file
extension doc and wri. I run version 5.37 on such documents. Such
documents and similar files are only described general as "Microsoft
Word Document". The output looks like:

dlgaccf.doc:                    Microsoft Word Document
DOS6DS.WRI:                     Microsoft Word Document
MEMO.DOC:                       Microsoft Word Document
splitter.doc:                   Microsoft Word Document
x-fmt-12-signature-id-614.wri:  Microsoft Word Document
x-fmt-274-signature-id-488.doc: Microsoft Word Document
x-fmt-275-signature-id-489.doc: Microsoft Word Document
x-fmt-276-signature-id-490.doc: Microsoft Word Document
Furthermore with --extension option ??? is displayed. And with
- --apple option UNKNUNKN is shown.

The file identifying tool TrID ( http://mark0.net/soft-trid-e.html )
describes inspected examples as "Microsoft Word for DOS Document"
and "Windows Write Document"

Droid, the UK government national archives program describes such
examples as "Write for Windows Document 3.0", Microsoft Word for
MS-DOS Document 4.0 or 5.5. See https://sourceforge.net/projects/droid

So i add more lines to Magdir/msdos. Some Information is found on
fileformats.archiveteam.org website. So i add comment line like
 # URL: http://fileformats.archiveteam.org/wiki/DOC
There a website about old "MS-WORD FORMAT" conserved at
web.archive.org is mentioned. So use this:
 # Reference:	https://web.archive.org/web/20170206041048/
 #		http://www.msxnet.org/word2rtf/formats/ffh-dosword5

According to that site such documents start with a characteristic 4
byte signature. This was expressed by line
 0	belong	0x31be0000		Microsoft Word Document
Unfortunately this is not unique enough. Droid test skeletons like
x-fmt-274-signature-id-488.doc are misidentified. So this line now
 0	belong	0x31be0000
More magic lines are needed. For real documents text content start
at offset 128 (80h), whereas the skeleton examples contains nothing
at that place. So skip droid skeleton like
x-fmt-274-signature-id-488.doc by additional line
 >128	ubyte		>0  			Microsoft

At the end show this stored ASCII text. Some times start with 4 non
printable characters like Carriage Returns or Line Feeds. So first
test for printable character. If this is true print string. If it is
not printable jump to next character and repeat procedure. This is
expressed by magic lines starting like
 >>128	ubyte		x			\b,
 >>>128		ubyte	>0x1F
 >>>>128	string	x			%s
 >>>128		ubyte	<0x20
 >>>>129	ubyte	>0x1F

The described Word document format is also used in a variant by some
Microsoft Write versions. According to archived web site
http://msxnet.org/word2rtf/formats/write.txt examining the value of
word 48 of the header is a good way to distinguish Write files from
Microsoft Word files. If equals 0, the file originated in Word.
Other identifies a Write file. This is now expressed by lines
 >>96	uleshort	=0		Word
 !:mime	application/msword
 !:apple	MSWDWDBN
 !:ext	doc/dcx
For Word documents mime type is "application/msword" and file name
extension is "doc" on DOS systems and "dcx" on Unix systems
according to TrID.
 >>96	uleshort	!0		Write 3.0 (Windows) Document
 !:mime	application/x-mswrite
 !:apple	MSWDWDBN
For Write Documents mime type is "application/x-mswrite" and file
name extension is in most cases "wri". But i also found 3 examples
like splitter.doc or srchtest.doc with "doc" extension. I do not
know if this an accident.

According to reference at offset 6Eh 18 bytes in version 4.0 are
always 00h, but after version 5.0 used for unknown code. So
different DOS variants can be distinguished by additional lines
 >>>0x6E	ulequad		=0	1.0-4.0
 >>>0x6E	ulequad		!0	5.0-6.0
 >>>0x6E	ulequad		x	(DOS) Document

To print such Document from DOS additional style files like
NORMAL.STY and DOS Printer driver like HPLASMS are needed. This
information is also stored inside document and is shown by lines like
 >>0x1E	string		>0		\b, formate by %-.66s
 >>0x62	string		>0		\b, %-.8s printer

According to reference block pointer to optional file manager
information block is stored at offset 1Ch. This can be displayed by
line like:
 >>0x1C	uleshort	x			\b, at 0x%x info block
Because block size is 128 bytes, jump to optional file manager block
by line
 >>(0x1C.s*128)	uleshort x
Then test for valid information start values 14 or maybe 12 by line
>>> &-2		uleshort	=0x0014
Afterward show appended ASCIIZ names starting with document name by
line like
 >>>>&0x12	string		x		%s
and finally the modification and creation date stored as MM/DD/YY by
 >>>>>>>>>>&1	string		x		\b, %-.8s
 >>>>>>>>>>&9	string		x		created %-.8s

After applying the above mentioned modifications by patch
file-5.37-msdos-doc_wri.diff then such old Microsoft documents are
identified and described more precisely like:

	Microsoft Word 1.0-4.0 (DOS) Document,
	formated by DOC.STY, HPLASPS printer,
	Dialog Accelerators   -----   Technical Information
	Microsoft Write 3.0 (Windows) Document,
	224 pages,
	MS-DOS 6.0 DoubleSpace Test - Final Summary.
	Microsoft Word 5.0-6.0 (DOS) Document
	Sample heading for memo, author Microsoft Corp., reviser jpf,
	keywords memo sample heading,
	comment Learning Word refers to this file.,
	version , 10/01/90 created 10/01/90, HPLASMS printer,
	9 blocks,
	Microsoft Write 3.0 (Windows) Document,
	formated by RP.STY, 20552 pages, LASPS printer,
	Window Splitting in Opus -- \001
x-fmt-12-signature-id-614.wri:  data
x-fmt-274-signature-id-488.doc: data
x-fmt-275-signature-id-489.doc: data
x-fmt-276-signature-id-490.doc: data

I hope my diff file can be applied in future version of
file utility.

With best wishes
Jörg Jenderek
- --
Jörg Jenderek

Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

-------------- next part --------------
--- file-5.37/magic/Magdir/msdos.old	2019-04-19 00:42:27 +0000
+++ file-5.37/magic/Magdir/msdos	2019-06-01 22:29:06 +0000
@@ -657,4 +657,79 @@
-0	belong	0x31be0000			Microsoft Word Document
+# Update:	Joerg Jenderek
+# URL:		http://fileformats.archiveteam.org/wiki/DOC
+# Reference:	https://web.archive.org/web/20170206041048/
+#		http://www.msxnet.org/word2rtf/formats/ffh-dosword5
+# wIdent+dty
+0	belong	0x31be0000
+# skip droid skeleton like x-fmt-274-signature-id-488.doc
+>128	ubyte		>0  			Microsoft
+>>96	uleshort	=0			Word
 !:mime	application/msword
+!:apple	MSWDWDBN
+# DCX is used in the Unix version.
+!:ext	doc/dcx
+>>>0x6E	ulequad		=0			1.0-4.0
+>>>0x6E	ulequad		!0			5.0-6.0
+>>>0x6E	ulequad		x			(DOS) Document
+# https://web.archive.org/web/20130831064118/http://msxnet.org/word2rtf/formats/write.txt
+>>96	uleshort	!0			Write 3.0 (Windows) Document
+!:mime	application/x-mswrite
+!:apple	MSWDWDBN
+# sometimes also doc like in splitter.doc srchtest.doc
+!:ext	wri/doc
+# wTool must be 0125400 octal
+#>>4	uleshort	!0xAB00			\b, wTool %o
+# reserved; must be zero
+#>>6	ulelong		!0			\b, reserved %u
+# block pointer to the block containing optional file manager information
+#>>0x1C	uleshort	x			\b, at 0x%x info block
+# jump to File manager information block
+>>(0x1C.s*128)	uleshort x
+# test for valid information start; maybe also 0012h
+>>>&-2		uleshort	=0x0014
+# Document ASCIIZ name
+>>>>&0x12	string		x		%s
+# author name
+>>>>>&1		string		x		\b, author %s
+# reviser name
+>>>>>>&1	string		x		\b, reviser %s
+# keywords
+>>>>>>>&1	string		x		\b, keywords %s
+# comment
+>>>>>>>>&1	string		x		\b, comment %s
+# version number
+>>>>>>>>>&1	string		x		\b, version %s
+# date of last change MM/DD/YY
+>>>>>>>>>>&1	string		x		\b, %-.8s
+# creation date MM/DD/YY
+>>>>>>>>>>&9	string		x		created %-.8s
+# file name of print format like NORMAL.STY
+>>0x1E	string		>0			\b, formatted by %-.66s
+# count of pages in whole file for write variant; maybe some times wrong
+>>96	uleshort	>0			\b, %u pages
+# name of the printer driver like HPLASMS
+>>0x62	string		>0			\b, %-.8s printer
+# number of blocks used in the file; seems to be 0 for Word 4.0 and Write 3.0
+>>0x6A	uleshort	>0			\b, %u blocks
+# bit field for corrected text areas
+#>>0x6C	uleshort	x			\b, 0x%x bit field
+# text of document; some times start with 4 non printable characters like CR LF
+>>128	ubyte		x			\b,
+>>>128		ubyte	>0x1F
+>>>>128		string	x			%s
+>>>128		ubyte	<0x20
+>>>>129		ubyte	>0x1F
+>>>>>129	string	x			%s
+>>>>129		ubyte	<0x20
+>>>>>130	ubyte	>0x1F
+>>>>>>130	string	x			%s
+>>>>>130	ubyte	<0x20
+>>>>>>131	ubyte	>0x1F
+>>>>>>>131	string	x			%s
+>>>>>>131	ubyte	<0x20
+>>>>>>>132	ubyte	>0x1F
+>>>>>>>>132	string	x			%s
+>>>>>>>132	ubyte	<0x20
+>>>>>>>>133	ubyte	>0x1F
+>>>>>>>>>133	string	x			%s
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.37-msdos-doc_wri.diff.sig
Type: application/octet-stream
Size: 95 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20190602/61549f03/attachment.obj>

More information about the File mailing list