[File] [PATCH] of Magdir/msdos for old Microsoft DOS Word documents (*.doc *.wri)
Jörg Jenderek
joerg.jen.der.ek at gmx.net
Sat Jun 1 22:39:50 UTC 2019
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello,
some days ago i inspect some old Microsoft documents with file
extension doc and wri. I run version 5.37 on such documents. Such
documents and similar files are only described general as "Microsoft
Word Document". The output looks like:
dlgaccf.doc: Microsoft Word Document
DOS6DS.WRI: Microsoft Word Document
MEMO.DOC: Microsoft Word Document
splitter.doc: Microsoft Word Document
x-fmt-12-signature-id-614.wri: Microsoft Word Document
x-fmt-274-signature-id-488.doc: Microsoft Word Document
x-fmt-275-signature-id-489.doc: Microsoft Word Document
x-fmt-276-signature-id-490.doc: Microsoft Word Document
Furthermore with --extension option ??? is displayed. And with
- --apple option UNKNUNKN is shown.
The file identifying tool TrID ( http://mark0.net/soft-trid-e.html )
describes inspected examples as "Microsoft Word for DOS Document"
and "Windows Write Document"
Droid, the UK government national archives program describes such
examples as "Write for Windows Document 3.0", Microsoft Word for
MS-DOS Document 4.0 or 5.5. See https://sourceforge.net/projects/droid
So i add more lines to Magdir/msdos. Some Information is found on
fileformats.archiveteam.org website. So i add comment line like
# URL: http://fileformats.archiveteam.org/wiki/DOC
There a website about old "MS-WORD FORMAT" conserved at
web.archive.org is mentioned. So use this:
# Reference: https://web.archive.org/web/20170206041048/
# http://www.msxnet.org/word2rtf/formats/ffh-dosword5
According to that site such documents start with a characteristic 4
byte signature. This was expressed by line
0 belong 0x31be0000 Microsoft Word Document
Unfortunately this is not unique enough. Droid test skeletons like
x-fmt-274-signature-id-488.doc are misidentified. So this line now
becomes
0 belong 0x31be0000
More magic lines are needed. For real documents text content start
at offset 128 (80h), whereas the skeleton examples contains nothing
at that place. So skip droid skeleton like
x-fmt-274-signature-id-488.doc by additional line
>128 ubyte >0 Microsoft
At the end show this stored ASCII text. Some times start with 4 non
printable characters like Carriage Returns or Line Feeds. So first
test for printable character. If this is true print string. If it is
not printable jump to next character and repeat procedure. This is
expressed by magic lines starting like
>>128 ubyte x \b,
>>>128 ubyte >0x1F
>>>>128 string x %s
>>>128 ubyte <0x20
>>>>129 ubyte >0x1F
The described Word document format is also used in a variant by some
Microsoft Write versions. According to archived web site
http://msxnet.org/word2rtf/formats/write.txt examining the value of
word 48 of the header is a good way to distinguish Write files from
Microsoft Word files. If equals 0, the file originated in Word.
Other identifies a Write file. This is now expressed by lines
>>96 uleshort =0 Word
!:mime application/msword
!:apple MSWDWDBN
!:ext doc/dcx
For Word documents mime type is "application/msword" and file name
extension is "doc" on DOS systems and "dcx" on Unix systems
according to TrID.
>>96 uleshort !0 Write 3.0 (Windows) Document
!:mime application/x-mswrite
!:apple MSWDWDBN
For Write Documents mime type is "application/x-mswrite" and file
name extension is in most cases "wri". But i also found 3 examples
like splitter.doc or srchtest.doc with "doc" extension. I do not
know if this an accident.
According to reference at offset 6Eh 18 bytes in version 4.0 are
always 00h, but after version 5.0 used for unknown code. So
different DOS variants can be distinguished by additional lines
>>>0x6E ulequad =0 1.0-4.0
>>>0x6E ulequad !0 5.0-6.0
>>>0x6E ulequad x (DOS) Document
To print such Document from DOS additional style files like
NORMAL.STY and DOS Printer driver like HPLASMS are needed. This
information is also stored inside document and is shown by lines like
>>0x1E string >0 \b, formate by %-.66s
>>0x62 string >0 \b, %-.8s printer
According to reference block pointer to optional file manager
information block is stored at offset 1Ch. This can be displayed by
line like:
>>0x1C uleshort x \b, at 0x%x info block
Because block size is 128 bytes, jump to optional file manager block
by line
>>(0x1C.s*128) uleshort x
Then test for valid information start values 14 or maybe 12 by line
>>> &-2 uleshort =0x0014
Afterward show appended ASCIIZ names starting with document name by
line like
>>>>&0x12 string x %s
and finally the modification and creation date stored as MM/DD/YY by
lines
>>>>>>>>>>&1 string x \b, %-.8s
>>>>>>>>>>&9 string x created %-.8s
After applying the above mentioned modifications by patch
file-5.37-msdos-doc_wri.diff then such old Microsoft documents are
identified and described more precisely like:
dlgaccf.doc:
Microsoft Word 1.0-4.0 (DOS) Document,
formated by DOC.STY, HPLASPS printer,
Dialog Accelerators ----- Technical Information
DOS6DS.WRI:
Microsoft Write 3.0 (Windows) Document,
224 pages,
MS-DOS 6.0 DoubleSpace Test - Final Summary.
MEMO.DOC:
Microsoft Word 5.0-6.0 (DOS) Document
Sample heading for memo, author Microsoft Corp., reviser jpf,
keywords memo sample heading,
comment Learning Word refers to this file.,
version , 10/01/90 created 10/01/90, HPLASMS printer,
9 blocks,
MEMORANDUM
splitter.doc:
Microsoft Write 3.0 (Windows) Document,
formated by RP.STY, 20552 pages, LASPS printer,
Window Splitting in Opus -- \001
x-fmt-12-signature-id-614.wri: data
x-fmt-274-signature-id-488.doc: data
x-fmt-275-signature-id-489.doc: data
x-fmt-276-signature-id-490.doc: data
I hope my diff file can be applied in future version of
file utility.
With best wishes
Jörg Jenderek
- --
Jörg Jenderek
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCXPL+qwAKCRCv8rHJQhrU
1m+rAKCCzXs44S6V77proafvmQnIhXGSSwCfUKsoHvZAYrM1CdENE+6J1m1siZo=
=Of0g
-----END PGP SIGNATURE-----
-------------- next part --------------
--- file-5.37/magic/Magdir/msdos.old 2019-04-19 00:42:27 +0000
+++ file-5.37/magic/Magdir/msdos 2019-06-01 22:29:06 +0000
@@ -657,4 +657,79 @@
#
-0 belong 0x31be0000 Microsoft Word Document
+# Update: Joerg Jenderek
+# URL: http://fileformats.archiveteam.org/wiki/DOC
+# Reference: https://web.archive.org/web/20170206041048/
+# http://www.msxnet.org/word2rtf/formats/ffh-dosword5
+# wIdent+dty
+0 belong 0x31be0000
+# skip droid skeleton like x-fmt-274-signature-id-488.doc
+>128 ubyte >0 Microsoft
+>>96 uleshort =0 Word
!:mime application/msword
+!:apple MSWDWDBN
+# DCX is used in the Unix version.
+!:ext doc/dcx
+>>>0x6E ulequad =0 1.0-4.0
+>>>0x6E ulequad !0 5.0-6.0
+>>>0x6E ulequad x (DOS) Document
+# https://web.archive.org/web/20130831064118/http://msxnet.org/word2rtf/formats/write.txt
+>>96 uleshort !0 Write 3.0 (Windows) Document
+!:mime application/x-mswrite
+!:apple MSWDWDBN
+# sometimes also doc like in splitter.doc srchtest.doc
+!:ext wri/doc
+# wTool must be 0125400 octal
+#>>4 uleshort !0xAB00 \b, wTool %o
+# reserved; must be zero
+#>>6 ulelong !0 \b, reserved %u
+# block pointer to the block containing optional file manager information
+#>>0x1C uleshort x \b, at 0x%x info block
+# jump to File manager information block
+>>(0x1C.s*128) uleshort x
+# test for valid information start; maybe also 0012h
+>>>&-2 uleshort =0x0014
+# Document ASCIIZ name
+>>>>&0x12 string x %s
+# author name
+>>>>>&1 string x \b, author %s
+# reviser name
+>>>>>>&1 string x \b, reviser %s
+# keywords
+>>>>>>>&1 string x \b, keywords %s
+# comment
+>>>>>>>>&1 string x \b, comment %s
+# version number
+>>>>>>>>>&1 string x \b, version %s
+# date of last change MM/DD/YY
+>>>>>>>>>>&1 string x \b, %-.8s
+# creation date MM/DD/YY
+>>>>>>>>>>&9 string x created %-.8s
+# file name of print format like NORMAL.STY
+>>0x1E string >0 \b, formatted by %-.66s
+# count of pages in whole file for write variant; maybe some times wrong
+>>96 uleshort >0 \b, %u pages
+# name of the printer driver like HPLASMS
+>>0x62 string >0 \b, %-.8s printer
+# number of blocks used in the file; seems to be 0 for Word 4.0 and Write 3.0
+>>0x6A uleshort >0 \b, %u blocks
+# bit field for corrected text areas
+#>>0x6C uleshort x \b, 0x%x bit field
+# text of document; some times start with 4 non printable characters like CR LF
+>>128 ubyte x \b,
+>>>128 ubyte >0x1F
+>>>>128 string x %s
+>>>128 ubyte <0x20
+>>>>129 ubyte >0x1F
+>>>>>129 string x %s
+>>>>129 ubyte <0x20
+>>>>>130 ubyte >0x1F
+>>>>>>130 string x %s
+>>>>>130 ubyte <0x20
+>>>>>>131 ubyte >0x1F
+>>>>>>>131 string x %s
+>>>>>>131 ubyte <0x20
+>>>>>>>132 ubyte >0x1F
+>>>>>>>>132 string x %s
+>>>>>>>132 ubyte <0x20
+>>>>>>>>133 ubyte >0x1F
+>>>>>>>>>133 string x %s
#
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.37-msdos-doc_wri.diff.sig
Type: application/octet-stream
Size: 95 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20190602/61549f03/attachment.obj>
More information about the File
mailing list