[File] [PATCH] of Magdir/msdos for old Microsoft DOS Word documents (*.doc *.wri)
Christos Zoulas
christos at zoulas.com
Sun Jun 2 15:26:14 UTC 2019
On Jun 2, 12:39am, joerg.jen.der.ek at gmx.net (=?UTF-8?Q?J=c3=b6rg_Jenderek?=) wrote:
-- Subject: [File] [PATCH] of Magdir/msdos for old Microsoft DOS Word documen
| Hello,
| some days ago i inspect some old Microsoft documents with file
| extension doc and wri. I run version 5.37 on such documents. Such
| documents and similar files are only described general as "Microsoft
| Word Document". The output looks like:
|
| dlgaccf.doc: Microsoft Word Document
| DOS6DS.WRI: Microsoft Word Document
| MEMO.DOC: Microsoft Word Document
| splitter.doc: Microsoft Word Document
| x-fmt-12-signature-id-614.wri: Microsoft Word Document
| x-fmt-274-signature-id-488.doc: Microsoft Word Document
| x-fmt-275-signature-id-489.doc: Microsoft Word Document
| x-fmt-276-signature-id-490.doc: Microsoft Word Document
| Furthermore with --extension option ??? is displayed. And with
| - --apple option UNKNUNKN is shown.
|
| The file identifying tool TrID ( http://mark0.net/soft-trid-e.html )
| describes inspected examples as "Microsoft Word for DOS Document"
| and "Windows Write Document"
|
| Droid, the UK government national archives program describes such
| examples as "Write for Windows Document 3.0", Microsoft Word for
| MS-DOS Document 4.0 or 5.5. See https://sourceforge.net/projects/droid
|
| So i add more lines to Magdir/msdos. Some Information is found on
| fileformats.archiveteam.org website. So i add comment line like
| # URL: http://fileformats.archiveteam.org/wiki/DOC
| There a website about old "MS-WORD FORMAT" conserved at
| web.archive.org is mentioned. So use this:
| # Reference: https://web.archive.org/web/20170206041048/
| # http://www.msxnet.org/word2rtf/formats/ffh-dosword5
|
| According to that site such documents start with a characteristic 4
| byte signature. This was expressed by line
| 0 belong 0x31be0000 Microsoft Word Document
| Unfortunately this is not unique enough. Droid test skeletons like
| x-fmt-274-signature-id-488.doc are misidentified. So this line now
| becomes
| 0 belong 0x31be0000
| More magic lines are needed. For real documents text content start
| at offset 128 (80h), whereas the skeleton examples contains nothing
| at that place. So skip droid skeleton like
| x-fmt-274-signature-id-488.doc by additional line
| >128 ubyte >0 Microsoft
|
| At the end show this stored ASCII text. Some times start with 4 non
| printable characters like Carriage Returns or Line Feeds. So first
| test for printable character. If this is true print string. If it is
| not printable jump to next character and repeat procedure. This is
| expressed by magic lines starting like
| >>128 ubyte x \b,
| >>>128 ubyte >0x1F
| >>>>128 string x %s
| >>>128 ubyte <0x20
| >>>>129 ubyte >0x1F
|
| The described Word document format is also used in a variant by some
| Microsoft Write versions. According to archived web site
| http://msxnet.org/word2rtf/formats/write.txt examining the value of
| word 48 of the header is a good way to distinguish Write files from
| Microsoft Word files. If equals 0, the file originated in Word.
| Other identifies a Write file. This is now expressed by lines
| >>96 uleshort =3D0 Word
| !:mime application/msword
| !:apple MSWDWDBN
| !:ext doc/dcx
| For Word documents mime type is "application/msword" and file name
| extension is "doc" on DOS systems and "dcx" on Unix systems
| according to TrID.
| >>96 uleshort !0 Write 3.0 (Windows) Document
| !:mime application/x-mswrite
| !:apple MSWDWDBN
| For Write Documents mime type is "application/x-mswrite" and file
| name extension is in most cases "wri". But i also found 3 examples
| like splitter.doc or srchtest.doc with "doc" extension. I do not
| know if this an accident.
|
| According to reference at offset 6Eh 18 bytes in version 4.0 are
| always 00h, but after version 5.0 used for unknown code. So
| different DOS variants can be distinguished by additional lines
| >>>0x6E ulequad =3D0 1.0-4.0
| >>>0x6E ulequad !0 5.0-6.0
| >>>0x6E ulequad x (DOS) Document
|
| To print such Document from DOS additional style files like
| NORMAL.STY and DOS Printer driver like HPLASMS are needed. This
| information is also stored inside document and is shown by lines like
| >>0x1E string >0 \b, formate by %-.66s
| >>0x62 string >0 \b, %-.8s printer
|
| According to reference block pointer to optional file manager
| information block is stored at offset 1Ch. This can be displayed by
| line like:
| >>0x1C uleshort x \b, at 0x%x info block
| Because block size is 128 bytes, jump to optional file manager block
| by line
| >>(0x1C.s*128) uleshort x
| Then test for valid information start values 14 or maybe 12 by line
| >>> &-2 uleshort =3D0x0014
| Afterward show appended ASCIIZ names starting with document name by
| line like
| >>>>&0x12 string x %s
| and finally the modification and creation date stored as MM/DD/YY by
| lines
| >>>>>>>>>>&1 string x \b, %-.8s
| >>>>>>>>>>&9 string x created %-.8s
|
| After applying the above mentioned modifications by patch
| file-5.37-msdos-doc_wri.diff then such old Microsoft documents are
| identified and described more precisely like:
|
| dlgaccf.doc:
| Microsoft Word 1.0-4.0 (DOS) Document,
| formated by DOC.STY, HPLASPS printer,
| Dialog Accelerators ----- Technical Information
| DOS6DS.WRI:
| Microsoft Write 3.0 (Windows) Document,
| 224 pages,
| MS-DOS 6.0 DoubleSpace Test - Final Summary.
| MEMO.DOC:
| Microsoft Word 5.0-6.0 (DOS) Document
| Sample heading for memo, author Microsoft Corp., reviser jpf,
| keywords memo sample heading,
| comment Learning Word refers to this file.,
| version , 10/01/90 created 10/01/90, HPLASMS printer,
| 9 blocks,
| MEMORANDUM
| splitter.doc:
| Microsoft Write 3.0 (Windows) Document,
| formated by RP.STY, 20552 pages, LASPS printer,
| Window Splitting in Opus -- \001
| x-fmt-12-signature-id-614.wri: data
| x-fmt-274-signature-id-488.doc: data
| x-fmt-275-signature-id-489.doc: data
| x-fmt-276-signature-id-490.doc: data
|
| I hope my diff file can be applied in future version of
| file utility.
Thanks, committed!
christos
More information about the File
mailing list