[File] [PATCH] Magdir/ole2compounddocs for "older" Microsoft Publisher

Jörg Jenderek (GMX) joerg.jen.der.ek at gmx.net
Thu May 23 15:02:38 UTC 2024


Hello,

some days ago i must handle an old CD-ROM. This contains some older
Microsoft Publisher files with file name suffix pub.

When running file command version 5.45 on such "older" and "newer"
PUB samples i get an output like:

MSPUB.PUB:                    Composite Document File V2 Document
			      , Little Endian, Os 0, Version:
			      3.10
MSPublisher95.PUB:            Composite Document File V2 Document
			      , Cannot read section info
MSPublisherv2.PUB:            Composite Document File V2 Document
			      , Little Endian, Os 0, Version:
			      3.10
MSPublisher2013-Sample.pub:   Composite Document File V2 Document
			      , Little Endian, Os: Windows, Version
			      6.1, Code page: 1252, Author: Windows User
MSPublisher97.pub:            Composite Document File V2 Document
			      , Little Endian, Os 0, Version:
			      4.90
PublisherMuster_quer2000.pub: Composite Document File V2 Document
			      , Little Endian, Os: Windows, Version
			      10.0, Code page: 1252, Author: Jenderek
PublisherMuster_quer98.pub:   Composite Document File V2 Document
			      , Little Endian, Os: Windows, Version
			      10.0, Code page: 1252, Author: Jenderek

When running file command version 5.45 with -e cdf option on such
samples i get an output like:

MSPUB.PUB:                    OLE 2 Compound Document, v3.59, SecID 0x1
			      , Mini FAT start sector 0x2
			      : Microsoft
MSPublisher95.PUB:            OLE 2 Compound Document, v3.62, SecID 0x1
			      , Mini FAT start sector 0x2
			      : Microsoft
MSPublisherv2.PUB:            OLE 2 Compound Document, v3.31, SecID 0x1
			      , Mini FAT start sector 0x2
			      : Microsoft
MSPublisher2013-Sample.pub:   OLE 2 Compound Document, v3.62, SecID 0x1
			      , 2 FAT sectors, Mini FAT start sector 0x4
			      : Microsoft Publisher
MSPublisher97.pub:            OLE 2 Compound Document, v3.62, SecID 0x1
			      , Mini FAT start sector 0x2
			      : Microsoft Publisher
PublisherMuster_quer2000.pub: OLE 2 Compound Document, v3.62, SecID 0x1
			      , 2 FAT sectors, Mini FAT start sector 0x4
			      : Microsoft Publisher
PublisherMuster_quer98.pub:   OLE 2 Compound Document, v3.62, SecID 0x1
			      , 2 FAT sectors, Mini FAT start sector 0x4
			      : Microsoft Publisher


With option --extension only 3 byte sequence ??? for unrecognized
samples is shown whereas for recognized samples pub is shown. With
option -i only generic application/octet-stream is shown for
unrecognized samples whereas for known samples
application/vnd.ms-publisher is shown.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). This identifies also
all examples with low priority as "Generic OLE2 / Multistream
Compound"  with application/x-ole-storage mime type by docfile.trid.xml.
The examples are described with high priority as "Microsoft Publisher
document" by pub.trid.xml without mime type and with 1 possible file
name extension (pub See appended trid-v-pub.txt.gz).

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/). This
identifies all PUB samples as "Microsoft Publisher" with mime type
application/x-mspublisher. This also does a sub classification.
The version "2.0" is done by PUID x-fmt/252, "95" is done by x-fmt/253,
"97" is done by x-fmt/254, "98" is done by x-fmt/255, "2000" is done by
x-fmt/256 and "2013" is done by fmt/1515.

On Linux according to shared MIME-info database such samples are called
"Microsoft Publisher document". Here application/vnd.ms-publisher is
used as mime type and also file name suffix pub is shown. The samples
are just recognized by looking for byte sequence
\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1 at the beginning. That is the
characteristic for all
Compound files. It also look for bytes sequence
x01\x12\x02\x00\x00\x00\x00\x00\x00\xc0\x00\x00\x00\x00\x00\x46 at
offset range 592-8192.
That characteristic is also used by current pattern inside
Magdir/ole2compounddocs. That information can be seen in source
freedesktop.org.xml.in found for example on gitlab.freedesktop.org.

Luckily with information given by the other tools i also found a
page about Microsoft Publisher on file formats archive team web site.
There also links for samples to download are listed. That informations
are expressed by comment lines inside Magdir/ole2compounddocs like:
# URL:	http://fileformats.archiveteam.org/wiki/Microsoft_Publisher
# Ref.:	http://mark0.net/download/triddefs_xml.7z
#	defs/p/pub.trid.xml

The Publisher files are recognized as "OLE 2 Compound Document"
by starting bytes (\320\317\021\340\241\261\032\341) at the beginning
inside Magdir/ole2compounddocs. Obviously there exist code
fragment to do sub class identification. So most examples are
described correctly as "Microsoft Publisher" by branch with clsid.
That looked like:
  >>88 	ubequad	0x00c0000000000046	: Microsoft
  >>>80 	ubequad	0x0112020000000000	Publisher
  !:mime	application/vnd.ms-publisher
  !:ext	pub
When looking at different version samples this only applies to "newer"
versions range 97-2013. All this samples have "version" string
MSPublisher.3 inside CompObj stream. So for me this is version range
3.0-11.0.. There is a discrepancy with the documentation (What is the
version). I also do not know to distinguish the Publisher version like
DROID do. I also do not know what about more newer versions that are
part of Office 365. So the above line become like:
 >>88 	ubequad	0x00c0000000000046	: Microsoft
 >>>80 	ubequad	0x0112020000000000	Publisher 97-2013 (3.0-11.0)

For the "older" version the clsid is is different in 1 byte. So the
first test line is also true but the second not. So only phrase ":
Microsoft" at the end is shown. All these samples have "version" string
MSPublisher.2 inside CompObj stream. So these variants are done by
additional second test part that starts like:
 >>>80 	ubequad	0x0012020000000000	Publisher 95 (2.0)

After applying the above mentioned modifications by patch
file-ole2compounddocs-pub.diff, then all my inspected  Microsoft
Publisher examples are now recognized. This now looks with -e cdf option
like:

MSPUB.PUB:                    OLE 2 Compound Document, v3.59, SecID 0x1
			      , Mini FAT start sector 0x2
			      : Microsoft Publisher 95 (2.0)
MSPublisher95.PUB:            OLE 2 Compound Document, v3.62, SecID 0x1
			      , Mini FAT start sector 0x2
			      : Microsoft Publisher 95 (2.0)
MSPublisherv2.PUB:            OLE 2 Compound Document, v3.31, SecID 0x1
			      , Mini FAT start sector 0x2
			      : Microsoft Publisher 95 (2.0)
MSPublisher2013-Sample.pub:   OLE 2 Compound Document, v3.62, SecID 0x1
			      , 2 FAT sectors, Mini FAT start sector 0x4
			      : Microsoft Publisher 97-2013 (3.0-11.0)
MSPublisher97.pub:            OLE 2 Compound Document, v3.62, SecID 0x1
			      , Mini FAT start sector 0x2
			      : Microsoft Publisher 97-2013 (3.0-11.0)
PublisherMuster_quer2000.pub: OLE 2 Compound Document, v3.62, SecID 0x1
			      , 2 FAT sectors, Mini FAT start sector 0x4
			      : Microsoft Publisher 97-2013 (3.0-11.0)
PublisherMuster_quer98.pub:   OLE 2 Compound Document, v3.62 , SecID 0x1
			      , 2 FAT sectors, Mini FAT start sector 0x4
			      : Microsoft Publisher 97-2013 (3.0-11.0)

I hope my diff file can be applied in future version of file
utility.

With best wishes,
Jörg Jenderek
--
Jörg Jenderek
-------------- next part --------------
-- 
File mailing list
File at astron.com
https://mailman.astron.com/mailman/listinfo/file

-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-pub.txt.gz
Type: application/x-gzip
Size: 578 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240523/cd24a50a/attachment.bin>
-------------- next part --------------
--- file-master/magic/Magdir/ole2compounddocs.old	2023-09-11 17:43:30.718795400 +0200
+++ file-master/magic/Magdir/ole2compounddocs	2024-05-22 15:39:51.063907900 +0200
@@ -500,16 +500,23 @@
 #!:mime	application/x-ole-storage
 # https://www.iana.org/assignments/media-types/application/vnd.ms-works
 !:mime	application/vnd.ms-works
 # https://extension.nirsoft.net/wsb
 # like: wsbsamp.wsb WORKS2003_CD:\MSWorks\Common\Sammlung.wsb
 !:ext	wsb
-#??
-# URL:	http://fileformats.archiveteam.org/wiki/Microsoft_Publisher
+#
+# Update:	Joerg Jenderek
+# URL:		http://fileformats.archiveteam.org/wiki/Microsoft_Publisher
+# Reference:	http://mark0.net/download/triddefs_xml.7z/defs/p/pub.trid.xml
+# Note:		called like "Microsoft Publisher document" by TrID
+#		"version" string MSPublisher.2 inside CompObj stream
 >>88 	ubequad		0x00c0000000000046	: Microsoft
->>>80 	ubequad		0x0112020000000000	Publisher
+>>>80 	ubequad		0x0012020000000000	Publisher 95 (2.0)
+!:mime	application/vnd.ms-publisher
+!:ext	pub
+>>>80 	ubequad		0x0112020000000000	Publisher 97-2013 (3.0-11.0)
 !:mime	application/vnd.ms-publisher
 !:ext	pub
 #
 # URL:	http://fileformats.archiveteam.org/wiki/PPT
 #??
 >>88 	ubequad		0xa90300aa00510ea3	: Microsoft
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-ole2compounddocs-pub.diff.sig
Type: application/octet-stream
Size: 771 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20240523/cd24a50a/attachment.obj>


More information about the File mailing list