[File] [Patch] Magdir/rtf for Microsoft Pocket Word *.pwd *.psw+ urtf variant

Jörg Jenderek joerg.jen.der.ek at gmx.net
Sat May 16 12:27:03 UTC 2020


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

some days ago i used SoftMaker Office. The text module can read/read
Microsoft Pocket Word documents with filename extension pwd and psw.
When running file command version 5.38 on such documents, some
related RTF documents and some test files i get an output like:

fmt-355-signature-id-522.rtf: Rich Text Format data,
			      unknown version
fmt-45-signature-id-30.rtf:   Rich Text Format data,
			      unknown version
fmt-50-signature-id-158.rtf:  Rich Text Format data,
			      unknown version
fmt-52-signature-id-26.rtf:   Rich Text Format data,
			      unknown version
fmt-53-signature-id-523.rtf:  Rich Text Format data,
			      unknown version
fdo78502.rtf:                 Rich Text Format data,
			      unknown version
fdo85889-pca.rtf:             Rich Text Format data,
			      version 1,
			      IBM PC, code page 437
			      IBM PS/2, code page 850
fdo85889-pc.rtf:              Rich Text Format data,
			      version 1, IBM PC, code page 437
footer-para.rtf:              Rich Text Format data,
			      version 1, unknown character set
License-Enterprise.rtf:       Rich Text Format data,
			      version 1, unknown character set
Readme-0.72-Persian.rtf:      Rich Text Format data,
			      version 1, unknown character set
RTF-Spec-1.7.rtf:             Rich Text Format data,
			      version 1, Apple Macintosh
urtf-sample.rtf:              ASCII text, with CRLF line terminators
PocketWord-HandheldPC.pwd:    ASCII text, with CRLF line terminators
PocketWord-PocketPC.psw:      ASCII text, with CRLF line terminators
test-v2.pwd:                  ASCII text, with no line terminators


Furthermore with --extension only ??? is shown and with --apple
option UNKNUNKN is displayed.

Some information is found on Rich Text Format page on Wikipedia. That
is expressed by additional comment line with URL like:
 # URL:		https://en.wikipedia.org/wiki/Rich_Text_Format
More details can be found inside Rich Text Format (RTF) Specification
found for example as older version 1.7. That is now expressed by
additional comment line like:
 # Reference:	http://www.snake.net/software/RTF/RTF-Spec-1.7.rtf

Inside Magdir/rtf the first test lines for RTF looks like
 0	string		{\\rtf		Rich Text Format data,
 !:mime	text/rtf
Afterwards now show also apple type and file name extension. The test
lines showing version and character set i encapsulate inside a
subroutine. So i can reuse it for file formats related to Rich Text
Format. This now looks like
 >0	use		rtf-info
 0	name		rtf-info

In current file magic the version of RTF is shown by lines like
 >5	string		1		version 1,
 >5	default		x		unknown version
For most documents version is 1, but to show version "2" for newer
Pocket Word documents like test-v2.pwd, space character inside
Libre Office test document fdo78502.rtf or next brace inside urtf
samples this now becomes
 >5	ubyte		!0x7b		\b, version %c

The DROID test signatures fmt-*.rtf are misidentified as Rich Text
Format. So skip these signatures by checking for valid RTF version by
additional test lines. So this now becomes like
 0	string		{\\rtf
 >5	ubyte		!0xAB
 >>5	ubyte		!0x5C		Rich Text Format data

The code page information was shown for version 1 by lines
 >>6	string		\\ansi		ANSI
 >>6	string		\\mac		Apple Macintosh
 >>6	string		\\pc		IBM PC, code page 437
 >>6	string		\\pca		IBM PS/2, code page 850
 >>6	default		x		unknown character set
This now becomes like
 >6	string		\\ansi		ANSI
 >6	string		\\mac		Apple Macintosh
 >6	string		\\pc
 >>9	ubyte		=0x61		IBM PS/2, code page 850
 >>9	ubyte		!0x61		IBM PC, code page 437
 >6	search/502	\\ansi		\b, ANSI
 >6	default		x		\b, unknown character set


So now it distinguish between "pc" and "pca" code page variants.
So for LibreOffice example fdo85889-pca.rtf only one code page phrase
"IBM PS/2, code page 850" is now shown instead of wrong additional
second phrase "IBM PC, code page 437".

According to specification version 1.9 after specifying the RTF
version, you must declare the default character set used in the
document unless it is \ansi (the default). The control word for the
character set must precede any plain text or any table control words.
But i find examples like Readme-0.72-Persian.rtf where "\ansi" phrase
appears later after other control words like \adeflang1025, \info,
\title, \author, \category or \manager like in example
"Burow, Steffanie - Im Tal des Schneeleoparden.rtf"

The explicit code page number string is often stored after keyword
\ansicpg. So now look also for that keyword and display valid code
page number string (not 0 like in example fdo78502.rtf ) in range
from 437 for United States IBM til 57011 for Punjabi by lines
 >5	search/500	\\ansicpg
 >>&0	ubyte		!0x30		\b, code page
 >>>&-1		string	x		%-.3s
 >>>&2		ubyte	>0x2F
 >>>>&-1	ubyte	<0x3A		\b%c
 >>>>>&0	ubyte	>0x2F
 >>>>>>&-1	ubyte	<0x3A		\b%c

In the same manner look for possible stored language ids LCID after
keywords adeflang or deflang and display this information by lines
like:
 >>6	search/497 \\adeflang	\b, default middle east language ID
 >>>&0	string	x		%.4s
 >>>&4	ubyte	>0x2F
 >>>>&-1 ubyte	<0x3A		\b%c
 >>6	default	x
 >>>6	search/505 \\deflang
 >>>>&0	string	>0		\b, default language ID %-.4s
 >>>>&4	ubyte	>0x2F
 >>>>>&-1 ubyte	<0x3A		\b%c
So for example like Readme-0.72-Persian.rtf the correct used language
id like 1065 for Persian is shown.

According to documentation some in Universal Character Set encoded
variants starts with a special control word. This is now handled by
additional lines like
 0	string	{\\urtf		Rich Text Format unicoded data
 !:mime	text/rtf
 !:ext	rtf
 >1	use		rtf-info
Unfortunately i found only a few samples like urtf-sample.rtf on some
Chinese or Asian web sites. So this variant is not very well tested.

According to documentation the Microsoft Pocket Word documents starts
with an own control word. So these examples are now described by
lines starting with
 0	string		{\\pwd	Pocket Word document or template
SoftMaker Office register the same mime type application/msword as
for Microsoft Word Documents. But some sites like reposcope.com and
TrID identifying tool use another one. So i also use this type.
This is expressed by additional line:
 !:mime	application/x-pocket-word
The PWD extension is used for Handheld PC variant, extension PSW is
used for for Pocket PC variant and PWT extension is used for
templates. I do not know what are the exact differences, but
extensions are displayed by line:
 !:ext	pwd/psw/pwt

There exist also Pocket Word document type with pwi extension and
starting with phrase {\pwi called "InkWriter" or "Note Taker", but
that format is not related to rtf.

After applying the above mentioned modifications by patch
file-5.38-rtf-pwd.diff then all my Pocket Word documents are
recognized, the DROID test examples are not misidentified any more
and the RTF documents are described more precisely and i get an
output like:

fmt-355-signature-id-522.rtf: ISO-8859 text, with very long lines,
			      with no line terminators
fmt-45-signature-id-30.rtf:   ASCII text, with no line terminators
fmt-50-signature-id-158.rtf:  ASCII text, with no line terminators
fmt-52-signature-id-26.rtf:   ISO-8859 text, with no line terminators
fmt-53-signature-id-523.rtf:  ISO-8859 text, with no line terminators
fdo78502.rtf:                 Rich Text Format data,
			      version  , ANSI
fdo85889-pca.rtf:             Rich Text Format data,
			      version 1, IBM PS/2, code page 850
fdo85889-pc.rtf:              Rich Text Format data,
			      version 1, IBM PC, code page 437
footer-para.rtf:              Rich Text Format data,
			      version 1, ANSI
License-Enterprise.rtf:       Rich Text Format data,
			      version 1, ANSI, code page 936,
			      default language ID 2052
Readme-0.72-Persian.rtf:      Rich Text Format data,
			      version 1, ANSI, code page 1256,
			      default middle east language ID 1065
RTF-Spec-1.7.rtf:             Rich Text Format data,
			      version 1, Apple Macintosh, ANSI,
			      code page 10000,
			      default language ID 1033
urtf-sample.rtf:              Rich Text Format unicoded data,
			      unknown character set
PocketWord-HandheldPC.pwd:    Pocket Word document or template,
			      version 1, ANSI, code page 1252
PocketWord-PocketPC.psw:      Pocket Word document or template,
			      version 1, ANSI, code page 1252
test-v2.pwd:                  Pocket Word document or template,
			      version 2


I hope my patch can be applied in future version of file utility.

With best wishes
Jörg Jenderek
- --
Jörg Jenderek


















-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCXr/b9wAKCRCv8rHJQhrU
1p6YAJ4zqFDPi5mBS6namp5ulP+Vs/sSjACfYlBkC7TjF/Pd4L+eFitx+9exGTA=
=NHKl
-----END PGP SIGNATURE-----
-------------- next part --------------
--- file-5.38/magic/Magdir/rtf.old	2019-02-22 13:06:34 +0000
+++ file-5.38/magic/Magdir/rtf	2020-05-15 16:54:55 +0000
@@ -6,11 +6,89 @@
 # Duncan P. Simpson, D.P.Simpson at dcs.warwick.ac.uk
-#
-0	string		{\\rtf		Rich Text Format data,
+# Update:	Joerg Jenderek
+# URL:		https://en.wikipedia.org/wiki/Rich_Text_Format
+# Reference:	http://www.snake.net/software/RTF/RTF-Spec-1.7.rtf
+#		http://www.kleinlercher.at/tools/Windows_Protocols/Word2007RTFSpec9.pdf
+0	string		{\\rtf
+# skip DROID fmt-355-signature-id-522.rtf by looking for valid version
+>5	ubyte		!0xAB
+# skip also \ in DROID fmt-50-signature-id-158.rtf by looking for valid version
+>>5	ubyte		!0x5C		Rich Text Format data
 !:mime	text/rtf
->5	string		1		version 1,
->>6	string		\\ansi		ANSI
->>6	string		\\mac		Apple Macintosh
->>6	string		\\pc		IBM PC, code page 437
->>6	string		\\pca		IBM PS/2, code page 850
->>6	default		x		unknown character set
->5	default		x		unknown version
+!:apple	????RTF
+!:ext	rtf
+>>>0	use		rtf-info
+#	display information like version, language and code page of RTF
+0	name		rtf-info
+# 1 mostly, 2 for newer Pocket Word documents, space for test like fdo78502.rtf, { for some urtf
+>5	ubyte		!0x7b		\b, version %c
+# The word for character set must precede any text or most other control words
+>6	string		\\mac		\b, Apple Macintosh
+>6	string		\\pc
+# control word \pca
+>>9	ubyte		=0x61		\b, IBM PS/2, code page 850
+>>9	ubyte		!0x61		\b, IBM PC, code page 437
+# unknown character set or ANSI later after control words like
+# \adeflang1025 \info \title \author \category \manager
+# "Burow, Steffanie - Im Tal des Schneeleoparden.rtf"
+#>6	search/105	\\ansi		\b, ANSI
+>6	search/502	\\ansi		\b, ANSI
+>6	default		x		\b, unknown character set
+# look for explict codepage keyword
+# "Burow, Steffanie - Im Tal des Schneeleoparden.rtf"
+#>5	search/110	\\ansicpg
+>5	search/500	\\ansicpg
+# skip unknown or buggy codepage string 0 like in fdo78502.rtf
+>>&0	ubyte		!0x30		\b, code page
+# codepage string: 437~United States IBM, ..., 1252~WesternEuropean, ..., 57011~Punjabi
+>>>&-1		string	x		%-.3s
+# skip space or \ and display possible 4th digit of code page string
+>>>&2		ubyte	>0x2F
+>>>>&-1		ubyte	<0x3A		\b%c
+# possible 5th digit of code page string
+>>>>>&0		ubyte	>0x2F
+>>>>>>&-1	ubyte	<0x3A		\b%c
+# look again at version byte to use default clause
+>5	ubyte		x
+# Default language ID for South Asian/Middle Eastern text
+# language ID: 1025, ..., 1065~Persian, ..., 2057~English_UnitedKingdom, ..., 58380~French_NorthAfrica
+# Readme-0.72-Persian.rtf
+#>6	search/1	\\adeflang	\b, default middle east language ID
+>>6	search/497	\\adeflang	\b, default middle east language ID
+# https://docs.microsoft.com/en-us/openspecs/office_standards/ms-oe376/6c085406-a698-4e12-9d4d-c3b0ee3dbc4a
+>>>&0	string		x		%.4s
+# skip \ and NL and show possible 5th digit of language string
+>>>&4	ubyte		>0x2F
+>>>>&-1	ubyte		<0x3A		\b%c
+# else look for default language to be used when the \plain control word is encountered
+>>6	default		x
+# "Burow, Steffanie - Im Tal des Schneeleoparden.rtf"
+#>>>6	search/127	\\deflang
+>>>6	search/505	\\deflang
+>>>>&0	string		>0		\b, default language ID %-.4s
+# possible 5th digit of language string
+>>>>&4		ubyte	>0x2F
+>>>>>&-1	ubyte	<0x3A		\b%c
+
+# Reference:	http://latex2rtf.sourceforge.net/rtfspec_63.html
+# Note:		no real world example found
+0	string		{\\urtf		Rich Text Format unicoded data
+!:mime	text/rtf
+#!:apple	????RTF
+!:ext	rtf
+>1	use		rtf-info
+
+# URL:		https://en.wikipedia.org/wiki/Microsoft_Word
+# Reference:	http://fileformats.archiveteam.org/wiki/Microsoft_Word
+# Note:	called by TrID "Pocket Word document"
+#	by PlanMaker "Pocket Word-Handheld PC" for pwd
+#	by PlanMaker "Pocket Word-Pocket PC" for psw
+0	string		{\\pwd		Pocket Word document or template
+# by SoftMaker Office	http://extension.nirsoft.net/pwd
+#!:mime	application/msword
+# https://reposcope.com/mimetype/application/x-pocket-word
+!:mime	application/x-pocket-word
+# PWD for Handheld PC variant and PSW for Pocket PC variant
+# PWT for template
+!:ext	pwd/psw/pwt
+>0	use		rtf-info
+
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.38-rtf-pwd.diff.sig
Type: application/octet-stream
Size: 95 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20200516/dd33f044/attachment.obj>


More information about the File mailing list