[File] [PATCH] Magdir/archive for ARJ, JAR (ARJ Software, Inc.) versus Java archive data (JAR)

Jörg Jenderek joerg.jen.der.ek at gmx.net
Sat Mar 12 18:40:42 UTC 2022


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

some days ago i want to handle some Java archive which are ZIP
compressed based and have normally 3 byte jar name extension or
maybe 1 byte j extension.

Unfortunately these extensions are also used by other compression
tools. When running file command version 5.41 on such non-ZIP
examples extensions i get a nearly correct output like:
19GXE.ARJ:         ARJ archive data, v3
		   , original name: #9GXE.ARJ
		   , os: MS-DOS
MY_JARC.JAR:       JAR (ARJ Software, Inc.) archive data
SAMPLE.J:          JAR (ARJ Software, Inc.) archive data
TEST-hk2.ARJ:      ARJ archive data, v11
		   , slash-switched
		   , original name: \003,
WP60.ARJ:          ARJ archive data, v4
		   , slash-switched
		   , original name: WP60.ARJ
		   , os: MS-DOS 4]
pmext4pc.arj:      ARJ archive data, v6
		   , slash-switched
		   , original name: PMEXT4PC.ARJ
		   , os: MS-DOS 3]
test-je-v360K.e01: ARJ archive data, v11
		   , slash-switched
		   , original name: ,
test-r-v360.a02:   ARJ archive data, v11
		   , multi-volume
		   , slash-switched
		   , original name: ,
zip300.j:          JAR (ARJ Software, Inc.) archive data
zip300.j01:        JAR (ARJ Software, Inc.) archive data


With --extension option only ??? is displayed. Furthermore with -i
option only for ARJ samples application/x-arj is shown. For other
examples only generic application/octet-stream is shown.

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). This describes
samples with JAR extension like MY_JARC.JAR as "JARCS compressed
archive" by ark-jarcs.trid.xml. Most of the others are described as
"JAR compressed archive" by ark-jar.trid.xml or as variant with
additional "(with Security Envelope)" by ark-jar-se.trid.xml (See
appended jar_j_trid-v.txt.gz)

Luckily with -v option TrID displays file name extension and related
URL. With the information of this tools i found a page about about
JAR (ARJ Software) on file formats archive team web site. That
information is expressed by comment lines inside Magdir/archive like:
# URL:	http://fileformats.archiveteam.org/wiki/JAR_(ARJ_Software)
# ref.:	http://mark0.net/download/triddefs_xml.7z
#	defs/a/ark-jar.trid.xml

The description happens inside Magdir/archive line like:
0xe	string	\x1aJar\x1b JAR (ARJ Software, Inc.) archive data
This now becomes like:
0xe	string	\x1aJar\x1b JAR (ARJ Software, Inc.) archive data
!:mime	application/x-compress-j
> 0	ulelong	x		\b, CRC32 %#x
!:ext	j/j01/j02
Instead generic mime type application/octet-stream i display an used
defined one. The standard suffix is ".j", but if you create multi
volumes then the first get this suffix but the following have suffix
with number order like:
j01 j02 ... j99 100 ... 990

For the example with only one j character in suffix the description
happens inside Magdir/archive line like:
0	string	JARCS JAR (ARJ Software, Inc.) archive data
So we get same description as in the other examples, but there at
the beginning the CRC is stored whereas here we find text string
JARCS. So magic lines now becomes like:
0	string	JARCS JAR (ARJ Software, Inc.) archive data
!:mime	application/x-compress-jar
!:ext	jar
Instead generic mime type application/octet-stream i display another
used defined one. The standard suffix is ".jar", that is also  used
for Java archive. The information about that format are expressed
inside Magdir/archive by lines like:
# URL:		http://fileformats.archiveteam.org/wiki/JARCS
# reference:	http://mark0.net/download/triddefs_xml.7z
#		a/ark-jarcs.trid.xml

Because the "jar" are described as based on ARJ i also check such
examples. These are described by TrID as "ARJ compressed archive" by
ark-arj.trid.xml and as "ARJ File Format" by DROID via PUID fmt/610.
There is also mentioned that the specification about ARJ can be found
in file with name TECHNOTE.TXT, that can be found in unarj or
multiarc sources tree. That information is expressed by comment
lines inside Magdir/archive like:
# URL:		http://fileformats.archiveteam.org/wiki/ARJ
# reference:	http://mark0.net/download/triddefs_xml.7z
#		defs/a/ark-arj.trid.xml
#		https://github.com/FarGroup/FarManager/
#		blob/master/plugins/multiarc/arc.doc/arj.txt

Often information about the operating system is shown. That is done
by lines like:
> 7	byte		0		os: MS-DOS 7	byte		1		os: PRIMOS
...
> 7	byte		9		os: VAX/VMS
But sometimes no such information is shown, because for "new"
systems higher numbers are used. So according to newer
specification this is now done by additional lines like:
> 7	byte		10		os: WIN95 7	byte		11		os: WIN32

Afterwards often a digit and bracket like for example WP60.ARJ is
shown. That was done by line like:
> 3	byte		>0		%d]

But according to specification this is the basic header size (like:
0x002b 0x002c 0x04e0 0x04e3 0x04e7). So if you interested in this
information for debugging purpose then show this information
correctly by lines like:
> 2	uleshort	x	basic header size %#4.4x (2.s)	ubequad		x	NEXT
> FRAGMENT CONTENT %#16.16llx

The archiver version number (like: 3 4 6 11 102) is stored in archive
and that information is shown by line like:
> 5	byte		x		\b, v%d,
Afterwards the minimum archiver version to extract like 1 is
stored. Similar to ZIP examples now show this information too by
additional line like:
> 6	ubyte		!1		minimum %u to extract,

Often the original archive name is shown. This was done by line like:
> 34	string		x		original name: %s,
But sometimes this is missing or obviously wrong like in example
TEST-hk2.ARJ. If i understand documentation right than sometimes 4
extra bytes are inserted before 0-terminated file name. So this now
becomes like:
> 34	byte		x		original name: 34	byte		<0x0B
>> 38	string		x		%s,
> 34	byte		>0x0A
>> 34	string		x		%s,

At offset of file name sometimes the arj protection factor is
stored. The maximal value is 10, where this value is given by arj
command switch like hky, where y is a digit and factory is
calculated by adding one to y value. The existence of data
protection record is shown by setted ARJPROT_FLAG bit in flags
byte. So show now this information by lines like:
> 8	byte		&0x08		recoverable
>> 0x22	byte		x		(factor %u),

Normally 3-byte suffix like ".arj" or the upcased variant on DOS
systems is used. For multi volume first name is archive.arj then
following parts are like archive.a01, archive.a02 and so on. In the
following parts the "multi-volume" flag is set. For self extracting
multi volume archives first name is archive.exe. This is correctly
identified like executable for MS Windows, with additional tag "ARJ
self-extracting archive". The following parts are normal archives
with names like archive.e01, archive.e02 and so on. Astonish here
flag for multi-volume is not set. So the extensions are now shown
by additional lines like:

> 0x26	search/1024	\0
#>>&-5	string		x		extension %.4s
>> &-5	string/c	.arj		data
!:ext	arj
>> &-5	default		x
>>> 8	byte		&0x04		data
!:ext	a01/a02
>>> 8	byte		^0x04		data, SFX multi-volume
!:ext	e01/e02

So i also saw that only few bits in flag byte are shown and
interpreted. So according to documentation i add more flags values.
So for example TEST-gstew.ARJ show GARBLED_FLAG1. If this bit is
set then the archive content is garbled with password given by g
switch. So show this information with additional encryption version
by lines like:
> 8	byte		&0x01		garbled
>> 0x20	ubyte		x		(v%u),

At offset 0xC date+time for creation and modified stamps are
stored. Similar to ZIP archives that information is stored  in
MS-DOS format.
So show this by sub routine dos-date inside newest Magdir/msdos or
use new internal functions lemsdosdate and lemsdostime. This is now
done by lines like:
> 0xC	ulelong		x		created 0xC	use		dos-date
#>0xE	lemsdosdate	x		%s
#>0xC	lemsdostime	x		%s
> 0xC	ulelong		x		\b, 0x10	ulelong		>0		modified
>> 0x10	use		dos-date
#>>0x12	lemsdosdate	x		%s
#>>0x10	lemsdostime	x		%s
>> 0x10	ulelong		x		\b,
That information can be verified by commands like:
	arj	l 	pmext4pc.arj
	7z	l -tarj	PHRACK1.ARJ

The detection happens by start magic lines like:
0	leshort		0xea60		ARJ archive data
!:mime	application/x-arj
That used only 2 bytes. That is not a strong magic and this is in
contrast to recommendation to use at least 4 bytes. The DROID test
example fmt-610-signature-id-946.arj just contains these 2 first
bytes. So by current magic this is also described as ARJ archive
data. This is not what you really want. So i skip this example by
additional test for valid file type (2) of main header. Also put
displaying part inside sub routine arj-archive. That starting lines
now becomes like:
0	leshort		0xea60
> 0xA	ubyte		2
>> 0	use		arj-archive
0	name		arj-archive
> 0	leshort		x		ARJ archive
!:mime	application/x-arj
At first glance this looks like an overkill, but this has some
advantages. According to comment lines "[JW] idarc" there exist
samples where magic occurs 2 bytes later. This was expressed by
line
2	leshort		0xea60		ARJ archive data
Unfortunately i myself have no such example, but i prepared lines
to use here also the sub routine. So this now becomes like:
2	leshort		0xea60		ARJ archive data
#2	leshort		0xea60
#>2	use		arj-archive
Also the SFX archive has after executing stub the real ARJ archive.
So it is possible to jump to right position and show information by
calling subroutine at that offset.

Some fields described in documentation like archive size, filespec
position are not understandable for me or i get not expected values.
So i added these fields only as comment lines like:
# archive size (currently used only for secured archives); MAYBE?
#>0x14	ulelong		!0		file size %u,
# security envelope file position; MAYBE?
#>0x18	ulelong		!0		at %#x security envelope,
# filespec position in filename; WHAT IS THAT?
#>0x1C	uleshort	>0		filespec position %#x,

After applying the above mentioned modifications by patch
file-5.41-archive-jar_j.diff and using newest Magdir/msdos
then all samples are described as before with corrections and more
details like:

19GXE.ARJ:         ARJ archive data, v3
		   , created 19 may 1980+13
		   , original name: #9GXE.ARJ
		   , os: MS-DOS
MY_JARC.JAR:       JAR (ARJ Software, Inc.) archive data
SAMPLE.J:          JAR (ARJ Software, Inc.) archive data
		   , CRC32 0x92c93391
TEST-hk2.ARJ:      ARJ archive data, v11
		   , recoverable (factor 3)
		   , slash-switched
		   , created 10 mar 1980+42
		   , original name: TEST-hk2.ARJ
		   , os: WIN32
WP60.ARJ:          ARJ archive data, v4
		   , ANSI codepage
		   , slash-switched
		   , created 2 jun 1980+13
		   , security envelope length 0x471
		   , original name: WP60.ARJ
		   , os: MS-DOS
pmext4pc.arj:      ARJ archive data, v6
		   , slash-switched
		   , created 13 mar 1980+15
		   , original name: PMEXT4PC.ARJ
		   , os: MS-DOS
test-je-v360K.e01: ARJ archive data, SFX multi-volume, v11
		   , slash-switched
		   , created 12 mar 1980+42
		   , original name: test-je-v360K.e01
		   , os: WIN32
test-r-v360.a02:   ARJ archive data, v11
		   , multi-volume
		   , slash-switched
		   , created 12 mar 1980+42
		   , original name: test-r-v360.a02
		   , os: WIN32
zip300.j:          JAR (ARJ Software, Inc.) archive data
		   , CRC32 0x37c4d93d
zip300.j01:        JAR (ARJ Software, Inc.) archive data
		   , CRC32 0xedaf841b


With --extension option now i get expected output like:

19GXE.ARJ:         arj
MY_JARC.JAR:       jar
SAMPLE.J:          j/j01/j02
TEST-hk2.ARJ:      arj
WP60.ARJ:          arj
pmext4pc.arj:      arj
test-je-v360K.e01: e01/e02
test-r-v360.a02:   a01/a02
zip300.j:          j/j01/j02
zip300.j01:        j/j01/j02

I hope my diff file can be applied in future version of file
utility.

By -i option the mime type is shown which is given by magic line
looking like "!:mime	application/x-arj". So it would be nice to
implement in similar way option to show TrID, shared-mime-info,
DROID description and/or identification number PUID. Why? It
remembers me like the anti-virus software. Every company calls it
differently. So if you are iun trouble and are uncertain because
you get different descriptions than it is difficult to decide what
is correct or is the same fact just called with other description tex
t.

With best wishes
Jörg Jenderek
- --
Jörg Jenderek





-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCYizpKQAKCRCv8rHJQhrU
1gotAKCZbwWfj9HC+ZlUzqPbpVOmOf8BXgCfdcbThoGbfGELNhE84O2BfDJ8I/s=
=dU/G
-----END PGP SIGNATURE-----
-------------- next part --------------
-- 
File mailing list
File at astron.com
https://mailman.astron.com/mailman/listinfo/file

-------------- next part --------------
A non-text attachment was scrubbed...
Name: jar_j_trid-v.txt.gz
Type: application/x-gzip
Size: 537 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20220312/b0152411/attachment-0002.bin>
-------------- next part --------------
--- file-5.41/magic/Magdir/archive.old	2021-08-30 11:10:26.000000000 +0200
+++ file-5.41/magic/Magdir/archive	2022-03-12 19:19:42.763217700 +0100
@@ -927,13 +927,123 @@
 # JAR archiver (.j), this is the successor to ARJ, not Java's JAR (which is essentially ZIP)
+# Update:	Joerg Jenderek
+# URL:		http://fileformats.archiveteam.org/wiki/JAR_(ARJ_Software)
+# reference:	http://mark0.net/download/triddefs_xml.7z/defs/a/ark-jar.trid.xml
+#		https://www.sac.sk/download/pack/jar102x.exe/TECHNOTE.DOC
+# Note:		called "JAR compressed archive" by TrID
 0xe	string	\x1aJar\x1b JAR (ARJ Software, Inc.) archive data
+#!:mime	application/octet-stream
+!:mime	application/x-compress-j
+>0	ulelong	x		\b, CRC32 %#x
+# standard suffix is ".j"; for multi volumes following order j01 j02 ... j99 100 ... 990
+!:ext	j/j01/j02
+# URL:		http://fileformats.archiveteam.org/wiki/JARCS
+# reference:	http://mark0.net/download/triddefs_xml.7z/defs/a/ark-jarcs.trid.xml
+# Note:		called "JARCS compressed archive" by TrID
 0	string	JARCS JAR (ARJ Software, Inc.) archive data
+#!:mime	application/octet-stream
+!:mime	application/x-compress-jar
+!:ext	jar
 
 # ARJ archiver (jason at jarthur.Claremont.EDU)
-0	leshort		0xea60		ARJ archive data
+# URL:		http://fileformats.archiveteam.org/wiki/ARJ
+# reference:	http://mark0.net/download/triddefs_xml.7z/defs/a/ark-arj.trid.xml
+#		https://github.com/FarGroup/FarManager/
+#		blob/master/plugins/multiarc/arc.doc/arj.txt
+# Note:		called "ARJ compressed archive" by TrID and
+#		"ARJ File Format" by DROID via PUID fmt/610
+#		verified by `7z l -tarj PHRACK1.ARJ` and
+#		`arj.exe l TEST-hk9.ARJ`
+0	leshort		0xea60
+# skip DROID fmt-610-signature-id-946.arj by check for valid file type of main header
+>0xA	ubyte		2
+>>0	use		arj-archive
+0	name		arj-archive
+>0	leshort		x		ARJ archive
 !:mime	application/x-arj
+# look for terminating 0-character of filename
+>0x26	search/1024	\0
+# file name extension is normally .arj but not for parts of multi volume
+#>>&-5	string		x		extension %.4s
+>>&-5	string/c	.arj		data
+!:ext	arj
+>>&-5	default		x
+# for multi volume first name is archive.arj then following parts archive.a01 archive.a02 ...
+>>>8	byte		&0x04		data
+!:ext	a01/a02
+# for SFX first name is archive.exe then following parts archive.e01 archive.e02 ...
+>>>8	byte		^0x04		data, SFX multi-volume
+!:ext	e01/e02
+# basic header size like: 0x002b 0x002c 0x04e0 0x04e3 0x04e7
+#>2	uleshort	x		basic header size %#4.4x
+# next fragment content like: 0x0a200a003a8fc713 0x524a000010bb3471 0x524a0000c73c70f9
+#>(2.s)	ubequad		x		NEXT FRAGMENT CONTENT %#16.16llx
+# first_hdr_size; seems to be same as basic header size
+#>2	uleshort	x		1st header size %#x
+# archiver version number like: 3 4 6 11 102
 >5	byte		x		\b, v%d,
+# minimum archiver version to extract like: 1
+>6	ubyte		!1		minimum %u to extract,
+# FOR DEBUGGING
+#>8	byte		x		FLAGS %#x,
+# GARBLED_FLAG1; garble with password; g switch
+>8	byte		&0x01		garbled
+# encryption version: 0~old  1~old 2~new 3~reserved 4~40 bit key GOST
+>>0x20	ubyte		x		(v%u),
+# ANSIPAGE_FLAG; indicates ANSI codepage used by ARJ32; hy switch
+>8	byte		&0x02		ANSI codepage,
+# VOLUME_FLAG indicates presence of succeeding volume
 >8	byte		&0x04		multi-volume,
+# ARJPROT_FLAG; build with data protection record; hk switch
+>8	byte		&0x08		recoverable
+# arj protection factor; maximal 10; switch hky -> factor=y+1
+>>0x22	byte		x		(factor %u),
 >8	byte		&0x10		slash-switched,
+# BACKUP_FLAG; obsolete
 >8	byte		&0x20		backup,
->34	string		x		original name: %s,
+# SECURED_FLAG;
+>8	byte		&0x40		secured,
+# ALTNAME_FLAG; indicates dual-name archive
+>8	byte		&0x80		dual-name,
+# security version; 0~old 2~current
+>9	ubyte		!0
+>>9	ubyte		!2		security version %u,
+# file type; 2 in main header; 0~binary 1~7-bitText 2~comment 3~directory 4~VolumeLabel 5=ChapterLabel
+>0xA	ubyte		!2		file type %u,
+# date+time when original archive was created in MS-DOS format via ./msdos
+>0xC	ulelong		x		created
+>0xC	use		dos-date
+# or date and time by new internal function
+#>0xE	lemsdosdate	x		%s
+#>0xC	lemsdostime	x		%s
+>0xC	ulelong		x		\b,
+# FOR DEBUGGING
+#>0x12	uleshort	x		RAW DATE %#4.4x
+#>0x10	uleshort	x		RAW TIME %#4.4x
+# date+time when archive was last modified; sometimes nil or
+# maybe wrong like in HP4DRVR.ARJ
+#>0x10	ulelong		>0		modified
+#>>0x10	use		dos-date
+# or date and time by new internal function
+#>>0x12	lemsdosdate	x		%s
+#>>0x10	lemsdostime	x		%s
+#>>0x10	ulelong		x		\b,
+# archive size (currently used only for secured archives); MAYBE?
+#>0x14	ulelong		!0		file size %u,
+# security envelope file position; MAYBE?
+#>0x18	ulelong		!0		at %#x security envelope,
+# filespec position in filename; WHAT IS THAT?
+#>0x1C	uleshort	>0		filespec position %#x,
+# length in bytes of security envelope data like: 2CAh 301h 364h 471h
+>0x1E	uleshort	!0		security envelope length %#x,
+# last chapter like: 0 1
+>0x21	ubyte		!0		last chapter %u,
+# filename (null-terminated string); sometimes at 0x26 when 4 bytes for extra data
+>34	byte		x		original name:
+# with extras data
+>34	byte		<0x0B
+>>38	string		x		%s,
+# without extras data
+>34	byte		>0x0A
+>>34	string		x		%s,
+# host OS: 0~MSDOS ... 11~WIN32
 >7	byte		0		os: MS-DOS
@@ -948,5 +1058,8 @@
 >7	byte		9		os: VAX/VMS
->3	byte		>0		%d]
+>7	byte		10		os: WIN95
+>7	byte		11		os: WIN32
 # [JW] idarc says this is also possible
 2	leshort		0xea60		ARJ archive data
+#2	leshort		0xea60
+#>2	use		arj-archive
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.41-archive-jar_j.diff.sig
Type: application/octet-stream
Size: 2583 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20220312/b0152411/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jar_j-droid.csv.gz
Type: application/x-gzip
Size: 545 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20220312/b0152411/attachment-0003.bin>


More information about the File mailing list