[File] [PATCH] Magdir/archive for Comic Book Archive, tar archive *.CBT

Jörg Jenderek joerg.jen.der.ek at gmx.net
Wed Jul 27 23:04:08 UTC 2022


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

some days ago i send patches for DOS COM executables. One Syslinux
COMboot variant use file name extension CBT instead of COM.

For control reason i look for other files with CBT extension on my
systems. According to DROID utility this extension is also used for
Comic Book Archive. When running file command (version 5.42) on
such examples and related tar archives with -e tar option i get an
output like:

Black_Cobra_003.cbt: POSIX tar archive (GNU), file
		     19.jpg, mode 000644 ,
		     size 00003315356,
		     seconds 11540725637
M129-pax.tar:        POSIX.1-2001 tar archive, global
		     /tmp/GlobalHead.2512.2, mode 0000644,
		     uid 0000000, gid 0000000,
		     size 00000000141,
		     seconds 13071714760
TAR3214-j.TAR:       tar archive (old), file
		     tar3214.txt, mode    666 ,
		     uid      0 , gid      0 ,
		     size        3400 ,
		     seconds  6450504352, comment:
		     comment field created by -j option by DO
archive.dir.tar:     POSIX tar archive (GNU), directory
		     gettext-0.10.35/, mode 0000755,
		     uid 0000000, gid 0000000,
		     size 00000000000,
		     seconds 11401732537,
		     user root, group root
comics.cbt:          POSIX tar archive (GNU), file
		     test.jpg, mode 0000644,
		     uid 0001750, gid 0001750,
		     size 00000001121,
		     seconds 10665023160,
		     user jjmarin, group jjmarin
dpmi-en.tar:         POSIX tar archive, file
		     0.9.gif, mode 000644 ,
		     uid 000124 , gid 000024 ,
		     size 00000000147 ,
		     seconds 05762024207,
		     user dj, group user
gtarfail.tar:        POSIX tar archive, file
		     vedpowered.gif, mode 0000644,
		     uid 0000746, gid 0002044,
		     size 00000001006 ,
		     seconds 07303467402,
		     user jes, group glone
id-high2037-old.tar: tar archive (V7), file
		     6Mar2037.txt, mode 0000644,
		     uid 7777777, gid 7777777,
		     size 00000000374,
		     seconds 17626765756
test-png.cbt:        POSIX tar archive (GNU), file
		     0001.png, mode 000644 ,
		     size 00000002567,
		     seconds 13273174121
test4digit.tar:      POSIX tar archive (GNU), file
		     2712.txt, mode 000644 ,
		     size 00000204500,
		     seconds 13220741303
test_data.tar:       POSIX tar archive (GNU), file
		     0000000000000000.empty.br,
		     mode 0000600,
		     uid 0423055, gid 0257523,
		     size 00000000001,
		     seconds 13266421766,
		     user eustas, group primarygroup
win10iso-gnu.tar:    POSIX tar archive (GNU), file
		     m/vm/14393.0.160715-1616.
		     RS1_RELEASE_CLIENTENTERPRISE_S_EVAL,
		     mode 0000644,
		     uid 0002464, gid 0001143,
		     size 0xd72db800,
		     seconds 13031057704,
		     user joerg, group Administratoren

When running file command without such option i get an output like:

Black_Cobra_003.cbt: POSIX tar archive (GNU)
M129-pax.tar:        POSIX tar archive
TAR3214-j.TAR:       tar archive
archive.dir.tar:     POSIX tar archive (GNU)
comics.cbt:          POSIX tar archive (GNU)
dpmi-en.tar:         POSIX tar archive
gtarfail.tar:        POSIX tar archive
id-high2037-old.tar: tar archive
test-png.cbt:        POSIX tar archive (GNU)
test4digit.tar:      POSIX tar archive (GNU)
test_data.tar:       POSIX tar archive (GNU)
win10iso-gnu.tar:    data

With option to show file name extension i get wrong phrase like
tar/gtar or ??? and with option to show mime type i get wrong phrase
like application/x-tar or application/x-gtar.

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/). This
identifies most CBT examples as "Comic Book Archive" by PUID
fmt/1462 based on file name extension (See appended
droid-comicbook-cbt.csv.gz)

For comparison reason i run the file format identification utility
TrID ( See https://mark0.net/soft-trid-e.html). Many examples are
described with low priority as "Tape ARchive (file)" by definition
ark-tar-file.trid.xml. With higher priority many examples are also
described as "TAR - Tape ARchive (GNU)" by ark-tar-gnu.trid.xml (See
appended trid-v-comicbook-cbt.txt.gz).

There exist a page about Comic book archive on Wikipedia and on
file formats archive team website. That is now expressed by
comment lines like:
# URL:		https://en.wikipedia.org/wiki/Comic_book_archive
#		http://fileformats.archiveteam.org/
#		wiki/Comic_Book_Archive

Luckily inside Magdir/archive the displaying part for tar archive is
done by calling sub routine tar-file.

So i create inside Magdir/archive lines for such Comic Book Archive
TAR variant by sub routine tar-cbt which looks like:
 0	name		tar-cbt
 >0	string		x	Comic Book Archive, tar archive
 !:mime	application/vnd.comicbook
 !:ext	cbt
 >0	string		>\0	\b, 1st image %-.60s

Instead of generic mime type like application/x-tar i display
another. For other variants a type starting with text like
application/vnd.comicbook is used. For CBZ variant additional +zip is
used and for CBR variant additional -rar is used. For TAR variant i
found nothing, but when thinking logical this should look at least
like application/vnd.comicbook. For the TAR packed variant the
extension CBT instead of TAR is used.

Unfortunately there exist for precisely specification. It is
described that every page of the comic is stored as image, where
only a few types are used (mainly like JPEG or PNG are used, but
also TIFF, GIF and BMP can occur). And such filenames are used that
these represent the sort order of the page numbers. This
information is shown by last line of sub routine and should look
like 19.jpg, 0001.png or 0002.png.

So as additional test i look for such image names by check for 1st
image main name with digits and for image name extension by regular
expression. If this true it probably is a Comic Book Archive. So i
call here the new sub routine. If it is false then it is probably a
"normal" tar archive and call old sub routine. So this is done by
additional lines which looks like:
 >>>>>>>>0 regex \^[0-9]{2,4}[.](png|jpg|jpeg|tif|tiff|gif|bmp)
 >>>>>>>>>0	use	tar-cbt
 >>>>>>>>0	default		x
 >>>>>>>>>0	use	tar-file

I do not know if this always true, because it is written that folders
may be used to group images in a more logical layout within the
archive, like book chapters. Or some applications support additional
tag information in the form of embedded XML files in the archive like
ComicInfo.xml. So maybe more test lines or branches or more
sophisticated regular expressions must be used for exotic samples.

After applying the above mentioned modifications by patch
file-5.42-archive-cbt.diff then most Comic Book CBT Archive samples
are now identified correctly and related TAR files are still
described as before. This now looks like:

Black_Cobra_003.cbt: Comic Book Archive,
		     tar archive, 1st image
		     19.jpg
M129-pax.tar:        POSIX.1-2001 tar archive, global
		     /tmp/GlobalHead.2512.2, mode 0000644,
		     uid 0000000, gid 0000000,
		     size 00000000141,
		     seconds 13071714760
TAR3214-j.TAR:       tar archive (old), file
		     tar3214.txt, mode    666 ,
		     uid      0 , gid      0 ,
		     size        3400 ,
		     seconds  6450504352, comment:
		     comment field created by -j option by DO
archive.dir.tar:     POSIX tar archive (GNU), directory
		     gettext-0.10.35/, mode 0000755,
		     uid 0000000, gid 0000000,
		     size 00000000000,
		     seconds 11401732537,
		     user root, group root
comics.cbt:          POSIX tar archive (GNU), file
		     test.jpg, mode 0000644,
		     uid 0001750, gid 0001750,
		     size 00000001121,
		     seconds 10665023160,
		     user jjmarin, group jjmarin
dpmi-en.tar:         POSIX tar archive, file
		     0.9.gif, mode 000644 ,
		     uid 000124 , gid 000024 ,
		     size 00000000147 ,
		     seconds 05762024207,
		     user dj, group user
gtarfail.tar:        POSIX tar archive, file
		     vedpowered.gif, mode 0000644,
		     uid 0000746, gid 0002044,
		     size 00000001006 ,
		     seconds 07303467402,
		     user jes, group glone
id-high2037-old.tar: tar archive (V7), file
		     6Mar2037.txt, mode 0000644,
		     uid 7777777, gid 7777777,
		     size 00000000374,
		     seconds 17626765756
test-png.cbt:        Comic Book Archive,
		     tar archive, 1st image
		     0001.png
test4digit.tar:      POSIX tar archive (GNU), file
		     2712.txt, mode 000644 ,
		     size 00000204500,
		     seconds 13220741303
test_data.tar:       POSIX tar archive (GNU), file
		     0000000000000000.empty.br, mode 0000600,
		     uid 0423055, gid 0257523,
		     size 00000000001,
		     seconds 13266421766,
		     user eustas, group primarygroup
win10iso-gnu.tar:    POSIX tar archive (GNU), file
		     m/vm/14393.0.160715-1616.
		     RS1_RELEASE_CLIENTENTERPRISE_S_EVAL,
		     mode 0000644,
		     uid 0002464, gid 0001143,
		     size 0xd72db800,
		     seconds 13031057704,
		     user joerg, group Administratoren

I hope my diff file can be applied in future version of file
utility.

There exist still some other file formats with CBT suffix. I will
try to handle this in a future session.

With best wishes,
Jörg Jenderek
- --
Jörg Jenderek
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iF0EARECAB0WIQS5/qNWKD4ASGOJGL+v8rHJQhrU1gUCYuHEYAAKCRCv8rHJQhrU
1kwSAKDfwwjm/RhQycZJXwBPbV9XGPWwGQCgx5Ld5nthG93biSG3g/PygDy8p8Q=
=VhQO
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trid-v-comicbook-cbt.txt.gz
Type: application/x-gzip
Size: 1052 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20220728/b0d11f14/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: droid-comicbook-cbt.csv.gz
Type: application/x-gzip
Size: 709 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20220728/b0d11f14/attachment-0001.bin>
-------------- next part --------------
--- file-5.42/magic/Magdir/archive.old	2022-05-28 22:13:23.000000000 +0200
+++ file-5.42/magic/Magdir/archive	2022-07-27 22:42:40.851865900 +0200
@@ -25,7 +25,16 @@
 >>>>>>155 ubyte&0xDF	=0	
 # space or ascii digit 0 at start of check sum
 >>>>>>>148	ubyte&0xEF	=0x20	
->>>>>>>>0	use	tar-file
+# FOR DEBUGGING: 
+#>>>>>>>>0	regex		\^[0-9]{2,4}[.](png|jpg|jpeg|tif|tiff|gif|bmp)	NAME "%s"
+# check for 1st image main name with digits used for sorting
+# and for name extension case insensitive like: PNG JPG JPEG TIF TIFF GIF BMP
+>>>>>>>>0	regex		\^[0-9]{2,4}[.](png|jpg|jpeg|tif|tiff|gif|bmp)
+#foo
+>>>>>>>>>0	use	tar-cbt
+# if 1st member name without digits and without used image suffix then it is a TAR archive
+>>>>>>>>0	default		x
+>>>>>>>>>0	use	tar-file
 #	minimal check and then display tar archive information which can also be
 #	embedded inside others like Android Backup, Clam AntiVirus database
 0	name		tar-file
@@ -146,6 +155,19 @@
 >>508	default		x		
 # padding[255] in old tar sometimes comment field
 >>>257	string		>\0		\b, comment: %-.40s
+# Summary:	Comic Book Archive *.CBT with TAR format
+# URL:		https://en.wikipedia.org/wiki/Comic_book_archive
+#		http://fileformats.archiveteam.org/wiki/Comic_Book_Archive
+# Note:		there exist also RAR, ZIP, ACE and 7Z packed variants
+0	name		tar-cbt
+>0	string		x		Comic Book archive, tar archive
+#!:mime	application/x-tar
+!:mime	application/vnd.comicbook
+#!:mime	application/vnd.comicbook+tar
+!:ext	cbt
+# name[100] probably like: 19.jpg 0001.png 0002.png
+# or maybe like ComicInfo.xml
+>0	string		>\0		\b, 1st image %-.60s
 
 # Incremental snapshot gnu-tar format from:
 # https://www.gnu.org/software/tar/manual/html_node/Snapshot-Files.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.42-archive-comicbook-cbt.diff.sig
Type: application/octet-stream
Size: 1117 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20220728/b0d11f14/attachment.obj>


More information about the File mailing list