[File] Update magic entries
Jason Summers
jason1 at pobox.com
Tue Jun 3 15:14:28 UTC 2025
Here are a few items I want to mention.
-----
File's database consists of several hundred pattern files, each with its
own topic. What would be the plan for integrating these new
currently-unsorted patterns? Is someone going to go through and decide
which file each new pattern belongs in?
-----
Including "!:ext" filename extensions was my suggestion, because I thought
it might be easy. But I have concerns about the quality of the data. Lots
of them are clearly wrong, as they have a digit appended to them for no
apparent reason. Beyond that, I wonder if some of them were just made up.
And I know that PRONOM allows a format to have multiple extensions, so why
do none of the new patterns have more than one extension? Where to set the
bar for whether to include an extension is very debatable, but I think that
low quality data could be worse than no data.
-----
There are still some new patterns that duplicate or overlap patterns
already in file's database. Here are some (not saying whose pattern is
better):
0 string =\x00\ \ \ \ \ \ \ \ \ \ \ \x00\x00 LBR Archive Data
0 string pcxLib PCX Library
3 string pm2 PMarc Compressed Archive
0 string PP11 Power Packer
0 string PP20 Power Packer
0 string SQSH Squash Compressed Data
2 string -sw1- SourceWare Archival Group Pascal Archive
0 string DMS! Disk Masher System compressed disk
0 string EDILZSS EDI Install Packed File EDI LZSSLi
0 string TFMX-SONG\x20 TFMX Module Sound Data
>8 string IFRSRIdx BLORB Interactive Fiction File
8 string ANIMFORM ANIM Animated Raster Graphic
-----
Some of the new patterns, such as the IFF and RIFF-based ones, ought to be
in the form of a patch to an existing pattern, not a separate pattern.
Examples:
>0 search/600/b _JPSJPS_ JPS Stereoscopic Image
>30 string XMIDFORM Extended MIDI Audio File
-----
Patterns that start with a "search", in particular, probably ought to be
evaluated for their performance impact, and likelihood of false positives.
-----
None of the new patterns report metadata, like image dimensions. I'm not
saying that's bad, but I want to make sure the people reading this are
aware.
-----
Lots of the new patterns have unnecessary redundancies, and could, with
effort, be simplified. For example, this:
0 string CA\x00\x00\x00 Crack Art Image
0 string CA\x00\x00\x01 Crack Art Image
0 string CA\x00\x00\x02 Crack Art Image
0 string CA\x00\x00\x03 Crack Art Image
0 string CA\x01\x00\x00 Crack Art Image
0 string CA\x01\x00\x01 Crack Art Image
0 string CA\x01\x00\x02 Crack Art Image
0 string CA\x01\x00\x03 Crack Art Image
0 string CA\x02\x00\x00 Crack Art Image
0 string CA\x02\x00\x01 Crack Art Image
0 string CA\x02\x00\x02 Crack Art Image
0 string CA\x02\x00\x03 Crack Art Image
Could be written as:
0 string CA
>2 ubyte <3
>>3 ubeshort <4 Crack Art Image
But that's very hard to do in an automated way.
On Fri, May 2, 2025 at 12:54 PM Jason Summers <jason1 at pobox.com> wrote:
> To the list: I've tried to help to get these patterns into basic working
> order. And while I'd definitely like to see *some* of them added to file's
> database, I've been skeptical about how this will work, and I'm not taking
> sides. Many of the patterns are not really ready for production. I have a
> number of concerns that I'm prepared to bring up, but I'm hoping that some
> other people here will offer their feedback.
>
> On Fri, May 2, 2025 at 7:42 AM Gregory Lepore <greg at rhobard.com> wrote:
>
>> With a ton of help from Jason Summers I have updated my collection of
>> 1,200 new magic entries for file. I have updated each entry to include
>> links to the supporting documentation I used to create the signatures.
>> The file now passes Jason's mgcchk script and all entries have been
>> tested.
>>
>> I would like to find out the best way to get these signatures verified
>> and into 'file'. I don't think anybody wants 1,200 separate emails/bug
>> reports for each of the entries (but I will do it if needed.)
>>
>> The magic file (lepore_magic) is at:
>>
>> https://github.com/glepore70/pronom-research/tree/master
>>
>> along with sample files and supporting documentation for every new entry
>> in the sample_files directory. Also included are the signatures in
>> PRONOM format.
>>
>> I've also uploaded a helper script, combomask.py which generates a basic
>> file or PRONOM formatted entry based on an analysis of files in ./.
>>
>> Thanks.
>> --
>> File mailing list
>> File at astron.com
>> https://mailman.astron.com/mailman/listinfo/file
>>
>
>
> --
> Jason Summers
>
>
--
Jason Summers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.astron.com/pipermail/file/attachments/20250603/4fc8a71c/attachment.htm>
More information about the File
mailing list