[File] Update magic entries

Jason Summers jason1 at pobox.com
Tue Jun 3 15:14:28 UTC 2025


Here are a few items I want to mention.

-----
File's database consists of several hundred pattern files, each with its
own topic. What would be the plan for integrating these new
currently-unsorted patterns? Is someone going to go through and decide
which file each new pattern belongs in?

-----
Including "!:ext" filename extensions was my suggestion, because I thought
it might be easy. But I have concerns about the quality of the data. Lots
of them are clearly wrong, as they have a digit appended to them for no
apparent reason. Beyond that, I wonder if some of them were just made up.
And I know that PRONOM allows a format to have multiple extensions, so why
do none of the new patterns have more than one extension? Where to set the
bar for whether to include an extension is very debatable, but I think that
low quality data could be worse than no data.

-----
There are still some new patterns that duplicate or overlap patterns
already in file's database. Here are some (not saying whose pattern is
better):

0   string  =\x00\ \ \ \ \ \ \ \ \ \ \ \x00\x00 LBR Archive Data
0   string  pcxLib  PCX Library
3   string  pm2 PMarc Compressed Archive
0   string  PP11    Power Packer
0   string  PP20    Power Packer
0   string  SQSH    Squash Compressed Data
2   string  -sw1-   SourceWare Archival Group Pascal Archive
0   string  DMS!    Disk Masher System compressed disk
0   string  EDILZSS EDI Install Packed File EDI LZSSLi
0   string  TFMX-SONG\x20   TFMX Module Sound Data
>8  string  IFRSRIdx    BLORB Interactive Fiction File
8   string  ANIMFORM    ANIM Animated Raster Graphic

-----
Some of the new patterns, such as the IFF and RIFF-based ones, ought to be
in the form of a patch to an existing pattern, not a separate pattern.
Examples:

>0  search/600/b    _JPSJPS_    JPS Stereoscopic Image
>30 string  XMIDFORM    Extended MIDI Audio File

-----
Patterns that start with a "search", in particular, probably ought to be
evaluated for their performance impact, and likelihood of false positives.

-----
None of the new patterns report metadata, like image dimensions. I'm not
saying that's bad, but I want to make sure the people reading this are
aware.

-----
Lots of the new patterns have unnecessary redundancies, and could, with
effort, be simplified. For example, this:

0   string  CA\x00\x00\x00  Crack Art Image
0   string  CA\x00\x00\x01  Crack Art Image
0   string  CA\x00\x00\x02  Crack Art Image
0   string  CA\x00\x00\x03  Crack Art Image
0   string  CA\x01\x00\x00  Crack Art Image
0   string  CA\x01\x00\x01  Crack Art Image
0   string  CA\x01\x00\x02  Crack Art Image
0   string  CA\x01\x00\x03  Crack Art Image
0   string  CA\x02\x00\x00  Crack Art Image
0   string  CA\x02\x00\x01  Crack Art Image
0   string  CA\x02\x00\x02  Crack Art Image
0   string  CA\x02\x00\x03  Crack Art Image

Could be written as:

0    string    CA
>2   ubyte     <3
>>3  ubeshort  <4  Crack Art Image

But that's very hard to do in an automated way.

On Fri, May 2, 2025 at 12:54 PM Jason Summers <jason1 at pobox.com> wrote:

> To the list: I've tried to help to get these patterns into basic working
> order. And while I'd definitely like to see *some* of them added to file's
> database, I've been skeptical about how this will work, and I'm not taking
> sides. Many of the patterns are not really ready for production. I have a
> number of concerns that I'm prepared to bring up, but I'm hoping that some
> other people here will offer their feedback.
>
> On Fri, May 2, 2025 at 7:42 AM Gregory Lepore <greg at rhobard.com> wrote:
>
>> With a ton of help from Jason Summers I have updated my collection of
>> 1,200 new magic entries for file. I have updated each entry to include
>> links to the supporting documentation I used to create the signatures.
>> The file now passes Jason's mgcchk script and all entries have been
>> tested.
>>
>> I would like to find out the best way to get these signatures verified
>> and into 'file'. I don't think anybody wants 1,200 separate emails/bug
>> reports for each of the entries (but I will do it if needed.)
>>
>> The magic file (lepore_magic) is at:
>>
>> https://github.com/glepore70/pronom-research/tree/master
>>
>> along with sample files and supporting documentation for every new entry
>> in the sample_files directory. Also included are the signatures in
>> PRONOM format.
>>
>> I've also uploaded a helper script, combomask.py which generates a basic
>> file or PRONOM formatted entry based on an analysis of files in ./.
>>
>> Thanks.
>> --
>> File mailing list
>> File at astron.com
>> https://mailman.astron.com/mailman/listinfo/file
>>
>
>
> --
> Jason Summers
>
>

-- 
Jason Summers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.astron.com/pipermail/file/attachments/20250603/4fc8a71c/attachment.htm>


More information about the File mailing list