[File] Update magic entries

Gregory Lepore greg at rhobard.com
Wed Jun 4 14:43:06 UTC 2025


Jason,

Thank you for the suggestions!

For integrating the new signatures - I will take any and all advice. I'm
not proficient enough to produce a patch like other submitters have done,
but I'm willing to learn. I can also go through my signatures and classify
them into one of the existing topics in file.

For the extensions - I have cleaned up the numerical suffixes (a legacy of
my directory structure preventing overlaps.)  I have also added multiple
extensions where I have found them in my test data.

For the duplicates - I missed removing a few, but I kept PCX, SWAG, and
TFMX because I think my signatures match a bit more true files without
false negatives. But I will do some more research to double check this.

For the metadata extraction - it's my personal philosophy that a file
identification tool should only identify the file format. Anything beyond
that would be the province of a metadata extraction tool.

I agree completely about the search/ signatures and the IFF/RIFF
signatures. It would be easy to pull them out of my sig file but I don't
know how to apply them as patches to the existing files. I've generally
paid attention to any performance hit for the search/ signatures, which is
one reason I don't do any EOF matching (plus I think that leaks over into
conformance, not identification.)

I have updated the signature file at:
https://github.com/glepore70/pronom-research/blob/master/lepore_magic

Thanks again, I am very willing to do whatever it takes to get these
signatures into file, and I understand if the volume is daunting. If
necessary we can take them one by one, but that would take forever.




On Tue, Jun 3, 2025 at 11:20 AM Jason Summers <jason1 at pobox.com> wrote:

> Here are a few items I want to mention.
>
> -----
> File's database consists of several hundred pattern files, each with its
> own topic. What would be the plan for integrating these new
> currently-unsorted patterns? Is someone going to go through and decide
> which file each new pattern belongs in?
>
> -----
> Including "!:ext" filename extensions was my suggestion, because I thought
> it might be easy. But I have concerns about the quality of the data. Lots
> of them are clearly wrong, as they have a digit appended to them for no
> apparent reason. Beyond that, I wonder if some of them were just made up.
> And I know that PRONOM allows a format to have multiple extensions, so why
> do none of the new patterns have more than one extension? Where to set the
> bar for whether to include an extension is very debatable, but I think that
> low quality data could be worse than no data.
>
> -----
> There are still some new patterns that duplicate or overlap patterns
> already in file's database. Here are some (not saying whose pattern is
> better):
>
> 0   string  =\x00\ \ \ \ \ \ \ \ \ \ \ \x00\x00 LBR Archive Data
> 0   string  pcxLib  PCX Library
> 3   string  pm2 PMarc Compressed Archive
> 0   string  PP11    Power Packer
> 0   string  PP20    Power Packer
> 0   string  SQSH    Squash Compressed Data
> 2   string  -sw1-   SourceWare Archival Group Pascal Archive
> 0   string  DMS!    Disk Masher System compressed disk
> 0   string  EDILZSS EDI Install Packed File EDI LZSSLi
> 0   string  TFMX-SONG\x20   TFMX Module Sound Data
> >8  string  IFRSRIdx    BLORB Interactive Fiction File
> 8   string  ANIMFORM    ANIM Animated Raster Graphic
>
> -----
> Some of the new patterns, such as the IFF and RIFF-based ones, ought to be
> in the form of a patch to an existing pattern, not a separate pattern.
> Examples:
>
> >0  search/600/b    _JPSJPS_    JPS Stereoscopic Image
> >30 string  XMIDFORM    Extended MIDI Audio File
>
> -----
> Patterns that start with a "search", in particular, probably ought to be
> evaluated for their performance impact, and likelihood of false positives.
>
> -----
> None of the new patterns report metadata, like image dimensions. I'm not
> saying that's bad, but I want to make sure the people reading this are
> aware.
>
> -----
> Lots of the new patterns have unnecessary redundancies, and could, with
> effort, be simplified. For example, this:
>
> 0   string  CA\x00\x00\x00  Crack Art Image
> 0   string  CA\x00\x00\x01  Crack Art Image
> 0   string  CA\x00\x00\x02  Crack Art Image
> 0   string  CA\x00\x00\x03  Crack Art Image
> 0   string  CA\x01\x00\x00  Crack Art Image
> 0   string  CA\x01\x00\x01  Crack Art Image
> 0   string  CA\x01\x00\x02  Crack Art Image
> 0   string  CA\x01\x00\x03  Crack Art Image
> 0   string  CA\x02\x00\x00  Crack Art Image
> 0   string  CA\x02\x00\x01  Crack Art Image
> 0   string  CA\x02\x00\x02  Crack Art Image
> 0   string  CA\x02\x00\x03  Crack Art Image
>
> Could be written as:
>
> 0    string    CA
> >2   ubyte     <3
> >>3  ubeshort  <4  Crack Art Image
>
> But that's very hard to do in an automated way.
>
> On Fri, May 2, 2025 at 12:54 PM Jason Summers <jason1 at pobox.com> wrote:
>
>> To the list: I've tried to help to get these patterns into basic working
>> order. And while I'd definitely like to see *some* of them added to file's
>> database, I've been skeptical about how this will work, and I'm not taking
>> sides. Many of the patterns are not really ready for production. I have a
>> number of concerns that I'm prepared to bring up, but I'm hoping that some
>> other people here will offer their feedback.
>>
>> On Fri, May 2, 2025 at 7:42 AM Gregory Lepore <greg at rhobard.com> wrote:
>>
>>> With a ton of help from Jason Summers I have updated my collection of
>>> 1,200 new magic entries for file. I have updated each entry to include
>>> links to the supporting documentation I used to create the signatures.
>>> The file now passes Jason's mgcchk script and all entries have been
>>> tested.
>>>
>>> I would like to find out the best way to get these signatures verified
>>> and into 'file'. I don't think anybody wants 1,200 separate emails/bug
>>> reports for each of the entries (but I will do it if needed.)
>>>
>>> The magic file (lepore_magic) is at:
>>>
>>> https://github.com/glepore70/pronom-research/tree/master
>>>
>>> along with sample files and supporting documentation for every new entry
>>> in the sample_files directory. Also included are the signatures in
>>> PRONOM format.
>>>
>>> I've also uploaded a helper script, combomask.py which generates a basic
>>> file or PRONOM formatted entry based on an analysis of files in ./.
>>>
>>> Thanks.
>>> --
>>> File mailing list
>>> File at astron.com
>>> https://mailman.astron.com/mailman/listinfo/file
>>>
>>
>>
>> --
>> Jason Summers
>>
>>
>
> --
> Jason Summers
>
> --
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.astron.com/pipermail/file/attachments/20250604/320207b3/attachment.htm>


More information about the File mailing list