<div dir="ltr">Here are a few items I want to mention.<br><br>-----<br>File's database consists of several hundred pattern files, each with its own topic. What would be the plan for integrating these new currently-unsorted patterns? Is someone going to go through and decide which file each new pattern belongs in?<br><br>-----<br>Including "!:ext" filename extensions was my suggestion, because I thought it might be easy. But I have concerns about the quality of the data. Lots of them are clearly wrong, as they have a digit appended to them for no apparent reason. Beyond that, I wonder if some of them were just made up. And I know that PRONOM allows a format to have multiple extensions, so why do none of the new patterns have more than one extension? Where to set the bar for whether to include an extension is very debatable, but I think that low quality data could be worse than no data.<br><br>-----<br>There are still some new patterns that duplicate or overlap patterns already in file's database. Here are some (not saying whose pattern is better):<br><br>0 string =\x00\ \ \ \ \ \ \ \ \ \ \ \x00\x00 LBR Archive Data<br>0 string pcxLib PCX Library<br>3 string pm2 PMarc Compressed Archive<br>0 string PP11 Power Packer<br>0 string PP20 Power Packer<br>0 string SQSH Squash Compressed Data<br>2 string -sw1- SourceWare Archival Group Pascal Archive<br>0 string DMS! Disk Masher System compressed disk<br>0 string EDILZSS EDI Install Packed File EDI LZSSLi<br>0 string TFMX-SONG\x20 TFMX Module Sound Data<br>>8 string IFRSRIdx BLORB Interactive Fiction File<br>8 string ANIMFORM ANIM Animated Raster Graphic<br><br>-----<br>Some of the new patterns, such as the IFF and RIFF-based ones, ought to be in the form of a patch to an existing pattern, not a separate pattern. Examples:<br><br>>0 search/600/b _JPSJPS_ JPS Stereoscopic Image<br>>30 string XMIDFORM Extended MIDI Audio File<br><br>-----<br>Patterns that start with a "search", in particular, probably ought to be evaluated for their performance impact, and likelihood of false positives.<br><br>-----<br>None of the new patterns report metadata, like image dimensions. I'm not saying that's bad, but I want to make sure the people reading this are aware.<br><br>-----<br>Lots of the new patterns have unnecessary redundancies, and could, with effort, be simplified. For example, this:<br><br>0 string CA\x00\x00\x00 Crack Art Image<br>0 string CA\x00\x00\x01 Crack Art Image<br>0 string CA\x00\x00\x02 Crack Art Image<br>0 string CA\x00\x00\x03 Crack Art Image<br>0 string CA\x01\x00\x00 Crack Art Image<br>0 string CA\x01\x00\x01 Crack Art Image<br>0 string CA\x01\x00\x02 Crack Art Image<br>0 string CA\x01\x00\x03 Crack Art Image<br>0 string CA\x02\x00\x00 Crack Art Image<br>0 string CA\x02\x00\x01 Crack Art Image<br>0 string CA\x02\x00\x02 Crack Art Image<br>0 string CA\x02\x00\x03 Crack Art Image<br><br>Could be written as:<br><br>0 string CA<br>>2 ubyte <3<br>>>3 ubeshort <4 Crack Art Image<br><br>But that's very hard to do in an automated way.<br></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Fri, May 2, 2025 at 12:54 PM Jason Summers <<a href="mailto:jason1@pobox.com">jason1@pobox.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">To the list: I've tried to help to get these patterns into basic working order. And while I'd definitely like to see *some* of them added to file's database, I've been skeptical about how this will work, and I'm not taking sides. Many of the patterns are not really ready for production. I have a number of concerns that I'm prepared to bring up, but I'm hoping that some other people here will offer their feedback.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 2, 2025 at 7:42 AM Gregory Lepore <<a href="mailto:greg@rhobard.com" target="_blank">greg@rhobard.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">With a ton of help from Jason Summers I have updated my collection of <br>
1,200 new magic entries for file. I have updated each entry to include <br>
links to the supporting documentation I used to create the signatures. <br>
The file now passes Jason's mgcchk script and all entries have been tested.<br>
<br>
I would like to find out the best way to get these signatures verified <br>
and into 'file'. I don't think anybody wants 1,200 separate emails/bug <br>
reports for each of the entries (but I will do it if needed.)<br>
<br>
The magic file (lepore_magic) is at:<br>
<br>
<a href="https://github.com/glepore70/pronom-research/tree/master" rel="noreferrer" target="_blank">https://github.com/glepore70/pronom-research/tree/master</a><br>
<br>
along with sample files and supporting documentation for every new entry <br>
in the sample_files directory. Also included are the signatures in <br>
PRONOM format.<br>
<br>
I've also uploaded a helper script, combomask.py which generates a basic <br>
file or PRONOM formatted entry based on an analysis of files in ./.<br>
<br>
Thanks.<br>
-- <br>
File mailing list<br>
<a href="mailto:File@astron.com" target="_blank">File@astron.com</a><br>
<a href="https://mailman.astron.com/mailman/listinfo/file" rel="noreferrer" target="_blank">https://mailman.astron.com/mailman/listinfo/file</a><br>
</blockquote></div><div><br clear="all"></div><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr">Jason Summers<div><br></div></div></div>
</blockquote></div><div><br clear="all"></div><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr">Jason Summers<div><br></div></div></div>