[File] matching arbitrary data

B Watson yalhcru at gmail.com
Wed Jan 16 11:21:55 UTC 2019


I'm pretty new to writing magic rules, so apologies if this is a dumb
question.

What I'm trying to do is write magic that matches a complex tokenized
format. Reference: https://www.atariarchives.org/dere/chapt10.php

The first part is easy: the files all start with 0x0000. By itself this
isn't unique enough to use, I don't think.

I've got about 5000 BASIC files to test with (plus the ability to
create as many as I want by writing code in the Atari800 emulator and
saving it). What I found from examining a bunch of them: there's no
other fixed-location signature to check for, and no relocatable fixed
signatures that are guaranteed to exist.

According to the doc, it looks like every valid file should end in 0x16,
but many of my test files don't (they have garbage at the end, added by
a bug in the BASIC itself), and yet they still load and run fine in the
emulator. So this won't work:

0	uleshort	0x0000
>-1	ubyte	16

...even if that would work, it might result it too many false positives.

So what I hit on: every single one of these files should contain the
tokenized form of the command that was used to save it. In BASIC, you'd
say e.g. SAVE "D:FILENAME" to save the file... and the save command
itself gets saved in the file (actually there are some rare exceptions,
but I'm not worrying about those).

The token sequence for the SAVE command is always:

0x19 - the SAVE token
0x0f - means "start of string constant"
?    - one byte length of the string (can range 0-255, usually 15 or less)
filename - a string matching the regex [DHC][0-9]{0,1}:

I thought to use a regex, start matching at byte 14 (first byte after the
header), but regexes don't mix well with binary files. I tried this:

0	uleshort	0x0000
>14	regex/32768	[\x19][\x0f].[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC

...but it never matches. I'm assuming the regex starts trying to match
at byte 14, and considers the first 0x00 (or \n, or maybe any unprintable
character) to end the search space.

OK, no problem, I tried this:

0	uleshort	0x0000
>14	search	\x19\x0f
>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC

This works for very small/simple test files, where the byte sequence 0x19
0x0f only occurs once (just before the SAVE filename). But in longer
more complex programs, these 2 bytes can occur in other contexts: part
of a floating point constant, or string data, or the \x19 is, instead of
a SAVE command token, a TO operator token that happens to be followed
by 0x0f... to do this right, I'd have to properly detokenize the file
(which I could do in C or whatever, but not magic-language).

So what I really need (and don't think exists) is a way to write something
like a loop: repeat the 'search \x19\0f' if the regex doesn't match. Or
in plain English: I want to identify the file as tokenized BASIC if *any*
occurrence of 0x190f (followed by any one byte) is followed by a valid
Atari filename.

I came up with a horrible "pyramid schame" extension, that looks like:

0	uleshort	0x0000
>14	search	\x19\x0f
>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC
>>&0	search	\x19\x0f
>>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC
>>>&0	search	\x19\x0f
>>>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC
>>>>&0	search	\x19\x0f
>>>>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC
>>>>>&0	search	\x19\x0f
>>>>>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC
>>>>>>&0	search	\x19\x0f
>>>>>>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC
>>>>>>>&0	search	\x19\x0f
>>>>>>>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC
>>>>>>>>&0	search	\x19\x0f
>>>>>>>>>&1	regex	[DHC][0-9]{0,1}:	Atari 8-bit tokenized BASIC

etc etc. I extended it out to 20 levels. It actually works. Unless there
are more than 20 non-filename occurrences of 0x190f. So it works with
99.5% of my 5000-odd test files, but it offends me. Plus, one of the
test files has over 50 occurrences...

*Please* someone tell me if there's a right way to do this, that I'm
not getting from the man page and existing stuff in Magdir.

I tried another approach: search for the fixed strings D: D1: D2: etc
up to D9:, also H: H1: etc, and C:. But I don't see a way to tell file
to terminate and print its identification if any one of the searches
succeeds. Something like:

0	uleshort	0x0000
>14	search	D:	Atari 8-bit tokenized BASIC
>14	search	D1:	Atari 8-bit tokenized BASIC
>14	search	D2:	Atari 8-bit tokenized BASIC
(etc etc for the other drives IDs).

It works, but if the file contains more than one drive spec, file prints
the identification multiple times.


More information about the File mailing list