[File] Erroneous byte in Magdir/msdos?
Christos Zoulas
christos at zoulas.com
Wed Feb 12 16:58:00 UTC 2025
On 2025-02-12 9:49 am, Jason Summers wrote:
> If the encoding of pattern files is documented somewhere, I'd like to
> see it. I couldn't find anything when I researched it for Mgchkj
> (https://github.com/jsummers/mgchkj). There's no single encoding that
> is valid for all the current patterns and comments.
>
> Of course, you can use 'file' itself to tell you which files are not
> valid UTF-8. Mgchkj is more precise, and it does warn about the issue
> you're reporting:
>
> filesystems:2610: Line has non-ASCII characters (probably not UTF-8)
> [# From: Thomas Wei�schuh <thomas at t-8ch.de>
> firmware:177: Line has non-ASCII characters (probably not UTF-8) [#
> Note: called "Intel Hexadecimal object format" by TrID, "Intel�
> hexadecimal object file" on Linux]
> images:647: Line has non-ASCII characters (probably not UTF-8) [#
> binary data variant with non ASCII text characters like Control-A or
> �C in thermostat.fig]
> msdos:2526: Line has non-ASCII characters (probably not UTF-8) [# 1st
> member name like: "Class Notes.one" "test-onenote.one" "Open
> Notebook.onetoc2" "Editor �ffnen.onetoc2"]
>
> If you use the "-w3" option, it also warns about (non-ASCII) UTF-8.
> I'll remove that warning if it's documented as being correct.
>
> There is probably a way to configure your Python input method to
> handle errors differently, but I don't know enough to help with that.
>
> On Tue, Feb 11, 2025 at 7:23 PM Sudarshan S Chawathe
> <chaw at eip10.org> wrote:
>
>> In the file Magdir/msdos, there seems to be a strange byte at offset
>> 108406. Examining it in emacs gives:
>>
>> Char: \326 (4194262, #o17777726, #x3fffd6, raw-byte) point=108406
>> of
>> 127680 (85%) column=93
>>
>> [I replaced the actual byte with the string "\326" above to avoid
>> potential email problems.]
>>
>> Trying to read that line using 'input' in python3 gives:
>>
>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in
>> position
>> 1909: invalid continuation byte
>>
>> Is that byte a typo of some sort, or should that file be read using
>> a
>> different text encoding (or method)?
>>
>> Regards,
>>
These are all iso-8859-1 characters German: "SS", (R), ordinal
indicator, o with umlaut.
I have replaced them with their ascii equivalent.
Best,
christos
More information about the File
mailing list