[File] Gemtext file badly recognized as HTML
ploumfile at offpunk.net
ploumfile at offpunk.net
Sat Jan 31 17:25:19 UTC 2026
Le 26 jan 31 12:09, Christos Zoulas a écrit :
>Yes, we can require that <tile> exists for html documents, but that will break other uses for
>html embeddings for example. Anyway it is just heuristics, and if you try to fix one document,
>you might break another.
Indeed. That’s why I’m really curious about file philosophy on this. At
which point is a false-positive considered a bug? To which degree should
a file be identified even if it doesn’t follow closely a given standard?
I’m really curious about it.
I’m also willing to help on identifying gemtext file but I’ve been told
that there’s no magic number for gemtext. Is there a way to contribute
one, one way on another?
Regards
>
>Best,
>
>christos
>
>> On Jan 31, 2026, at 11:42 AM, ploumfile at offpunk.net wrote:
>>
>> Le 26 jan 31 10:47, Christos Zoulas a écrit :
>>> Running with -d says:
>>>
>>> *unknown*, 115: > 0 search/wct/4096,=<a href=',"HTML' DEFANGED_document text"]
>>> search: [# 2026-01-30 Locking the gate\n\nThe last few days have once again been pretty stressful as the scrape...] for [<a href="]" DEFANGED_found
>>> 0 == 0 = 1 strength=68
>>> [try ascmagic 1]
>>> /Users/christos/bad_mime.DEFANGED-16: HTML document, Unicode text, UTF-8 text, with very long lines (440)
>>
>> I overlooked this: there’s indeed a single <a href= "in" DEFANGED_the DEFANGED_middle DEFANGED_of DEFANGED_the document, DEFANGED_in DEFANGED_a DEFANGED_quoted DEFANGED_text (the DEFANGED_document DEFANGED_describe DEFANGED_a DEFANGED_webserver configuration)
>>
>> DEFANGED_So DEFANGED_this DEFANGED_is DEFANGED_definitely DEFANGED_a DEFANGED_false-positive DEFANGED_for HTML.
>>
>> DEFANGED_Is DEFANGED_this DEFANGED_a DEFANGED_bug DEFANGED_that DEFANGED_should DEFANGED_be DEFANGED_fixed DEFANGED_or not?
>>
>> DEFANGED_There DEFANGED_are DEFANGED_multiple DEFANGED_hints DEFANGED_that DEFANGED_this DEFANGED_document DEFANGED_is DEFANGED_not DEFANGED_HTML:
>>
>> 1. DEFANGED_No DEFANGED_opening "<" (in DEFANGED_most DEFANGED_HTML document, DEFANGED_the DEFANGED_first DEFANGED_non-empty DEFANGED_characters DEFANGED_should DEFANGED_probably DEFANGED_be "<")
>> 2. DEFANGED_No <html> (although I’m not sure it is mandatory)
>> 3. No <title> tag (that one in mandatory according to RFC1866 (https://www.ietf.org/rfc/rfc1866.txt )
>>
>> What’s your opinion on this?
>>
>> Ploum
>>
>>
>> --
>> This message has been 'sanitized'. This means that potentially
>> dangerous content has been rewritten or removed. The following
>> log describes which actions were taken.
>>
>> Sanitizer (start="1769877794"):
>> SanitizeFile (filename="unnamed.txt, filetype.html", mimetype="text/plain"):
>> Match (names="unnamed.txt, filetype.html", rule="9"):
>> Enforced policy: accept
>>
>> Rewrote HTML tag: >>_a href=,"HTML document text"] _<<
>> as: >>_a href=',"HTML' DEFANGED_document text"] _<<
>> Rewrote HTML tag: >>_a href=] found _<<
>> as: >>_a href="]" DEFANGED_found _<<
>> Rewrote HTML tag: >>_a href= in the middle of the document, in a quoted text (the document describe a webserver configuration) So this is definitely a false-positive for HTML. Is this a bug that should be fixed or not? There are multiple hints that this document is not HTML: 1. No opening "_" (in most HTML document, the first non-empty characters should probably be "_") 2. No _html_<<
>> as: >>_a href= "in" DEFANGED_the DEFANGED_middle DEFANGED_of DEFANGED_the document, DEFANGED_in DEFANGED_a DEFANGED_quoted DEFANGED_text (the DEFANGED_document DEFANGED_describe DEFANGED_a DEFANGED_webserver configuration) DEFANGED_So DEFANGED_this DEFANGED_is DEFANGED_definitely DEFANGED_a DEFANGED_false-positive DEFANGED_for HTML. DEFANGED_Is DEFANGED_this DEFANGED_a DEFANGED_bug DEFANGED_that DEFANGED_should DEFANGED_be DEFANGED_fixed DEFANGED_or not? DEFANGED_There DEFANGED_are DEFANGED_multiple DEFANGED_hints DEFANGED_that DEFANGED_this DEFANGED_document DEFANGED_is DEFANGED_not DEFANGED_HTML: 1. DEFANGED_No DEFANGED_opening "_" (in DEFANGED_most DEFANGED_HTML document, DEFANGED_the DEFANGED_first DEFANGED_non-empty DEFANGED_characters DEFANGED_should DEFANGED_probably DEFANGED_be "_") 2. DEFANGED_No _html_<<
>> Total modifications so far: 52
>>
>>
>> Anomy 0.0.0 : Sanitizer.pm
>> $Id: Sanitizer.pm,v 1.94 2006/01/02 16:43:10 bre Exp $
>
--
Ploum - Lionel Dricot
Blog: https://www.ploum.net
Bikepunk: https://bikepunk.fr/
More information about the File
mailing list