[File] Gemtext file badly recognized as HTML

Christos Zoulas christos at zoulas.com
Sat Jan 31 17:09:26 UTC 2026


Yes, we can require that <tile> exists for html documents, but that will break other uses for
html embeddings for example. Anyway it is just heuristics, and if you try to fix one document,
you might break another.

Best,

christos

> On Jan 31, 2026, at 11:42 AM, ploumfile at offpunk.net wrote:
> 
> Le 26 jan 31 10:47, Christos Zoulas a écrit :
>> Running with -d says:
>> 
>> *unknown*, 115: > 0 search/wct/4096,=<a href=',"HTML' DEFANGED_document text"]
>> search: [# 2026-01-30 Locking the gate\n\nThe last few days have once again been pretty stressful as the scrape...] for [<a href="]" DEFANGED_found
>> 0 == 0 = 1 strength=68
>> [try ascmagic 1]
>> /Users/christos/bad_mime.DEFANGED-16: HTML document, Unicode text, UTF-8 text, with very long lines (440)
> 
> I overlooked this: there’s indeed a single <a href= "in" DEFANGED_the DEFANGED_middle DEFANGED_of DEFANGED_the document, DEFANGED_in DEFANGED_a DEFANGED_quoted DEFANGED_text (the DEFANGED_document DEFANGED_describe DEFANGED_a DEFANGED_webserver configuration)
> 
> DEFANGED_So DEFANGED_this DEFANGED_is DEFANGED_definitely DEFANGED_a DEFANGED_false-positive DEFANGED_for HTML.
> 
> DEFANGED_Is DEFANGED_this DEFANGED_a DEFANGED_bug DEFANGED_that DEFANGED_should DEFANGED_be DEFANGED_fixed DEFANGED_or not?
> 
> DEFANGED_There DEFANGED_are DEFANGED_multiple DEFANGED_hints DEFANGED_that DEFANGED_this DEFANGED_document DEFANGED_is DEFANGED_not DEFANGED_HTML:
> 
> 1. DEFANGED_No DEFANGED_opening "<"   (in DEFANGED_most DEFANGED_HTML document, DEFANGED_the DEFANGED_first DEFANGED_non-empty DEFANGED_characters DEFANGED_should DEFANGED_probably DEFANGED_be "<")
> 2. DEFANGED_No <html> (although I’m not sure it is mandatory)
> 3. No <title> tag  (that one in mandatory according to RFC1866 (https://www.ietf.org/rfc/rfc1866.txt )
> 
> What’s your opinion on this?
> 
> Ploum
> 
> 
> -- 
> This message has been 'sanitized'.  This means that potentially
> dangerous content has been rewritten or removed.  The following
> log describes which actions were taken.
> 
> Sanitizer (start="1769877794"):
> SanitizeFile (filename="unnamed.txt, filetype.html", mimetype="text/plain"):
>   Match (names="unnamed.txt, filetype.html", rule="9"):
>     Enforced policy: accept
> 
> Rewrote HTML tag: >>_a href=,"HTML document text"] _<<
>               as: >>_a href=',"HTML' DEFANGED_document text"] _<<
> Rewrote HTML tag: >>_a href=] found _<<
>               as: >>_a href="]" DEFANGED_found _<<
> Rewrote HTML tag: >>_a href= in the middle of the document, in a quoted text (the document describe a webserver configuration) So this is definitely a false-positive for HTML. Is this a bug that should be fixed or not? There are multiple hints that this document is not HTML: 1. No opening "_" (in most HTML document, the first non-empty characters should probably be "_") 2. No _html_<<
>               as: >>_a href= "in" DEFANGED_the DEFANGED_middle DEFANGED_of DEFANGED_the document, DEFANGED_in DEFANGED_a DEFANGED_quoted DEFANGED_text (the DEFANGED_document DEFANGED_describe DEFANGED_a DEFANGED_webserver configuration) DEFANGED_So DEFANGED_this DEFANGED_is DEFANGED_definitely DEFANGED_a DEFANGED_false-positive DEFANGED_for HTML. DEFANGED_Is DEFANGED_this DEFANGED_a DEFANGED_bug DEFANGED_that DEFANGED_should DEFANGED_be DEFANGED_fixed DEFANGED_or not? DEFANGED_There DEFANGED_are DEFANGED_multiple DEFANGED_hints DEFANGED_that DEFANGED_this DEFANGED_document DEFANGED_is DEFANGED_not DEFANGED_HTML: 1. DEFANGED_No DEFANGED_opening "_" (in DEFANGED_most DEFANGED_HTML document, DEFANGED_the DEFANGED_first DEFANGED_non-empty DEFANGED_characters DEFANGED_should DEFANGED_probably DEFANGED_be "_") 2. DEFANGED_No _html_<<
> Total modifications so far: 52
> 
> 
> Anomy 0.0.0 : Sanitizer.pm
> $Id: Sanitizer.pm,v 1.94 2006/01/02 16:43:10 bre Exp $

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/file/attachments/20260131/a63d5c5b/attachment-0001.asc>


More information about the File mailing list