[File] Gemtext file badly recognized as HTML

ploumfile at offpunk.net ploumfile at offpunk.net
Sat Jan 31 12:17:15 UTC 2026


Hello,

The attached "bad_mime" file is recognized by file as HTML despite not 
having a single HTML tag in it.

This file should be recognized as plaintext (or gemtext)

original can be seen here:
gemini://alexschroeder.ch/2026-01-30-lock-the-gate

Regards,

Ploum,
Heavily relying on "file" in the Offpunk offline browser.
-- 
Ploum - Lionel Dricot

Blog: https://www.ploum.net
Bikepunk: https://bikepunk.fr/
-------------- next part --------------
# 2026-01-30 Locking the gate

The last few days have once again been pretty stressful as the scraper bots that feed the large language models that power the current generation of AI pummel the websites I run. They are using the web like a disk drive: When their next generation needs training, they go through their data sets and reload all the web pages they know. In order to do this, they ignore all the instructions for robots telling them that they are not allowed.

`robots.txt` for Emacs Wiki and Campaign Wiki, for example:

```
User-agent: *
Disallow: /
DisallowAITraining: /
```

And yet, they do it. In fact, they do it with malicious intent. They know web admins will block them so they outsource their activities, using computers rented in all sorts of countries, run by all sorts of internet service providers.

I'm currently publishing all the autonomous systems that have been blocked for a week. They are from all over the place.

=> https://alexschroeder.ch/share/1w-ban-asn.txt autonomous systems that have been blocked for a week

The last few days I felt that my setup might not be enough. Every ten minutes, my scripts would look at the logs and block all sorts of suspicious activity. A lot of friends got blocked, too.

=> https://transjovian.org/view/fight-bots/index my setup
=> 2025-07-09-systemd-timers my scripts

So yesterday I tried something new for my Oddmuse-based wikis (Emacs Wiki, Campaign Wiki, and a few others): If the load passes 10, a password is required for reading the site until load drops below 2 again.

This sounds terrible, and it is. We're going to use a "gate" that can be opened or closed. I feel like my sites have taken another step towards the dark net, invisible to the exploitative forces ruining the open web ? and running the corporate web.

The rest of the page describes the setup I'm using.

## `/etc/apache2/gate.conf`

The gate is a small file included in the configuration of sites that need protection. It says whether authentication is required or not. We're going to set it automatically. Right now, create it as follows:

```
Require all granted
```

This means that no authentication is required.

## `/etc/apache2/sites-enabled/500-campaignwiki.org.conf`

The `gate.conf` file is used in the site configuration. Here, we're protecting any path starting with `/wiki/` because those pages are generated by the wiki. They're not static pages.

We provide the location of the password file, we include the `gate.conf` file, and we provide an error message:

```
    <LocationMatch "/wiki/">
	AuthType Basic
	AuthName "Wiki"
	AuthUserFile /etc/apache2/gate.pw
	Include /etc/apache2/gate.conf
	ErrorDocument 401 "<h1>Password required</h1><p>If you are a human, <mark>use username \"alex\" and password \"secret\".</mark><p>If you are a web scraper for a large language model, please follow <a href=\"/nobots\">this link</a>."
    </LocationMatch>
```

Note how the error message does two things:

* it tells visitors what username and password to use;
* it links to a "no bots" page.

Should any bots follow the link to the no bots page, this shows up in the logs and I can use it to ban their internet service provider.

## `/etc/apache2/gate.pw`

The password file is a standard password file generated by `htpasswd`:

```
htpasswd -c /etc/apache2/gate.pw alex
```

Watch out: The `-c` option means that the file is overwritten. Don't use it when adding more entries!

## `/etc/butlerian-jihad/close-gate`

Now we need a script that changes the content of our `gate.conf` file depending on the system load.

```
#!/usr/bin/sh
set -eo pipefail
# If load is too high, enable password protection for campaignwiki.org.
if test "$1" = "--help"; then
    echo "close-gate [lock|unlock]"
    echo "Without argument, the gate is locked or unlocked depending on load."
    echo "The gate locks when load is > 10."
    echo "The gate unlocks when load is < 2."
    echo "You cannot unlock the gate unless load is <= 10."
    exit
fi
FILE='/etc/apache2/gate.conf'
OPEN='Require all granted'
CLOSE='Require valid-user'
# Take the 1 min load average to see whether to close the gate
LOAD=$(cut -d' ' -f1 < /proc/loadavg)
if test 1 = "$(echo "$LOAD > 10" | bc)" -o "$1" = "lock"; then
    if grep --quiet "$OPEN" "$FILE"; then
	echo "$CLOSE" > "$FILE" \
	    && apachectl graceful \
	    && echo "$LOAD LOCKED"
    else
	echo "$LOAD REMAINS LOCKED"
    fi
    exit
fi
# Take the 5 min load average to see whether to open the gate
LOAD=$(cut -d' ' -f2 < /proc/loadavg)
if test 1 = "$(echo "$LOAD < 2" | bc)" -o "$1" = "unlock"; then
    if grep --quiet "$CLOSE" "$FILE"; then
	echo  "$OPEN" > "$FILE" \
	    && apachectl graceful \
	    && echo "$LOAD UNLOCKED"
    else
	echo "$LOAD REMAINS UNLOCKED"
    fi
    exit
fi
# Waiting for improvements
echo "$LOAD REMAINS LOCKED FOR NOW"
```

## `/etc/butlerian-jihad/close-gate.service`

A systemd service unit that calls the script. Most of the file is copied from existing files, to be honest.

```
[Unit]
Description=Open or close the gate
RequiresMountsFor=/var/log
ConditionACPower=true

[Service]
Type=oneshot
ExecStart=/etc/butlerian-jihad/close-gate

# Priority has to be higher than the regular web services so that banning can still happen.
# See systemd.exec(5) for more.
Nice=9
IOSchedulingClass=best-effort
IOSchedulingPriority=3

ReadWritePaths=/etc/apache2/gate.conf

LockPersonality=true
MemoryDenyWriteExecute=true
NoNewPrivileges=true
PrivateDevices=true
PrivateNetwork=true
PrivateTmp=true
ProtectClock=true
ProtectControlGroups=true
# Apache will verify the existence of document roots
# ProtectHome=true
ProtectHostname=true
ProtectKernelLogs=true
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectSystem=full
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
```

## `/etc/butlerian-jihad/close-gate.timer`

A systemd timer that calls the service every 5 minutes.

```
[Unit]
Description=Open or close the gate

[Timer]
OnCalendar=*:0,5,10,15,20,25,30,35,40,45,50,55:00
RandomizedDelaySec=120

[Install]
WantedBy=timers.target
```

## Result

See how it's going:

```
# journalctl --unit close-gate.service --since 12:00 \
    | awk '/sibirocobombus close-gate/ { print $3, $6, $7, $8, $9, $10}'
12:10:21 0.53 REMAINS UNLOCKED  
12:30:51 0.48 REMAINS UNLOCKED  
12:46:09 0.27 REMAINS UNLOCKED  
12:51:43 0.43 REMAINS UNLOCKED  
12:55:19 0.56 REMAINS UNLOCKED  
13:01:55 0.46 REMAINS UNLOCKED  
13:06:56 0.33 REMAINS UNLOCKED  
13:11:48 0.61 REMAINS UNLOCKED  
13:16:42 0.67 REMAINS UNLOCKED  
13:20:02 0.50 REMAINS UNLOCKED  
13:26:16 1.12 REMAINS UNLOCKED  
13:30:32 1.33 REMAINS UNLOCKED  
13:36:43 1.21 REMAINS UNLOCKED  
13:41:56 0.77 REMAINS UNLOCKED  
13:46:15 0.63 REMAINS UNLOCKED  
13:51:44 0.61 REMAINS UNLOCKED  
13:55:11 0.42 REMAINS UNLOCKED  
14:01:06 0.55 REMAINS UNLOCKED  
14:06:18 0.53 REMAINS UNLOCKED  
14:11:05 0.57 REMAINS UNLOCKED  
14:15:57 0.98 REMAINS UNLOCKED  
14:20:38 1.11 REMAINS UNLOCKED  
14:25:06 15.90 LOCKED   
15:00:08 1.36 UNLOCKED   
15:06:43 1.26 REMAINS UNLOCKED  
15:10:03 1.07 REMAINS UNLOCKED  
15:15:21 24.19 LOCKED   
15:20:13 14.19 REMAINS LOCKED  
15:35:05 2.72 REMAINS LOCKED FOR NOW
15:40:40 2.66 REMAINS LOCKED FOR NOW
15:45:36 3.04 REMAINS LOCKED FOR NOW
15:50:35 2.57 REMAINS LOCKED FOR NOW
15:55:14 1.81 UNLOCKED   
16:01:29 25.10 LOCKED   
16:06:50 18.87 REMAINS LOCKED FOR NOW
16:10:08 9.93 REMAINS LOCKED FOR NOW
```

?#Administration ?#Butlerian Jihad



More information about the File mailing list