[File] brief impossible bug: magic_file(MAGIC_COMPRESS) for test.lz -> vfork() + waitpid(pid) + waitpid(0) = sleep forever
raf
file at raf.org
Mon Apr 6 02:44:34 EDT 2026
The problem is back again today.
I think I'll just have to stop using MAGIC_COMPRESS.
The alternative solution of calling magic_file() from a
new child process isn't appealing for several reasons,
one of which is the fact that there is something wierd
going on with the writepid variable whatever I do.
And it's reproducible without my program. The following
happens with both file-5.44 and file-5.46:
$ file /home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz
/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz: lzip compressed data,
version: 1
$ file -z /home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz
/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz: ERROR:[lzip: Wait failed, No
child processes] (lzip compressed data, version: 1)
The strace -f output is attached (no - that is sent
in the next message).
The second, spurious,
waitpid() is passed a bizarre pid again, but this time
it's a huge number. The previous real/valid pid was
522859. The invalid pid is 1141822634. I'm sure I don't
have a billion processes running on this little vm. :-)
I'll change my program to stop using MAGIC_COMPRESS,
but to allow the user to specify flags as well just in
case they want MAGIC_COMPRESS and this bug gets fixed.
I wish I could help more, but I can't see what the
problem is. Hmm, when I sprinkle debug writes to see
what writepid is, it's -1 all the time (like it should
be). It's pid that's wrong (the pid from the vfork).
It's showing up as 32716 which is too low.
OK, it's not using vfork(), it's using posix_spawnp() and the pid is an uninitialized stack variable, which
is never being set by posix_spawnp(), that's why it's
sometimes too low, and sometimes too high.
OK. The code is misinterpreting the return value of
posix_spawnp(). It expects the return value to be -1 on
error, and for errno to be set, but posix_spawnp()
doesn't work like that. The manual entry says:
Upon successful completion, posix_spawn() and
posix_spawnp() place the PID of the child process in
pid, and return 0. If there is an error during the
fork() step, then no child is created, the contents of
*pid are un‐ specified, and these functions return an
error number as described below.
It's returning 2 for me which means ENOENT (No such
file or directory). Ah, I don't have lzip(1) installed,
so that makes sense.
The attached patch will fix it. It interprets
posix_spawnp()'s return value correctly, and fails at the
right time with the right error message. Initializing
pid to -1 isn't necessary, so you could leave that out.
With the patch, the above example looks like this:
$ ./file -m ../magic/magic.mgc -z
/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz
/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz: ERROR:[lzip: Cannot
posix_spawn `lzip', No such file or directory] (lzip compressed data, version: 1)
And it eliminates the possibility of calling waitpid(0, ...)
and putting my program to sleep forwver. Yay!
Please let me know which version of libmagic will
include this fix. I want to warn my users in the
documentation not to enable MAGIC_COMPRESS unless
they have that version or later installed.
cheers,
raf
On Sun, Apr 05, 2026 at 11:59:15PM +1000, raf <file at raf.org> wrote:
> Hi,
>
> debian-12, libmagic-5.44
>
> Yesterday, I encountered a wierd bug using magic_file()
> with MAGIC_COMPRESS on a compressed file. It would go
> to sleep until I interrupted it. It only happened when
> my code had its own child process, but it happened
> almost every time that that was the case. For weeks, it
> was fine, but for 1-2 days, it would sleep forever.
>
> At first I thought I must be corrupting memory somehow
> and damaging whatever is behind the magic_t pointer,
> but ubsan, asan, msan, and valgrind found no problems.
> I also checked that the magic_t pointer itself and the
> first 30 bytes of whatever it pointed to weren't being
> overwritten, and they weren't. It looked like something
> inside libmagic was just going to sleep.
>
> Today it's back to normal and working fine. I looked at
> the code for 5.46 and it looks completely impossible
> for yesterday's behaviour to happen (assuming no relevant
> differences between 5.44 and 5.46).
>
> For compressed files, uncompressbuf() does vfork() and
> does waitpid() for it.
>
> clone3({flags=CLONE_VM|CLONE_VFORK, exit_signal=SIGCHLD, stack=0x7fe771fa1000, stack_size=0x9000}, 88) = 480299
> wait4(480299, NULL, 0, NULL) = 480299
>
> If fd == -1, it calls writechild() which does fork()
> and sets writepid. Later, if writepid > 0, then a second
> waitpid() happens. Strangely, the second waitpid() does
> happen, even though writepid should still be -1 rather
> than > 0, because the fork() in writechild() never
> happens.
>
> When broken, writepid was 0 which makes no sense. It
> can't be 0. It's either -1 or > 0, and the waitpid()
> only happens when writepid > 0.
>
> wait4(0, 0x7fff19477658, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
>
> Anyway, when it was 0, and my code had its own child
> process, the waitpid() in uncompressbuf() would go to
> sleep, presumably/mistakenly waiting for my code's
> child process to terminate, which it wasn't going to
> do.
>
> Today, now that it's not broken, the value of writepid
> is > 0 even though there was no writechild() process
> created. It just returns -1 because there's no such
> child process. So my program doesn't go to sleep and it
> works fine...
>
> wait4(32764, 0x7ffc45c09118, 0, NULL) = -1 ECHILD (No child processes)
>
> BUT the value of writepid is suspiciously low (ignoring
> the fact that it should actually be -1). The pid just
> before this was 519823, and pids tend to increment by 1
> each time there's a fork(), so it's odd to drop to such
> a low value as 32764. If the corresponding fork() had
> happened, writepid would probably = 519824.
>
> It's still wrong though. childpid should be -1 because
> there's no second fork(), and the second waitpid()
> shouldn't be happening at all.
>
> Maybe it's cosmic rays, or a hardware error, but maybe
> there is something wierd happening in uncompressbuf()
> that I can't make sense of.
>
> It looks like writepid is being overwritten with junk
> (even in between "if (writepid > 0)" and
> "w = waitpid(writepid, &status, 0)" on the next line.
> but none of the analyzers noticed anything wrong.
>
> I was thinking I'd have to call magic_file() from a
> separate child process so it can't make the mistake of
> waiting for my code's child process. Or maybe just stop
> using MAGIC_COMPRESS. Maybe I'll have to, just to be
> on the safe side.
>
> Can anyone make sense of the the following strace
> output files? There's one extract for non-sleepforever
> but still broken writepid value (good) and another
> extract for sleepforever behaviour because writepid != 0
> and then writepid == 0 immediately after (bad):
>
> good:
>
> newfstatat(AT_FDCWD, "/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz", {st_mode=S_IFREG|0644, st_size=36, ...}, AT_SYMLINK_NOFOLLOW) = 0
> openat(AT_FDCWD, "/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 4
> newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=36, ...}, AT_EMPTY_PATH) = 0
> read(4, "LZIP\1\f\0\203\377\373\377\377\300\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0$\0\0\0"..., 7340032) = 36
> rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f896af95050}, {sa_handler=0x56128af72180, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f896af95050}, 8) = 0
> brk(0x56129dab4000) = 0x56129dab4000
> write(1, "{\"path\":\"/home/linuxbrew/.linuxb"..., 3986) = 3986
> pipe2([7, 9], O_CLOEXEC) = 0
> pipe2([10, 11], O_CLOEXEC) = 0
> lseek(4, 0, SEEK_SET) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> fcntl(0, F_SETFD, 0) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> fcntl(1, F_SETFD, 0) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> fcntl(2, F_SETFD, 0) = 0
> mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f896b21c000
> rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0
> clone3({flags=CLONE_VM|CLONE_VFORK, exit_signal=SIGCHLD, stack=0x7f896b21c000, stack_size=0x9000}, 88) = 519823
> wait4(519823, NULL, 0, NULL) = 519823
> munmap(0x7f896b21c000, 36864) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=519823, si_uid=1000, si_status=127, si_utime=0, si_stime=0} ---
> close(9) = 0
> close(11) = 0
> read(7, "", 7340032) = 0
> read(10, "", 7340032) = 0
> close(7) = 0
> close(10) = 0
> wait4(32764, 0x7ffc45c09118, 0, NULL) = -1 ECHILD (No child processes)
> openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 7
> newfstatat(7, "", {st_mode=S_IFREG|0644, st_size=2996, ...}, AT_EMPTY_PATH) = 0
> read(7, "# Locale name alias data base.\n#"..., 4096) = 2996
> read(7, "", 4096) = 0
> close(7) = 0
> openat(AT_FDCWD, "/usr/share/locale/en_AU/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
> rt_sigaction(SIGPIPE, {sa_handler=0x56128af72180, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f896af95050}, NULL, 8) = 0
> brk(0x56129ccb3000) = 0x56129ccb3000
> close(4) = 0
>
> bad:
>
> newfstatat(AT_FDCWD, "/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz", {st_mode=S_IFREG|0644, st_size=36, ...}, AT_SYMLINK_NOFOLLOW) = 0
> openat(AT_FDCWD, "/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/test/support/fixtures/test.lz", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 4
> newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=36, ...}, AT_EMPTY_PATH) = 0
> read(4, "LZIP\1\f\0\203\377\373\377\377\300\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0$\0\0\0"..., 7340032) = 36
> rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe771d1a050}, {sa_handler=0x55dd31ba7180, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe771d1a050}, 8) = 0
> pipe2([7, 9], O_CLOEXEC) = 0
> pipe2([10, 11], O_CLOEXEC) = 0
> lseek(4, 0, SEEK_SET) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> fcntl(0, F_SETFD, 0) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> fcntl(1, F_SETFD, 0) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
> fcntl(2, F_SETFD, 0) = 0
> mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fe771fa1000
> rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0
> clone3({flags=CLONE_VM|CLONE_VFORK, exit_signal=SIGCHLD, stack=0x7fe771fa1000, stack_size=0x9000}, 88) = 480299
> wait4(480299, NULL, 0, NULL) = 480299
> munmap(0x7fe771fa1000, 36864) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=480299, si_uid=1000, si_status=127, si_utime=0, si_stime=0} ---
> close(9) = 0
> close(11) = 0
> read(7, "", 7340032) = 0
> read(10, "", 7340032) = 0
> close(7) = 0
> close(10) = 0
> wait4(0, 0x7fff19477658, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
> rt_sigprocmask(SIG_BLOCK, NULL, [INT], 8) = 0
> rt_sigprocmask(SIG_BLOCK, [INT PIPE], NULL, 8) = 0
> write(5, "-stay_open\n0\n", 13) = 13
> rt_sigprocmask(SIG_SETMASK, [INT], NULL, 8) = 0
> close(5) = 0
> close(6) = 0
> close(8) = 0
> wait4(480297, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 480297
> --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=480297, si_uid=1000, si_status=1, si_utime=54 /* 0.54 s */, si_stime=4 /* 0.04 s */} ---
> exit_group(2) = ?
> +++ exited with 2 +++
>
> cheers,
> raf
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file-5.46-posix_spawnp-error.patch
Type: text/x-diff
Size: 764 bytes
Desc: not available
URL: <https://mailman.astron.com/pipermail/file/attachments/20260406/a9bc3bba/attachment-0001.bin>
More information about the File
mailing list