[Tcsh] tcsh Deadlock with SIGHUP

Mon Jan 20 14:08:36 UTC 2020

(Resending what I posted last August ... let me know if there's
anything I can do to get this (or a differnet fix for the same issue)
into the tree.)

tcsh can deadlock with itself if savehist is confgured with "merge" and
"lock", and two SIGHUPs are received in rapid succession.  The
mechanism of the deadlock is the first SIGHUP triggers a rechist() and
while that rechist() is executing (and after it has created the lock
file), the second SIGHUP triggers a another rechist() which then waits
forever for the lock the the first rechist() created to be released
(which will never happen).

A backtrace from when it's deadlocked:

#1  0x00007fe3a48f7877 in usleep (useconds=useconds at entry=100000)
    at ../sysdeps/posix/usleep.c:32
#2  0x000055c7b9368974 in dot_lock (
    fname=fname at entry=0x55c7ba174540 "/home/rbf/.history", 
    pollinterval=pollinterval at entry=100) at dotlock.c:166
#3  0x000055c7b935950f in rechist (fname=0x55c7ba1e5960L"/home/rbf/.history", 
    ref=<optimized out>) at sh.hist.c:1293
#4  0x000055c7b9344cc0 in record () at sh.c:2512
#5  0x000055c7b9346b29 in phup () at sh.c:1842
#6  0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
#7  0x000055c7b935ec53 in xwrite (fildes=3, 
    buf=buf at entry=0x55c7b95b6e00 <linbuf>, nbyte=12) at sh.misc.c:690
#8  0x000055c7b9360104 in flush () at sh.print.c:260
#9  0x000055c7b9387219 in doprnt (addchar=0x55c7b9360390 <xputchar>, 
    sfmt=sfmt at entry=0x55c7b938d0ad "%S", ap=ap at entry=0x7ffc4fb9cd60)
    at tc.printf.c:294
#10 0x000055c7b9387823 in xprintf (fmt=fmt at entry=0x55c7b938d0ad "%S")
    at tc.printf.c:392
#11 0x000055c7b935b392 in prlex (sp0=sp0 at entry=0x55c7ba17efc0) at
sh.lex.c:228
#12 0x000055c7b9358510 in phist (hp=0x55c7ba17efc0, hflg=<optimized out>)
    at sh.hist.c:1071
#13 0x000055c7b93596d3 in dophist (hflg=65, n=200) at sh.hist.c:1114
#14 dohist (vp=<optimized out>, c=<optimized out>) at sh.hist.c:1177
#15 0x000055c7b93593f7 in rechist (fname=0x55c7ba1dade0
#L"/home/rbf/.history", 
    ref=<optimized out>) at sh.hist.c:1322
#16 0x000055c7b9344cc0 in record () at sh.c:2512
#17 0x000055c7b9346b29 in phup () at sh.c:1842
#18 0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
#19 0x000055c7b935ebb3 in xread (fildes=16, buf=buf at entry=0x7ffc4fb9e030, 
    nbyte=nbyte at entry=1) at sh.misc.c:662

This patch (which is to 6.20.00 ... but there's doesn't appear to be
anything in 6.21.00 or 6.22.00 which would address this, so I'm
reasonably confident the problem exists there as well) fixes the
problem for me.  It disables processing of pending SIGHUPs at the start
of rechist (and then restores on completion).

(It does this even if savehist isn't configured with lock; so it avoids
starting a second write while the first one is in progress even in
cases where it won't deadlock.)

(I did consider just having handle_pending_signals not redispatch
phup() if one was already running, but it looks like the same deadlock
could occur if a single SIGHUP arrived while the shell was saving the
history for other reasons, although I haven't produced (or tried to
produce) that behavior.)

--- tcsh-6.20.00.orig/sh.hist.c
+++ tcsh-6.20.00/sh.hist.c
@@ -1223,7 +1223,7 @@ void
 rechist(Char *fname, int ref)
 {
     Char    *snum, *rs;
-    int     fp, ftmp, oldidfds;
+    int     fp, ftmp, oldidfds, phup_disabled_tmp;
     struct varent *shist;
     char path[MAXPATHLEN];
     struct stat st;
@@ -1231,6 +1231,10 @@ rechist(Char *fname, int ref)
 
     if (fname == NULL && !ref) 
        return;
+
+       phup_disabled_tmp = phup_disabled;
+       phup_disabled = 1;
+
     /*
      * If $savehist is just set, we use the value of $history
      * else we use the value in $savehist
@@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
     if (fp == -1) {
        didfds = oldidfds;
        cleanup_until(fname);
+       phup_disabled = phup_disabled_tmp;
        return;
     }
     /* Try to preserve ownership and permissions of the original history file */
@@ -1325,6 +1330,7 @@ rechist(Char *fname, int ref)
     didfds = oldidfds;
     (void)rename(path, short2str(fname));
     cleanup_until(fname);
+    phup_disabled = phup_disabled_tmp;
 }

(As background, below is where/how I found this)

For me, this is occurring on Linux; and on systemd systems it's easy to
recreate -- when systemd attempts to terminate a session, the shell
often ends up getting two SIGHUPs in rapid succession (my assumption is
that one is directly from systemd and another is a result of the parent
sshd terminating).  I get get it to happen about 50% of the time with 
"systemctl stop session-XX.scope" when that session is an ssh
connection that has a tcsh shell configured with "savehist = ( XXX
merge lock )".  (I have a history of about 200 lines to write out.)

Obviously, it's racy ... sometimes the second SIGHUP is early enough,
or late enough, to avoid the problem.

With the above fix, I can not reproduce the deadlock.

     -- Brett