[Tcsh] tcsh Deadlock with SIGHUP

Thu Apr 9 20:52:07 UTC 2020

On Thu, Apr 09, 2020 at 02:40:57PM -0400, Christos Zoulas wrote:
> All that is fine, but who sends the SIGHUP? And if we ignore the SIGHUP,
> that someone might send SIGKILL and then what?

I assume systemd is sending the SIGHUP.  Nothing I've done or commented
on below will cause the SIGHUP to be ignored -- my code just defers
processing it once history writing starts until it is completed (note
that even without my change -- there's still some latency to handling a
SIGHUP because handle_pending_signals() is only called so often).  I
did mention possibly just permanently ignoring SIGUP once committed to
the "goodbye" path (done: label in main()) -- once tcsh is in that
path, it's committed to exiting, and that path (at first glance) isn't
really any much slower (if at all slower) at exiting than the phup()
path is.

I never tracked down the exact sources -- I assumed in my original
deadlock, one SIGHUP came from systemd and one came from the kernel as
a result of SSHD getting a sIGHUP and exiting -- and in the current
race I am seeing on ubuntu 20.04, I assume the -EIO is a result of sshd
closing the pty on its end, and the SIGHUP is coming from systemd.  But
I don't konw either of those things for sure -- just that when I
shutdown the system (or stop a session via systemctl) the dual SIGHUP
occurs on my older machines and that the-EIO/SIGHUP race occurs
(leaving a .history_lock) on my 20.04 machines.

You're right that systemd will eventually send a stronger signal if the
shell sticks around, but it will wait before doing so.  Unless there
are other issues on the system, the exit (from the first SIGHUP and/or
the -EIO on the read) will complete before systemd gets tired of
waiting and sents a SIGKILL.  (Ultimately, back when I was having the
deadlock, a SIGKILL did get sent, but systemd was waiting 90 seconds to
do that.  Absent a deadlock, tcsh is existing sub-second, of course.)

Obviosuly, the SIGKILL risk cannot be mitigated -- if it comes in while
.history_lock exists, the file is going to remain -- but I haven't
experienced any conditions in practice where a SIGKILL immediately
follows a SIGHUP or an EOF (or -EIO) on STDIN.  But there are real
world cases where a SIGHUP quickly follows another SIGHUP and/or
quickly follows EOF/-EIO on STDIN.

(The "history -S" followed by "SIGHUP" potential deadlock, on the other
hand, would just be badluck; I wouldn't expect non-contrived
reproducible cases of that in practice ... I only fixed that for
philosophical reasons -- because I knew it was a race that could be
lost.)

     -- Brett

> christos
> 
> > On Apr 7, 2020, at 9:47 PM, Brett Frankenberger <rbf at rbfnet.com> wrote:
> > 
> > Your fix, which was different from mine, does successfully prevent
> > deadlocks.  However, I have discovered a similar issue: If the STDIN
> > for the shell is closed (in my particular reproduction, the read() in
> > which the shell is waiting for input is returning EIO), and then a
> > SIGHUP is received shortly thereafter, there is a race condition that,
> > while it cannot deadlock, and result in .history_lock being left in
> > $HOME, causing hangs when future shells exit.  (I am seeing this with
> > shutdowns of a Ubuntu 20.04 (beta release) server; the timing there
> > appears to be different than on my other servers.)
> > 
> > What appears to be happening is that the EIO is triggering the shell to
> > exit, and record() (after the done: label in main()) is being called.
> >> From within record(), recdirs() is called, and then .history_lock is
> > created and the history is written.  While the history is being
> > written, the SIGHUP arrives and phup() ends up getting invoked.
> > phup()'s call to record() will not deadlock because the (again) fix
> > will cause record to immediately return; but ultimately the shell will
> > exit without finishing the original recdirs().
> > 
> > So the history will not have been completely written (the .history.XXX
> > temporary file will remain) and .history_lock will still exist.  So (a)
> > the history is lost, and (b) the remaining .history_lock will hang
> > future shells that exit.
> > 
> > Separately, although I haven't experienced it, I also think the original
> > deadlock could occur if a "history -S" was in progress when a SIGHUP
> > was received.  ("history -S" doesn't call record(), so rechist() would
> > be entered once as a result of "history -S" and again as a result of
> > phup().  The second entry will deadlock with the first one.)
> > 
> > And, finally, I think all of the issues described above can occur with
> > savedirs (although the symptoms are different since recdirs() doesn't
> > appear to use a lockfile or an intermediate temporary file), although I
> > don't personally use that feature and so have not experienced any
> > problems there.
> > 
> > Below is the patch I have applied to my tcsh.  It includes three
> > things:
> > 
> > (1) Changes to sh.c to include temporarily disabling phup processing
> > while in record().  This is intended to fix the issue with a SIGHUP
> > being received while in record() while the shell is exiting for some
> > other reason, resulting in the original rechist() (and/or recdirs())
> > not completing and leaving behind a lock file (in the case of rechist)
> > or a partially written dirs file (in the case of recdirs)).
> > 
> > (2) Reapplies my original fix (from below) to sh.hist.c to prevent a
> > deadlock when a SIGHUP is received while processing "history -S".
> > 
> > (3) A similar fix (for a similar reason -- although it won't deadlock
> > since there's no lock file option) to sh.dirs.c.
> > 
> > That's probably not the only way to fix things.  And (1) really doesn't
> > accomplish much if (2) and (3) are implemented.  (And instead of my
> > patch to sh.c, I wonder if phup processing should just be disabled once
> > done: is reached in main() ... it looks like the code attempts to
> > cleanup any pending SIGHUPs there, but doesn't really account for the
> > race condition if a SIGHUP is received right after
> > handle_pending_signals() returns).
> > 
> > Anyway, lots of options, but would be nice to get at least (1) or (2)
> > implemented, or something else that fixes the same issue, to prevent
> > the problem I am actually seeing on Ubuntu 20.04.
> > 
> > Thanks ...
> > 
> >     -- Brett
> > 
> > ========= PATCHES ============
> > 
> > 
> > Index: tcsh-6.22.02/sh.c
> > ===================================================================
> > --- tcsh-6.22.02.orig/sh.c
> > +++ tcsh-6.22.02/sh.c
> > @@ -2521,14 +2521,21 @@ static void
> > record(void)
> > {
> >     static int again = 0;
> > +    int phup_disabled_tmp;
> > +
> >     if (again++)
> >        return;
> > 
> > +   phup_disabled_tmp = phup_disabled;
> > +   phup_disabled = 1;
> > +
> >     if (!fast) {
> >        recdirs(NULL, adrof(STRsavedirs) != NULL);
> >        rechist(NULL, adrof(STRsavehist) != NULL);
> >     }
> >     displayHistStats("Exiting");       /* no-op unless DEBUG_HIST */
> > +
> > +   phup_disabled = phup_disabled_tmp;
> > }
> > 
> > /*
> > 
> > Index: tcsh-6.22.02/sh.hist.c
> > ===================================================================
> > --- tcsh-6.22.02.orig/sh.hist.c
> > +++ tcsh-6.22.02/sh.hist.c
> > @@ -1219,7 +1219,7 @@ void
> > rechist(Char *fname, int ref)
> > {
> >     Char    *snum, *rs;
> > -    int     fp, ftmp, oldidfds;
> > +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
> >     struct varent *shist;
> >     char path[MAXPATHLEN];
> >     struct stat st;
> > @@ -1227,6 +1227,10 @@ rechist(Char *fname, int ref)
> > 
> >     if (fname == NULL && !ref)
> >        return;
> > +
> > +   phup_disabled_tmp = phup_disabled;
> > +   phup_disabled = 1;
> > +
> >     /*
> >      * If $savehist is just set, we use the value of $history
> >      * else we use the value in $savehist
> > @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
> >     if (fp == -1) {
> >        didfds = oldidfds;
> >        cleanup_until(fname);
> > +   phup_disabled = phup_disabled_tmp;
> >        return;
> >     }
> >     /* Try to preserve ownership and permissions of the original history file */
> > @@ -1329,6 +1334,7 @@ rechist(Char *fname, int ref)
> >     (void)ReplaceFile( short2str(fname),path,NULL,0,NULL,NULL);
> > #endif
> >     cleanup_until(fname);
> > +    phup_disabled = phup_disabled_tmp;
> > }
> > 
> > Index: tcsh-6.22.02/sh.dir.c
> > ===================================================================
> > --- tcsh-6.22.02.orig/sh.dir.c
> > +++ tcsh-6.22.02/sh.dir.c
> > @@ -1356,7 +1356,7 @@ loaddirs(Char *fname)
> > void
> > recdirs(Char *fname, int def)
> > {
> > -    int     fp, ftmp, oldidfds;
> > +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
> >     int     cdflag = 0;
> >     struct directory *dp;
> >     unsigned int    num;
> > @@ -1366,6 +1366,9 @@ recdirs(Char *fname, int def)
> >     if (fname == NULL && !def)
> >        return;
> > 
> > +   phup_disabled_tmp = phup_disabled;
> > +   phup_disabled = 1;
> > +
> >     if (fname == NULL) {
> >        if ((fname = varval(STRdirsfile)) == STRNULL)
> >            fname = Strspl(varval(STRhome), &STRtildotdirs[1]);
> > @@ -1378,6 +1381,7 @@ recdirs(Char *fname, int def)
> > 
> >     if ((fp = xcreat(short2str(fname), 0600)) == -1) {
> >        cleanup_until(fname);
> > +       phup_disabled = phup_disabled_tmp;
> >        return;
> >     }
> > 
> > @@ -1413,4 +1417,5 @@ recdirs(Char *fname, int def)
> >     SHOUT = ftmp;
> >     didfds = oldidfds;
> >     cleanup_until(fname);
> > +    phup_disabled = phup_disabled_tmp;
> > }
> > 
> > 
> > ======== PREVIOUS MESSAGES ==========
> > 
> > 
> > On Mon, Jan 20, 2020 at 11:40:49AM -0500, Christos Zoulas wrote:
> >> Thanks, it should be fixed now.
> >> 
> >> christos
> >> 
> >>> On Jan 20, 2020, at 9:08 AM, Brett Frankenberger <rbf at rbfnet.com> wrote:
> >>> 
> >>> (Resending what I posted last August ... let me know if there's
> >>> anything I can do to get this (or a differnet fix for the same issue)
> >>> into the tree.)
> >>> 
> >>> tcsh can deadlock with itself if savehist is confgured with "merge" and
> >>> "lock", and two SIGHUPs are received in rapid succession.  The
> >>> mechanism of the deadlock is the first SIGHUP triggers a rechist() and
> >>> while that rechist() is executing (and after it has created the lock
> >>> file), the second SIGHUP triggers a another rechist() which then waits
> >>> forever for the lock the the first rechist() created to be released
> >>> (which will never happen).
> >>> 
> >>> A backtrace from when it's deadlocked:
> >>> 
> >>> #1  0x00007fe3a48f7877 in usleep (useconds=useconds at entry=100000)
> >>>   at ../sysdeps/posix/usleep.c:32
> >>> #2  0x000055c7b9368974 in dot_lock (
> >>>   fname=fname at entry=0x55c7ba174540 "/home/rbf/.history",
> >>>   pollinterval=pollinterval at entry=100) at dotlock.c:166
> >>> #3  0x000055c7b935950f in rechist (fname=0x55c7ba1e5960L"/home/rbf/.history",
> >>>   ref=<optimized out>) at sh.hist.c:1293
> >>> #4  0x000055c7b9344cc0 in record () at sh.c:2512
> >>> #5  0x000055c7b9346b29 in phup () at sh.c:1842
> >>> #6  0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
> >>> #7  0x000055c7b935ec53 in xwrite (fildes=3,
> >>>   buf=buf at entry=0x55c7b95b6e00 <linbuf>, nbyte=12) at sh.misc.c:690
> >>> #8  0x000055c7b9360104 in flush () at sh.print.c:260
> >>> #9  0x000055c7b9387219 in doprnt (addchar=0x55c7b9360390 <xputchar>,
> >>>   sfmt=sfmt at entry=0x55c7b938d0ad "%S", ap=ap at entry=0x7ffc4fb9cd60)
> >>>   at tc.printf.c:294
> >>> #10 0x000055c7b9387823 in xprintf (fmt=fmt at entry=0x55c7b938d0ad "%S")
> >>>   at tc.printf.c:392
> >>> #11 0x000055c7b935b392 in prlex (sp0=sp0 at entry=0x55c7ba17efc0) at
> >>> sh.lex.c:228
> >>> #12 0x000055c7b9358510 in phist (hp=0x55c7ba17efc0, hflg=<optimized out>)
> >>>   at sh.hist.c:1071
> >>> #13 0x000055c7b93596d3 in dophist (hflg=65, n=200) at sh.hist.c:1114
> >>> #14 dohist (vp=<optimized out>, c=<optimized out>) at sh.hist.c:1177
> >>> #15 0x000055c7b93593f7 in rechist (fname=0x55c7ba1dade0
> >>> #L"/home/rbf/.history",
> >>>   ref=<optimized out>) at sh.hist.c:1322
> >>> #16 0x000055c7b9344cc0 in record () at sh.c:2512
> >>> #17 0x000055c7b9346b29 in phup () at sh.c:1842
> >>> #18 0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
> >>> #19 0x000055c7b935ebb3 in xread (fildes=16, buf=buf at entry=0x7ffc4fb9e030,
> >>>   nbyte=nbyte at entry=1) at sh.misc.c:662
> >>> 
> >>> This patch (which is to 6.20.00 ... but there's doesn't appear to be
> >>> anything in 6.21.00 or 6.22.00 which would address this, so I'm
> >>> reasonably confident the problem exists there as well) fixes the
> >>> problem for me.  It disables processing of pending SIGHUPs at the start
> >>> of rechist (and then restores on completion).
> >>> 
> >>> (It does this even if savehist isn't configured with lock; so it avoids
> >>> starting a second write while the first one is in progress even in
> >>> cases where it won't deadlock.)
> >>> 
> >>> (I did consider just having handle_pending_signals not redispatch
> >>> phup() if one was already running, but it looks like the same deadlock
> >>> could occur if a single SIGHUP arrived while the shell was saving the
> >>> history for other reasons, although I haven't produced (or tried to
> >>> produce) that behavior.)
> >>> 
> >>> --- tcsh-6.20.00.orig/sh.hist.c
> >>> +++ tcsh-6.20.00/sh.hist.c
> >>> @@ -1223,7 +1223,7 @@ void
> >>> rechist(Char *fname, int ref)
> >>> {
> >>>    Char    *snum, *rs;
> >>> -    int     fp, ftmp, oldidfds;
> >>> +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
> >>>    struct varent *shist;
> >>>    char path[MAXPATHLEN];
> >>>    struct stat st;
> >>> @@ -1231,6 +1231,10 @@ rechist(Char *fname, int ref)
> >>> 
> >>>    if (fname == NULL && !ref)
> >>>       return;
> >>> +
> >>> +       phup_disabled_tmp = phup_disabled;
> >>> +       phup_disabled = 1;
> >>> +
> >>>    /*
> >>>     * If $savehist is just set, we use the value of $history
> >>>     * else we use the value in $savehist
> >>> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
> >>>    if (fp == -1) {
> >>>       didfds = oldidfds;
> >>>       cleanup_until(fname);
> >>> +       phup_disabled = phup_disabled_tmp;
> >>>       return;
> >>>    }
> >>>    /* Try to preserve ownership and permissions of the original history file */
> >>> @@ -1325,6 +1330,7 @@ rechist(Char *fname, int ref)
> >>>    didfds = oldidfds;
> >>>    (void)rename(path, short2str(fname));
> >>>    cleanup_until(fname);
> >>> +    phup_disabled = phup_disabled_tmp;
> >>> }
> >>> 
> >>> (As background, below is where/how I found this)
> >>> 
> >>> For me, this is occurring on Linux; and on systemd systems it's easy to
> >>> recreate -- when systemd attempts to terminate a session, the shell
> >>> often ends up getting two SIGHUPs in rapid succession (my assumption is
> >>> that one is directly from systemd and another is a result of the parent
> >>> sshd terminating).  I get get it to happen about 50% of the time with
> >>> "systemctl stop session-XX.scope" when that session is an ssh
> >>> connection that has a tcsh shell configured with "savehist = ( XXX
> >>> merge lock )".  (I have a history of about 200 lines to write out.)
> >>> 
> >>> Obviously, it's racy ... sometimes the second SIGHUP is early enough,
> >>> or late enough, to avoid the problem.
> >>> 
> >>> With the above fix, I can not reproduce the deadlock.
> >>> 
> >>>    -- Brett
> >>> --
> >>> Tcsh mailing list
> >>> Tcsh at astron.com
> >>> https://mailman.astron.com/mailman/listinfo/tcsh
> >> 
> > 
> > 
> > 
> >> --
> >> Tcsh mailing list
> >> Tcsh at astron.com
> >> https://mailman.astron.com/mailman/listinfo/tcsh
>