[Tcsh] tcsh Deadlock with SIGHUP

Christos Zoulas christos at zoulas.com
Thu Apr 9 18:40:57 UTC 2020


All that is fine, but who sends the SIGHUP? And if we ignore the SIGHUP,
that someone might send SIGKILL and then what?

christos

> On Apr 7, 2020, at 9:47 PM, Brett Frankenberger <rbf at rbfnet.com> wrote:
> 
> Your fix, which was different from mine, does successfully prevent
> deadlocks.  However, I have discovered a similar issue: If the STDIN
> for the shell is closed (in my particular reproduction, the read() in
> which the shell is waiting for input is returning EIO), and then a
> SIGHUP is received shortly thereafter, there is a race condition that,
> while it cannot deadlock, and result in .history_lock being left in
> $HOME, causing hangs when future shells exit.  (I am seeing this with
> shutdowns of a Ubuntu 20.04 (beta release) server; the timing there
> appears to be different than on my other servers.)
> 
> What appears to be happening is that the EIO is triggering the shell to
> exit, and record() (after the done: label in main()) is being called.
>> From within record(), recdirs() is called, and then .history_lock is
> created and the history is written.  While the history is being
> written, the SIGHUP arrives and phup() ends up getting invoked.
> phup()'s call to record() will not deadlock because the (again) fix
> will cause record to immediately return; but ultimately the shell will
> exit without finishing the original recdirs().
> 
> So the history will not have been completely written (the .history.XXX
> temporary file will remain) and .history_lock will still exist.  So (a)
> the history is lost, and (b) the remaining .history_lock will hang
> future shells that exit.
> 
> Separately, although I haven't experienced it, I also think the original
> deadlock could occur if a "history -S" was in progress when a SIGHUP
> was received.  ("history -S" doesn't call record(), so rechist() would
> be entered once as a result of "history -S" and again as a result of
> phup().  The second entry will deadlock with the first one.)
> 
> And, finally, I think all of the issues described above can occur with
> savedirs (although the symptoms are different since recdirs() doesn't
> appear to use a lockfile or an intermediate temporary file), although I
> don't personally use that feature and so have not experienced any
> problems there.
> 
> Below is the patch I have applied to my tcsh.  It includes three
> things:
> 
> (1) Changes to sh.c to include temporarily disabling phup processing
> while in record().  This is intended to fix the issue with a SIGHUP
> being received while in record() while the shell is exiting for some
> other reason, resulting in the original rechist() (and/or recdirs())
> not completing and leaving behind a lock file (in the case of rechist)
> or a partially written dirs file (in the case of recdirs)).
> 
> (2) Reapplies my original fix (from below) to sh.hist.c to prevent a
> deadlock when a SIGHUP is received while processing "history -S".
> 
> (3) A similar fix (for a similar reason -- although it won't deadlock
> since there's no lock file option) to sh.dirs.c.
> 
> That's probably not the only way to fix things.  And (1) really doesn't
> accomplish much if (2) and (3) are implemented.  (And instead of my
> patch to sh.c, I wonder if phup processing should just be disabled once
> done: is reached in main() ... it looks like the code attempts to
> cleanup any pending SIGHUPs there, but doesn't really account for the
> race condition if a SIGHUP is received right after
> handle_pending_signals() returns).
> 
> Anyway, lots of options, but would be nice to get at least (1) or (2)
> implemented, or something else that fixes the same issue, to prevent
> the problem I am actually seeing on Ubuntu 20.04.
> 
> Thanks ...
> 
>     -- Brett
> 
> ========= PATCHES ============
> 
> 
> Index: tcsh-6.22.02/sh.c
> ===================================================================
> --- tcsh-6.22.02.orig/sh.c
> +++ tcsh-6.22.02/sh.c
> @@ -2521,14 +2521,21 @@ static void
> record(void)
> {
>     static int again = 0;
> +    int phup_disabled_tmp;
> +
>     if (again++)
>        return;
> 
> +   phup_disabled_tmp = phup_disabled;
> +   phup_disabled = 1;
> +
>     if (!fast) {
>        recdirs(NULL, adrof(STRsavedirs) != NULL);
>        rechist(NULL, adrof(STRsavehist) != NULL);
>     }
>     displayHistStats("Exiting");       /* no-op unless DEBUG_HIST */
> +
> +   phup_disabled = phup_disabled_tmp;
> }
> 
> /*
> 
> Index: tcsh-6.22.02/sh.hist.c
> ===================================================================
> --- tcsh-6.22.02.orig/sh.hist.c
> +++ tcsh-6.22.02/sh.hist.c
> @@ -1219,7 +1219,7 @@ void
> rechist(Char *fname, int ref)
> {
>     Char    *snum, *rs;
> -    int     fp, ftmp, oldidfds;
> +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
>     struct varent *shist;
>     char path[MAXPATHLEN];
>     struct stat st;
> @@ -1227,6 +1227,10 @@ rechist(Char *fname, int ref)
> 
>     if (fname == NULL && !ref)
>        return;
> +
> +   phup_disabled_tmp = phup_disabled;
> +   phup_disabled = 1;
> +
>     /*
>      * If $savehist is just set, we use the value of $history
>      * else we use the value in $savehist
> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
>     if (fp == -1) {
>        didfds = oldidfds;
>        cleanup_until(fname);
> +   phup_disabled = phup_disabled_tmp;
>        return;
>     }
>     /* Try to preserve ownership and permissions of the original history file */
> @@ -1329,6 +1334,7 @@ rechist(Char *fname, int ref)
>     (void)ReplaceFile( short2str(fname),path,NULL,0,NULL,NULL);
> #endif
>     cleanup_until(fname);
> +    phup_disabled = phup_disabled_tmp;
> }
> 
> Index: tcsh-6.22.02/sh.dir.c
> ===================================================================
> --- tcsh-6.22.02.orig/sh.dir.c
> +++ tcsh-6.22.02/sh.dir.c
> @@ -1356,7 +1356,7 @@ loaddirs(Char *fname)
> void
> recdirs(Char *fname, int def)
> {
> -    int     fp, ftmp, oldidfds;
> +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
>     int     cdflag = 0;
>     struct directory *dp;
>     unsigned int    num;
> @@ -1366,6 +1366,9 @@ recdirs(Char *fname, int def)
>     if (fname == NULL && !def)
>        return;
> 
> +   phup_disabled_tmp = phup_disabled;
> +   phup_disabled = 1;
> +
>     if (fname == NULL) {
>        if ((fname = varval(STRdirsfile)) == STRNULL)
>            fname = Strspl(varval(STRhome), &STRtildotdirs[1]);
> @@ -1378,6 +1381,7 @@ recdirs(Char *fname, int def)
> 
>     if ((fp = xcreat(short2str(fname), 0600)) == -1) {
>        cleanup_until(fname);
> +       phup_disabled = phup_disabled_tmp;
>        return;
>     }
> 
> @@ -1413,4 +1417,5 @@ recdirs(Char *fname, int def)
>     SHOUT = ftmp;
>     didfds = oldidfds;
>     cleanup_until(fname);
> +    phup_disabled = phup_disabled_tmp;
> }
> 
> 
> ======== PREVIOUS MESSAGES ==========
> 
> 
> On Mon, Jan 20, 2020 at 11:40:49AM -0500, Christos Zoulas wrote:
>> Thanks, it should be fixed now.
>> 
>> christos
>> 
>>> On Jan 20, 2020, at 9:08 AM, Brett Frankenberger <rbf at rbfnet.com> wrote:
>>> 
>>> (Resending what I posted last August ... let me know if there's
>>> anything I can do to get this (or a differnet fix for the same issue)
>>> into the tree.)
>>> 
>>> tcsh can deadlock with itself if savehist is confgured with "merge" and
>>> "lock", and two SIGHUPs are received in rapid succession.  The
>>> mechanism of the deadlock is the first SIGHUP triggers a rechist() and
>>> while that rechist() is executing (and after it has created the lock
>>> file), the second SIGHUP triggers a another rechist() which then waits
>>> forever for the lock the the first rechist() created to be released
>>> (which will never happen).
>>> 
>>> A backtrace from when it's deadlocked:
>>> 
>>> #1  0x00007fe3a48f7877 in usleep (useconds=useconds at entry=100000)
>>>   at ../sysdeps/posix/usleep.c:32
>>> #2  0x000055c7b9368974 in dot_lock (
>>>   fname=fname at entry=0x55c7ba174540 "/home/rbf/.history",
>>>   pollinterval=pollinterval at entry=100) at dotlock.c:166
>>> #3  0x000055c7b935950f in rechist (fname=0x55c7ba1e5960L"/home/rbf/.history",
>>>   ref=<optimized out>) at sh.hist.c:1293
>>> #4  0x000055c7b9344cc0 in record () at sh.c:2512
>>> #5  0x000055c7b9346b29 in phup () at sh.c:1842
>>> #6  0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
>>> #7  0x000055c7b935ec53 in xwrite (fildes=3,
>>>   buf=buf at entry=0x55c7b95b6e00 <linbuf>, nbyte=12) at sh.misc.c:690
>>> #8  0x000055c7b9360104 in flush () at sh.print.c:260
>>> #9  0x000055c7b9387219 in doprnt (addchar=0x55c7b9360390 <xputchar>,
>>>   sfmt=sfmt at entry=0x55c7b938d0ad "%S", ap=ap at entry=0x7ffc4fb9cd60)
>>>   at tc.printf.c:294
>>> #10 0x000055c7b9387823 in xprintf (fmt=fmt at entry=0x55c7b938d0ad "%S")
>>>   at tc.printf.c:392
>>> #11 0x000055c7b935b392 in prlex (sp0=sp0 at entry=0x55c7ba17efc0) at
>>> sh.lex.c:228
>>> #12 0x000055c7b9358510 in phist (hp=0x55c7ba17efc0, hflg=<optimized out>)
>>>   at sh.hist.c:1071
>>> #13 0x000055c7b93596d3 in dophist (hflg=65, n=200) at sh.hist.c:1114
>>> #14 dohist (vp=<optimized out>, c=<optimized out>) at sh.hist.c:1177
>>> #15 0x000055c7b93593f7 in rechist (fname=0x55c7ba1dade0
>>> #L"/home/rbf/.history",
>>>   ref=<optimized out>) at sh.hist.c:1322
>>> #16 0x000055c7b9344cc0 in record () at sh.c:2512
>>> #17 0x000055c7b9346b29 in phup () at sh.c:1842
>>> #18 0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
>>> #19 0x000055c7b935ebb3 in xread (fildes=16, buf=buf at entry=0x7ffc4fb9e030,
>>>   nbyte=nbyte at entry=1) at sh.misc.c:662
>>> 
>>> This patch (which is to 6.20.00 ... but there's doesn't appear to be
>>> anything in 6.21.00 or 6.22.00 which would address this, so I'm
>>> reasonably confident the problem exists there as well) fixes the
>>> problem for me.  It disables processing of pending SIGHUPs at the start
>>> of rechist (and then restores on completion).
>>> 
>>> (It does this even if savehist isn't configured with lock; so it avoids
>>> starting a second write while the first one is in progress even in
>>> cases where it won't deadlock.)
>>> 
>>> (I did consider just having handle_pending_signals not redispatch
>>> phup() if one was already running, but it looks like the same deadlock
>>> could occur if a single SIGHUP arrived while the shell was saving the
>>> history for other reasons, although I haven't produced (or tried to
>>> produce) that behavior.)
>>> 
>>> --- tcsh-6.20.00.orig/sh.hist.c
>>> +++ tcsh-6.20.00/sh.hist.c
>>> @@ -1223,7 +1223,7 @@ void
>>> rechist(Char *fname, int ref)
>>> {
>>>    Char    *snum, *rs;
>>> -    int     fp, ftmp, oldidfds;
>>> +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
>>>    struct varent *shist;
>>>    char path[MAXPATHLEN];
>>>    struct stat st;
>>> @@ -1231,6 +1231,10 @@ rechist(Char *fname, int ref)
>>> 
>>>    if (fname == NULL && !ref)
>>>       return;
>>> +
>>> +       phup_disabled_tmp = phup_disabled;
>>> +       phup_disabled = 1;
>>> +
>>>    /*
>>>     * If $savehist is just set, we use the value of $history
>>>     * else we use the value in $savehist
>>> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
>>>    if (fp == -1) {
>>>       didfds = oldidfds;
>>>       cleanup_until(fname);
>>> +       phup_disabled = phup_disabled_tmp;
>>>       return;
>>>    }
>>>    /* Try to preserve ownership and permissions of the original history file */
>>> @@ -1325,6 +1330,7 @@ rechist(Char *fname, int ref)
>>>    didfds = oldidfds;
>>>    (void)rename(path, short2str(fname));
>>>    cleanup_until(fname);
>>> +    phup_disabled = phup_disabled_tmp;
>>> }
>>> 
>>> (As background, below is where/how I found this)
>>> 
>>> For me, this is occurring on Linux; and on systemd systems it's easy to
>>> recreate -- when systemd attempts to terminate a session, the shell
>>> often ends up getting two SIGHUPs in rapid succession (my assumption is
>>> that one is directly from systemd and another is a result of the parent
>>> sshd terminating).  I get get it to happen about 50% of the time with
>>> "systemctl stop session-XX.scope" when that session is an ssh
>>> connection that has a tcsh shell configured with "savehist = ( XXX
>>> merge lock )".  (I have a history of about 200 lines to write out.)
>>> 
>>> Obviously, it's racy ... sometimes the second SIGHUP is early enough,
>>> or late enough, to avoid the problem.
>>> 
>>> With the above fix, I can not reproduce the deadlock.
>>> 
>>>    -- Brett
>>> --
>>> Tcsh mailing list
>>> Tcsh at astron.com
>>> https://mailman.astron.com/mailman/listinfo/tcsh
>> 
> 
> 
> 
>> --
>> Tcsh mailing list
>> Tcsh at astron.com
>> https://mailman.astron.com/mailman/listinfo/tcsh

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/tcsh/attachments/20200409/47f8c24c/attachment.asc>


More information about the Tcsh mailing list