[Tcsh] tcsh Deadlock with SIGHUP

Sat Apr 18 16:55:21 UTC 2020

Committed, sorry for the delay!

christos

> On Apr 9, 2020, at 4:52 PM, Brett Frankenberger <rbf at rbfnet.com> wrote:
> 
> On Thu, Apr 09, 2020 at 02:40:57PM -0400, Christos Zoulas wrote:
>> All that is fine, but who sends the SIGHUP? And if we ignore the SIGHUP,
>> that someone might send SIGKILL and then what?
> 
> I assume systemd is sending the SIGHUP.  Nothing I've done or commented
> on below will cause the SIGHUP to be ignored -- my code just defers
> processing it once history writing starts until it is completed (note
> that even without my change -- there's still some latency to handling a
> SIGHUP because handle_pending_signals() is only called so often).  I
> did mention possibly just permanently ignoring SIGUP once committed to
> the "goodbye" path (done: label in main()) -- once tcsh is in that
> path, it's committed to exiting, and that path (at first glance) isn't
> really any much slower (if at all slower) at exiting than the phup()
> path is.
> 
> I never tracked down the exact sources -- I assumed in my original
> deadlock, one SIGHUP came from systemd and one came from the kernel as
> a result of SSHD getting a sIGHUP and exiting -- and in the current
> race I am seeing on ubuntu 20.04, I assume the -EIO is a result of sshd
> closing the pty on its end, and the SIGHUP is coming from systemd.  But
> I don't konw either of those things for sure -- just that when I
> shutdown the system (or stop a session via systemctl) the dual SIGHUP
> occurs on my older machines and that the-EIO/SIGHUP race occurs
> (leaving a .history_lock) on my 20.04 machines.
> 
> You're right that systemd will eventually send a stronger signal if the
> shell sticks around, but it will wait before doing so.  Unless there
> are other issues on the system, the exit (from the first SIGHUP and/or
> the -EIO on the read) will complete before systemd gets tired of
> waiting and sents a SIGKILL.  (Ultimately, back when I was having the
> deadlock, a SIGKILL did get sent, but systemd was waiting 90 seconds to
> do that.  Absent a deadlock, tcsh is existing sub-second, of course.)
> 
> Obviosuly, the SIGKILL risk cannot be mitigated -- if it comes in while
> .history_lock exists, the file is going to remain -- but I haven't
> experienced any conditions in practice where a SIGKILL immediately
> follows a SIGHUP or an EOF (or -EIO) on STDIN.  But there are real
> world cases where a SIGHUP quickly follows another SIGHUP and/or
> quickly follows EOF/-EIO on STDIN.
> 
> (The "history -S" followed by "SIGHUP" potential deadlock, on the other
> hand, would just be badluck; I wouldn't expect non-contrived
> reproducible cases of that in practice ... I only fixed that for
> philosophical reasons -- because I knew it was a race that could be
> lost.)
> 
>     -- Brett
> 
> 
>> christos
>> 
>>> On Apr 7, 2020, at 9:47 PM, Brett Frankenberger <rbf at rbfnet.com> wrote:
>>> 
>>> Your fix, which was different from mine, does successfully prevent
>>> deadlocks.  However, I have discovered a similar issue: If the STDIN
>>> for the shell is closed (in my particular reproduction, the read() in
>>> which the shell is waiting for input is returning EIO), and then a
>>> SIGHUP is received shortly thereafter, there is a race condition that,
>>> while it cannot deadlock, and result in .history_lock being left in
>>> $HOME, causing hangs when future shells exit.  (I am seeing this with
>>> shutdowns of a Ubuntu 20.04 (beta release) server; the timing there
>>> appears to be different than on my other servers.)
>>> 
>>> What appears to be happening is that the EIO is triggering the shell to
>>> exit, and record() (after the done: label in main()) is being called.
>>>> From within record(), recdirs() is called, and then .history_lock is
>>> created and the history is written.  While the history is being
>>> written, the SIGHUP arrives and phup() ends up getting invoked.
>>> phup()'s call to record() will not deadlock because the (again) fix
>>> will cause record to immediately return; but ultimately the shell will
>>> exit without finishing the original recdirs().
>>> 
>>> So the history will not have been completely written (the .history.XXX
>>> temporary file will remain) and .history_lock will still exist.  So (a)
>>> the history is lost, and (b) the remaining .history_lock will hang
>>> future shells that exit.
>>> 
>>> Separately, although I haven't experienced it, I also think the original
>>> deadlock could occur if a "history -S" was in progress when a SIGHUP
>>> was received.  ("history -S" doesn't call record(), so rechist() would
>>> be entered once as a result of "history -S" and again as a result of
>>> phup().  The second entry will deadlock with the first one.)
>>> 
>>> And, finally, I think all of the issues described above can occur with
>>> savedirs (although the symptoms are different since recdirs() doesn't
>>> appear to use a lockfile or an intermediate temporary file), although I
>>> don't personally use that feature and so have not experienced any
>>> problems there.
>>> 
>>> Below is the patch I have applied to my tcsh.  It includes three
>>> things:
>>> 
>>> (1) Changes to sh.c to include temporarily disabling phup processing
>>> while in record().  This is intended to fix the issue with a SIGHUP
>>> being received while in record() while the shell is exiting for some
>>> other reason, resulting in the original rechist() (and/or recdirs())
>>> not completing and leaving behind a lock file (in the case of rechist)
>>> or a partially written dirs file (in the case of recdirs)).
>>> 
>>> (2) Reapplies my original fix (from below) to sh.hist.c to prevent a
>>> deadlock when a SIGHUP is received while processing "history -S".
>>> 
>>> (3) A similar fix (for a similar reason -- although it won't deadlock
>>> since there's no lock file option) to sh.dirs.c.
>>> 
>>> That's probably not the only way to fix things.  And (1) really doesn't
>>> accomplish much if (2) and (3) are implemented.  (And instead of my
>>> patch to sh.c, I wonder if phup processing should just be disabled once
>>> done: is reached in main() ... it looks like the code attempts to
>>> cleanup any pending SIGHUPs there, but doesn't really account for the
>>> race condition if a SIGHUP is received right after
>>> handle_pending_signals() returns).
>>> 
>>> Anyway, lots of options, but would be nice to get at least (1) or (2)
>>> implemented, or something else that fixes the same issue, to prevent
>>> the problem I am actually seeing on Ubuntu 20.04.
>>> 
>>> Thanks ...
>>> 
>>>    -- Brett
>>> 
>>> ========= PATCHES ============
>>> 
>>> 
>>> Index: tcsh-6.22.02/sh.c
>>> ===================================================================
>>> --- tcsh-6.22.02.orig/sh.c
>>> +++ tcsh-6.22.02/sh.c
>>> @@ -2521,14 +2521,21 @@ static void
>>> record(void)
>>> {
>>>    static int again = 0;
>>> +    int phup_disabled_tmp;
>>> +
>>>    if (again++)
>>>       return;
>>> 
>>> +   phup_disabled_tmp = phup_disabled;
>>> +   phup_disabled = 1;
>>> +
>>>    if (!fast) {
>>>       recdirs(NULL, adrof(STRsavedirs) != NULL);
>>>       rechist(NULL, adrof(STRsavehist) != NULL);
>>>    }
>>>    displayHistStats("Exiting");       /* no-op unless DEBUG_HIST */
>>> +
>>> +   phup_disabled = phup_disabled_tmp;
>>> }
>>> 
>>> /*
>>> 
>>> Index: tcsh-6.22.02/sh.hist.c
>>> ===================================================================
>>> --- tcsh-6.22.02.orig/sh.hist.c
>>> +++ tcsh-6.22.02/sh.hist.c
>>> @@ -1219,7 +1219,7 @@ void
>>> rechist(Char *fname, int ref)
>>> {
>>>    Char    *snum, *rs;
>>> -    int     fp, ftmp, oldidfds;
>>> +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
>>>    struct varent *shist;
>>>    char path[MAXPATHLEN];
>>>    struct stat st;
>>> @@ -1227,6 +1227,10 @@ rechist(Char *fname, int ref)
>>> 
>>>    if (fname == NULL && !ref)
>>>       return;
>>> +
>>> +   phup_disabled_tmp = phup_disabled;
>>> +   phup_disabled = 1;
>>> +
>>>    /*
>>>     * If $savehist is just set, we use the value of $history
>>>     * else we use the value in $savehist
>>> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
>>>    if (fp == -1) {
>>>       didfds = oldidfds;
>>>       cleanup_until(fname);
>>> +   phup_disabled = phup_disabled_tmp;
>>>       return;
>>>    }
>>>    /* Try to preserve ownership and permissions of the original history file */
>>> @@ -1329,6 +1334,7 @@ rechist(Char *fname, int ref)
>>>    (void)ReplaceFile( short2str(fname),path,NULL,0,NULL,NULL);
>>> #endif
>>>    cleanup_until(fname);
>>> +    phup_disabled = phup_disabled_tmp;
>>> }
>>> 
>>> Index: tcsh-6.22.02/sh.dir.c
>>> ===================================================================
>>> --- tcsh-6.22.02.orig/sh.dir.c
>>> +++ tcsh-6.22.02/sh.dir.c
>>> @@ -1356,7 +1356,7 @@ loaddirs(Char *fname)
>>> void
>>> recdirs(Char *fname, int def)
>>> {
>>> -    int     fp, ftmp, oldidfds;
>>> +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
>>>    int     cdflag = 0;
>>>    struct directory *dp;
>>>    unsigned int    num;
>>> @@ -1366,6 +1366,9 @@ recdirs(Char *fname, int def)
>>>    if (fname == NULL && !def)
>>>       return;
>>> 
>>> +   phup_disabled_tmp = phup_disabled;
>>> +   phup_disabled = 1;
>>> +
>>>    if (fname == NULL) {
>>>       if ((fname = varval(STRdirsfile)) == STRNULL)
>>>           fname = Strspl(varval(STRhome), &STRtildotdirs[1]);
>>> @@ -1378,6 +1381,7 @@ recdirs(Char *fname, int def)
>>> 
>>>    if ((fp = xcreat(short2str(fname), 0600)) == -1) {
>>>       cleanup_until(fname);
>>> +       phup_disabled = phup_disabled_tmp;
>>>       return;
>>>    }
>>> 
>>> @@ -1413,4 +1417,5 @@ recdirs(Char *fname, int def)
>>>    SHOUT = ftmp;
>>>    didfds = oldidfds;
>>>    cleanup_until(fname);
>>> +    phup_disabled = phup_disabled_tmp;
>>> }
>>> 
>>> 
>>> ======== PREVIOUS MESSAGES ==========
>>> 
>>> 
>>> On Mon, Jan 20, 2020 at 11:40:49AM -0500, Christos Zoulas wrote:
>>>> Thanks, it should be fixed now.
>>>> 
>>>> christos
>>>> 
>>>>> On Jan 20, 2020, at 9:08 AM, Brett Frankenberger <rbf at rbfnet.com> wrote:
>>>>> 
>>>>> (Resending what I posted last August ... let me know if there's
>>>>> anything I can do to get this (or a differnet fix for the same issue)
>>>>> into the tree.)
>>>>> 
>>>>> tcsh can deadlock with itself if savehist is confgured with "merge" and
>>>>> "lock", and two SIGHUPs are received in rapid succession.  The
>>>>> mechanism of the deadlock is the first SIGHUP triggers a rechist() and
>>>>> while that rechist() is executing (and after it has created the lock
>>>>> file), the second SIGHUP triggers a another rechist() which then waits
>>>>> forever for the lock the the first rechist() created to be released
>>>>> (which will never happen).
>>>>> 
>>>>> A backtrace from when it's deadlocked:
>>>>> 
>>>>> #1  0x00007fe3a48f7877 in usleep (useconds=useconds at entry=100000)
>>>>>  at ../sysdeps/posix/usleep.c:32
>>>>> #2  0x000055c7b9368974 in dot_lock (
>>>>>  fname=fname at entry=0x55c7ba174540 "/home/rbf/.history",
>>>>>  pollinterval=pollinterval at entry=100) at dotlock.c:166
>>>>> #3  0x000055c7b935950f in rechist (fname=0x55c7ba1e5960L"/home/rbf/.history",
>>>>>  ref=<optimized out>) at sh.hist.c:1293
>>>>> #4  0x000055c7b9344cc0 in record () at sh.c:2512
>>>>> #5  0x000055c7b9346b29 in phup () at sh.c:1842
>>>>> #6  0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
>>>>> #7  0x000055c7b935ec53 in xwrite (fildes=3,
>>>>>  buf=buf at entry=0x55c7b95b6e00 <linbuf>, nbyte=12) at sh.misc.c:690
>>>>> #8  0x000055c7b9360104 in flush () at sh.print.c:260
>>>>> #9  0x000055c7b9387219 in doprnt (addchar=0x55c7b9360390 <xputchar>,
>>>>>  sfmt=sfmt at entry=0x55c7b938d0ad "%S", ap=ap at entry=0x7ffc4fb9cd60)
>>>>>  at tc.printf.c:294
>>>>> #10 0x000055c7b9387823 in xprintf (fmt=fmt at entry=0x55c7b938d0ad "%S")
>>>>>  at tc.printf.c:392
>>>>> #11 0x000055c7b935b392 in prlex (sp0=sp0 at entry=0x55c7ba17efc0) at
>>>>> sh.lex.c:228
>>>>> #12 0x000055c7b9358510 in phist (hp=0x55c7ba17efc0, hflg=<optimized out>)
>>>>>  at sh.hist.c:1071
>>>>> #13 0x000055c7b93596d3 in dophist (hflg=65, n=200) at sh.hist.c:1114
>>>>> #14 dohist (vp=<optimized out>, c=<optimized out>) at sh.hist.c:1177
>>>>> #15 0x000055c7b93593f7 in rechist (fname=0x55c7ba1dade0
>>>>> #L"/home/rbf/.history",
>>>>>  ref=<optimized out>) at sh.hist.c:1322
>>>>> #16 0x000055c7b9344cc0 in record () at sh.c:2512
>>>>> #17 0x000055c7b9346b29 in phup () at sh.c:1842
>>>>> #18 0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
>>>>> #19 0x000055c7b935ebb3 in xread (fildes=16, buf=buf at entry=0x7ffc4fb9e030,
>>>>>  nbyte=nbyte at entry=1) at sh.misc.c:662
>>>>> 
>>>>> This patch (which is to 6.20.00 ... but there's doesn't appear to be
>>>>> anything in 6.21.00 or 6.22.00 which would address this, so I'm
>>>>> reasonably confident the problem exists there as well) fixes the
>>>>> problem for me.  It disables processing of pending SIGHUPs at the start
>>>>> of rechist (and then restores on completion).
>>>>> 
>>>>> (It does this even if savehist isn't configured with lock; so it avoids
>>>>> starting a second write while the first one is in progress even in
>>>>> cases where it won't deadlock.)
>>>>> 
>>>>> (I did consider just having handle_pending_signals not redispatch
>>>>> phup() if one was already running, but it looks like the same deadlock
>>>>> could occur if a single SIGHUP arrived while the shell was saving the
>>>>> history for other reasons, although I haven't produced (or tried to
>>>>> produce) that behavior.)
>>>>> 
>>>>> --- tcsh-6.20.00.orig/sh.hist.c
>>>>> +++ tcsh-6.20.00/sh.hist.c
>>>>> @@ -1223,7 +1223,7 @@ void
>>>>> rechist(Char *fname, int ref)
>>>>> {
>>>>>   Char    *snum, *rs;
>>>>> -    int     fp, ftmp, oldidfds;
>>>>> +    int     fp, ftmp, oldidfds, phup_disabled_tmp;
>>>>>   struct varent *shist;
>>>>>   char path[MAXPATHLEN];
>>>>>   struct stat st;
>>>>> @@ -1231,6 +1231,10 @@ rechist(Char *fname, int ref)
>>>>> 
>>>>>   if (fname == NULL && !ref)
>>>>>      return;
>>>>> +
>>>>> +       phup_disabled_tmp = phup_disabled;
>>>>> +       phup_disabled = 1;
>>>>> +
>>>>>   /*
>>>>>    * If $savehist is just set, we use the value of $history
>>>>>    * else we use the value in $savehist
>>>>> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
>>>>>   if (fp == -1) {
>>>>>      didfds = oldidfds;
>>>>>      cleanup_until(fname);
>>>>> +       phup_disabled = phup_disabled_tmp;
>>>>>      return;
>>>>>   }
>>>>>   /* Try to preserve ownership and permissions of the original history file */
>>>>> @@ -1325,6 +1330,7 @@ rechist(Char *fname, int ref)
>>>>>   didfds = oldidfds;
>>>>>   (void)rename(path, short2str(fname));
>>>>>   cleanup_until(fname);
>>>>> +    phup_disabled = phup_disabled_tmp;
>>>>> }
>>>>> 
>>>>> (As background, below is where/how I found this)
>>>>> 
>>>>> For me, this is occurring on Linux; and on systemd systems it's easy to
>>>>> recreate -- when systemd attempts to terminate a session, the shell
>>>>> often ends up getting two SIGHUPs in rapid succession (my assumption is
>>>>> that one is directly from systemd and another is a result of the parent
>>>>> sshd terminating).  I get get it to happen about 50% of the time with
>>>>> "systemctl stop session-XX.scope" when that session is an ssh
>>>>> connection that has a tcsh shell configured with "savehist = ( XXX
>>>>> merge lock )".  (I have a history of about 200 lines to write out.)
>>>>> 
>>>>> Obviously, it's racy ... sometimes the second SIGHUP is early enough,
>>>>> or late enough, to avoid the problem.
>>>>> 
>>>>> With the above fix, I can not reproduce the deadlock.
>>>>> 
>>>>>   -- Brett
>>>>> --
>>>>> Tcsh mailing list
>>>>> Tcsh at astron.com
>>>>> https://mailman.astron.com/mailman/listinfo/tcsh
>>>> 
>>> 
>>> 
>>> 
>>>> --
>>>> Tcsh mailing list
>>>> Tcsh at astron.com
>>>> https://mailman.astron.com/mailman/listinfo/tcsh
>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/tcsh/attachments/20200418/578f79f9/attachment.asc>