[Tcsh] tcsh Deadlock with SIGHUP
Christos Zoulas
christos at zoulas.com
Sat Apr 18 16:55:21 UTC 2020
Committed, sorry for the delay!
christos
> On Apr 9, 2020, at 4:52 PM, Brett Frankenberger <rbf at rbfnet.com> wrote:
>
> On Thu, Apr 09, 2020 at 02:40:57PM -0400, Christos Zoulas wrote:
>> All that is fine, but who sends the SIGHUP? And if we ignore the SIGHUP,
>> that someone might send SIGKILL and then what?
>
> I assume systemd is sending the SIGHUP. Nothing I've done or commented
> on below will cause the SIGHUP to be ignored -- my code just defers
> processing it once history writing starts until it is completed (note
> that even without my change -- there's still some latency to handling a
> SIGHUP because handle_pending_signals() is only called so often). I
> did mention possibly just permanently ignoring SIGUP once committed to
> the "goodbye" path (done: label in main()) -- once tcsh is in that
> path, it's committed to exiting, and that path (at first glance) isn't
> really any much slower (if at all slower) at exiting than the phup()
> path is.
>
> I never tracked down the exact sources -- I assumed in my original
> deadlock, one SIGHUP came from systemd and one came from the kernel as
> a result of SSHD getting a sIGHUP and exiting -- and in the current
> race I am seeing on ubuntu 20.04, I assume the -EIO is a result of sshd
> closing the pty on its end, and the SIGHUP is coming from systemd. But
> I don't konw either of those things for sure -- just that when I
> shutdown the system (or stop a session via systemctl) the dual SIGHUP
> occurs on my older machines and that the-EIO/SIGHUP race occurs
> (leaving a .history_lock) on my 20.04 machines.
>
> You're right that systemd will eventually send a stronger signal if the
> shell sticks around, but it will wait before doing so. Unless there
> are other issues on the system, the exit (from the first SIGHUP and/or
> the -EIO on the read) will complete before systemd gets tired of
> waiting and sents a SIGKILL. (Ultimately, back when I was having the
> deadlock, a SIGKILL did get sent, but systemd was waiting 90 seconds to
> do that. Absent a deadlock, tcsh is existing sub-second, of course.)
>
> Obviosuly, the SIGKILL risk cannot be mitigated -- if it comes in while
> .history_lock exists, the file is going to remain -- but I haven't
> experienced any conditions in practice where a SIGKILL immediately
> follows a SIGHUP or an EOF (or -EIO) on STDIN. But there are real
> world cases where a SIGHUP quickly follows another SIGHUP and/or
> quickly follows EOF/-EIO on STDIN.
>
> (The "history -S" followed by "SIGHUP" potential deadlock, on the other
> hand, would just be badluck; I wouldn't expect non-contrived
> reproducible cases of that in practice ... I only fixed that for
> philosophical reasons -- because I knew it was a race that could be
> lost.)
>
> -- Brett
>
>
>> christos
>>
>>> On Apr 7, 2020, at 9:47 PM, Brett Frankenberger <rbf at rbfnet.com> wrote:
>>>
>>> Your fix, which was different from mine, does successfully prevent
>>> deadlocks. However, I have discovered a similar issue: If the STDIN
>>> for the shell is closed (in my particular reproduction, the read() in
>>> which the shell is waiting for input is returning EIO), and then a
>>> SIGHUP is received shortly thereafter, there is a race condition that,
>>> while it cannot deadlock, and result in .history_lock being left in
>>> $HOME, causing hangs when future shells exit. (I am seeing this with
>>> shutdowns of a Ubuntu 20.04 (beta release) server; the timing there
>>> appears to be different than on my other servers.)
>>>
>>> What appears to be happening is that the EIO is triggering the shell to
>>> exit, and record() (after the done: label in main()) is being called.
>>>> From within record(), recdirs() is called, and then .history_lock is
>>> created and the history is written. While the history is being
>>> written, the SIGHUP arrives and phup() ends up getting invoked.
>>> phup()'s call to record() will not deadlock because the (again) fix
>>> will cause record to immediately return; but ultimately the shell will
>>> exit without finishing the original recdirs().
>>>
>>> So the history will not have been completely written (the .history.XXX
>>> temporary file will remain) and .history_lock will still exist. So (a)
>>> the history is lost, and (b) the remaining .history_lock will hang
>>> future shells that exit.
>>>
>>> Separately, although I haven't experienced it, I also think the original
>>> deadlock could occur if a "history -S" was in progress when a SIGHUP
>>> was received. ("history -S" doesn't call record(), so rechist() would
>>> be entered once as a result of "history -S" and again as a result of
>>> phup(). The second entry will deadlock with the first one.)
>>>
>>> And, finally, I think all of the issues described above can occur with
>>> savedirs (although the symptoms are different since recdirs() doesn't
>>> appear to use a lockfile or an intermediate temporary file), although I
>>> don't personally use that feature and so have not experienced any
>>> problems there.
>>>
>>> Below is the patch I have applied to my tcsh. It includes three
>>> things:
>>>
>>> (1) Changes to sh.c to include temporarily disabling phup processing
>>> while in record(). This is intended to fix the issue with a SIGHUP
>>> being received while in record() while the shell is exiting for some
>>> other reason, resulting in the original rechist() (and/or recdirs())
>>> not completing and leaving behind a lock file (in the case of rechist)
>>> or a partially written dirs file (in the case of recdirs)).
>>>
>>> (2) Reapplies my original fix (from below) to sh.hist.c to prevent a
>>> deadlock when a SIGHUP is received while processing "history -S".
>>>
>>> (3) A similar fix (for a similar reason -- although it won't deadlock
>>> since there's no lock file option) to sh.dirs.c.
>>>
>>> That's probably not the only way to fix things. And (1) really doesn't
>>> accomplish much if (2) and (3) are implemented. (And instead of my
>>> patch to sh.c, I wonder if phup processing should just be disabled once
>>> done: is reached in main() ... it looks like the code attempts to
>>> cleanup any pending SIGHUPs there, but doesn't really account for the
>>> race condition if a SIGHUP is received right after
>>> handle_pending_signals() returns).
>>>
>>> Anyway, lots of options, but would be nice to get at least (1) or (2)
>>> implemented, or something else that fixes the same issue, to prevent
>>> the problem I am actually seeing on Ubuntu 20.04.
>>>
>>> Thanks ...
>>>
>>> -- Brett
>>>
>>> ========= PATCHES ============
>>>
>>>
>>> Index: tcsh-6.22.02/sh.c
>>> ===================================================================
>>> --- tcsh-6.22.02.orig/sh.c
>>> +++ tcsh-6.22.02/sh.c
>>> @@ -2521,14 +2521,21 @@ static void
>>> record(void)
>>> {
>>> static int again = 0;
>>> + int phup_disabled_tmp;
>>> +
>>> if (again++)
>>> return;
>>>
>>> + phup_disabled_tmp = phup_disabled;
>>> + phup_disabled = 1;
>>> +
>>> if (!fast) {
>>> recdirs(NULL, adrof(STRsavedirs) != NULL);
>>> rechist(NULL, adrof(STRsavehist) != NULL);
>>> }
>>> displayHistStats("Exiting"); /* no-op unless DEBUG_HIST */
>>> +
>>> + phup_disabled = phup_disabled_tmp;
>>> }
>>>
>>> /*
>>>
>>> Index: tcsh-6.22.02/sh.hist.c
>>> ===================================================================
>>> --- tcsh-6.22.02.orig/sh.hist.c
>>> +++ tcsh-6.22.02/sh.hist.c
>>> @@ -1219,7 +1219,7 @@ void
>>> rechist(Char *fname, int ref)
>>> {
>>> Char *snum, *rs;
>>> - int fp, ftmp, oldidfds;
>>> + int fp, ftmp, oldidfds, phup_disabled_tmp;
>>> struct varent *shist;
>>> char path[MAXPATHLEN];
>>> struct stat st;
>>> @@ -1227,6 +1227,10 @@ rechist(Char *fname, int ref)
>>>
>>> if (fname == NULL && !ref)
>>> return;
>>> +
>>> + phup_disabled_tmp = phup_disabled;
>>> + phup_disabled = 1;
>>> +
>>> /*
>>> * If $savehist is just set, we use the value of $history
>>> * else we use the value in $savehist
>>> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
>>> if (fp == -1) {
>>> didfds = oldidfds;
>>> cleanup_until(fname);
>>> + phup_disabled = phup_disabled_tmp;
>>> return;
>>> }
>>> /* Try to preserve ownership and permissions of the original history file */
>>> @@ -1329,6 +1334,7 @@ rechist(Char *fname, int ref)
>>> (void)ReplaceFile( short2str(fname),path,NULL,0,NULL,NULL);
>>> #endif
>>> cleanup_until(fname);
>>> + phup_disabled = phup_disabled_tmp;
>>> }
>>>
>>> Index: tcsh-6.22.02/sh.dir.c
>>> ===================================================================
>>> --- tcsh-6.22.02.orig/sh.dir.c
>>> +++ tcsh-6.22.02/sh.dir.c
>>> @@ -1356,7 +1356,7 @@ loaddirs(Char *fname)
>>> void
>>> recdirs(Char *fname, int def)
>>> {
>>> - int fp, ftmp, oldidfds;
>>> + int fp, ftmp, oldidfds, phup_disabled_tmp;
>>> int cdflag = 0;
>>> struct directory *dp;
>>> unsigned int num;
>>> @@ -1366,6 +1366,9 @@ recdirs(Char *fname, int def)
>>> if (fname == NULL && !def)
>>> return;
>>>
>>> + phup_disabled_tmp = phup_disabled;
>>> + phup_disabled = 1;
>>> +
>>> if (fname == NULL) {
>>> if ((fname = varval(STRdirsfile)) == STRNULL)
>>> fname = Strspl(varval(STRhome), &STRtildotdirs[1]);
>>> @@ -1378,6 +1381,7 @@ recdirs(Char *fname, int def)
>>>
>>> if ((fp = xcreat(short2str(fname), 0600)) == -1) {
>>> cleanup_until(fname);
>>> + phup_disabled = phup_disabled_tmp;
>>> return;
>>> }
>>>
>>> @@ -1413,4 +1417,5 @@ recdirs(Char *fname, int def)
>>> SHOUT = ftmp;
>>> didfds = oldidfds;
>>> cleanup_until(fname);
>>> + phup_disabled = phup_disabled_tmp;
>>> }
>>>
>>>
>>> ======== PREVIOUS MESSAGES ==========
>>>
>>>
>>> On Mon, Jan 20, 2020 at 11:40:49AM -0500, Christos Zoulas wrote:
>>>> Thanks, it should be fixed now.
>>>>
>>>> christos
>>>>
>>>>> On Jan 20, 2020, at 9:08 AM, Brett Frankenberger <rbf at rbfnet.com> wrote:
>>>>>
>>>>> (Resending what I posted last August ... let me know if there's
>>>>> anything I can do to get this (or a differnet fix for the same issue)
>>>>> into the tree.)
>>>>>
>>>>> tcsh can deadlock with itself if savehist is confgured with "merge" and
>>>>> "lock", and two SIGHUPs are received in rapid succession. The
>>>>> mechanism of the deadlock is the first SIGHUP triggers a rechist() and
>>>>> while that rechist() is executing (and after it has created the lock
>>>>> file), the second SIGHUP triggers a another rechist() which then waits
>>>>> forever for the lock the the first rechist() created to be released
>>>>> (which will never happen).
>>>>>
>>>>> A backtrace from when it's deadlocked:
>>>>>
>>>>> #1 0x00007fe3a48f7877 in usleep (useconds=useconds at entry=100000)
>>>>> at ../sysdeps/posix/usleep.c:32
>>>>> #2 0x000055c7b9368974 in dot_lock (
>>>>> fname=fname at entry=0x55c7ba174540 "/home/rbf/.history",
>>>>> pollinterval=pollinterval at entry=100) at dotlock.c:166
>>>>> #3 0x000055c7b935950f in rechist (fname=0x55c7ba1e5960L"/home/rbf/.history",
>>>>> ref=<optimized out>) at sh.hist.c:1293
>>>>> #4 0x000055c7b9344cc0 in record () at sh.c:2512
>>>>> #5 0x000055c7b9346b29 in phup () at sh.c:1842
>>>>> #6 0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
>>>>> #7 0x000055c7b935ec53 in xwrite (fildes=3,
>>>>> buf=buf at entry=0x55c7b95b6e00 <linbuf>, nbyte=12) at sh.misc.c:690
>>>>> #8 0x000055c7b9360104 in flush () at sh.print.c:260
>>>>> #9 0x000055c7b9387219 in doprnt (addchar=0x55c7b9360390 <xputchar>,
>>>>> sfmt=sfmt at entry=0x55c7b938d0ad "%S", ap=ap at entry=0x7ffc4fb9cd60)
>>>>> at tc.printf.c:294
>>>>> #10 0x000055c7b9387823 in xprintf (fmt=fmt at entry=0x55c7b938d0ad "%S")
>>>>> at tc.printf.c:392
>>>>> #11 0x000055c7b935b392 in prlex (sp0=sp0 at entry=0x55c7ba17efc0) at
>>>>> sh.lex.c:228
>>>>> #12 0x000055c7b9358510 in phist (hp=0x55c7ba17efc0, hflg=<optimized out>)
>>>>> at sh.hist.c:1071
>>>>> #13 0x000055c7b93596d3 in dophist (hflg=65, n=200) at sh.hist.c:1114
>>>>> #14 dohist (vp=<optimized out>, c=<optimized out>) at sh.hist.c:1177
>>>>> #15 0x000055c7b93593f7 in rechist (fname=0x55c7ba1dade0
>>>>> #L"/home/rbf/.history",
>>>>> ref=<optimized out>) at sh.hist.c:1322
>>>>> #16 0x000055c7b9344cc0 in record () at sh.c:2512
>>>>> #17 0x000055c7b9346b29 in phup () at sh.c:1842
>>>>> #18 0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
>>>>> #19 0x000055c7b935ebb3 in xread (fildes=16, buf=buf at entry=0x7ffc4fb9e030,
>>>>> nbyte=nbyte at entry=1) at sh.misc.c:662
>>>>>
>>>>> This patch (which is to 6.20.00 ... but there's doesn't appear to be
>>>>> anything in 6.21.00 or 6.22.00 which would address this, so I'm
>>>>> reasonably confident the problem exists there as well) fixes the
>>>>> problem for me. It disables processing of pending SIGHUPs at the start
>>>>> of rechist (and then restores on completion).
>>>>>
>>>>> (It does this even if savehist isn't configured with lock; so it avoids
>>>>> starting a second write while the first one is in progress even in
>>>>> cases where it won't deadlock.)
>>>>>
>>>>> (I did consider just having handle_pending_signals not redispatch
>>>>> phup() if one was already running, but it looks like the same deadlock
>>>>> could occur if a single SIGHUP arrived while the shell was saving the
>>>>> history for other reasons, although I haven't produced (or tried to
>>>>> produce) that behavior.)
>>>>>
>>>>> --- tcsh-6.20.00.orig/sh.hist.c
>>>>> +++ tcsh-6.20.00/sh.hist.c
>>>>> @@ -1223,7 +1223,7 @@ void
>>>>> rechist(Char *fname, int ref)
>>>>> {
>>>>> Char *snum, *rs;
>>>>> - int fp, ftmp, oldidfds;
>>>>> + int fp, ftmp, oldidfds, phup_disabled_tmp;
>>>>> struct varent *shist;
>>>>> char path[MAXPATHLEN];
>>>>> struct stat st;
>>>>> @@ -1231,6 +1231,10 @@ rechist(Char *fname, int ref)
>>>>>
>>>>> if (fname == NULL && !ref)
>>>>> return;
>>>>> +
>>>>> + phup_disabled_tmp = phup_disabled;
>>>>> + phup_disabled = 1;
>>>>> +
>>>>> /*
>>>>> * If $savehist is just set, we use the value of $history
>>>>> * else we use the value in $savehist
>>>>> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
>>>>> if (fp == -1) {
>>>>> didfds = oldidfds;
>>>>> cleanup_until(fname);
>>>>> + phup_disabled = phup_disabled_tmp;
>>>>> return;
>>>>> }
>>>>> /* Try to preserve ownership and permissions of the original history file */
>>>>> @@ -1325,6 +1330,7 @@ rechist(Char *fname, int ref)
>>>>> didfds = oldidfds;
>>>>> (void)rename(path, short2str(fname));
>>>>> cleanup_until(fname);
>>>>> + phup_disabled = phup_disabled_tmp;
>>>>> }
>>>>>
>>>>> (As background, below is where/how I found this)
>>>>>
>>>>> For me, this is occurring on Linux; and on systemd systems it's easy to
>>>>> recreate -- when systemd attempts to terminate a session, the shell
>>>>> often ends up getting two SIGHUPs in rapid succession (my assumption is
>>>>> that one is directly from systemd and another is a result of the parent
>>>>> sshd terminating). I get get it to happen about 50% of the time with
>>>>> "systemctl stop session-XX.scope" when that session is an ssh
>>>>> connection that has a tcsh shell configured with "savehist = ( XXX
>>>>> merge lock )". (I have a history of about 200 lines to write out.)
>>>>>
>>>>> Obviously, it's racy ... sometimes the second SIGHUP is early enough,
>>>>> or late enough, to avoid the problem.
>>>>>
>>>>> With the above fix, I can not reproduce the deadlock.
>>>>>
>>>>> -- Brett
>>>>> --
>>>>> Tcsh mailing list
>>>>> Tcsh at astron.com
>>>>> https://mailman.astron.com/mailman/listinfo/tcsh
>>>>
>>>
>>>
>>>
>>>> --
>>>> Tcsh mailing list
>>>> Tcsh at astron.com
>>>> https://mailman.astron.com/mailman/listinfo/tcsh
>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/tcsh/attachments/20200418/578f79f9/attachment.asc>
More information about the Tcsh
mailing list