[Tcsh] tcsh Deadlock with SIGHUP
Christos Zoulas
christos at zoulas.com
Thu Apr 9 18:40:57 UTC 2020
All that is fine, but who sends the SIGHUP? And if we ignore the SIGHUP,
that someone might send SIGKILL and then what?
christos
> On Apr 7, 2020, at 9:47 PM, Brett Frankenberger <rbf at rbfnet.com> wrote:
>
> Your fix, which was different from mine, does successfully prevent
> deadlocks. However, I have discovered a similar issue: If the STDIN
> for the shell is closed (in my particular reproduction, the read() in
> which the shell is waiting for input is returning EIO), and then a
> SIGHUP is received shortly thereafter, there is a race condition that,
> while it cannot deadlock, and result in .history_lock being left in
> $HOME, causing hangs when future shells exit. (I am seeing this with
> shutdowns of a Ubuntu 20.04 (beta release) server; the timing there
> appears to be different than on my other servers.)
>
> What appears to be happening is that the EIO is triggering the shell to
> exit, and record() (after the done: label in main()) is being called.
>> From within record(), recdirs() is called, and then .history_lock is
> created and the history is written. While the history is being
> written, the SIGHUP arrives and phup() ends up getting invoked.
> phup()'s call to record() will not deadlock because the (again) fix
> will cause record to immediately return; but ultimately the shell will
> exit without finishing the original recdirs().
>
> So the history will not have been completely written (the .history.XXX
> temporary file will remain) and .history_lock will still exist. So (a)
> the history is lost, and (b) the remaining .history_lock will hang
> future shells that exit.
>
> Separately, although I haven't experienced it, I also think the original
> deadlock could occur if a "history -S" was in progress when a SIGHUP
> was received. ("history -S" doesn't call record(), so rechist() would
> be entered once as a result of "history -S" and again as a result of
> phup(). The second entry will deadlock with the first one.)
>
> And, finally, I think all of the issues described above can occur with
> savedirs (although the symptoms are different since recdirs() doesn't
> appear to use a lockfile or an intermediate temporary file), although I
> don't personally use that feature and so have not experienced any
> problems there.
>
> Below is the patch I have applied to my tcsh. It includes three
> things:
>
> (1) Changes to sh.c to include temporarily disabling phup processing
> while in record(). This is intended to fix the issue with a SIGHUP
> being received while in record() while the shell is exiting for some
> other reason, resulting in the original rechist() (and/or recdirs())
> not completing and leaving behind a lock file (in the case of rechist)
> or a partially written dirs file (in the case of recdirs)).
>
> (2) Reapplies my original fix (from below) to sh.hist.c to prevent a
> deadlock when a SIGHUP is received while processing "history -S".
>
> (3) A similar fix (for a similar reason -- although it won't deadlock
> since there's no lock file option) to sh.dirs.c.
>
> That's probably not the only way to fix things. And (1) really doesn't
> accomplish much if (2) and (3) are implemented. (And instead of my
> patch to sh.c, I wonder if phup processing should just be disabled once
> done: is reached in main() ... it looks like the code attempts to
> cleanup any pending SIGHUPs there, but doesn't really account for the
> race condition if a SIGHUP is received right after
> handle_pending_signals() returns).
>
> Anyway, lots of options, but would be nice to get at least (1) or (2)
> implemented, or something else that fixes the same issue, to prevent
> the problem I am actually seeing on Ubuntu 20.04.
>
> Thanks ...
>
> -- Brett
>
> ========= PATCHES ============
>
>
> Index: tcsh-6.22.02/sh.c
> ===================================================================
> --- tcsh-6.22.02.orig/sh.c
> +++ tcsh-6.22.02/sh.c
> @@ -2521,14 +2521,21 @@ static void
> record(void)
> {
> static int again = 0;
> + int phup_disabled_tmp;
> +
> if (again++)
> return;
>
> + phup_disabled_tmp = phup_disabled;
> + phup_disabled = 1;
> +
> if (!fast) {
> recdirs(NULL, adrof(STRsavedirs) != NULL);
> rechist(NULL, adrof(STRsavehist) != NULL);
> }
> displayHistStats("Exiting"); /* no-op unless DEBUG_HIST */
> +
> + phup_disabled = phup_disabled_tmp;
> }
>
> /*
>
> Index: tcsh-6.22.02/sh.hist.c
> ===================================================================
> --- tcsh-6.22.02.orig/sh.hist.c
> +++ tcsh-6.22.02/sh.hist.c
> @@ -1219,7 +1219,7 @@ void
> rechist(Char *fname, int ref)
> {
> Char *snum, *rs;
> - int fp, ftmp, oldidfds;
> + int fp, ftmp, oldidfds, phup_disabled_tmp;
> struct varent *shist;
> char path[MAXPATHLEN];
> struct stat st;
> @@ -1227,6 +1227,10 @@ rechist(Char *fname, int ref)
>
> if (fname == NULL && !ref)
> return;
> +
> + phup_disabled_tmp = phup_disabled;
> + phup_disabled = 1;
> +
> /*
> * If $savehist is just set, we use the value of $history
> * else we use the value in $savehist
> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
> if (fp == -1) {
> didfds = oldidfds;
> cleanup_until(fname);
> + phup_disabled = phup_disabled_tmp;
> return;
> }
> /* Try to preserve ownership and permissions of the original history file */
> @@ -1329,6 +1334,7 @@ rechist(Char *fname, int ref)
> (void)ReplaceFile( short2str(fname),path,NULL,0,NULL,NULL);
> #endif
> cleanup_until(fname);
> + phup_disabled = phup_disabled_tmp;
> }
>
> Index: tcsh-6.22.02/sh.dir.c
> ===================================================================
> --- tcsh-6.22.02.orig/sh.dir.c
> +++ tcsh-6.22.02/sh.dir.c
> @@ -1356,7 +1356,7 @@ loaddirs(Char *fname)
> void
> recdirs(Char *fname, int def)
> {
> - int fp, ftmp, oldidfds;
> + int fp, ftmp, oldidfds, phup_disabled_tmp;
> int cdflag = 0;
> struct directory *dp;
> unsigned int num;
> @@ -1366,6 +1366,9 @@ recdirs(Char *fname, int def)
> if (fname == NULL && !def)
> return;
>
> + phup_disabled_tmp = phup_disabled;
> + phup_disabled = 1;
> +
> if (fname == NULL) {
> if ((fname = varval(STRdirsfile)) == STRNULL)
> fname = Strspl(varval(STRhome), &STRtildotdirs[1]);
> @@ -1378,6 +1381,7 @@ recdirs(Char *fname, int def)
>
> if ((fp = xcreat(short2str(fname), 0600)) == -1) {
> cleanup_until(fname);
> + phup_disabled = phup_disabled_tmp;
> return;
> }
>
> @@ -1413,4 +1417,5 @@ recdirs(Char *fname, int def)
> SHOUT = ftmp;
> didfds = oldidfds;
> cleanup_until(fname);
> + phup_disabled = phup_disabled_tmp;
> }
>
>
> ======== PREVIOUS MESSAGES ==========
>
>
> On Mon, Jan 20, 2020 at 11:40:49AM -0500, Christos Zoulas wrote:
>> Thanks, it should be fixed now.
>>
>> christos
>>
>>> On Jan 20, 2020, at 9:08 AM, Brett Frankenberger <rbf at rbfnet.com> wrote:
>>>
>>> (Resending what I posted last August ... let me know if there's
>>> anything I can do to get this (or a differnet fix for the same issue)
>>> into the tree.)
>>>
>>> tcsh can deadlock with itself if savehist is confgured with "merge" and
>>> "lock", and two SIGHUPs are received in rapid succession. The
>>> mechanism of the deadlock is the first SIGHUP triggers a rechist() and
>>> while that rechist() is executing (and after it has created the lock
>>> file), the second SIGHUP triggers a another rechist() which then waits
>>> forever for the lock the the first rechist() created to be released
>>> (which will never happen).
>>>
>>> A backtrace from when it's deadlocked:
>>>
>>> #1 0x00007fe3a48f7877 in usleep (useconds=useconds at entry=100000)
>>> at ../sysdeps/posix/usleep.c:32
>>> #2 0x000055c7b9368974 in dot_lock (
>>> fname=fname at entry=0x55c7ba174540 "/home/rbf/.history",
>>> pollinterval=pollinterval at entry=100) at dotlock.c:166
>>> #3 0x000055c7b935950f in rechist (fname=0x55c7ba1e5960L"/home/rbf/.history",
>>> ref=<optimized out>) at sh.hist.c:1293
>>> #4 0x000055c7b9344cc0 in record () at sh.c:2512
>>> #5 0x000055c7b9346b29 in phup () at sh.c:1842
>>> #6 0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
>>> #7 0x000055c7b935ec53 in xwrite (fildes=3,
>>> buf=buf at entry=0x55c7b95b6e00 <linbuf>, nbyte=12) at sh.misc.c:690
>>> #8 0x000055c7b9360104 in flush () at sh.print.c:260
>>> #9 0x000055c7b9387219 in doprnt (addchar=0x55c7b9360390 <xputchar>,
>>> sfmt=sfmt at entry=0x55c7b938d0ad "%S", ap=ap at entry=0x7ffc4fb9cd60)
>>> at tc.printf.c:294
>>> #10 0x000055c7b9387823 in xprintf (fmt=fmt at entry=0x55c7b938d0ad "%S")
>>> at tc.printf.c:392
>>> #11 0x000055c7b935b392 in prlex (sp0=sp0 at entry=0x55c7ba17efc0) at
>>> sh.lex.c:228
>>> #12 0x000055c7b9358510 in phist (hp=0x55c7ba17efc0, hflg=<optimized out>)
>>> at sh.hist.c:1071
>>> #13 0x000055c7b93596d3 in dophist (hflg=65, n=200) at sh.hist.c:1114
>>> #14 dohist (vp=<optimized out>, c=<optimized out>) at sh.hist.c:1177
>>> #15 0x000055c7b93593f7 in rechist (fname=0x55c7ba1dade0
>>> #L"/home/rbf/.history",
>>> ref=<optimized out>) at sh.hist.c:1322
>>> #16 0x000055c7b9344cc0 in record () at sh.c:2512
>>> #17 0x000055c7b9346b29 in phup () at sh.c:1842
>>> #18 0x000055c7b93895a6 in handle_pending_signals () at tc.sig.c:72
>>> #19 0x000055c7b935ebb3 in xread (fildes=16, buf=buf at entry=0x7ffc4fb9e030,
>>> nbyte=nbyte at entry=1) at sh.misc.c:662
>>>
>>> This patch (which is to 6.20.00 ... but there's doesn't appear to be
>>> anything in 6.21.00 or 6.22.00 which would address this, so I'm
>>> reasonably confident the problem exists there as well) fixes the
>>> problem for me. It disables processing of pending SIGHUPs at the start
>>> of rechist (and then restores on completion).
>>>
>>> (It does this even if savehist isn't configured with lock; so it avoids
>>> starting a second write while the first one is in progress even in
>>> cases where it won't deadlock.)
>>>
>>> (I did consider just having handle_pending_signals not redispatch
>>> phup() if one was already running, but it looks like the same deadlock
>>> could occur if a single SIGHUP arrived while the shell was saving the
>>> history for other reasons, although I haven't produced (or tried to
>>> produce) that behavior.)
>>>
>>> --- tcsh-6.20.00.orig/sh.hist.c
>>> +++ tcsh-6.20.00/sh.hist.c
>>> @@ -1223,7 +1223,7 @@ void
>>> rechist(Char *fname, int ref)
>>> {
>>> Char *snum, *rs;
>>> - int fp, ftmp, oldidfds;
>>> + int fp, ftmp, oldidfds, phup_disabled_tmp;
>>> struct varent *shist;
>>> char path[MAXPATHLEN];
>>> struct stat st;
>>> @@ -1231,6 +1231,10 @@ rechist(Char *fname, int ref)
>>>
>>> if (fname == NULL && !ref)
>>> return;
>>> +
>>> + phup_disabled_tmp = phup_disabled;
>>> + phup_disabled = 1;
>>> +
>>> /*
>>> * If $savehist is just set, we use the value of $history
>>> * else we use the value in $savehist
>>> @@ -1305,6 +1309,7 @@ rechist(Char *fname, int ref)
>>> if (fp == -1) {
>>> didfds = oldidfds;
>>> cleanup_until(fname);
>>> + phup_disabled = phup_disabled_tmp;
>>> return;
>>> }
>>> /* Try to preserve ownership and permissions of the original history file */
>>> @@ -1325,6 +1330,7 @@ rechist(Char *fname, int ref)
>>> didfds = oldidfds;
>>> (void)rename(path, short2str(fname));
>>> cleanup_until(fname);
>>> + phup_disabled = phup_disabled_tmp;
>>> }
>>>
>>> (As background, below is where/how I found this)
>>>
>>> For me, this is occurring on Linux; and on systemd systems it's easy to
>>> recreate -- when systemd attempts to terminate a session, the shell
>>> often ends up getting two SIGHUPs in rapid succession (my assumption is
>>> that one is directly from systemd and another is a result of the parent
>>> sshd terminating). I get get it to happen about 50% of the time with
>>> "systemctl stop session-XX.scope" when that session is an ssh
>>> connection that has a tcsh shell configured with "savehist = ( XXX
>>> merge lock )". (I have a history of about 200 lines to write out.)
>>>
>>> Obviously, it's racy ... sometimes the second SIGHUP is early enough,
>>> or late enough, to avoid the problem.
>>>
>>> With the above fix, I can not reproduce the deadlock.
>>>
>>> -- Brett
>>> --
>>> Tcsh mailing list
>>> Tcsh at astron.com
>>> https://mailman.astron.com/mailman/listinfo/tcsh
>>
>
>
>
>> --
>> Tcsh mailing list
>> Tcsh at astron.com
>> https://mailman.astron.com/mailman/listinfo/tcsh
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP
URL: <https://mailman.astron.com/pipermail/tcsh/attachments/20200409/47f8c24c/attachment.asc>
More information about the Tcsh
mailing list