[Tcsh] Multi-byte characters in promptchars

Fri Apr 12 09:45:06 UTC 2024

As I said in the original thread, I have a rather half-assed fix for
handling utf-8 in Windows.

https://github.com/amoldeshpande/tcsh/tree/unicode-with-full-vt-restore

The diff from tcsh's master is pretty large, but I think the changes that
matter are in ed.screen.c and ed.char.c .  They depend on Windows versions
of NLSClassify() and NLSWidth() , so you'd need specific versions for
however Unixes handle multibyte utf-8

>From my recollection, the refresh code is set up for unicode being 4 byte
wchar_t  (or 2-byte for Windows), so it did not handle true multibyte
utf-8.

Since I seem to be the only person still using the native Windows version
of tcsh, I didn't bother to make a proper, clean fix (there's a C++
hashtable used to cache lookups of Unicode codepoints for example)

Some personal stuff has me rather harried and pressed for time this summer
but maybe later in the year I can try to make a nicer fix and test on WSL
to make sure it's cross-platform.

Or someone can hold their nose and spelunk through the above diff.

-amol

On Fri, Apr 12, 2024 at 2:45 PM Kimmo Suominen <kim at netbsd.org> wrote:

> On Fri, Apr 05, 2024 at 12:08:59PM +0200, H.Merijn Brand wrote:
> > That was not the tone I intended. I'm a volunteer myself and I did not
> > see it as a complaint. Sorry if that was not clear!
>
> Thank you for clarifying — I appreciate it.
>
> > On Fri, 5 Apr 2024 12:47:05 +0300, Kimmo Suominen <kim at netbsd.org>
> wrote:
> > > I think this part of the commit is your proposed fix:
> > >
> > > diff --git a/ed.refresh.c b/ed.refresh.c
> > > index f1913801..bc902f5e 100644
> > > --- a/ed.refresh.c
> > > +++ b/ed.refresh.c
> > > @@ -1155,6 +1160,8 @@ CalcPosition(int w, int th, int *h, int *v)
> > >         *h += 4;
> > >         break;
> > >     case NLSCLASS_ILLEGAL2:
> > > +       *h += NLSCLASS_ILLEGAL_SIZE(w);
> > > +       break;
> > >     case NLSCLASS_ILLEGAL3:
> > >     case NLSCLASS_ILLEGAL4:
> > >     case NLSCLASS_ILLEGAL5:
> > >
> > > Why does it only apply to NLSCLASS_ILLEGAL2?
> >
> > Because that was the smallest change required to make "it work", and I
> > do not understand the underlying internals, so keeping the scope as
> > small as possible was a way to do it the safest way possible.
>
> I gave this a try.  I noticed that the penguin is rendering across
> two columns.  If you use a character that renders in a single screen
> position, the cursor is placed off by one.  While this could be
> considered better than the original, which will be off by several
> column positions, it is clearly not correct.
>
> You can reproduce with the hwair character:
>
>     set promptchars=$'\U10348#'
>
> versus the penguin character:
>
>     set promptchars=$'\U1F427#'
>
> Do we have something already that correctly provides the rendering width
> of the character?  Here is a Stack Overflow answer that points to using
> wcwidth(3) and wcswidth(3):
>
>     https://stackoverflow.com/a/9145712/1511370
>
>     https://man.netbsd.org/wcwidth.3
>     https://man.netbsd.org/wcswidth.3
>
> I'm still not at all clear about the meanings of NSLCLASS_ILLEGAL*, but
> I'm guessing it is about the number of bytes taken to represent each
> character.  Which does not appear to equate with the rendering width of
> the characters.
>
> And then should we also handle combining characters?  What if I wanted
> the Finnish flag in my prompt?
>
>     echo $'\U1F1EB\U1F1EE'
>
> Cheers,
> + Kimmo
>
> --
> Tcsh mailing list
> Tcsh at astron.com
> https://mailman.astron.com/mailman/listinfo/tcsh
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.astron.com/pipermail/tcsh/attachments/20240412/f274e775/attachment.htm>