<div dir="ltr">As I said in the original thread, I have a rather half-assed fix for handling utf-8 in Windows.<div><br></div><div><a href="https://github.com/amoldeshpande/tcsh/tree/unicode-with-full-vt-restore">https://github.com/amoldeshpande/tcsh/tree/unicode-with-full-vt-restore</a><br></div><div><br></div><div>The diff from tcsh's master is pretty large, but I think the changes that matter are in ed.screen.c and ed.char.c .  They depend on Windows versions of NLSClassify() and NLSWidth() , so you'd need specific versions for however Unixes handle multibyte utf-8 </div><div><br></div><div>From my recollection, the refresh code is set up for unicode being 4 byte wchar_t  (or 2-byte for Windows), so it did not handle true multibyte utf-8.   </div><div><br></div><div>Since I seem to be the only person still using the native Windows version of tcsh, I didn't bother to make a proper, clean fix (there's a C++ hashtable used to cache lookups of Unicode codepoints for example)</div><div><br></div><div>Some personal stuff has me rather harried and pressed for time this summer but maybe later in the year I can try to make a nicer fix and test on WSL to make sure it's cross-platform.</div><div><br></div><div>Or someone can hold their nose and spelunk through the above diff.</div><div><br></div><div>-amol</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Apr 12, 2024 at 2:45 PM Kimmo Suominen <<a href="mailto:kim@netbsd.org">kim@netbsd.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, Apr 05, 2024 at 12:08:59PM +0200, H.Merijn Brand wrote:<br>

> That was not the tone I intended. I'm a volunteer myself and I did not<br>

> see it as a complaint. Sorry if that was not clear!<br>

<br>

Thank you for clarifying — I appreciate it.<br>

<br>

> On Fri, 5 Apr 2024 12:47:05 +0300, Kimmo Suominen <<a href="mailto:kim@netbsd.org" target="_blank">kim@netbsd.org</a>> wrote:<br>

> > I think this part of the commit is your proposed fix:<br>

> > <br>

> > diff --git a/ed.refresh.c b/ed.refresh.c<br>

> > index f1913801..bc902f5e 100644<br>

> > --- a/ed.refresh.c<br>

> > +++ b/ed.refresh.c<br>

> > @@ -1155,6 +1160,8 @@ CalcPosition(int w, int th, int *h, int *v)<br>

> >         *h += 4;<br>

> >         break;<br>

> >     case NLSCLASS_ILLEGAL2:<br>

> > +       *h += NLSCLASS_ILLEGAL_SIZE(w);<br>

> > +       break;<br>

> >     case NLSCLASS_ILLEGAL3:<br>

> >     case NLSCLASS_ILLEGAL4:<br>

> >     case NLSCLASS_ILLEGAL5:<br>

> > <br>

> > Why does it only apply to NLSCLASS_ILLEGAL2?<br>

> <br>

> Because that was the smallest change required to make "it work", and I<br>

> do not understand the underlying internals, so keeping the scope as<br>

> small as possible was a way to do it the safest way possible.<br>

<br>

I gave this a try.  I noticed that the penguin is rendering across<br>

two columns.  If you use a character that renders in a single screen<br>

position, the cursor is placed off by one.  While this could be<br>

considered better than the original, which will be off by several<br>

column positions, it is clearly not correct.<br>

<br>

You can reproduce with the hwair character:<br>

<br>

    set promptchars=$'\U10348#'<br>

<br>

versus the penguin character:<br>

<br>

    set promptchars=$'\U1F427#'<br>

<br>

Do we have something already that correctly provides the rendering width<br>

of the character?  Here is a Stack Overflow answer that points to using<br>

wcwidth(3) and wcswidth(3):<br>

<br>

    <a href="https://stackoverflow.com/a/9145712/1511370" rel="noreferrer" target="_blank">https://stackoverflow.com/a/9145712/1511370</a><br>

<br>

    <a href="https://man.netbsd.org/wcwidth.3" rel="noreferrer" target="_blank">https://man.netbsd.org/wcwidth.3</a><br>

    <a href="https://man.netbsd.org/wcswidth.3" rel="noreferrer" target="_blank">https://man.netbsd.org/wcswidth.3</a><br>

<br>

I'm still not at all clear about the meanings of NSLCLASS_ILLEGAL*, but<br>

I'm guessing it is about the number of bytes taken to represent each<br>

character.  Which does not appear to equate with the rendering width of<br>

the characters.<br>

<br>

And then should we also handle combining characters?  What if I wanted<br>

the Finnish flag in my prompt?<br>

<br>

    echo $'\U1F1EB\U1F1EE'<br>

<br>

Cheers,<br>

+ Kimmo<br>

<br>

-- <br>

Tcsh mailing list<br>

<a href="mailto:Tcsh@astron.com" target="_blank">Tcsh@astron.com</a><br>

<a href="https://mailman.astron.com/mailman/listinfo/tcsh" rel="noreferrer" target="_blank">https://mailman.astron.com/mailman/listinfo/tcsh</a><br>

</blockquote></div>