<div dir="ltr">As I said in the original thread, I have a rather half-assed fix for handling utf-8 in Windows.<div><br></div><div><a href="https://github.com/amoldeshpande/tcsh/tree/unicode-with-full-vt-restore">https://github.com/amoldeshpande/tcsh/tree/unicode-with-full-vt-restore</a><br></div><div><br></div><div>The diff from tcsh's master is pretty large, but I think the changes that matter are in ed.screen.c and ed.char.c . They depend on Windows versions of NLSClassify() and NLSWidth() , so you'd need specific versions for however Unixes handle multibyte utf-8 </div><div><br></div><div>From my recollection, the refresh code is set up for unicode being 4 byte wchar_t (or 2-byte for Windows), so it did not handle true multibyte utf-8. </div><div><br></div><div>Since I seem to be the only person still using the native Windows version of tcsh, I didn't bother to make a proper, clean fix (there's a C++ hashtable used to cache lookups of Unicode codepoints for example)</div><div><br></div><div>Some personal stuff has me rather harried and pressed for time this summer but maybe later in the year I can try to make a nicer fix and test on WSL to make sure it's cross-platform.</div><div><br></div><div>Or someone can hold their nose and spelunk through the above diff.</div><div><br></div><div>-amol</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Apr 12, 2024 at 2:45 PM Kimmo Suominen <<a href="mailto:kim@netbsd.org">kim@netbsd.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, Apr 05, 2024 at 12:08:59PM +0200, H.Merijn Brand wrote:<br>
> That was not the tone I intended. I'm a volunteer myself and I did not<br>
> see it as a complaint. Sorry if that was not clear!<br>
<br>
Thank you for clarifying — I appreciate it.<br>
<br>
> On Fri, 5 Apr 2024 12:47:05 +0300, Kimmo Suominen <<a href="mailto:kim@netbsd.org" target="_blank">kim@netbsd.org</a>> wrote:<br>
> > I think this part of the commit is your proposed fix:<br>
> > <br>
> > diff --git a/ed.refresh.c b/ed.refresh.c<br>
> > index f1913801..bc902f5e 100644<br>
> > --- a/ed.refresh.c<br>
> > +++ b/ed.refresh.c<br>
> > @@ -1155,6 +1160,8 @@ CalcPosition(int w, int th, int *h, int *v)<br>
> > *h += 4;<br>
> > break;<br>
> > case NLSCLASS_ILLEGAL2:<br>
> > + *h += NLSCLASS_ILLEGAL_SIZE(w);<br>
> > + break;<br>
> > case NLSCLASS_ILLEGAL3:<br>
> > case NLSCLASS_ILLEGAL4:<br>
> > case NLSCLASS_ILLEGAL5:<br>
> > <br>
> > Why does it only apply to NLSCLASS_ILLEGAL2?<br>
> <br>
> Because that was the smallest change required to make "it work", and I<br>
> do not understand the underlying internals, so keeping the scope as<br>
> small as possible was a way to do it the safest way possible.<br>
<br>
I gave this a try. I noticed that the penguin is rendering across<br>
two columns. If you use a character that renders in a single screen<br>
position, the cursor is placed off by one. While this could be<br>
considered better than the original, which will be off by several<br>
column positions, it is clearly not correct.<br>
<br>
You can reproduce with the hwair character:<br>
<br>
set promptchars=$'\U10348#'<br>
<br>
versus the penguin character:<br>
<br>
set promptchars=$'\U1F427#'<br>
<br>
Do we have something already that correctly provides the rendering width<br>
of the character? Here is a Stack Overflow answer that points to using<br>
wcwidth(3) and wcswidth(3):<br>
<br>
<a href="https://stackoverflow.com/a/9145712/1511370" rel="noreferrer" target="_blank">https://stackoverflow.com/a/9145712/1511370</a><br>
<br>
<a href="https://man.netbsd.org/wcwidth.3" rel="noreferrer" target="_blank">https://man.netbsd.org/wcwidth.3</a><br>
<a href="https://man.netbsd.org/wcswidth.3" rel="noreferrer" target="_blank">https://man.netbsd.org/wcswidth.3</a><br>
<br>
I'm still not at all clear about the meanings of NSLCLASS_ILLEGAL*, but<br>
I'm guessing it is about the number of bytes taken to represent each<br>
character. Which does not appear to equate with the rendering width of<br>
the characters.<br>
<br>
And then should we also handle combining characters? What if I wanted<br>
the Finnish flag in my prompt?<br>
<br>
echo $'\U1F1EB\U1F1EE'<br>
<br>
Cheers,<br>
+ Kimmo<br>
<br>
-- <br>
Tcsh mailing list<br>
<a href="mailto:Tcsh@astron.com" target="_blank">Tcsh@astron.com</a><br>
<a href="https://mailman.astron.com/mailman/listinfo/tcsh" rel="noreferrer" target="_blank">https://mailman.astron.com/mailman/listinfo/tcsh</a><br>
</blockquote></div>