How to work with projects using non-UTF-8 coding systems? #135

whatacold · 2018-10-26T00:59:30Z

Hi,

At work I have some cpp projects using chinese-gkb to encode source files,
eglot together with ccls errors to work when doing completion:

[jsonrpc] (warning) Invalid JSON: (: 0) {"jsonrpc":"2.0","id":7,"result":{"isIncomplete":false,"items":[{"label":"hello(const char *name) -> void","kind":3,"detail":"","documentation":"Say hello to someone ��ĳ���ʺ�","sortText":"...........","insertText":"hello(${1:const char *name})$0","filterText":"hello","insertTextFormat"

And if I change the coding to utf-8 temporally, it works fine.

Is it a bug for eglot or ccls? The spec
says:

It defaults to utf-8, which is the only encoding supported right now.

Software info:

eglot 20181024.1053
ccls as LS

The text was updated successfully, but these errors were encountered:

MaskRay · 2018-10-26T03:14:21Z

I would not call this as anyone (eglot or ccls)'s fault. GBK encoded file would not work. You can suggest your employer use Unicode.

Fortunately I think the encoding issue only applies to comments, and in rare occasions string literals. Just ignore them.

mkcms · 2018-10-26T04:59:58Z

It might be either eglot's or jsonrpc's library fault, I quickly looked at the sources and I couldn't find anything related to encoding conversion there.

whatacold · 2018-10-26T05:16:21Z

I would not call this as anyone (eglot or ccls)'s fault. GBK encoded file would not work. You can suggest your employer use Unicode.

Does it stem from the spec that only supporting UTF-8?

Fortunately I think the encoding issue only applies to comments, and in rare occasions string literals. Just ignore them.

I noticed that if the comment for functions ("docstring") contains gbk, eglot won't be able to prompt the completion candidates, so it hurts usability.

whatacold · 2018-10-26T05:20:39Z

It might be either eglot's or jsonrpc's library fault, I quickly looked at the sources and I couldn't find anything related to encoding conversion there.

Is it related to make-process ? I've tried changing it to chinese-gbk but doesn't help.

whatacold · 2018-10-30T15:23:12Z

With current implementations of both sides, the problem comes by:

eglot sends everything out in UTF-8(as in make-process), even if file using GBK. (Emacs helps do the job.)
ccls sends say hover info back as in the file, GBK in this case, and does not specify Content-Type header(UTF-8 as default).
The filter function in jsonrpc will decode the response with UTF-8, then feeds the response to json-read
json-read complains about the definitively invalid json object.

I can think of two workarounds to this:

eglot have a customizable way to allow user to set the coding system
ccls allows user to set the coding system via arguments

I tend to prefer the second one considering the spec, I've tried to convert the encoding to UTF-8 with libiconv.

MaskRay · 2018-10-30T16:34:04Z

For option 2, ccls uses a naive UTF-8 transcoding stuff to convert between clang byte-based SourceLocation/SourceRange and UTF-8 measured line/character. In practice, these clang based language servers (ccls, clangd, cquery) may read files from the file system (i.e. do not take file contents send by the language client as the single source) for indexing (and completion/diagnostics). A couple of months ago someone asked if it is reasonable to have native UTF-16 support on cfe-dev, the answer is basically that MS should fix their stuff.

// ccls/src/working_files.cc
int GetOffsetForPosition(lsPosition pos, std::string_view content) {
  size_t i = 0;
  for (; pos.line > 0 && i < content.size(); i++)
    if (content[i] == '\n')
      pos.line--;
  for (; pos.character > 0 && i < content.size() && content[i] != '\n';
       pos.character--)
    if (uint8_t(content[i++]) >= 128) {
      // Skip 0b10xxxxxx
      while (i < content.size() && uint8_t(content[i]) >= 128 &&
             uint8_t(content[i]) < 192)
        i++;
    }
  return int(i);
}

// and also in src/message_handler.cc src/messages/textDocument_formatting.cc

The line number will be correct, but the character can be inaccurate for other encodings retaining the feature of low bytes being 1-byte characters. If you don't put indexable identities on a line after GBK, this should work fine:

long RIP; // 金庸

This will break character measurement:

int 紅顏彈指老, 剎那芳華;

I think you only care about Chinese in comments... I guess GetOffsetForPosition may work to some degree..

whatacold · 2018-10-31T13:59:26Z

Thanks for pointing out GetOffsetForPosition, actually I haven't understood the source code too much,
but I did wonder how it does the conversion between character-based and byte-based positon/range.

I may need some time to see whether my workaround fork works well or not.


The line number will be correct, but the character can be inaccurate for other encodings retaining the feature of low bytes being 1-byte characters. If you don't put indexable identities on a line after GBK, this should work fine:

```c++
long RIP; // 金庸

This will break character measurement:

int 紅顏彈指老, 剎那芳華;

There is no code like the second case above in our projects, so it should work fine I guess.

joaotavora · 2018-12-02T11:08:21Z

@whatacold can you check if using clangd with the fixes from #124 and #125 fix your problem? You might have to

(setf eglot-current-column-function #'eglot-lsp-abiding-column
      eglot-move-to-column-function #'eglot-move-to-lsp-abiding-column)

whatacold · 2018-12-03T13:56:56Z

Unfortunately, it doesn't help with the HEAD commit of eglot.el 38da3d3 ,
, clangd and below settings as you suggest:

(setf eglot-current-column-function #'eglot-lsp-abiding-column
      eglot-move-to-column-function #'eglot-move-to-lsp-abiding-column)

I make a demo repo if you're insterested.

joaotavora · 2018-12-03T13:59:27Z

Thanks @whatacold, what LSP server are you using (name and version).

whatacold · 2018-12-03T14:01:41Z

I use clangd with version:

clangd --version
LLVM (http://llvm.org/):
  LLVM version 6.0.1
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: broadwell

joaotavora · 2018-12-03T14:09:48Z

looks like clangd version 6. If you switch to version 7 or 8 you should be OK, I think. @mkcms?

whatacold · 2018-12-03T14:19:26Z

It seems that if we want to support non utf-8 coding systems, we need to
allow user to specify the coding system:

for make-process in eglot--connect in eglot.el
for jsonrpc--json-read in jsonrpc.el

I tried to work around it there, but it seems not that elegant, so I quit.

And I also have managed to convert the json string to UTF-8 in ccls side,
but I can't make ccls itself work at work, so I quit it too.

At last, I came up with a workaround with a thin translator written in python,
which translates the LSP server output to UTF-8 as I specified.
It works fine for my projects for now, and it has an extra benefit, that is
I can use it to fit cquery, ccls and clangd, depends on which works best.
(It should work with any LS though.)

whatacold · 2018-12-03T14:20:47Z

looks like clangd version 6. If you switch to version 7 or 8 you should be OK

Is it related to clangd version? I'll try to see if I could get clangd 7/8 to verify this.

---UPDATE---
Yeah, auto-completion does work with clangd 7:

clangd --version
clangd version 7.0.0 (tags/RELEASE_700/final)

but it seems clangd doesn't include the comment for the function under point:

client-request (id:108) Mon Dec  3 22:25:56 2018:
(:jsonrpc "2.0" :id 108 :method "textDocument/documentHighlight" :params
          (:textDocument
           (:uri "file:///home/hgw/test/eglot-gbk-demo/main.c")
           :position
           (:line 8 :character 6)))

server-reply (id:107) Mon Dec  3 22:25:56 2018:
(:id 107 :jsonrpc "2.0" :result
     (:contents
      (:kind "plaintext" :value "Declared in global namespace\n\nvoid hello(const char *name)")))

joaotavora · 2018-12-03T14:59:16Z

It seems that if we want to support non utf-8 coding systems, we need to
allow user to specify the coding system:

I'm sorry perhaps I misread the whole discussion, but at some point it was related to misreported character/column positions. Text communication between client and LSP server as specified by the standard (Base Protocol, Content Part) is utf-8 and there are no plans to change it to anything else.

Yeah, auto-completion does work with clangd 7:

completion? I gave you a solution for incorrect column reporting. Everything else would be a problem with the server.

whatacold · 2018-12-05T15:45:02Z

I'm sorry that I haven't make the issue clear.

is utf-8 and there are no plans to change it to anything else.

Totally understand that, so I gonna close this.

I gave you a solution for incorrect column reporting.

I'll verify this if I have some time, and report back here.

whatacold closed this as completed Dec 5, 2018

joaotavora mentioned this issue Mar 27, 2019

What unicode unit does eglot use in lsp ranges? #244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to work with projects using non-UTF-8 coding systems? #135

How to work with projects using non-UTF-8 coding systems? #135

whatacold commented Oct 26, 2018

MaskRay commented Oct 26, 2018 •

edited

Loading

mkcms commented Oct 26, 2018

whatacold commented Oct 26, 2018 •

edited

Loading

whatacold commented Oct 26, 2018

whatacold commented Oct 30, 2018 •

edited

Loading

MaskRay commented Oct 30, 2018 •

edited

Loading

whatacold commented Oct 31, 2018

joaotavora commented Dec 2, 2018

whatacold commented Dec 3, 2018

joaotavora commented Dec 3, 2018

whatacold commented Dec 3, 2018

joaotavora commented Dec 3, 2018

whatacold commented Dec 3, 2018

whatacold commented Dec 3, 2018 •

edited

Loading

joaotavora commented Dec 3, 2018

whatacold commented Dec 5, 2018

How to work with projects using non-UTF-8 coding systems? #135

How to work with projects using non-UTF-8 coding systems? #135

Comments

whatacold commented Oct 26, 2018

MaskRay commented Oct 26, 2018 • edited Loading

mkcms commented Oct 26, 2018

whatacold commented Oct 26, 2018 • edited Loading

whatacold commented Oct 26, 2018

whatacold commented Oct 30, 2018 • edited Loading

MaskRay commented Oct 30, 2018 • edited Loading

whatacold commented Oct 31, 2018

joaotavora commented Dec 2, 2018

whatacold commented Dec 3, 2018

joaotavora commented Dec 3, 2018

whatacold commented Dec 3, 2018

joaotavora commented Dec 3, 2018

whatacold commented Dec 3, 2018

whatacold commented Dec 3, 2018 • edited Loading

joaotavora commented Dec 3, 2018

whatacold commented Dec 5, 2018

MaskRay commented Oct 26, 2018 •

edited

Loading

whatacold commented Oct 26, 2018 •

edited

Loading

whatacold commented Oct 30, 2018 •

edited

Loading

MaskRay commented Oct 30, 2018 •

edited

Loading

whatacold commented Dec 3, 2018 •

edited

Loading