Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to work with projects using non-UTF-8 coding systems? #135

Closed
whatacold opened this issue Oct 26, 2018 · 16 comments
Closed

How to work with projects using non-UTF-8 coding systems? #135

whatacold opened this issue Oct 26, 2018 · 16 comments

Comments

@whatacold
Copy link
Contributor

Hi,

At work I have some cpp projects using chinese-gkb to encode source files,
eglot together with ccls errors to work when doing completion:

[jsonrpc] (warning) Invalid JSON: (: 0) {"jsonrpc":"2.0","id":7,"result":{"isIncomplete":false,"items":[{"label":"hello(const char *name) -> void","kind":3,"detail":"","documentation":"Say hello to someone ��ij���ʺ�","sortText":"...........","insertText":"hello(${1:const char *name})$0","filterText":"hello","insertTextFormat"

And if I change the coding to utf-8 temporally, it works fine.

Is it a bug for eglot or ccls? The spec
says:

It defaults to utf-8, which is the only encoding supported right now.

Software info:

  • eglot 20181024.1053
  • ccls as LS
@MaskRay
Copy link
Contributor

MaskRay commented Oct 26, 2018

I would not call this as anyone (eglot or ccls)'s fault. GBK encoded file would not work. You can suggest your employer use Unicode.

Fortunately I think the encoding issue only applies to comments, and in rare occasions string literals. Just ignore them.

@mkcms
Copy link
Collaborator

mkcms commented Oct 26, 2018

It might be either eglot's or jsonrpc's library fault, I quickly looked at the sources and I couldn't find anything related to encoding conversion there.

@whatacold
Copy link
Contributor Author

whatacold commented Oct 26, 2018

I would not call this as anyone (eglot or ccls)'s fault. GBK encoded file would not work. You can suggest your employer use Unicode.

Does it stem from the spec that only supporting UTF-8?

Fortunately I think the encoding issue only applies to comments, and in rare occasions string literals. Just ignore them.

I noticed that if the comment for functions ("docstring") contains gbk, eglot won't be able to prompt the completion candidates, so it hurts usability.

@whatacold
Copy link
Contributor Author

It might be either eglot's or jsonrpc's library fault, I quickly looked at the sources and I couldn't find anything related to encoding conversion there.

Is it related to make-process ? I've tried changing it to chinese-gbk but doesn't help.

@whatacold
Copy link
Contributor Author

whatacold commented Oct 30, 2018

With current implementations of both sides, the problem comes by:

  1. eglot sends everything out in UTF-8(as in make-process), even if file using GBK. (Emacs helps do the job.)
  2. ccls sends say hover info back as in the file, GBK in this case, and does not specify Content-Type header(UTF-8 as default).
  3. The filter function in jsonrpc will decode the response with UTF-8, then feeds the response to json-read
  4. json-read complains about the definitively invalid json object.

I can think of two workarounds to this:

  1. eglot have a customizable way to allow user to set the coding system
  2. ccls allows user to set the coding system via arguments

I tend to prefer the second one considering the spec, I've tried to convert the encoding to UTF-8 with libiconv.

@MaskRay
Copy link
Contributor

MaskRay commented Oct 30, 2018

For option 2, ccls uses a naive UTF-8 transcoding stuff to convert between clang byte-based SourceLocation/SourceRange and UTF-8 measured line/character. In practice, these clang based language servers (ccls, clangd, cquery) may read files from the file system (i.e. do not take file contents send by the language client as the single source) for indexing (and completion/diagnostics). A couple of months ago someone asked if it is reasonable to have native UTF-16 support on cfe-dev, the answer is basically that MS should fix their stuff.

// ccls/src/working_files.cc
int GetOffsetForPosition(lsPosition pos, std::string_view content) {
  size_t i = 0;
  for (; pos.line > 0 && i < content.size(); i++)
    if (content[i] == '\n')
      pos.line--;
  for (; pos.character > 0 && i < content.size() && content[i] != '\n';
       pos.character--)
    if (uint8_t(content[i++]) >= 128) {
      // Skip 0b10xxxxxx
      while (i < content.size() && uint8_t(content[i]) >= 128 &&
             uint8_t(content[i]) < 192)
        i++;
    }
  return int(i);
}

// and also in src/message_handler.cc src/messages/textDocument_formatting.cc

The line number will be correct, but the character can be inaccurate for other encodings retaining the feature of low bytes being 1-byte characters. If you don't put indexable identities on a line after GBK, this should work fine:

long RIP; // 金庸

This will break character measurement:

int 紅顏彈指老, 剎那芳華;

I think you only care about Chinese in comments... I guess GetOffsetForPosition may work to some degree..

@whatacold
Copy link
Contributor Author

Thanks for pointing out GetOffsetForPosition, actually I haven't understood the source code too much,
but I did wonder how it does the conversion between character-based and byte-based positon/range.

I may need some time to see whether my workaround fork works well or not.


The line number will be correct, but the character can be inaccurate for other encodings retaining the feature of low bytes being 1-byte characters. If you don't put indexable identities on a line after GBK, this should work fine:

```c++
long RIP; // 金庸

This will break character measurement:

int 紅顏彈指老, 剎那芳華;

There is no code like the second case above in our projects, so it should work fine I guess.

@joaotavora
Copy link
Owner

@whatacold can you check if using clangd with the fixes from #124 and #125 fix your problem? You might have to

(setf eglot-current-column-function #'eglot-lsp-abiding-column
      eglot-move-to-column-function #'eglot-move-to-lsp-abiding-column)

@whatacold
Copy link
Contributor Author

Unfortunately, it doesn't help with the HEAD commit of eglot.el 38da3d3 ,
, clangd and below settings as you suggest:

(setf eglot-current-column-function #'eglot-lsp-abiding-column
      eglot-move-to-column-function #'eglot-move-to-lsp-abiding-column)

I make a demo repo if you're insterested.

@joaotavora
Copy link
Owner

Thanks @whatacold, what LSP server are you using (name and version).

@whatacold
Copy link
Contributor Author

I use clangd with version:

clangd --version
LLVM (http://llvm.org/):
  LLVM version 6.0.1
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: broadwell

@joaotavora
Copy link
Owner

looks like clangd version 6. If you switch to version 7 or 8 you should be OK, I think. @mkcms?

@whatacold
Copy link
Contributor Author

It seems that if we want to support non utf-8 coding systems, we need to
allow user to specify the coding system:

  1. for make-process in eglot--connect in eglot.el
  2. for jsonrpc--json-read in jsonrpc.el

I tried to work around it there, but it seems not that elegant, so I quit.

And I also have managed to convert the json string to UTF-8 in ccls side,
but I can't make ccls itself work at work, so I quit it too.

At last, I came up with a workaround with a thin translator written in python,
which translates the LSP server output to UTF-8 as I specified.
It works fine for my projects for now, and it has an extra benefit, that is
I can use it to fit cquery, ccls and clangd, depends on which works best.
(It should work with any LS though.)

@whatacold
Copy link
Contributor Author

whatacold commented Dec 3, 2018

looks like clangd version 6. If you switch to version 7 or 8 you should be OK

Is it related to clangd version? I'll try to see if I could get clangd 7/8 to verify this.

---UPDATE---
Yeah, auto-completion does work with clangd 7:

clangd --version
clangd version 7.0.0 (tags/RELEASE_700/final)

but it seems clangd doesn't include the comment for the function under point:

client-request (id:108) Mon Dec  3 22:25:56 2018:
(:jsonrpc "2.0" :id 108 :method "textDocument/documentHighlight" :params
          (:textDocument
           (:uri "file:///home/hgw/test/eglot-gbk-demo/main.c")
           :position
           (:line 8 :character 6)))

server-reply (id:107) Mon Dec  3 22:25:56 2018:
(:id 107 :jsonrpc "2.0" :result
     (:contents
      (:kind "plaintext" :value "Declared in global namespace\n\nvoid hello(const char *name)")))

@joaotavora
Copy link
Owner

It seems that if we want to support non utf-8 coding systems, we need to
allow user to specify the coding system:

I'm sorry perhaps I misread the whole discussion, but at some point it was related to misreported character/column positions. Text communication between client and LSP server as specified by the standard (Base Protocol, Content Part) is utf-8 and there are no plans to change it to anything else.

Yeah, auto-completion does work with clangd 7:

completion? I gave you a solution for incorrect column reporting. Everything else would be a problem with the server.

@whatacold
Copy link
Contributor Author

I'm sorry that I haven't make the issue clear.

is utf-8 and there are no plans to change it to anything else.

Totally understand that, so I gonna close this.

I gave you a solution for incorrect column reporting.

I'll verify this if I have some time, and report back here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants