-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #124: Treat :character offsets as UTF-16 code units #125
Conversation
Rather than use UTF-16 unconditionally, you may consider hiding this under an option (I would even suggest using code points for performance consideration). Some other language clients don't UTF-8 (LanguageClient-neovim, lsp-mode) either and this isn't an issue in practice because code points in Supplementary Planes are rarely used. I dislike UTF-16 so much that I didn't even consider doing the conversion in cquery and now in ccls. There is a bug tracking the conversion to code points and you may raise that topic again microsoft/language-server-protocol#376 |
What potential problems could arise from using it unconditionally?
For the UTF-16-ignorant like me, what are code points? |
A code point usually represents a single character (the closest concept is probably "grapheme cluster"). 𝛱 has 1 code point but requires 2 UTF-16 code units to encode, thus causing the incorrect computation of I find the time spent on forward-line is positively related to the distance. Converting between position and LSP |
LSP can be used for things other than just programming. I didn't really benchmark this change, but maybe you're right and it will make eglot slower, although probably not by a large factor. |
@mkcms although probably not by a large factor. Yes if line/character -> position conversion is only done a few times. But in some scenarios (semantic highlight, codeLens), computation through Many language servers may deliberately choose not to conform to LSP for the UTF-16 transcoding. It seems palantir/python-language-server diverges from LSP, too. Please provide the UTF-16 transcoding under a |
The spec says that :character offsets are based on a UTF-16 representation. Previously we were assuming that :character specifies the number of characters. * eglot.el (eglot-full-position-conversion): New defvar. (eglot--count-characters): New function. (eglot--pos-to-lsp-position): (eglot--lsp-position-to-point): Use it.
I pushed a commit which adds a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say @mkcms, I don't like this at all. It's:
- very bad from a readability point of view;
- bad in performance, more than probably;
- questionable in benefit (though I have understood your Issues when buffer contains special characters #124 bug report, and it is indeed a bug)
I feel some effort should be made to use "normal" emacs abstractions. I'll have a better look at the problem, until then, let's shelve this.
@@ -721,23 +721,52 @@ CONNECT-ARGS are passed as additional arguments to | |||
(let ((warning-minimum-level :error)) | |||
(display-warning 'eglot (apply #'format format args) :warning))) | |||
|
|||
(defvar eglot-full-position-conversion t | |||
"Whether positions are calculated in full compliance with the standard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If non-nil, calculate positions in full compliance with standard"
(defvar eglot-full-position-conversion t | ||
"Whether positions are calculated in full compliance with the standard. | ||
Setting this to nil may improve performance, but it can also | ||
introduce bugs when characters wider than two UTF-16 code units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably meant: "wider than one".
introduce bugs when characters wider than two UTF-16 code units | ||
are used.") | ||
|
||
(defun eglot--count-characters (beg end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd name this "eglot--lsp-chars-between"
If `eglot-full-position-conversion' is non-nil, then convert the | ||
region to UTF-16 and count the number of code units. Otherwise | ||
return the distance between BEG and END." | ||
(cond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need so many newlines, it makes reading very strenuous, for me at least.
(line-beginning-position) (line-end-position)))) | ||
(pos 0)) | ||
(while (< pos lsp-pos) | ||
(cl-incf pos (eglot--count-characters (point) (1+ (point)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole section is extremely hacky, counting character lengths one by one, it's like you're reimplementing Emacs's coding systems.
@mkcms, have a look at 1999245 and tell me what you think. There's a very simple test, which we could overhaul with many difficult situations. It'd be interesting to compare the performance of your approach and mine's (which I think uses more of Emacs's built-in machinery to do what you do by hand, and is therefore more readable). Interestingly, Emacs's |
Actually look here (3bbc937), I bungled that other commit quite badly with some stuff that wasn't supposed to go in. |
@joaotavora Thanks for the review.
Yes, of course it will slow down the position conversion functions. What I meant before was overall performance that affects the user. My reasoning was that converting Emacs to LSP positions only takes a fraction of the overall time (e.g. when applying workspace edits, I/O and buffer modifications take the most time). Obviously, I haven't really benchmarked, so it's possible that my change would slow down Eglot by a large factor in other use cases. I only tested this change by doing regular coding work and I haven't noticed any slowdown.
I agree that 99% of the time, this is just redundant computation. Actually this issue only happened to me exactly once, and it wasn't even serious.
I also looked at the Maybe instead of unconditionally converting the positions, perhaps we could add a |
Depends. Consider a very big file with a very large list of diagnostics. It might take a much larger fraction there.
Actually, it doesn't just happen with crazy characters like the one you had, my name João is also a candidade because of the ã. But I don't write my name often in code, only in the header at the most.
This is true, but the buffer we're talking about doesn't need to be the buffer visiting the file, but a temporary buffer whose only contents are the line you're trying to discover the column in (EDIT: and this is what my version does).
Most likely. They also offer this exactness/performance trade-off
This is a good idea. |
Alright, I will do this. I also did some benchmarks, comparing execution times of
I've also managed to optimize my solution to get these times:
The benchmark suite: benchmark.zip |
@mkcms, I'll have a look at your benchmark but in the meantime, to avoid confusion of mental translations I've pushed two branches |
@mkcms your benchmark idea is very good, but I don't understand Also, |
Right, I missed that. Attached is a new benchmark suite which will also test the number of times each solution was incorrect.
With optimization:
The optimization is this: |
@mkcms, so I basically did the same as you and rewrote the benchmark to verify correctness. Indeed the Emacs function
I pushed it to
I'm using your unoptimized version, I think. What is the optimization you had? Can it be applied to my version? |
And some followup questions: won't this break those servers that are not following the standard and returning just the actual column number when confronted with complex |
Yes, it can be applied from what I can see.
To get to
Good question. I will test some servers and report the stats here. |
Nice. I thought of that just now, too. It's a little faster indeed
But it might take more than two iterations if there are more multibyte chars before the target column. |
Continuing this fun project, I've managed to optimized it a tad further by using a binary search:
More seriously though, this breaks at least |
@joaotavora Don't we need to also have a variable for |
Nice:) This is interpolation search.
for offset = (max 1 (abs (/ diff 2)))
do (if (> diff 0) (forward-char offset) (backward-char offset)) be simplified to something like |
lol, not sure if that would be simplifying it. I was looking for a
I don't think so. Do we? Don't we just need an alternate way to move to the LSP column? |
Actually, it isn't a bad idea, especially if I change this to |
* eglot.el (eglot-move-to-lsp-abiding-column): Simplify slightly.
Yes, we do.
|
Also instead of having two variables we can have one that defaults to
|
Curious, but I don't understand what is happening, in what situation did eglot move to the wrong position? |
Never mind. Of course we do: we need to translate back to the format that So I'd make a second variable indeed that is commonly set to Then we can take care of the problem of setting both variables at once, if it is indeed a problem. |
@mkcms have a look at my latest commit. Curiously, the problem in #124 hasn't gone away, at least in clangd-6.0. It's like my clang is reporting for
What clangd version are you using? |
@joaotavora Thanks, that seems to fix the issue.
EDIT: If clangd-6.0 is older than April 27 2018, then it won't use UTF-16 code units: |
@mkcms, thanks for the tests. To solve the problem of automatically setting the variables for Also, in retrospect, I think my original review of your PR was too harsh :-) I wasn't reading it correctly and had kind of a knee-jerk reaction. |
That's a cool idea, I opened #152 for it.
No problem, I appreciate how much you pay attention to code quality. I'm not a professional lisper myself, so hopefully your reviews will make me a better emacs hacker. |
Eglot treats the offset as characters by default (codepoints). joaotavora/eglot#125 Eglot and lsp-mode no longer listed as "work in progress" on https://langserver.org
Also close joaotavora/eglot#125. Idea and much of design contributed by Michał Krzywkowski <k.michal@zoho.com> This introduces the variable eglot-move-to-column-function. According to the standard, LSP column/character offsets are based on a count of UTF-16 code units, not actual visual columns. So when LSP says position 3 of a line containing just \"aXbc\", where X is a multi-byte character, it actually means `b', not `c'. This is what the function `eglot-move-to-lsp-abiding-column' does. However, many servers don't follow the spec this closely, and thus this variable should be set to `move-to-column' in buffers managed by those servers. * eglot.el (eglot-move-to-column-function): New variable. (eglot-move-to-lsp-abiding-column): New function. (eglot--lsp-position-to-point): Use eglot-move-to-column-function.
* eglot.el (eglot-move-to-lsp-abiding-column): Simplify slightly.
* eglot.el (eglot-current-column-function): New variable. (eglot-lsp-abiding-column): New helper. (eglot--pos-to-lsp-position): Use eglot-current-column-function. (eglot-move-to-column-function): Tweak docstring.
Also close joaotavora/eglot#125. Idea and much of design contributed by Michał Krzywkowski <k.michal@zoho.com> This introduces the variable eglot-move-to-column-function. According to the standard, LSP column/character offsets are based on a count of UTF-16 code units, not actual visual columns. So when LSP says position 3 of a line containing just \"aXbc\", where X is a multi-byte character, it actually means `b', not `c'. This is what the function `eglot-move-to-lsp-abiding-column' does. However, many servers don't follow the spec this closely, and thus this variable should be set to `move-to-column' in buffers managed by those servers. * eglot.el (eglot-move-to-column-function): New variable. (eglot-move-to-lsp-abiding-column): New function. (eglot--lsp-position-to-point): Use eglot-move-to-column-function.
* eglot.el (eglot-move-to-lsp-abiding-column): Simplify slightly.
* eglot.el (eglot-current-column-function): New variable. (eglot-lsp-abiding-column): New helper. (eglot--pos-to-lsp-position): Use eglot-current-column-function. (eglot-move-to-column-function): Tweak docstring.
Also close #125. Idea and much of design contributed by Michał Krzywkowski <k.michal@zoho.com> This introduces the variable eglot-move-to-column-function. According to the standard, LSP column/character offsets are based on a count of UTF-16 code units, not actual visual columns. So when LSP says position 3 of a line containing just \"aXbc\", where X is a multi-byte character, it actually means `b', not `c'. This is what the function `eglot-move-to-lsp-abiding-column' does. However, many servers don't follow the spec this closely, and thus this variable should be set to `move-to-column' in buffers managed by those servers. * eglot.el (eglot-move-to-column-function): New variable. (eglot-move-to-lsp-abiding-column): New function. (eglot--lsp-position-to-point): Use eglot-move-to-column-function. #124: joaotavora/eglot#124 #125: joaotavora/eglot#125
* eglot.el (eglot-move-to-lsp-abiding-column): Simplify slightly. #125: joaotavora/eglot#125
* eglot.el (eglot-current-column-function): New variable. (eglot-lsp-abiding-column): New helper. (eglot--pos-to-lsp-position): Use eglot-current-column-function. (eglot-move-to-column-function): Tweak docstring. #125: joaotavora/eglot#125
Also close joaotavora/eglot#125. Idea and much of design contributed by Michał Krzywkowski <k.michal@zoho.com> This introduces the variable eglot-move-to-column-function. According to the standard, LSP column/character offsets are based on a count of UTF-16 code units, not actual visual columns. So when LSP says position 3 of a line containing just \"aXbc\", where X is a multi-byte character, it actually means `b', not `c'. This is what the function `eglot-move-to-lsp-abiding-column' does. However, many servers don't follow the spec this closely, and thus this variable should be set to `move-to-column' in buffers managed by those servers. * eglot.el (eglot-move-to-column-function): New variable. (eglot-move-to-lsp-abiding-column): New function. (eglot--lsp-position-to-point): Use eglot-move-to-column-function. GitHub-reference: fix joaotavora/eglot#124
* eglot.el (eglot-move-to-lsp-abiding-column): Simplify slightly. GitHub-reference: joaotavora/eglot#125
* eglot.el (eglot-current-column-function): New variable. (eglot-lsp-abiding-column): New helper. (eglot--pos-to-lsp-position): Use eglot-current-column-function. (eglot-move-to-column-function): Tweak docstring. GitHub-reference: fix joaotavora/eglot#125
@MaskRay Since you participated in this discussion and in the definition of Also, do you know about any servers that already implement |
https://microsoft.github.io/language-server-protocol/specification#text-documents
The spec says that :character offsets are based on a UTF-16
representation. Previously we were assuming that :character specifies
the number of characters.
(eglot--pos-to-lsp-position):
(eglot--lsp-position-to-point): Use it.