Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Server Index Format #623

Closed
dbaeumer opened this issue Dec 4, 2018 · 40 comments
Closed

Language Server Index Format #623

dbaeumer opened this issue Dec 4, 2018 · 40 comments
Labels

Comments

@dbaeumer
Copy link
Member

dbaeumer commented Dec 4, 2018

The purpose of the Language Server Index Format (LSIF) is it to define a standard format for language servers or other programming tools to dump their knowledge about a workspace. This dump can later be used to answer language server LSP requests for the same workspace without running the language server itself. Since much of the information would be invalidated by a change to the workspace, the dumped information typically excludes requests used when mutating a document. So, for example, the result of a code complete request is typically not part of such a dump.

A first draft of a specification is available here

@dbaeumer dbaeumer added the lsif label Dec 4, 2018
@rcjsuen
Copy link
Contributor

rcjsuen commented Dec 4, 2018

@dbaeumer Not sure if I'm misinterpreting something here, but correct to say that this is something for LSP clients to implement and not servers to implement, yes?

@LaurentTreguier
Copy link
Contributor

LaurentTreguier commented Dec 4, 2018

I understood it as something dumped by servers, and then used by clients afterwards, so both would have their part to implement

@rcjsuen
Copy link
Contributor

rcjsuen commented Dec 4, 2018

Hm...good point. I had interpreted it as more of a caching system for the LSP client. However, I guess the first sentence should've made it clear to me... :(

The purpose of the Language Server Index Format (LSIF) is it to define a standard format for language servers or other programming tools to dump their knowledge about a workspace.

Well then...!

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 4, 2018

@rcjsuen no the idea is that this is dumped either by the server or a separate tool. As mentioned in the spec I have already written a tools for TypeScript and an generic extension that serves the dump via LSP to any kind of LSP client. I will make these open source the next couple of days.

@tsmaeder
Copy link

tsmaeder commented Dec 5, 2018

So if I understand this correctly, this is supposed to be a library for helping language server implementers solve the problem of maintaining an index? If so, why does it need a specification as opposed to simply a documentation of the code? What are the reuse cases?

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2018

Looks like I was pretty bad in explaining it: the goal is that we can produce an index to answer LSP requests for read only workspaces without firing up a language server specific to the programming language. There will be one generic language server that can serve the index. Furthermore the index will allow to relate symbols across repositories. See the demo here where Jonathan navigates from the use of observable to the actual definition in source.

@svenefftinge
Copy link
Contributor

Sorry, still not sure I got it. Is it some sort of cache middleware for language servers?
If so why does it need to be part of the protocol?

@mickaelistria
Copy link

If so why does it need to be part of the protocol?

I am glad the LSP protocol also includes some technical proposals, middleware or intermediary formats to allow combination of different language servers.

So this index format will come with an implementation of a language server able to process multiple indexes to return results? How is this "composite index-based language server" expected to know which LS to retrive indexes from or how to get indexes?
And the LSP will be extended so that existing LS could provide the ability to dump and index and then shutdown assuming the index wouldn't have to change?

@svenefftinge
Copy link
Contributor

to allow combination of different language servers

From where did you draw this would be supported? It would indeed be useful if language servers could access a common cross-languages index.

I am glad the LSP protocol also includes some technical proposals, middleware or intermediary formats

I was not implying it should not, but wanted to understand why and how it depends on the LSP (technically).

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2018

Some clarifications: the LSIF will not be part of the protocol itself since it is not a protocol. What might be part of the protocol are requests to ask a server to dump it's state.

Why did we decide to put it here: the LSIF is based on LSP data types. The questions that are answerable by a dump are typical LSP requests useful on a read only workspace (for example goto definition, find all references).

Yes, we have developed a generic language server that can read in many indices and serve LSP requests on them. So it can serve a C# index in parallel to a TS index. I will make the TS index generator and the generic language server with an VS Code extension public soon. Will add a message here when available.

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2018

Here we go:

@tsmaeder
Copy link

tsmaeder commented Dec 5, 2018

I'm still a bit hazy on the motivation here: are you trying to solve a latency problem? Also, what's the use case for "read-only workpaces"?

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2018

It is more about repositories and published version. In my projects I usually have dependencies to many other npm packages which I depend on on a certain version. To be able to navigate and browser them there is no need to spin a whole language server (no need for code complete, signature help, ...). If there would be an index it would be relatively cheap to serve these and to support navigating to them even without cloning the repository locally.

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2018

Shown here to navigate from one source base to mobx sources on Github.

@tsmaeder
Copy link

tsmaeder commented Dec 5, 2018

It is more about repositories and published version. In my projects I usually have dependencies to many other npm packages

Ahh...precomputed indexes. We've been thinking about this for jdt for a long time :-)
However, if you think about maven, for example, the dependencies are not in the workspace tree (like npm deps). I'll have to think about this, but I'm not sure this is not better solved inside the language server.

@ShaneDelmore
Copy link

I would love to use this to precompute indexes in CI. For devs working in large repos this would be very helpful.

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2018

@tsmaeder if you look at the specification then this is basically split into two passes. The LS or the language tool will generate monikers specific to the tool. A linker tool will make them package manager specific by consulting other information. We even split this for TS / npm to demonstrate that embedding this into the Language tool is problematic. So the idea is more one of a compiler and then a symbol linker.

@felixfbecker
Copy link

Is the JSON graph format built on any standard JSON graph representation? I would assume there are already existing formats for that which would be nice to built on, since there may be existing tools that can read/generate them (e.g. save in database, query, visualise, build, etc)

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2018

@felixfbecker yes and no: it uses the same property names like label, ìnV, outV as graphSON does but doesn't fully emit standard graphSON. reason being is that graphSON is optimized that the output can be processed by different servers where as we focused on easy and early emit. But the TSC example contains a graphSON emitter: https://github.com/Microsoft/lsif-typescript/blob/master/tsc-lsif/src/emitters/graphSON.ts#L1. But I haven't tested for a while.

@svenefftinge
Copy link
Contributor

I still don't understand how the index-based LS and the 'real' LS would work together.
Having a brief look at the repos you shared, it looks like they would simply both be registered for the same language. If that is the case I don't understand how e.g. find references would work. Could you say a few words about that or point me to some information?

@dbaeumer
Copy link
Member Author

dbaeumer commented Dec 5, 2018

That depends on how we would at the end decide how the indexer is run. We haven't made any decision on this. Options are:

  • standalone tool.
    • Then you need two servers and someone to hot swap them
    • or you need to pre-index
    • or you need to wait
  • embedded in the language server. Then after the state is dump it could be served from there
  • or some other combination

See also: microsoft/lsif-node#6

I am fully open for ideas here.

@tsenart
Copy link

tsenart commented Dec 6, 2018

embedded in the language server. Then after the state is dump it could be served from there

With this approach, could we extend LSIF to be a write-through cache? This would support incremental lazy LSIF cache filling which could be merged with asynchronous full "dumps" over time.

If you squint hard enough, this would look similar to the Lambda Architecture, where incoming LSP requests that lazily fill the LSIF cache would map to the Streaming and Serving layers and the asynchronous full "dumps" would map to the Batch layer.

@tsenart
Copy link

tsenart commented Dec 6, 2018

Addendum: In theory, from what I can see, even a generalized LSIF caching proxy for different language servers would work. So there wouldn't be a need to change each language server individually.

@matklad
Copy link
Contributor

matklad commented Dec 8, 2018

Might be a good idea to take a look at kythe schema which seems to serve a similar-ish purpose. The primary difference is that kythe does try to define symbols&references, while LSIF works purely on the Ranges & Offsets. That is, you can map kythe model to LSIF, but not vice verse. Which is a good thing: LSIF seems much simpler and can be used directly!

@dbaeumer
Copy link
Member Author

@matklad we looked at kythe and other symbol databases and then purposely decided not to use one. Mainly for the reasons you pointed out.

@robinp
Copy link

robinp commented Jan 15, 2019

@dbaeumer @matklad I put together a quick list of first impression differences between LSIF and Kythe: https://gist.github.com/robinp/76f9d3d91387da5162f773895d4e1d15. Disclaimer: I don't know much about LSP/LSIF other than browsing the spec and the query docs a bit, and somewhat biased towards Kythe due to previous work with it, so offset that.

@yaohaizh
Copy link

One usage case of LSIF for Java is that before the current language server initialized, which might take times, the client can use the LSI to unblock some smartness scenarios immediately after the user open the workspace.

@akaroml
Copy link
Member

akaroml commented Jan 17, 2019

One usage case of LSIF for Java is that before the current language server initialized, which might take times, the client can use the LSI to unblock some smartness scenarios immediately after the user open the workspace.

This can be very useful for the warm load case. The language server knowledge can be persisted with LSIF in the previous session. And the knowledge can be used to enable basic language server features like symbol navigation in the new session before the actual language server finishes loading the project.

@fbricon This is something we would like to try for the Java language server.

@dbaeumer
Copy link
Member Author

@yaohaizh nice use case.

@dbaeumer
Copy link
Member Author

@robinp thanks a lot for the comparison. Some first feedback to the feedback:

  • I want to reiterate that LSIF (as LSP) is not providing any symbol information. This is by design. It provides data structures to navigate code using editor abstractions (see https://microsoft.github.io/language-server-protocol/overview). Therefore it doesn't need to specify any programming language specific constructs. The downside is of course that the LSP and LSIF can only answer questions that are foreseen.

  • We have an issue for providing a compressed format. We started with the verbose JSON one since it is easier to read and understand.

  • Regarding: At Google, such approach ran into problems due to codebase size: can you provide more insights here. We choose that approach since we thought it is more scaleable since code can be indexed independently.

  • Regarding having two libraries: we are discussion in which from the moniker needs to have version information to better support this.

  • We also discussion whether a edge should have a scope property to support different versions of the same file in different contexts (for example header files in cpp)

  • Scalability: since LSP scales for large projects I am confident that LSIF does as well assuming that we have a more compact format than JSON.

You might be also interested in https://github.com/Microsoft/lsif-typescript/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc

@tsmaeder
Copy link

One problem I see with LSIF is that some queries depend on knowing the whole program. Imagine you index a maven project that declares an interface with a method foo(). When we try to find implementers of "foo", the anwer depends on what the user has open in his workspace. It's even worse: the language server might determine that a particular declaration of "foo" is not an implementation (maybe because it's from a different version of the project, not the one the code in the workspace compiles against.

@robinp
Copy link

robinp commented Jan 22, 2019

@dbaeumer Thanks! Sounds fair. Re problems due to codebase size I meant what you list as next, the multi-version moniker issue, where references can become ambiguous without exact versioning. I didn't mean some performance issue.

@dbaeumer
Copy link
Member Author

@tsmaeder we discussed this lately and one idea was that LSP adds support to resolve a moniker and that we could have a global index where these can be mapped to LSIF dumps.

@mpickering
Copy link

I have now implemented a program which generates LSIF indices for Haskell files. I have two main concerns so far about the format.

  1. For a 100 line file, my program has produced a (once formatted) 23000 line JSON file. I can't imagine how big the output is going to be when I try to index a project like GHC which is over 100 000 lines. It could be that my output could be compressed in some way, I didn't make a big effort to do that yet but the preliminary signs are worrying.

  2. The format is not very compositional. The assumption seems to be that a single project produces one lsif.json file. This would be fine, if there was an easy way to combine together lots of index files without doing lots of recomputation but I can't think of an elegant way to do this and be confident that ids will point to the correct things. So for a big project if you change one of the files you have to completely regenerate the lsif.json file from scratch.

I also don't understand the bit in the specification about imports/exports but there's another issue #680 about that already.

@zfy0701
Copy link

zfy0701 commented Feb 21, 2019

@dbaeumer this is an awesome for LSP! The biggest concern I have is the numerical value of vertex id. it has several limitations as I can see

  • if the indexer is not perfect, it's common that it crashes and need to be recover, and use the numerical value that increased by the index makes it hard to recover from middle
  • for large repo, it's hard to run multiple indexing tasks in parallel
  • it's hard to do the indexing incrementally
    I think the goal of this project is basically a similar idea of google's kythe project, the use something called vname to identify a vertex, they basically use string value that could be generated deterministically (https://kythe.io/docs/schema/writing-an-indexer.html). I would suggest that at least we make it string and let user be able to override the way they generated id

Also, I would suggest that there should be some dump options, e.g. we may just want to index references information

@dbaeumer
Copy link
Member Author

@zfy0701 the protocol defines the id as number | string and the tsc-lsif tool has support to emit UUIDs. See https://github.com/Microsoft/lsif-typescript/blob/master/tsc-lsif/src/shared/protocol.ts#L12 and --id here: https://github.com/Microsoft/lsif-typescript/blob/master/tsc-lsif/src/main.ts#L19

@dbaeumer
Copy link
Member Author

@mpickering I agree that the current version is to verbose and I have an item for this. It is microsoft/lsif-node#4.

I started to prototype a compress JSON format that is fully array based and self describing. Will ping if I have something to comment on.

Regarding composition: the idea is that projects can be parsed independently and that import / export results can be used to link symbols between them. I will continue on #680 and look into implementing that for TypeScript.

@jdneo
Copy link
Member

jdneo commented Mar 13, 2019

@tsmaeder we discussed this lately and one idea was that LSP adds support to resolve a moniker and that we could have a global index where these can be mapped to LSIF dumps.

@dbaeumer Does that mean the moniker should contain the version info for a project with different versions?

@dbaeumer
Copy link
Member Author

@jdneo see the discussion here: microsoft/lsif-node#10

@dbaeumer
Copy link
Member Author

I will close the issue now that we have lsif-node

@vscodebot vscodebot bot locked and limited conversation to collaborators Dec 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests