-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ripgrep: Use the new json output #2622
Comments
Exhaustive docs for the format can be found here: https://docs.rs/grep-printer/0.1.0/grep_printer/struct.JSON.html If you have any feedback on the format, it would be great to get it! Particularly since this hasn't landed in a release yet, so I can still make breaking changes to the format if there is anything seriously wrong with it. |
Thanks! @marcdumais-work, then maybe we can work on this sooner than later? (@BurntSushi is ripgrep's author). |
@BurntSushi, I'm trying to use the JSON interface, so far it's going well. The only thing is that we need the match position in the line in terms of characters (when one multi-byte UTF-8 character counts as one), but rg only gives the offset in bytes. I imagine most editors using rg will be interested in that value as well. Is it an information readily available in rg that could be output at no extra cost? Otherwise, we'll continue converting byte offset to character offset ourselves. |
@simark Ah interesting! Generally getting the offset in terms of characters in "the wrong thing," so I'm not sure I'd think that other folks would want it. Moreover, the output isn't actually guaranteed to be UTF-8, and in that case, it's pretty difficult to determine what exactly is a character. Certainly, ripgrep never ever never deals in character offsets. It's always byte offsets. The reason why "character index" is tricky to use is because "character" is difficult to define. A common but typically incorrect way of approaching this is to store your string as a series of 32 bit integers, each representing a codepoint, and calling each codepoint a character. But multiple codepoints can join to together to form a grapheme, which a human might see as a single character. If you know what you're doing here with respect to Unicode, then feel free to ignore me! But if this sounds interesting to you, I'd be happy to elaborate further. :-) In short, if you do insist on character indices, you will definitely need to do that conversion yourself. |
You are right, text encoding is complex and what I am trying to do here is most likely not accurate in all scenarios. I am only considering UTF-8 for now (where the arbitrary data object has the When parsing the UTF-8 string in the |
@simark Yeah unfortunately I've not used Javascript for a while, so I'm not familiar with best practices there, but I'd probably start with a UTF-8 library? https://www.npmjs.com/package/utf8 |
The problem is that once you do |
You should get back the original, yes. If you're only dealing with
I think that code should be enough to then establish a mapping of some sort or otherwise do your conversion. IIRC, there will be one tricky part of the above: you'll need to make sure Javascript hands you complete codepoints and not surrogates. cc @roblourens --- I assume you'll need to do something like this for VS Code, and I think you're using Javascript as the glue code. Do you have any advice here? |
Ah ok that makes sense. I don't know why, I thought there would be some normalization going on. I'll try what you suggest that, it should be enough 99.9% of the time. |
@simark Yeah, I guess you should double check what your implementation of Javascript does. I wouldn't expect any normalization here, but it is possible! |
I am hitting a weird case. I get different results when invoking rg from the command line than from node (from a test case ran by mocha). This is the command line my test runs:
This is what the test receives from rg:
This is what I get when running the command manually from the shell:
Notice the difference in the
Any idea what could cause this? |
@simark In the results from your test harness, note that the match starts with a space: |
@simark Also, |
Ah damn, well spotted, thanks! Sorry about that noise. |
Here's the patch: https://github.com/simark/theia/commits/rg-json When a ripgrep with the json output is released and made available through vscode-ripgrep, we can revisit it. Thanks for your help @BurntSushi! |
@simark Awesome! Thanks so much for trying it out and giving feedback! |
ripgrep now seems to have an option to output results as json:
BurntSushi/ripgrep#1017
We should look into it, to replace the escape sequence parsing code that we have now.
The text was updated successfully, but these errors were encountered: