-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fulltext search #95
Comments
cc @comex I think you had some thoughts on this. |
cc @DivineDominion - you might have some thoughts on this, it relates to your comment seongjaelee/nvatom#35 (comment) |
I have about 50 GB of "code" in the form of Git repos with long histories across a corporate codebase. GitHub's search is abysmal in terms of functionality and ripgrep is the best thing since sliced bread, so I'm keen to wrap up ripgrep in a minimal web UI. To begin with, I'll be happy to do something as crude as shelling out to I remember reading in your blog post that the multitude of Would this be a useful optimisation? The 50GB in question consists of ~700K files across ~1500 repos. Searches typically take about 3 seconds. I'm not sure if rg yet has some built-in profiling to emit how much time is spent in the dir traversal and text scanning stages. |
@aidansteele To be clear, this issue is about fulltext search. It's not an issue about turning ripgrep into a daemon. :-) Turning ripgrep into a daemon is one particular implementation path for fulltext search because fulltext search generally implies the management of an index. It's hard to imagine that working well without some kind of daemon watching files, but that seems more like a convenience to me. It should always be possible to use ripgrep without a daemon where by the end user updates the index manually. OK, with that said, I'll respond more directly.
This sounds exciting. If you ever want to chat more about this, I'd be happy if you opened a new more targeted issue.
Thanks! Unfortunately, there's no low hanging fruit right now. Fulltext search is merely an idea in its infancy. (It's totally not even clear to me whether it should be a separate tool or not. I could make a very good argument for either way.) The most important thing anyone can do at this point in the game is help me come up with use cases. For example, if you could share more about how you envision your interaction with the tool, that would be great. Details about stat calls and daemons are better left out, but for example, things like this are useful:
Some of these are lofty goals---and they might not ever happen---but they are examples of the kind of thing I'm interested in at this point in time.
To give you an idea of how premature this is, I wouldn't expect rg-as-a-daemon to be relevant for at least another year, and that's only if we start working on the fulltext side of things very soon.
A non-trivial amount of time is spent not only in directory traversal, but also in matching gitignore rules against each file path. I wouldn't be surprised if it shaved off a third of your search time, but I wouldn't expect much more than that. In any case, even if we do rg-as-a-daemon, the point isn't so much to "save on directory traversal" as it is to "keep an index up to date." |
Since you asked for a use case: I wish I had the power of |
@DivineDominion Thanks! Can you give more details on the precise interaction you'd want to see? I'm not familiar with your note taking app (or Macs, for that matter), so I don't understand what the specific integration points are. For example, can you say more about your "file index"? Do you have enough data where an index is beneficial? |
I'd like to stress another point here: fulltext search isn't necessarily about making ripgrep faster. It might help for truly large scale work loads ("I want to search a 50GB repo faster than 3 seconds"), but an index isn't going to do much for the average joe. What I'm particularly interested in doing is ranking results as well. (And this is kind of why I might consider pushing this to a new tool, because I'm not sure how well ranking meshes with a CLI search tool.) |
In an ideal world, the app offers auto-completion of note titles while you search and filters the results live. (It's already pretty fast with C strstr()) Hitting the drive for every search request (i.e. every n-th key stroke) is taxing the hardware a bit too much for my taste, though, so I work on an index at the moment. Nothing clever, just full text search based on word components.
Like this one here (Notational Velocity/nvALT):
https://www.youtube.com/watch?v=vP-rLLKL_6U&t=0m55s
Edit:
Yes, the appeal of an index only becomes reality when you can query it in-memory for my purpose. It's more about 100k notes with 200 words each that I'm dealing with. Relevancy ranking would be great for my purposes, too :)
|
I'm a little late to this issue, but I'll share this in case it helps @BurntSushi with any ideas. I've been maintaining and using the https://github.com/alphapapa/helm-org-rifle Emacs package for about a year and a half now. It works well for searching Org-mode files that are already loaded into Emacs buffers, but for unopened files, it has to wait for Emacs to load each file before it can search it. Emacs/Org users' styles vary: some use a few large Org files, while others use many smaller files, sometimes in a deep hierarchy of directories. For example, one user uses a deep hierarchy of thousands of Org files essentially as a museum's database system. Searching with helm-org-rifle works, but it takes some time to initially load all of the files into Emacs. So for some time I've been looking for and thinking about solutions for searching Org files with external processes; that way Emacs would only have to load the files that the user selects results from. And also, searching text with an optimized binary like Now of course, any plain-text search tool works in the general sense ( The closest I've come to finding a solution is actually
in So I've looked at just about every similar tool I can find, from So, for me, I wish
Maybe these requirements are too specific for a general-purpose tool like Thanks. |
Just chiming in to say I've been working on a code search service similar to Hound (using the same frontend, actually) powered by ripgrep under the hood. It's looking good so far, but this is going to be running across a massive codebase, so I imagine an index would help for performance. (Specifically, the codebase is the entire WordPress.org plugins repository, which is some 30GB+ on disk.) It would also be quite useful to be able to rank the results, although it's probably not our main use case. (As far as I could see, there also wasn't an actual rg-as-a-daemon thread apart from this one.) |
@rmccue Wow that's very cool! I'll be excited to see the results of that. :-) My own personal leanings have somewhat shifted from "add fulltext search to ripgrep" to "fulltext search should probably a completely different tool, albeit with large swathes of shared implementation details with ripgrep." Note though that I am specifically thinking of ranking results, which is perhaps too ambitious. :) |
I'm going to close this. It's a nice idea, but it is very nascent, and the more I've thought about this, the more I think it probably should be a separate tool. We can always revisit it. |
Im a bit curious toward this topic now as I am running into trouble scaling up hound. Would love to hear what are some alternative direction, perhaps involving rust or ripgrep, that we can explore? |
There are alternatives to hound. Such as zoekt and Sourcegraph (disclaimer: I work at Sourcegraph). |
thanks for the suggestions. Zoekt looks quite primitive, but I will invest sometime into research it. (Docs dont seem to mention RegEx though) Sourcegraph is one of the solution we have been looking into, seems to be a bit complex and having the I found Perhaps what I am hoping for is the ability to use ripgrep (as well as equivalent Ripgrep itself does not need to do the indexing but it could support some sort of index file/cache specification that could speed up the search (plaintext or regex). any comment would be much appreciated @BurntSushi |
I don't understand what kind of answer you're hoping for here. I closed this issue and haven't given much thought to it since.
It already has one? You can tell ripgrep to only search a specific set of files like this: To be honest, I still have a desire to build a tool like this, but I just do not have the time. My focus right now is on improvements to Many of the technical pieces required for building a tool like this already exist in the Rust ecosystem. You don't even need ripgrep. ripgrep's core is just a shell (albeit, a complex one) around a bunch of libraries. The index could be built with But the success of a tool like this, IMO, is heavily dependent on its engineering quality, and in particular, its reliability. This is very hard to do, because the speed of the tool is the main selling point, and the speed is in turn dependent on an out-of-band index. That index needs to encapsulate state from the files being searched in a very reliable way. If the index's state gets out of sync, then searches are going to turn up incorrect or missing results. (A small number of false positives are probably tolerable, but false negatives are probably never tolerable.) This state synchronization, IMO, is the most significant initial hurdle that one has to overcome to build a tool like this. Because if you give wrong results, users aren't going to trust the tool, and if users don't trust the tool, they just aren't going to use it. |
If only you were an
Yes, exactly. John Kitchin came up with a SQLite-based indexing tool for Org files, and I used it as a base for a prototype, which works fine, but the issue is expiring old data and reindexing whole files from scratch (because any change in the plain-text source file changes the positions of all the nodes later in the file). The bottleneck seems to be deleting all the old table rows when a file changes. Anyway, having full-text/multi-line matching in Ripgrep would provide adequate performance for searching many large Org files, even without an out-of-band index. I've tried all the grep-like tools I could find, but none of them were adequate to the task. |
@alphapapa I don't think you're just asking for fulltext search, you're asking for something that is aware of the semantic nature of the content that is being searched. As long as you don't have gigabytes of org mode files, I imagine your use case could be solved quite well without any index at all. You'd just need something that knows how to parse org mode files to the extent that it can show the output you want. I think fulltext search, at least as I conceptualize it here, is primarily about scale, and secondarily, potentially about relevance ranking. I think that in order to do relevance ranking well, particularly with code, you probably do need some semantic awareness. But it doesn't necessarily correspond to the same semantic awareness as needed for showing each result. But yeah, building a search engine is a lot of work. :-) |
Have you heard of the percolate concept? https://docs.manticoresearch.com/2.7.5/html/searching/percolate_query.html. I learned about this for elasticsearch. But I think this explains it well. Essentially its search in reverse you run documents by a “compiled” query. I think that would be a way to accomplish this feature. I think this could work quite quickly / on the fly because you only need to index current search document |
@rmchale Sorry, but I'm not sure I understand the relevance of percolate queries and ripgrep. Could you please be more specific? |
See also: #1497 |
It's an inside out search so you don't have to create a daemon. You index the query and send every document through the query. |
I think you're just repeating what the link says. :) What I don't understand is why I would want to do that for ripgrep. See my linked issue. There is no plan to make a daemon. The plan is to build an embedded IR engine, just like how Lucene works. |
As far as I can tell, percolate queries aren't an alternative to indexes. They still read through the whole search corpus just like a non-indexed search. The use case is if you have a list of N different queries you want to run on the same corpus (you have to know all the queries in advance); with percolate queries you can read through the corpus one time instead of N times, while still giving each query its own results list. But that doesn't really match the usage model of ripgrep. |
ripgrep seems like its super fast about determining which files to process regex/queries/search on. But it doesn't know anything about the files its searching until run-time. Typically search engines like lucene and I think what you are preposing requires that you have an index built up in order to return its results. This seems a bit counter-intuitive to how ripgrep currently works. percolate (or lucene-monitor) you don't have to build up any index. You take the query (or multiple queries) compile it (index it in search terms) [should be super quick if there's only one] and run it against every document that matches the ripgrep file result set. You would throw away the index after every call to ripgrep because most likely you would be indexing a new search the next time rg was called. To me atleast it seems like a more intuitive fit for how ripgrep works. |
That isn't any different from what ripgrep already does. |
I agree its similar to what rg is doing. Except it doesn't support full-text search? |
|
Sounds good |
One of the things I've wanted
ripgrep
to do from before I even started writing it was fulltext search. General fulltext search is hard, but I wonder how far we can get by focusing on fulltext search for code.I know there are some tools out there already that aspire to do this:
I'd like to start collecting use cases for functionality like this. In particular, most of the technical problems are already solved. Obviously, we have
ripgrep
and we have a lighting fast data structure for representing an inverted index already (indeed, it's the same as what Lucene uses internally). I think the key problems remaining are figuring out the user interaction story. Some things off the top of my head:ripgrep
? Can use of an index be done seamlessly while still being a top notch general purpose search tool?Anyway, I'd like to start thinking about this. I don't know when I'll start on an implementation, but it'd be good to get ideas from other folks.
The text was updated successfully, but these errors were encountered: