-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialise #14
Conversation
serialises the index and all its stores
loads a previously serialised index
Hi I've been playing with the serializer. Great stuff! I'm encountering a bug though, which I could not resolve. See http://jsbin.com/upabeb/1 for my test case, check source. Inside is a build of the lunr.js serialise branch, followed by my dataset, followed by a minimal test script. Check console for output. Essentially what I do is I load a big dataset. Then I search for 'lady' which returns 21 results. Searching in the serialised-deserialised index returns an empty set. Clearly some data is lost. I saw your test case, which essentially does the same, but that one runs fine. The only difference is the bigger dataset I guess? Also I've been looking into ways to optimize the size of the index. Currently it compiles to 900k for me. Most overhead is in the Great stuff! |
Hey sorry for late reply, I've been away on holiday. That is a strange issue you are seeing, I'll take a look and see whats going on. As for the size of the serialised index, there probably are some ways to store the tokenStore in a more efficient way, for example, several characters could be combined into a node so as to lessen the number of nodes required. I haven't looked into how achievable this is though, its just a thought. |
Okay so I think the problem you were having with the serialised branch is due to the pipeline not being serialised. This worked in my example because the pipeline was empty, I assume in your example it was not. I'm not sure of the best way to try and handle this, functions cannot be serialised into JSON. I see a number of options:
Whilst the first option is the simplest, and requires the least amount of change, I think it would cause too many problems, and its not particularly obvious. I think when you load an index it should be ready to go, without any extra set-up. Having to register pipeline functions is a bit of an overhead, for the built in pipeline functions this isn't a problem, but any one adding extra pipeline functions would need to make sure that they are registered before trying to load an index etc. I'm leaning towards option 2, however I'm keen to get any ideas or feedback. |
pipeline functions need to be registerd if they are to be serialised the pipeline will issue warnings about using unregistered pipeline functions
Hey Oliver, hope you had a good holiday :) My preference is to use named functions too. Maybe we can have a fallback in which we supply functions encoded as strings, which will be |
@ssured Yep, I've gone with naming the pipeline functions. They have to be registered with lunr before they can be successfully serialised. The included pipeline functions are automatically registered so this will only affect people using custom pipeline functions. Please try out the latest version with your example again, I think it should have solved the problem you were having before. I'm aiming to get a 0.3.0 release with this feature in a weeks time or so. |
The serialise functionality has been released in version 0.3.0 of lunr. An example can be seen in |
This branch includes early support for dumping and loading an index, the internals may change but the interface is likely to remain the same, this branch will be released as v0.3.0. I haven't had a chance to update the docs, I'll probably get around to releasing this properly in a week or so.
The index now supports serialisation via
JSON.stringify
. A dumped index can then be reloaded vialunr.Index.load
.I have updated the example, in
/example
, to use a pre-built index. The load time is significantly faster! I've also included a simple node script for pre-building the index. I imagine that something similar could be extracted into a standalone tool, but that is for another day.Any feedback is welcome.