Serialise #14

olivernn · 2013-03-13T09:40:46Z

This branch includes early support for dumping and loading an index, the internals may change but the interface is likely to remain the same, this branch will be released as v0.3.0. I haven't had a chance to update the docs, I'll probably get around to releasing this properly in a week or so.

The index now supports serialisation via JSON.stringify. A dumped index can then be reloaded via lunr.Index.load.

I have updated the example, in /example, to use a pre-built index. The load time is significantly faster! I've also included a simple node script for pre-building the index. I imagine that something similar could be extracted into a standalone tool, but that is for another day.

Any feedback is welcome.

serialises the index and all its stores

loads a previously serialised index

ssured · 2013-03-14T13:33:58Z

Hi

I've been playing with the serializer. Great stuff! I'm encountering a bug though, which I could not resolve.

See http://jsbin.com/upabeb/1 for my test case, check source. Inside is a build of the lunr.js serialise branch, followed by my dataset, followed by a minimal test script. Check console for output.

Essentially what I do is I load a big dataset. Then I search for 'lady' which returns 21 results. Searching in the serialised-deserialised index returns an empty set. Clearly some data is lost.

I saw your test case, which essentially does the same, but that one runs fine. The only difference is the bigger dataset I guess?

Also I've been looking into ways to optimize the size of the index. Currently it compiles to 900k for me. Most overhead is in the tokenStore part. I was wondering if its possible to quickly compute (a part of) the tokenStore in some way. What's your guess?
A small enhancement can be made in the documentStore, instead of copying the strings, we could reference the index of the string in the corpusTokens

Great stuff!

olivernn · 2013-03-25T10:26:19Z

Hey sorry for late reply, I've been away on holiday.

That is a strange issue you are seeing, I'll take a look and see whats going on.

As for the size of the serialised index, there probably are some ways to store the tokenStore in a more efficient way, for example, several characters could be combined into a node so as to lessen the number of nodes required. I haven't looked into how achievable this is though, its just a thought.

olivernn · 2013-03-25T19:31:29Z

Okay so I think the problem you were having with the serialised branch is due to the pipeline not being serialised. This worked in my example because the pipeline was empty, I assume in your example it was not.

I'm not sure of the best way to try and handle this, functions cannot be serialised into JSON. I see a number of options:

Don't serialise the pipeline at all. This is the simplest option, it would mean that the user would have to manually set up the pipeline to be exactly the same as where the index was serialised. This is the current implementation.
Serialise the names of the pipeline functions. This would mean that each pipeline function would have to have a name or label property, which would be serialised and then 'looked-up' when loading a serialised index. To make the lookup easier pipeline functions would need to be 'registered' with lunr, so the loading process had a list of functions to look up.

Whilst the first option is the simplest, and requires the least amount of change, I think it would cause too many problems, and its not particularly obvious. I think when you load an index it should be ready to go, without any extra set-up.

Having to register pipeline functions is a bit of an overhead, for the built in pipeline functions this isn't a problem, but any one adding extra pipeline functions would need to make sure that they are registered before trying to load an index etc.

I'm leaning towards option 2, however I'm keen to get any ideas or feedback.

pipeline functions need to be registerd if they are to be serialised the pipeline will issue warnings about using unregistered pipeline functions

ssured · 2013-03-26T13:33:54Z

Hey Oliver, hope you had a good holiday :)

My preference is to use named functions too. Maybe we can have a fallback in which we supply functions encoded as strings, which will be evaled by lunr. This is a security risk, but inside closed networks (intranets) the risk is low.

olivernn · 2013-03-26T14:20:23Z

@ssured Yep, I've gone with naming the pipeline functions. They have to be registered with lunr before they can be successfully serialised. The included pipeline functions are automatically registered so this will only affect people using custom pipeline functions.

Please try out the latest version with your example again, I think it should have solved the problem you were having before.

I'm aiming to get a 0.3.0 release with this feature in a weeks time or so.

olivernn · 2013-04-01T17:35:20Z

The serialise functionality has been released in version 0.3.0 of lunr.

An example can be seen in example/index_builder.js. Basically the index can now be serialised using JSON.stringify. The output can then be loaded again using lunr.Index.load(serialisedIndex).

olivernn added 4 commits March 13, 2013 11:31

add JSON.stringify support

45a9e8f

serialises the index and all its stores

add load to index

d4be684

loads a previously serialised index

add an integration test covering serialisation

420c63f

pre-build the example index

9046bd8

olivernn mentioned this pull request Mar 13, 2013

Implement serialize #11

Closed

implement serialisation of the pipeline

436b8a2

pipeline functions need to be registerd if they are to be serialised the pipeline will issue warnings about using unregistered pipeline functions

update the example

57645d6

olivernn closed this Apr 1, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialise #14

Serialise #14

olivernn commented Mar 13, 2013

ssured commented Mar 14, 2013

olivernn commented Mar 25, 2013

olivernn commented Mar 25, 2013

ssured commented Mar 26, 2013

olivernn commented Mar 26, 2013

olivernn commented Apr 1, 2013

Serialise #14

Serialise #14

Conversation

olivernn commented Mar 13, 2013

ssured commented Mar 14, 2013

olivernn commented Mar 25, 2013

olivernn commented Mar 25, 2013

ssured commented Mar 26, 2013

olivernn commented Mar 26, 2013

olivernn commented Apr 1, 2013