-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using invertedIndex for autocomplete #287
Comments
Sorry for the late reply. You could definitely wrap that normalise function up into a lunr plugin. There is a similar project, lunr-unicode-normalizer, but I don't think it has been updated for lunr 2. As for autocomplete, I need to get round to actually putting a demo of this together, but this is what I've been suggesting to people. idx.query(function (q) {
// exact matches should have the highest boost
q.term(searchTerm, { boost: 100 })
// prefix matches should be boosted slightly
q.term(searchTerm, { boost: 10, usePipeline: false, wildcard: lunr.Query.wildcard.TRAILING })
// finally, try a fuzzy search, without any boost
q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
}) I disable the pipeline to prevent stemming getting in the way, you would have to experiment if this makes sense for your use case, especially if you wanted to add the unicode normalising plugin. Additionally, when using the |
Dear Oliver, no need to be sorry! With the help of your snippet I cobbled together a little more context, so that, in fact, I get around creating an extra iindex upfront but instead can feed the autocomplete from lunr itself and also can get rid of my own normalizer by using the pipeline: // LUNR query for autocomplete terms
function acsearch(searchTerm) {
var results = index.query(function (q) {
// exact matches should have the highest boost
q.term(searchTerm, { boost: 100 })
// wildcard matches should be boosted slightly
q.term(searchTerm, { boost: 10, usePipeline: true, wildcard: lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING })
// finally, try a fuzzy search, without any boost
q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
});
if (!results.length) { return ""; }
return results.map(function(v, i, a) { // extract
return Object.keys(v.matchData.metadata);
}).reduce(function(a, b) { // flatten
return a.concat(b);
}).filter(function(v, i, a) { // uniq
return a.indexOf(v) === i;
});
}
// install jquery autocomplete widget
$('#query').autocomplete({
appendTo: '#dlg',
minLength: 3,
source: function(inp, out) {
out(acsearch(inp.term.toLowerCase()));
}
}); The resulting list reads much the same -- though now it is sorted by score and some new fuzzy terms appear. Performance on my desktop is also good. I think this is showing all that can be pulled from an inverted index - anything more would need a mapping to the unstemmed original text and that is beyond the scope of this experiment, at least for now. Thank you very much indeed! |
@hungerburg glad you managed to make some progress. Indeed the stemming does make implementing really good autocomplete difficult. Just thinking aloud, but you can get Lunr to create and store the mapping to the unstemmed words, though it will likely increase the index size considerably. During indexing you can add a pipeline function that stores the unstemmed words as metadata for a token. The concept is discussed a bit in the guides. You would need to add a pipeline function before the stemmer: var storeUnstemmed = function (token) {
token.metadata['unstemmed'] = token.toString()
return token
} The idea would be that you would then get the unstemmed words from the search results, the guide has a more detailed example. First prize would be able to have an algorithmic unstemmer, but I don't think its possible. |
Closing this as it has gone stale, feel free to comment or re-open if you still have any unanswered questions. |
First, a HUGE thanks to both @hungerburg and @olivernn for this. I combined both suggestions and it's working great. For anyone wanting to do the same, this is what worked for me... I'm indexing like this: // Store unstemmed term in the metadata. See:
// https://github.com/olivernn/lunr.js/issues/287#issuecomment-322573117
// https://lunrjs.com/guides/customising.html#token-meta-data
const storeUnstemmed = function(builder) {
// Define a pipeline function that keeps the unstemmed word
const pipelineFunction = function(token) {
token.metadata['unstemmed'] = token.toString();
return token;
};
// Register the pipeline function so the index can be serialised
lunr.Pipeline.registerFunction(pipelineFunction, 'storeUnstemmed');
// Add the pipeline function to both the indexing pipeline and the searching pipeline
builder.pipeline.before(lunr.stemmer, pipelineFunction);
// Whitelist the unstemmed metadata key
builder.metadataWhitelist.push('unstemmed');
};
const index = lunr(function() {
this.use(storeUnstemmed);
...
}); And modified the autocomplete function suggested by @hungerburg to use the unstemmed words like this: autoComplete(searchTerm) {
const results = this._index.query(function(q) {
// exact matches should have the highest boost
q.term(searchTerm, { boost : 100 })
// wildcard matches should be boosted slightly
q.term(searchTerm, {
boost : 10,
usePipeline : true,
wildcard : lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
})
// finally, try a fuzzy search, without any boost
q.term(searchTerm, { boost : 1, usePipeline : false, editDistance : 1 })
});
if (!results.length) {
return "";
}
return results.map(function(v, i, a) { // extract unstemmed terms
const unstemmedTerms = {};
Object.keys(v.matchData.metadata).forEach(function(term) {
Object.keys(v.matchData.metadata[term]).forEach(function(field) {
v.matchData.metadata[term][field].unstemmed.forEach(function(word) {
unstemmedTerms[word] = true;
});
});
});
return Object.keys(unstemmedTerms);
}).reduce(function(a, b) { // flatten
return a.concat(b);
}).filter(function(v, i, a) { // uniq
return a.indexOf(v) === i;
});
} Thanks! |
Using advice from olivernn/lunr.js#287 to retain the "autocomplete" behavior seen in earlier versions.
Using advice from olivernn/lunr.js#287 to retain the "autocomplete" behavior seen in earlier versions.
Using advice from olivernn/lunr.js#287 to retain the "autocomplete" behavior seen in earlier versions.
Not so much an issue with lunr, which is great! More a quick try to get ideas going…
In a shell with the jq utility I pull my terms from the lunr index in advance:
jq '[.index.invertedIndex[][0]|scan("^\\w{3,}")]|unique' index.json > iindex.json
I can feed that to http://api.jqueryui.com/autocomplete/ widget like below
Not fully nice, but works acceptably so far. Now truly nice would be to create the autocomplete index on the client and have the term to match processed by the indexer instead of that crude normalizer above.
The text was updated successfully, but these errors were encountered: