-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize search algorithm + documentation of prediction.cppm #9
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your changes. Had a look on your [Bug?] labels and commented my thoughts on each of them separately.
nlpcore/src/latin/prediction.cppm
Outdated
auto isWordCandidate = token_index > 0 && candidateCost <= state.weights_.max_cost_sum_; | ||
auto prefixCost = 0.0; // Is initialized in next line only if isWordPrefix results in true | ||
// TODO: improve prefix searching performance (run time and stop detection) | ||
// Eligible as prefix candidate? searching for prefix, current word and query word same until `token_index`, and cost at word size within bound. | ||
// [Bug?] editDistanceAt at this stage is either 0 or leftover from last time we searched till word.size(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally it shouldn't be a bug and the edit distance should be the current one, but I am not 100% sure
nlpcore/src/latin/prediction.cppm
Outdated
@@ -350,13 +458,18 @@ void predictWordInternal(std::span<const fl::str::UniString> sentence, const Rec | |||
params.results_.insert({properties->shortcut_phrase, 1.0}, params.flags_); | |||
} | |||
} else { | |||
// We have an n-gram | |||
// We have an n-gram (n > 1) | |||
// Only search the user's dictionary for n-grams for now. [Bug?] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bug currently and behaves as described, only in the future if we also want to query the language pack dictionaries this is a bug. We should probably slap a TODO label on it
nlpcore/src/latin/prediction.cppm
Outdated
/* Compute frequency and merged properties from word-level, n-gram level, and shortcut level. | ||
* Returns pair: | ||
* - merged_properties: absolute score = sum frequencies from all types, offensive/hidden = or(that of each n) | ||
* - frequency: average of (smoothed) frequencies of all types [Bug?] Should add all n = 1..N. should normalize by (n-1)-gram's frequency, not the root's. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no normally P and EntryType are synced, so if i query ngrams i get the merged freq for ngrams only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the changes, really appreciate it!
I've now run your changes using the nlptools binary on PC and have compared the old and new results for suggestion queries. Here are some observations:
A: Dictionary loading in parallel seems to run very smoothly and is not much noticeable, except if one immediately tries to type and not everything is loaded yet, but on mobile that's more than enough
B. The confidence value in the old was a normalized value between 0 and 1:
using the new algorithm however the confidence values are all extremely low:
Here is the question why this occurs and how this could be fixed?
C. Sometimes the confidence value drops below 0 which should not happen:
D. The current implementation breaks the nlptool actions prep and train due to API changes in nlpcore, these need to be fixed please.
nlpcore/src/common/dictionary.cppm
Outdated
~WrappedJThread() { | ||
thread.join(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
}; |
You missed a semicolon here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did this get past compilation...
nlpcore/src/latin/dictionary.cppm
Outdated
@@ -55,15 +57,16 @@ export enum class LatinDictionarySection { | |||
export class LatinDictionary : public Dictionary { | |||
public: | |||
LatinDictId dict_id_; | |||
std::shared_ptr<LatinTrieNode> data_; | |||
// all operations on data_->first need to acquire lock on data_->second please. | |||
std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the std::pair<LatinTrieNode, std::shared_mutex>
:
What I would suggest to do here is to declare a new struct named LatinTrieNodeWithLock
, which in turn looks something like this:
struct LatinTrieNodeWithLock {
LatinTrieNode node;
std::shared_mutex lock;
}
This way we can make code using this more readable and we avoid having to use first
and second
, which are not verbose at all.
Then we can write:
std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_; | |
std::shared_ptr<LatinTrieNodeWithLock> data_; |
which is far more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will do.
nlpcore/src/latin/dictionary.cppm
Outdated
inline LatinTrieNode* insertNgram(std::span<const fl::str::UniString> ngram) noexcept { | ||
return algorithms::insertNgram(data_.get(), dict_id_, ngram); | ||
std::scoped_lock<std::shared_mutex> lock(data_->second); | ||
return algorithms::insertNgram(&(data_->first), dict_id_, ngram); | ||
} | ||
|
||
inline void forEachEntry(algorithms::EntryAction& action) { | ||
algorithms::forEachEntry(data_.get(), dict_id_, action); | ||
inline void forEachEntryReadSafe(algorithms::EntryAction& action) { | ||
std::shared_lock<std::shared_mutex> lock(data_->second); | ||
algorithms::forEachEntry(&(data_->first), dict_id_, action); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you sometimes use scoped_lock
and sometimes shared_lock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because scoped_lock is non-shared lock (write lock) while shared_lock is shared (read lock). I'd use scoped_shared_lock if there is one, but there isn't.
nlpcore/src/latin/prediction.cppm
Outdated
similarity = -cost; | ||
double confidence = (w1 * similarity + w2 * frequency) / (w1 + w2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why assign negative cost to similarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from the "just get it working" mindset. The similarity does not really need to be 0-1 for suggestion; it just needs to be correctly ordered such that better matches are scored higher. I'll work out the math in the next update.
Thanks for the review Patrick. A. It would only be noticeable if the user types a uncommon word straight away. Otherwise it should be equivalent. B. I have no idea what's going on; it should not be so small. I'll take a look. C. I haven't given the score much attention, thinking as long as it works, it's fine. Right now it's not constrained to 0-1 because that would be hard to make the cost non-decreasing during the A* search. I'll make changes so the math is more formal. However, please note that instead of having a 0-1 score, we can simply say the score is the log confidence ranging from -infty to 0. Log probability is adopted in some libraries, including KenLM mentioned somewhere here. Subtracting a number is just multiplying the confidence with something smaller than 1. For example, one edit distance subtracts from the log confidence, which is equivalent to multiplying the confidence by 1/e. I'll probably do something along this line and return an exponent at the end. D. I'm on Debian so I couldn't build nlptools. I'll ping you to get your setup. |
it seems that merging in the latest main has messed up the branch and now changes are shown in the diff which aren't yours, making PR review pretty hard. Could you maybe undo the merge and properly rebase on main? |
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
… loaded Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Hey Patrick, |
Added documentation as I try to understand the prediction implementation.
Please also check the items labeled as
[Bug?]
and see if they are actually bugs.I'll try optimizing the search in another PR.