Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize search algorithm + documentation of prediction.cppm #9

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

moonbeamcelery
Copy link

Added documentation as I try to understand the prediction implementation.
Please also check the items labeled as [Bug?] and see if they are actually bugs.

I'll try optimizing the search in another PR.

@moonbeamcelery moonbeamcelery changed the title Documentation of prediction.cppm Optimize search algorithm + documentation of prediction.cppm Oct 20, 2023
Copy link
Member

@patrickgold patrickgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your changes. Had a look on your [Bug?] labels and commented my thoughts on each of them separately.

auto isWordCandidate = token_index > 0 && candidateCost <= state.weights_.max_cost_sum_;
auto prefixCost = 0.0; // Is initialized in next line only if isWordPrefix results in true
// TODO: improve prefix searching performance (run time and stop detection)
// Eligible as prefix candidate? searching for prefix, current word and query word same until `token_index`, and cost at word size within bound.
// [Bug?] editDistanceAt at this stage is either 0 or leftover from last time we searched till word.size().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally it shouldn't be a bug and the edit distance should be the current one, but I am not 100% sure

@@ -350,13 +458,18 @@ void predictWordInternal(std::span<const fl::str::UniString> sentence, const Rec
params.results_.insert({properties->shortcut_phrase, 1.0}, params.flags_);
}
} else {
// We have an n-gram
// We have an n-gram (n > 1)
// Only search the user's dictionary for n-grams for now. [Bug?]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bug currently and behaves as described, only in the future if we also want to query the language pack dictionaries this is a bug. We should probably slap a TODO label on it

/* Compute frequency and merged properties from word-level, n-gram level, and shortcut level.
* Returns pair:
* - merged_properties: absolute score = sum frequencies from all types, offensive/hidden = or(that of each n)
* - frequency: average of (smoothed) frequencies of all types [Bug?] Should add all n = 1..N. should normalize by (n-1)-gram's frequency, not the root's.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no normally P and EntryType are synced, so if i query ngrams i get the merged freq for ngrams only

Copy link
Member

@patrickgold patrickgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes, really appreciate it!

I've now run your changes using the nlptools binary on PC and have compared the old and new results for suggestion queries. Here are some observations:

A: Dictionary loading in parallel seems to run very smoothly and is not much noticeable, except if one immediately tries to type and not everything is loaded yet, but on mobile that's more than enough

B. The confidence value in the old was a normalized value between 0 and 1:
image
using the new algorithm however the confidence values are all extremely low:
image
Here is the question why this occurs and how this could be fixed?

C. Sometimes the confidence value drops below 0 which should not happen:
image

D. The current implementation breaks the nlptool actions prep and train due to API changes in nlpcore, these need to be fixed please.

~WrappedJThread() {
thread.join();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
};

You missed a semicolon here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this get past compilation...

@@ -55,15 +57,16 @@ export enum class LatinDictionarySection {
export class LatinDictionary : public Dictionary {
public:
LatinDictId dict_id_;
std::shared_ptr<LatinTrieNode> data_;
// all operations on data_->first need to acquire lock on data_->second please.
std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the std::pair<LatinTrieNode, std::shared_mutex>:

What I would suggest to do here is to declare a new struct named LatinTrieNodeWithLock, which in turn looks something like this:

struct LatinTrieNodeWithLock {
    LatinTrieNode node;
    std::shared_mutex lock;
}

This way we can make code using this more readable and we avoid having to use first and second, which are not verbose at all.

Then we can write:

Suggested change
std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_;
std::shared_ptr<LatinTrieNodeWithLock> data_;

which is far more readable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will do.

Comment on lines 77 to 90
inline LatinTrieNode* insertNgram(std::span<const fl::str::UniString> ngram) noexcept {
return algorithms::insertNgram(data_.get(), dict_id_, ngram);
std::scoped_lock<std::shared_mutex> lock(data_->second);
return algorithms::insertNgram(&(data_->first), dict_id_, ngram);
}

inline void forEachEntry(algorithms::EntryAction& action) {
algorithms::forEachEntry(data_.get(), dict_id_, action);
inline void forEachEntryReadSafe(algorithms::EntryAction& action) {
std::shared_lock<std::shared_mutex> lock(data_->second);
algorithms::forEachEntry(&(data_->first), dict_id_, action);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you sometimes use scoped_lock and sometimes shared_lock?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because scoped_lock is non-shared lock (write lock) while shared_lock is shared (read lock). I'd use scoped_shared_lock if there is one, but there isn't.

Comment on lines 125 to 126
similarity = -cost;
double confidence = (w1 * similarity + w2 * frequency) / (w1 + w2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why assign negative cost to similarity?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from the "just get it working" mindset. The similarity does not really need to be 0-1 for suggestion; it just needs to be correctly ordered such that better matches are scored higher. I'll work out the math in the next update.

@moonbeamcelery
Copy link
Author

Thanks for the review Patrick.

A. It would only be noticeable if the user types a uncommon word straight away. Otherwise it should be equivalent.

B. I have no idea what's going on; it should not be so small. I'll take a look.

C. I haven't given the score much attention, thinking as long as it works, it's fine. Right now it's not constrained to 0-1 because that would be hard to make the cost non-decreasing during the A* search. I'll make changes so the math is more formal.

However, please note that instead of having a 0-1 score, we can simply say the score is the log confidence ranging from -infty to 0. Log probability is adopted in some libraries, including KenLM mentioned somewhere here. Subtracting a number is just multiplying the confidence with something smaller than 1. For example, one edit distance subtracts from the log confidence, which is equivalent to multiplying the confidence by 1/e. I'll probably do something along this line and return an exponent at the end.

D. I'm on Debian so I couldn't build nlptools. I'll ping you to get your setup.

@patrickgold
Copy link
Member

it seems that merging in the latest main has messed up the branch and now changes are shown in the diff which aren't yours, making PR review pretty hard. Could you maybe undo the merge and properly rebase on main?

Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
… loaded

Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
@moonbeamcelery
Copy link
Author

Hey Patrick,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants