-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Change id hash map #5304
Conversation
To trigger regression tests:
|
memset(hash_map_.get(), -1, sizeof(Mapping) * capacity); | ||
|
||
// This code block is to fill the ids into hash_map_. | ||
IdArray unique_ids = NewIdArray(num_ids, ctx, sizeof(IdType) * 8); | ||
IdType* unique_ids_data = unique_ids.Ptr<IdType>(); | ||
// Fill in the first `num_seeds` ids. | ||
parallel_for(0, num_seeds, kGrainSize, [&](int64_t s, int64_t e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the common scale of num_seeds? Since kGrainSize is 256 already, do we need to use parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scale depends on your fan-out, which usually about 1/10 of the original nodes. When input nodes is huge, it could be also very large. And parallel doesn't introduce side effects, so keep it here should be better.
// Fill in the first `num_seeds` ids. | ||
parallel_for(0, num_seeds, kGrainSize, [&](int64_t s, int64_t e) { | ||
for (int64_t i = s; i < e; i++) { | ||
InsertAndSet(ids_data[i], static_cast<IdType>(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why for seed ids we don't use AttemptInsertAt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seed ids
mapping value is exactly its index in the array, so the key and value need to be set at the same time.Seed ids
is unique so the insertion is simpler and some checks can be removed to save efforts.
* | ||
* For example, for an array A with following entries: | ||
* [98, 98, 100, 99, 97, 99, 101, 100, 102] | ||
* For example, for an array A having 4 seed ids with following entries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what are the seed ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added.
* And then insert the items in `ids` concurrently to generate the | ||
* mappings, in passing returning the unique ids in `ids`. | ||
* @brief Initialize the hashmap with an array of ids. The first `num_seeds` | ||
* ids are unqiue and must be mapped to a contiguous array starting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unqiue -> unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* For example, for an array A with following entries: | ||
* [98, 98, 100, 99, 97, 99, 101, 100, 102] | ||
* For example, for an array A having 4 seed ids with following entries: | ||
* [99, 98, 100, 97, 97, 101, 101, 102, 101] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the comment below you mentioned num_seeds
ids are unique, I am assuming the first 4 are seed ids, but I see duplicated 97 in this example, it this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Seed ids
is unique among themselves, but it can be duplicate with other ids. So put it here may help user clarify it.
* mapped to [0, num_seed_ids) and `left ids` to [num_seed_ids, num_unique_ids). | ||
* Notice that mapping order is stable for `seed ids` while not for the left. | ||
* divided into 2 parts: [`seed ids`, `left ids`]. `Seed ids` refer to | ||
* a set ids chosen as the input for sampling process and `left ids` are the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left ids -> sampled ids.
sampled ids are the ids new sampled from the process (note the the seed ids might be sampled in the process, but not included in the sampled ids to avoid duplication).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good description, adopt it in the notes.
One small correction is seed ids can also be included in the sampled ids.
@@ -111,7 +111,9 @@ IdArray ConcurrentIdHashMap<IdType>::Init( | |||
parallel_for(num_seeds, num_ids, kGrainSize, [&](int64_t s, int64_t e) { | |||
size_t count = 0; | |||
for (int64_t i = s; i < e; i++) { | |||
Insert(ids_data[i], &valid, i); | |||
if (Insert(ids_data[i])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming each i will only be accessed once:
It can be simplified to:
valid[i] = Insert(ids_data[i]);
count += valid[i];
This is actually better since the existing code in L107 assumes the valid will be initiated to 0, which might not be true for all c++ compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. changed to this style.
/** | ||
* @brief The result state of an attempt to insert. | ||
*/ | ||
enum class InsertState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to private section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
* change concurrent id hash map
* change concurrent id hash map
Description
As the concurrent id hash map is mainly used in
to_block
to map an id array to a new contiguous one, and current solution doesn't ensure the mapping order. While a requirement by it is to map first Nth unique seed nodes to 0~N. This PR change the id hash map to meet the requirement.Checklist
Please feel free to remove inapplicable items for your PR.
Changes