[Enhancement] Change id hash map #5304

peizhou001 · 2023-02-16T06:08:02Z

Description

As the concurrent id hash map is mainly used in to_block to map an id array to a new contiguous one, and current solution doesn't ensure the mapping order. While a requirement by it is to map first Nth unique seed nodes to 0~N. This PR change the id hash map to meet the requirement.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2023-02-16T06:15:00Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2023-02-16T08:36:19Z

Commit ID: abd5e5eaf3021519e5da17310e81a3a4b154def1

Build ID: 1

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot · 2023-02-16T09:57:09Z

Commit ID: 85a12cc16f348f2f07fd360061bbf40ee051b6d5

Build ID: 2

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot · 2023-02-17T06:13:49Z

Commit ID: 78b467807c46bf3d72fa75a218b07118850e5afa

Build ID: 3

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot · 2023-02-17T06:52:08Z

Commit ID: 172e7ef809b84a4d4710856d829476797709eb01

Build ID: 4

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

frozenbugs · 2023-02-17T07:50:53Z

src/array/cpu/concurrent_id_hash_map.cc

  memset(hash_map_.get(), -1, sizeof(Mapping) * capacity);

  // This code block is to fill the ids into hash_map_.
  IdArray unique_ids = NewIdArray(num_ids, ctx, sizeof(IdType) * 8);
+  IdType* unique_ids_data = unique_ids.Ptr<IdType>();
+  // Fill in the first `num_seeds` ids.
+  parallel_for(0, num_seeds, kGrainSize, [&](int64_t s, int64_t e) {


What's the common scale of num_seeds? Since kGrainSize is 256 already, do we need to use parallel?

The scale depends on your fan-out, which usually about 1/10 of the original nodes. When input nodes is huge, it could be also very large. And parallel doesn't introduce side effects, so keep it here should be better.

frozenbugs · 2023-02-17T07:51:39Z

src/array/cpu/concurrent_id_hash_map.cc

+  // Fill in the first `num_seeds` ids.
+  parallel_for(0, num_seeds, kGrainSize, [&](int64_t s, int64_t e) {
+    for (int64_t i = s; i < e; i++) {
+      InsertAndSet(ids_data[i], static_cast<IdType>(i));


Why for seed ids we don't use AttemptInsertAt?

Seed ids mapping value is exactly its index in the array, so the key and value need to be set at the same time.

Seed ids is unique so the insertion is simpler and some checks can be removed to save efforts.

frozenbugs · 2023-02-17T07:53:30Z

src/array/cpu/concurrent_id_hash_map.h

 *
- * For example, for an array A with following entries:
- * [98, 98, 100, 99, 97, 99, 101, 100, 102]
+ * For example, for an array A having 4 seed ids with following entries:


Can you clarify what are the seed ids?

Sure, added.

frozenbugs · 2023-02-17T07:54:47Z

src/array/cpu/concurrent_id_hash_map.h

-   * And then insert the items in `ids` concurrently to generate the
-   * mappings, in passing returning the unique ids in `ids`.
+   * @brief Initialize the hashmap with an array of ids. The first `num_seeds`
+   * ids are unqiue and must be mapped to a contiguous array starting


unqiue -> unique

frozenbugs · 2023-02-17T07:55:20Z

src/array/cpu/concurrent_id_hash_map.h

- * For example, for an array A with following entries:
- * [98, 98, 100, 99, 97, 99, 101, 100, 102]
+ * For example, for an array A having 4 seed ids with following entries:
+ * [99, 98, 100, 97, 97, 101, 101, 102, 101]


In the comment below you mentioned num_seeds ids are unique, I am assuming the first 4 are seed ids, but I see duplicated 97 in this example, it this intended?

Yes. Seed ids is unique among themselves, but it can be duplicate with other ids. So put it here may help user clarify it.

frozenbugs · 2023-02-20T07:36:52Z

src/array/cpu/concurrent_id_hash_map.h

- * mapped to [0, num_seed_ids) and `left ids` to [num_seed_ids, num_unique_ids).
- * Notice that mapping order is stable for `seed ids` while not for the left.
+ * divided into 2 parts: [`seed ids`, `left ids`]. `Seed ids` refer to
+ * a set ids chosen as the input for sampling process and `left ids` are the


left ids -> sampled ids.

sampled ids are the ids new sampled from the process (note the the seed ids might be sampled in the process, but not included in the sampled ids to avoid duplication).

Good description, adopt it in the notes.
One small correction is seed ids can also be included in the sampled ids.

dgl-bot · 2023-02-20T07:50:32Z

Commit ID: edeccf11f8f839f33bd319c64bad5a1f67a3645d

Build ID: 5

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot · 2023-02-20T11:24:23Z

Commit ID: cf9830eb1fc83755d827891f0248d97f04485319

Build ID: 6

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

frozenbugs · 2023-02-20T12:53:17Z

src/array/cpu/concurrent_id_hash_map.cc

@@ -111,7 +111,9 @@ IdArray ConcurrentIdHashMap<IdType>::Init(
  parallel_for(num_seeds, num_ids, kGrainSize, [&](int64_t s, int64_t e) {
    size_t count = 0;
    for (int64_t i = s; i < e; i++) {
-      Insert(ids_data[i], &valid, i);
+      if (Insert(ids_data[i])) {


I am assuming each i will only be accessed once:
It can be simplified to:
valid[i] = Insert(ids_data[i]);
count += valid[i];

This is actually better since the existing code in L107 assumes the valid will be initiated to 0, which might not be true for all c++ compiler.

It makes sense. changed to this style.

frozenbugs · 2023-02-20T12:54:51Z

src/array/cpu/concurrent_id_hash_map.h

+  /**
+   * @brief The result state of an attempt to insert.
+   */
+  enum class InsertState {


move to private section?

dgl-bot · 2023-02-21T08:13:02Z

Commit ID: e2976db

Build ID: 8

Status: ✅ CI test succeeded

Report path: link

Full logs path: link
Note: A new CI run will cancel previous CI runs, but an incorrect "success"
status might be shown for the previous runs. Please double check the report
before merging the PR.

* change concurrent id hash map

refactor id hash map

178d594

peizhou001 changed the title ~~Refactor id hash map~~ [Enhancement]Refactor id hash map Feb 16, 2023

peizhou001 self-assigned this Feb 16, 2023

peizhou001 added the topic: system performance Issues about DGL system performance (e.g., speed, memory efficiency) label Feb 16, 2023

peizhou001 changed the title ~~[Enhancement]Refactor id hash map~~ [Enhancement] Change id hash map Feb 16, 2023

peizhou001 marked this pull request as ready for review February 17, 2023 04:24

peizhou001 requested review from BarclayII, frozenbugs and Rhett-Ying February 17, 2023 04:24

lint

c350b53

frozenbugs reviewed Feb 17, 2023

View reviewed changes

add notes

028ba30

frozenbugs reviewed Feb 20, 2023

View reviewed changes

add Insert state

8a3bf4c

frozenbugs reviewed Feb 20, 2023

View reviewed changes

frozenbugs approved these changes Feb 20, 2023

View reviewed changes

Ubuntu and others added 2 commits February 21, 2023 03:57

fix issue

70bb8d0

Merge branch 'master' into peizhou/changeidhashmap

e2976db

peizhou001 merged commit ed2e540 into dmlc:master Feb 21, 2023

peizhou001 deleted the peizhou/changeidhashmap branch February 21, 2023 08:47

paoxiaode pushed a commit to paoxiaode/dgl that referenced this pull request Mar 24, 2023

[Enhancement] Change id hash map (dmlc#5304)

3ae7272

* change concurrent id hash map

DominikaJedynak pushed a commit to DominikaJedynak/dgl that referenced this pull request Mar 12, 2024

[Enhancement] Change id hash map (dmlc#5304)

215a8f1

* change concurrent id hash map

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Change id hash map #5304

[Enhancement] Change id hash map #5304

peizhou001 commented Feb 16, 2023 •

edited

Loading

dgl-bot commented Feb 16, 2023

dgl-bot commented Feb 16, 2023

dgl-bot commented Feb 16, 2023

dgl-bot commented Feb 17, 2023

dgl-bot commented Feb 17, 2023

frozenbugs Feb 17, 2023 •

edited

Loading

peizhou001 Feb 20, 2023 •

edited

Loading

frozenbugs Feb 17, 2023

peizhou001 Feb 20, 2023

frozenbugs Feb 17, 2023

peizhou001 Feb 20, 2023

frozenbugs Feb 17, 2023

peizhou001 Feb 20, 2023

frozenbugs Feb 17, 2023

peizhou001 Feb 20, 2023

frozenbugs Feb 20, 2023 •

edited

Loading

peizhou001 Feb 20, 2023 •

edited

Loading

dgl-bot commented Feb 20, 2023

dgl-bot commented Feb 20, 2023

frozenbugs Feb 20, 2023

peizhou001 Feb 21, 2023

frozenbugs Feb 20, 2023

peizhou001 Feb 21, 2023

dgl-bot commented Feb 21, 2023

[Enhancement] Change id hash map #5304

[Enhancement] Change id hash map #5304

Conversation

peizhou001 commented Feb 16, 2023 • edited Loading

Description

Checklist

Changes

dgl-bot commented Feb 16, 2023

dgl-bot commented Feb 16, 2023

dgl-bot commented Feb 16, 2023

dgl-bot commented Feb 17, 2023

dgl-bot commented Feb 17, 2023

frozenbugs Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

peizhou001 Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frozenbugs Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

peizhou001 Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

dgl-bot commented Feb 20, 2023

dgl-bot commented Feb 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgl-bot commented Feb 21, 2023

peizhou001 commented Feb 16, 2023 •

edited

Loading

frozenbugs Feb 17, 2023 •

edited

Loading

peizhou001 Feb 20, 2023 •

edited

Loading

frozenbugs Feb 20, 2023 •

edited

Loading

peizhou001 Feb 20, 2023 •

edited

Loading