CPU Multi-process based sampling performing worse in DGL 1.1.2 as compared to DGL 1.1.1 #6315

UtkrishtP · 2023-09-12T06:39:11Z

🐛 Performance Bug

Hello Team,

I have been conducting some rigorous experiments on measuring the sampling time using CPU based multi-processing (num_workers > 0)

DGL 1.1.1

Experiment details:

Dataset : ogbn-papers100M
Sampler : Neighbor Sampler
Fanout : [10,10,10]
Batch_size : [512, 1024, 2048, 4096, 8192]
# workers : [1, 2, 4, 8, 16, 32, 64, 128]

Hardware details:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127

Sample code used to measure sampling time:

    train_idx = dataset.train_idx
    val_idx = dataset.val_idx
 
    sampler = NeighborSampler(
        [10, 10, 10],  # fanout for [layer-0, layer-1, layer-2]
        # prefetch_node_feats=["feat"],
        # prefetch_labels=["label"],
    )
    
    train_dataloader = DataLoader(
        g,
        train_idx,
        sampler,
        device=device,
        batch_size=batch,
        shuffle=True,
        drop_last=False,
        num_workers=workers,
        use_uva=False,
        pin_prefetcher=True,
        use_prefetch_thread=True,
        use_alternate_streams=True,
        persistent_workers=True,    
    )

    '''
        Iterating over for 1 iteration to avoid the measurement of process launch overhead. 
        We are using persistent_workers=True to reuse the workers from the same pool.
    '''
    for it, (input_nodes, output_nodes, blocks) in enumerate(
            train_dataloader
        ):
        break
    
    for epoch in range(1):
        start = timer()
        for it, (input_nodes, output_nodes, blocks) in enumerate(
            train_dataloader
        ):
            continue  

        end = timer()
        print("Sampling Time : " , end - start)

DGL 1.1.2

Experiment details:
We will be using the neighbor sampler, with fused=False case too.

Dataset : ogbn-papers100M
Sampler : Neighbor Sampler
Fanout : [10,10,10]
Batch_size : [512, 1024, 2048, 4096, 8192]
# workers : [0, 4, 8, 16, 32]

Sample code to measure time:

train_idx = dataset.train_idx
val_idx = dataset.val_idx

sampler = NeighborSampler(
    [10, 10, 10],  # fanout for [layer-0, layer-1, layer-2]
    # prefetch_node_feats=["feat"],
    # prefetch_labels=["label"],
)

sampler_ = NeighborSampler(
    [10, 10, 10],  # fanout for [layer-0, layer-1, layer-2]
    fused = False,
    # prefetch_node_feats=["feat"],
    # prefetch_labels=["label"],
)

train_dataloader = DataLoader(
    g,
    train_idx,
    sampler,
    device=device,
    batch_size=batch,
    shuffle=True,
    drop_last=False,
    num_workers=workers,
    use_uva=False,
    pin_prefetcher=True,
    use_prefetch_thread=True,
    use_alternate_streams=True,
    persistent_workers=True,    
)

train_dataloader_ = DataLoader(
    g,
    train_idx,
    sampler_,
    device=device,
    batch_size=batch,
    shuffle=True,
    drop_last=False,
    num_workers=workers,
    use_uva=False,
    pin_prefetcher=True,
    use_prefetch_thread=True,
    use_alternate_streams=True,
    persistent_workers=True,    
)

    '''
        Iterating over for 1 iteration to avoid the measurement of process launch overhead. 
        We are using persistent_workers=True to reuse the workers from the same pool.
    '''

for it, (input_nodes, output_nodes, blocks) in enumerate(
        train_dataloader
    ):
    break

for it, (input_nodes, output_nodes, blocks) in enumerate(
        train_dataloader_
    ):
    break


for epoch in range(1):
    start = timer()
    for it, (input_nodes, output_nodes, blocks) in enumerate(
        train_dataloader
    ):
        continue
    
    end = timer()
    print("Fused Sampling : ", end - start)

for epoch in range(1):
    start = timer()
    for it, (input_nodes, output_nodes, blocks) in enumerate(
        train_dataloader_
    ):
        continue
    
    end = timer()
    print("Neighbor Sampling : ", end - start)

Results

NOTE : All the results are for 1 epoch.

DGL 1.1.1

Below are the sampling times for all the combinations of workers and batch_sizes:

Here, I have listed out the best performing combination:

DGL 1.1.2

As per the release notes, after introduction of fused neighbor sampling we have seen a performance improvement especially for the case where #workers = 0.
#5924

The red bars are for fused sampling whereas the blue bars are for neighbor sampling(fused = False)

As per the logic and the claim made by this (#5328 (comment)) :

We should see performance improvement when using multi-processing.
The performance shown should be atleast same or better than DGL 1.1.1

The above two claims are being contradicted as explained below.

Observations

Case 1

Sampling times for fused neighbor Sampler:
(Red bars are for workers = 0 case)

For smaller batch size 512 and 1024 it is performing worse with more workers, than it's (# workers = 0) counterpart.
For larger batch sizes we are seeing very slight performance improvement with large workers, but for smaller workers it's performing worse.

Case 2:

Sampling times for neighbor sampler (fused = False)
(The red bars are the best case for DGL 1.1.1 when using multi-processed neighbor sampler)

The above numbers should have been same or better as that of it's DGL 1.1.1 counterpart.
But we can see it's performing worse than a significant margin.

Case 3:

Here we compare the fused neighbor sampling along with DGL 1.1.1's CPU based multi-process best case.
(Red bars highlight DGL 1.1.1 CPU based mutli-process best case)

Except for 512 the fused neighbor sampling is not able to perform as good as DGL 1.1.1 neighbor sampler.
For higher num_workers also we are seeing the fused neighbor sampler is not able to catch up with best case for DGL 1.1.1

I suspect based on the above observations some performance bug in the latest DGL 1.1.2.
Let me know if some other info is required.

TIA.

The text was updated successfully, but these errors were encountered:

anko-intel · 2023-09-14T08:49:51Z

@UtkrishtP could you also provide "model name" from lscpu?

UtkrishtP · 2023-09-14T09:11:55Z

@anko-intel Sure here is the output:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       4
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz
Stepping:                        7
CPU MHz:                         1201.461
CPU max MHz:                     3900.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        5600.00
Virtualization:                  VT-x
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        64 MiB
L3 cache:                        88 MiB
NUMA node0 CPU(s):               0-15,64-79
NUMA node1 CPU(s):               16-31,80-95
NUMA node2 CPU(s):               32-47,96-111
NUMA node3 CPU(s):               48-63,112-127

czkkkkkk · 2023-09-21T02:58:07Z

@UtkrishtP, thanks for your comprehensive study on DGL. Let's discuss the issues one-by-one.

DGL multi-process fused neighbor sampling is slower than single-process. We found this is because Pytorch limits the threads used by workers if multi-processing is enabled.
The non-fused neighbor sampling of DGL 1.1.2 is slower than DGL 1.1.1. This could really be a problem. To understand more about this issue, could you help us to profile some data points on direct comparison among these two versions? For example, direct comparison on the same number of workers and batch size.
Fused neighbor sampler is slower than non-fused neighbor sampler in some cases. From your results, we cannot safely draw this conclusion because your are comparing fused sampling of DGL 1.1.2 and non-fused sampling of DGL 1.1.1. Considering the possible performance regression of DGL 1.1.2 in the last point, it is fair if you compare both fused and non-fused sampling on DGL 1.1.2.

UtkrishtP · 2023-09-21T10:16:13Z

@czkkkkkk , Thanks for reverting back and looking into this issue.

DGL multi-process fused neighbor sampling is slower than single-process. We found this is because Pytorch limits the threads used by workers if multi-processing is enabled.

Is there any workaround for this to increase performance? Also if we compare to it's non-fused neighbor sampling(DGL 1.1.1) the performance increase as we scale # workers.
Is there any reason why this has changed? It would be great if you can provide me a more detailed explanation, or part of the code in the pytorch dataloader that has changed.
Until DGL 1.1.1 scaling # workers for CPU based sampling was yielding great performance, I am curious to know why this has been changed by Pytorch's dataloader recently?

The non-fused neighbor sampling of DGL 1.1.2 is slower than DGL 1.1.1. This could really be a problem. To understand more about this issue, could you help us to profile some data points on direct comparison among these two versions? For example, direct comparison on the same number of workers and batch size.

I have added a graph in the DGL 1.1.1 section which consists of runtimes for various batch sizes and worker combinations. and in the experiment section I have compared the best case for DGL 1.1.1 with DGL 1.1.2. Attaching the side-side comparison below:

Fused neighbor sampler is slower than non-fused neighbor sampler in some cases. From your results, we cannot safely draw this conclusion because your are comparing fused sampling of DGL 1.1.2 and non-fused sampling of DGL 1.1.1. Considering the possible performance regression of DGL 1.1.2 in the last point, it is fair if you compare both fused and non-fused sampling on DGL 1.1.2.

The reason why I showed this comparison is, because as per fused sampling implementation ([RFC] Faster CPU sampling through fused sampling+compaction #5328) since they have merged two operations together. They have claimed and shown experimentally the performances to improve significantly over the non-fused neighbor sampling counterpart.
Hence it should have been better than or atleast equal to DGL's 1.1.1 non-fused neighbor sampling, but it seems it has regressed.

Let me know if you have any further questions or need any data.

agrabows · 2023-09-22T17:10:26Z

I tried to reproduce this regression using initially CPU only configuration on AWS machine (ami: ami-0ff11ac96c22c53a5, type: r6i.32xlarge - 8375C) turning off prefetcher options and then on AWS machine with GPU (ami: ami-0705983c654abda59, type: g4dn.16xlarge - 8259CL) and both attempts failed to show regression.

Are there more configurations that show this performance regression of unfused Neighbor Sampler between version 1.1.1 and 1.1.2?

It fixes the issue dmlc#6315

github-actions · 2023-10-23T01:27:18Z

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

frozenbugs · 2023-10-23T02:19:16Z

@UtkrishtP Can you double check whether the problem is resolved or not? If not, feel free to reopen, we can investigate more.

UtkrishtP changed the title ~~CPU Multi-process based sampling performing worse in DGL 1.2 as compared to DGL 1.1~~ CPU Multi-process based sampling performing worse in DGL 1.1.2 as compared to DGL 1.1.1 Sep 13, 2023

frozenbugs assigned BarclayII, frozenbugs and peizhou001 Sep 14, 2023

anko-intel added a commit to anko-intel/dgl that referenced this issue Oct 10, 2023

[Performance] Do not fuse neighbor sampler for 1 thread

05613e1

It fixes the issue dmlc#6315

anko-intel mentioned this issue Oct 10, 2023

[Performance] Do not fuse neighbor sampler for 1 thread #6421

Merged

8 tasks

github-actions bot added the stale-issue label Oct 23, 2023

frozenbugs closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Multi-process based sampling performing worse in DGL 1.1.2 as compared to DGL 1.1.1 #6315

CPU Multi-process based sampling performing worse in DGL 1.1.2 as compared to DGL 1.1.1 #6315

UtkrishtP commented Sep 12, 2023 •

edited

Loading

anko-intel commented Sep 14, 2023 •

edited

Loading

UtkrishtP commented Sep 14, 2023

czkkkkkk commented Sep 21, 2023

UtkrishtP commented Sep 21, 2023

agrabows commented Sep 22, 2023

github-actions bot commented Oct 23, 2023

frozenbugs commented Oct 23, 2023

CPU Multi-process based sampling performing worse in DGL 1.1.2 as compared to DGL 1.1.1 #6315

CPU Multi-process based sampling performing worse in DGL 1.1.2 as compared to DGL 1.1.1 #6315

Comments

UtkrishtP commented Sep 12, 2023 • edited Loading

🐛 Performance Bug

DGL 1.1.1

DGL 1.1.2

Results

DGL 1.1.1

DGL 1.1.2

Observations

Case 1

Case 2:

Case 3:

anko-intel commented Sep 14, 2023 • edited Loading

UtkrishtP commented Sep 14, 2023

czkkkkkk commented Sep 21, 2023

UtkrishtP commented Sep 21, 2023

agrabows commented Sep 22, 2023

github-actions bot commented Oct 23, 2023

frozenbugs commented Oct 23, 2023

UtkrishtP commented Sep 12, 2023 •

edited

Loading

anko-intel commented Sep 14, 2023 •

edited

Loading