Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Multi-process based sampling performing worse in DGL 1.1.2 as compared to DGL 1.1.1 #6315

Closed
UtkrishtP opened this issue Sep 12, 2023 · 7 comments
Assignees

Comments

@UtkrishtP
Copy link

UtkrishtP commented Sep 12, 2023

🐛 Performance Bug

Hello Team,

I have been conducting some rigorous experiments on measuring the sampling time using CPU based multi-processing (num_workers > 0)

DGL 1.1.1

Experiment details:

Dataset : ogbn-papers100M
Sampler : Neighbor Sampler
Fanout : [10,10,10]
Batch_size : [512, 1024, 2048, 4096, 8192]
# workers : [1, 2, 4, 8, 16, 32, 64, 128]

Hardware details:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127

Sample code used to measure sampling time:

    train_idx = dataset.train_idx
    val_idx = dataset.val_idx
 
    sampler = NeighborSampler(
        [10, 10, 10],  # fanout for [layer-0, layer-1, layer-2]
        # prefetch_node_feats=["feat"],
        # prefetch_labels=["label"],
    )
    
    train_dataloader = DataLoader(
        g,
        train_idx,
        sampler,
        device=device,
        batch_size=batch,
        shuffle=True,
        drop_last=False,
        num_workers=workers,
        use_uva=False,
        pin_prefetcher=True,
        use_prefetch_thread=True,
        use_alternate_streams=True,
        persistent_workers=True,    
    )

    '''
        Iterating over for 1 iteration to avoid the measurement of process launch overhead. 
        We are using persistent_workers=True to reuse the workers from the same pool.
    '''
    for it, (input_nodes, output_nodes, blocks) in enumerate(
            train_dataloader
        ):
        break
    
    for epoch in range(1):
        start = timer()
        for it, (input_nodes, output_nodes, blocks) in enumerate(
            train_dataloader
        ):
            continue  

        end = timer()
        print("Sampling Time : " , end - start)

DGL 1.1.2

Experiment details:
We will be using the neighbor sampler, with fused=False case too.

Dataset : ogbn-papers100M
Sampler : Neighbor Sampler
Fanout : [10,10,10]
Batch_size : [512, 1024, 2048, 4096, 8192]
# workers : [0, 4, 8, 16, 32]

Sample code to measure time:

train_idx = dataset.train_idx
val_idx = dataset.val_idx

sampler = NeighborSampler(
    [10, 10, 10],  # fanout for [layer-0, layer-1, layer-2]
    # prefetch_node_feats=["feat"],
    # prefetch_labels=["label"],
)

sampler_ = NeighborSampler(
    [10, 10, 10],  # fanout for [layer-0, layer-1, layer-2]
    fused = False,
    # prefetch_node_feats=["feat"],
    # prefetch_labels=["label"],
)

train_dataloader = DataLoader(
    g,
    train_idx,
    sampler,
    device=device,
    batch_size=batch,
    shuffle=True,
    drop_last=False,
    num_workers=workers,
    use_uva=False,
    pin_prefetcher=True,
    use_prefetch_thread=True,
    use_alternate_streams=True,
    persistent_workers=True,    
)

train_dataloader_ = DataLoader(
    g,
    train_idx,
    sampler_,
    device=device,
    batch_size=batch,
    shuffle=True,
    drop_last=False,
    num_workers=workers,
    use_uva=False,
    pin_prefetcher=True,
    use_prefetch_thread=True,
    use_alternate_streams=True,
    persistent_workers=True,    
)

    '''
        Iterating over for 1 iteration to avoid the measurement of process launch overhead. 
        We are using persistent_workers=True to reuse the workers from the same pool.
    '''

for it, (input_nodes, output_nodes, blocks) in enumerate(
        train_dataloader
    ):
    break

for it, (input_nodes, output_nodes, blocks) in enumerate(
        train_dataloader_
    ):
    break


for epoch in range(1):
    start = timer()
    for it, (input_nodes, output_nodes, blocks) in enumerate(
        train_dataloader
    ):
        continue
    
    end = timer()
    print("Fused Sampling : ", end - start)

for epoch in range(1):
    start = timer()
    for it, (input_nodes, output_nodes, blocks) in enumerate(
        train_dataloader_
    ):
        continue
    
    end = timer()
    print("Neighbor Sampling : ", end - start)

Results

NOTE : All the results are for 1 epoch.

DGL 1.1.1

Below are the sampling times for all the combinations of workers and batch_sizes:

image

Here, I have listed out the best performing combination:

image

DGL 1.1.2

As per the release notes, after introduction of fused neighbor sampling we have seen a performance improvement especially for the case where #workers = 0.
#5924

The red bars are for fused sampling whereas the blue bars are for neighbor sampling(fused = False)

image

As per the logic and the claim made by this (#5328 (comment)) :

  • We should see performance improvement when using multi-processing.
  • The performance shown should be atleast same or better than DGL 1.1.1

The above two claims are being contradicted as explained below.

Observations

Case 1

Sampling times for fused neighbor Sampler:
(Red bars are for workers = 0 case)

image
  • For smaller batch size 512 and 1024 it is performing worse with more workers, than it's (# workers = 0) counterpart.
  • For larger batch sizes we are seeing very slight performance improvement with large workers, but for smaller workers it's performing worse.

Case 2:

Sampling times for neighbor sampler (fused = False)
(The red bars are the best case for DGL 1.1.1 when using multi-processed neighbor sampler)

image
  • The above numbers should have been same or better as that of it's DGL 1.1.1 counterpart.
  • But we can see it's performing worse than a significant margin.

Case 3:

Here we compare the fused neighbor sampling along with DGL 1.1.1's CPU based multi-process best case.
(Red bars highlight DGL 1.1.1 CPU based mutli-process best case)

image
  • Except for 512 the fused neighbor sampling is not able to perform as good as DGL 1.1.1 neighbor sampler.
  • For higher num_workers also we are seeing the fused neighbor sampler is not able to catch up with best case for DGL 1.1.1

I suspect based on the above observations some performance bug in the latest DGL 1.1.2.
Let me know if some other info is required.

TIA.

@UtkrishtP UtkrishtP changed the title CPU Multi-process based sampling performing worse in DGL 1.2 as compared to DGL 1.1 CPU Multi-process based sampling performing worse in DGL 1.1.2 as compared to DGL 1.1.1 Sep 13, 2023
@anko-intel
Copy link
Collaborator

anko-intel commented Sep 14, 2023

@UtkrishtP could you also provide "model name" from lscpu?

@UtkrishtP
Copy link
Author

@anko-intel Sure here is the output:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       4
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz
Stepping:                        7
CPU MHz:                         1201.461
CPU max MHz:                     3900.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        5600.00
Virtualization:                  VT-x
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        64 MiB
L3 cache:                        88 MiB
NUMA node0 CPU(s):               0-15,64-79
NUMA node1 CPU(s):               16-31,80-95
NUMA node2 CPU(s):               32-47,96-111
NUMA node3 CPU(s):               48-63,112-127

@czkkkkkk
Copy link
Collaborator

@UtkrishtP, thanks for your comprehensive study on DGL. Let's discuss the issues one-by-one.

  • DGL multi-process fused neighbor sampling is slower than single-process. We found this is because Pytorch limits the threads used by workers if multi-processing is enabled.
  • The non-fused neighbor sampling of DGL 1.1.2 is slower than DGL 1.1.1. This could really be a problem. To understand more about this issue, could you help us to profile some data points on direct comparison among these two versions? For example, direct comparison on the same number of workers and batch size.
  • Fused neighbor sampler is slower than non-fused neighbor sampler in some cases. From your results, we cannot safely draw this conclusion because your are comparing fused sampling of DGL 1.1.2 and non-fused sampling of DGL 1.1.1. Considering the possible performance regression of DGL 1.1.2 in the last point, it is fair if you compare both fused and non-fused sampling on DGL 1.1.2.

@UtkrishtP
Copy link
Author

@czkkkkkk , Thanks for reverting back and looking into this issue.

DGL multi-process fused neighbor sampling is slower than single-process. We found this is because Pytorch limits the threads used by workers if multi-processing is enabled.

  • Is there any workaround for this to increase performance? Also if we compare to it's non-fused neighbor sampling(DGL 1.1.1) the performance increase as we scale # workers.

  • Is there any reason why this has changed? It would be great if you can provide me a more detailed explanation, or part of the code in the pytorch dataloader that has changed.

  • Until DGL 1.1.1 scaling # workers for CPU based sampling was yielding great performance, I am curious to know why this has been changed by Pytorch's dataloader recently?

The non-fused neighbor sampling of DGL 1.1.2 is slower than DGL 1.1.1. This could really be a problem. To understand more about this issue, could you help us to profile some data points on direct comparison among these two versions? For example, direct comparison on the same number of workers and batch size.

  • I have added a graph in the DGL 1.1.1 section which consists of runtimes for various batch sizes and worker combinations. and in the experiment section I have compared the best case for DGL 1.1.1 with DGL 1.1.2. Attaching the side-side comparison below:
image image image image

image

Fused neighbor sampler is slower than non-fused neighbor sampler in some cases. From your results, we cannot safely draw this conclusion because your are comparing fused sampling of DGL 1.1.2 and non-fused sampling of DGL 1.1.1. Considering the possible performance regression of DGL 1.1.2 in the last point, it is fair if you compare both fused and non-fused sampling on DGL 1.1.2.

  • The reason why I showed this comparison is, because as per fused sampling implementation ([RFC] Faster CPU sampling through fused sampling+compaction #5328) since they have merged two operations together. They have claimed and shown experimentally the performances to improve significantly over the non-fused neighbor sampling counterpart.

  • Hence it should have been better than or atleast equal to DGL's 1.1.1 non-fused neighbor sampling, but it seems it has regressed.

Let me know if you have any further questions or need any data.

@agrabows
Copy link
Contributor

I tried to reproduce this regression using initially CPU only configuration on AWS machine (ami: ami-0ff11ac96c22c53a5, type: r6i.32xlarge - 8375C) turning off prefetcher options and then on AWS machine with GPU (ami: ami-0705983c654abda59, type: g4dn.16xlarge - 8259CL) and both attempts failed to show regression.
image
image
Are there more configurations that show this performance regression of unfused Neighbor Sampler between version 1.1.1 and 1.1.2?

@github-actions
Copy link

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

@frozenbugs
Copy link
Collaborator

@UtkrishtP Can you double check whether the problem is resolved or not? If not, feel free to reopen, we can investigate more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants