Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add multi-GPU UnifiedTensor unit test #3184

Merged
merged 15 commits into from
Aug 2, 2021

Conversation

davidmin7
Copy link
Contributor

@davidmin7 davidmin7 commented Jul 25, 2021

Description

This is a followup PR of #3086, which adds a multi-gpu unit test case for the PyTorch backend using UnifiedTensor.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

  • Add multi-gpu unified tensor test.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 25, 2021

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@VoVAllen
Copy link
Collaborator

Bus error is probably related to the docker settings. Will take a look at this tommorow

@davidmin7 davidmin7 marked this pull request as draft July 25, 2021 15:09
@davidmin7
Copy link
Contributor Author

davidmin7 commented Jul 25, 2021

Bus error is probably related to the docker settings. Will take a look at this tommorow

@VoVAllen Thanks for the quick reply! Based on my quick research, I believe the docker's shared memory space is too small. Here is the relevant link: pytorch/pytorch#2244.

@classicsong
Copy link
Contributor

Bus error is probably related to the docker settings. Will take a look at this tommorow

Do we extend the shm size for the docker container?

@VoVAllen
Copy link
Collaborator

Fixed shared memory problem

@davidmin7 davidmin7 marked this pull request as ready for review July 31, 2021 13:18
Copy link
Contributor

@classicsong classicsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@classicsong classicsong merged commit c793593 into dmlc:master Aug 2, 2021
@davidmin7 davidmin7 deleted the dgl-unified-multigpu branch August 10, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants