Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add CheckpointManager support with TFRA Dynamic Embedding Horovod Training. #57

Closed
wants to merge 2 commits into from

Conversation

MoFHeka
Copy link

@MoFHeka MoFHeka commented Dec 25, 2023

Description

Now CheckpointManager is available in Deepray when training with TFRA Dynamic Embedding.

[fix] In deepray/core/base_trainer.py, gpu_affinity didn't take effect when NVML Shared Library Not Found.
[fix] In deepray/core/base_trainer.py line 784, self.loss_container.metrics may empty when 'FLAGS.stop_steps = 0' in tools/testing/horovod_sync_train_test.py.

Also the adding script also support test TF Embedding when use Horovod training.

Type of change

Checklist:

  • I've properly formatted my code according to the guidelines
    • By running find ./ -name '*.py' -exec yapf --style=./.yapf -ir {} ;
    • By running pre-commit hooks
  • This PR addresses an already submitted issue for Deepray
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • This PR contains modifications to C++ custom-ops

How Has This Been Tested?

mpirun -np 2 -H localhost:2 --allow-run-as-root pytest -v tools/testing/horovod_sync_train_test.py

or

horovodrun -np 2  pytest -v tools/testing/horovod_sync_train_test.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants