-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fleet_executor] Add entrance of FleetExecutor in AnalysisPredictor for distributed inference #39992
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use VLOG(3) for all output logs while you will see redundancy logs you don't need and
var's name should express its clear meaning.
Add a concise description of PR like #37725 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
b6f9ec3
to
fd41be6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Describe
Add entrance of fleet executor into AnalysisPredictor. Add some helper methods to init NCCL environment for distributed inference.
To use the fleet executor for inference, these configures should be set in DistConfig
Note that,
use_dist_model_
muse be set true by callingnranks
andrank
are set simultaneously by callingtrainer_endpoints
andcurrent_endpoint
are also set simultaneously by callingSetEndpoints(std::vector<std::string> trainer_endpoints, std::string current_endpoint);
DistConfig
should be set to AnalysisConfig by callingSetDistConfig(dConfig);
The converter config should some sections like this: