-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
开启Smart schedule时报错Segmentation fault #184
Comments
This means the feature dimension of input / output tensors of experts should be equal. I think your code fulfills this requirement. I am not able to reproduce your issue with your code provided using randn or ones as input data. Can you please provide more information about the error? E.g. the shape of Also, you can try turning off some features, for example |
Thank you for your reply!
`class MoETorchTransformerBlock(TorchTransformerBlock):
`class FastMoe(FMoE):
When I don't turn on smart schedule,no errors occurred,but when I add FMOE_FASTER_SCHEDULE_ENABLE=1 |
|
``> > if the result that local_expert_count gets on each card (worldsize) is the same or different
In addition local_expert_count it is calculated by the function in the FMOE, is it because my use of FMOE is written incorrectly, causing each local_expert_count to be the same?My num_expert is set 1,the world_size is set by os.environ['WORLD_SIZE'],my nnode is 1 and nproc_per_node=4
Thank you very much for your guidance again |
|
Yes, indeed for this reason, thank you very much for your help!!!! |
Well, thank you very much for reporting this issue and ebugging. I think we should explicitly specify the device of tensors when we allocate them in our library. We will update the codebase before closing this issue. |
When I use a custom expert, inherit the FMoE class, and turn on Smart schedule to report an error
The location of the error is located when using the pdb debug
It's not clear to me what input and output that need to be constrained means
The input and output features have to be of the same length for the experts.
My definition goes something like this:
DDP is also used:
num_expert is 1
I want to implement the parallelism of an expert on a GPU
I'd appreciate it if anyone could point me across
The text was updated successfully, but these errors were encountered: