-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove hardcoded value from softmax in flat_pa #280
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, do we have a noticeable throughput loss with this fix?
We had without the workaround to .max() . With it there is a slight drop, but I don't have exact numbers at hand. |
I tested this change on mixtral and accuracy is still fine. Merging then. |
This reverts commit 35a4a98.
This PR removes the hardcoded value used to normalize softmax in flat_pa . Current approach is to use the global maximum as it is very easy to compute, but it has the drawback that other samples in a batch might slightly affect numerical stability. This is a first step to eliminated some of the INF/NaN issues we see in certain configurations and by no means this is a complete solutions. This needs to be revised in the future.
This PR removes the hardcoded value used to normalize softmax in flat_pa . Current approach is to use the global maximum as it is very easy to compute, but it has the drawback that other samples in a batch might slightly affect numerical stability. This is a first step to eliminated some of the INF/NaN issues we see in certain configurations and by no means this is a complete solutions. This needs to be revised in the future.
This PR removes the hardcoded value used to normalize softmax in flat_pa . Current approach is to use the global maximum as it is very easy to compute, but it has the drawback that other samples in a batch might slightly affect numerical stability.
This is a first step to eliminated some of the INF/NaN issues we see in certain configurations and by no means this is a complete solutions. This needs to be revised in the future.