-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SwiGLU: Support for no bias #504
Conversation
[ghstack-poisoned]
ghstack-source-id: 5130cd816667ac26b5dce4db68e052b059e80229 Pull Request resolved: #504
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
Codecov ReportBase: 88.33% // Head: 88.01% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## gh/danthe3rd/58/base #504 +/- ##
========================================================
- Coverage 88.33% 88.01% -0.33%
========================================================
Files 80 80
Lines 4802 4822 +20
========================================================
+ Hits 4242 4244 +2
- Misses 560 578 +18
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
**PERFORMANCE** ``` [-------------------------- swiglu_fw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 1338.5 | 1568.4 b16 B=9456, I=1536, H=4096 nobi | 1268.3 | 1477.1 f16 B=9456, I=1536, H=4096 bias | 1305.3 | 1607.4 f16 B=9456, I=1536, H=4096 nobi | 1291.7 | 1493.0 f16.ac B=9456, I=1536, H=4096 bias | 1445.1 | 1754.7 f16.ac B=9456, I=1536, H=4096 nobi | 1434.0 | 1640.5 b16 B=4440, I=1536, H=4096 bias | 580.7 | 726.5 b16 B=4440, I=1536, H=4096 nobi | 578.7 | 725.1 f16 B=4440, I=1536, H=4096 bias | 597.4 | 736.5 f16 B=4440, I=1536, H=4096 nobi | 598.0 | 732.3 f16.ac B=4440, I=1536, H=4096 bias | 712.8 | 853.8 f16.ac B=4440, I=1536, H=4096 nobi | 701.0 | 841.7 b16 B=4728, I=1536, H=4096 bias | 620.3 | 772.6 b16 B=4728, I=1536, H=4096 nobi | 618.2 | 744.7 f16 B=4728, I=1536, H=4096 bias | 634.9 | 785.9 f16 B=4728, I=1536, H=4096 nobi | 633.7 | 750.8 f16.ac B=4728, I=1536, H=4096 bias | 746.8 | 897.1 f16.ac B=4728, I=1536, H=4096 nobi | 740.0 | 854.1 b16 B=4728, I=1536, H=1024 bias | 160.1 | 199.3 b16 B=4728, I=1536, H=1024 nobi | 158.6 | 196.4 f16 B=4728, I=1536, H=1024 bias | 163.6 | 202.1 f16 B=4728, I=1536, H=1024 nobi | 161.9 | 198.8 f16.ac B=4728, I=1536, H=1024 bias | 237.1 | 278.5 f16.ac B=4728, I=1536, H=1024 nobi | 227.3 | 265.1 Times are in microseconds (us). ``` ``` [-------------------------- swiglu_bw --------------------------] | optimized | eager 1 threads: ------------------------------------------------------ b16 B=9456, I=1536, H=4096 bias | 2223.2 | 2705.4 b16 B=9456, I=1536, H=4096 nobi | 2192.1 | 2568.4 f16 B=9456, I=1536, H=4096 bias | 2292.8 | 2700.6 f16 B=9456, I=1536, H=4096 nobi | 2183.5 | 2560.7 f16.ac B=9456, I=1536, H=4096 bias | 2603.5 | 2992.8 f16.ac B=9456, I=1536, H=4096 nobi | 2457.3 | 2834.5 b16 B=4440, I=1536, H=4096 bias | 1176.6 | 1418.9 b16 B=4440, I=1536, H=4096 nobi | 1159.1 | 1339.6 f16 B=4440, I=1536, H=4096 bias | 1199.8 | 1416.1 f16 B=4440, I=1536, H=4096 nobi | 1154.0 | 1335.1 f16.ac B=4440, I=1536, H=4096 bias | 1407.7 | 1631.2 f16.ac B=4440, I=1536, H=4096 nobi | 1348.7 | 1535.2 b16 B=4728, I=1536, H=4096 bias | 1233.5 | 1491.8 b16 B=4728, I=1536, H=4096 nobi | 1215.1 | 1409.0 f16 B=4728, I=1536, H=4096 bias | 1248.2 | 1486.0 f16 B=4728, I=1536, H=4096 nobi | 1208.1 | 1404.2 f16.ac B=4728, I=1536, H=4096 bias | 1480.1 | 1705.0 f16.ac B=4728, I=1536, H=4096 nobi | 1408.5 | 1605.9 b16 B=4728, I=1536, H=1024 bias | 459.9 | 517.1 b16 B=4728, I=1536, H=1024 nobi | 425.2 | 461.0 f16 B=4728, I=1536, H=1024 bias | 436.6 | 495.9 f16 B=4728, I=1536, H=1024 nobi | 400.0 | 441.3 f16.ac B=4728, I=1536, H=1024 bias | 558.4 | 617.8 f16.ac B=4728, I=1536, H=1024 nobi | 512.6 | 555.2 Times are in microseconds (us). ``` [ghstack-poisoned]
ghstack-source-id: 39f1898bb6bf1636291fb79bfc9dc6c680c002a3 Pull Request resolved: #504
Stack from ghstack (oldest at bottom):
PERFORMANCE