Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SwiGLU: Support for no bias #504

Merged
merged 10 commits into from
Nov 10, 2022
Merged

Conversation

danthe3rd
Copy link
Contributor

@danthe3rd danthe3rd commented Nov 2, 2022

Stack from ghstack (oldest at bottom):

PERFORMANCE

[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2022
danthe3rd pushed a commit that referenced this pull request Nov 2, 2022
ghstack-source-id: 5130cd816667ac26b5dce4db68e052b059e80229
Pull Request resolved: #504
@danthe3rd danthe3rd mentioned this pull request Nov 2, 2022
danthe3rd added 3 commits November 3, 2022 08:21

**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]

**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]

**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]
@codecov-commenter
Copy link

codecov-commenter commented Nov 3, 2022

Codecov Report

Base: 88.33% // Head: 88.01% // Decreases project coverage by -0.32% ⚠️

Coverage data is based on head (4e3adec) compared to base (00fbd9b).
Patch coverage: 7.89% of modified lines in pull request are covered.

Additional details and impacted files
@@                   Coverage Diff                    @@
##           gh/danthe3rd/58/base     #504      +/-   ##
========================================================
- Coverage                 88.33%   88.01%   -0.33%     
========================================================
  Files                        80       80              
  Lines                      4802     4822      +20     
========================================================
+ Hits                       4242     4244       +2     
- Misses                      560      578      +18     
Flag Coverage Δ
Python 88.01% <7.89%> (-0.33%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
xformers/ops/swiglu.py 33.79% <7.89%> (-2.43%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.


**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]

**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]
@danthe3rd danthe3rd mentioned this pull request Nov 4, 2022
danthe3rd added 2 commits November 4, 2022 10:04

**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]

**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]

**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]
Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!


**PERFORMANCE**

```
[-------------------------- swiglu_fw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    1338.5   |  1568.4
      b16    B=9456, I=1536, H=4096 nobi  |    1268.3   |  1477.1
      f16    B=9456, I=1536, H=4096 bias  |    1305.3   |  1607.4
      f16    B=9456, I=1536, H=4096 nobi  |    1291.7   |  1493.0
      f16.ac B=9456, I=1536, H=4096 bias  |    1445.1   |  1754.7
      f16.ac B=9456, I=1536, H=4096 nobi  |    1434.0   |  1640.5
      b16    B=4440, I=1536, H=4096 bias  |     580.7   |   726.5
      b16    B=4440, I=1536, H=4096 nobi  |     578.7   |   725.1
      f16    B=4440, I=1536, H=4096 bias  |     597.4   |   736.5
      f16    B=4440, I=1536, H=4096 nobi  |     598.0   |   732.3
      f16.ac B=4440, I=1536, H=4096 bias  |     712.8   |   853.8
      f16.ac B=4440, I=1536, H=4096 nobi  |     701.0   |   841.7
      b16    B=4728, I=1536, H=4096 bias  |     620.3   |   772.6
      b16    B=4728, I=1536, H=4096 nobi  |     618.2   |   744.7
      f16    B=4728, I=1536, H=4096 bias  |     634.9   |   785.9
      f16    B=4728, I=1536, H=4096 nobi  |     633.7   |   750.8
      f16.ac B=4728, I=1536, H=4096 bias  |     746.8   |   897.1
      f16.ac B=4728, I=1536, H=4096 nobi  |     740.0   |   854.1
      b16    B=4728, I=1536, H=1024 bias  |     160.1   |   199.3
      b16    B=4728, I=1536, H=1024 nobi  |     158.6   |   196.4
      f16    B=4728, I=1536, H=1024 bias  |     163.6   |   202.1
      f16    B=4728, I=1536, H=1024 nobi  |     161.9   |   198.8
      f16.ac B=4728, I=1536, H=1024 bias  |     237.1   |   278.5
      f16.ac B=4728, I=1536, H=1024 nobi  |     227.3   |   265.1

Times are in microseconds (us).
```

```
[-------------------------- swiglu_bw --------------------------]
                                          |  optimized  |  eager 
1 threads: ------------------------------------------------------
      b16    B=9456, I=1536, H=4096 bias  |    2223.2   |  2705.4
      b16    B=9456, I=1536, H=4096 nobi  |    2192.1   |  2568.4
      f16    B=9456, I=1536, H=4096 bias  |    2292.8   |  2700.6
      f16    B=9456, I=1536, H=4096 nobi  |    2183.5   |  2560.7
      f16.ac B=9456, I=1536, H=4096 bias  |    2603.5   |  2992.8
      f16.ac B=9456, I=1536, H=4096 nobi  |    2457.3   |  2834.5
      b16    B=4440, I=1536, H=4096 bias  |    1176.6   |  1418.9
      b16    B=4440, I=1536, H=4096 nobi  |    1159.1   |  1339.6
      f16    B=4440, I=1536, H=4096 bias  |    1199.8   |  1416.1
      f16    B=4440, I=1536, H=4096 nobi  |    1154.0   |  1335.1
      f16.ac B=4440, I=1536, H=4096 bias  |    1407.7   |  1631.2
      f16.ac B=4440, I=1536, H=4096 nobi  |    1348.7   |  1535.2
      b16    B=4728, I=1536, H=4096 bias  |    1233.5   |  1491.8
      b16    B=4728, I=1536, H=4096 nobi  |    1215.1   |  1409.0
      f16    B=4728, I=1536, H=4096 bias  |    1248.2   |  1486.0
      f16    B=4728, I=1536, H=4096 nobi  |    1208.1   |  1404.2
      f16.ac B=4728, I=1536, H=4096 bias  |    1480.1   |  1705.0
      f16.ac B=4728, I=1536, H=4096 nobi  |    1408.5   |  1605.9
      b16    B=4728, I=1536, H=1024 bias  |     459.9   |   517.1
      b16    B=4728, I=1536, H=1024 nobi  |     425.2   |   461.0
      f16    B=4728, I=1536, H=1024 bias  |     436.6   |   495.9
      f16    B=4728, I=1536, H=1024 nobi  |     400.0   |   441.3
      f16.ac B=4728, I=1536, H=1024 bias  |     558.4   |   617.8
      f16.ac B=4728, I=1536, H=1024 nobi  |     512.6   |   555.2

Times are in microseconds (us).
```

[ghstack-poisoned]
@danthe3rd danthe3rd merged commit 4e3adec into gh/danthe3rd/58/base Nov 10, 2022
danthe3rd pushed a commit that referenced this pull request Nov 10, 2022
ghstack-source-id: 39f1898bb6bf1636291fb79bfc9dc6c680c002a3
Pull Request resolved: #504
@danthe3rd danthe3rd deleted the gh/danthe3rd/58/head branch November 10, 2022 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants