Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: arm64 performance optimizations #4288

Merged
merged 8 commits into from
Dec 12, 2022
Merged

Conversation

rami-lv
Copy link
Contributor

@rami-lv rami-lv commented Nov 24, 2022

Recently at @liftoff I have been leveraging arm64 machines to train our models using vowpal wabbit.
Using arm64 in our infrastructure has helped us reduce the cost of our ML infrastructure.

To get the most out of vowpal wabbit on these machines, we added arm64 compiler optimization and transported the SIMD instructions used in vowpal wabbit via sse2neon. To do that, we followed AWS guide on how to optimize builds for arm64 machines, there are probably more optimizations to apply which require a deeper knowledge of the training algorithm.

These optimizations improved our ML pipeline time by roughly 20% (the pipeline contains steps other than training with vowpal wabbit so more experiments should be conducted if you are interested in getting the improvement on vowpal wabbit).

The proper solution would be to fill the placeholder in lda_core.cc with the corresponding instructions but that would require a deeper understanding of the code.

@rami-lv
Copy link
Contributor Author

rami-lv commented Nov 24, 2022

I couldn't properly add sse2neno as an external dependency, I would appreciate some help on that.

@rami-lv rami-lv marked this pull request as ready for review November 24, 2022 16:28
@rami-lv rami-lv changed the title Arm64 performance optimizations Add arm64 performance optimizations Nov 24, 2022
@rami-lv rami-lv changed the title Add arm64 performance optimizations [perf ] arm64 performance optimizations Nov 24, 2022
@rami-lv rami-lv changed the title [perf ] arm64 performance optimizations perf: arm64 performance optimizations Nov 24, 2022
@zwd-ms
Copy link
Collaborator

zwd-ms commented Nov 29, 2022

Hi @rami-lv, thank you for the contribution!

Regarding your question on adding an external dependency, it might be much easier for VW to consume a library if it's already added in vcpkg. Would you mind adding a port to vcpkg and then use it in VW?

Just curious, sse2neon seems to support up to SSE4, while SIMDe supports a wider range of instruction sets including AVX2 (and partially AVX512). Would it be possible to use SIMDe for your purposes? And I think it is already ported in vcpkg.

@rami-lv
Copy link
Contributor Author

rami-lv commented Dec 2, 2022

@zwd-ms Thanks for the suggestion I didn't know about the project.
However, integrating SIMDe means that we have to rename the intrinsics calls in lda_core.cc to match those of SIMDe which might be not preferable for the maintainers.
Also, SIMDe has merged all sse2neon code so there is no advantage of using one or another.

In my opinion, unless the maintainers decide to integrate SIMDe this should be temporary, and the long-term solution is to actually implement the corresponding NEON logic.

As for AVX2 I couldn't find any usage of AVX intrinsics through the code, did you?

@rami-lv
Copy link
Contributor Author

rami-lv commented Dec 2, 2022

As for AVX2 I couldn't find any usage of AVX intrinsics through the code, did you?

Never mind I just found them.

Comment on lines 21 to 24
set(LINUX_ARM64_OPT_FLAGS "")
if("${CMAKE_SYSTEM_PROCESSOR}" MATCHES "aarch64|arm64|ARM64")
set(LINUX_ARM64_OPT_FLAGS -mcpu=neoverse-n1)
endif()
Copy link
Member

@jackgerrits jackgerrits Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to this CPU, so we should not be adding it whenever we encounter an arm CPU. Generally for specific optimization flags like this it is better to add them when configuring your own build.

So, in this case we should remove this but in your build you would add the following to your command line to achieve the same outcome:

-DCMAKE_CXX_FLAGS_RELEASE="-mcpu=neoverse-n1"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to go ahead and push a change removing this piece so we can move forward here.

@@ -32,6 +32,10 @@ VW_WARNING_STATE_POP
#include "vw/core/vw_versions.h"
#include "vw/io/logger.h"

#if defined(__ARM_NEON)
#include <sse2neon/sse2neon.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not been made available to the vw_core cmake target, so the build won't be able to find this. We also don't have any CI which is going to cause this path to be exercised so we need to be careful.

The include should be added here:

target_include_directories(vw_core PRIVATE ${CMAKE_CURRENT_LIST_DIR}/src)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just created a package in vcpkg for sse2neon microsoft/vcpkg#28129

Would it be okay if I add the header directly in the source code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is still preferable to use the submodule. I can push a commit to the branch with this change,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can push a commit to the branch with this change

That would be nice of you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackgerrits FYI, I added sse2neon to vcpkg. Would you prefer that I add the package instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding that, that's super helpful! There was a tiny issue with the installed path, I went ahead and opened a fix here microsoft/vcpkg#28268. When that gets merged I will update this PR to consume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and merge this one and we can update to consume vcpkg later.

@jackgerrits jackgerrits self-requested a review December 12, 2022 22:41
@zwd-ms zwd-ms enabled auto-merge (squash) December 12, 2022 22:57
@zwd-ms zwd-ms merged commit ca0c23c into VowpalWabbit:master Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants