Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change options based on arch for Mesa #1892

Merged
merged 20 commits into from
Mar 23, 2020

Conversation

edmondac
Copy link
Contributor

arch-specific config opts for Mesa

"-Dswr-arches=avx,avx2,skx,knl" fails on too many types of system, so removing it
Flamefire
Flamefire previously approved these changes Feb 4, 2020
Copy link
Contributor

@Flamefire Flamefire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@smoors smoors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edmondac could you select the -Dswr-arches options according the CPU microarchitecture, similar to what is done in the lammps easyblock?

smoors
smoors previously requested changes Mar 9, 2020
easybuild/easyblocks/m/mesa.py Outdated Show resolved Hide resolved
Co-Authored-By: Sam Moors <smoors@users.noreply.github.com>
@edmondac edmondac requested a review from smoors March 9, 2020 15:56
@bartoldeman
Copy link
Contributor

Just a few comments:

the reason why I put in the all the swr arches unconditionally is that they are selected at runtime, and you may be "cross-compiling". The problem with the recipes was that in the 19.1.* they run into a GCC bug (fixed in GCC 9 IIRC) where -march=native is not overridden with -march=skylake-avx512 or similar, so avx512 code when compiled with -march=native -march=skylake-avx512 errors out when the build machine does not support avx512 . That is why I included a patch in the easyconfig for 19.2..1

@edmondac
Copy link
Contributor Author

Just a few comments:

the reason why I put in the all the swr arches unconditionally is that they are selected at runtime, and you may be "cross-compiling". The problem with the recipes was that in the 19.1.* they run into a GCC bug (fixed in GCC 9 IIRC) where -march=native is not overridden with -march=skylake-avx512 or similar, so avx512 code when compiled with -march=native -march=skylake-avx512 errors out when the build machine does not support avx512 . That is why I included a patch in the easyconfig for 19.2..1

Hi @bartoldeman - the problem with this approach is that it doesn't build on POWER9 CPUs:

src/gallium/drivers/swr/meson.build:200:2: ERROR: Problem encountered: Cannot find AVX support for swr. (these are required for SWR an all architectures.)

The new approach does, see easybuilders/easybuild-easyconfigs#9764 (comment)

Is cross-compilation "supported" by EasyBuild? Not using it, I wouldn't know. So many things auto-detect their CPU features that I'd have thought it would mostly not build with the right support - but I've not tried it.

@akesandgren
Copy link
Contributor

Using self.cfg.update('configopts', "-Dswr-arches=avx,avx2,skx,knl") works on our broadwell nodes at least.

@Flamefire
Copy link
Contributor

My point was that there is a pending investigation but this was about to be merged.
The existing easyconfig with all archs has worked so far, so the proposed approach is reasonable.
The question is: why did it fail for you? Can you post the easyconfig and easyblock used together with the log? I suggested that the problem might be in Mesa and already known and solved. Is that the case?

@lexming
Copy link
Contributor

lexming commented Mar 21, 2020

Using -Dgallium-drivers='swrast,swr' -Dswr-arches=knl,skx,avx2,avx (your proposal) hits Mesa bug #105029 (https://bugs.freedesktop.org/show_bug.cgi?id=105029) on IvyBridge for easyconfigs of Mesa < v19.3.7 using GCC-8.

The current easyblock for mesa in this PR does not hit that bug because it compiles Mesa with -Dgallium-drivers='swrast,swr' -Dswr-arches=avx.

Since there is already a PR open (easybuilders/easybuild-easyconfigs#9764) to use the mesa easyblock on Mesa easyconfigs below v19.3.7, this is a serious concern.

As I said, the current state of this PR seems to be a safer path forward. If any user wants to do cross-compiling, this easyblock already allows the user to do that by manually overriding swr-arches in the easyconfig (ie adding -Dswr-arches=knl,skx,avx2,avx to configopts)

This is just my advice, I leave the final decision to other maintainers.

@Flamefire
Copy link
Contributor

Do I misunderstood the bug report because to me this looks like an issue with a compiler not with the architecture.
That's why I suggested to just add the archs in easyconfigs where the compiler is known. Currently I expect this to fail with Mesa < 19.3.7 on archs supporting avx 512 but with older compilers (?)

So this makes this an arch dependent failure which is even worse

And I mentioned that this is probably what was mentioned in #1892 (comment)

Comment on lines 53 to 56
# avx512f: AVX-512 Foundation - introduced in Skylake
# avx512er: AVX-512 Exponential and Reciprocal Instructions implemented in Knights Landing
x86_features = {'avx': 'avx', 'avx1.0': 'avx', 'avx2': 'avx2', 'avx512f': 'skx', 'avx512er': 'knl'}
swr_arches = [farch for fname, farch in x86_features.items() if fname in cpu_features]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after finding that Mesa < 19.3.7 bug, suggested code would be:

if arch == X86_64:
  x86_features = {'avx': 'avx', 'avx1.0': 'avx', 'avx2': 'avx2', 'avx512f': 'skx', 'avx512er': 'knl'}
  if LooseVersion(self.version) < LooaseVersion('19.3.7'):
    swr_arches = [farch for fname, farch in x86_features.items() if fname in cpu_features]
  else:
    swr_arches = [farch for fname, farch in x86_features.items() ]

Or something similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @lexming here that including this version check is not a good idea, especially since there's a patch available to fix the problem you're dancing around here.

@akesandgren
Copy link
Contributor

@Flamefire @lexming good enough for both of you?

And if so, @edmondac can you make that effect if the4y both agree?

@Flamefire
Copy link
Contributor

Again the question: is this a bug related to the current arch or the current compiler? I understood it to be the latter. Then this is not acceptable as it will fail when one tries to build on knl or skylake(?)

@lexming
Copy link
Contributor

lexming commented Mar 21, 2020

Based on my tests, the current mesa easyblock (commit cac0c0c) works well in the following cases:

  • Mesa-19.0.1-GCCcore-8.2.0.eb: IvyBridge and Skylake
  • Mesa-19.1.7-GCCcore-8.3.0.eb: IvyBridge and Skylake
  • Mesa-20.0.2-GCCcore-9.3.0.eb: IvyBridge and Skylake

So the bug occurs in a combination of architecture and compiler when using -Dswr-arches=knl,skx,avx2,avx on non-AVX512 archs (?) That is unclear, but in any case, my tests are all positive so far with this easyblock as it is.

@akesandgren I would really prefer avoiding the burden and complication of maintaining version checks of the compiler or Mesa unless we find a case where this easyblock fails. This is not currently the case. Moreover (I repeat myself) users wanting to do cross-compiling can set swr-arches in configopts and this easyblock will respect that.

Having said all of this, if a majority of you still prefer to add version checks on the compiler and Mesa, I won't oppose this change obviously.

@Flamefire
Copy link
Contributor

And you are still missing the point. To summarize:

  • All evidence suggest that there is a known issue with Mesa < 19.3.7 (?) which runs into a GCC 8 bug that makes it fail to compile if knl or skx is specified for swr-arches. This is from https://bugs.freedesktop.org/show_bug.cgi?id=105029 and Change options based on arch for Mesa #1892 (comment)
  • These flags will be used with the current EasyBlock if skylake or knl architectures are used to build mesa
  • Hence there is a case where this easyblock fails. You did not test compiling on skylake, did you? Hence this isn't noticed yet but it will fail

users wanting to do cross-compiling can set swr-arches in configopts

See above: The bug is the compiler with older Mesa versions. There was not any evidence that this bug only applies when running on non-AVX-512 archs. The linked issue shows a compile failure, not a runtime failure.

I would really prefer avoiding the burden and complication of maintaining version checks of the compiler or Mesa

The burden of having a bug that is only reproducible on 1 architecture due to the EasyBlock changing behavior based on the architecture is IMO higher.

Furthermore there exists a patch for that which has been mentioned in #1892 (comment): https://github.com/easybuilders/easybuild-easyconfigs/pull/9764/files#diff-a56c37857393d269498e8fd48f40500dR29

@lexming
Copy link
Contributor

lexming commented Mar 21, 2020

Hence there is a case where this easyblock fails. You did not test compiling on skylake, did you? Hence this isn't noticed yet but it will fail

/facepalm

I already tested this easyblock in a Skylake (Intel(R) Xeon(R) Gold 6126) and it works. Precisely with Mesa-19.0.1-GCCcore-8.2.0.eb, Mesa-19.1.7-GCCcore-8.3.0.eb and Mesa-20.0.2-GCCcore-9.3.0.eb. I just said that in my previous comment (#1892 (comment)). Are you reading them?

I am tired of repeating myself. All those tests listed in #1892 (comment) with this easyblock as it is in commit cac0c0c are successful.

I won't participate in this discussion any more unless you provide some evidence to your claims. So far you have talked a lot, but have not submitted the results of any test. I already provided tests of several Mesa easyconfigs in different architectures, I tested both the current easyblock and your proposed changes, and I am the one who found that we could be affected by bug 105029 by using indiscriminate swr-archs. So I really doubt that I'm missing any point here.

@Flamefire
Copy link
Contributor

Flamefire commented Mar 21, 2020

Apologies, it seems it is KNL affected by the bug you mentioned, not skylake. Got fooled by the mentioning of avx512. The upstream fix is this: https://gitlab.freedesktop.org/mesa/mesa/-/commit/f1fbeb1a53 and it seems to be fixed in 18.1.0. So that doesn't look like the bug you encountered. But since no test reports of the failure were submitted despite me asking I couldn't know better.

So far you have talked a lot, but have not submitted the results of any test.

I did link https://github.com/easybuilders/easybuild-easyconfigs/pull/9764/files#diff-a56c37857393d269498e8fd48f40500dR29 and #1892 (comment) multiple times where all archs are passed for an existing EasyConfig and all tests (submitted when this was merged and current usage) show that using all archs is possible.

and I am the one who found that we could be affected by bug 105029 by using indiscriminate swr-archs. So I really doubt that I'm missing any point here.

And I'm the one who pointed out that we might be affected by the issue reported in #1892 (comment) and asked to verify this. It seems you must have missed #1892 (comment).

I did just do this: I used this EC: https://gist.github.com/Flamefire/6730d3b40d75dabf47d9df3a3ca515f3

It is the current Mesa-19.1.7-GCCcore-8.3.0.eb with -Dswr-arches=avx,avx2,skx,knl and the mentioned patch added. I successfully built this on a sandy-bridge. I also verified that removing the patch makes it fail with errors like:

../mesa-19.1.7/src/gallium/drivers/swr/rasterizer/core/format_types.h:1023:45: error: cannot convert simd16scalar {aka SIMDImpl::SIMD512Impl::Float} to __m512 {aka __vector(16) float}
result = _mm512_mask_blend_ps(mask, result2, result);

So conclusion: We run into https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69471 as mentioned by @bartoldeman in #1892 (comment) and there is a patch available so we can continue to build for all architectures.

However building only with the swr for the current arch happens to work as -march=native and the swr-arch-specific -march don't conflict in that situation. This is why the current state works but has the downside of reduced performance when building for the highest common CPU architecture used.

@boegel
Copy link
Member

boegel commented Mar 21, 2020

Seems like it's time I pitch in here... :)

EasyBuild usually builds for the host processor architecture (usually by just using -march=native), so that's also what this easyblock should do by default...

Also listing other values in the swr-arches configuration option which correspond to older processor architectures (for example also listing avx when building on Intel Skylake) is fine though. In some sense that's cross-compiling, but only for older stuff that is definitely supported for the build host, so I don't see an issue there.

I'm not in favor of always listing all supported architectures to try and obtain a Mesa installation that will perform "optimally" on hosts that have newer processors than the build host. That's pretty atypical for EasyBuild, so that shouldn't be the default.

In the current state (commit cac0c0c) there's an easy way to override the default of building for the host architecture: as soon as you include swr-arches=... via configopts you're in full control, and you can, for example, opt for also listing skx when building on a Haswell host to obtain a Mesa installation that will switch at runtime to something optimized for Skylake.

@Flamefire I understand your point that obtaining a Mesa installation that will perform well on other (newer) architectures can be useful, but the problem that @bartoldeman fixed with a patch shows that this is also a bit risky.
Although it's fixed now, I can envision a future where a similar issue is introduced again.

So I would prefer going forward with the current implementation, which fits better with what EasyBuild usually does by default, and is less likely to cause build problems going forward.

I hope that settles the discussion.

I would like to see some minor stylistic changes to the easyblock (like stretching out the x86_features dict to have one entry per line), as well as adding a custom sanity_check_step method, before merging this. I'll look into this myself right now...

@Flamefire
Copy link
Contributor

EasyBuild usually builds for the host processor architecture (usually by just using -march=native), so that's also what this easyblock should do by default...

I'm not in favor of always listing all supported architectures to try and obtain a Mesa installation that will perform "optimally" on hosts that have newer processors than the build host. That's pretty atypical for EasyBuild, so that shouldn't be the default.

Ok, to me it looked like it should do that as it was removed from the EC: https://github.com/easybuilders/easybuild-easyconfigs/pull/9764/files#diff-a56c37857393d269498e8fd48f40500dL64
So I guess it should be added there again to not change existing behavior.

BTW: What is done for other Software that does runtime selection? Doesn't OpenBLAS or FFTW does something similar? This is why I didn't consider it atypical.

@boegel
Copy link
Member

boegel commented Mar 21, 2020

EasyBuild usually builds for the host processor architecture (usually by just using -march=native), so that's also what this easyblock should do by default...

I'm not in favor of always listing all supported architectures to try and obtain a Mesa installation that will perform "optimally" on hosts that have newer processors than the build host. That's pretty atypical for EasyBuild, so that shouldn't be the default.

Ok, to me it looked like it should do that as it was removed from the EC: https://github.com/easybuilders/easybuild-easyconfigs/pull/9764/files#diff-a56c37857393d269498e8fd48f40500dL64
So I guess it should be added there again to not change existing behavior.

I consider this approach better than hardcoding avx,avx2,skx,knl, so I consider the changed behavior as a bug fix, not a regression (so it's OK to change).

BTW: What is done for other Software that does runtime selection? Doesn't OpenBLAS or FFTW does something similar? This is why I didn't consider it atypical.

I don't know of any other software which does runtime selection to an arch that is newer than what the software was built for.
For FFTW, we only enable building for CPU features supported by the build host, and then leave it to FFTW itself to do runtime selection, so very similar to what we do here for Mesa.
For OpenBLAS, the easyblock is only aware of building for a specific architecture as far as making sure a TARGET=... that is included in buildopts is also set in testopts and installopts, but it doesn't specify any target arch at all and leaves it up to OpenBLAS itself (which detects the host CPU and targets that by default).

@boegel
Copy link
Member

boegel commented Mar 21, 2020

@edmondac Please take a look at bear-rsg#17, I took this a step further, mostly w.r.t. the sanity check...

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel merged commit c1afbd1 into easybuilders:develop Mar 23, 2020
@edmondac edmondac deleted the mesa-for-power9 branch March 23, 2020 14:11
'dirs': [os.path.join('include', 'GL', 'internal')],
}

if self.swr_arches:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, this is indeed wrong: It expects swr to be build without checking the configopts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Flamefire Did this get fixed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it was in #2006...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants