Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocfft: split kernel compilation into separate derivations #230881

Merged
merged 3 commits into from
Jun 2, 2023

Conversation

kira-bruneau
Copy link
Contributor

@kira-bruneau kira-bruneau commented May 9, 2023

Description of changes

I managed to find a way to get around the output limit exceeded error on hydra! - by building kernel device libs in separate derivations for each GPU target.

(I verified this compiling for one GPU target, but just waiting on a nixpkgs-review run to verify the complete build)

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.05 Release Notes (or backporting 22.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@kira-bruneau
Copy link
Contributor Author

kira-bruneau commented May 10, 2023

Oh oops, realized I pushed this too soon. pytorch doesn't like how patchelf is causing page misalignments:

Check whether the following modules can be imported: torch
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <lambda>
  File "/nix/store/ysd0w2ffcn21snv39770gna78wcf524j-python3-3.10.11/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/nix/store/mk0zjy8dmlhwpagvhqxxa0v2g9l90npq-python3.10-torch-2.0.0/lib/python3.10/site-packages/torch/__init__.py", line 229, in <module>
    from torch._C import *  # noqa: F403
ImportError: librocfft.so.0: ELF load command address/offset not page-aligned

It looks like this might be a bug upstream with patchelf (NixOS/patchelf#492). I'm going to mark this as a draft until I can find another way. (probably just by moving the device libs to buildInputs & adding the per-device libraries to the CMake target)

@kira-bruneau kira-bruneau marked this pull request as draft May 10, 2023 15:55
@kira-bruneau
Copy link
Contributor Author

I pushed some more changes, and I think it should be good to go! I'll just let nixpkgs-review run overnight before un-marking this as a draft PR.

@kira-bruneau kira-bruneau changed the title rocfft: split device lib build by gpu target rocfft: split kernel compilation into separate derivations May 24, 2023
@ofborg ofborg bot requested a review from Madouura May 24, 2023 04:19
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin and removed 10.rebuild-darwin: 1 10.rebuild-darwin: 1-10 labels May 24, 2023
@Tungsten842
Copy link
Member

Result of nixpkgs-review pr 230881 run on x86_64-linux 1

6 packages built:
  • hipfft
  • python310Packages.torchWithRocm
  • python310Packages.torchWithRocm.dev
  • python310Packages.torchWithRocm.dist
  • python310Packages.torchWithRocm.lib
  • rocfft

@kira-bruneau
Copy link
Contributor Author

Result of nixpkgs-review run on x86_64-linux 1

6 packages built:
  • hipfft
  • python310Packages.torchWithRocm
  • python310Packages.torchWithRocm.dev
  • python310Packages.torchWithRocm.dist
  • python310Packages.torchWithRocm.lib
  • rocfft

@kira-bruneau kira-bruneau marked this pull request as ready for review May 24, 2023 14:20
@kira-bruneau
Copy link
Contributor Author

> nix path-info -rsSh ./results/rocfft
/nix/store/0fvh2p4irz0lw0cpy2ll1rf2hbhbym3g-xgcc-12.2.0-libgcc              	 139.3K	 139.3K
/nix/store/f437nzzrmjmf5k7q93mawr93vfrsqdb0-libunistring-1.1                	   1.8M	   1.8M
/nix/store/xqcrq3aw2fnrj7cp3id90wzy8x46qhw3-libidn2-2.3.4                   	 350.4K	   2.1M
/nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8                    	  28.8M	  31.1M
/nix/store/4g9phbpakh51bbw2n391vipz9r5z56kw-ncurses-6.4                     	   3.5M	  34.6M
/nix/store/hr3m53r0nhyqx80sg0bz9xjgk6jg009k-zlib-1.2.13                     	 129.6K	  31.2M
/nix/store/hx7dy0y08w1kqpm7h85n9m4p67kk28f1-libxml2-2.10.4                  	   1.5M	  32.7M
/nix/store/5gk8zqasr9hdhm9nhl0y7g0g7bf5lvbc-gcc-12.2.0-libgcc               	 139.3K	 139.3K
/nix/store/yazs3bdl481s2kyffgsa825ihy1adn8f-gcc-12.2.0-lib                  	   7.5M	  38.8M
/nix/store/0jv3wbw5asfjnjh59y5yv15qr2533a6z-rocm-llvm-lld-5.4.4             	 101.6M	 145.6M
/nix/store/0v9qczcpm304n51dq0ym6m3c4k4iha78-rocm-llvm-llvm-5.4.4            	  44.7M	  83.4M
/nix/store/19diy37d1q2mnvpmgaa9xkmjz830gmbj-gmp-with-cxx-6.2.1              	 730.4K	  39.5M
/nix/store/1cyi24kkmqr18h3k5s709mwsxcr5ipnp-xz-5.4.3                        	 781.5K	  31.9M
/nix/store/1d5xan2fc7bvkrkq6h7qmywbfhci2zmq-linux-headers-6.2               	   6.1M	   6.1M
/nix/store/2lrb1i34x4p4f16x5z1d1f3zqyp2hd24-rocm-llvm-libunwind-5.4.4       	 192.2K	  31.3M
/nix/store/2r20fz7wb14xa7i9j4c3sn990swp53af-rocm-device-libs-5.4.4          	   3.2M	   3.2M
/nix/store/7492cr0i0kglzbyr7l1401dzkr809dcx-rocm-llvm-libcxx-5.4.4          	   9.6M	  93.0M
/nix/store/8hpqbpflmx0h7g3mgda4345fh69vkfh1-rocm-llvm-libcxxabi-5.4.4       	1012.3K	  84.4M
/nix/store/w237hnxridnmjjwxfz1s1lfyppdzrrrb-attr-2.5.1                      	  78.8K	  31.2M
/nix/store/p0ikbnq88v649sk7rdrwhdp8qaqqjill-acl-2.3.1                       	 108.9K	  31.3M
/nix/store/ahkfdxq8mcpsb5kvdvgqr1wv8zjngbh4-coreutils-9.1                   	   1.4M	  41.1M
/nix/store/fzr584cz03fv689c950lz36qf0lvx877-rocm-llvm-compiler-rt-5.4.4     	  33.6M	  64.9M
/nix/store/jcg5zqj20bhxl1fmc034syma47g38dq8-ncurses-6.4-man                 	 597.8K	 597.8K
/nix/store/rhvbjmcfnkg8i2dxpzr114cp1ws7f667-bash-5.2-p15                    	   1.6M	  32.7M
/nix/store/9jmgsy8bll4ya21v4yvv96mr2ky1cc52-ncurses-6.4-dev                 	 389.3K	  37.2M
/nix/store/qwmhvny4in8134s96ssfy92w5acbwc4c-zlib-1.2.13-dev                 	 113.2K	  31.3M
/nix/store/m4l5y2hh0jn6knhk6a0z0g339dhl82m1-rocm-llvm-llvm-5.4.4            	   1.4G	   1.5G
/nix/store/csc9q2hz91d692vzid41d1gpy40v0nz1-libedit-20221030-3.1            	 282.6K	  34.9M
/nix/store/pakjbhlkgij9x3y6a4sjmvha52k5bb5g-rocm-llvm-clang-5.4.4           	   1.2G	   1.2G
/nix/store/b0352j49z81vxlvv4kxj1xalrbrn6mp4-rocm-llvm-clang-5.4.4           	   1.2M	   2.8G
/nix/store/4r7wx6h7hgrdq4x16r8ly2285sq31mdc-gmp-6.2.1                       	 686.5K	  31.8M
/nix/store/nkl7i39lqbky0b5kjxnblqxp9w1y7kxy-mpfr-4.2.0                      	 774.6K	  32.5M
/nix/store/3r5npdwjkmmx7l888zinqvi00viwbb8m-libmpc-1.3.1                    	 273.9K	  32.8M
/nix/store/g2v76k4cb3cqr96c8whw3xj7v1k0xk6a-isl-0.20                        	   2.5M	  34.2M
/nix/store/lyvhsvwp2pzy74fkcn7qbs5vcgy5d7vl-glibc-2.37-8-bin                	   2.7M	  33.8M
/nix/store/rfw51dqr3qn7b6fjy8hmx6f0x3hfwbx6-glibc-2.37-8-dev                	   2.2M	  42.1M
/nix/store/dcd1zhv56rk0d2z7akzfjgzr076c4jl9-gcc-12.2.0                      	 206.4M	 260.4M
/nix/store/hh1l75s66jqfj96g024qdpv83rzylji4-expand-response-params          	  16.4K	  31.1M
/nix/store/lm4971jip508qzb0k92zfbvglp9zmr92-rocm-llvm-binutils-5.4.4        	  26.8K	   1.6G
/nix/store/imdylwmpbm6xv93iyz1l9qmp6xwinqdb-rocm-llvm-binutils-wrapper-5.4.4	  38.9K	   1.6G
/nix/store/dmcbq0a8igikxayc5wni62nhyp337l9f-pcre-8.45                       	 514.4K	  31.6M
/nix/store/zrls4w5lxynqakh1jlrp03kg4bxzp9yi-gnugrep-3.7                     	 773.1K	  32.4M
/nix/store/2x9a5n3v36hjcmvg7xpmcyfkqd922szr-rocm-llvm-clang-wrapper-5.4.4   	  46.4K	   3.1G
/nix/store/k6x7rlrhhl8vd86yx9bp2h98kxnj69l4-hwdata-0.370                    	   8.7M	   8.7M
/nix/store/y3yivmqpacyf52nqzd55rm3jkscg16nf-libpciaccess-0.16               	  67.8K	  39.9M
/nix/store/f9bx50s498ssdy627kdm090qw0ah8xfv-libdrm-2.4.115                  	 530.0K	  40.4M
/nix/store/40anvvxz6x9hgbh7kbayqh36pq573iqx-rocm-thunk-5.4.4                	 359.7K	  40.9M
/nix/store/fxhq0kn94fi877clk03ksrvxxpv234x8-libxcrypt-4.4.33                	 128.0K	  31.2M
/nix/store/5qg40449fx96jl4sq0mbgifdppxq4zap-perl-5.36.0                     	  53.7M	  95.0M
/nix/store/zzab1mabj0xjsq05py3n1lldycwdsxad-rocm-comgr-5.4.4                	 136.5M	 171.5M
/nix/store/82k645bsx2mh3zviad3lpcj75hhbgp72-rocclr-5.4.4                    	   3.1M	 174.6M
/nix/store/6i68qmqhm149rd8wijbhf2nk3v6fil8s-getopt-1.1.6                    	  22.6K	  31.1M
/nix/store/8402jxnhn5ghisvh025ril7xhdsli6xp-lsb_release                     	   4.8K	  42.7M
/nix/store/iwihad8z8kszy0yhz0qldcnd9hx4k41f-libelf-0.8.13                   	 334.8K	  31.4M
/nix/store/s7wg6dwgfq2mddcy1xw9zgl94c09dj5g-numactl-2.0.16                  	 243.2K	  31.3M
/nix/store/vf7r78lim5516mnlvdc2py1xwdf2mi48-rocm-runtime-5.4.3              	   4.7M	   2.9G
/nix/store/64pyqgnjar1cdqg2l5kim7bngqpggx63-sqlite-3.41.2                   	   1.4M	  32.6M
/nix/store/6h9ywdy70bx0bgrym2b2vrihy73g83vb-gdbm-1.23                       	 805.4K	  31.9M
/nix/store/gavih4vph8awxlmz1ryqiajikgmql2r1-readline-8.2p1                  	 459.3K	  35.1M
/nix/store/jhgh02lyizd1kyl71brvc01ygsmgi40a-tzdata-2023c                    	   2.0M	   2.0M
/nix/store/jy8fxqz6rg1zx5h9qdvp6cqlivl223bc-mailcap-2.1.53                  	 109.4K	 109.4K
/nix/store/mnq0hqsqivdbaqzmzc287l0z9zw8dp15-libffi-3.4.4                    	  55.2K	  31.2M
/nix/store/r8dlmcy5a93qxdgsvsywbg3s51fwyvwi-expat-2.5.0                     	 253.0K	  31.3M
/nix/store/wqpb61g795dxhbiqkrni5lnpwpdglq2j-bzip2-1.0.8                     	  79.5K	  31.2M
/nix/store/wvrg1kgiy79sln1fzhvj8w6g604ghsad-openssl-3.0.8                   	   6.2M	  37.3M
/nix/store/95cxzy2hpizr23343b8bskl4yacf4b3l-python3-3.10.11                 	  83.0M	 139.2M
/nix/store/i7rizgw92kmsj3dw2vn61pswwggnhkr6-busybox-1.36.0                  	   1.2M	  32.3M
/nix/store/vhbyjpv0fdvn4g1v49c9c26ksh23cfra-rocminfo-5.4.4                  	   1.3M	   3.0G
/nix/store/lss26vc068x7wmwrflxw8vw8sglzs14a-hip-common-5.4.2                	   7.4M	   3.2G
/nix/store/vi13x9r70r9jd5ixxmlz9s5m0wmyjg8n-source                          	   2.4M	   2.4M
/nix/store/vv3vq7fnkbddkln85xm3pl87dbxs9f0x-rocm-opencl-runtime-5.4.4       	   5.4M	   3.0G
/nix/store/3336crsrp936qf662ia7y7a0r3xgpbh4-hip-amd-5.4.4                   	  32.4M	   3.4G
/nix/store/4zyar3xw806nijghkjr26p5nypb23gz0-rocfft-kernel-cache-5.4.3       	 634.7M	 634.7M
/nix/store/6xnfzkzmn6z9510cmw5g5cjpa5vqfh8k-rocfft-device-gfx803-5.4.3      	 377.9M	   3.8G
/nix/store/bpcw5avbzw3xag8y0c8aqffdpv3kry3g-rocfft-device-gfx906-5.4.3      	 373.5M	   3.8G
/nix/store/f34xamwma6fiagr9wdl6qrh3s8rlnxm3-rocfft-device-gfx1100-5.4.3     	 399.4M	   3.8G
/nix/store/ns2f2sczdsrlyrbxwqx9nzqs8gqiywdl-rocfft-device-gfx908-5.4.3      	 374.3M	   3.8G
/nix/store/qaqrdij7110caxqrgyxpv3536xq61iz8-rocfft-device-gfx900-5.4.3      	 374.4M	   3.8G
/nix/store/s3d1jrz7shp97y2g33s6afp6wj3j3v1y-rocfft-device-gfx1102-5.4.3     	 399.4M	   3.8G
/nix/store/skw6pvan8dsvbhhrp5y1hz3g2zazkmqd-rocfft-device-gfx90a-5.4.3      	 376.3M	   3.8G
/nix/store/wlg8239a3vbvcmcncjsif3wz1m9dhc5r-rocfft-device-gfx1030-5.4.3     	 385.3M	   3.8G
/nix/store/jzil3jc76pxxn4v7s1bx35gf3d6a1ffc-rocfft-5.4.3                    	   5.3M	   7.0G

@kira-bruneau
Copy link
Contributor Author

kira-bruneau commented Jun 1, 2023

I have commit access to merge this, but I was just wondering if this looks ok to you @Flakebi & @Madouura before I go ahead!

Also, I can add myself as a maintainer to help with future maintenance around this split if it'd help!

@Madouura
Copy link
Contributor

Madouura commented Jun 1, 2023

It's building good so far. As for the LLD issue, I would suggest making a rocmClangNoLLDStdEnv with an overridden clang (clangNoLLD?) in pkgs/development/compilers/llvm/rocm/default.nix and setting rocfft in all-packages.nix to use rocmClangNoLLDStdEnv. (Edit: This may be a bad idea and could result in OOM)
So far LGTM.

Would love to have you as a maintainer, good work!

@kira-bruneau
Copy link
Contributor Author

kira-bruneau commented Jun 1, 2023

Oh, the issue I was pointing out was that it wasn't running in parallel, not that there were problems with it running in parallel (eg. the last chunk of the build would only use one core for a long stretch). I think it'd make sense to stick to using lld.

To avoid output limit exceeded errors in hydra, we build kernel device
libs and the kernel RTC cache database in separate derivations
@ofborg ofborg bot added the 11.by: package-maintainer This PR was created by the maintainer of the package it changes label Jun 1, 2023
@Madouura
Copy link
Contributor

Madouura commented Jun 1, 2023

Getting segmentation faults with tests, benchmarks, and samples.
If torchWithRocm compiles and runs my benchmark I won't consider it a blocker though.

@kira-bruneau
Copy link
Contributor Author

Hmm, I wonder if they're just segfaulting because of the way I've split them out - I could leave off the last commit to keep the param based approach.

@Madouura
Copy link
Contributor

Madouura commented Jun 1, 2023

Reverting the last commit doesn't seem to fix tests or benchmarks. Oh well.

Copy link
Contributor

@Madouura Madouura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builds and torchWithRocm runs fine.
LGTM

@mweinelt mweinelt added the 12.approvals: 1 This PR was reviewed and approved by one reputable person label Jun 2, 2023
@kira-bruneau kira-bruneau merged commit 564e538 into NixOS:master Jun 2, 2023
@Madouura Madouura mentioned this pull request Jun 5, 2023
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 11.by: package-maintainer This PR was created by the maintainer of the package it changes 12.approvals: 1 This PR was reviewed and approved by one reputable person
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants