Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci][fix] Fix cuda_exp ci #5438

Merged
merged 10 commits into from
Aug 29, 2022
Merged

[ci][fix] Fix cuda_exp ci #5438

merged 10 commits into from
Aug 29, 2022

Conversation

shiyu1994
Copy link
Collaborator

Unfortunately, in a very long time (since the merge of #4630), the ci tests for cuda_exp has not actually run.
Since the task name for both cuda and cuda_exp is cuda


in test.sh the cuda_exp option cannot be detected, and it will always run for device cuda!

LightGBM/.ci/test.sh

Lines 198 to 199 in 702db13

elif [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
if [[ $TASK == "cuda" ]]; then

That's a serious mistake. And I think we need to fix it right now for our on-going development of cuda_exp.

@shiyu1994
Copy link
Collaborator Author

See #5425 (comment)

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for noticing and proposing a fix for this! But I think there's a much simpler fix that will achieve the same goal.

I think doing the following would prevent adding new complexity to .ci/test.sh:

  1. replace treelearner: in .github/workflows/cuda.yml with task:
  2. remove this line:
  3. change the following line to TASK="${{ matrix.task }}":
    TASK=${{ env.task }}

I remember we had a discussion in #4630 about not adding more complexity to test.sh: #4630 (comment).


In addition to that request...right now, both cuda_exp CI jobs are failing with the following error

Fatal Python error: Aborted

(build link)

We should never merge a PR that will result in CI being broken on master, so please do one of these:

  • push a fix for that issue to this PR
  • remove the cuda_exp CI jobs in this PR, document the work to add them back in a separate issue, merge this PR, and then push a follow-up PR fixing cuda_exp and adding back the CI jobs

I hope a fix can just be pushed here, but I don't know how complicated it will be to debug this. If it is very complicated, then we might as well eliminate those two cuda_exp jobs that are not actually testing cuda_exp, to save CI time.

And if the "make an issue and fix this later" approach is taken, then I think we should not merge any more cuda_exp PRs until those CI jobs have been re-enabled.

.ci/test.sh Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

I've added the label maintenance to this PR.

As I mentioned in #5403 (comment) and #5413 (comment), please add a labels when you create pull requests here, so your PR will be correctly categorized in the release notes.

@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Aug 25, 2022

I think doing the following would prevent adding new complexity to .ci/test.sh:

Thanks for the suggestion. Already applied via 0206da8 and eb34dd5.

We should never merge a PR that will result in CI being broken on master, so please do one of these:

  1. push a fix for that issue to this PR
  2. remove the cuda_exp CI jobs in this PR, document the work to add them back in a separate issue, merge this PR, and then push a follow-up PR fixing cuda_exp and adding back the CI jobs

I'd prefer the first choice, and already pushed the fixes via ca7df38.

As I mentioned in #5403 (comment) and #5413 (comment), please add a labels when you create pull requests here, so your PR will be correctly categorized in the release notes.

Thanks again for the reminder.

@shiyu1994
Copy link
Collaborator Author

Note that this is blocking other related PRs, including (#5425 and #4827).

@shiyu1994 shiyu1994 requested a review from jmoralez as a code owner August 25, 2022 11:25
@shiyu1994
Copy link
Collaborator Author

This is ready. @jameslamb @guolinke Could you please review this again. If it looks good to you, let's merge this so that we can work on related PRs. Thanks.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for .github/workflows/cuda.yml and tests/python_package_test/test_utilities.py except one minor simplification.

Comment on lines +459 to +460
CHECK_EQ(gradients_pointer_, gradients_.data());
CHECK_EQ(hessians_pointer_, hessians_.data());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these checks slow? Maybe move them under #ifdef DEBUG?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. These checks are fast.

tests/python_package_test/test_utilities.py Outdated Show resolved Hide resolved
@jameslamb jameslamb dismissed their stale review August 28, 2022 16:31

dismissing to prevent blocking, as I might not be able to review again today

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any additional comments.

Please address @StrikerRUS 's suggestions, and then I think @guolinke should re-review this (since you've added significant changes since his original review) prior to this being merged.

Thanks for fixing this!

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Aug 29, 2022

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/2945515006

Status: success ✔️.

@shiyu1994
Copy link
Collaborator Author

/gha run r-solaris

@jameslamb
Copy link
Collaborator

@shiyu1994 Solaris support was removed in #5226. The r-solaris check no longer exists, and is no longer necessary.

But thank you for being so thorough!

@shiyu1994
Copy link
Collaborator Author

@guolinke @jameslamb @StrikerRUS Thanks for reviewing this. I'll merge this since all tests have been passed and it is blocking many related PRs including #4827, #5425 and #4266.

@shiyu1994 shiyu1994 merged commit be7f321 into master Aug 29, 2022
@shiyu1994 shiyu1994 deleted the cuda/fix-cuda-exp-ci branch August 29, 2022 06:38
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants