Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming concurrency tests #5437

Merged
merged 6 commits into from
Sep 3, 2022
Merged

Conversation

svotaw
Copy link
Contributor

@svotaw svotaw commented Aug 23, 2022

This PR adds concurrency testing to new streaming APIs added in PR #5299.

It also attempts to fix a concurrency issue in SparseBin PushRow API. WIthout this fix, concurrent calls often fail to the LGBM_DatasetPushRowsByCSRWithMetadata API.

@svotaw svotaw marked this pull request as draft August 24, 2022 20:49
@svotaw svotaw marked this pull request as ready for review August 24, 2022 22:24
@svotaw
Copy link
Contributor Author

svotaw commented Aug 24, 2022

This PR is 99% new CPP tests, which pass on the pipeline. I'm not sure why there are so many other pipeline failures (first commit showed no such failures). The 1-line change in production code seems to be innocuous for python tests, but feel free to tell me otherwise.

Although this helps with concurrency failures locally, we'll need to test this change further once checked in. We have a concurrency failure that we only see in our CI Linux tests.

@jameslamb jameslamb self-requested a review August 25, 2022 04:44
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your continued work on LightGBM, @svotaw !

we'll need to test this change further once checked in. We have a concurrency failure that we only see in our CI Linux tests

I'm not sure who "we" / "our" refers to in this statement or what you mean by "once checked in", so I want to be sure to set the right expectation...we will not merge pull requests that do not pass all CI jobs into master in this repository.

This PR will not be merged until all of the existing CI jobs are passing.

The 1-line change ... seems to be innocuous for python tests, but feel free to tell me otherwise

When working on LightGBM, it's important to understand that the Python package's tests are the most comprehensive tests this project has for its C/C++ code. Those tests cover more of the project's C/C++ code than the dedicated tests in https://github.com/microsoft/LightGBM/tree/master/tests/cpp_tests.

The R package tests also cover a significant portion of the C/C++ code, and as of #5312 at least one code path not covered by the Python tests (LGBM_BoosterPredictForMatSingleRowFast() and related code).

So when you push a change to C/C++ code in this project and see most of the R and Python tests failing, that is a strong indication that the change you've pushed is a breaking change to LightGBM.

Maybe you've uncovered a correctness bug while trying to fix a source of runtime exceptions (which would be great!) or maybe this change has a bigger impact on model training than just fixing a runtime exception related to concurrency issues. I'm not experienced enough with C++ or the implementation of SparseBin to say for sure. @guolinke or @shiyu1994 should be able to help.

Please let me know if you need help with how to run only the failing Python test cases locally (so you can work on this with a faster feedback cycle than waiting for all of this project's CI jobs to run).

src/io/sparse_bin.hpp Outdated Show resolved Hide resolved
@svotaw
Copy link
Contributor Author

svotaw commented Aug 25, 2022

I'm not sure who "we" / "our" refers to in this statement or what you mean by "once checked in", so I want to be sure to set the right expectation...we will not merge pull requests that do not pass all CI jobs into master in this repository.

@jameslamb By "we", I mean our SynapseML wrapper. And yes, of course I would never expect anything to be checked in to LightGBM without passing all tests. :) I just thought I'd check and see if there were any current CI problems that might have caused those other things to fail. They were passing on an earlier iteration of this that included the only production code change that is left. But if there are no other issues, I will investigate further. Probably this change then.

@guolinke
Copy link
Collaborator

@svotaw I am quite confused about this PR, what does "concurrency" mean?
The PushRow is already designed for multi-threading, and LightGBM also uses a multi-threading solution to construct Dataset.

@svotaw
Copy link
Contributor Author

svotaw commented Aug 26, 2022

@svotaw I am quite confused about this PR, what does "concurrency" mean? The PushRow is already designed for multi-threading, and LightGBM also uses a multi-threading solution to construct Dataset.

You can also say multi-threading instead of concurrency. I mean the same thing.

The multi-threaded tests in this PR were just added since someone asked for them in the previous PR. I said that I would add them, and this is the followup.

And by the way, the external LGBM_DatasetPushRowsByCSR is NOT thread safe, which is why I modified LGBM_DatasetPushRowsByCSRWithMetadata to be able to handle being called by multiple threads. The old LGBM_DatasetPushRowsByCSR API will crash if you call it from multiple threads. We found that out empirically. Each external thread will generate the same set of OpenMP tid's, which means they will attempt to push to the same sparse buffers and eventually fail. LGBM_DatasetPushRowsByCSRWithMetadata is supposed to be completely thread safe (although see next).

As far as changing that one production code line in SparseBin.hpp, that is an experiment. Even with the thread-safe design, we still see some crash failures in SparseBin.PushRow() in our own tests using the top of LightGBM main (we made a maven package of it for SynapseML). They have been hard to debug, because they aren't reproducible locally and have only occurred in our CI pipeline runs (making it hard to get crash dumps). We only know the method that fails. I was using this PR to help find the issue, and making that change seemed to make the tests pass reliably. But admittedly, I'm not convinced yet that it is needed, so will experiment some more before asking to check this in.

The changes to tests (99% of this PR) should have no production impact, as long as they pass reliably and don't cause CI issues. They are just to have the extra multi-threaded coverage. Note that these tests only work for LGBM_DatasetPushRowsByCSRWithMetadata. LGBM_DatasetPushRowsByCSR would fail if tested the same way.

Note that I manually made threads in the test since OpenMP does not seem to work for tests. I tried, but settled on manual thread creation.

@svotaw
Copy link
Contributor Author

svotaw commented Aug 26, 2022

@guolinke here is the error we see with multi-threaded sparse data in our SynapseML tests:
JRE version: OpenJDK Runtime Environment (8.0_345-b01) (build 1.8.0_345-b01)
Java VM: OpenJDK 64-Bit Server VM (25.345-b01 mixed mode linux-amd64 compressed oops)
Problematic frame:
C [lib_lightgbm.so+0x18a72b] LightGBM::SparseBin::Push(int, int, unsigned int)+0x3b

It might be something other than a threading issue but given the debugging I've done it seems highly likely.

@jameslamb
Copy link
Collaborator

@shiyu1994 or @guolinke can you please re-review this, now that it's only testing changes and CI is passing?


passing on an earlier iteration of this that included the only code change that is left

I don't think that's true.

The first commit in this PR, 62a97c2, which was passing all CI jobs except lint, contained only this change + testing changes:

image

Then cc3f431 moved that code to a different location and introduced another change.

image

From that commit until you removed the SparseBin changes in e796f5e, most of the Python and R jobs were failing.

Just pointing it out in case it helps you in your investigation.

@svotaw
Copy link
Contributor Author

svotaw commented Aug 31, 2022

Yes, apologies, was working on something else for last week.

Yes, indeed it looks like you were right and the production code change was responsible for the failures. I have removed it and this PR can be reviewed as just test changes.

The SparseBin:Push method still crashes on our Synapse CI linux machines when run multithreaded, and not sure why yet. But at least now we have tests here in this repo that prove it's not a general LightGBM problem and is something else.

@svotaw svotaw requested review from jameslamb and removed request for shiyu1994 September 2, 2022 19:56
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

In the future, please don't push PRs with whitespace-only changes like this

image

unless those changes are directly related to the goal of the PR.

We rely on the git blame view as a source of history for changes to files, and whitespace-only changes that are not related to functionality add friction to the use of that view.

I'm approving this one since these tests are new and under such active development, and to avoid delaying you further.

@svotaw
Copy link
Contributor Author

svotaw commented Sep 2, 2022

@jameslamb Ah, my apologies. The teams I've been on have always had the policy to do simple style fixes whenever they saw them, but I will avoid doing that here from now on.
Do you just make separate PRs for this kind of change?

@jameslamb
Copy link
Collaborator

Do you just make separate PRs for this kind of change?

thanks very much! Yes, separate PRs are preferred for unrelated whitespace changes, typo/grammar fixes, etc. I personally have found that having all the changes in a PR's diff relate only to the stated purpose of the PR (in this example, "add more tests") makes PRs much easier to review.

And I personally use this project's git blame regularly to trace back to the relevant PRs that changed a specific section of the code.

@jameslamb jameslamb changed the title fix: Add streaming concurrency tests Add streaming concurrency tests Sep 3, 2022
@jameslamb jameslamb merged commit d0ea321 into microsoft:master Sep 3, 2022
@@ -320,7 +320,7 @@ TEST(Stream, PushSparseRowsWithMetadata) {
TestUtils::CreateRandomSparseData(nrows, ncols, nclasses, sparse_percent, &indptr, &indices, &vals, &labels, &weights, &init_scores, &groups);

const std::vector<int32_t> batch_counts = { 1, nrows / 100, nrows / 10, nrows };
const std::vector<int8_t> creation_types = { 0, 1 };
const std::vector<int8_t> creation_types = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svotaw
Why there are only zeros here now and no any ones?
Previously creation_types = { 0, 1 }; made sense to me, but now I'm very confused by such test parametrization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS oops, yes need to revert this. It was a quick and dirty experiment just to force many repeats. Some of the early testing failures were only sporadic, so I added this to give me better confidence on no failures. I will fix.

I had planned to do a last pass over the PR to look for things like this, but it got checked in. Not used to someone else being in control of the checkin. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, thanks a lot for your awesome work on LightGBM!

Luckily, we need to revert just one line of the code. 🙂

@svotaw svotaw deleted the streaming-tests branch September 4, 2022 20:22
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants