Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FOLLOW-UP] Support for building DMatrix from Apache Arrow data format #7512

Merged
merged 1 commit into from
Mar 15, 2022

Conversation

xwu99
Copy link
Contributor

@xwu99 xwu99 commented Dec 16, 2021

Continue work for #7283 #5667

@xwu99
Copy link
Contributor Author

xwu99 commented Dec 16, 2021

@trivialfis This is a follow-up work for #7283, Could you help to kick off CI? It looks it is still using Jenkins there but my fork is using github actions? There is a type check error in github actions but not relavent to this PR.

image

@trivialfis
Copy link
Member

Could you rebase to master?

@xwu99
Copy link
Contributor Author

xwu99 commented Dec 16, 2021

rebased. I didn't change xgboost/dask.py. Could you approve ci again? @trivialfis

@trivialfis
Copy link
Member

Submitted a fix #7513 .

@trivialfis
Copy link
Member

I saw some compiler warnings with gcc 9.3 and nvcc 11.4, some of them like pointless comparison are quite useful. Please review and address them.

log.txt

@xwu99
Copy link
Contributor Author

xwu99 commented Dec 20, 2021

@trivialfis I have enabled test_with_arrow for CI.
Github Actions: "macos-10.15" passed. But "windows-2016" just skipped the test.
Jenkins: "Win64 Test" failed.
It seems they have very different packages installed for different OSes. I need more time to debug this.

@trivialfis
Copy link
Member

Thanks for the update. Will review the PR later. I don't mind disabling Windows support for now.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 4, 2022

Thanks for the update. Will review the PR later. I don't mind disabling Windows support for now.

I have fixed for win64_cpu_test.yml (added cffi) but looks there is some other packages missing in win64_test.yml in order to run with arrow. The test case error silently without useful messages.

@trivialfis
Copy link
Member

It seems to be hanging on both Linux and Windows.

https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-7512/18/pipeline/180

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 5, 2022

It seems to be hanging on both Linux and Windows.

https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-7512/18/pipeline/180

Thanks, you are right, I will look into details.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 5, 2022

It seems to be hanging on both Linux and Windows.

https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-7512/18/pipeline/180

@trivialfis Hi, my Jenkinsfile-win64 has changed for debugging, I have added -DUSE_ARROW=ON to cmake and also changed pytest line.

But the change is not effective, Jenkins still execute the old Jenkinsfile-win64 file. I am not familiar with Jenkins, anything missing?

https://xgboost-ci.net/blue/organizations/jenkins/xgboost-win64/detail/PR-7512/22/pipeline/39

@trivialfis
Copy link
Member

@hcho3 Could you please help take a look if time allows?

@hcho3
Copy link
Collaborator

hcho3 commented Jan 5, 2022

The test server should now use the updated Jenkinsfile.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 6, 2022

The test server should now use the updated Jenkinsfile.

Thanks a lot, I will rebase to master next time. Seems Jenkins only pull in rebased one.

@trivialfis
Copy link
Member

I have interrupted the win64 CI as it was approaching 3 hours.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 6, 2022

It seems to be hanging on both Linux and Windows.
https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-7512/18/pipeline/180

Thanks, you are right, I will look into details.

The Linux one is fixed by adding -DUSE_ARROW=ON, for Win64 it passed yesterday by adding -DUSE_ARROW=ON https://xgboost-ci.net/blue/organizations/jenkins/xgboost-win64/detail/PR-7512/24/pipeline. but the same setting hang today.

@trivialfis
Copy link
Member

trivialfis commented Jan 6, 2022

Thanks a lot, I will rebase to master next time. Seems Jenkins only pull in rebased one.

Please let me know when you have rebased the branch. I will make some more changes based on your current PR. I couldn't open clean PR to your repository due to the upstream merge.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 6, 2022

@trivialfis I have rebased now. Right now when I only test tests\python\test_with_arrow.py, it passed. I am not sure why it doesn't proceed as a whole (tests\python).

@trivialfis
Copy link
Member

I am not sure why it doesn't proceed as a whole (tests\python).

I don't know either. Could you reproduce it locally?

I opened a PR for cmake script: xwu99#1 will open some other PRs later.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 7, 2022

I am not sure why it doesn't proceed as a whole (tests\python).

I don't know either. Could you reproduce it locally?

I opened a PR for cmake script: xwu99#1 will open some other PRs later.

I will check Windows locally. Could you cancel the checks? it seems to hang.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 9, 2022

I am not sure why it doesn't proceed as a whole (tests\python).

I don't know either. Could you reproduce it locally?
I opened a PR for cmake script: xwu99#1 will open some other PRs later.

I will check Windows locally. Could you cancel the checks? it seems to hang.
0
It's a hard to find bug. Digging out code locally on Win10 finally I found improrting modin.pandas mess up with pyarrow iterator on Windows. Attach the reproducer. hang on next(rb_iter) if comment out import modin line.

Environment:

OS: Windows 10
modin 0.11.3 py38haa95532_0
pyarrow 6.0.1 py38h0b4a547_7_cpu conda-forge

test_modin_arrow_hang.zip

For XGboost , the previous running of test_dmatrix.py: test_unknown_data (imported modin.pandas) impact the following test_with_arrow.

I have not found a workaround for modin and pyarrow to co-exist on Windows yet. If user import modin.pandas first, and then use arrow to load, it will hang.

[UPDATE] It also hangs on Linux. If I removed modin, all tests passed.
[UPDATE] only works with arrow 4.0 rather than arrow 6.0 (latest modin)

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 15, 2022

@trivialfis I have specified arrow to use 4.0 version in the CI instead of latest version due to importing modin resulting in hang for arrow 6.0. May need to file an issue for modin.

@trivialfis
Copy link
Member

@xwu99 Thanks for the hard work on this. Do you know of any plan on supporting latest arrow and modin?

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 15, 2022

I will contact Modin to see if there is a fix.

[UPDATE] modin-project/modin#3982
[UPDATE] It's a pyarrow issue. workaround in https://issues.apache.org/jira/browse/ARROW-15362

@trivialfis
Copy link
Member

@xwu99 Do you mind if I rebase your commits?

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 17, 2022

@xwu99 Do you mind if I rebase your commits?

Pls go ahead to do whatever you need. Btw. Modin team is looking into the hang issue.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 19, 2022

@trivialfis I think I have fixed arrow hang issue from workaround https://issues.apache.org/jira/browse/ARROW-15362. the CI error seems irrelavant, you can rebase as you wish.

@trivialfis
Copy link
Member

Now the arrow is built by default, clang tidy will check the code.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 20, 2022

Now the arrow is built by default, clang tidy will check the code.

ok, I will run clang tidy locally to check.

@trivialfis
Copy link
Member

I will follow up on this PR. Apologies for being slow here.

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 20, 2022

I will follow up on this PR. Apologies for being slow here.

Thanks. for clang-tidy some renames will do. pls let me know when you need me.

@trivialfis
Copy link
Member

I made some cleanups to the PR by removing unused code. From my perspective, it's best to have everything in an adapter instead of DMatrix, but I don't think I will try to implement that in this release.

@hcho3 @RAMitchell Could you please take a look?

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame we can't use any existing data iterator structure to do this. After all the attempted refactoring of xgboost's data import mechanisms and DMatrix it seems to get more and more confusing and we are just adding API functions for each new format.

@trivialfis
Copy link
Member

The arrow format can chunk the data arbitrarily for both rows and columns, which makes iterating it really difficult. We can move some of the code into adapter at the cost of some inefficiency. Will follow up tomorrow and rebase.

* Integrate with Arrow C data API.
* Support Arrow dataset.
* Support Arrow table.

Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
@trivialfis trivialfis merged commit 613ec36 into dmlc:master Mar 15, 2022
@trivialfis
Copy link
Member

Thank you for the great work and apologies for the slow response.

@pseudotensor
Copy link
Contributor

pseudotensor commented May 8, 2022

It's a shame we can't use any existing data iterator structure to do this. After all the attempted refactoring of xgboost's data import mechanisms and DMatrix it seems to get more and more confusing and we are just adding API functions for each new format.

And the datatable ingest became horribly slow after the refactor.

@hcho3
Copy link
Collaborator

hcho3 commented May 8, 2022

@pseudotensor Thanks for letting us know. Please file a new GitHub issue so that we can track it. Or is it related to #7642?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants