-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FOLLOW-UP] Support for building DMatrix from Apache Arrow data format #7512
Conversation
@trivialfis This is a follow-up work for #7283, Could you help to kick off CI? It looks it is still using Jenkins there but my fork is using github actions? There is a type check error in github actions but not relavent to this PR. |
Could you rebase to master? |
rebased. I didn't change xgboost/dask.py. Could you approve ci again? @trivialfis |
Submitted a fix #7513 . |
I saw some compiler warnings with gcc 9.3 and nvcc 11.4, some of them like |
@trivialfis I have enabled test_with_arrow for CI. |
Thanks for the update. Will review the PR later. I don't mind disabling Windows support for now. |
I have fixed for win64_cpu_test.yml (added cffi) but looks there is some other packages missing in win64_test.yml in order to run with arrow. The test case error silently without useful messages. |
It seems to be hanging on both Linux and Windows. https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-7512/18/pipeline/180 |
Thanks, you are right, I will look into details. |
@trivialfis Hi, my Jenkinsfile-win64 has changed for debugging, I have added -DUSE_ARROW=ON to cmake and also changed pytest line. But the change is not effective, Jenkins still execute the old Jenkinsfile-win64 file. I am not familiar with Jenkins, anything missing? https://xgboost-ci.net/blue/organizations/jenkins/xgboost-win64/detail/PR-7512/22/pipeline/39 |
@hcho3 Could you please help take a look if time allows? |
The test server should now use the updated Jenkinsfile. |
Thanks a lot, I will rebase to master next time. Seems Jenkins only pull in rebased one. |
I have interrupted the win64 CI as it was approaching 3 hours. |
The Linux one is fixed by adding -DUSE_ARROW=ON, for Win64 it passed yesterday by adding -DUSE_ARROW=ON https://xgboost-ci.net/blue/organizations/jenkins/xgboost-win64/detail/PR-7512/24/pipeline. but the same setting hang today. |
Please let me know when you have rebased the branch. I will make some more changes based on your current PR. I couldn't open clean PR to your repository due to the upstream merge. |
@trivialfis I have rebased now. Right now when I only test tests\python\test_with_arrow.py, it passed. I am not sure why it doesn't proceed as a whole (tests\python). |
I don't know either. Could you reproduce it locally? I opened a PR for cmake script: xwu99#1 will open some other PRs later. |
I will check Windows locally. Could you cancel the checks? it seems to hang. |
Environment: OS: Windows 10 For XGboost , the previous running of test_dmatrix.py: test_unknown_data (imported modin.pandas) impact the following test_with_arrow. I have not found a workaround for modin and pyarrow to co-exist on Windows yet. If user import modin.pandas first, and then use arrow to load, it will hang. [UPDATE] It also hangs on Linux. If I removed modin, all tests passed. |
@trivialfis I have specified arrow to use 4.0 version in the CI instead of latest version due to importing modin resulting in hang for arrow 6.0. May need to file an issue for modin. |
@xwu99 Thanks for the hard work on this. Do you know of any plan on supporting latest arrow and modin? |
I will contact Modin to see if there is a fix. [UPDATE] modin-project/modin#3982 |
@xwu99 Do you mind if I rebase your commits? |
Pls go ahead to do whatever you need. Btw. Modin team is looking into the hang issue. |
@trivialfis I think I have fixed arrow hang issue from workaround https://issues.apache.org/jira/browse/ARROW-15362. the CI error seems irrelavant, you can rebase as you wish. |
Now the arrow is built by default, clang tidy will check the code. |
ok, I will run clang tidy locally to check. |
I will follow up on this PR. Apologies for being slow here. |
Thanks. for clang-tidy some renames will do. pls let me know when you need me. |
I made some cleanups to the PR by removing unused code. From my perspective, it's best to have everything in an adapter instead of DMatrix, but I don't think I will try to implement that in this release. @hcho3 @RAMitchell Could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a shame we can't use any existing data iterator structure to do this. After all the attempted refactoring of xgboost's data import mechanisms and DMatrix it seems to get more and more confusing and we are just adding API functions for each new format.
The arrow format can chunk the data arbitrarily for both rows and columns, which makes iterating it really difficult. We can move some of the code into adapter at the cost of some inefficiency. Will follow up tomorrow and rebase. |
* Integrate with Arrow C data API. * Support Arrow dataset. * Support Arrow table. Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
618717d
to
2fd751e
Compare
Thank you for the great work and apologies for the slow response. |
And the datatable ingest became horribly slow after the refactor. |
@pseudotensor Thanks for letting us know. Please file a new GitHub issue so that we can track it. Or is it related to #7642? |
Continue work for #7283 #5667