Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export Python Interface for external memory. #7070

Merged
merged 14 commits into from
Jul 22, 2021

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jun 30, 2021

This is the final PR of the original Sparse DMatrix rewrite. The origin description is kept in the below section. This PR exposes Python API for data iterator. Other than the interface, this PR also finalizes documents, examples and tests.

Old description

This is a proof of concept for using iterative DMatrix style callback to handle external memory. Also, the data iter in Python can now handle CPU data and be used by CPU-based algorithms. For details:

  • I implemented caching with iterative DMatrix style callback, with simple, deterministic, and lock-free async fetching. Writing to the cache, however, is sequential.
  • The cache file is named by pointer address, should be able to avoid most of the collisions.
  • Users can now define their own data iterator without having a parser in XGBoost. I put together an example using dask single node (without distributed) as a lazy data generator and let XGBoost consume data chunks from it.

** todos **

  • Integrate existing ellpack external memory support into this.
  • Remove old external memory implementation.
  • Remove lz4
  • Remove GPU page size.
  • Verify the correctness of column page.
  • Concatenate ellpack early on.
  • Tests.

@trivialfis trivialfis marked this pull request as draft June 30, 2021 15:02
@trivialfis trivialfis force-pushed the external-iterative-dmatrix-1 branch from d9ccd05 to 7893d66 Compare July 1, 2021 06:15
@trivialfis trivialfis mentioned this pull request Jul 1, 2021
5 tasks
@codecov-commenter
Copy link

codecov-commenter commented Jul 3, 2021

Codecov Report

Merging #7070 (d8059f4) into master (bd1f3a3) will increase coverage by 0.98%.
The diff coverage is 77.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7070      +/-   ##
==========================================
+ Coverage   81.60%   82.58%   +0.98%     
==========================================
  Files          13       13              
  Lines        3903     3962      +59     
==========================================
+ Hits         3185     3272      +87     
+ Misses        718      690      -28     
Impacted Files Coverage Δ
python-package/xgboost/core.py 83.73% <77.02%> (+2.27%) ⬆️
python-package/xgboost/data.py 67.20% <78.26%> (+4.05%) ⬆️
python-package/xgboost/__init__.py 89.47% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd1f3a3...d8059f4. Read the comment docs.

src/data/data.cc Outdated Show resolved Hide resolved
src/data/sparse_page_dmatrix.h Outdated Show resolved Hide resolved
src/data/sparse_page_dmatrix.h Outdated Show resolved Hide resolved
python-package/xgboost/data.py Outdated Show resolved Hide resolved
python-package/xgboost/data.py Outdated Show resolved Hide resolved
src/data/ellpack_page_raw_format.cu Outdated Show resolved Hide resolved
src/data/file_iterator.h Outdated Show resolved Hide resolved
src/data/file_iterator.h Outdated Show resolved Hide resolved
src/data/adapter.h Outdated Show resolved Hide resolved
tests/cpp/helpers.h Outdated Show resolved Hide resolved
src/data/ellpack_page_raw_format.cu Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

trivialfis commented Jul 7, 2021

Close #7022 .
Close #6719 .
Close #6336 .
Close #6307 . (might be reopened if the failure is observed again).
Close #6167

Related:
#5851

@trivialfis
Copy link
Member Author

trivialfis commented Jul 8, 2021

Running it multiple times to see if removing dmlc parser can fix the std::bad_alloc.

Ran 5 times so far. Seems fine. Will continue monitoring in the future.

@trivialfis trivialfis force-pushed the external-iterative-dmatrix-1 branch 2 times, most recently from d8eea6b to 53f0361 Compare July 13, 2021 08:39
python-package/xgboost/core.py Outdated Show resolved Hide resolved
python-package/xgboost/core.py Show resolved Hide resolved
@trivialfis trivialfis force-pushed the external-iterative-dmatrix-1 branch from 46aa916 to 2fbdab5 Compare July 16, 2021 04:49
@trivialfis trivialfis self-assigned this Jul 16, 2021
@trivialfis trivialfis marked this pull request as ready for review July 16, 2021 06:27
@trivialfis trivialfis changed the title [POC] Use iterative DMatrix for external memory. Export Python Interface for external memory. Jul 16, 2021
"""
_T = TypeVar("_T")

def __init__(self, cache_prefix: Optional[str] = None) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache_prefix can be a parameter for DMatrix instead of DataIter. I don't have a strong preference for the choice. But do note that it's useful since users might have a URI that's not a local file path so we can't drop the parameter.

demo/c-api/external-memory/external_memory.c Outdated Show resolved Hide resolved
demo/c-api/external-memory/README.md Outdated Show resolved Hide resolved
python-package/xgboost/data.py Outdated Show resolved Hide resolved
src/data/proxy_dmatrix.cu Show resolved Hide resolved
tests/python/test_data_iterator.py Show resolved Hide resolved
@trivialfis trivialfis merged commit e608836 into dmlc:master Jul 22, 2021
@trivialfis trivialfis deleted the external-iterative-dmatrix-1 branch July 22, 2021 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants