Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix ray sink error when there are no data to write #2919

Merged
merged 3 commits into from
Sep 27, 2024

Conversation

SaintBacchus
Copy link
Contributor

@SaintBacchus SaintBacchus commented Sep 20, 2024

Reproduce python code:

import ray
from lance.ray.sink import LanceDatasink

ray.init()

sink = LanceDatasink("./data.lance")
ray.data.range(10).filter((lambda row: row["id"] > 10)).map(lambda x: {"id": x["id"], "str": f"str-{x['id']}"}).write_datasink(sink)

When using the lance ray sink to write lance file, the empty sink which may be caused by filter operator in ray data will cause these exception.

  File "/opt/conda/lib/python3.11/site-packages/ray/data/dataset.py", line 3621, in write_datasink
    datasink.on_write_complete(write_results)
  File "/opt/conda/lib/python3.11/site-packages/lance/ray/sink.py", line 141, in on_write_complete
    op = lance.LanceOperation.Overwrite(schema, fragments)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 5, in __init__
  File "/opt/conda/lib/python3.11/site-packages/lance/dataset.py", line 1962, in __post_init__
    raise TypeError(
TypeError: schema must be pyarrow.Schema, got <class 'NoneType'>

The on_write_complete function assigns the schema by fragments. If there is no fragments, the schema will be None

@github-actions github-actions bot added bug Something isn't working python labels Sep 20, 2024
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Would you be willing to add a test for this?

@wjones127
Copy link
Contributor

We have ray integration tests here: https://github.com/lancedb/lance/blob/main/python/python/tests/test_ray.py

@westonpace
Copy link
Contributor

(agree that a test would be useful)

@SaintBacchus
Copy link
Contributor Author

OK, I will add a test case later

@SaintBacchus SaintBacchus force-pushed the RaySinkEmpty branch 2 times, most recently from 1e1e010 to 779254e Compare September 26, 2024 06:28
@SaintBacchus
Copy link
Contributor Author

The new commit will not trigger the CI unit tests, please help me trigger it manually

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests! A few small suggestions.

python/python/tests/test_ray.py Outdated Show resolved Hide resolved
python/python/lance/ray/sink.py Outdated Show resolved Hide resolved
@westonpace westonpace merged commit 75aa2c2 into lancedb:main Sep 27, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants