Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Invalid comparison operation: Utf8 == Utf8View error during LEFT ANTI JOIN #13510

Open
Tracked by #13334
sergiimk opened this issue Nov 21, 2024 · 7 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@sergiimk
Copy link
Contributor

sergiimk commented Nov 21, 2024

Describe the bug

Between 42.2.0 and 43.0.0 there appears to have been a regression that introduced an error:

External(ArrowError(InvalidArgumentError("Invalid comparison operation: Utf8 == Utf8View"), None))

Note that the error happens at the plan execution phase, i.e. plan validation passes successfully.

To Reproduce

Minimal repro is:

  • Read data from CSV (Utf8 columns)
  • Read data from Parquet (Utf8View)
  • Do a LEFT ANTI JOIN

Including a test project with sample data: datafusion-13510.zip

Physical plan:

CoalesceBatchesExec: target_batch_size=8192
  HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(date@0, date@0), (city@1, city@1)]
    CoalesceBatchesExec: target_batch_size=8192
      RepartitionExec: partitioning=Hash([date@0, city@1], 16), input_partitions=16
        RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1
          CsvExec: file_groups={1 group: [[home/.../datafusion-13510/data/data2.csv]]}, projection=[date, city, population], has_header=true
    CoalesceBatchesExec: target_batch_size=8192
      RepartitionExec: partitioning=Hash([date@0, city@1], 16), input_partitions=1
        ParquetExec: file_groups={1 group: [[home/.../datafusion-13510/data/data1.parquet]]}, projection=[date, city]

Expected behavior

No error / error during planning if some operation is invalid

Additional context

No response

@sergiimk sergiimk added the bug Something isn't working label Nov 21, 2024
@ttencate
Copy link

Related to #13568? I filed that as a separate issue, but it seems like they might have the same root cause.

@findepi
Copy link
Member

findepi commented Nov 26, 2024

@sergiimk are you able to simplify the failing query so that we have the simplest possible reproducer?

@sergiimk
Copy link
Contributor Author

@findepi was slammed this week unfortunately. I will re-test with latest master and try to narrow the issue down this weekend.

@alamb
Copy link
Contributor

alamb commented Dec 3, 2024

Does the problem go away if you turn off this config setting:

https://datafusion.apache.org/user-guide/configs.html

datafusion.execution.parquet.schema_force_view_types

We are still working through some additional needed support:

@sergiimk
Copy link
Contributor Author

sergiimk commented Dec 5, 2024

Apologies for the slow turnaround. Sharing findings from today:

Compiling with latest main DF did NOT fix the issue.

After simulating my ANTI JOIN in datafusion-cli between CSV and Parquet to force view/no-view mismatch I still could not find a smaller repro - will continue digging.

@alamb setting datafusion.execution.parquet.schema_force_view_types = false made the issue go away. Thanks a lot for the suggestion. This will allow us to upgrade to v43 and not fall behind.

@sergiimk
Copy link
Contributor Author

sergiimk commented Dec 5, 2024

Phew, finally got the minimal repro:

  • Read data from CSV (Utf8 columns)
  • Read data from Parquet (Utf8View)
  • Do a LEFT ANTI JOIN

Including the test project with sample data:
datafusion-13510.zip

Still not sure why the same exact scenario didn't break when using datafusion-cli.

Will update the ticket description with the repro.

@alamb alamb mentioned this issue Dec 6, 2024
10 tasks
@alamb
Copy link
Contributor

alamb commented Dec 22, 2024

Summary:

  • This issue can be worked around by setting the datafusion.execution.parquet.schema_force_view_types config to false

The error comes from arrow-rs (source https://github.com/apache/arrow-rs/blob/2c84f243b882eff69806cd7294d38bf422fdb24a/arrow-ord/src/cmp.rs#L241

Here is the stack of the error:

arrow_ord::cmp::compare_op cmp.rs:246
arrow_ord::cmp::eq cmp.rs:79
datafusion_physical_plan::joins::hash_join::eq_dyn_null hash_join.rs:1220
datafusion_physical_plan::joins::hash_join::equal_rows_arr hash_join.rs:1242
datafusion_physical_plan::joins::hash_join::lookup_join_hashmap hash_join.rs:1190
datafusion_physical_plan::joins::hash_join::HashJoinStream::process_probe_batch hash_join.rs:1374
datafusion_physical_plan::joins::hash_join::HashJoinStream::poll_next_impl hash_join.rs:1290
<datafusion_physical_plan::joins::hash_join::HashJoinStream as futures_core::stream::Stream>::poll_next hash_join.rs:1532
<core::pin::Pin<P> as futures_core::stream::Stream>::poll_next stream.rs:130
futures_util::stream::stream::StreamExt::poll_next_unpin mod.rs:1638
datafusion_physical_plan::coalesce_batches::CoalesceBatchesStream::poll_next_inner coalesce_batches.rs:293
<datafusion_physical_plan::coalesce_batches::CoalesceBatchesStream as futures_core::stream::Stream>::poll_next coalesce_batches.rs:229
<core::pin::Pin<P> as futures_core::stream::Stream>::poll_next stream.rs:130
futures_util::stream::stream::StreamExt::poll_next_unpin mod.rs:1638
<futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll next.rs:32
datafusion_physical_plan::stream::RecordBatchReceiverStreamBuilder::run_input::{{closure}} stream.rs:288
tokio::runtime::task::core::Core<T,S>::poll::{{closure}} core.rs:331
[Inlined] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut unsafe_cell.rs:16
tokio::runtime::task::core::Core<T,S>::poll core.rs:320
tokio::runtime::task::harness::poll_future::{{closure}} harness.rs:499
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once unwind_safe.rs:272
std::panicking::try::do_call panicking.rs:557
__rust_try 0x0000000105056f44
[Inlined] std::panicking::try panicking.rs:520
std::panic::catch_unwind panic.rs:358
tokio::runtime::task::harness::poll_future harness.rs:487
tokio::runtime::task::harness::Harness<T,S>::poll_inner harness.rs:209
tokio::runtime::task::harness::Harness<T,S>::poll harness.rs:154
tokio::runtime::task::raw::poll raw.rs:271
tokio::runtime::task::raw::RawTask::poll raw.rs:201
tokio::runtime::task::LocalNotified<S>::run mod.rs:435
tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}} worker.rs:596
[Inlined] tokio::runtime::coop::with_budget coop.rs:107
[Inlined] tokio::runtime::coop::budget coop.rs:73
tokio::runtime::scheduler::multi_thread::worker::Context::run_task worker.rs:595
tokio::runtime::scheduler::multi_thread::worker::Context::run worker.rs:558
tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}} worker.rs:511
tokio::runtime::context::scoped::Scoped<T>::set scoped.rs:40
tokio::runtime::context::set_scheduler::{{closure}} context.rs:180
std::thread::local::LocalKey<T>::try_with local.rs:283
std::thread::local::LocalKey<T>::with local.rs:260
tokio::runtime::context::set_scheduler context.rs:180
tokio::runtime::scheduler::multi_thread::worker::run::{{closure}} worker.rs:506
tokio::runtime::context::runtime::enter_runtime runtime.rs:65
tokio::runtime::scheduler::multi_thread::worker::run worker.rs:498
tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}} worker.rs:464
<tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll task.rs:42
tokio::runtime::task::core::Core<T,S>::poll::{{closure}} core.rs:331
[Inlined] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut unsafe_cell.rs:16
tokio::runtime::task::core::Core<T,S>::poll core.rs:320
tokio::runtime::task::harness::poll_future::{{closure}} harness.rs:499
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once unwind_safe.rs:272
std::panicking::try::do_call panicking.rs:557
__rust_try 0x000000010683e59c
[Inlined] std::panicking::try panicking.rs:520
std::panic::catch_unwind panic.rs:358
tokio::runtime::task::harness::poll_future harness.rs:487
tokio::runtime::task::harness::Harness<T,S>::poll_inner harness.rs:209
tokio::runtime::task::harness::Harness<T,S>::poll harness.rs:154
tokio::runtime::task::raw::poll raw.rs:271
tokio::runtime::task::raw::RawTask::poll raw.rs:201
tokio::runtime::task::UnownedTask<S>::run mod.rs:472
tokio::runtime::blocking::pool::Task::run pool.rs:161
tokio::runtime::blocking::pool::Inner::run pool.rs:511
tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}} pool.rs:469
std::sys::backtrace::__rust_begin_short_backtrace backtrace.rs:154
std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} mod.rs:538
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once unwind_safe.rs:272
std::panicking::try::do_call panicking.rs:557
__rust_try 0x000000010683008c
[Inlined] std::panicking::try panicking.rs:520
[Inlined] std::panic::catch_unwind panic.rs:358
std::thread::Builder::spawn_unchecked_::{{closure}} mod.rs:537
core::ops::function::FnOnce::call_once{{vtable.shim}} function.rs:250
[Inlined] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once boxed.rs:2454
[Inlined] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once boxed.rs:2454
std::sys::pal::unix::thread::Thread::new::thread_start thread.rs:105
_pthread_start 0x00000001940832e4

I suspect what is needed to fix this issue is to insert a coercsion somewhere in DataFusion so the join key is correctly coerced

@alamb alamb added the help wanted Extra attention is needed label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants