Bug: RepartitionExec sometimes incorrectly reports "Error" when output is not completely consumed #575

alamb · 2021-06-16T20:41:52Z

Describe the bug
If the output of one of the repartition operator is not completely consumed, the repartition exec may return an error on one of the other streams

So roughly the picture looks like:

                   ┌───────────────┐
                   │   Consumer    │
                   └───────────────┘
                           │
              ┌────────────┴─────────────┐
              │                          │
              ▼                          ▼
   ┌────────────────────┐     ┌────────────────────┐
   │RepartitionStream 0 │     │RepartitionStream 1 │
   └────────────────────┘     └────────────────────┘
              │                          │
              │                          ├───────────────┐
              │                          │               ▼
           ┌──┤                          │    ┌────────────────────┐
           │  └──────────────────────────┼───▶│   InputStream B    │
           │                             │    └────────────────────┘
           ▼                             │
┌────────────────────┐                   │
│   InputStream A    │◀──────────────────┘
└────────────────────┘

If RepartitionStream 0 is dropped prior to both InputStream A and InputStream B completing, the repartition exec may still try to send a batch to RepartitionStream 0, find the channel closed, and report an error which will be seen by RepartitionStream 1

To Reproduce
I am working on a reproducer.

Reproducing this is error is made more challening by the fact that the repartition stream uses unbounded channels so it is very timing dependent

Expected behavior
No errors should be produced

Additional context
We have a test that fails intermittently https://github.com/influxdata/influxdb_iox/issues/1735

Here is the plan (the 'ExecutionPlan(PlaceHolder)' is an extension node that looks like LIMIT -- in that it may decide to stop consuming its input after producing some output.

The plan being run looks like:

ExecutionPlan(PlaceHolder)
  ProjectionExec: expr=[borough, city, state]
    CoalesceBatchesExec: target_batch_size=500
      FilterExec: 1 <= time AND time < 550 AND CAST(state AS Utf8) = NY
        RepartitionExec: partitioning=RoundRobinBatch(4)
          IOxReadFilterNode: table_name=o2, chunks=1 predicate=Predicate exprs: [TimestampNanosecond(1) LtEq #time, #time Lt TimestampNanosecond(550), #state Eq Utf8("NY")]

While I have been recently messing with RepartitionExec as part of #521 it appears the error behavior predates that change. However, now the error is passed up to the caller

The text was updated successfully, but these errors were encountered:

alamb added the bug Something isn't working label Jun 16, 2021

alamb self-assigned this Jun 16, 2021

alamb mentioned this issue Jun 16, 2021

RepartitionExec should not error if output has hung up #576

Merged

alamb closed this as completed in #576 Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: RepartitionExec sometimes incorrectly reports "Error" when output is not completely consumed #575

Bug: RepartitionExec sometimes incorrectly reports "Error" when output is not completely consumed #575

alamb commented Jun 16, 2021 •

edited

Loading

Bug: RepartitionExec sometimes incorrectly reports "Error" when output is not completely consumed #575

Bug: RepartitionExec sometimes incorrectly reports "Error" when output is not completely consumed #575

Comments

alamb commented Jun 16, 2021 • edited Loading

alamb commented Jun 16, 2021 •

edited

Loading