-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we ban full outer join for streaming query? #8084
Comments
cc @yuhao-su |
What's the cause of |
left side: +[null] --> Full Join -> +[null, null] From the downstream's view, the full join operator inserts the same row twice. |
|
I don't think it is a bug. The batch query will output 2 rows instead of one row. |
You are right! I can't think of any easy way to fix this. Maybe we should ban it for now. cc. @st1page |
😇 in fact, I prefer to ban all null primary keys. The only origin of that is Materialized view with |
Link to #8059. |
Group by |
I prefer to keep the full outer join. In batch query it's almost useless but I guess it might be more useful for stream joining stream. Just guess. But I don't have any idea to fix it now 🤔 |
Maybe we can use a trick like I used in union all operator before. Use a project plus a constant (indicating which side) to extend the stream key of both sides. In this way, we can tell the difference between left side +[null] and right side +[null], because they will become +[null, 0] and +[null, 1]. dev=> explain create materialized view v as select * from t union all select * from t;
QUERY PLAN
-------------------------------------------------------------------------------
StreamMaterialize { columns: [a, 0:Int32(hidden)], pk_columns: [a, 0:Int32] }
└─StreamUnion { all: true }
├─StreamExchange { dist: HashShard(t.a, 0:Int32) }
| └─StreamProject { exprs: [t.a, 0:Int32] }
| └─StreamTableScan { table: t, columns: [a] }
└─StreamExchange { dist: HashShard(t.a, 1:Int32) }
└─StreamProject { exprs: [t.a, 1:Int32] }
└─StreamTableScan { table: t, columns: [a] }
(8 rows) |
I think this is the correct direction. Let me explain more about my thoughts. Theoretically, you may consider a
For the left outer join, only 1 & 2 exists. Luckily, they must be non-conflict because a While, for the full outer join, the problem happened because result rows in 2 & 3 can be conflicted, as @chenzl25's example shows. Thus, I think adding a column to mark the "source" (1/2/3, as explained above) of the result row is the correct solution, but might be too heavy. |
Another way to mitigate the problem is to forbid null PKs on base tables. I know this cannot solve the problem completely because you can construct a MView with aggregation, but it can reduce the odds hopefully. By the way, PG also rejects null PK:
|
Yes, we can simply remove the pk constraint and get the same result. I think PG will add a hidden column on the source in this case. I can't think of any way to fully solve this problem by banning null pk from the source since we have agg. So I prefer adding a column to mark the "source" solution. The cost of adding 1 column on two sides only in full outer join sound acceptable to me. |
How about just adding a filter to ignore NULL stream key for the full outer join as a workaround. |
This will provide incorrect result 🥵 |
we can control the incorrect field with a more narrow predicate on the output of the outer join. just add a filter to remove the outer join's output rows where the output stream key is NULL. SELECT * from t1 full outer join t2 on t1.pk = t2.pk;
/*will be transformed to*/
SELECT * from (
SELECT * from t1 full outer join t2 on t1.pk = t2.pk;
) where NOT (t1.pk IS NULL AND t2.pk IS NULL) |
Why we don't just check every derived pk in plan node and check that it can't be all nullable? |
@liurenjie1024 :lark_cry |
I think this is a bug in our optimizer to determine the primary key of each streaming plan node. In full outer join, the pk + join key may not be unique when pk can be null, and in this case we may need to add extra column to ensure uniqueness. |
Correct, but we need to trade off between this very rare cases and complexity. I agree with @st1page that we can tolerate this incorrect behavior (i.e. just don't panic) in an off-line discussion |
I feel weird to allow incorrect behavior. Why not just ban such kind of query which may cause this wrong behavior? I think this is much effort. |
We will ban it when we have "not null" property in the optimizer and now workaround in #8520 |
Describe the bug
A null row from either left or right side produces the same row (null, null) to the downstream.
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: