-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure that the data types are supported in hashjoin before genera… #2702
Changes from all commits
cc56f63
06569b2
9d74954
b9b9434
ea06ab5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,6 +25,7 @@ use crate::logical_plan::{ | |
Values, Window, | ||
}; | ||
use crate::{Expr, ExprSchemable, LogicalPlan, LogicalPlanBuilder}; | ||
use arrow::datatypes::{DataType, TimeUnit}; | ||
use datafusion_common::{ | ||
Column, DFField, DFSchema, DFSchemaRef, DataFusionError, Result, ScalarValue, | ||
}; | ||
|
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> { | |
} | ||
} | ||
|
||
/// can this data type be used in hash join equal conditions?? | ||
/// data types here come from function 'equal_rows', if more data types are supported | ||
/// in equal_rows(hash join), add those data types here to generate join logical plan. | ||
pub fn can_hash(data_type: &DataType) -> bool { | ||
match data_type { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. decimal should be supported here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Data types here come from function equal_rows, decimal is not supported in equal_rows so that hash join currently does not support joining on columns of decimal data type. That's why decimal is not here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it may help to add a comment here (or in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'll add this comment. |
||
DataType::Null => true, | ||
DataType::Boolean => true, | ||
DataType::Int8 => true, | ||
DataType::Int16 => true, | ||
DataType::Int32 => true, | ||
DataType::Int64 => true, | ||
DataType::UInt8 => true, | ||
DataType::UInt16 => true, | ||
DataType::UInt32 => true, | ||
DataType::UInt64 => true, | ||
DataType::Float32 => true, | ||
DataType::Float64 => true, | ||
DataType::Timestamp(time_unit, None) => match time_unit { | ||
TimeUnit::Second => true, | ||
TimeUnit::Millisecond => true, | ||
TimeUnit::Microsecond => true, | ||
TimeUnit::Nanosecond => true, | ||
}, | ||
DataType::Utf8 => true, | ||
DataType::LargeUtf8 => true, | ||
_ => false, | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use super::*; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
From the issue description #2145 (comment) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow
cast
kernels are quite efficient for things likeDate32
->Int32
(no copies) as the representations are the same@pjmore what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. supporting more data types in hash join is the better way to solve this issue, and i'm already working on it.
And this pr only wants to make hash unsupported join running in cross join instead of error/panic, we can support more data types in hash join continuously.