Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Decouple logical from physical types #11513

Open
notfilippo opened this issue Jul 17, 2024 · 62 comments
Open

[Proposal] Decouple logical from physical types #11513

notfilippo opened this issue Jul 17, 2024 · 62 comments

Comments

@notfilippo
Copy link

notfilippo commented Jul 17, 2024

Abstract

Logical types are an abstract representation of data that emphasises structure without regard for physical implementation, while physical types are tangible representations that describe how data will be stored, accessed, and retrieved.

Currently the type model in DataFusion, both in logical and physical plans, relies purely on arrow’s DataType enum. But whilst this choice makes sense its physical execution engine (DataTypes map 1:1 with the physical array representation, defining implicit rules for storage, access, and retrieval), it has flaws when dealing with its logical plans. This is due to the fact that some logically equivalent array types in the Arrow format have very different DataTypes – for example a logical string array can be represented in the Arrow array of DataType Utf8, Dictionary(Utf8), RunEndEncoded(Utf8), and StringView (without mentioning the different indexes types that can be specified for dictionaries and REE arrays).

This proposal evaluates possible solutions for decoupling the notion of types in DataFusion’s logical plans with DataTypes, evaluating their impact on DataFusion itself and on downstream crates.

Goals

  • Introduce a subset of DataTypes to be used as logical types
  • Define the boundaries of logical types and physical types in the planning and execution path
  • Define a solution to keep track of a column’s logical and physical type
  • Introduce a new logical ScalarValue
  • Define how to specify materialisation options for physical types
  • Evaluate the impact on downstream crates

Proposal

Defining a logical type

To define the list of logical types we must first take a look at the physical representation of the engine: the Arrow columnar format. DataTypes are the physical types of the DataFusion engine and they define storage and access pattern for buffers in the Arrow format.

Looking at a list of the possible DataTypes it's clear that while some map 1:1 with their logical representation other also specify information about the encoding (e.g. Large*, FixedSize*, Dictionary , RunEndEncoded...). The latter must be consolidate into what they represent, discarding the encoding information and, in general, types that can store different ranges of values should be different logical types. (ref).

What follows is a list of DataTypes and how would they map to their respective logical type following the rules above:

DataType (aka physical type) Logical type Note
Null Null
Boolean Boolean
Int8 Int8
Int16 Int16
Int32 Int32
Int64 Int64
UInt8 UInt8
UInt16 Uint16
UInt32 UInt32
UInt64 UInt64
Float16 Float16
Float32 Float32
Float64 Float64
Timestamp(unit, tz) Timestamp(unit, tz)
Date32 Date
Date64 Date Date64 doesn't actually provide more precision. (docs)
Time32(unit) Time32(unit)
Time64(unit) Time64(unit)
Duration(unit) Duration(uint)
Interval(unit) Interval(unit)
Binary Binary
FixedSizeBinary(size) Binary
LargeBinary Binary
BinaryView Binary
Utf8 Utf8
LargeUtf8 Utf8
Utf8View Utf8
List(field) List(field)
ListView(field) List(field)
FixedSizeList(field, size) List(field)
LargeList(field) List(field)
LargeListView(field) List(field)
Struct(fields) Struct(fields)
Union(fields, mode) Union(fields)
Dictionary(index_type, data_type) underlying data_type, converted to logical type
Decimal128(precision, scale) Decimal128(precision, scale)
Decimal256(precision, scale) Decimal256(precision, scale)
Map(fields, sorted) Map(fields, sorted)
RunEndEncoded(run_ends_type, data_type) underlying data_type, converted to logical type

User defined types

User defined physical types

The Arrow columnar format provides guidelines to define Extension types though the composition of native DataTypes and custom metadata in fields. Since this proposal already includes a mapping from DataType to logical type we could extend it to support user defined types (through extension types) which would map to a known logical type.

For example an extension type with the DataType of List(UInt8) and a custom metadata field {'ARROW:extension:name': 'myorg.mystring'} could have a logical type of Utf8.

User defined logical types

Arrow extension types can also be used to extend the list of supported logical types. An additional logical type called Extension could be introduced. This extension type would contain a structure detailing its logical type and the extension type metadata.

Boundaries of logical and physical types

In plans and expressions

As the prefix suggests, logical types should be used exclusively in logical plans (LogicalPlan and Expr) while physical types should be used exclusively in physical plans (ExecutionPlan and PhysicalExpr). This would enable logical plans to be purely logical, without worrying about underlying encodings.

Expr in logical plans would need to represent their resulting value as logical types through the trait method ExprSchemable::get_type, which would need to return a logical type instead.

In functions

ScalarUDF, WindowUDF, and AggregateUDF all define their Signatures through the use of DataTypes. Function arguments are currently validated against signatures through type coercion during logical planning. With logical types Signatures would be expressed without the need to specify the underlying encoding. This would simplify the type coercion rules, removing the need of traversing dictionaries and handling different containers and focusing instead on explicit logical rules (e.g. all logical types can be coerced to Utf8).

During execution the function receives a slice of ColumnarValue that is guaranteed to match the signature. Being strictly a physical operation, the function will have to deal with physical types. ColumnarValue enum could be extended so that functions could choose to provide their own optimised implementation for a subset of physical types and then fall back to a generic implementation that materialises the argument to known physical type. This would potentially allow native functions to support user defined physical types that map to known logical types.

In substrait

The datafusion_substrait crate provides helper functions to enable interoperability between substrait plans and datafusion's plan. While some effort has been made to support converting from / to DataTypes via type_variation_reference (example here), dictionaries and not supported as both literal types and cast types, leading to potential errors when trying to encode a valid LogicalPlan) into a substrait plan. The usage of logical types would enable a more seamless transition between DataFusion's native logical plan and substrait.

Keeping track of the physical type

While logical types simplify the list of possible types that can be handled during logical planning, the relation to their underlying physical representation needs to be accounted for when transforming the logical plan into nested ExecutionPlan and PhysicalExpr which will dictate how will the query execute.

This proposal introduces a new trait that represents the link between a logical type and its underlying physical representation:

pub enum LogicalType {/*...*/}

pub type TypeRelationRef = Arc<dyn TypeRelation + Send + Sync>;

pub trait TypeRelation: std::fmt::Debug {
    fn logical(&self) -> &LogicalType;
    fn physical(&self) -> &DataType;
    // ...
}

#[derive(Clone, Debug)]
pub struct NativeType {
    logical: LogicalType,
    physical: DataType,
}

impl TypeRelation for NativeType {/*...*/}
impl From<DataType> for NativeType {/*...*/}
impl From<DataType> for LogicalType {/*...*/}


#[derive(Clone, Debug)]
pub struct LogicalPhysicalType(TypeRelationRef);

impl TypeRelation for LogicalPhysicalType {/*...*/}

While NativeType would be primarily used for standard DataTypes and their logical relation, TypeRelation is defined to provide support for used defined physical types.

What follows is an exploration of the areas in which LogicalPhysicalType would need to get introduced:

A new type of Schema

To support the creation of LogicalPhysicalType a new schema must be introduced, which can be consumed as either a logical schema or used to access the underlying physical representation. Currently DFSchema is used throughout DataFusion as a thin wrapper for Arrow's native Schema in order to qualify fields originating from potentially different tables. This proposal suggest to decouple the DFSchema from its underlying Schema and instead adopt a new Schema-compatible structure (LogicalPhysicalSchema) but with DataTypes replaced by LogicalPhysicalType. This would also mean the introduction of new Field-compatible structure (LogicalPhysicalField) which also supports LogicalPhysicalType instead of Arrow's native Field DataType.

DFSchema would be used by DataFusion's planning and execution engine to derive either logical or physical type information of each field. It should retain the current interoperability with Schema (and additionally the new LogicalPhysicalSchema) allowing easy Into & From conversion.

Type sources

Types in plans sourced through Arrow's native Schema returned by implementations of TableSource / TableProvider , variables DataTypes returned by VarProvider , and ScalarValue. To allow definition of custom LogicalPhysicalType these type sources should be edited to return LogicalPhysicalSchema / LogicalPhysicalType.

Tables

For tables a non-breaking way of editing the trait to support LogicalPhysicalSchema could be:

  1. Declare a new trait method logical_physical_schema() -> LogicalPhysicalSchema, this method's default implementation calls the schema() and converts it to LogicalPhysicalSchema without introducing any custom LogicalPhysicalType. Implementers are free to override this method and add custom LogicalPhysicalType.
  2. Declare the existing schema() method to return impl Into<LogicalPhysicalSchema>.

Open question: Should we really introduce a new schema type or should we reuse DFSchema? The qualifiers for fields in DFSchema should not be specified by the Tables themselves.

VarProvider

VarProvider needs to be edited in order to return a LogicalPhysicalType when getting the type of the variable, while the actual variable can very well remain a ScalarValue.

ScalarValue

ScalarValue should be wrapped in order to have a way of retrieving both its logical and its underlying physical type. When reasoning about logical plans it should be treated as its logical type while its physical properties should be accessed exclusively by the physical plan.

Open question: Should ScalarValue split into a LogicalScalarValue and a PhysicalScalarValue? Or should it just provide generic information during logical planning (e.g. its logical type, is_null(), is_zero()) without access the underlying physical representation?

Physical execution

For physical planning and execution, much like the invocation of UDFs, ExecutionPlan and PhysicalExpr must also be granted access to the LogicalPhysicalType in order to have the capabilities of performing optimised execution for a subset of supported physical types and then fall back to a generic implementation that materialises other types to known physical type. This can be achieved by substituting the use DataType and Schema with, respectively, LogicalPhysicalType and LogicalPhysicalSchema.

Impact on downstream dependencies

Care must be put in place not to introduce breaking changes for downstream crates and dependencies that build on top of DataFusion.

The most impactful changes introduced by this proposal are the LogicalPhysicalType, LogicalPhysicalSchema and LogicalPhysicalField types. These structures would replace most of the mentions of DataType, Schema and Field in the DataFusion codebase. Type sources (TableProvider / TableSource, VarProvider, and ScalarValue) and Logical / ExecutionPlan nodes would be greatly affected by this change. This effect can be mitigated by providing good Into & From implementations for the new types and providing editing existing function arguments and return types as impl Into<LogicalPhysical*>, but it will still break a lot of things.

Case study: datafusion-comet

datafusion-comet is a high-performance accelerator for Apache Spark, built on top of DataFusion. A fork of the project containing changes from this proposal currently compiles without modifications. As more features in this proposal are implemented, namely UDFs Logical Signature, some refactoring might be required (e.g for CometScalarFunction and other functions defined in the codebase). Refer to the draft's TODOs to see what's missing.

Non-goals topics that might be interesting to dive into

While not the primary goals of this proposal, here are some interesting topics that could be explored in the future:

RecordBatches with same logical type but different physical types

Integrating LogicalPhysicalSchema into DataFusion's RecordBatches, streamed from one ExecutionPlan to the other, could be an interesting approach to support the possibility of two record batches having logically equivalent schemas with different underlying physical types. This could be useful in situations where data, stored in multiple pages mapping 1:1 with RecordBatches, is encoded with different strategies based on the density of the rows and the cardinality of the values.

Current work effort

The [Epic] issue tracking this effort can be found at #12622 .

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

Thank you @notfilippo -- I think this proposal is well thought out and makes a lot of sense to me.

If we were to implement it I think the benefits for DataFusion would be enormous

From my perspective, the use of Arrow types in logical planning in DataFusion (e.g. type coercion) has always been a little bit of an impedance mismatch. When there were just a few variants (e.g. String/LargeString and Dictionary) it was annoying but manageable.

As Arrow evolves (e.g. to include REEArray, StringViewArray, etc) the mismatch is becoming more painful (e.g. #10220 is an example)

Care must be put in place not to introduce breaking changes for downstream crates and dependencies that build on top of DataFusion.

I think breaking changes to the API is inevitable, but I think we can mange the pain through careful API thought and deprecation. More thoughts to follow

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

Thoughts on the technical details

This would also mean the introduction of new Field-compatible structure (LogicalPhysicalField)

Since LogicalPlans already use DFSchema, maybe we could avoid introducing a LogialSchema at all and instead update DFSchema to know how to report LogicalType 🤔

Open question: Should ScalarValue split into a LogicalScalarValue and a PhysicalScalarValue? Or should it just provide generic information during logical planning (e.g. its logical type, is_null(), is_zero()) without access the underlying physical representation?

Another possibility wuld be to migate ScalarValue so it was entirely logical (the usecase for a Physical scalar value is not large). We could slowly remove the non-logical versions of ScalarValue (e.g. remove ScalarValue::LargeString and ScalarValue::Dictionary) -- this would give us some idea of where the boundary would be as well as I think improve the usability (we have for example, hit issues where ScalarValue::StringView were treated differently than ScalarValue::Utf8)

I think the biggest challenge of this project will be managing the transition from Arrow DataType to LogicalTypes. You have done a great job articulating the places that would need to be changed. I think we could manage the transition over time by for example deprecating (but leaving UDF APIs in terms of DataType)

ColumnarValue enum could be extended so that functions could choose to provide their own optimised implementation for a subset of physical types and then fall back to a generic implementation that materialises the argument to known physical type.

ANother possibility would be to make a function like ColumnarValue::into_one_of that ensured (via casting if necessary) the input was one of the types supported by the operator. For example

let input: ColumnarValue = &args[0];

// get input as one of the named types, casting if necessary
let input = input.into_one_of(&[DataType::Utf8View, DataType::Utf8])?;

match input.data_type() {
  DataType::Utf8View => { /*specialized impl for StringViewArray */ },
  DataType::Utf8 =>     { /*specialized impl for StringArray */ },
  _ => unreachable!()
}  

RecordBatches with same logical type but different physical types

Another thing we could do is relax the requirement that the DataTypes matched exactly between physical plans, and instead just require that the LogicalTypes matches (in other words an ExecutionPlan could report it generates output like Utf8 but could actually produce output like Utf8View 🤔

@notfilippo
Copy link
Author

Since LogicalPlans already use DFSchema, maybe we could avoid introducing a LogialSchema at all and instead update DFSchema to know how to report LogicalType 🤔

Initially, I planned to propose repurposing the DFSchema for this change. Still, I decided against it (at least for the first draft) because of this open question that I've noted above:

Open question: Should we really introduce a new schema type or should we reuse DFSchema? The qualifiers for fields in DFSchema should not be specified by the Tables themselves.

This issue originates from the fact that TableSource and TableProvider (the "native" sources of schemas) would have to return a DFSchema to include the LogicalPhysicalTypes. Adopting DFSchema at that level would mean that both TableSource and TableProvider could edit the qualifiers of the DFSchema, which doesn't seem right (even if the qualifiers would then be replaced or ignored).

@andygrove
Copy link
Member

This proposal makes sense to me. Thanks for driving this @notfilippo.

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

This issue originates from the fact that TableSource and TableProvider (the "native" sources of schemas) would have to return a DFSchema to include the LogicalPhysicalTypes

I was thining that a DFSchema could still wrap an arrow Schema but translate to LogicalType

So like if a TableProvider said it returned a StringView or a Dictionary(Int32, String) the DFSchema would report LogicalType::Utf8 or something

I haven't looked at the code so there may be some reason this wouldn't work

@ozankabak
Copy link
Contributor

ozankabak commented Jul 19, 2024

The main challenge I see is that this will be a very large project. The high-level premise of separating logical and physical types makes sense from a first-principle POV, but the cost/benefit analysis at this point is not very clear to me. The latter will probably depend on the implementation strategy and the exact architecture we adopt, so I think we should try out different ideas and approaches before committing to a strategy/architecture and accepting/rejecting/deferring the idea based on a possibly premature choice.

@ozankabak
Copy link
Contributor

ozankabak commented Jul 19, 2024

Another possibility wuld be to migate ScalarValue so it was entirely logical (the usecase for a Physical scalar value is not large).

AFAICT ScalarValue is used quite heavily in physical stuff like accumulators and other places. Many "state"-like information we track during the course of execution relies on ScalarValue. Maybe I'm misunderstanding what you meant here?

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

AFAICT ScalarValue is used quite heavily in physical stuff like accumulators and other places. Many "state"-like information we track during the course of execution relies on ScalarValue. Maybe I'm misunderstanding what you meant here?

I was thinking that there is no fundamental difference between using ScalarValue::String and ScalarValue::LargeString and ScalarValue::Dictionary(..) to accumulate string values as they all eventually wrap String

Thus I was thinking we could simplify the physical implementations by not having different codepaths and this would also give us some first hand experience in how mapping Logical --> Physical types might look like

@wjones127
Copy link
Member

This generally looks good. I agree that starting with making ScalarValue to represent a logical type would be a good starting point.

One small nit: I don't think I would lump together FixedSizeBinary with Binary and FixedSizeList with List. The fixed lengths often have semantics that should be considered. For example, we use FixedSizeList<f32> as a vector, where the length list is the vector dimension.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 20, 2024

Is it possible to have the mapping of the relation between arrow's DataType and LogicalType without TypeRelation?

As long as there is only one LogicalType for each arrow::DataType, we can build up the mapping via From/Into conversion.

I was thining that a DFSchema could still wrap an arrow Schema but translate to LogicalType

We can then easily get the LogicalType from DataType inside Schema.

Is there any type mapping that can't be done without TypeRelation 🤔 ? In other words, is there any arrow::DataType that
can be mapped to more than one LogicalType?

@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

Is there any type mapping that can't be done without TypeRelation 🤔 ? In other words, is there any arrow::DataType that
can be mapped to more than one LogicalType?

I think DataType::Dictionary(..) would map to more than one LogicalType (it would map to whatever the dictionary values type was)

@ozankabak
Copy link
Contributor

ozankabak commented Jul 20, 2024

Thus I was thinking we could simplify the physical implementations by not having different codepaths and this would also give us some first hand experience in how mapping Logical --> Physical types might look like

This could be interesting to try -- both in terms of whether we can somehow simplify ScalarValue and also accumulate learnings to help towards this proposal.

It would also be a relatively contained draft, and if the result is satisfactory, can possibly get merged in even if the full proposal is for some reason not yet doable in our current state

@notfilippo
Copy link
Author

It would also be a relatively contained draft, and if the result is satisfactory, can possibly get merged in even if the full proposal is for some reason not yet doable in our current state.

I agree with this approach. I'll dive deeper next week and report back my findings.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 20, 2024

Is there any type mapping that can't be done without TypeRelation 🤔 ? In other words, is there any arrow::DataType that
can be mapped to more than one LogicalType?

I think DataType::Dictionary(..) would map to more than one LogicalType (it would map to whatever the dictionary values type was)

Why don't we have LogicalType::Dict(LogicalPhysicalFieldRef), similar to List, Struct, and Map. Although Dict is not really a nested type like those 3, but with LogicalType::Dict(LogicalPhysicalFieldRef) we could probably avoid replacing DFSchema with LogicalPhysicalSchema

or something like

#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub enum LogicalType {
    LogicalPrimitiveType,
    Date,
    Time32(TimeUnit),
    Time64(TimeUnit),
    Timestamp(TimeUnit, Option<Arc<str>>),
    Duration(TimeUnit),
    Interval(IntervalUnit),
    List(LogicalPhysicalFieldRef),
    Struct(LogicalPhysicalFields),
    Dict(LogicalPrimitiveType),
    Map(LogicalPhysicalFieldRef, bool),
    Decimal128(u8, i8),
    Decimal256(u8, i8),
    Union(LogicalUnionFields, UnionMode), // TODO: extension signatures?
    // UserDefinedType
}

pub enum LogicalPrimitiveType {
    Null,
    Int8,
    Int16,
    Int32,
    Int64,
    UInt8,
    UInt16,
    UInt32,
    UInt64,
    Boolean,
    Float16,
    Float32,
    Float64,
    Utf8,
    Binary, 
}

@alamb
Copy link
Contributor

alamb commented Jul 21, 2024

we could probably avoid replacing DFSchema with LogicalPhysicalSchema

That is an interesting idea. I don't fully understand the implications, but if we can figure out how to avoid having to switching from DFSchema I suspect it would make this project less impactful on downstream crates

@findepi
Copy link
Member

findepi commented Jul 22, 2024

I like the idea of separating logical types from arrow types, but it would be great to understand the exact consequences.
DataFusion is both a SQL execution engine (so it has a SQL frontend) and also a query execution library (so it has rich programmable API).

The SQL frontend should have "very logical types". For example, we don't need Time32(unit) and Time64(unit). The SQL user isn't interested in how different precisions are implemented. Having Time(unit) should be enough (and different units could use different physical representations under the hood).
Also, Timestamp(unit, tz) may want to be revisited from SQL perspective. SQL spec defines two separate types "timestamp" (without zone) and "timestamp with time zone" (point in time semantics + zone information), It might be possible (with some limitations) to have both use same physical representation (like arrow Timestamp(unit, tz)), but logically they want to be distinct.

Then, from DF-as-a-library perspective, physical representation becomes important.
To drive the design for this part, we need to understand how DF-as-a-library is used. What are its necessary contractual obligation and what can be implementation detail. At this level we need to be even more extensible, since adding more types into the type system feels very natural for DF-as-a-library use-case. We also may need to be more physical, like perhaps discerning between time32 and time64.
It would be great if, as proposed here, this layer still didn't have to deal with various equivalent ways of encoding semantically equivalent data. I.e. DF-as-a-library use-case still doesn't want to discern between Utf8, Dictionary(Utf8) RunEndEncoded(Utf8). This shouldn't be in the type system at all. A function operating on string values should work with any valid representation of string values (either intrinsically, or with an adapter).

ColumnarValue enum could be extended so that functions could choose to provide their own optimised implementation for a subset of physical types and then fall back to a generic implementation that materialises the argument to known physical type. This would potentially allow native functions to support user defined physical types that map to known logical types.

I am concerned about deriving support for a logical type based on the support for a physical type is actually slipper slope.

Let's consider an example. Let's assume i have my_duration({micros | nanos}) type with uses 64-bit integer physical type for representation. my_duration values are always stored with nano precision, and the unit defines how aligned they have to be. I.e. 1s is always stored as integer value 1_000_000_000 for both precisions allowed. Every value of my_duration(micros) is represented as i64 number divisible by 1000.

Now, i have add_one() function that can take 64-bit integer values and add +1 to them. The +1 operation is perfectly valid operation for i64 -- it's valid for sql long/bigint type. It's valid also for my_duration(nanos), but it's not valid for my_duration(micros), since it produces unaligned value (not divisible by 1000).

@doki23
Copy link
Contributor

doki23 commented Jul 23, 2024

I find a previous discussion and reference it here: #7421

@doki23
Copy link
Contributor

doki23 commented Jul 23, 2024

I have a question: How users specify the underlying physical types? FYI, Clickhouse exposes physical types to users like this.

@jayzhan211
Copy link
Contributor

I have a question: How users specify the underlying physical types? FYI, Clickhouse exposes physical types to users like this.

Physical Type in Datafusion is Arrow Type

@doki23
Copy link
Contributor

doki23 commented Jul 23, 2024

I have a question: How users specify the underlying physical types? FYI, Clickhouse exposes physical types to users like this.

Physical Type in Datafusion is Arrow Type

Apologies for the unclear description. I meant to ask, if we opt for logicalType, how do users then specify the physical types? This is important because in certain scenarios, only the users can determine the most suitable data types.

@notfilippo
Copy link
Author

Now, i have add_one() function that can take 64-bit integer values and add +1 to them. The +1 operation is perfectly valid operation for i64 -- it's valid for sql long/bigint type. It's valid also for my_duration(nanos), but it's not valid for my_duration(micros), since it produces unaligned value (not divisible by 1000).

@findepi -- If I understand correctly your example is about a the logical type my_duration({micro | nano}) with the underlying physical type int64. In this case the type my_duration(micro) is logically different from my_duration(nano) (both logically different from an int64).

Functions that have specialised implementations for my_duration, specified in their signature, would have to handle either case differently. Instead, functions such as add_one(int64) would have a signature accepting only logical int64s and not my_duration.

Hypothetically, if add_one was to be edited to support my_duration (by accepting any numeric type) it would either have to provide a specific implementation for the my_duration logical type or it would've to materialise the array to a known numeric type like int64. The materialisation wouldn't necessarily match the underlying physical type and for this example I think it would make sense to materialise as an int64 array with the number of nanoseconds (physical type) multiplied by the unit of the duration.

@notfilippo
Copy link
Author

if we opt for logicalType, how do users then specify the physical types?

@doki23 -- Through the use of a custom implementation of the TypeRelation trait, which then can be used inside a LogicalPhysicalSchema. All type sources (see proposal) would need to be edited to support returning this new type of schema which will then be embedded into a DFSchema and used to carry both the physical and the logical type knowledge throughout logical and physical planning.

Example
#[derive(Debug)]
struct MyString {
    logical: LogicalType,
    physical: DataType,
}

impl Default for MyString {
    fn default() -> Self {
        Self {
            logical: LogicalType::Utf8,
            physical: DataType::new_list(DataType::UInt8, false),
        }
    }
}

impl TypeRelation for MyString {
    fn logical(&self) -> &LogicalType {
        &self.logical
    }
    fn physical(&self) -> &DataType {
        &self.physical
    }
    // ...
}

@notfilippo
Copy link
Author

notfilippo commented Jul 23, 2024

One small nit: I don't think I would lump together FixedSizeBinary with Binary and FixedSizeList with List. The fixed lengths often have semantics that should be considered. For example, we use FixedSizeList as a vector, where the length list is the vector dimension.

@wjones127 -- Noted! I was also hesitant on including the Fixed* variants into the base ones and your explanation makes sense to me. While I agree that having fixed length constraint for a list of logical types makes sense I am not convinced about FixedSizeBinaries. What would be the use case? Do you have some example in mind?

@wjones127
Copy link
Member

wjones127 commented Jul 23, 2024

While I agree that having fixed length constraint for a list of logical types makes sense I am not convinced about FixedSizeBinaries. What would be the use case? Do you have some example in mind?

Example uses of fixed size binary are representing arrays of data types not supported in Arrow, such as bf16 or i128 (UUID). A value with a different number of bytes would be considered invalid.

@findepi
Copy link
Member

findepi commented Sep 24, 2024

Keeping track of the physical type

While logical types simplify the list of possible types that can be handled during logical planning, the relation to their underlying physical representation needs to be accounted for when transforming the logical plan into nested ExecutionPlan and PhysicalExpr which will dictate how will the query execute.

Sorry if this is a dumb question.
Do we actually need to track physical representation at planning time?

Let look at an example.
For example Utf8, LargeUtf8, Dictionary of Utf8, Utf8View, REE/RLE of Utf8 -- these are different physical representations that correspond to some string logical type (or varchar / varchar(n); doesn't matter).
Knowing the data will be dictionary-encoded at (physical) planning time is useful (knowing more is always useful), but actually prevents run-time adaptiveness. One parquet file can have data flat and another dict-encoded, the Parquet writer can be adaptive, so at planning time there is no way to know physical representation "for sure". There is only way to "propose a physical representation", but then data needs to be made fit into that.

At what cost and for what benefit? Cost is of two kinds

  • the system is more complex than necessary. it needs to deal with physical types in places where they do not matter. For example what does it mean to arrow_cast( literal, dict_type )? What does it mean to coerce a literal to REE/RLE? This is what brings requirement for ScalarValue to carry any physical type, even though for a constant scalar value this is not meaninful
  • the data may get unnecessarily converted to a different representation imposed by the plan

Benefit?

  • the system could in theory be faster if we knew the physical representation and we wouldn't need to type check on it. But our functions do this anyway (cause they are polymorphic). Since function resolution is a logical type based, we won't get rid of runtime type checks, unless we let logical functions to resolve to physical functions (sounds very complicated, let's not do this).
    Not sure if this benefit exists in practice.
  • others?

Let me draw comparison, if I may...
Trino internal design is IMO very well thought thru. Trino has same concepts of physical representations (flat, dict, rle), but they don't percolate into the type system. The type system is "logical types" (or just "types"), but more importantly the functions are authored in terms of same types. The function/expression implementor doesn't need to bother with dict or rle inputs, because they are taken care once and for all by the projection/filter operator. What do they miss by not dealing with "physical types" at planning time?

Summing up, I propose that

  • we introduce the concept of "data fusion type". This is the "logical type" @notfilippo proposed.
  • we use this "data fusion type" for logical plans
  • we use this "data fusion type" for physical plans as well
    • this leaves existing "physical plans" to be a runtime concept
  • we use this "data fusion type" for function authoring, scalar/constant/literal values in the plan

cc @notfilippo @alamb @comphead @andygrove @jayzhan211

@notfilippo
Copy link
Author

Do we actually need to track physical representation at planning time?

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and/or physical execution phase.

Currently some limitations in datafusion's abstraction design don't allow a clear-cut distinction between the types (this I think is clear when you look on how you can call LogicalPlan::schema). Knowing this, some care needs to be taken in order to slowly introduce the distinction, which mainly comes in the form of storing the DataType alongside logical values.

The objective would be to eventually remove that knowledge and making it available directly through the data source (i.e. the RecordBatch) to support the run-time adaptiveness you are mentioning above and that've also mention in the proposal:

RecordBatches with same logical type but different physical types

Integrating LogicalPhysicalSchema into DataFusion's RecordBatches, streamed from one ExecutionPlan to the other, could be an interesting approach to support the possibility of two record batches having logically equivalent schemas with different underlying physical types. This could be useful in situations where data, stored in multiple pages mapping 1:1 with RecordBatches, is encoded with different strategies based on the density of the rows and the cardinality of the values.

@findepi
Copy link
Member

findepi commented Sep 24, 2024

distinction between the types (this I think is clear when you look on how you can call LogicalPlan::schema). Knowing this, some care needs to be taken in order to slowly introduce the distinction, which mainly comes in the form of storing the DataType alongside logical values.

Well, if LogicalPlan is expressed in terms of logical types (what I'd like to be the datafusion types), then LogicalPlan::schema should return schema expressed in those types as well. Would be no need to track (arrow's) DataType.

The objective would be to eventually remove that knowledge and making it available directly through the data source (i.e. the RecordBatch) to support the run-time adaptiveness you are mentioning above and that've also mention in the proposal:

Awesome. Let's graduate it from "proposal" to "project plan".

@notfilippo
Copy link
Author

Well, if LogicalPlan is expressed in terms of logical types (what I'd like to be the datafusion types), then LogicalPlan::schema should return schema expressed in those types as well. Would be no need to track (arrow's) DataType.

Yes, that's the plan! 👍

Awesome. Let's graduate it from "proposal" to "project plan".

I really like @\alamb's idea of creating an epic and in #12536 (comment) I've suggested some starting issues.

@andygrove
Copy link
Member

This sounds good to me.

We are running into the RecordBatches with same logical type but different physical types issue in DataFusion Comet. For a single table, a column may be dictionary-encoded in some Parquet files, and not in others, so we are forced to cast them all to the same type, which introduces unnecessary dictionary encoding (or decoding) overhead.

@findepi
Copy link
Member

findepi commented Sep 24, 2024

@andygrove yeah, that's exactly the problem I anticipated based on DF design and Trino experience. Happy that you have real world validation that this is a problem for DF Comet too, thanks!

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 24, 2024

I don't get it, how do you get the actual value if there is no physical type (DataType)?
Given ScalarValue::Utf8(String), how do you convert this to StringArray or StringViewArray given no hint about what they are exactly?

In my mind, at some point we need to build the arrow's Array. How do we build it if we don't know what type is it?

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

The idea if dynamically switching to different physical encodings is neat, but I think it presumes all operators / functions can handle any of the different encodings (RLE, Dict, String, etc) which is not the case today

Benefit?

I think another benefit of the current type system is that the implementations of functions (and operators, etc) declare what types of arrays (physical encodings) they have specializations for and then the optimizers and analyzers ensure that the types lineup and try to minimize conversions at runtime.

For example, in a query like this that computes several operations on a column

SELECT COUNT(*), substr(url, 1, 4), regexp_match(url, '%google.com%'`
FROM ...
GROUP BY url

If we change the hash group by operator to return StringViewArray for certain grouping operations when the input it a StringArray, and the column is used twice -- once in substr and once in regexp_match which don't have specialized code paths for StringViewArray we have to be careful the array will not be cast to StringArray twice

In my mind, at some point we need to build the arrow's Array. How do we build it if we don't know what type is it?

I think @jayzhan211 's is the key question in my mind. At some point you need to get the actual array and have code that operates on exactly that type. Figuring out at what point this conversion happens is important.

@ozankabak
Copy link
Contributor

I think another benefit of the current type system is that the implementations of functions (and operators, etc) declare what types of arrays (physical encodings) they have specializations for and then the optimizers and analyzers ensure that the types lineup and try to minimize conversions at runtime.

IMO this is quite important and we should be careful to preserve this kind of optimizations in any new design. Admittedly, we don't do a great job in removing/avoiding runtime checks in various other use cases within today's architecture, but let's not lose that capability completely -- I remember some cases where this kind of specializations yielded very significant performance improvements.

@thinkharderdev
Copy link
Contributor

I think another benefit of the current type system is that the implementations of functions (and operators, etc) declare what types of arrays (physical encodings) they have specializations for and then the optimizers and analyzers ensure that the types lineup and try to minimize conversions at runtime.

IMO this is quite important and we should be careful to preserve this kind of optimizations in any new design. Admittedly, we don't do a great job in removing/avoiding runtime checks in various other use cases within today's architecture, but let's not lose that capability completely -- I remember some cases where this kind of specializations yielded very significant performance improvements.

I'm a bit confused as to what the goal is of this work is if we still need to track the physical type during planning?

@notfilippo
Copy link
Author

notfilippo commented Sep 25, 2024

I'm a bit confused as to what the goal is of this work is if we still need to track the physical type during planning?

I would like to stress that the intent of this proposal remains to decouple logical types from physical types in order to achieve the following goal:

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

LogicalPlans will use LogicalType while PhysicalPlans will use DataType.

While the goal seems to have achieved wide consensus, the path to reach it has not been finalised. Through some experiments (#11160 -> #11978 -> #12536) we've been trying to narrow down on a possible approach to commit to in order to make progress.

As this proposal aims at changing the tires on a moving car there is and there will be a lot of discussion in order to complete the migration safely and without breaking too much functionality for end user. This will certainly result in a intermediate state where the existing behaviour is supported by temporarily tracking DataType alongside some objects which will only have a logical type until the type can be extracted by the context itself.


Re: @findepi's proposal,

Summing up, I propose that

  • we introduce the concept of "data fusion type". This is the "logical type" @notfilippo proposed.
  • we use this "data fusion type" for logical plans
  • we use this "data fusion type" for physical plans as well
    • this leaves existing "physical plans" to be a runtime concept
  • we use this "data fusion type" for function authoring, scalar/constant/literal values in the plan

This proposal is compatible with (and actually depends on) the decoupling logical from physical types but I think it's a further step ahead to consider once we at least clear the initial steps to take in order to make LogicalTypes happen.

Additionally I think it should be filed as a separate, but related, ticket. I understand that it heavily depends and influences the choices of this proposal but judging by the comments above I think there needs to be a separate discussion in order to validate the idea on its own.


I think another benefit of the current type system is that the implementations of functions (and operators, etc) declare what types of arrays (physical encodings) they have specializations for and then the optimizers and analyzers ensure that the types lineup and try to minimize conversions at runtime

Not sure where we discussed this already but I would love to support both logical types and physical types when declaring function signatures in order to let the user have full control over the arguments: as little control as simply specifying a LogicalType + cast or as much control as precise function signatures for specific DataTypes.

Instead I was planning on keeping return_type and invoke as is, potentially adding a return_logical_type helper if needed.

@findepi
Copy link
Member

findepi commented Sep 27, 2024

This proposal is compatible with (and actually depends on) the decoupling logical from physical types but I think it's a further step ahead to consider once we at least clear the initial steps to take in order to make LogicalTypes happen.

Additionally I think it should be filed as a separate, but related, ticket. I understand that it heavily depends and influences the choices of this proposal but judging by the comments above I think there needs to be a separate discussion in order to validate the idea on its own.

I don't mind creating new ticket if needed. I created #12644 to track Extension Types which -- although mentioned already in this issue -- feel like even a higher goal to achieve.

I am not convinced, however, that we should separate these discussions: "use new DataFusion types in logical plans" and "use new DataFusion types in physical plans as well". They feel very related and both impact the type system design. I don't want to paint ourselves into a corner just because we narrowed the horizon.

Using logical types for all of the planning (including physical planning) is very important for performance -- see @andygrove 's Comet comment #11513 (comment) but obviously this is not Comet-only thing. As with any performance improvement, it may be impossible not to regress some specialized cases along the way. There is not point in breaking something for the sake of breaking. Just that if we allow only incremental improvements, we achieve local optimum only.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 28, 2024

@findepi

The core issue seems to be mapping between Arrow's DataType and DataFusion's logical type. While we could technically bypass logical types and handle everything using Arrow's DataType, it's beneficial to have a simplified version of DataType for easier processing. This is where the logical type comes in—it serves as a simplified version of DataType to enumerate all semantically equivalent types. In this case, the logical type should be a more constrained or reduced representation than Arrow's DataType.

We want a one-way mapping from Arrow's DataType to user-defined or extension types, where the logical type (DataFusion's native type) acts as the single source of truth within DataFusion, much like native types in Rust.

To achieve this, we need two traits for type mapping:

  • One for user-defined types.
  • One for Arrow's DataType.

If these types map to the same logical type, it implies that we can correctly decode the value as the expected type. Otherwise, it signals a type mismatch.

#[derive(Clone)]
pub enum LogicalType {
    Int32,
    String,
    Float32,
    Float64,
    FixedSizeList(Box<LogicalType>, usize),
    // and more
    Extenstion(Arc<dyn ExtensionType>),
}

pub trait ExtensionType {
    fn logical_type(&self) -> LogicalType;
}

pub struct JsonType {}

impl ExtensionType for JsonType {
    fn logical_type(&self) -> LogicalType {
        LogicalType::String
    }
}

pub struct GeoType {
    n_dim: usize
}

impl ExtensionType for GeoType {
    fn logical_type(&self) -> LogicalType {
        LogicalType::FixedSizeList(Box::new(LogicalType::Float64), self.n_dim)
    }
}

pub trait PhysicalType {
    fn logical_type(&self) -> LogicalType;
}

impl PhysicalType for DataType {
    fn logical_type(&self) -> LogicalType {
        match self {
            DataType::Int32 => LogicalType::Int32,
            DataType::FixedSizeList(f, n) => {
                LogicalType::FixedSizeList(Box::new(f.data_type().logical_type()), *n as usize)
            }
            _ => todo!("")
        }
    }
}

I’d love to hear if this design is sound, or if there are any potential pitfalls in how I’ve approached type mapping.

@findepi
Copy link
Member

findepi commented Sep 28, 2024

FYI i touched upon the topic of types on DataFusion meetup in Belgrade yesterday.
The slides are here if anyone is interested: https://docs.google.com/presentation/d/1VW_JCGbN22lrGUOMRvUXGpAmlJopbG02hn_SDYJouiY . It was an attempt to summarize why we need both: simpler types (#11513), more types (#12644), and simple function "SDK" (#12635).
The document has comments disabled to avoid diverging the discussion from the github issue.

@alamb
Copy link
Contributor

alamb commented Sep 28, 2024

I’d love to hear if this design is sound, or if there are any potential pitfalls in how I’ve approached type mapping.

One thing I found a bit confusing was ExtensionType::logical_type in your example returns a LogicalType

pub trait ExtensionType {
    fn logical_type(&self) -> LogicalType;
}

But one of the LogicalType variants is itself an ExtensionType

#[derive(Clone)]
pub enum LogicalType {
    //...
    Extenstion(Arc<dyn ExtensionType>),
}

It seems like the idea is that the logical type would report its underlying representation to DataFusion.

I wonder when we would ever need to know the "logical_type" of an extension type. I think it would never make sense to DataFusion to treat a logical extension type as its underlying representation

Just becase I know JSON columns are represented as Strings internally, I think they should never be treated as Strings unless the user explicitly requests such a conversion and the ExtensionType defines how to cast to a String

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 29, 2024

Just becase I know JSON columns are represented as Strings internally, I think they should never be treated as Strings unless the user explicitly requests such a conversion and the ExtensionType defines how to cast to a String

I agree with this, but we don't need to treat it as string if we don't care about it. In logical layer, we can differentiate Json and String by the Enum itself or maybe name if we use Trait. In physical layer, it is where we need to understand the underlying representation. Then we can treat Json as String and get the StringArray we want.

I rewrote another one to show we can differentiate Json and String but also treat them the same if we need to.

// Minimum set as Datafusion Native type
#[derive(Clone, PartialEq, Eq)]
pub enum DatafusionNativeType {
    Int32,
    UInt64,
    String,
    Float32,
    Float64,
    FixedSizeList(Box<DatafusionNativeType>, usize),
}

pub trait LogicalDataType {
    fn name(&self) -> &str;
    fn native_type(&self) -> DatafusionNativeType;
}

// This allows us to treat `DatafusionNativeType` as Trait `LogicalDataType`, 
// there might be better design but the idea is to get `DatafusionNativeType`
// for both UserDefinedType and BuiltinType
//
// Alternative design is like
// pub enum DatafusionType {
//   Builtin(DatafusionNativeType),
//   Extension(Arc<dyn LogicalDataType>)
// }
impl LogicalDataType for DatafusionNativeType {
    fn native_type(&self) -> DatafusionNativeType {
        match self {
            DatafusionNativeType::Int32 => DatafusionNativeType::Int32,
            _ => self.clone()
        }
    }

    fn name(&self) -> &str {
        match self {
            DatafusionNativeType::Int32 => "i32",
            DatafusionNativeType::Float32 => "f32",
            _ => todo!("")
        }
    }
}

fn is_numeric(logical_data_type: &Arc<dyn LogicalDataType>) -> bool {
    matches!(logical_data_type.native_type(), DatafusionNativeType::Int32 | DatafusionNativeType::UInt64) // and more
}

// function where we only care about the Logical type
fn logical_func(logical_type: Arc<dyn LogicalDataType>) {

    if is_numeric(&logical_type) {
        // process it as numeric
    }

    // For user-defined type, maybe there is another way to differentiate the type instead by name
    match logical_type.name() {
        "json" => { 
            // process json 
        },
        "geo" => {
            // process geo
        },
        _ => todo!("")
    }
}


// function where we care about the internal physical type so we can modify the Array.
fn physical_func(logical_type: Arc<dyn LogicalDataType>, array: ArrayRef, schema: Schema) -> Result<()>{
    let data_type_in_schema = schema.field(0).data_type();
    let actual_native_type = data_type_in_schema.logical_type();

    if logical_type.native_type() != actual_native_type {
        return internal_err!("logical type mismatches with the actual data type in schema & array")
    }

    // For Json type, we know the internal physical type is String, so we need to ensure the
    // Array is able to cast to StringArray variant, we can check the schema.
    match logical_type.native_type() {
        DatafusionNativeType::String => {
            match data_type_in_schema {
                DataType::Utf8 => {
                    let string_arr = array.as_string::<i32>();
                    Ok(())
                }
                DataType::Utf8View => {
                    let string_view_arr = array.as_string_view();
                    Ok(())
                }
                _ => todo!("")
            }
        }
        _ => todo!("")
    }
}

pub struct JsonType {}

impl LogicalDataType for JsonType {
    fn native_type(&self) -> DatafusionNativeType {
        DatafusionNativeType::String
    }
    fn name(&self) -> &str {
        "json"
    }
}

pub struct GeoType {
    n_dim: usize
}

impl LogicalDataType for GeoType {
    fn native_type(&self) -> DatafusionNativeType {
        DatafusionNativeType::FixedSizeList(Box::new(DatafusionNativeType::Float64), self.n_dim)
    }
    fn name(&self) -> &str {
        "geo"
    }
}

pub trait PhysicalType {
    fn logical_type(&self) -> DatafusionNativeType;
}

impl PhysicalType for DataType {
    fn logical_type(&self) -> DatafusionNativeType {
        match self {
            DataType::Int32 => DatafusionNativeType::Int32,
            DataType::FixedSizeList(f, n) => {
                DatafusionNativeType::FixedSizeList(Box::new(f.data_type().logical_type()), *n as usize)
            }
            _ => todo!("")
        }
    }
}

@alamb
Copy link
Contributor

alamb commented Sep 29, 2024

In physical layer, it is where we need to understand the underlying representation. Then we can treat Json as String and get the StringArray we want.

Yes, I think it would make a lot of sense in the physical layer to simply pass along the actual arrow type (String in this case) as all the function calls / operators will have been resolved to do whatever conversion is needed (e.g. for comparisons)

@alamb
Copy link
Contributor

alamb commented Sep 29, 2024

// function where we care about the internal physical type so we can modify the Array.

ANother way to potentially handle this is that the function call gets resolved to a specific implementation that gets specific Array types (like today). Physical planning / the extension type would be responsible for figuring out which specific implementation to call, etc

@findepi
Copy link
Member

findepi commented Sep 30, 2024

We want a one-way mapping from Arrow's DataType to user-defined or extension types, where the logical type (DataFusion's native type) acts as the single source of truth within DataFusion, much like native types in Rust.

I agree with the need for DataFusion's native type to be the single source of truth within DataFusion, to be "the types" of expressions, columns, etc.

We clearly cannot have a one-way mapping / back mapping from Arrow's DataType back to DF Type. Arrow DataType cannot say "it's json" or "it's geometry". We can encode this information in Arrow metadata field and this would be useful for clients reading data from DataFusion. However, this won't be sufficient for intermediate expressions, so the DataFusion types need to be treated as the DF's responsibility, with Arrow types (DataType) handling as carrier type / physical type only.

confusing was ExtensionType::logical_type in your example returns a LogicalType

pub trait ExtensionType {
    fn logical_type(&self) -> LogicalType;
}

I see this could be seen as a solution to some problems like function calls (vide simple functions #12635).
If we can generate function calls for various string array types for a function accepting string, then we should be able to do the same for a function accepting JSON. I agree, however, that we need to express this differently, i.e. [logical] type should not redirect to another [logical] type for its implementation, otherwise we will force ourselves into a bad design with first-class and second-class types.

BTW what if this LogicalType / Type thing was not an enum at all, just a trait?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 30, 2024

We clearly cannot have a one-way mapping / back mapping from Arrow's DataType back to DF Type. Arrow DataType cannot say "it's json" or "it's geometry". We can encode this information in Arrow metadata field and this would be useful for clients reading data from DataFusion. However, this won't be sufficient for intermediate expressions, so the DataFusion types need to be treated as the DF's responsibility, with Arrow types (DataType) handling as carrier type / physical type only.

I still don't get the point why not, is there any example that we require more than one Logical Type for DataType?

Given Arrow's extension type like Json, we can consider it as either LogicalType::UserDefined(Json) or LogicalType::String if you want (it is customizable by design), but it does not make sense to have both cases.

If the one-way mapping doesn't work, we can introduce able_to_decode for the DataType that returns Vec of LogicalType. If any matched, we can decode the value to the target DataType.

BTW what if this LogicalType / Type thing was not an enum at all, just a trait?

Trait for user defined type is much more flexible. For native type we can have either trait or enum.

@findepi
Copy link
Member

findepi commented Sep 30, 2024

I still don't get the point why not, is there any example that we require more than one Logical Type for DataType?

if we allow types to be extended (#12644), many different types may use same arrow type as their carrier type (eg JSON -> String, Geo -> String).

Given Arrow's extension type like Json, we can consider it as either LogicalType::UserDefined(Json) or LogicalType::String if you want (it is customizable by design), but it does not make sense to have both cases.

logical JSON and logical String may both use Arrow String, but they have different behavior, e.g. when casting from/to some other type.

If the one-way mapping doesn't work, we can introduce able_to_decode for the DataType that returns Vec of LogicalType.

i may be naive here, but i do very much hope we need only unidirectional mapping, from logical types to arrow types.
We need to retain information what the logical/DF type was (eg when returning data to the client, so they know too), but we should might not need this information at runtime.

BTW what if this LogicalType / Type thing was not an enum at all, just a trait?

Trait for user defined type is much more flexible. For native type we can have either trait or enum.

Yes, both impl strategies seem syntactically equivalent and starting from enum feels like a lazy choice.
I am experimenting with trait-as-first-class-citizen approach and see how this works.
(This would be similar in spirit to unifying buildin and extension UDAFs under single trait #8708)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 30, 2024

if we allow types to be extended (#12644), many different types may use same arrow type as their carrier type (eg JSON -> String, Geo -> String).

I think this is what we have not consensus on, for arrow extension type, I think they should be considered as an individual type. Json, Geo, String are 3 different arrow data type, thus 3 different logical types. Even they all have StringArray internally but the metadata or the rule to describe the StringArray should be considered different, otherwise why we not just use StringArray at the first place?

For logical type, they can have the same DataType, logical JSON and logical String may both use Arrow String and they can both cast to StringArray, and we can still tell the difference between these 2 logical type and process them with different logic.

I think my assumption should be changed to At most one DatafusionNativeType for each DataType instead. We can also think like this, we can find the minimum type set (DatafusionNativeTypes) and all the Arrow DataType including extension type and the LogicalType including user defined type can be mapped to at most one type in the DatafusionNativeTypes

@jayzhan211
Copy link
Contributor

If the one-way mapping doesn't work, we can introduce able_to_decode for the DataType that returns Vec of LogicalType. If any matched, we can decode the value to the target DataType.

Maybe what we need is just checking whether the LogicalType we have is able to decode to the certain Arrow DataType 🤔 In this case we just need to know the set of types

trait LogicalType {
    fn can_decode_to(&self, data_type: DataType) -> bool;
}

struct String {}
impl LogicalType for String {
    fn can_decode_to(&self, data_type: DataType) -> bool {
        matches!(data_type, DataType::Utf8 | DataType::Utf8View | DataType::LargeUtf8)
    }
}

struct Json {}
impl LogicalType for Json {
    fn can_decode_to(&self, data_type: DataType) -> bool {
        matches!(data_type, DataType::Utf8 | DataType::Utf8View | DataType::LargeUtf8)
    }
}

struct StringViewOnly {}
impl LogicalType for StringViewOnly {
    fn can_decode_to(&self, data_type: DataType) -> bool {
        matches!(data_type, DataType::Utf8View)
    }
}
struct Geo {}
impl LogicalType for Geo {
    fn can_decode_to(&self, data_type: DataType) -> bool {
        if let DataType::FixedSizeList(inner_type, n) = data_type {
            return matches!(inner_type.data_type(), DataType::Float32 | DataType::Float64) && n > 1
        }

        return false;
    }
}

@notfilippo
Copy link
Author

struct String {}
impl LogicalType for String {
    fn can_decode_to(&self, data_type: DataType) -> bool {
        matches!(data_type, DataType::Utf8 | DataType::Utf8View | DataType::LargeUtf8)
    }
}

I don't really like that this function would require recursion to handle Dictionary(Dictionary(...(Utf8))) but it's definitely an interesting approach.

@notfilippo
Copy link
Author

@jayzhan211 and @findepi I really like the approach of LogicalType as a trait and I was able to validate it by using an old working branch with the original proposal code. I ended up using this:

pub trait LogicalType {
    fn native(&self) -> &NativeType;
    fn name(&self) -> Option<&str>; // as in 'ARROW:extension:name'.
}

pub struct String {}

impl LogicalType for String {
    fn native(&self) -> &NativeType {
        &NativeType::Utf8
    }

    fn name(&self) -> Option<&str> {
        None
    }
}

pub struct JSON {}

pub const JSON_EXTENSION_NAME: &'static str = "arrow.json";

impl LogicalType for JSON {
    fn native(&self) -> &NativeType {
        &NativeType::Utf8
    }

    fn name(&self) -> Option<&str> {
        Some(JSON_EXTENSION_NAME)
    }
}

And by modifying the temporary TypeRelation trait in order to keep track of both physical and logical types in the whole datafusion system:

pub trait TypeRelation {
    fn logical(&self) -> &dyn LogicalType;
    fn physical(&self) -> &DataType;
}

pub struct NativeTypeRelation {
    logical: Arc<dyn LogicalType>,
    physical: DataType,
}

impl TypeRelation for NativeTypeRelation {
    fn logical(&self) -> &dyn LogicalType {
        &*self.logical
    }

    fn physical(&self) -> &DataType {
        &self.physical
    }
}

impl From<DataType> for NativeTypeRelation {
    fn from(value: DataType) -> Self {
        let logical = match value {
            DataType::Utf8 | DataType::Utf8View | DataType::LargeUtf8 => {
                Arc::new(String {})
            }
            _ => todo!(),
        };
        Self {
            logical,
            physical: value,
        }
    }
}

I was able to run pass some string tests and write "JSON specific" functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants