-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Format] Consider adding an official variant type to Arrow #42069
Comments
What would be the benefit of this over the current Union types? Is it just to alleviate the need to specify all the types up front? |
That's part of it, yes, but many sources also support variants containing semistructured types where you could conceivably need a combinatorial explosion of unions to reflect all the data. (See https://docs.snowflake.com/en/sql-reference/data-types-semistructured, for instance.) |
I’m investigating using the Spark spec as an extension type in DataFusion. I’ll report back here whether it turns out to work well with Arrow layouts. |
Notes from discussion with original developersI talked to developers at Databricks who worked on adding this feature to Spark and Delta Lake. Here a few notes from that.
|
An Arrow extension type?In the near term, I think this would make a good Arrow extension type. This would be:
The metadata will usually be a single binary shared across all rows, but could be multiple. (Multiple might happen if two different batches are concatenated together, for example.) Either dictionary or REE encoded array would be appropriate. The data could be either binary, large binary, or binary view. Binary view isn’t widely supported right now, but could be very useful for this data type. This is because sub-objects can be sliced out of variants. From the spec 1:
Where could this be useful?A few immediate places I think this extension type could be useful:
Extension type pitfallsThe main pitfall of using an extension type for this is the storage type is meaningless to users. They need to have special libraries to interpret the bytes if pulled into a system that doesn't understand the variant extension type. In addition, most existing Arrow systems I've worked with don't have a way to customize how extension arrays are printed. I think this is something we should fix. A reasonable workaround in the meantime is providing functions that convert these back to JSON strings for the purpose of printing. Footnotes |
I think this makes sense as a extension type. I think given subcolumnarization work happening one might also want to store a union in the type as well for columns that have been split out |
One other thought, I think the variant type in spark has a more limited type surface then Arrow, that is potentially something that might need reconciling |
Yeah, I think there are really two different requests possible here: an Arrow-native variant type and a Spark-compatible variant type. The surface area thing works both ways: like Parquet, the Spark variant supports 32-bit and 64-bit decimal values while Arrow does not. |
I think it’s too early to say whether an Arrow-native one makes sense. The Spark / delta lake teams have intentions that their standard will proliferate to other engines. At which point it will not be a Spark specific thing and might make more sense to align with these types. If that succeeds, it would make sense for us to align with the format. The Open Variant Data Type has a If this standard doesn’t proliferate to other engines and ends up being Spark specific while other engines maintain different standards, we will have to have a conversation about what kind of variant type would make a good interchange format. That would be a point where Arrow designing its own format would make sense. Either way, it’s much too early to know which direction to go. Spark 4.0 isn’t even release yet. I think this is the stage where we should experiment with this type as a non-canonical extension type and keep an eye on the data types adoption in the wider ecosystem. |
Yeah, fwiw there is an iceberg proposal to also support variant type and if IIUC the current incarnation is to support spark with iceberg types but it hasn't made it very far yet |
Even if there were no Spark (or Iceberg) variant type there would still be variants stored in databases and it would be nice for ADBC to be able to return those in a somewhat-consistent fashion. I suppose ADBC could define its own extension type for this purpose. |
For curious observers, there's a thread about the Iceberg proposal at https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 and the proposal itself at https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit#heading=h.rt0cvesdzsj7. |
BTW we are actively working on implementing StringView / BinaryView support in arrow-rs apache/arrow-rs#5374 and DataFusion apache/datafusion#10918 and thanks to @XiangpengXao, @Weijun-H and other we are making good progress |
Parquet recently added the variant spec and variant data type. (Iceberg folks decided that it would be better to maintain the spec in parquet instead of Iceberg) Is there any plans in arrow to adopt it soon? |
@julienledem and I were just talking about this. I agree it would be nice to add the variant type to arrow -- I think the challenge will be finding people willing to help implement it in two languages. I don't think I will have time to help with Rust anytime soon, though I can help coordinate and I'll see if I can muster anyone to help. I filed this to track the ideae |
What would be required on the Arrow side of things? Just an extension, or would we also need methods to access/manipulate the content as well? |
I think the extension type is potentially not super useful without some methods to manipulate it. Note, the Parquet spec is still in experimental stage. I think having an extension type that mirrors the parquet spec once it is ready makes sense. |
Describe the enhancement requested
This could be aligned with the new Spark variant type or it could not be.
Component(s)
Format
The text was updated successfully, but these errors were encountered: