-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-304: Add an option to make requested schema case insensitive in read path #210
base: master
Are you sure you want to change the base?
Conversation
cc @isnotinvain cc @liancheng request your thoughts on this ? I have tested with latest spark master , and was able to eliminate schema read/ reconciliation for spark-sql purposes (where metastore schema is available) |
@saucam Thanks for working on this! To be more specific, Spark SQL itself can be configured to be either case sensitive or case insensitive. But when using Spark SQL to access Hive tables whose metadata are stored in Hive metastore, it have to be case insensitive because Hive metastore is so. Our current solution is to read both Hive metastore schema and Parquet schema, then try to get an arbitrative schema by having case information from Parquet schema and column type information from Hive schema. To be honest, I feel kinda complicated about this issue. From the perspective of Spark SQL, schema resolution code can be simpler if Parquet provides this configuration. However, personally I feel that case insensitivity can be a footgun in many cases and a source of bugs. Let's leave this to Parquet committers to decide :) |
The way Hive resolves this is that the client never reads the footer (only hive metadata), but when the task processes the file it will read the footer and resolve the case sensitivity (see https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L155). I'm not opposed to having it configurable, but I'm not clear why it's necessary. |
@danielcweeks this is possible to do in spark as well (have tried, works for projections) but how do you resolve column names in filter objects which are pushed down in parquet before creating the tasks ? For that it becomes inevitable to have a resolved schema on the driver side ? I wonder if and how hive is able to resolve that ? |
@saucam it looks like in hive they push the filter expression to the task side as well and evaluate it there. However, I don't see that they address the case sensitivity issue, but I just took a cursory look at the code. I assume they could do the same as the do for column projection. |
@julienledem request your thoughts on this ... |
Presto also handles case sensitivity itself on the task side, please see this PR. |
@nezihyigitbasi Presto works with case sensitivity with the PR you reference but that fix is limited to top-level columns only. It doesn't address querying case sensitive nested data structures, which is currently not supported in presto (see related prestodb/presto#2863). I believe this proposed parquet fix would address that. |
If we add an option for case insensitivity in Parquet, we should make sure it is consistent and decide what to do with conflicting names. It sounds like it's case insensitive selection. projections and filters apply to all column that match ignoring the case? |
validate(predicate, schema, true); | ||
} | ||
|
||
public static void validate(FilterPredicate predicate, MessageType schema, boolean isCaseSensitive) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that isCaseSensitive should be a property of the FilterPredicate.
I made comments inline. This is the kind of feature that is easy to add but hard to get right and to change if not right. |
For projects such as hive and spark-sql which use parquet, the schema of the stored tables is always lowercase (because of limitations of hive metastore). It would be great if we can have a configurable option to read data from parquet irrespective of the case of the requested schema , supplied via ReadContext object.
This PR adds configurable option of ParquetInputFormat.CASE_SENSITIVITY which can be set to false. In that case, parquet will resolve requested columns irrespective of case. This alleviates projects like spark-sql to read footers at the driver side and reconcile schema (to be read from footers) with metastore schema before calling parquet read.