Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-304: Add an option to make requested schema case insensitive in read path #210

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

saucam
Copy link

@saucam saucam commented Jun 9, 2015

For projects such as hive and spark-sql which use parquet, the schema of the stored tables is always lowercase (because of limitations of hive metastore). It would be great if we can have a configurable option to read data from parquet irrespective of the case of the requested schema , supplied via ReadContext object.

This PR adds configurable option of ParquetInputFormat.CASE_SENSITIVITY which can be set to false. In that case, parquet will resolve requested columns irrespective of case. This alleviates projects like spark-sql to read footers at the driver side and reconcile schema (to be read from footers) with metastore schema before calling parquet read.

@saucam
Copy link
Author

saucam commented Jun 9, 2015

cc @isnotinvain

cc @liancheng request your thoughts on this ?

I have tested with latest spark master , and was able to eliminate schema read/ reconciliation for spark-sql purposes (where metastore schema is available)

@liancheng
Copy link
Contributor

@saucam Thanks for working on this!

To be more specific, Spark SQL itself can be configured to be either case sensitive or case insensitive. But when using Spark SQL to access Hive tables whose metadata are stored in Hive metastore, it have to be case insensitive because Hive metastore is so. Our current solution is to read both Hive metastore schema and Parquet schema, then try to get an arbitrative schema by having case information from Parquet schema and column type information from Hive schema.

To be honest, I feel kinda complicated about this issue. From the perspective of Spark SQL, schema resolution code can be simpler if Parquet provides this configuration. However, personally I feel that case insensitivity can be a footgun in many cases and a source of bugs. Let's leave this to Parquet committers to decide :)

@danielcweeks
Copy link

The way Hive resolves this is that the client never reads the footer (only hive metadata), but when the task processes the file it will read the footer and resolve the case sensitivity (see https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L155). I'm not opposed to having it configurable, but I'm not clear why it's necessary.

@saucam
Copy link
Author

saucam commented Jun 13, 2015

@danielcweeks this is possible to do in spark as well (have tried, works for projections) but how do you resolve column names in filter objects which are pushed down in parquet before creating the tasks ? For that it becomes inevitable to have a resolved schema on the driver side ? I wonder if and how hive is able to resolve that ?

@danielcweeks
Copy link

@saucam it looks like in hive they push the filter expression to the task side as well and evaluate it there. However, I don't see that they address the case sensitivity issue, but I just took a cursory look at the code. I assume they could do the same as the do for column projection.

@saucam
Copy link
Author

saucam commented Jun 16, 2015

@julienledem request your thoughts on this ...

@nezihyigitbasi
Copy link
Contributor

Presto also handles case sensitivity itself on the task side, please see this PR.

@billonahill
Copy link

@nezihyigitbasi Presto works with case sensitivity with the PR you reference but that fix is limited to top-level columns only. It doesn't address querying case sensitive nested data structures, which is currently not supported in presto (see related prestodb/presto#2863). I believe this proposed parquet fix would address that.

@julienledem
Copy link
Member

If we add an option for case insensitivity in Parquet, we should make sure it is consistent and decide what to do with conflicting names. It sounds like it's case insensitive selection. projections and filters apply to all column that match ignoring the case?

validate(predicate, schema, true);
}

public static void validate(FilterPredicate predicate, MessageType schema, boolean isCaseSensitive) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that isCaseSensitive should be a property of the FilterPredicate.

@julienledem
Copy link
Member

I made comments inline. This is the kind of feature that is easy to add but hard to get right and to change if not right.
Sorry for letting this drag for long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants