PARQUET-304: Add an option to make requested schema case insensitive in read path #210

saucam · 2015-06-09T05:46:13Z

For projects such as hive and spark-sql which use parquet, the schema of the stored tables is always lowercase (because of limitations of hive metastore). It would be great if we can have a configurable option to read data from parquet irrespective of the case of the requested schema , supplied via ReadContext object.

This PR adds configurable option of ParquetInputFormat.CASE_SENSITIVITY which can be set to false. In that case, parquet will resolve requested columns irrespective of case. This alleviates projects like spark-sql to read footers at the driver side and reconcile schema (to be read from footers) with metastore schema before calling parquet read.

…in read path

saucam · 2015-06-09T06:11:15Z

cc @isnotinvain

cc @liancheng request your thoughts on this ?

I have tested with latest spark master , and was able to eliminate schema read/ reconciliation for spark-sql purposes (where metastore schema is available)

liancheng · 2015-06-09T07:22:46Z

@saucam Thanks for working on this!

To be more specific, Spark SQL itself can be configured to be either case sensitive or case insensitive. But when using Spark SQL to access Hive tables whose metadata are stored in Hive metastore, it have to be case insensitive because Hive metastore is so. Our current solution is to read both Hive metastore schema and Parquet schema, then try to get an arbitrative schema by having case information from Parquet schema and column type information from Hive schema.

To be honest, I feel kinda complicated about this issue. From the perspective of Spark SQL, schema resolution code can be simpler if Parquet provides this configuration. However, personally I feel that case insensitivity can be a footgun in many cases and a source of bugs. Let's leave this to Parquet committers to decide :)

danielcweeks · 2015-06-12T17:58:46Z

The way Hive resolves this is that the client never reads the footer (only hive metadata), but when the task processes the file it will read the footer and resolve the case sensitivity (see https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L155). I'm not opposed to having it configurable, but I'm not clear why it's necessary.

saucam · 2015-06-13T07:41:57Z

@danielcweeks this is possible to do in spark as well (have tried, works for projections) but how do you resolve column names in filter objects which are pushed down in parquet before creating the tasks ? For that it becomes inevitable to have a resolved schema on the driver side ? I wonder if and how hive is able to resolve that ?

danielcweeks · 2015-06-15T18:52:44Z

@saucam it looks like in hive they push the filter expression to the task side as well and evaluate it there. However, I don't see that they address the case sensitivity issue, but I just took a cursory look at the code. I assume they could do the same as the do for column projection.

saucam · 2015-06-16T00:52:37Z

@julienledem request your thoughts on this ...

nezihyigitbasi · 2016-03-08T01:42:03Z

Presto also handles case sensitivity itself on the task side, please see this PR.

billonahill · 2016-03-22T23:22:04Z

@nezihyigitbasi Presto works with case sensitivity with the PR you reference but that fix is limited to top-level columns only. It doesn't address querying case sensitive nested data structures, which is currently not supported in presto (see related prestodb/presto#2863). I believe this proposed parquet fix would address that.

julienledem · 2016-03-30T17:54:00Z

If we add an option for case insensitivity in Parquet, we should make sure it is consistent and decide what to do with conflicting names. It sounds like it's case insensitive selection. projections and filters apply to all column that match ignoring the case?

julienledem · 2016-03-30T17:55:21Z

...-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java

+   validate(predicate, schema, true);
+  }
+
+  public static void validate(FilterPredicate predicate, MessageType schema, boolean isCaseSensitive) {


I would think that isCaseSensitive should be a property of the FilterPredicate.

julienledem · 2016-03-30T18:06:14Z

I made comments inline. This is the kind of feature that is easy to add but hard to get right and to change if not right.
Sorry for letting this drag for long.

PARQUET-304: Add an option to make requested schema case insensitive …

86e092b

…in read path

julienledem reviewed Mar 30, 2016
View reviewed changes

budde mentioned this pull request Feb 5, 2017

[SPARK-19455][SQL] Add option for case-insensitive Parquet field resolution apache/spark#16797

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-304: Add an option to make requested schema case insensitive in read path #210

PARQUET-304: Add an option to make requested schema case insensitive in read path #210

saucam commented Jun 9, 2015

saucam commented Jun 9, 2015

liancheng commented Jun 9, 2015

danielcweeks commented Jun 12, 2015

saucam commented Jun 13, 2015

danielcweeks commented Jun 15, 2015

saucam commented Jun 16, 2015

nezihyigitbasi commented Mar 8, 2016

billonahill commented Mar 22, 2016

julienledem commented Mar 30, 2016

julienledem Mar 30, 2016

julienledem commented Mar 30, 2016

PARQUET-304: Add an option to make requested schema case insensitive in read path #210

Are you sure you want to change the base?

PARQUET-304: Add an option to make requested schema case insensitive in read path #210

Conversation

saucam commented Jun 9, 2015

saucam commented Jun 9, 2015

liancheng commented Jun 9, 2015

danielcweeks commented Jun 12, 2015

saucam commented Jun 13, 2015

danielcweeks commented Jun 15, 2015

saucam commented Jun 16, 2015

nezihyigitbasi commented Mar 8, 2016

billonahill commented Mar 22, 2016

julienledem commented Mar 30, 2016

julienledem Mar 30, 2016

Choose a reason for hiding this comment

julienledem commented Mar 30, 2016