Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #5] enable user provided schema when reading exasol table #19

Merged
merged 5 commits into from
Oct 18, 2018

Conversation

3cham
Copy link
Contributor

@3cham 3cham commented Oct 17, 2018

No description provided.

@morazow
Copy link
Contributor

morazow commented Oct 18, 2018

Good morning @3cham,

Thanks for PR! Please keep up the good work!

Overall, everything looks good. However, I have some suggestions. I am going to list them from easier (easy to explain :)) to harder points.

  • [ExasolRelation] Remove using configSchema in buildScan() here. The user schema will be again automatically selected here.

  • [ExasolRelation] Keep querySchema function as before. But please rename it to inferSchema, that describes it clearly.

  • [ExasolRelation] Take configSchema parameter as option, e.g configSchema: Option[StructType]

  • [ExasolRelation] Change overrided schema function to get configSchema if available and only otherwise inferSchema. Usually, in Scala there are many ways to do this with options. However, let us stick with fold. So in the schema function should be for example:

// This will use inferSchema if configSchema is None
 override def schema: StructType = configSchema.fold(inferSchema){ userSchema =>
    logger.info(s"Using provided schema $userSchema")
    userSchema 
  }  

You can read about it here, http://blog.originate.com/blog/2014/06/15/idiomatic-scala-your-options-do-not-match/, https://kwangyulseo.com/2014/05/21/scala-option-fold-vs-option-mapgetorelse/

  • [DefaultSource] Here, let's put query string check and manager creating part into its own private function. E.g,
private def fromParameters(params: Map...): (String, ExasolManager) = {
    val queryString = parameters.get("query") match ..
    ...
    val manager = ExasolConnectionManager(config)
   (queryString, manager)
}

And both createRelation function call this, and return respective ExasolRelation.

 override def createRelation(...): BaseRelation = { 
    val (queryString, manager) = fromParameters(parameters)
    new ExasolRelation(sqlContext, queryString, None, manager)
  }

  override def createRelation(
    ....,                                                       
    schema: StructType                
  ): BaseRelation = {
    val (queryString, manager) = fromParameters(parameters)
    new ExasolRelation(sqlContext, queryString, Option(schema), manager)
  }     

Hope these are clear. Please let me know if you have questions!

}
}

override def schema: StructType = querySchema

override def buildScan(): RDD[Row] =
buildScan(Array.empty, Array.empty)
buildScan(configSchema.map(_.name).toArray, Array.empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove using configSchema here and keep as before.

Copy link
Contributor Author

@3cham 3cham Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @morazow, this is where I spent a lot of time to think about. In the ExasolRDD implementation, we use Converter.resultSetToRows(resultSet, querySchema), in this case the jdbc connection still have to fetch all columns including the ones not provided in the configSchema. If the idea is user must provide the full schema that is returns from the queryString, then it is fine to let it like that. However, we could restrict jdbc to fetch only the required columns. Do you think it is a good idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @3cham,

Yes, we only fetch required columns for both cases when schema is provided and is inferred.

I comment below, for better explaining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I could not comment on un-changed lines :)

Below, on lines 51 / 60 you will see,

Types.selectColumns(requiredColumns, schema)

Here, schema will be either user provided configSchema or inferred inferSchema. Then if we have required columns we select only those from that schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I oversaw this line. Thank you for pointing that out :)

class ExasolRelation(
context: SQLContext,
queryString: String,
configSchema: StructType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configSchema with type Option[StructType]


override def shortName(): String = "exasol"

override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]
): BaseRelation =
createRelation(sqlContext, parameters, new StructType())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call new private function to check that query exist and create manager variable. Then new relation with schema as None.

val (queryString, manager) = fromParameter(params)
new ExasolRelation(context, queryString, None, manager)

@@ -27,7 +39,6 @@ class DefaultSource extends RelationProvider with DataSourceRegister {
val config = ExasolConfiguration(parameters)
val manager = ExasolConnectionManager(config)

new ExasolRelation(sqlContext, queryString, manager)
new ExasolRelation(sqlContext, queryString, schema, manager)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before but with schema provided,

val (queryString, manager) = fromParameter(params)
new ExasolRelation(context, queryString, Option(schema), manager)

And put all other into separate private function which returns queryString and manager.

@@ -1,20 +1,32 @@
package com.exasol.spark

import java.util.InvalidPropertiesFormatException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is not used anywhere

@3cham 3cham force-pushed the task/enable-user-provided-schema branch from 1177d38 to f555955 Compare October 18, 2018 12:30
@3cham
Copy link
Contributor Author

3cham commented Oct 18, 2018

Hi @morazow, please check the following up commits. If you don't mind, allow me to change the author info in the last PR ;)

@morazow
Copy link
Contributor

morazow commented Oct 18, 2018

Hey @3cham ,
PR looks good! Going to merge soon!

Yes, sure. I was also confused not seeing you in contributors list. Just for your information, I am also going to add authors list from my fork, https://github.com/morazow/spark-exasol-connector/blob/task/authors/AUTHORS.md.

Best

@3cham
Copy link
Contributor Author

3cham commented Oct 18, 2018

@morazow: many thanks, I used my lap from work with a different email config. That's why I need to change the info from the last PR.

@morazow morazow merged commit e6a197b into exasol:master Oct 18, 2018
@3cham 3cham deleted the task/enable-user-provided-schema branch October 19, 2018 11:09
@3cham 3cham restored the task/enable-user-provided-schema branch October 19, 2018 11:09
@3cham 3cham deleted the task/enable-user-provided-schema branch October 19, 2018 11:09
morazow added a commit that referenced this pull request Nov 4, 2018
[Issue #5] enable user provided schema when reading exasol table
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants