Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Coral-Hive] Extend VersionedSqlUserDefinedFunction to support version-specific function names #508

Merged
merged 5 commits into from
Aug 27, 2024

Conversation

ljfgem
Copy link
Collaborator

@ljfgem ljfgem commented Jun 5, 2024

What changes are proposed in this pull request, and why are they necessary?

This PR is to support versioning UDF resolution for one unversioned UDF static registration.
For example, user can register com.linkedin.abc once and then use variations with standard versioning prefix in Hive view definitions. A UDF class is considered versioned if the prefix before the first dot follows coral_udf_version_(\d+|x)_(\d+|x)_(\d+|x) format.
To achieve the goal, we made the following changes:

  1. modify HiveFunctionResolver to allow versioning UDF resolution while static registry only contains unversioned class name.
  2. add functionClassName and getShortFunctionName() to VersionedSqlUserDefinedFunction.

After successful resolution in coral-hive, coral-spark would be able to return the UDF with versioning class name (functionClassName) so that Spark is able to register the versioning UDF class well. And coral-trino will use registered short name or fall back to convert the camel case class name to lowercase underscore function name (getShortFunctionName()).

How was this patch tested?

  1. New unit tests in coral-hive, coral-spark and coral-trino, worked as expected.
  2. Regression test on prod views.

@wmoustafa
Copy link
Contributor

A UDF is considered shaded if the prefix before the first dot contains an underscore.

This convention is brittle.

Why is shading indicated by an underscore? Can we use something that utilizes patterns/templates or maybe suffixes? We will have to be explicit about how to resolve conflicts in case of multiple names/classes match.

By the way, resolving conflicts could happen in the "underscore" based approach, so it is not unique to the more generic solution suggested above. One resolution strategy is to fail loading the view if there is a collision.

We need a representation that does not open the door for too generic names like .* but also flexible enough to allow for minor things like version number.

@ljfgem
Copy link
Collaborator Author

ljfgem commented Jun 7, 2024

A UDF is considered shaded if the prefix before the first dot contains an underscore.

This convention is brittle.

Why is shading indicated by an underscore? Can we use something that utilizes patterns/templates or maybe suffixes? We will have to be explicit about how to resolve conflicts in case of multiple names/classes match.

By the way, resolving conflicts could happen in the "underscore" based approach, so it is not unique to the more generic solution suggested above. One resolution strategy is to fail loading the view if there is a collision.

We need a representation that does not open the door for too generic names like .* but also flexible enough to allow for minor things like version number.

@wmoustafa The shading prefix resolution only happens for static registered UDFs in a_b.xyz format (must contain dot and the prefix before dot must contain underscore), so there won't be issue if we ensure that the future registered UDFs follow this pattern (no underscore in the prefix before dot).
But agreed that we can create a more deterministic and well specified template for our use.

Copy link
Contributor

@wmoustafa wmoustafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First high-level review.

@@ -26,7 +28,7 @@ public SourceOperatorMatchSqlCallTransformer(String sourceOpName, int numOperand

@Override
protected boolean condition(SqlCall sqlCall) {
return sourceOpName.equalsIgnoreCase(sqlCall.getOperator().getName())
return sourceOpName.equalsIgnoreCase(FunctionUtils.removeShadingPrefix(sqlCall.getOperator().getName()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not leak this shading business to a standard operator like this. Is there any way to handle this outside this operator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to CoralRegistryOperatorRenameSqlCallTransformer in coral-trino.


/**
* Checks if the given class name has a shading prefix.
* A class name is considered shaded if the prefix before the first dot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove references to shading and use versioning/versioned in the description and the code? Using shading terminology unnecessarily complicates the description.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, modified.



public class FunctionUtils {
public static final String CORAL_VERSIONED_UDF_PREFIX = "coralversionedudf_(\\d+|x)_(\\d+|x)_(\\d+|x)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coral_udf_version_..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified.

Comment on lines 46 to 51
/**
* Generates a versioned function name based on the given function name and class name.
* For example, if the function name is "myFunction" and the class name is "coralversionedudf_1_0_0.com.linkedin.MyClass",
* the versioned function name will be "myFunction_1_0_0". If the class name is not shaded, such as "com.linkedin.MyClass",
* the versioned function name will be "myFunction".
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is not this done at the view definition level? Sounds like we are adding too much use case specific code to the core Coral logic. I am guessing because doing it in the view will hurt Trino, but this is not a valid reason for adding custom code like this. FYI @phd3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seeing the only use is in Spark (and the behavior is Spark specific), not sure if this fits in coral-common.

Copy link
Collaborator Author

@ljfgem ljfgem Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to VersionedSqlUserDefinedFunction.

@@ -66,15 +67,15 @@ protected SqlCall transform(SqlCall sqlCall) {
throw new UnsupportedUDFException(operatorName);
}
final String viewDependentFunctionName = operator.getViewDependentFunctionName();
String versionedFunctionName = FunctionUtils.getVersionedFunctionName(viewDependentFunctionName, operatorName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we leverage VersionedSqlUserDefinedFunction jointly with this new code, is there a way to encapsulate all this functionality in this class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a new method getVersionedViewDependentFunctionName in VersionedSqlUserDefinedFunction.

// we want to see class name in the exception message for coverage testing
// so throw exception here
throw new UnknownSqlFunctionException(funcClassName);
}

return dynamicResolvedFunctions;
} else {
if (isVersioningUDF(funcClassName)) {
Copy link
Contributor

@wmoustafa wmoustafa Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here should be:

  • If it has the necessary properties (and hence Dali UDF)
    • If it is on the format [name]+_version or db_t_+[name], then
      • short name = [name].
      • function name = unversioned class name
      • long name = [name]+_version or db_t_+[name] or whatever original name is
    • else a special case of the above

Ideally Dali views should have function-identifier property that makes it possible to use this directly in the registry or in function name here and not derive function name from the class name but the latter is acceptable in the interim, especially everything is encapsulated in this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in case short name = [name] is not useful for Trino, we can have short name depend on function name through a hardcoded dependency. Intention of short name is to be used in engines that cannot leverage versioning like Trino.

Copy link
Collaborator Author

@ljfgem ljfgem Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the code by adding a new method getTrinoFunctionName to VersionedSqlUserDefinedFunction, which returns the Trino function name as how we did before:

  1. Try to retrieve the Trino function name from the registered map TRINO_FUNC_NAME_MAP based on the UDF class name
  2. If the UDF class name doesn't exist in the map, follow the current approach by converting the camel case class name to lowercase underscore function name.

Since we don't have a standard for version in view function name (we only have a standard for versioning prefix in class name), and view function name may not be related to the registered Trino function name, I would prefer to keep the current approach of generating Trino function name, which would also avoid regressions.

@ljfgem ljfgem changed the title [Coral-Hive] Support various shaded UDF resolution for one unshaded UDF static registration [Coral-Hive] Support versioning UDF resolution for one unshaded UDF static registration Aug 22, 2024
* the short function name.
*/
public String getShortFunctionName() {
String unversionedClassName = getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add java doc about the nature of the function name (e.g., its format etc). You can say this is just a convention and other naming approaches are still valid as long as they identify the type inference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -356,6 +356,14 @@ public void testDaliUDFCall() {
assertEquals(RelOptUtil.toString(rel), expectedPlan);
}

@Test
public void testUDFWithVersioningClassName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify the unit test name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

@@ -44,6 +44,7 @@
* Class to resolve hive function names in SQL to Function.
*/
public class HiveFunctionResolver {
private static final String CORAL_VERSIONED_UDF_PREFIX = "coral_udf_version_(\\d+|x)_(\\d+|x)_(\\d+|x)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove CORAL_ since VERSIONED_UDF is clear enough that it is related to VersionedUDF class?
Should use CLASS_NAME_PREFIX?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

@@ -218,7 +222,7 @@ public void addDynamicFunctionToTheRegistry(String funcClassName, Function funct
new SqlUserDefinedFunction(new SqlIdentifier(funcClassName, ZERO),
new HiveGenericUDFReturnTypeInference(funcClassName, hiveTable.getDaliUdfDependencies()), null,
createSqlOperandTypeChecker(numOfOperands), null, null),
hiveTable.getDaliUdfDependencies(), functionName));
hiveTable.getDaliUdfDependencies(), functionName, funcClassName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we deal with functionName as a class name (e.g., String unversionedClassName = getName();). Is there a way to stick to one terminology througout the whole code?

Copy link
Collaborator Author

@ljfgem ljfgem Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to use functionName to refer to the function name in view text, and funcClassName to refer to the function class string.

Comment on lines 107 to 109
public String getUDFClassName() {
return udfClassName;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FunctionName vs UDF Name.. can we fix all terminology and unify them to minimize confusion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

Copy link
Contributor

@wmoustafa wmoustafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @ljfgem for working on this. Looks great now. Could you please update the title and description to match the new approach/sematics? I do not "shading" is a thing anymore.

It is probably along the lines of Extend VersionedSqlUserDefinedFunction to support version-specific function names.

Please make sure Trino tests pass both regression and server side.

@ljfgem ljfgem changed the title [Coral-Hive] Support versioning UDF resolution for one unshaded UDF static registration [Coral-Hive] Extend VersionedSqlUserDefinedFunction to support version-specific function names Aug 27, 2024
@ljfgem
Copy link
Collaborator Author

ljfgem commented Aug 27, 2024

Thank you so much @ljfgem for working on this. Looks great now. Could you please update the title and description to match the new approach/sematics? I do not "shading" is a thing anymore.

It is probably along the lines of Extend VersionedSqlUserDefinedFunction to support version-specific function names.

Please make sure Trino tests pass both regression and server side.

@wmoustafa Thank you for your reviews and suggestions on this PR! Modified the title and description, also finished the required tests.

@ljfgem ljfgem merged commit a8b86e7 into linkedin:master Aug 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants