Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Generate updated supported CSV files from plugin repo #847

Merged
merged 13 commits into from
Mar 22, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Mar 8, 2024

Fixes #846

spark-rapids repo added CSV files generation per Apache Spark version under spark-rapids/tools/generated_files.

This PR syncs with the plugin and updated the existing CSV files under spark-rapids-tools/core/src/main/resources. The new files are generated using the python script included in this PR. The high-level approach is to take the union of the CSV files across all spark versions, which is explained in more detail in the python file.

Follow up:

  1. This PR introduces new Execs. We need to add new parsers
  2. Automate this process by setting up a regular Jenkins job
  3. File an issue in the plugin repo to remove the stale files

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang !
Have couple of questions:

  1. Dependencies for this script(pandas) has to be specified ?
  2. What would be the function of jenkins job i.e this python script will update the supported*.csv files and create a new PR?

core/src/main/resources/dev/process_supported_files.py Outdated Show resolved Hide resolved
core/src/main/resources/dev/process_supported_files.py Outdated Show resolved Hide resolved
NA = 2
PS = 3
S = 4
CO = 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CO is configured off, correct? So this takes precedence over S if there is an Exec with CO in one version and S in another Spark version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nartal1 Thanks! I need to confirm what CO means. The current script assumes CO takes the highest precedence. I can change this once I get an answer.

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Mar 8, 2024

Thanks @nartal1!

  1. Dependencies for this script(pandas) has to be specified?

I added some comments in the script for dependencies. My original intention was to include this script for review/future development reference. Do we need to specifically specify them in a file like requirements.txt?

  1. What would be the function of jenkins job i.e this python script will update the supported*.csv files and create a new PR?

The jenkins job would likely have the following process:

  1. Use the python script in this PR to generate the union of the CSV files from the spark-rapids repo
  2. Compare the current union files with the existing files in tools repo
  3. Update tools repo and file an automated PR for review

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang for working on that new feature.
I haven't gotten the chance yet to review the PR, but I have one suggestion: It will be nice if the new python script accepts an input to override the data pulled from the plugin. For example we can have a config files in json format to describe some sort of explicit overriding we need to do do on the tools side.

Examples:

  • Previously, the tools resource files added manually promote_precision as SQL func name to PromotePrecision. the SQL func name did not exist in the Plugin files. It will be nice to have the configuration file doing that for us.
  • In some cases, users facing teams would ask to explicitly override a certain behavior for certain customers. These also can be part of the configuration files.
  • Down the road, (Future improvement) the config file can be an optional input. This allows to change the behavior without the need to build a custom wheel-file or make a new release.

CCing @viadea , @kuhushukla to get their thoughts because this feature would be used frequently by Hao and Kuhu.
CC @mattahrens

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. Few minor questions.

core/src/main/resources/dev/process_supported_files.py Outdated Show resolved Hide resolved
Comment on lines 100 to 101
def unify_all_files(root_dir, file_name, key_names):
final_df = pd.DataFrame()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings should be inside the definition for all functions

https://peps.python.org/pep-0257/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa! I will update this.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Mar 12, 2024

It will be nice if the new python script accepts an input to override the data pulled from the plugin. For example we can have a config files in json format to describe some sort of explicit overriding we need to do do on the tools side.

Thanks @amahussein! I updated the python script to accept an optional configs file which overrides the final output, and also included the config file we need to use for tools.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

What does CO mean in the supported CSV files?
CC: @amahussein @mattahrens

@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review March 12, 2024 23:36
@nartal1
Copy link
Collaborator

nartal1 commented Mar 13, 2024

What does CO mean in the supported CSV files?

CO in this context means Configured Off specifically for read formats. If the readformat is off by default in the plugin, it's marked as CO. Here Avro and Json are off by default, so we end up adding CO for these read formats.
Code where CO is assigned: link

val readOps = types.map { t =>
          if (!formatEnabled) {
            // indicate configured off by default
            "CO"
          } else {
            read.support(t).text
          }
        }

JSON and AVRO formats disabled by default.
I think we can consider these same as "NS" as we have CollectLimitExec(disabled by default).
@amahussein @mattahrens - Your thoughts on this?

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang

  • Can we have the generated files sorted alphabetically similar to what the plugin sources have?
  • I have some concerns about committing the changes in the CSV files:
    • It is hard to tell if the new change overrides any manual change we did in those files. For example, if I haven't call out promote_precision, we would not see it. Is there another diff that can show us the change done to each operator? possible work around is to maintain the order of the columns so that the Diff won't be bogus.
    • There are new rows introduced here, but we have not done any audit to assess the handling of the new operators. Especially for Execs and some of the UDF expressions added in exprs.csv

@cindyyuanjiang did you have any thoughts on how to address those issues?

CC @mattahrens

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Mar 15, 2024

Thanks @amahussein!

  • Can we have the generated files sorted alphabetically similar to what the plugin sources have?

I looked into the generated files in plugin repo. They are not sorted alphabetically either. If we need to follow the order strictly, I will spend more time on it. Now the order is generally the same, with few exceptions.

It is hard to tell if the new change overrides any manual change we did in those files. For example, if I haven't call out promote_precision, we would not see it. Is there another diff that can show us the change done to each operator? possible work around is to maintain the order of the columns so that the Diff won't be bogus.

For the manual changes done in the tools side, they are currently not tracked (we can use the configs file override_supported_confugs.json to do it). I believe so far promote_precision is the only one from previous Jenkins job messages.

The current bogus diff is due to two new columns DAYTIME and YEARMONTH. If we remove these columns and compare with the old CSV files in plugin again, we see that:
in supportedDataSource.csv:
no difference

in supportedExecs.csv:

@@ -17 +17 @@
-AQEShuffleReadExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
+CustomShuffleReaderExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
@@ -24,4 +23,0 @@
-WriteFilesExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
-AppendDataExecV1,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,NS,S,NS,PS,PS,PS,NS
-AtomicCreateTableAsSelectExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,NS,S,NS,PS,PS,PS,NS
-AtomicReplaceTableAsSelectExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,NS,S,NS,PS,PS,PS,NS
@@ -29 +24,0 @@
-OverwriteByExpressionExecV1,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,NS,S,NS,PS,PS,PS,NS
@@ -52 +46,0 @@
-PythonMapInArrowExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
@@ -56 +49,0 @@
-WindowGroupLimitExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,PS,PS,PS,NS
@@ -58 +50,0 @@
-CustomShuffleReaderExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS

in supportedExprs.csv:

@@ -112,3 +111,0 @@
-BloomFilterMightContain,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,S,NA,NA,NA,NA,NA
-BloomFilterMightContain,S, ,None,project,rhs,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA
-BloomFilterMightContain,S, ,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -124,2 +121,2 @@
-CheckOverflowInTableInsert,S, ,None,project,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
-CheckOverflowInTableInsert,S, ,None,project,result,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
+CheckOverflow,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
+CheckOverflow,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -159,3 +156,3 @@
-DateAdd,S,`date_add`; `dateadd`,None,project,startDate,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DateAdd,S,`date_add`; `dateadd`,None,project,days,NA,S,S,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DateAdd,S,`date_add`; `dateadd`,None,project,result,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateAdd,S,`date_add`,None,project,startDate,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateAdd,S,`date_add`,None,project,days,NA,S,S,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateAdd,S,`date_add`,None,project,result,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -165,3 +162,3 @@
-DateDiff,S,`date_diff`; `datediff`,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DateDiff,S,`date_diff`; `datediff`,None,project,rhs,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DateDiff,S,`date_diff`; `datediff`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateDiff,S,`datediff`,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateDiff,S,`datediff`,None,project,rhs,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateDiff,S,`datediff`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -185,6 +181,0 @@
-DivideDTInterval,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DivideDTInterval,S, ,None,project,rhs,NA,S,S,S,S,S,S,NA,NA,NA,NS,NA,NA,NA,NA,NA,NA,NA
-DivideDTInterval,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DivideYMInterval,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DivideYMInterval,S, ,None,project,rhs,NA,S,S,S,S,S,S,NA,NA,NA,NS,NA,NA,NA,NA,NA,NA,NA
-DivideYMInterval,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -196,2 +186,0 @@
-Empty2Null,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
-Empty2Null,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
@@ -300,2 +288,0 @@
-KnownNullable,S, ,None,project,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
-KnownNullable,S, ,None,project,result,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
@@ -317,2 +304,2 @@
-Length,S,`char_length`; `character_length`; `len`; `length`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NS,NA,NA,NA,NA,NA
-Length,S,`char_length`; `character_length`; `len`; `length`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+Length,S,`char_length`; `character_length`; `length`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NS,NA,NA,NA,NA,NA
+Length,S,`char_length`; `character_length`; `length`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -379,6 +365,0 @@
-MultiplyDTInterval,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-MultiplyDTInterval,S, ,None,project,rhs,NA,S,S,S,S,S,S,NA,NA,NA,NS,NA,NA,NA,NA,NA,NA,NA
-MultiplyDTInterval,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-MultiplyYMInterval,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-MultiplyYMInterval,S, ,None,project,rhs,NA,S,S,S,S,S,S,NA,NA,NA,NS,NA,NA,NA,NA,NA,NA,NA
-MultiplyYMInterval,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -425,8 +406,2 @@
-PythonUDAF,S, ,None,aggregation,param,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
-PythonUDAF,S, ,None,aggregation,result,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NA,PS,NS,PS,NA
-PythonUDAF,S, ,None,reduction,param,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
-PythonUDAF,S, ,None,reduction,result,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NA,PS,NS,PS,NA
-PythonUDAF,S, ,None,window,param,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
-PythonUDAF,S, ,None,window,result,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NA,PS,NS,PS,NA
-PythonUDAF,S, ,None,project,param,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
-PythonUDAF,S, ,None,project,result,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NA,PS,NS,PS,NA
+PromotePrecision,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
+PromotePrecision,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -443,3 +418,3 @@
-RLike,S,`regexp_like`; `regexp`; `rlike`,None,project,str,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
-RLike,S,`regexp_like`; `regexp`; `rlike`,None,project,regexp,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA
-RLike,S,`regexp_like`; `regexp`; `rlike`,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+RLike,S,`rlike`,None,project,str,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
+RLike,S,`rlike`,None,project,regexp,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA
+RLike,S,`rlike`,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -479,6 +453,0 @@
-RoundCeil,S, ,None,project,value,NA,S,S,S,S,PS,PS,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-RoundCeil,S, ,None,project,scale,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-RoundCeil,S, ,None,project,result,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-RoundFloor,S, ,None,project,value,NA,S,S,S,S,PS,PS,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-RoundFloor,S, ,None,project,scale,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-RoundFloor,S, ,None,project,result,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -647 +616 @@
-XxHash64,S,`xxhash64`,None,project,input,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NS,NS,NS
+XxHash64,S,`xxhash64`,None,project,input,S,S,S,S,S,NS,NS,S,PS,S,S,S,NS,NS,NS,NS,NS,NS
@@ -668 +637 @@
-Average,S,`avg`; `mean`,None,aggregation,input,NA,S,S,S,S,S,S,NA,NA,NA,S,S,NA,NS,NA,NA,NA,NA
+Average,S,`avg`; `mean`,None,aggregation,input,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -670 +639 @@
-Average,S,`avg`; `mean`,None,reduction,input,NA,S,S,S,S,S,S,NA,NA,NA,S,S,NA,NS,NA,NA,NA,NA
+Average,S,`avg`; `mean`,None,reduction,input,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -672 +641 @@
-Average,S,`avg`; `mean`,None,window,input,NA,S,S,S,S,S,S,NA,NA,NA,S,S,NA,NS,NA,NA,NA,NA
+Average,S,`avg`; `mean`,None,window,input,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -674,10 +643,6 @@
-BloomFilterAggregate,S, ,None,reduction,child,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-BloomFilterAggregate,S, ,None,reduction,estimatedItems,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-BloomFilterAggregate,S, ,None,reduction,numBits,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-BloomFilterAggregate,S, ,None,reduction,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA
-CollectList,S,`array_agg`; `collect_list`,None,aggregation,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
-CollectList,S,`array_agg`; `collect_list`,None,aggregation,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
-CollectList,S,`array_agg`; `collect_list`,None,reduction,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
-CollectList,S,`array_agg`; `collect_list`,None,reduction,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
-CollectList,S,`array_agg`; `collect_list`,None,window,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
-CollectList,S,`array_agg`; `collect_list`,None,window,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
+CollectList,S,`collect_list`,None,aggregation,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
+CollectList,S,`collect_list`,None,aggregation,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
+CollectList,S,`collect_list`,None,reduction,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
+CollectList,S,`collect_list`,None,reduction,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
+CollectList,S,`collect_list`,None,window,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
+CollectList,S,`collect_list`,None,window,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
@@ -766,2 +730,0 @@
-InSubqueryExec,S, ,None,project,input,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,NS,NS
-InSubqueryExec,S, ,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -773,4 +735,0 @@
-CheckOverflow,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-CheckOverflow,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-PromotePrecision,S,`promote_precision`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-PromotePrecision,S,`promote_precision`,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA

The diff may still look bogus, but if we look closely, we can see the difference is mainly due to:

  1. new execs and expressions
  2. longer SQL func values (date_add; dateadd vs date_add)
  3. difference order from the plugin side

There are new rows introduced here, but we have not done any audit to assess the handling of the new operators. Especially for Execs and some of the UDF expressions added in exprs.csv

For the new Execs and Expressions, I should file follow up PRs to add the support.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang

The point I was making in my previous comment that regardless of followups we cannot have the new CSV merged as they are without proper evaluation.

The PR is not just about adding the script to the repo. The expectation is to make it working end-to-end with confidence that the current behavior does not break.

If we need to follow the order strictly, I will spend more time on it.

I do not understand why it would take more development time to sort the final CSV file by its first column. The script already uses Pandas dataframes.

The current bogus diff is due to two new columns DAYTIME and YEARMONTH. If we remove these columns and compare with the old CSV files in plugin again, we see that

The diff file is incorrect. I will show just one case here, but there is possibly other entries. For example, the entry below "Empty2Null" :

@@ -196,2 +186,0 @@
-Empty2Null,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
-Empty2Null,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA

However, Empty2Null exists In the tools CSV file. Also, the override_supported_config has no entry to set the SQL function empty2null

Empty2Null,S,`empty2null`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA
Empty2Null,S,`empty2null`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

longer SQL func values (date_add; dateadd vs date_add)

This is not 100% correct. Another reason why the diff is bogus is because the order within the SQL function has changed on the plugin side. For example, in Tools CSV we have =; then == while on the plugin side we have ==, followed by =.

EqualTo,S,`=`; `==`,None,project,lhs,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,PS,NS
EqualTo,S,`=`; `==`,None,project,rhs,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,PS,NS
EqualTo,S,`=`; `==`,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
EqualTo,S,`=`; `==`,None,AST,lhs,S,S,S,S,S,NS,NS,S,PS,S,NS,NS,NS,NS,NS,NA,NS,NS
EqualTo,S,`=`; `==`,None,AST,rhs,S,S,S,S,S,NS,NS,S,PS,S,NS,NS,NS,NS,NS,NA,NS,NS
EqualTo,S,`=`; `==`,None,AST,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

https://github.com/NVIDIA/spark-rapids/blob/7177c3ae88661c758c776bd638a28f73a134b925/tools/generated_files/supportedExprs.csv?plain=1#L193-L198

The diff may still look bogus, but if we look closely, we can see the difference is mainly due to:

That's what is my initial comment was all about. What we need from this PR is to include a report of what is changing.
Obviously, diff is not the way to do it.
This PR introduces a script written in python which is a fully fledged framework. We can use its capability to report the actual difference for each record regardless of the columns order, and order within cells.
There is no need to reinvent the wheel as python package such as "Pandas" has all those functionalities.

Suggestions to move forward

What I expect from this PR is to get a CSV file that I can instantly use to replace the current Tools CSV file with confidence that the Q's behavior is identical.
It does not make much sense that we split the logic of the operation between Jenkins and this script.

  • Why then we would have this python script in the Github repo?
  • How to test when we have the logic split in two different places?
  • What if we faced a blocker to get the automated-PR implemented? Then we get stuck with non-End-to-End sync mechanism. If the logic is in one place in the github, then we should be able to run the script manually to update the CSV file without waiting for the Jenkins job is running (given all the complexity of how to the PR is going to work).
  1. In my comment, I described different enums specific to changes coming from the Tools side "TON/TOFF"
  2. The script should take an argument representing directory of the tools generated CSV file to be used as a reference to maintain the new changes.
  3. The script should generate a report with the following sections:
  • "New entries": Entire new entries. Those entries should not be added automatically to the Tools CSV file.
  • "Removed entries": 1- We define ML functions that were not defined in the plugin's side. 2-Remember that the plugin side drops support to specific Spark versions and some how an entry may disappear from the record. On the tools side, entries stick once they are added to the system because the tools support eventlogs back to Spark2.x
  • "Modified entries": the changes in the support-level done to existing entries. Those are the entries that have been modified at the support-level scope compared to the existing Tools file.
  1. The above generated report is important because the dev can manually audit it to evaluate the "merge" is doing its job and that's new changes won't be introduced without notice.
  2. Optional: New entries can be added to the CSV file given that their support-level will be marked as TOFF or even TNEW (stands for newly added entry). Then, after proper testing, their support level can be restored. Note that this requires that the script appends them automatically to the 0verride-config files.

@nartal1 has good knowledge of operators and I suggest he can be a good source of assistance navigating through this since you can both meet in person.

Comment on lines 54 to 58
NA = 1
NS = 2
CO = 3
PS = 4
S = 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is difficult to know why a given operator is getting its level of support; especially that we want to override that support on-demand. For instance, when Hao wants to troubleshoot an operator, he won't be able to tell whether the value comes from the plugin, overridden by the tools, or coming from a specific version of Spark.

I suggest we define two other members:

  • TOFF standing for "tools-OFF".
  • TON standing for "tools-ON".

This will make it easy for anyone to know that this operator is turned-on/off on the tools side.

  • when we want to enforce disabling an operator then:
    • Add an entry in the override-config file defining its support level to be "TOFF".
    • Run the script to generate new CSV file which will have TOFF
  • when we want to enforce enabling an operator then:
    • Add an entry in the override-config file defining its support level to be "TON".
    • Run the script to generate new CSV file which will have TON

This change requires that we mirror those two members on the Scala side as well.

Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang !
This is going to be very useful script moving forward.

@amahussein amahussein merged commit fa15398 into NVIDIA:dev Mar 22, 2024
13 checks passed
@cindyyuanjiang cindyyuanjiang deleted the sync-supported-files branch March 22, 2024 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Sync supported CSV files with the spark-rapids repo
4 participants