Improve error handling in prediction code #950

parthosa · 2024-04-18T17:34:05Z

Fixes #908. Currently the prediction code has a broad exception handling mechanism. This stops the prediction of all apps if an error occurs in any single app.

This PR introduces more granular error handling in the prediction code.

Method

Identified possible error checkpoints where we should be able to continue processing the remaining applications.
Manually injected relevant errors at these points and compared behaviour against current state.

Pseudo Code Flow

predict()

predict() 
    _get_model() 
    get_qual_data() 
        find_paths() -> Add error handling: Return value expected to be None
        load_qtool_execs() -> Add error handling: Return value expected to be None
        load_qual_csv() -> Add error handling: Return value unused in prediction
    for prof in prof_dirs:  (note this is single folder in our case)
        load_profiles()
            for dataset in profile.items():
                for path in profile_paths:
                    for app in app_meta.keys():
                      column access and string processing -> Add error handling: Individual app processed in a loop
                extract_raw_features() ->
                    app_id_tables = [load_csv_files() for app_id in app_ids] -> Add error handling: Individual app processed in a loop
                impute() 
    for dataset, input_df in processed_dfs.items()
       -> input_df has information from all apps combined
       -> Post this individual apps cannot be isolated
    print_speedup_summary() -> Add error handling
    return predictions

Testing

Function	Injected Error	Prediction State Current	Prediction State New	Reason
find_paths()	ValueError	Fails for all apps	Continues	Return value expected to be None
load_qual_csv()	FileNotFoundError	Fails for all apps	Continues	Return value unused for prediction
load_qtool_execs()	FileNotFoundError	Fails for all apps	Continues	Return value expected to be None
load_csv_files()	KeyError	Fails for all apps	Continues	Individual app processed in a loop
load_profiles()	IndexError during string slicing	Fails for all apps	Continues	Individual app processed in a loop
_print_speedup_summary()	ValueError	Fails for all apps	Returns Result	Printing errors should be handled

Analysis

load_qual_csv() and its callers could be removed as these are not used in prediction.
Apart from these checkpoints, other errors cannot be handled at granular level because:
- Fatal errors should not continue.
  - Goal is to target errors that are non-fatal to the overall prediction process. (i.e. Errors during model loading should not continue).
- Individual apps cannot be isolated.
  - Post the initial data processing steps, once the DFs are combined, individual apps cannot be isolated for separate error handling.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

amahussein

Thanks @parthosa !
LGTME

I like to get one more feedback from @mattahrens, @eordentlich, and @leewyang

parthosa · 2024-04-18T20:13:32Z

I like to get one more feedback from @mattahrens, @eordentlich, and @leewyang

Yes @amahussein. One thing we want to make sure that allowing the code to proceed does not affect the accuracy of the predictions. I have tested it on my local machine, none of these changes affect the accuracy of prediction values.

leewyang · 2024-04-18T22:18:57Z

LGTM 👍

amahussein

Thanks @parthosa
LGTME

Handle exceptions in QualX

49b9df9

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 18, 2024

parthosa requested review from cindyyuanjiang, amahussein and nartal1 April 18, 2024 17:34

parthosa self-assigned this Apr 18, 2024

amahussein reviewed Apr 18, 2024

View reviewed changes

amahussein requested review from mattahrens, eordentlich and leewyang April 18, 2024 20:00

amahussein approved these changes Apr 19, 2024

View reviewed changes

mattahrens approved these changes Apr 19, 2024

View reviewed changes

parthosa merged commit 5e7d63c into NVIDIA:dev Apr 19, 2024
16 checks passed

parthosa deleted the spark-rapids-tools-908 branch April 19, 2024 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error handling in prediction code #950

Improve error handling in prediction code #950

parthosa commented Apr 18, 2024 •

edited

Loading

amahussein left a comment

parthosa commented Apr 18, 2024 •

edited

Loading

leewyang commented Apr 18, 2024

amahussein left a comment

Improve error handling in prediction code #950

Improve error handling in prediction code #950

Conversation

parthosa commented Apr 18, 2024 • edited Loading

Method

Pseudo Code Flow

Testing

Analysis

amahussein left a comment

Choose a reason for hiding this comment

parthosa commented Apr 18, 2024 • edited Loading

leewyang commented Apr 18, 2024

amahussein left a comment

Choose a reason for hiding this comment

parthosa commented Apr 18, 2024 •

edited

Loading

parthosa commented Apr 18, 2024 •

edited

Loading