Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling in prediction code #950

Merged
merged 1 commit into from
Apr 19, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Apr 18, 2024

Fixes #908. Currently the prediction code has a broad exception handling mechanism. This stops the prediction of all apps if an error occurs in any single app.

This PR introduces more granular error handling in the prediction code.

Method

  • Identified possible error checkpoints where we should be able to continue processing the remaining applications.
  • Manually injected relevant errors at these points and compared behaviour against current state.

Pseudo Code Flow

predict()
predict() 
    _get_model() 
    get_qual_data() 
        find_paths() -> Add error handling: Return value expected to be None
        load_qtool_execs() -> Add error handling: Return value expected to be None
        load_qual_csv() -> Add error handling: Return value unused in prediction
    for prof in prof_dirs:  (note this is single folder in our case)
        load_profiles()
            for dataset in profile.items():
                for path in profile_paths:
                    for app in app_meta.keys():
                      column access and string processing -> Add error handling: Individual app processed in a loop
                extract_raw_features() ->
                    app_id_tables = [load_csv_files() for app_id in app_ids] -> Add error handling: Individual app processed in a loop
                impute() 
    for dataset, input_df in processed_dfs.items()
       -> input_df has information from all apps combined
       -> Post this individual apps cannot be isolated
    print_speedup_summary() -> Add error handling
    return predictions

Testing

Function Injected Error Prediction State Current Prediction State New Reason
find_paths() ValueError Fails for all apps Continues Return value expected to be None
load_qual_csv() FileNotFoundError Fails for all apps Continues Return value unused for prediction
load_qtool_execs() FileNotFoundError Fails for all apps Continues Return value expected to be None
load_csv_files() KeyError Fails for all apps Continues Individual app processed in a loop
load_profiles() IndexError during string slicing Fails for all apps Continues Individual app processed in a loop
_print_speedup_summary() ValueError Fails for all apps Returns Result Printing errors should be handled

Analysis

  • load_qual_csv() and its callers could be removed as these are not used in prediction.
  • Apart from these checkpoints, other errors cannot be handled at granular level because:
    • Fatal errors should not continue.
      • Goal is to target errors that are non-fatal to the overall prediction process. (i.e. Errors during model loading should not continue).
    • Individual apps cannot be isolated.
      • Post the initial data processing steps, once the DFs are combined, individual apps cannot be isolated for separate error handling.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 18, 2024
@parthosa parthosa self-assigned this Apr 18, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa !
LGTME

I like to get one more feedback from @mattahrens, @eordentlich, and @leewyang

@parthosa
Copy link
Collaborator Author

parthosa commented Apr 18, 2024

I like to get one more feedback from @mattahrens, @eordentlich, and @leewyang

Yes @amahussein. One thing we want to make sure that allowing the code to proceed does not affect the accuracy of the predictions. I have tested it on my local machine, none of these changes affect the accuracy of prediction values.

@leewyang
Copy link
Collaborator

LGTM 👍

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa
LGTME

@parthosa parthosa merged commit 5e7d63c into NVIDIA:dev Apr 19, 2024
16 checks passed
@parthosa parthosa deleted the spark-rapids-tools-908 branch April 19, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Improve Error handling in Prediction code
4 participants