Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context to handler functions #103

Merged
merged 7 commits into from
Sep 30, 2023
Merged

Add context to handler functions #103

merged 7 commits into from
Sep 30, 2023

Conversation

sachanub
Copy link
Contributor

@sachanub sachanub commented Sep 21, 2023

Issue #, if available:

MMS stores request information in context object. Similar to the other inference toolkits, the context is passed to the handler service function. A PR was created for the SageMaker Inference toolkit to pass the context to the handler functions. However, since the HuggingFaceHandlerService class and its handler functions do not directly inherit from the DefaultHandlerService and the DefaultInferenceHandler classes (in the base Inference toolkit), hence the context is currently not passed to the handler functions in the HuggingFace Inference toolkit. This PR passes the context to the handler functions in the HuggingFace Inference toolkit so that users can use that information in their custom handler functions.

One particular use case is multi-GPU inference for HuggingFace. The default load function in the handler service is capable of dynamically selecting the GPU device by leveraging the context and is capable of assigning the model to different devices. However, currently we do not pass the context as an input argument in the default load, preprocess, predict, postprocess and the transform_fn functions. As a result, customers can not directly specify the context in their custom handler functions to dynamically assign the data/model to different GPUs. The same GPU device is chosen and the data/model are assigned to that single device even on a multi-GPU host. By passing context to the handler function, users can choose to assign the data/model to different GPUs and utilize all the GPU resources available on their host.

Below is an example of a model_fn that loads and assigns a model to a device:

# Current handler function without context:

def model_fn(model_dir):
    # Load stable diffusion and move it to the GPU
    pipe = StableDiffusionPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
    )

    pipe = pipe.to("cuda")

    return pipe
# New handler function with context:

def model_fn(model_dir, context):
    # Load stable diffusion and move it to the GPU
    pipe = StableDiffusionPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
    )
    gpu_id = str(context.system_properties.get("gpu_id"))

    pipe = pipe.to("cuda:"+ gpu_id)

    return pipe

After this PR, handler functions with/without context will both be supported. So that it won’t break existing use cases. Example:

model_fn(model_dir)  # should still work

Description of changes:

  • Add context as an input argument to default handler functions.
  • Add helper function to support handler functions with/without context.

Description of testing:

  • Updated unit tests to make sure context gets passed and both new/old handler functions work properly.
  • Tested end-to-end by loading a HuggingFace model with 5 workers on a host with 4 GPUs. Tested with and without the context.

With context, the workers are allocated to different GPUs as expected:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1B.0 Off |                    0 |
| N/A   32C    P0              31W /  70W |   2795MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:00:1C.0 Off |                    0 |
| N/A   27C    P0              30W /  70W |   5587MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       On  | 00000000:00:1D.0 Off |                    0 |
| N/A   26C    P0              30W /  70W |   2795MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P0              30W /  70W |   2795MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    462923      C   /opt/conda/bin/python3.10                  2792MiB |
|    1   N/A  N/A    462916      C   /opt/conda/bin/python3.10                  2792MiB |
|    1   N/A  N/A    462922      C   /opt/conda/bin/python3.10                  2792MiB |
|    2   N/A  N/A    462919      C   /opt/conda/bin/python3.10                  2792MiB |
|    3   N/A  N/A    462921      C   /opt/conda/bin/python3.10                  2792MiB |
+---------------------------------------------------------------------------------------+

Without context, the workers are allocated to the same GPU, burning its available memory.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1B.0 Off |                    0 |
| N/A   27C    P0              31W /  70W |  13943MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:00:1C.0 Off |                    0 |
| N/A   23C    P8               8W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       On  | 00000000:00:1D.0 Off |                    0 |
| N/A   22C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   23C    P8               8W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    461619      C   /opt/conda/bin/python3.10                  2788MiB |
|    0   N/A  N/A    461621      C   /opt/conda/bin/python3.10                  2788MiB |
|    0   N/A  N/A    461623      C   /opt/conda/bin/python3.10                  2788MiB |
|    0   N/A  N/A    461624      C   /opt/conda/bin/python3.10                  2788MiB |
|    0   N/A  N/A    461625      C   /opt/conda/bin/python3.10                  2788MiB |
+---------------------------------------------------------------------------------------+

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Collaborator

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! I added a comment regarding the implementation.

Comment on lines 291 to 308
def run_handler_function(self, func, *argv):
"""Helper to call the handler function which covers 2 cases:
1. the handle function takes context
2. the handle function does not take context
"""
num_func_input = len(signature(func).parameters)
if num_func_input == len(argv):
# function does not take context
result = func(*argv)
elif num_func_input == len(argv) + 1:
# function takes context
argv_context = argv + (self.context,)
result = func(*argv_context)
else:
raise TypeError(
"{} takes {} arguments but {} were given.".format(func.__name__, num_func_input, len(argv))
)
return result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that needed? What overhead does it add to an request/function call?.

Why don't we just past self.context in the handle method in line

response = self.transform_fn(self.model, input_data, content_type, accept)

Copy link
Contributor Author

@sachanub sachanub Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @philschmid . Thank you for your review. Maybe I am not understanding your point correctly, but I don't think we can directly pass self.context in the handle method. If we do that, don't we risk breaking existing customers who are not using context as an input argument? With the above mentioned run_handler_function, we should be able to support both customers who do not want to use context as well as customers who want to use context. Please correct me if I misunderstood. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it break it? you mean if they have a custom inference.py that defines a transform_fn method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we might affect those who define a custom transform_fn right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In handler_service.py, we are trying to import user defined functions from inference.py

            load_fn = getattr(user_module, "model_fn", None)
            preprocess_fn = getattr(user_module, "input_fn", None)
            predict_fn = getattr(user_module, "predict_fn", None)
            postprocess_fn = getattr(user_module, "output_fn", None)
            transform_fn = getattr(user_module, "transform_fn", None)

If existence, it will be used to overwrite self.load, self.preprocess, self.predict, self.postprocess and self.transform_fn

            if load_fn is not None:
                self.load = load_fn
            if preprocess_fn is not None:
                self.preprocess = preprocess_fn
            if predict_fn is not None:
                self.predict = predict_fn
            if postprocess_fn is not None:
                self.postprocess = postprocess_fn
            if transform_fn is not None:
                self.transform_fn = transform_fn

In this PR, we introduce additional parameter context for default handlers. However, for existing user scripts, they don't have this parameter when they implement customized handler function. We have to be careful when we call these functions. That's why @sachanub is adding a helper function to determine when to call functions with the new parameter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something we could deprecate and remove with a v3 of the toolkit? Feels like very error prone or a big overhead to parse and add args like this for every incoming request.

What do you think of adding a check to see if there is a inference.py provided and if not we are using self-transform directly? Most customer deploy models using the "zero-code" deployment feature, where you provide a MODEL_ID and TASK and don't need and inference.py script.

Comment on lines 291 to 308
def run_handler_function(self, func, *argv):
"""Helper to call the handler function which covers 2 cases:
1. the handle function takes context
2. the handle function does not take context
"""
num_func_input = len(signature(func).parameters)
if num_func_input == len(argv):
# function does not take context
result = func(*argv)
elif num_func_input == len(argv) + 1:
# function takes context
argv_context = argv + (self.context,)
result = func(*argv_context)
else:
raise TypeError(
"{} takes {} arguments but {} were given.".format(func.__name__, num_func_input, len(argv))
)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In handler_service.py, we are trying to import user defined functions from inference.py

            load_fn = getattr(user_module, "model_fn", None)
            preprocess_fn = getattr(user_module, "input_fn", None)
            predict_fn = getattr(user_module, "predict_fn", None)
            postprocess_fn = getattr(user_module, "output_fn", None)
            transform_fn = getattr(user_module, "transform_fn", None)

If existence, it will be used to overwrite self.load, self.preprocess, self.predict, self.postprocess and self.transform_fn

            if load_fn is not None:
                self.load = load_fn
            if preprocess_fn is not None:
                self.preprocess = preprocess_fn
            if predict_fn is not None:
                self.predict = predict_fn
            if postprocess_fn is not None:
                self.postprocess = postprocess_fn
            if transform_fn is not None:
                self.transform_fn = transform_fn

In this PR, we introduce additional parameter context for default handlers. However, for existing user scripts, they don't have this parameter when they implement customized handler function. We have to be careful when we call these functions. That's why @sachanub is adding a helper function to determine when to call functions with the new parameter.

tests/unit/test_handler_service.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a suggestion on how we can avoid running all request through this run_handler_function.
Do we know what "overhead" in terms of latency gets added by running a request through the run_handler_function?

Comment on lines 291 to 308
def run_handler_function(self, func, *argv):
"""Helper to call the handler function which covers 2 cases:
1. the handle function takes context
2. the handle function does not take context
"""
num_func_input = len(signature(func).parameters)
if num_func_input == len(argv):
# function does not take context
result = func(*argv)
elif num_func_input == len(argv) + 1:
# function takes context
argv_context = argv + (self.context,)
result = func(*argv_context)
else:
raise TypeError(
"{} takes {} arguments but {} were given.".format(func.__name__, num_func_input, len(argv))
)
return result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something we could deprecate and remove with a v3 of the toolkit? Feels like very error prone or a big overhead to parse and add args like this for every incoming request.

What do you think of adding a check to see if there is a inference.py provided and if not we are using self-transform directly? Most customer deploy models using the "zero-code" deployment feature, where you provide a MODEL_ID and TASK and don't need and inference.py script.

@sachanub
Copy link
Contributor Author

sachanub commented Sep 28, 2023

@philschmid I figured out another way to handle the context. Instead of using the run_handler_function for every single inference request, we can determine the number of input arguments needed for each of the handler functions in the validate_and_initialize_user_module method. Since the initialize method is executed only once (and hence the validate_and_initialize_user_module method is also executed only once), hence we will not have to use the run_handler_function method for each inference. In the validate_and_initialize_user_module method, for each handler function, we can figure out if the context is required as an additional argument. If it is, then we can store it in a list (for example self.preprocess_extra_arg. If it is not, then the list will be empty. Finally, we can just pass the input arguments to the function as a pointer in this fashion:

self.predict(*([processed_data, model] + self.predict_extra_arg))

By doing this, we should be able to avoid running all inference requests through the run_handler_function, avoiding any overhead to the latency. With this approach, we should be able to maintain backwards compatibility without causing any impact to the performance.

Copy link
Collaborator

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sachanub sachanub merged commit 8b10230 into aws:main Sep 30, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants