Add context to handler functions #103

sachanub · 2023-09-21T07:00:01Z

Issue #, if available:

MMS stores request information in context object. Similar to the other inference toolkits, the context is passed to the handler service function. A PR was created for the SageMaker Inference toolkit to pass the context to the handler functions. However, since the HuggingFaceHandlerService class and its handler functions do not directly inherit from the DefaultHandlerService and the DefaultInferenceHandler classes (in the base Inference toolkit), hence the context is currently not passed to the handler functions in the HuggingFace Inference toolkit. This PR passes the context to the handler functions in the HuggingFace Inference toolkit so that users can use that information in their custom handler functions.

One particular use case is multi-GPU inference for HuggingFace. The default load function in the handler service is capable of dynamically selecting the GPU device by leveraging the context and is capable of assigning the model to different devices. However, currently we do not pass the context as an input argument in the default load, preprocess, predict, postprocess and the transform_fn functions. As a result, customers can not directly specify the context in their custom handler functions to dynamically assign the data/model to different GPUs. The same GPU device is chosen and the data/model are assigned to that single device even on a multi-GPU host. By passing context to the handler function, users can choose to assign the data/model to different GPUs and utilize all the GPU resources available on their host.

Below is an example of a model_fn that loads and assigns a model to a device:

# Current handler function without context:

def model_fn(model_dir):
    # Load stable diffusion and move it to the GPU
    pipe = StableDiffusionPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
    )

    pipe = pipe.to("cuda")

    return pipe

# New handler function with context:

def model_fn(model_dir, context):
    # Load stable diffusion and move it to the GPU
    pipe = StableDiffusionPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
    )
    gpu_id = str(context.system_properties.get("gpu_id"))

    pipe = pipe.to("cuda:"+ gpu_id)

    return pipe

After this PR, handler functions with/without context will both be supported. So that it won’t break existing use cases. Example:

model_fn(model_dir)  # should still work

Description of changes:

Add context as an input argument to default handler functions.
Add helper function to support handler functions with/without context.

Description of testing:

Updated unit tests to make sure context gets passed and both new/old handler functions work properly.
Tested end-to-end by loading a HuggingFace model with 5 workers on a host with 4 GPUs. Tested with and without the context.

With context, the workers are allocated to different GPUs as expected:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1B.0 Off |                    0 |
| N/A   32C    P0              31W /  70W |   2795MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:00:1C.0 Off |                    0 |
| N/A   27C    P0              30W /  70W |   5587MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       On  | 00000000:00:1D.0 Off |                    0 |
| N/A   26C    P0              30W /  70W |   2795MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P0              30W /  70W |   2795MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    462923      C   /opt/conda/bin/python3.10                  2792MiB |
|    1   N/A  N/A    462916      C   /opt/conda/bin/python3.10                  2792MiB |
|    1   N/A  N/A    462922      C   /opt/conda/bin/python3.10                  2792MiB |
|    2   N/A  N/A    462919      C   /opt/conda/bin/python3.10                  2792MiB |
|    3   N/A  N/A    462921      C   /opt/conda/bin/python3.10                  2792MiB |
+---------------------------------------------------------------------------------------+

Without context, the workers are allocated to the same GPU, burning its available memory.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1B.0 Off |                    0 |
| N/A   27C    P0              31W /  70W |  13943MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:00:1C.0 Off |                    0 |
| N/A   23C    P8               8W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       On  | 00000000:00:1D.0 Off |                    0 |
| N/A   22C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   23C    P8               8W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    461619      C   /opt/conda/bin/python3.10                  2788MiB |
|    0   N/A  N/A    461621      C   /opt/conda/bin/python3.10                  2788MiB |
|    0   N/A  N/A    461623      C   /opt/conda/bin/python3.10                  2788MiB |
|    0   N/A  N/A    461624      C   /opt/conda/bin/python3.10                  2788MiB |
|    0   N/A  N/A    461625      C   /opt/conda/bin/python3.10                  2788MiB |
+---------------------------------------------------------------------------------------+

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

philschmid

Thank you for working on this! I added a comment regarding the implementation.

philschmid · 2023-09-25T06:24:28Z

src/sagemaker_huggingface_inference_toolkit/handler_service.py

+    def run_handler_function(self, func, *argv):
+        """Helper to call the handler function which covers 2 cases:
+        1. the handle function takes context
+        2. the handle function does not take context
+        """
+        num_func_input = len(signature(func).parameters)
+        if num_func_input == len(argv):
+            # function does not take context
+            result = func(*argv)
+        elif num_func_input == len(argv) + 1:
+            # function takes context
+            argv_context = argv + (self.context,)
+            result = func(*argv_context)
+        else:
+            raise TypeError(
+                "{} takes {} arguments but {} were given.".format(func.__name__, num_func_input, len(argv))
+            )
+        return result


Is that needed? What overhead does it add to an request/function call?.

Why don't we just past self.context in the handle method in line

sagemaker-huggingface-inference-toolkit/src/sagemaker_huggingface_inference_toolkit/handler_service.py

Line 234 in 44e3dec

response = self.transform_fn(self.model, input_data, content_type, accept)

Hi @philschmid . Thank you for your review. Maybe I am not understanding your point correctly, but I don't think we can directly pass self.context in the handle method. If we do that, don't we risk breaking existing customers who are not using context as an input argument? With the above mentioned run_handler_function, we should be able to support both customers who do not want to use context as well as customers who want to use context. Please correct me if I misunderstood. Thanks!

Why would it break it? you mean if they have a custom inference.py that defines a transform_fn method?

Yes, we might affect those who define a custom transform_fn right?

In handler_service.py, we are trying to import user defined functions from inference.py

load_fn = getattr(user_module, "model_fn", None) preprocess_fn = getattr(user_module, "input_fn", None) predict_fn = getattr(user_module, "predict_fn", None) postprocess_fn = getattr(user_module, "output_fn", None) transform_fn = getattr(user_module, "transform_fn", None)

If existence, it will be used to overwrite self.load, self.preprocess, self.predict, self.postprocess and self.transform_fn

if load_fn is not None: self.load = load_fn if preprocess_fn is not None: self.preprocess = preprocess_fn if predict_fn is not None: self.predict = predict_fn if postprocess_fn is not None: self.postprocess = postprocess_fn if transform_fn is not None: self.transform_fn = transform_fn

In this PR, we introduce additional parameter context for default handlers. However, for existing user scripts, they don't have this parameter when they implement customized handler function. We have to be careful when we call these functions. That's why @sachanub is adding a helper function to determine when to call functions with the new parameter.

Is that something we could deprecate and remove with a v3 of the toolkit? Feels like very error prone or a big overhead to parse and add args like this for every incoming request.

What do you think of adding a check to see if there is a inference.py provided and if not we are using self-transform directly? Most customer deploy models using the "zero-code" deployment feature, where you provide a MODEL_ID and TASK and don't need and inference.py script.

chen3933 · 2023-09-26T18:30:39Z

src/sagemaker_huggingface_inference_toolkit/handler_service.py

+    def run_handler_function(self, func, *argv):
+        """Helper to call the handler function which covers 2 cases:
+        1. the handle function takes context
+        2. the handle function does not take context
+        """
+        num_func_input = len(signature(func).parameters)
+        if num_func_input == len(argv):
+            # function does not take context
+            result = func(*argv)
+        elif num_func_input == len(argv) + 1:
+            # function takes context
+            argv_context = argv + (self.context,)
+            result = func(*argv_context)
+        else:
+            raise TypeError(
+                "{} takes {} arguments but {} were given.".format(func.__name__, num_func_input, len(argv))
+            )
+        return result


In handler_service.py, we are trying to import user defined functions from inference.py

load_fn = getattr(user_module, "model_fn", None) preprocess_fn = getattr(user_module, "input_fn", None) predict_fn = getattr(user_module, "predict_fn", None) postprocess_fn = getattr(user_module, "output_fn", None) transform_fn = getattr(user_module, "transform_fn", None)

If existence, it will be used to overwrite self.load, self.preprocess, self.predict, self.postprocess and self.transform_fn

if load_fn is not None: self.load = load_fn if preprocess_fn is not None: self.preprocess = preprocess_fn if predict_fn is not None: self.predict = predict_fn if postprocess_fn is not None: self.postprocess = postprocess_fn if transform_fn is not None: self.transform_fn = transform_fn

In this PR, we introduce additional parameter context for default handlers. However, for existing user scripts, they don't have this parameter when they implement customized handler function. We have to be careful when we call these functions. That's why @sachanub is adding a helper function to determine when to call functions with the new parameter.

tests/unit/test_handler_service.py

philschmid

Added a suggestion on how we can avoid running all request through this run_handler_function.
Do we know what "overhead" in terms of latency gets added by running a request through the run_handler_function?

philschmid · 2023-09-28T06:36:59Z

src/sagemaker_huggingface_inference_toolkit/handler_service.py

+    def run_handler_function(self, func, *argv):
+        """Helper to call the handler function which covers 2 cases:
+        1. the handle function takes context
+        2. the handle function does not take context
+        """
+        num_func_input = len(signature(func).parameters)
+        if num_func_input == len(argv):
+            # function does not take context
+            result = func(*argv)
+        elif num_func_input == len(argv) + 1:
+            # function takes context
+            argv_context = argv + (self.context,)
+            result = func(*argv_context)
+        else:
+            raise TypeError(
+                "{} takes {} arguments but {} were given.".format(func.__name__, num_func_input, len(argv))
+            )
+        return result


Is that something we could deprecate and remove with a v3 of the toolkit? Feels like very error prone or a big overhead to parse and add args like this for every incoming request.

What do you think of adding a check to see if there is a inference.py provided and if not we are using self-transform directly? Most customer deploy models using the "zero-code" deployment feature, where you provide a MODEL_ID and TASK and don't need and inference.py script.

sachanub · 2023-09-28T20:56:25Z

@philschmid I figured out another way to handle the context. Instead of using the run_handler_function for every single inference request, we can determine the number of input arguments needed for each of the handler functions in the validate_and_initialize_user_module method. Since the initialize method is executed only once (and hence the validate_and_initialize_user_module method is also executed only once), hence we will not have to use the run_handler_function method for each inference. In the validate_and_initialize_user_module method, for each handler function, we can figure out if the context is required as an additional argument. If it is, then we can store it in a list (for example self.preprocess_extra_arg. If it is not, then the list will be empty. Finally, we can just pass the input arguments to the function as a pointer in this fashion:

self.predict(*([processed_data, model] + self.predict_extra_arg))

By doing this, we should be able to avoid running all inference requests through the run_handler_function, avoiding any overhead to the latency. With this approach, we should be able to maintain backwards compatibility without causing any impact to the performance.

philschmid

LGTM

Ubuntu added 4 commits September 21, 2023 06:59

Add context to handler functions

fb84927

Add context to validate_and_initialize_user_module unit tests

d4372c9

Fix formatting issues

1e54f98

Update comments in handler_service.py

ad06a7f

sachanub requested review from philschmid and davidthomas426 September 21, 2023 21:22

sachanub self-assigned this Sep 21, 2023

philschmid reviewed Sep 25, 2023

View reviewed changes

chen3933 reviewed Sep 26, 2023

View reviewed changes

Include unit tests for handler service with and without context

778408f

chen3933 approved these changes Sep 26, 2023

View reviewed changes

philschmid reviewed Sep 28, 2023

View reviewed changes

sachanub added 2 commits September 28, 2023 17:54

Add logic to check for custom handler functions

129fa2f

Modify logic to identify extra arguments in initialize stage

4c5911f

philschmid approved these changes Sep 30, 2023

View reviewed changes

sachanub merged commit 8b10230 into aws:main Sep 30, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add context to handler functions #103

Add context to handler functions #103

sachanub commented Sep 21, 2023 •

edited

Loading

philschmid left a comment

philschmid Sep 25, 2023

sachanub Sep 25, 2023 •

edited

Loading

philschmid Sep 26, 2023

sachanub Sep 26, 2023

chen3933 Sep 26, 2023

philschmid Sep 28, 2023

chen3933 Sep 26, 2023

philschmid left a comment

philschmid Sep 28, 2023

sachanub commented Sep 28, 2023 •

edited

Loading

philschmid left a comment

Add context to handler functions #103

Add context to handler functions #103

Conversation

sachanub commented Sep 21, 2023 • edited Loading

philschmid left a comment

Choose a reason for hiding this comment

philschmid Sep 25, 2023

Choose a reason for hiding this comment

sachanub Sep 25, 2023 • edited Loading

Choose a reason for hiding this comment

philschmid Sep 26, 2023

Choose a reason for hiding this comment

sachanub Sep 26, 2023

Choose a reason for hiding this comment

chen3933 Sep 26, 2023

Choose a reason for hiding this comment

philschmid Sep 28, 2023

Choose a reason for hiding this comment

chen3933 Sep 26, 2023

Choose a reason for hiding this comment

philschmid left a comment

Choose a reason for hiding this comment

philschmid Sep 28, 2023

Choose a reason for hiding this comment

sachanub commented Sep 28, 2023 • edited Loading

philschmid left a comment

Choose a reason for hiding this comment

sachanub commented Sep 21, 2023 •

edited

Loading

sachanub Sep 25, 2023 •

edited

Loading

sachanub commented Sep 28, 2023 •

edited

Loading