To learn how to create a Triton backend, and to see a best-practices baseline onto which you can add your own backend log, follow the Tutorial.
Triton also provides a couple of example backends that demonstrate specific aspects of the backend API not covered by the Tutorial.
-
The repeat backend shows a more advanced example of how a backend can produce multiple responses per request.
-
The stateful backend shows an example of how a backend can manage model state tensors on the server-side for the sequence batcher to avoid transferring state tensors between client and server. Triton also implements Implicit State Management which allows backends to behave in a stateless manner and leave the state management to Triton.
The Triton Backend API exposes a large number of features. The backend utilities and classes provide many functions commonly used when creating a backend. But to create a functional backend it is not necessary to use most of the backend API or utilities. The tutorial starts with an implementation that shows a minimal backend and then adds on recommended and optional enhancements. The tutorial implementations follow best practices for Triton backends and so can be used as templates for your own backend.
The source code for the minimal backend is contained in minimal.cc. The source code contains extensive documentation describing the operation of the backend and the use of the Triton Backend API and the backend utilities. Before reading the source code, make sure you understand the concepts associated with Triton backend abstractions TRITONBACKEND_Backend, TRITONBACKEND_Model, and TRITONBACKEND_ModelInstance.
The minimal backend does not do any interesting operation, it simply copies a single input tensor to a single output tensor, but it does demonstrate the basic organization required for a Triton backend.
The minimal backend is complete but for clarity leaves out some important aspects of writing a full-featured backend that are described in Recommended Triton Backend. When creating your own backend use the Recommended Triton Backend as a starting point.
backends/minimal/CMakeLists.txt shows the recommended build and install script for a Triton backend. To build the minimal backend and install in a local directory use the following commands.
$ cd backends/minimal
$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ..
$ make install
The following required Triton repositories will be pulled and used in the build. By default the "main" branch/tag will be used for each repo but the listed CMake argument can be used to override.
- triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=[tag]
- triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
- triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag]
If you are building on a release branch (or on a development branch that is based off of a release branch), then you must set these cmake arguments to point to that release branch as well. For example, if you are building the r21.10 identity_backend branch then you need to use the following additional cmake flags:
-DTRITON_BACKEND_REPO_TAG=r21.10
-DTRITON_CORE_REPO_TAG=r21.10
-DTRITON_COMMON_REPO_TAG=r21.10
After building the install directory will contain a backends/minimal directory that contains the minimal backend. Instructions for adding this backend to the Triton server are described in Backend Shared Library.
After adding the minimal backend to the Triton server as described in Backend Shared Library, you can run Triton and have it load the models in model_repos/minimal_models. Assuming you have created a tritonserver Docker image by adding the minimal backend to Triton, the following command will run Triton:
$ docker run --rm -it --net=host -v/path/to/model_repos/minimal_models:/models tritonserver --model-repository=/models
The console output will show similar to the following indicating that the batching and nonbatching models from the minimal_models repository have loaded correctly. Note that the model repository has two models that both use the minimal backend. A backend can support any number of different models.
I1215 23:46:00.250284 68 server.cc:589]
+-------------+---------+--------+
| Model | Version | Status |
+-------------+---------+--------+
| batching | 1 | READY |
| nonbatching | 1 | READY |
+-------------+---------+--------+
The models are identical except that the batching model enabled the dynamic batcher and supports batch sizes up to 8. Note that the batching model sets the batch delay to 5 seconds so that the example client described below can demonstrate how the minimal backend receives a batch of requests.
The clients directory holds example clients. The minimal_client Python script demonstrates sending a couple of inference requests to the minimal backend. With Triton running as described in Running Triton with the Minimal Backend, execute the client:
$ clients/minimal_client
The minimal_client first sends a single request to nonbatching model. From the output you can see that the input value is returned in the output.
=========
Sending request to nonbatching model: IN0 = [1 2 3 4]
Response: {'model_name': 'nonbatching', 'model_version': '1', 'outputs': [{'name': 'OUT0', 'datatype': 'INT32', 'shape': [4], 'parameters': {'binary_data_size': 16}}]}
OUT0 = [1 2 3 4]
In the Triton console output you can see the log message printed by the minimal backend that indicates that it received a batch containing the single request.
I1221 18:14:12.964836 86 minimal.cc:348] model nonbatching: requests in batch 1
I1221 18:14:12.964857 86 minimal.cc:356] batched IN0 value: [ 1, 2, 3, 4 ]
The minimal_client next sends 2 requests at the same time to the batching model. Triton will dynamically batch those requests into a single batch and send that single batch to the minimal backend.
=========
Sending request to batching model: IN0 = [[10 11 12 13]]
Sending request to batching model: IN0 = [[20 21 22 23]]
Response: {'model_name': 'batching', 'model_version': '1', 'outputs': [{'name': 'OUT0', 'datatype': 'INT32', 'shape': [1, 4], 'parameters': {'binary_data_size': 16}}]}
OUT0 = [[10 11 12 13]]
Response: {'model_name': 'batching', 'model_version': '1', 'outputs': [{'name': 'OUT0', 'datatype': 'INT32', 'shape': [1, 4], 'parameters': {'binary_data_size': 16}}]}
OUT0 = [[20 21 22 23]]
In the Triton console output you can see the log message indicating that the minimal backend received a batch containing both requests.
I1221 18:14:17.965982 86 minimal.cc:348] model batching: requests in batch 2
I1221 18:14:17.966035 86 minimal.cc:356] batched IN0 value: [ 10, 11, 12, 13, 20, 21, 22, 23 ]
The source code for the recommended backend is contained in recommended.cc. The source code contains extensive documentation describing the operation of the backend and the use of the Triton Backend API and the backend utilities. Before reading the source code, make sure you understand the concepts associated with Triton backend abstractions TRITONBACKEND_Backend, TRITONBACKEND_Model, and TRITONBACKEND_ModelInstance.
The recommended backend improves the minimal backend to include the following features which should be present in any robust backend implementation:
-
Enhances the backend to support models with input/output tensors that have datatypes other than INT32.
-
Enhances the backend to support models with input/output tensors that have any shape.
-
Uses the Triton backend metric APIs to record statistics about requests executing in the backend. These metrics can then we queried using the Triton metrics and statistics APIs.
-
Additional error checking to ensure that the backend's version is compatible with Triton and that each model's configuration is compatible with the backend.
As with the minimal backend, the recommended backend just returns the input tensor value in the output tensor. Because of the additions described above, the recommended backend can serve as a starting point for your backend.
backends/recommended/CMakeLists.txt shows the recommended build and install script for a Triton backend. Building and installing is the same as described in Building the Minimal Backend.
After adding the recommended backend to the Triton server as described in Backend Shared Library, you can run Triton and have it load the models in model_repos/recommended_models. Assuming you have created a tritonserver Docker image by adding the recommended backend to Triton, the following command will run Triton:
$ docker run --rm -it --net=host -v/path/to/model_repos/recommended_models:/models tritonserver --model-repository=/models
The console output will show similar to the following indicating that the batching model from the recommended_models repository have loaded correctly.
I1215 23:46:00.250284 68 server.cc:589]
+-------------+---------+--------+
| Model | Version | Status |
+-------------+---------+--------+
| batching | 1 | READY |
+-------------+---------+--------+
The clients directory holds example clients. The recommended_client Python script demonstrates sending a couple of inference requests to the recommended backend. With Triton running as described in Running Triton with the Recommended Backend, execute the client:
$ clients/recommended_client
The recommended_client next sends 2 requests at the same time to the batching model, similar to what was done above with the minimal backend. Triton will dynamically batch those requests into a single batch and send that single batch to the recommended backend. In this model, batching is supported, the datatype is FP32 and the tensor shape is [ -1, 4, 4 ].
=========
Sending request to batching model: input = [[[1. 1.1 1.2 1.3]
[2. 2.1 2.2 2.3]
[3. 3.1 3.2 3.3]
[4. 4.1 4.2 4.3]]]
Sending request to batching model: input = [[[10. 10.1 10.2 10.3]
[20. 20.1 20.2 20.3]
[30. 30.1 30.2 30.3]
[40. 40.1 40.2 40.3]]]
Response: {'model_name': 'batching', 'model_version': '1', 'outputs': [{'name': 'OUTPUT', 'datatype': 'FP32', 'shape': [1, 4, 4], 'parameters': {'binary_data_size': 64}}]}
OUTPUT = [[[1. 1.1 1.2 1.3]
[2. 2.1 2.2 2.3]
[3. 3.1 3.2 3.3]
[4. 4.1 4.2 4.3]]]
Response: {'model_name': 'batching', 'model_version': '1', 'outputs': [{'name': 'OUTPUT', 'datatype': 'FP32', 'shape': [1, 4, 4], 'parameters': {'binary_data_size': 64}}]}
OUTPUT = [[[10. 10.1 10.2 10.3]
[20. 20.1 20.2 20.3]
[30. 30.1 30.2 30.3]
[40. 40.1 40.2 40.3]]]
In the Triton console output you can see the log message indicating that the recommended backend received a batch containing both requests.
I1221 18:30:52.223226 127 recommended.cc:604] model batching: requests in batch 2
I1221 18:30:52.223313 127 recommended.cc:613] batched INPUT value: [ 1.000000, 1.100000, 1.200000, 1.300000, 2.000000, 2.100000, 2.200000, 2.300000, 3.000000, 3.100000, 3.200000, 3.300000, 4.000000, 4.100000, 4.200000, 4.300000, 10.000000, 10.100000, 10.200000, 10.300000, 20.000000, 20.100000, 20.200001, 20.299999, 30.000000, 30.100000, 30.200001, 30.299999, 40.000000, 40.099998, 40.200001, 40.299999 ]
Because the recommended backend can support models that have input/output tensors with any datatype and shape, you can edit the model configuration and the client to experiment with these options.
To see the metrics collected for these two inference requests, use the following command to access Triton's metrics endpoint.
$ curl localhost:8002/metrics
The output will be metric values in Prometheus data format. The metrics documentation gives a description of these metric values.
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="batching",version="1"} 2.000000
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="batching",version="1"} 0.000000
# HELP nv_inference_count Number of inferences performed
# TYPE nv_inference_count counter
nv_inference_count{model="batching",version="1"} 2.000000
# HELP nv_inference_exec_count Number of model executions performed
# TYPE nv_inference_exec_count counter
nv_inference_exec_count{model="batching",version="1"} 1.000000
...
You can also see the collected statistics using the statistics endpoint.
$ curl localhost:8000/v2/models/batching/stats
{"model_stats":[{"name":"batching","version":"1","last_inference":1640111452223,"inference_count":2,"execution_count":1,"inference_stats":{"success":{"count":2,"ns":9997025869},"fail":{"count":0,"ns":0},"queue":{"count":2,"ns":9996491319},"compute_input":{"count":2,"ns":95288},"compute_infer":{"count":2,"ns":232202},"compute_output":{"count":2,"ns":195850}},"batch_stats":[{"batch_size":2,"compute_input":{"count":1,"ns":47644},"compute_infer":{"count":1,"ns":116101},"compute_output":{"count":1,"ns":97925}}]}]}
Please see the documentation of BLS Backend.
When using the dynamic batcher, Triton allows you to set custom batching rules. These rules are added on top of the specified dynamic batcher behavior. To set them, you pass in a library that implements the custom batching API. Two example batching libraries are located in the batching_strategies directory.
For this tutorial, you can use the volume_batching example to set up a maximum byte volume per request. To build the library and install in a local directory, use the following commands:
$ cd batch_strategies/volume_batching
$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ..
$ make install
Next, move the library to the desired location.
You can pass the file location via the model configuration.
If not specified, Triton will look for a library called batchstrategy.so
in the model version,
model, and backend directories, in that order. For ease, we'll pass it via the model configuration.
Select a model to use this strategy with. Then, update the model configuration to have these fields:
dynamic_batching { }
parameters: { key: "TRITON_BATCH_STRATEGY_PATH", value: {string_value: "/path/to/libtriton_volumebatching.so"}}
parameters { key: "MAX_BATCH_VOLUME_BYTES" value: {string_value: "96"}}
You can update the path to the filepath of your library.
You can also update the value of MAX_BATCH_VOLUME_BYTES
to the maximum volume per batch for your
use case. After starting Triton, you should see the scheduler apply a volume constraint per batch
on top of default batching behavior for your model. This can be made more visible by setting a
max queue delay
to give the scheduler more time for each batch to be completed. For example, you could set the
delay to 100,000 microseconds.
This section describes several optional features that you can add to enhance the capabilities of your backend.
Automatic model configuration generation is enabled by the backend implementing the appropriate logic (for example, in a function called AutoCompleteConfig) during TRITONBACKEND_ModelInitialize. For the recommended backend you would add a call to AutoCompleteConfig in the ModelState constructor just before the call to ValidateModelConfig. The AutoCompleteConfig function can update the model configuration with input tensor, output tensor, and max-batch-size configuration; and then update the configuration using TRITONBACKEND_ModelSetConfig. Examples can be found in ONNXRuntime backend, TensorFlow backend and other backends.
A backend can add a key-value pair to a response any time after the response is created and before it is sent. The parameter key must be a string and the parameter value can be a string, integer or boolean. The following example shows the TRITONBACKEND API used to set response parameters. Error checking code is not shown to improve clarity.
TRITONBACKEND_ResponseSetStringParameter(response, "param0", "an example string parameter");
TRITONBACKEND_ResponseSetIntParameter(responses[r], "param1", 42);
TRITONBACKEND_ResponseSetBoolParameter(responses[r], "param2", false);
A backend can access any of the files in a model's area of the model registry. These files are typically needed during TRITONBACKEND_ModelInitialize but can be accessed at other times as well. The TRITONBACKEND_ModelRepository API gives the location of the model's repository. For example, the following code can be run during TRITONBACKEND_ModelInitialize to write the location to the log.
// Can get location of the model artifacts. Normally we would need
// to check the artifact type to make sure it was something we can
// handle... but we are just going to log the location so we don't
// need the check. We would use the location if we wanted to load
// something from the model's repo.
TRITONBACKEND_ArtifactType artifact_type;
const char* clocation;
RETURN_IF_ERROR(
TRITONBACKEND_ModelRepository(model, &artifact_type, &clocation));
LOG_MESSAGE(
TRITONSERVER_LOG_INFO,
(std::string("Repository location: ") + clocation).c_str());
The framework backends (for example, TensorRT, ONNXRuntime, TensorFlow, PyTorch) read the actual model file from the model repository using this API. See those backends for examples of how it can be used.