Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ORT 1.18.1 Release] Cherry pick 3rd round #21129

Merged
merged 10 commits into from
Jun 24, 2024
2 changes: 1 addition & 1 deletion cmake/deps.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ mp11;https://github.com/boostorg/mp11/archive/refs/tags/boost-1.82.0.zip;9bc9e01
neural_speed;https://github.com/intel/neural-speed/archive/refs/tags/v0.3.zip;5ec64e3071edc7347ebd8a81679cf06e2bb9b851
onnx;https://github.com/onnx/onnx/archive/refs/tags/v1.16.0.zip;a6d8b619459fb4657f8bec7d1c6d95ad6d4c069d
#use the latest commit of 10.0-GA
onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/eb43908b02a296ea0594432f06e9d3fac288d672.zip;94d07871810a36a5bc70a1def5c50504101c9bd1
onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/06adf4461ac84035bee658c6cf5df39f7ab6071d.zip;46dceef659d75d276e7914a8057c2282269d5e7b
tianleiwu marked this conversation as resolved.
Show resolved Hide resolved
protobuf;https://github.com/protocolbuffers/protobuf/archive/refs/tags/v21.12.zip;7cf2733949036c7d52fda017badcab093fe73bfa
protoc_win64;https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win64.zip;b4521f7ada5b260380f94c4bd7f1b7684c76969a
protoc_win32;https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win32.zip;3688010318192c46ce73213cdfb6b3e5656da874
Expand Down
2 changes: 2 additions & 0 deletions docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -1597,6 +1597,8 @@ This version of the operator has been available since version 1 of the 'com.micr
<dd>Usually each single EPContext associate with a graph partition.But for some case like QNN, it has single EPContext contains all partitions.In that case, the node with ep_cache_context should set main_context=1. Other nodes set main_context=0 and skip ep_cache_context.The path is relative to this Onnx file. Default is 1.</dd>
<dt><tt>notes</tt> : string</dt>
<dd>(Optional) Some notes for the model</dd>
<dt><tt>onnx_model_filename</tt> : string</dt>
<dd>(Optional) Filename of the original ONNX model.</dd>
<dt><tt>partition_name</tt> : string</dt>
<dd>(Optional) partitioned graph name.</dd>
<dt><tt>source</tt> : string</dt>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,21 @@
* - if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir"
* - if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir"
*
* 3. In the case of building weight-stripped engines, the same security reasons as listed in 1) apply to the
* "onnx_model_filename" node attribute of EP context node, which contains a filename of the ONNX model with the
* weights needed for the refit process. User can specify a folder path relative to the current working
* directory by means of the "trt_onnx_model_folder_path" option.
*
*/
int trt_dump_ep_context_model{0}; // Dump EP context node model
const char* trt_ep_context_file_path{nullptr}; // Specify file name to dump EP context node model. Can be a path or a file name or a file name with path.
int trt_ep_context_embed_mode{0}; // Specify EP context embed mode. Default 0 = context is engine cache path, 1 = context is engine binary data
int trt_dump_ep_context_model{0}; // Dump EP context node model
const char* trt_ep_context_file_path{nullptr}; // Specify file name to dump EP context node model. Can be a path or a file name or a file name with path.

Check warning on line 74 in include/onnxruntime/core/providers/tensorrt/tensorrt_provider_options.h

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: include/onnxruntime/core/providers/tensorrt/tensorrt_provider_options.h:74: Lines should be <= 120 characters long [whitespace/line_length] [2]
int trt_ep_context_embed_mode{0}; // Specify EP context embed mode. Default 0 = context is engine cache path, 1 = context is engine binary data

Check warning on line 75 in include/onnxruntime/core/providers/tensorrt/tensorrt_provider_options.h

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: include/onnxruntime/core/providers/tensorrt/tensorrt_provider_options.h:75: Lines should be <= 120 characters long [whitespace/line_length] [2]
int trt_weight_stripped_engine_enable{0}; // Enable weight-stripped engine build. Default 0 = false,
// nonzero = true
const char* trt_onnx_model_folder_path{nullptr}; // Folder path relative to the current working directory for
// the ONNX model containing the weights (applicable only when
// the "trt_weight_stripped_engine_enable" option is enabled)

const char* trt_engine_cache_prefix{nullptr}; // specify engine cache prefix
int trt_engine_hw_compatible{0}; // Enable hardware compatibility. Default 0 = false, nonzero = true
};
1 change: 1 addition & 0 deletions onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_kernel.cu
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
// Licensed under the MIT License.

#include <algorithm>
#include <cfloat>
#include <cuda.h>
#include <cuda_fp16.h>
#include <math.h>
Expand Down
5 changes: 5 additions & 0 deletions onnxruntime/core/graph/contrib_ops/contrib_defs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -3299,6 +3299,11 @@ void RegisterContribSchemas() {
"(Optional) SDK version used to convert the model.",
AttributeProto::STRING,
OPTIONAL_VALUE)
.Attr(
"onnx_model_filename",
"(Optional) Filename of the original ONNX model.",
AttributeProto::STRING,
OPTIONAL_VALUE)
.Attr(
"hardware_architecture",
"(Optional) Hardware architecture.",
Expand Down
77 changes: 72 additions & 5 deletions onnxruntime/core/providers/tensorrt/onnx_ctx_model_helper.cc
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,10 @@
#include "onnx_ctx_model_helper.h"
#include "core/providers/cuda/shared_inc/cuda_call.h"
#include "core/framework/execution_provider.h"
#include "tensorrt_execution_provider.h"

Check warning on line 11 in onnxruntime/core/providers/tensorrt/onnx_ctx_model_helper.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Include the directory when naming header files [build/include_subdir] [4] Raw Output: onnxruntime/core/providers/tensorrt/onnx_ctx_model_helper.cc:11: Include the directory when naming header files [build/include_subdir] [4]

namespace onnxruntime {
extern TensorrtLogger& GetTensorrtLogger(bool verbose_log);

/*
* Check whether the graph has the EP context contrib op.
Expand Down Expand Up @@ -67,7 +69,8 @@
char* engine_data,
size_t size,
const int64_t embed_mode,
std::string compute_capability,
const std::string compute_capability,
const std::string onnx_model_path,
const logging::Logger* logger) {
auto model_build = graph_viewer.CreateModel(*logger);
auto& graph_build = model_build->MainGraph();
Expand All @@ -88,6 +91,7 @@
auto attr_0 = ONNX_NAMESPACE::AttributeProto::Create(); // embed_mode
auto attr_1 = ONNX_NAMESPACE::AttributeProto::Create(); // ep_cache_context
auto attr_2 = ONNX_NAMESPACE::AttributeProto::Create(); // hardware_architecture
auto attr_3 = ONNX_NAMESPACE::AttributeProto::Create(); // onnx_model_filename
std::string engine_data_str = "";
attr_0->set_name(EMBED_MODE);
attr_0->set_type(onnx::AttributeProto_AttributeType_INT);
Expand All @@ -106,13 +110,17 @@
attr_2->set_name(COMPUTE_CAPABILITY);
attr_2->set_type(onnx::AttributeProto_AttributeType_STRING);
attr_2->set_s(compute_capability);
attr_3->set_name(ONNX_MODEL_FILENAME);
attr_3->set_type(onnx::AttributeProto_AttributeType_STRING);
attr_3->set_s(std::filesystem::path(onnx_model_path).filename().string());

auto node_attributes = ONNX_NAMESPACE::NodeAttributes::Create();
int num_attributes = 3;
constexpr int num_attributes = 4;
node_attributes->reserve(num_attributes);
node_attributes->emplace(EMBED_MODE, *attr_0);
node_attributes->emplace(EP_CACHE_CONTEXT, *attr_1);
node_attributes->emplace(COMPUTE_CAPABILITY, *attr_2);
node_attributes->emplace(ONNX_MODEL_FILENAME, *attr_3);

// Create EP context node
graph_build.AddNode(EPCONTEXT_OP, EPCONTEXT_OP, "", inputs, outputs, node_attributes.get(), EPCONTEXT_OP_DOMAIN);
Expand Down Expand Up @@ -205,7 +213,7 @@
LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] Dumped " + ctx_model_path;
}

bool IsAbsolutePath(std::string& path_string) {
bool IsAbsolutePath(const std::string& path_string) {
#ifdef _WIN32
onnxruntime::PathString ort_path_string = onnxruntime::ToPathString(path_string);
auto path = std::filesystem::path(ort_path_string.c_str());
Expand All @@ -219,7 +227,7 @@
}

// Like "../file_path"
bool IsRelativePathToParentPath(std::string& path_string) {
bool IsRelativePathToParentPath(const std::string& path_string) {
#ifdef _WIN32
onnxruntime::PathString ort_path_string = onnxruntime::ToPathString(path_string);
auto path = std::filesystem::path(ort_path_string.c_str());
Expand All @@ -236,6 +244,28 @@
#endif
}

/*
* Get the weight-refitted engine cache path from a weight-stripped engine cache path
*
* Weight-stipped engine:
* An engine with weights stripped and its size is smaller than a regualr engine.
* The cache name of weight-stripped engine is TensorrtExecutionProvider_TRTKernel_XXXXX.stripped.engine
*
* Weight-refitted engine:
* An engine that its weights have been refitted and it's simply a regular engine.
* The cache name of weight-refitted engine is TensorrtExecutionProvider_TRTKernel_XXXXX.engine
*/
std::string GetWeightRefittedEnginePath(std::string stripped_engine_cache) {
std::filesystem::path stripped_engine_cache_path(stripped_engine_cache);
std::string refitted_engine_cache_path = stripped_engine_cache_path.stem().stem().string() + ".engine";
return refitted_engine_cache_path;
}

bool IsWeightStrippedEngineCache(std::filesystem::path& engine_cache_path) {
// The weight-stripped engine cache has the naming of xxx.stripped.engine
return engine_cache_path.stem().extension().string() == ".stripped";
}

Status TensorRTCacheModelHandler::GetEpContextFromGraph(const GraphViewer& graph_viewer) {
if (!ValidateEPCtxNode(graph_viewer)) {
return ORT_MAKE_STATUS(ONNXRUNTIME, EP_FAIL, "It's not a valid EP Context node");
Expand Down Expand Up @@ -271,6 +301,22 @@
// The engine cache and context model (current model) should be in the same directory
std::filesystem::path ctx_model_dir(GetPathOrParentPathOfCtxModel(ep_context_model_path_));
auto engine_cache_path = ctx_model_dir.append(cache_path);
LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] GetEpContextFromGraph engine_cache_path: " + engine_cache_path.string();

// If it's a weight-stripped engine cache, it needs to be refitted even though the refit flag is not enabled
if (!weight_stripped_engine_refit_) {
weight_stripped_engine_refit_ = IsWeightStrippedEngineCache(engine_cache_path);
}

// If the serialized refitted engine is present, use it directly without refitting the engine again
if (weight_stripped_engine_refit_) {
const std::filesystem::path refitted_engine_cache_path = GetWeightRefittedEnginePath(engine_cache_path.string());
if (std::filesystem::exists(refitted_engine_cache_path)) {
LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] " + refitted_engine_cache_path.string() + " exists.";
engine_cache_path = refitted_engine_cache_path.string();
weight_stripped_engine_refit_ = false;
}
}

if (!std::filesystem::exists(engine_cache_path)) {
return ORT_MAKE_STATUS(ONNXRUNTIME, EP_FAIL,
Expand All @@ -290,6 +336,21 @@
"TensorRT EP could not deserialize engine from cache: " + engine_cache_path.string());
}
LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] DeSerialized " + engine_cache_path.string();

if (weight_stripped_engine_refit_) {
const std::string onnx_model_filename = attrs.at(ONNX_MODEL_FILENAME).s();
std::string weight_stripped_engine_cache = engine_cache_path.string();
auto status = TensorrtExecutionProvider::RefitEngine(onnx_model_filename,
onnx_model_folder_path_,
weight_stripped_engine_cache,
true /* path check for security */,
(*trt_engine_).get(),
true /* serialize refitted engine to disk */,
detailed_build_log_);
if (status != Status::OK()) {
return ORT_MAKE_STATUS(ONNXRUNTIME, EP_FAIL, status.ErrorMessage());
}
}
}
return Status::OK();
}
Expand All @@ -306,7 +367,13 @@
// Show the warning if compute capability is not matched
if (attrs.count(COMPUTE_CAPABILITY) > 0) {
std::string model_compute_capability = attrs.at(COMPUTE_CAPABILITY).s();
if (model_compute_capability != compute_capability_) {
// Verify if engine was compiled with ampere+ hardware compatibility enabled
if (model_compute_capability == "80+") {
LOGS_DEFAULT(WARNING) << "[TensorRT EP] Engine is compatible to all Ampere+ GPU (except Jetson)";
if (std::stoi(compute_capability_) < 80) {
LOGS_DEFAULT(WARNING) << "[TensorRT EP] However, this GPU doesn't match. The compute capability of the GPU: " << compute_capability_;

Check warning on line 374 in onnxruntime/core/providers/tensorrt/onnx_ctx_model_helper.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/providers/tensorrt/onnx_ctx_model_helper.cc:374: Lines should be <= 120 characters long [whitespace/line_length] [2]
}
} else if (model_compute_capability != compute_capability_) {
LOGS_DEFAULT(WARNING) << "[TensorRT EP] Engine was compiled for a different compatibility level and might not work or perform suboptimal";
LOGS_DEFAULT(WARNING) << "[TensorRT EP] The compute capability of the engine: " << model_compute_capability;
LOGS_DEFAULT(WARNING) << "[TensorRT EP] The compute capability of the GPU: " << compute_capability_;
Expand Down
24 changes: 20 additions & 4 deletions onnxruntime/core/providers/tensorrt/onnx_ctx_model_helper.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

#include <string>
#include <filesystem>
#include <memory>

#include "core/providers/tensorrt/nv_includes.h"
#include "core/providers/shared_library/provider_api.h"
Expand All @@ -15,6 +16,7 @@
static const std::string EMBED_MODE = "embed_mode";
static const std::string EP_CACHE_CONTEXT = "ep_cache_context";
static const std::string COMPUTE_CAPABILITY = "hardware_architecture";
static const std::string ONNX_MODEL_FILENAME = "onnx_model_filename";

Check warning on line 19 in onnxruntime/core/providers/tensorrt/onnx_ctx_model_helper.h

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 For a static/global string constant, use a C style string instead: "static const char ONNX_MODEL_FILENAME[]". [runtime/string] [4] Raw Output: onnxruntime/core/providers/tensorrt/onnx_ctx_model_helper.h:19: For a static/global string constant, use a C style string instead: "static const char ONNX_MODEL_FILENAME[]". [runtime/string] [4]
static const std::string EPCONTEXT_OP_DOMAIN = "com.microsoft";
static const std::string EPCONTEXT_WARNING =
"It's suggested to set the ORT graph optimization level to 0 and \
Expand All @@ -29,12 +31,13 @@
char* engine_data,
size_t size,
const int64_t embed_mode,
std::string compute_capability,
const std::string compute_capability,
const std::string onnx_model_path,
const logging::Logger* logger);
std::string GetCtxModelPath(const std::string& ep_context_file_path,
const std::string& original_model_path);
bool IsAbsolutePath(std::string& path_string);
bool IsRelativePathToParentPath(std::string& path_string);
bool IsAbsolutePath(const std::string& path_string);
bool IsRelativePathToParentPath(const std::string& path_string);
void DumpCtxModel(ONNX_NAMESPACE::ModelProto* model_proto,
const std::string& ctx_model_path);
void UpdateCtxNodeModelEngineContext(ONNX_NAMESPACE::ModelProto* model_proto,
Expand All @@ -46,7 +49,17 @@
TensorRTCacheModelHandler(std::unique_ptr<nvinfer1::ICudaEngine>* trt_engine,
nvinfer1::IRuntime* trt_runtime,
std::string ep_context_model_path,
std::string compute_capability) : trt_engine_(trt_engine), trt_runtime_(trt_runtime), ep_context_model_path_(ep_context_model_path), compute_capability_(compute_capability) {
std::string compute_capability,
bool weight_stripped_engine_refit,
std::string onnx_model_folder_path,
bool detailed_build_log)
: trt_engine_(trt_engine),
trt_runtime_(trt_runtime),
ep_context_model_path_(ep_context_model_path),
compute_capability_(compute_capability),
weight_stripped_engine_refit_(weight_stripped_engine_refit),
onnx_model_folder_path_(onnx_model_folder_path),
detailed_build_log_(detailed_build_log) {
}
ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(TensorRTCacheModelHandler);

Expand All @@ -59,5 +72,8 @@
nvinfer1::IRuntime* trt_runtime_;
std::string ep_context_model_path_; // If using context model, it implies context model and engine cache is in the same directory
std::string compute_capability_;
bool weight_stripped_engine_refit_;
std::string onnx_model_folder_path_;
bool detailed_build_log_;
}; // TRTCacheModelHandler
} // namespace onnxruntime
Loading
Loading