Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ml): introduce support of onnxruntime-rocm for AMD GPU #11063

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Zelnes
Copy link

@Zelnes Zelnes commented Jul 12, 2024

I'd like to propose this feature, which introduces support for machine learning on AMD GPU.

⚠️ Not stable

It's relying on an opened PR which disable some caching features, in order to be able to run in parallel. (IMHO parallelizing without cache is still faster than caching in single threaded mode).

Important note

I just tried to make something work for me, and I'm not pretending to propose something completely working for anyone anywhere.
I'm proposing this here, so advanced users/developers, can provide help, add some tests, and to make this available for others.

I hope I'll have some feedback 👍

Notes

Docker size

Second note: the downside of all this new docker AMD capable, is the 28GB size of final image. I hope someone can help reduce this size.

Links

Please, see this discussion where I exchanged with @mertalev on this, and I posted more explanations, which led me here.

@Zelnes Zelnes requested review from mertalev and bo0tzz as code owners July 12, 2024 23:29
@github-actions github-actions bot added documentation Improvements or additions to documentation 🧠machine-learning labels Jul 12, 2024
Copy link
Contributor

@mertalev mertalev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some polishing, but looks very promising!

machine-learning/Dockerfile Outdated Show resolved Hide resolved
machine-learning/Dockerfile Outdated Show resolved Hide resolved
machine-learning/Dockerfile Outdated Show resolved Hide resolved
machine-learning/app/sessions/ort.py Outdated Show resolved Hide resolved
Comment on lines 45 to 51
# I ran into a compilation error when parallelizing the build
# I used 12 threads to build onnxruntime, but it needs more than 16GB of RAM, and that's the amount of RAM I have on my machine
# I lowered the number of threads to 8, and it worked
# Even with 12 threads, the compilation took more than 1,5 hours to fail
RUN ./build.sh --allow_running_as_root --config Release --build_wheel --update --build --parallel 9 --cmake_extra_defines\
ONNXRUNTIME_VERSION=1.18.1 --use_rocm --rocm_home=/opt/rocm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run this on Mich @mertalev?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely yes.

@Zelnes Zelnes force-pushed the feature/Add-rocm-support-for-machine-learning branch from d706380 to 8aca62b Compare July 15, 2024 07:59
@Zelnes
Copy link
Author

Zelnes commented Jul 15, 2024

I did some tests on the latest version.
I've launched the 4 ML related jobs at the same time, using this concurrency settings

"job": {
  "backgroundTask": {
    "concurrency": 5
  },
  "smartSearch": {
    "concurrency": 5
  },
  "metadataExtraction": {
    "concurrency": 5
  },
  "faceDetection": {
    "concurrency": 8
  },
  "search": {
    "concurrency": 5
  },
  "sidecar": {
    "concurrency": 5
  },
  "library": {
    "concurrency": 5
  },
  "migration": {
    "concurrency": 5
  },
  "thumbnailGeneration": {
    "concurrency": 3
  },
  "videoConversion": {
    "concurrency": 1
  },
  "notifications": {
    "concurrency": 5
  }
}

Which crashes, because of memory allocation failure I think, see logs bellow.

Error logs 'Failed to allocate memory'
2024-07-15 10:04:14.356608464 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running PRelu node. Name:'PRelu_1' Status Message: /code/onnxruntime/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 9633792

[07/15/24 10:04:14] ERROR    Exception in ASGI application

                             ╭─────── Traceback (most recent call last) ───────╮
                             │ /usr/src/app/main.py:151 in predict             │
                             │                                                 │
                             │   148 │   │   inputs = text                     │
                             │   149 │   else:                                 │
                             │   150 │   │   raise HTTPException(400, "Either  │
                             │ ❱ 151 │   response = await run_inference(inputs │
                             │   152 │   return ORJSONResponse(response)       │
                             │   153                                           │
                             │   154                                           │
                             │                                                 │
                             │ /usr/src/app/main.py:176 in run_inference       │
                             │                                                 │
                             │   173 │   without_deps, with_deps = entries     │
                             │   174 │   await asyncio.gather(*[_run_inference │
                             │   175 │   if with_deps:                         │
                             │ ❱ 176 │   │   await asyncio.gather(*[_run_infer │
                             │   177 │   if isinstance(payload, Image):        │
                             │   178 │   │   response["imageHeight"], response │
                             │   179                                           │
                             │                                                 │
                             │ /usr/src/app/main.py:169 in _run_inference      │
                             │                                                 │
                             │   166 │   │   │   │   message = f"Task {entry[' │
                             │       output of {dep}"                          │
                             │   167 │   │   │   │   raise HTTPException(400,  │
                             │   168 │   │   model = await load(model)         │
                             │ ❱ 169 │   │   output = await run(model.predict, │
                             │   170 │   │   outputs[model.identity] = output  │
                             │   171 │   │   response[entry["task"]] = output  │
                             │   172                                           │
                             │                                                 │
                             │ /usr/src/app/main.py:187 in run                 │
                             │                                                 │
                             │   184 │   if thread_pool is None:               │
                             │   185 │   │   return func(*args, **kwargs)      │
                             │   186 │   partial_func = partial(func, *args, * │
                             │ ❱ 187 │   return await asyncio.get_running_loop │
                             │   188                                           │
                             │   189                                           │
                             │   190 async def load(model: InferenceModel) ->  │
                             │                                                 │
                             │ /usr/lib/python3.10/concurrent/futures/thread.p │
                             │ y:58 in run                                     │
                             │                                                 │
                             │ /usr/src/app/models/base.py:60 in predict       │
                             │                                                 │
                             │    57 │   │   self.load()                       │
                             │    58 │   │   if model_kwargs:                  │
                             │    59 │   │   │   self.configure(**model_kwargs │
                             │ ❱  60 │   │   return self._predict(*inputs, **m │
                             │    61 │                                         │
                             │    62 │   @abstractmethod                       │
                             │    63 │   def _predict(self, *inputs: Any, **mo │
                             │                                                 │
                             │ /usr/src/app/models/facial_recognition/recognit │
                             │ ion.py:52 in _predict                           │
                             │                                                 │
                             │   49 │   │   │   return []                      │
                             │   50 │   │   inputs = decode_cv2(inputs)        │
                             │   51 │   │   cropped_faces = self._crop(inputs, │
                             │ ❱ 52 │   │   embeddings = self._predict_batch(c │
                             │      self._predict_single(cropped_faces)        │
                             │   53 │   │   return self.postprocess(faces, emb │
                             │   54 │                                          │
                             │   55 │   def _predict_batch(self, cropped_faces │
                             │      NDArray[np.float32]:                       │
                             │                                                 │
                             │ /usr/src/app/models/facial_recognition/recognit │
                             │ ion.py:56 in _predict_batch                     │
                             │                                                 │
                             │   53 │   │   return self.postprocess(faces, emb │
                             │   54 │                                          │
                             │   55 │   def _predict_batch(self, cropped_faces │
                             │      NDArray[np.float32]:                       │
                             │ ❱ 56 │   │   embeddings: NDArray[np.float32] =  │
                             │   57 │   │   return embeddings                  │
                             │   58 │                                          │
                             │   59 │   def _predict_single(self, cropped_face │
                             │      NDArray[np.float32]:                       │
                             │                                                 │
                             │ /opt/venv/lib/python3.10/site-packages/insightf │
                             │ ace/model_zoo/arcface_onnx.py:84 in get_feat    │
                             │                                                 │
                             │   81 │   │                                      │
                             │   82 │   │   blob = cv2.dnn.blobFromImages(imgs │
                             │   83 │   │   │   │   │   │   │   │   │     (sel │
                             │      self.input_mean), swapRB=True)             │
                             │ ❱ 84 │   │   net_out = self.session.run(self.ou │
                             │   85 │   │   return net_out                     │
                             │   86 │                                          │
                             │   87 │   def forward(self, batch_data):         │
                             │                                                 │
                             │ /usr/src/app/sessions/ort.py:49 in run          │
                             │                                                 │
                             │    46 │   │   input_feed: dict[str, NDArray[np. │
                             │    47 │   │   run_options: Any = None,          │
                             │    48 │   ) -> list[NDArray[np.float32]]:       │
                             │ ❱  49 │   │   outputs: list[NDArray[np.float32] │
                             │       run_options)                              │
                             │    50 │   │   return outputs                    │
                             │    51 │                                         │
                             │    52 │   @property                             │
                             │                                                 │
                             │ /opt/venv/lib/python3.10/site-packages/onnxrunt │
                             │ ime/capi/onnxruntime_inference_collection.py:22 │
                             │ 0 in run                                        │
                             │                                                 │
                             │    217 │   │   if not output_names:             │
                             │    218 │   │   │   output_names = [output.name  │
                             │    219 │   │   try:                             │
                             │ ❱  220 │   │   │   return self._sess.run(output │
                             │    221 │   │   except C.EPFail as err:          │
                             │    222 │   │   │   if self._enable_fallback:    │
                             │    223 │   │   │   │   print(f"EP Error: {err!s │
                             ╰─────────────────────────────────────────────────╯
                             RuntimeException: [ONNXRuntimeError] : 6 :
                             RUNTIME_EXCEPTION : Non-zero status code returned
                             while running PRelu node. Name:'PRelu_1' Status
                             Message:
                             /code/onnxruntime/onnxruntime/core/framework/bfc_ar
                             ena.cc:376 void*
                             onnxruntime::BFCArena::AllocateRawInternal(size_t,
                             bool, onnxruntime::Stream*, bool,
                             onnxruntime::WaitNotificationFn) Failed to allocate
                             memory for requested buffer of size 9633792

But, with

 "job": {
   "backgroundTask": {
     "concurrency": 5
   },
   "smartSearch": {
-    "concurrency": 5
+    "concurrency": 4
   },
   "metadataExtraction": {
     "concurrency": 5
   },
   "faceDetection": {
-    "concurrency": 8
+    "concurrency": 4
   },
   "search": {
     "concurrency": 5
   },
   "sidecar": {
     "concurrency": 5
   },
   "library": {
     "concurrency": 5
   },
   "migration": {
     "concurrency": 5
   },
   "thumbnailGeneration": {
     "concurrency": 3
   },
   "videoConversion": {
     "concurrency": 1
   },
   "notifications": {
     "concurrency": 5
   }
 }

It works well.

For info, my server runs on 12 cores and 16Gb RAM, with an AMD Rx 6400

@Zelnes Zelnes force-pushed the feature/Add-rocm-support-for-machine-learning branch from 8aca62b to 18f5d4d Compare July 15, 2024 10:30
@Zelnes Zelnes requested a review from mertalev July 15, 2024 10:31
Copy link
Contributor

@mertalev mertalev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job updating the documentation!

machine-learning/Dockerfile Outdated Show resolved Hide resolved
# Even with 12 threads, the compilation took more than 1,5 hours to fail
RUN ./build.sh --allow_running_as_root --config Release --build_wheel --update --build --parallel 9 --cmake_extra_defines\
ONNXRUNTIME_VERSION=1.18.1 --use_rocm --rocm_home=/opt/rocm
RUN mv /code/onnxruntime/build/Linux/Release/dist/*.whl /opt/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the different .whl files? We should only need one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one, and it's name is known after compilation. I copied this from this Dockerfile, but if you prefer, I can replace with the full name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice for clarity that there's only one wheel involved here.

Comment on lines 118 to 120
if [ "${DEVICE}" != "rocm" ]; then \
extra=libmimalloc2.0; \
fi && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you did this to not duplicate the apt command. You can change the apt command to be multi-line and make this if condition provide libmimalloc2.0 or nothing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I'm not sure to understand what you mean. Could you provide a suggestion ?

machine-learning/Dockerfile Outdated Show resolved Hide resolved
[tool.poetry.group.rocm]
optional = true

[tool.poetry.group.rocm.dependencies]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking if we can put anything here.

We could list onnxruntime-rocm with a file path, but that's basically just for our build, not usable by others. It's very easy to do though and at least points to the relevant package.

Ideally, we could build onnxruntime in the base-images repo, have the wheel be a GH artifact and reference the URL for it here. This would be nice but not something you need to work on (unless you want to).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any poetry knowledge at all, but from what I understood, we'd need to specify the dependency here (for instance onnxruntime), as a local file, but if we do so, we must provide the path to it.
But, the wheel is compiled during the build process, so the file can't exist on the host.
And if we update the poetry.lock, the file must exist. Actually, I tested some stuff, but the only way I found was to declare an empty group, and provide the installation during the build.

This is a problem for someone who wants to develop on his host without Docker. But I think this is linked to the compilation of onnxruntime, not the was poetry is configured.

Copy link
Contributor

@mertalev mertalev Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's a fair point. We can leave this as-is for this PR and work on providing it through a link later.

@@ -0,0 +1,176 @@
From a598a88db258f82a6e4bca75810921bd6bcee7e0 Mon Sep 17 00:00:00 2001
Copy link
Contributor

@mertalev mertalev Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a license for this code since it isn't ours. It can go in its own folder with a LICENSE file inside (ONNX Runtime's license). This should also end up in the final image when the wheel is copied (unless the wheel already has the license embedded in it, not sure if it does).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a clue how this is done, can you help me here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you can make a patches folder that includes this patch and ONNX Runtime's [license](https://github.com/microsoft/onnxruntime/blob/7ec51f0a13d5d8cbd796de33276bf04210ce6176/LICENSE. For the wheel part, first use pip-licenses to see if the installed onnxruntime has a license in it. If it does, you don't need to do anything. If not, you would copy this LICENSE file in the Dockerfile so it's in the final image.

Copy link

@mghesh-yseop mghesh-yseop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't have the time to work on this before september, so if anyone wants to complete the PR, or reopen a new one, feel free.

machine-learning/Dockerfile Outdated Show resolved Hide resolved
# Even with 12 threads, the compilation took more than 1,5 hours to fail
RUN ./build.sh --allow_running_as_root --config Release --build_wheel --update --build --parallel 9 --cmake_extra_defines\
ONNXRUNTIME_VERSION=1.18.1 --use_rocm --rocm_home=/opt/rocm
RUN mv /code/onnxruntime/build/Linux/Release/dist/*.whl /opt/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one, and it's name is known after compilation. I copied this from this Dockerfile, but if you prefer, I can replace with the full name.

Comment on lines 118 to 120
if [ "${DEVICE}" != "rocm" ]; then \
extra=libmimalloc2.0; \
fi && \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I'm not sure to understand what you mean. Could you provide a suggestion ?

[tool.poetry.group.rocm]
optional = true

[tool.poetry.group.rocm.dependencies]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any poetry knowledge at all, but from what I understood, we'd need to specify the dependency here (for instance onnxruntime), as a local file, but if we do so, we must provide the path to it.
But, the wheel is compiled during the build process, so the file can't exist on the host.
And if we update the poetry.lock, the file must exist. Actually, I tested some stuff, but the only way I found was to declare an empty group, and provide the installation during the build.

This is a problem for someone who wants to develop on his host without Docker. But I think this is linked to the compilation of onnxruntime, not the was poetry is configured.

@@ -0,0 +1,176 @@
From a598a88db258f82a6e4bca75810921bd6bcee7e0 Mon Sep 17 00:00:00 2001

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a clue how this is done, can you help me here ?

machine-learning/Dockerfile Outdated Show resolved Hide resolved
@Zelnes Zelnes force-pushed the feature/Add-rocm-support-for-machine-learning branch from 18f5d4d to 840b1a5 Compare July 20, 2024 13:03
Comment on lines 118 to 121
if [ "${DEVICE}" != "rocm" ]; then \
extra=libmimalloc2.0; \
fi && \
apt-get install -y --no-install-recommends tini "${extra}" && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [ "${DEVICE}" != "rocm" ]; then \
extra=libmimalloc2.0; \
fi && \
apt-get install -y --no-install-recommends tini "${extra}" && \
apt-get install -y --no-install-recommends tini $(if [ "${DEVICE}" != "rocm" ]; then echo "libmimalloc2.0"; fi) && \

@NicholasFlamy
Copy link
Member

NicholasFlamy commented Oct 3, 2024

Second note: the downside of all this new docker AMD capable, is the 28GB size of final image. I hope someone can help reduce this size.

It may be necessary to do something similar to what Frigate did (before they removed this feature due to yolov8 licensing issues, nothing to do with AMD, just that the code was specific to running yolov8 ONNX models) by splitting up the containers per GPU chipset:
blakeblackshear/frigate#9762

Screenshot_20241003_112153_Brave

@zackpollard zackpollard marked this pull request as draft November 13, 2024 01:30
@zackpollard
Copy link
Contributor

I've converted this to a draft as it seems very much still a WIP.

@Zelnes can you update if this is still something you're trying to pursue getting merged into Immich?
@mertalev can you give some info on whether this is something we want to merge and what steps would still be left to get to that point?

@mertalev
Copy link
Contributor

mertalev commented Nov 13, 2024

Definitely want to see this merged. I think the remaining work can be broken down into:

  1. Updating and rebasing the PR to use more recent dependencies
  2. Shrinking the image size
  3. Testing correctness
  4. (optional) Moving the build step to be a separate release artifact to be installed in normal builds

@Zelnes
Copy link
Author

Zelnes commented Nov 13, 2024

I've converted this to a draft as it seems very much still a WIP.

Thanks, it makes sense indeed.

@Zelnes can you update if this is still something you're trying to pursue getting merged into Immich?

I don't have the time unfortunately, and I don't have the hardware to pursue the work and testing.

Sorry if I didn't keep you updated before, I hope this can be merged one day.

@zackpollard
Copy link
Contributor

@mertalev given the above comment, is this something you would want to pursue yourself? If not I am not sure if it's worth keeping this open.

@mertalev
Copy link
Contributor

Yes, except for (3).

@NicholasFlamy
Copy link
Member

Yes, except for (3).

(If this means testing, I can help, it's just that I've been very busy recently. I think I am at the point where I'm getting more free time though.)

@zackpollard
Copy link
Contributor

Yes, except for (3).

Alright cool, will leave it with you then! 😄

@NicholasFlamy
Copy link
Member

NicholasFlamy commented Nov 17, 2024

I got it running and working correctly (note about something funky later) on my PC:
AMD Ryzen 5 5600X CPU with 32 GB of RAM
AMD Radeon RX 6700 XT GPU
Debian 12 with KDE Plasma on a 240 GB SSD with LVM and split root and home.

Funky thing: https://discord.com/channels/979116623879368755/1291425089539018907/1307529638985203732
img
This was after I reran smart search and face detection. My guess is I had too much stuff going on and when the GPU got slammed with the AI stuff the display part of the driver crapped out. Even then, immich kept working and therefore the rest of the system (I accessed immich from my phone).

Edit: I theorize this is just funky AMD (because unstable drivers were normal with their 5000 series) and/or Linux behavior.

@mertalev mertalev force-pushed the feature/Add-rocm-support-for-machine-learning branch from 840b1a5 to 46c505a Compare December 19, 2024 23:32
Copy link
Contributor

github-actions bot commented Dec 20, 2024

📖 Documentation deployed to pr-11063.preview.immich.app

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:feature documentation Improvements or additions to documentation 🧠machine-learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants