-
Notifications
You must be signed in to change notification settings - Fork 92
install ROCm has a mistake #107
Comments
This should probably be reported in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver Are there any error messages in the dmesg log? |
There has some error messages in the dmesg log,and I dont know how to fix it:
[ 0.636557] pci 0000:00:00.2: AMD-Vi: Unable to read/write to IOMMU perf counter.
[ 2.300156] snd_pci_acp3x 0000:03:00.5: Invalid ACP audio mode : 1
[ 4.212601] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.0.1 (-110).
[ 5.236614] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.1.1 (-110).
[ 6.260365] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.2.1 (-110).
[ 7.284212] amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.3.1 (-110).
[ 7.292712] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).
…------------------ 原始邮件 ------------------
发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>;
发送时间: 2020年12月17日(星期四) 凌晨1:10
收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>;
抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>;
主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)
This should probably be reported in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
Are there any error messages in the dmesg log?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Something is clearly going wrong during driver initialization at boot time. I cannot give you a diagnosis from a few hand-picked error messages. That usually leads to incorrect conclusions. Please provide a complete kernel log, which will include a lot more context to work with: kernel version, boot parameters, PCI device list, memory map, other errors you may have missed, etc. Can you also provide the output of "dkms status"? |
I have uninstalled the ubuntu20.04 and install ubuntu18.04.5 LST, and the Rocm is installed successfully.
Then I installed tensorflow and jupyter to test ,and the code is :
import tensorflow as tf
tf.__version__
tf.test.is_gpu_available()
but there is an error (red mark):
cfl@cfl-KPR-WX9:~/ts$ jupyter-notebook
[I 18:01:22.614 NotebookApp] Serving notebooks from local directory: /home/cfl/ts
[I 18:01:22.614 NotebookApp] 0 active kernels
[I 18:01:22.614 NotebookApp] The Jupyter Notebook is running at:
[I 18:01:22.614 NotebookApp] http://localhost:8888/?token=694b4d14330f2b194fa1a1c24e250f16d03b40e9ad650245
[I 18:01:22.614 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 18:01:22.615 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=694b4d14330f2b194fa1a1c24e250f16d03b40e9ad650245
[I 18:01:23.750 NotebookApp] Accepting one-time-token-authenticated connection from 127.0.0.1
[W 18:01:24.537 NotebookApp] 404 GET /i18n/zh-CN/LC_MESSAGES/nbjs.json?v=20201217180122 (127.0.0.1) 15.10ms referer=http://localhost:8888/tree
[W 18:01:24.545 NotebookApp] 404 GET /static/components/moment/locale/zh-cn.js?v=20201217180122 (127.0.0.1) 2.06ms referer=http://localhost:8888/tree
[W 18:01:26.569 NotebookApp] 404 GET /static/components/moment/locale/zh-cn.js?v=20201217180122 (127.0.0.1) 1.58ms referer=http://localhost:8888/notebooks/Untitled.ipynb
[W 18:01:26.573 NotebookApp] 404 GET /nbextensions/widgets/notebook/js/extension.js?v=20201217180122 (127.0.0.1) 1.74ms referer=http://localhost:8888/notebooks/Untitled.ipynb
[W 18:01:26.575 NotebookApp] 404 GET /i18n/zh-CN/LC_MESSAGES/nbjs.json?v=20201217180122 (127.0.0.1) 1.59ms referer=http://localhost:8888/notebooks/Untitled.ipynb
[I 18:01:26.707 NotebookApp] Kernel started: 9eab146c-0971-422e-89db-e301e3558abb
2020-12-17 18:01:35.760935: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-17 18:01:35.789512: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2295605000 Hz
2020-12-17 18:01:35.790200: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4ae6580 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-17 18:01:35.790243: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-12-17 18:01:35.793787: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")
[I 18:01:38.704 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
WARNING:root:kernel 9eab146c-0971-422e-89db-e301e3558abb restarted
[I 18:03:26.701 NotebookApp] Saving file at /Untitled.ipynb
…------------------ 原始邮件 ------------------
发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>;
发送时间: 2020年12月17日(星期四) 下午2:38
收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>;
抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>;
主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)
Something is clearly going wrong during driver initialization at boot time. I cannot give you a diagnosis from a few hand-picked error messages. That usually leads to incorrect conclusions. Please provide a complete kernel log, which will include a lot more context to work with: kernel version, boot parameters, PCI device list, memory map, other errors you may have missed, etc.
Can you also provide the output of "dkms status"?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I suggest execute |
Also please attach a kernel log / dmesg output as fxkamd suggested. |
The dmseg log is in the enclosure. Please check it.
…------------------ 原始邮件 ------------------
发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>;
发送时间: 2020年12月17日(星期四) 晚上7:28
收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>;
抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>;
主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)
Also please attach a kernel log / dmesg output as fxkamd suggested.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
lspci -vt:
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Device 15d0 +-00.2 Advanced Micro Devices, Inc. [AMD] Device 15d1 +-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-01.2-[01]----00.0 Intel Corporation Wireless 8265 / 8275 +-01.7-[02]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 +-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge +-08.1-[03]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Picasso | +-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Device 15de | +-00.2 Advanced Micro Devices, Inc. [AMD] Device 15df | +-00.3 Advanced Micro Devices, Inc. [AMD] Device 15e0 | +-00.4 Advanced Micro Devices, Inc. [AMD] Device 15e1 | +-00.5 Advanced Micro Devices, Inc. [AMD] Device 15e2 | \-00.6 Advanced Micro Devices, Inc. [AMD] Device 15e3 +-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller +-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge +-18.0 Advanced Micro Devices, Inc. [AMD] Device 15e8 +-18.1 Advanced Micro Devices, Inc. [AMD] Device 15e9 +-18.2 Advanced Micro Devices, Inc. [AMD] Device 15ea +-18.3 Advanced Micro Devices, Inc. [AMD] Device 15eb +-18.4 Advanced Micro Devices, Inc. [AMD] Device 15ec +-18.5 Advanced Micro Devices, Inc. [AMD] Device 15ed +-18.6 Advanced Micro Devices, Inc. [AMD] Device 15ee \-18.7 Advanced Micro Devices, Inc. [AMD] Device 15ef
But I don't know what does that means...
I think RX Vega10 is a GPU ,because I see the GPU mark in the windows10, but I don't know whether it means RX Vega 64.
My laptop is HONOR(HUAWEI) MagicBook 2019 with AMD 3700U.
…------------------ 原始邮件 ------------------
发件人: "RadeonOpenCompute/ROCm_Documentation" <notifications@github.com>;
发送时间: 2020年12月17日(星期四) 晚上7:22
收件人: "RadeonOpenCompute/ROCm_Documentation"<ROCm_Documentation@noreply.github.com>;
抄送: "计科1701-陈凡亮"<1581318468@qq.com>;"Author"<author@noreply.github.com>;
主题: Re: [RadeonOpenCompute/ROCm_Documentation] install ROCm has a mistake (#107)
I suggest execute lspci -vt to show the information of GPU.
BTW: rx Vega10 means RX Vega 64 or APU? ROCm can't support APU yet.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Thanks, but I don't see an attachment. Looks like you responded by email rather than the web page - it's possible that attachments via email don't get included. The web page dialog suggests that you have to drag & drop or paste attachments. Looks like your GPU is the integrated GPU of a Picasso (3700U) so as fxkamd mentioned it's not officially supported under HIP yet. @fxkamd I think Picasso is the first APU where we used GPUVM code paths rather than ATC/IOMMU but I don't know if that helps at all. |
Picasso is the same as Raven. It uses the IOMMUv2 code path by default. But we recently added fallbacks for systems with disabled IOMMUv2 or broken/missing CRAT tables where we treat it as a dGPU. I'm not sure whether that has made it into ROCm release branches yet. The error message "/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")" comes from the HIP runtime. It looks like the GPU code was not compiled for the correct ISA for your GPU. |
Here is the dmesg log. And whether my GPU will be supported in the future? |
ubuntu20.04 + Radeon Rx Vega10 Graphics.
/opt/rocm/bin/rocminfo has a mistake:
ROCk module is loaded
Unable to open /dev/kfd read-write: Bad address
cfl is member of render group
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
how can I fix it ?
The text was updated successfully, but these errors were encountered: