Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

用户空间的单节点内存不显示 #1460

Closed
liu-shaobo opened this issue Nov 20, 2024 · 15 comments
Closed

用户空间的单节点内存不显示 #1460

liu-shaobo opened this issue Nov 20, 2024 · 15 comments
Labels
bug Something isn't working

Comments

@liu-shaobo
Copy link

liu-shaobo commented Nov 20, 2024

发生了什么 | What happened

用户空间的单节点内存不显示
image

  • scontrol show node=nodename 的内存结果
   RealMemory=3932160 AllocMem=0 FreeMem=4119101 Sockets=4 Boards=1

运行环境 | Environment

- OS: Ubuntu 20.04
- Scheduler: slurm-22.05.11
- Docker: 24.0.9
- Docker-compose: 2.24.1
- SCOW cli: v1.5.2
- SCOW: v1.5.2
- Adapter: slurm-adapter v1.5.0
@liu-shaobo liu-shaobo added the bug Something isn't working label Nov 20, 2024
@liu-shaobo liu-shaobo changed the title [Bug/Help] auth: User does not exist when I login 用户空间的单节点内存不显示 Nov 20, 2024
@piccaSun
Copy link
Contributor

请问是否还有这个问题,如果还有请您确认slurm下 scontrol show partition=C96T4下的分区内存信息是否显式正常

@liu-shaobo
Copy link
Author

liu-shaobo commented Nov 27, 2024

是因为mem后面的值为T,所以显示有问题?我看适配器里面是取RealMemory的值吧!
image

@283713406
Copy link

麻烦执行下这条命令 scontrol show node=node221 | grep RealMemory=| awk '{print $1}' | awk -F'=' '{print $2}'

@liu-shaobo
Copy link
Author

image

@liu-shaobo
Copy link
Author

liu-shaobo commented Nov 28, 2024

scow使用1.5.2,适配器使用的1.5.0,适配器的日志级别调整为trace。

# 日志级别
log:
  level: "trace"

重启适配器后,级别还是info,这是什么原因?

{"level":"info","msg":"Received request GetClusterConfig: ","time":"2024-11-28T17:13:43+08:00"}

@piccaSun
Copy link
Contributor

piccaSun commented Nov 29, 2024

您好,在1.5.0版本中还没有增加trace级别日志的判断
更新适配器到 master ,更改日志等级为 trace 才可以打印更详细的日志

此问题已定位,后续我们会对此做出修改,感谢您的发现

@283713406
Copy link

@liu-shaobo 您可以根据适配器的fix-memory分支编译一个适配器,能解决该问题。https://github.com/PKUHPC/scow-slurm-adapter/tree/fix-memory

@liu-shaobo
Copy link
Author

好的,我重新编译试试。

@283713406
Copy link

@liu-shaobo 请问下这个问题得到解决了吗?

@liu-shaobo
Copy link
Author

liu-shaobo commented Dec 5, 2024

前几天出去有事没有编译,用fix-memory分支编译,在scow-1.5.2上测试,进入scow管理系统会出现500错误;

  • slurm-adapter的trace日志
[2024-12-05 10:33:22] [trace] [config.go:797 scow-slurm-adapter/services/config.(*ServerConfig).GetAvailablePartitions] GetAvailablePartitions: partitions:{name:"C96T4"  cores:576  nodes:6  qos:"normal"  comment:""}
  • mis-server日志
mis-server-1  | {"level":30,"time":"2024-12-05T02:38:57.794Z","pid":18,"hostname":"c24cde95c3ba","plugin":"price","msg":"Tenant specific prices {}"}
mis-server-1  | {"level":50,"time":"2024-12-05T02:38:57.850Z","pid":18,"hostname":"c24cde95c3ba","plugin":"price","err":{"type":"Error","message":"13 INTERNAL: convert memory error","stack":"Error: 13 INTERNAL: convert memory error\n    at callErrorFromStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/call.js:31:19)\n    at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/client.js:193:76)\n    at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)\n    at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)\n    at /app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78\n    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)\nfor call at\n    at ServiceClientImpl.makeUnaryRequest (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/client.js:161:32)\n    at ServiceClientImpl.getClusterConfig (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)\n    at /app/node_modules/.pnpm/@ddadaal+tsgrpc-client@0.17.7_@grpc+grpc-js@1.10.6/node_modules/@ddadaal/tsgrpc-client/lib/unary.js:18:13\n    at new Promise (<anonymous>)\n    at asyncClientCall (/app/node_modules/.pnpm/@ddadaal+tsgrpc-client@0.17.7_@grpc+grpc-js@1.10.6/node_modules/@ddadaal/tsgrpc-client/lib/unary.js:15:12)\n    at /app/apps/mis-server/build/bl/PriceMap.js:49:117\n    at /app/apps/mis-server/build/plugins/clusters.js:84:24\n    at Array.map (<anonymous>)\n    at Object.callOnAll (/app/apps/mis-server/build/plugins/clusters.js:83:18)\n    at createPriceMap (/app/apps/mis-server/build/bl/PriceMap.js:49:39)","code":13,"details":"convert memory error","metadata":{"content-type":["application/grpc"],"grpc-status-details-bin":[{"type":"Buffer","data":[8,13,18,20,99,111,110,118,101,114,116,32,109,101,109,111,114,121,32,101,114,114,111,114,26,67,10,40,116,121,112,101,46,103,111,111,103,108,101,97,112,105,115,46,99,111,109,47,103,111,111,103,108,101,46,114,112,99,46,69,114,114,111,114,73,110,102,111,18,23,10,21,67,79,78,86,69,82,84,95,77,69,77,79,82,89,95,70,65,73,76,69,68]}]}},"msg":"Executing on hpc01 failed"}
mis-server-1  | {"level":50,"time":"2024-12-05T02:38:57.851Z","pid":18,"hostname":"c24cde95c3ba","plugin":"price","msg":"Cluster ops fails at clusters [{\"cluster\":\"hpc01\",\"error\":{\"code\":13,\"details\":\"convert memory error\",\"metadata\":{\"content-type\":[\"application/grpc\"],\"grpc-status-details-bin\":[{\"type\":\"Buffer\",\"data\":[8,13,18,20,99,111,110,118,101,114,116,32,109,101,109,111,114,121,32,101,114,114,111,114,26,67,10,40,116,121,112,101,46,103,111,111,103,108,101,97,112,105,115,46,99,111,109,47,103,111,111,103,108,101,46,114,112,99,46,69,114,114,111,114,73,110,102,111,18,23,10,21,67,79,78,86,69,82,84,95,77,69,77,79,82,89,95,70,65,73,76,69,68]}]}}}]"}
mis-server-1  | /app/apps/mis-server/build/plugins/clusters.js:107
mis-server-1  |                 throw new tsgrpc_common_1.ServiceError({
mis-server-1  |                       ^
mis-server-1  |
mis-server-1  | ServiceError
mis-server-1  |     at Object.callOnAll (/app/apps/mis-server/build/plugins/clusters.js:107:23)
mis-server-1  |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
mis-server-1  |     at async createPriceMap (/app/apps/mis-server/build/bl/PriceMap.js:49:19)
mis-server-1  |     at async /app/apps/mis-server/build/plugins/price.js:20:22
mis-server-1  |     at async Server.register (/app/node_modules/.pnpm/@ddadaal+tsgrpc-server@0.19.5_@grpc+grpc-js@1.10.6/node_modules/@ddadaal/tsgrpc-server/lib/server.js:117:9)
mis-server-1  |     at async createServer (/app/apps/mis-server/build/app.js:41:9)
mis-server-1  |     at async main (/app/apps/mis-server/build/index.js:19:20) {
mis-server-1  |   code: 13,
mis-server-1  |   details: 'Cluster ID : hpc01, Details : Error: 13 INTERNAL: convert memory error',
mis-server-1  |   metadata: Metadata {
mis-server-1  |     internalRepr: Map(3) {
mis-server-1  |       'is_scow_error' => [ '1' ],
mis-server-1  |       'scow_error_code' => [ 'CLUSTEROPS_ERROR' ],
mis-server-1  |       'clustererrors' => [
mis-server-1  |         '[{"clusterId":"hpc01","details":{"code":13,"details":"convert memory error","metadata":{"content-type":["application/grpc"],"grpc-status-details-bin":[{"type":"Buffer","data":[8,13,18,20,99,111,110,118,101,114,116,32,109,101,109,111,114,121,32,101,114,114,111,114,26,67,10,40,116,121,112,101,46,103,111,111,103,108,101,97,112,105,115,46,99,111,109,47,103,111,111,103,108,101,46,114,112,99,46,69,114,114,111,114,73,110,102,111,18,23,10,21,67,79,78,86,69,82,84,95,77,69,77,79,82,89,95,70,65,73,76,69,68]}]}}}]'
mis-server-1  |       ]
mis-server-1  |     },
mis-server-1  |     options: {}
mis-server-1  |   }
mis-server-1  | }
mis-server-1  |
mis-server-1  | Node.js v20.13.1
  • 改用1.5.0的日志
[2024-12-05 10:42:07] [info] [config.go:34 scow-slurm-adapter/services/config.(*ServerConfig).GetClusterConfig] Received request GetClusterConfig:
[2024-12-05 10:42:07] [error] [config.go:183 scow-slurm-adapter/services/config.(*ServerConfig).GetClusterConfig] GetClusterConfig failed: rpc error: code = Internal desc = convert memory error
  • 在1.5.2的基础上编译出错
root@node196:~/scow-slurm-adapter# make build
CGO_BUILD=0 GOARCH=amd64 go build -o scow-slurm-adapter-amd64
# scow-slurm-adapter
./main.go:1391:5: undefined: caller
make: *** [Makefile:10: build] Error 1
  • main.go 内容
1374                         // if strings.Contains(totalMemsTmp, "M") {
1375                         // if stri      totalMemsInt, _ := strconv.Atoi(strings.Split(totalMemsTmp, "M")[0])
1376                         // if stri      totalMems = totalMemsInt
1377                         // if stri} else if strings.Contains(totalMemsTmp, "G") {
1378                         // if stri      totalMemsInt, _ := strconv.Atoi(strings.Split(totalMemsTmp, "G")[0])
1379                         // if stri      totalMems = totalMemsInt * 1024
1380                         // if stri} else if strings.Contains(totalMemsTmp, "T") {
1381                         // if stri      totalMemsInt, _ := strconv.Atoi(strings.Split(totalMemsTmp, "T")[0])
1382                         // if stri      totalMems = totalMemsInt * 1024 * 1024
1383                         memString := strings.Split(totalMemsTmp, "M")[0]
1384                         totalMemInt, err = utils.ConvertMemory(memString)
1385                         if err != nil {
1386                                 errInfo := &errdetails.ErrorInfo{
1387                                         Reason: "CONVERT_MEMORY_FAILED",
1388                                 }
1389                                 st := status.New(codes.Internal, "convert memory error")
1390                                 st, _ = st.WithDetails(errInfo)
1391                                 caller.Logger.Errorf("GetClusterConfig failed: %v", st.Err())
1392                                 return nil, st.Err()
1393                         }

@283713406
Copy link

@liu-shaobo 不好意思,fix-memory分支的代码优点问题,我又修改,如下
image
再麻烦基于fix-memory分支重新编译一下,然后试下。多谢啦!

@liu-shaobo
Copy link
Author

liu-shaobo commented Dec 5, 2024

用fix-memory分支编译,在scow-1.5.2上测试,进入scow管理系统会出现500错误;

  • services/config/config.go内容
image
  • slurm-adapter的trace日志
[2024-12-05 15:20:25] [error] [config.go:182 scow-slurm-adapter/services/config.(*ServerConfig).GetClusterConfig] GetClusterConfig failed: rpc error: code = Internal desc = convert memory error
[2024-12-05 15:20:26] [trace] [version.go:14 scow-slurm-adapter/services/version.(*ServerVersion).GetVersion] Adapter Version is: major:1 minor:6
[2024-12-05 15:20:26] [info] [config.go:959 scow-slurm-adapter/services/config.(*ServerConfig).GetClusterInfo] Received request GetClusterInfo:
[2024-12-05 15:20:26] [trace] [config.go:1216 scow-slurm-adapter/services/config.(*ServerConfig).GetClusterInfo] GetClusterInfo: cluster_name:"hpc01" partitions:{partition_name:"C96T2" node_count:1 idle_node_count:1 cpu_core_count:4 idle_cpu_count:4 partition_status:AVAILABLE}
  • mis-server日志
mis-server-1  | > @scow/mis-server@1.5.2 serve
mis-server-1  | > node build/index.js
mis-server-1  |
mis-server-1  | {"level":30,"time":"2024-12-05T07:20:25.149Z","pid":18,"hostname":"94d7ff0d503b","msg":"Hook is not configured."}
mis-server-1  | {"level":30,"time":"2024-12-05T07:20:25.391Z","pid":18,"hostname":"94d7ff0d503b","version":{"tag":"v1.5.2","commit":"665f5316212c99f7d461736134c1744e1b38084b"},"msg":"@scow/mis-server: "}
mis-server-1  | {"level":30,"time":"2024-12-05T07:20:25.391Z","pid":18,"hostname":"94d7ff0d503b","config":{"HOST":"0.0.0.0","PORT":5000,"LOG_LEVEL":"info","LOG_PRETTY":false,"SSH_PRIVATE_KEY_PATH":"/root/.ssh/id_rsa","SSH_PUBLIC_KEY_PATH":"/root/.ssh/id_rsa.pub","AUTH_URL":"","DB_PASSWORD":"must!chang3this"},"msg":"Loaded env config"}
mis-server-1  | {"level":30,"time":"2024-12-05T07:20:25.673Z","pid":18,"hostname":"94d7ff0d503b","msg":"Checking if root can login to hpc01 by login node node196"}
mis-server-1  | {"level":30,"time":"2024-12-05T07:20:25.727Z","pid":18,"hostname":"94d7ff0d503b","msg":"Root can login to hpc01 by login node node196"}
mis-server-1  | {"level":30,"time":"2024-12-05T07:20:25.753Z","pid":18,"hostname":"94d7ff0d503b","plugin":"price","msg":"Default Price Map: {\"hpc01.C96T2.normal\":{\"id\":1,\"itemId\":\"1\",\"path\":[\"hpc01\",\"C96T2\",\"normal\"],\"description\":\"\",\"price\":\"0.1\",\"amount\":\"cpusAlloc\",\"createTime\":\"2024-12-05T01:30:02.378Z\"}}"}
mis-server-1  | {"level":30,"time":"2024-12-05T07:20:25.753Z","pid":18,"hostname":"94d7ff0d503b","plugin":"price","msg":"Tenant specific prices {}"}
mis-server-1  | {"level":50,"time":"2024-12-05T07:20:25.809Z","pid":18,"hostname":"94d7ff0d503b","plugin":"price","err":{"type":"Error","message":"13 INTERNAL: convert memory error","stack":"Error: 13 INTERNAL: convert memory error\n    at callErrorFromStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/call.js:31:19)\n    at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/client.js:193:76)\n    at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)\n    at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)\n    at /app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78\n    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)\nfor call at\n    at ServiceClientImpl.makeUnaryRequest (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/client.js:161:32)\n    at ServiceClientImpl.getClusterConfig (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.6/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)\n    at /app/node_modules/.pnpm/@ddadaal+tsgrpc-client@0.17.7_@grpc+grpc-js@1.10.6/node_modules/@ddadaal/tsgrpc-client/lib/unary.js:18:13\n    at new Promise (<anonymous>)\n    at asyncClientCall (/app/node_modules/.pnpm/@ddadaal+tsgrpc-client@0.17.7_@grpc+grpc-js@1.10.6/node_modules/@ddadaal/tsgrpc-client/lib/unary.js:15:12)\n    at /app/apps/mis-server/build/bl/PriceMap.js:49:117\n    at /app/apps/mis-server/build/plugins/clusters.js:84:24\n    at Array.map (<anonymous>)\n    at Object.callOnAll (/app/apps/mis-server/build/plugins/clusters.js:83:18)\n    at createPriceMap (/app/apps/mis-server/build/bl/PriceMap.js:49:39)","code":13,"details":"convert memory error","metadata":{"content-type":["application/grpc"],"grpc-status-details-bin":[{"type":"Buffer","data":[8,13,18,20,99,111,110,118,101,114,116,32,109,101,109,111,114,121,32,101,114,114,111,114,26,67,10,40,116,121,112,101,46,103,111,111,103,108,101,97,112,105,115,46,99,111,109,47,103,111,111,103,108,101,46,114,112,99,46,69,114,114,111,114,73,110,102,111,18,23,10,21,67,79,78,86,69,82,84,95,77,69,77,79,82,89,95,70,65,73,76,69,68]}]}},"msg":"Executing on hpc01 failed"}
mis-server-1  | {"level":50,"time":"2024-12-05T07:20:25.810Z","pid":18,"hostname":"94d7ff0d503b","plugin":"price","msg":"Cluster ops fails at clusters [{\"cluster\":\"hpc01\",\"error\":{\"code\":13,\"details\":\"convert memory error\",\"metadata\":{\"content-type\":[\"application/grpc\"],\"grpc-status-details-bin\":[{\"type\":\"Buffer\",\"data\":[8,13,18,20,99,111,110,118,101,114,116,32,109,101,109,111,114,121,32,101,114,114,111,114,26,67,10,40,116,121,112,101,46,103,111,111,103,108,101,97,112,105,115,46,99,111,109,47,103,111,111,103,108,101,46,114,112,99,46,69,114,114,111,114,73,110,102,111,18,23,10,21,67,79,78,86,69,82,84,95,77,69,77,79,82,89,95,70,65,73,76,69,68]}]}}}]"}
mis-server-1  | /app/apps/mis-server/build/plugins/clusters.js:107
mis-server-1  |                 throw new tsgrpc_common_1.ServiceError({
mis-server-1  |                       ^
mis-server-1  |
mis-server-1  | ServiceError
mis-server-1  |     at Object.callOnAll (/app/apps/mis-server/build/plugins/clusters.js:107:23)
mis-server-1  |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
mis-server-1  |     at async createPriceMap (/app/apps/mis-server/build/bl/PriceMap.js:49:19)
mis-server-1  |     at async /app/apps/mis-server/build/plugins/price.js:20:22
mis-server-1  |     at async Server.register (/app/node_modules/.pnpm/@ddadaal+tsgrpc-server@0.19.5_@grpc+grpc-js@1.10.6/node_modules/@ddadaal/tsgrpc-server/lib/server.js:117:9)
mis-server-1  |     at async createServer (/app/apps/mis-server/build/app.js:41:9)
mis-server-1  |     at async main (/app/apps/mis-server/build/index.js:19:20) {
mis-server-1  |   code: 13,
mis-server-1  |   details: 'Cluster ID : hpc01, Details : Error: 13 INTERNAL: convert memory error',
mis-server-1  |   metadata: Metadata {
mis-server-1  |     internalRepr: Map(3) {
mis-server-1  |       'is_scow_error' => [ '1' ],
mis-server-1  |       'scow_error_code' => [ 'CLUSTEROPS_ERROR' ],
mis-server-1  |       'clustererrors' => [
mis-server-1  |         '[{"clusterId":"hpc01","details":{"code":13,"details":"convert memory error","metadata":{"content-type":["application/grpc"],"grpc-status-details-bin":[{"type":"Buffer","data":[8,13,18,20,99,111,110,118,101,114,116,32,109,101,109,111,114,121,32,101,114,114,111,114,26,67,10,40,116,121,112,101,46,103,111,111,103,108,101,97,112,105,115,46,99,111,109,47,103,111,111,103,108,101,46,114,112,99,46,69,114,114,111,114,73,110,102,111,18,23,10,21,67,79,78,86,69,82,84,95,77,69,77,79,82,89,95,70,65,73,76,69,68]}]}}}]'
mis-server-1  |       ]
mis-server-1  |     },
mis-server-1  |     options: {}
mis-server-1  |   }
mis-server-1  | }
mis-server-1  |
mis-server-1  | Node.js v20.13.1

@liu-shaobo
Copy link
Author

fix-memory分支的slurm-adapter已经修复
PKUHPC/scow-slurm-adapter@0371344
image

@283713406 283713406 removed their assignment Dec 5, 2024
@piccaSun
Copy link
Contributor

piccaSun commented Dec 5, 2024

上述修复已合并到适配器主分支 https://github.com/PKUHPC/scow-slurm-adapter
此问题已修复,关闭本Issue.

@piccaSun piccaSun closed this as completed Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants