Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgraded Xen-tools from 4.15 to 4.19.0 #4186

Merged
merged 4 commits into from
Oct 1, 2024

Conversation

roja-zededa
Copy link
Contributor

  1. Able to add ninja through apk in xen-tools/Docker and xen/Docker
  2. Changed the seabios and Xen version to 1.16.3 and 4.19.0 respectively
  3. Added 12-remove-vanillaqemu4.19-cpupinning.patch which removes new qemu_thread_set_affinity implementation (QEMU 8.0.4), also retained Nikolay CPU Pinning PatchNo.15
  4. Warning treated as error flag is available by default so removed 12-disable-Werror-to-build-under-gcc-11.2.patch
  5. NetdevTapOptions doesn't have has_br member so changed 10-bridge-helper-support.patch
  6. Bydefault vhost-vsock and vhost-scsi is enabled so removing the corresponding enable flags from xen-tools/Dockerfile
  7. Removed 11-char-socket-revert.patch as it's unnecessary
  8. Removed [realtime] option from kvm.go and replaced it with [overcommit], hypervisor.go unit test need to be changed to reflect [overcommit]
  9. 08-Revert__Revert__vfio_pci-quirks_c__Disable_stolen_memory_for_igd_VFIO__.patch looked super messy, so cleaned it.
  10. Replaced [realtime] with [overcommit] in kvm_test.go for the unit test case to pass

@roja-zededa
Copy link
Contributor Author

@OhmSpectator @rene @rouming Fixed unit test case, looks good to me. Please merge this when you can.

P.S. Had to move #4133 here because of the Request Code Owners Review / auto_request_review error.

@shjala
Copy link
Member

shjala commented Aug 29, 2024

@OhmSpectator @rene @rouming Fixed unit test case, looks good to me. Please merge this when you can.

P.S. Had to move #4133 here because of the Request Code Owners Review / auto_request_review error.

There are still unanswered question/change request in the original PR.

@OhmSpectator
Copy link
Member

@OhmSpectator @rene @rouming Fixed unit test case, looks good to me. Please merge this when you can.
P.S. Had to move #4133 here because of the Request Code Owners Review / auto_request_review error.

There are still unanswered question/change request in the original PR.

Do you mean this one? #4133 (comment)
or something else?

@shjala
Copy link
Member

shjala commented Aug 30, 2024

@OhmSpectator @rene @rouming Fixed unit test case, looks good to me. Please merge this when you can.
P.S. Had to move #4133 here because of the Request Code Owners Review / auto_request_review error.

There are still unanswered question/change request in the original PR.

Do you mean this one? #4133 (comment) or something else?

#4133 (comment)

@rouming
Copy link
Contributor

rouming commented Aug 30, 2024

@roja-zededa we usually don't merge all the patches in one commit: difficult to review, difficult to revert, difficult to maintain. Each logical change goes to its own commit. For example test fixes is ideal candidate for a separate commit with its own good description why the test needs a fix; or removal of old patches, or docker changes - those are good candidates for a separate commits. We try to follow Linux kernel best practices and those are well described here: https://github.com/torvalds/linux/blob/master/Documentation/process/submitting-patches.rst#separate-your-changes

Update: your PR description consists of 10 sentences, which perfectly illustrate the "logical change" approach. Of course there should not be exact 10 commits, but some of them definitely "want" to be separated.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some comments here and from #4133 to respond to, but kicking off tests.

@OhmSpectator
Copy link
Member

There are still some comments here and from #4133 to respond to, but kicking off tests.

It makes sense to rebase on the latest master before kicking the tests.

@roja-zededa
Copy link
Contributor Author

@OhmSpectator @rene @rouming Fixed unit test case, looks good to me. Please merge this when you can.
P.S. Had to move #4133 here because of the Request Code Owners Review / auto_request_review error.

There are still unanswered question/change request in the original PR.

Do you mean this one? #4133 (comment) or something else?

#4133 (comment)

@OhmSpectator @rene @rouming Fixed unit test case, looks good to me. Please merge this when you can.
P.S. Had to move #4133 here because of the Request Code Owners Review / auto_request_review error.

There are still unanswered question/change request in the original PR.

Do you mean this one? #4133 (comment) or something else?

#4133 (comment)

@shjala I have only added libgcc so I didn't know why there were sudo, bash, yajl. But, I can successfully build an image without these utils. @rene Can I remove sudo, bash, yajl from Dockerfile?

@roja-zededa
Copy link
Contributor Author

There are still some comments here and from #4133 to respond to, but kicking off tests.

It makes sense to rebase on the latest master before kicking the tests.

Tested with the latest master, and no issues were found.

@roja-zededa roja-zededa force-pushed the xentools-bump branch 2 times, most recently from df6005f to 6d9e7fa Compare September 3, 2024 20:34
@rouming
Copy link
Contributor

rouming commented Sep 23, 2024

@roja-zededa This is the source of the error: https://github.com/qemu/qemu/blob/01dc65a3bc262ab1bec8fe89775e9bbfa627becb/ui/vnc.c#L4069

The problem should be reproduced by using vnc and password. Try to follow build recommendations from here: https://bugs.gentoo.org/832494

@eriknordmark
Copy link
Contributor

@roja-zededa This is the source of the error: https://github.com/qemu/qemu/blob/01dc65a3bc262ab1bec8fe89775e9bbfa627becb/ui/vnc.c#L4069

Is that "Cipher backend does not support DES algorithm" check new in the new version of qemu?

@rouming
Copy link
Contributor

rouming commented Sep 24, 2024

Apparently something was changed, for example qemu/qemu@83bee4b
Either in the code, either in qemu build system, which (only a guess) does not enable a few libraries by default needed by vnc. But having an explicit error it is easy to prove: configure vnc with a password, run, tweak qemu build, repeat.

@milan-zededa
Copy link
Contributor

@Roja-Eswaran You can run locally these commands to reproduce:

make clean && make build-tests
./eden config add default
./eden config set default --key eve.tag --value 0.0.0-xentools-bump-8843ff1f
./eden config set default --key eve.log-level --value debug
./eden setup
./dist/bin/eden+ports.sh 2223:2223 2224:2224 
./eden start
./eden eve onboard
./eden test tests/vnc -e vnc_test -v debug

# later run:
./eden pod logs vnc-app

@roja-zededa roja-zededa force-pushed the xentools-bump branch 3 times, most recently from 038f0df to 949b192 Compare September 24, 2024 20:36
@roja-zededa
Copy link
Contributor Author

Apparently something was changed, for example qemu/qemu@83bee4b Either in the code, either in qemu build system, which (only a guess) does not enable a few libraries by default needed by vnc. But having an explicit error it is easy to prove: configure vnc with a password, run, tweak qemu build, repeat.

This issue has been resolved by adding gnutls and gnutls-dev in Xen-tools/Dockerfile.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kick off tests

@eriknordmark
Copy link
Contributor

@Roja-Eswaran please rebase on master (to pick up #4290) so we can get the build to work again.

Upgraded xen/Dockerfile and xen-tools/Dockerfile to point new version (1)

Removed the following unused patches: 11-char-socket-revert.patch, 12-disable-Werror-to-build-under-gcc-11.2.patch, 15-qemu-Set-the-affinity-of-QEMU-threads-according-to-t.patch, 16-imammedo_x86_acpi_use_offset_instead_of_pointer_when_using_build_header.patch, 0003-arch-arm-small-hack-for-rpi4-usb.patch

Signed-off-by: Roja Eswaran <roja@zededa.com>
Patch 12-remove-vanillaqemu4.19-cpupinning.patch which removes new qemu_thread_set_affinity implementation (QEMU 8.0.4)
Patch 20-return_zero_close_if_special_file.patch which removes a CVE fix

Signed-off-by: Roja Eswaran <roja@zededa.com>
Replacing [realtime] with [overcommit] as it's deprecated in QEMU 8.0.4 provided by xen-tools 4.19.0

Signed-off-by: Roja Eswaran <roja@zededa.com>
Adding gnutls and gnutls-dev in xen-tools/Dockerfile for VNC error

Signed-off-by: Roja Eswaran <roja@zededa.com>
@roja-zededa
Copy link
Contributor Author

roja-zededa commented Sep 25, 2024

@Roja-Eswaran please rebase on master (to pick up #4290) so we can get the build to work again.

Done! Please kickoff the tests.

@roja-zededa
Copy link
Contributor Author

roja-zededa commented Sep 25, 2024

I am unable to launch native containers. The eden test failed again! I was able to reproduce it locally and here is the log:

content: {"file":"/pillar/cmd/zedagent/handleconfig.go:824","func":"github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.requestConfigByURL","level":"error","msg":"inhaleDeviceConfig failed: 4","pid":2022,"source":"zedagent","time":"2024-09-25T21:23:05.974509884Z"}
severity: error

content: {"file"`:"/pillar/hypervisor/qmp.go:202","func":"github.com/lf-edge/eve/pkg/pillar/hypervisor.qmpEventHandler","level":"error","msg":"qmpEventHandler: Exception while stopping domain with socket: /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/qmp. write unix @-\u003e/run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/qmp: write: broken pipe","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:06.521514741Z"}
severity: error

content: {"file":"/pillar/hypervisor/qmp.go:205","func":"github.com/lfedge/eve/pkg/pillar/hypervisor.qmpEventHandler","level":"error","msg":"qmpEventHandler: Exception while quitting domain with socket: /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/qmp. dial unix /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/qmp: connect: no such file or directory","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:10.527364286Z"}
severity: error

content: {"file":"/pillar/hypervisor/qmp.go:193","func":"github.com/lf-edge/eve/pkg/pillar/hypervisor.qmpEventHandler","level":"error","msg":"qmpEventHandler: Exception while accessing listenerSocket: /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/listener.qmp. stat /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/listener.qmp: no such file or directory","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:10.527630294Z"}
severity: error

content: {"file":"/pillar/hypervisor/hypervisor.go:155","func":"github.com/lf-edge/eve/pkg/pillar/hypervisor.logError","level":"error","msg":"containerd looking up domain 1c868ce6-f417-4281-90ab-f98aec092c81.1.1 resulted in CtrContainerInfo: couldn't load container 1c868ce6-f417-4281-90ab-f98aec092c81.1.1: CtrLoadContainer: Exception while loading container: container \"1c868ce6-f417-4281-90ab-f98aec092c81.1.1\" in namespace \"eve-user-apps\": not found","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:26.053377236Z"}
severity: error

content: {"file":"/pillar/hypervisor/hypervisor.go:155","func":"github.com/lf-edge/eve/pkg/pillar/hypervisor.logError","level":"error","msg":"containerd looking up domain 1c868ce6-f417-4281-90ab-f98aec092c81.1.1 resulted in CtrContainerInfo: couldn't load container 1c868ce6-f417-4281-90ab-f98aec092c81.1.1: CtrLoadContainer: Exception while loading container: container \"1c868ce6-f417-4281-90ab-f98aec092c81.1.1\" in namespace \"eve-user-apps\": not found","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:41.852764882Z"}

FWIW: AMD64/KVM, Sometimes the zedcontroller throws me this error while launching native container:

failed to create task: new task failed: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: brctl: bridge bn1: Resource busy
brctl: bridge bn1: Resource busy
brctl: bridge bn1: Resource busy
Cannot find device "nbu1x2.1": unknown

@rene
Copy link
Contributor

rene commented Sep 26, 2024

I am unable to launch native containers. The eden test failed again! I was able to reproduce it locally and here is the log:

content: {"file":"/pillar/cmd/zedagent/handleconfig.go:824","func":"github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.requestConfigByURL","level":"error","msg":"inhaleDeviceConfig failed: 4","pid":2022,"source":"zedagent","time":"2024-09-25T21:23:05.974509884Z"}
severity: error

content: {"file"`:"/pillar/hypervisor/qmp.go:202","func":"github.com/lf-edge/eve/pkg/pillar/hypervisor.qmpEventHandler","level":"error","msg":"qmpEventHandler: Exception while stopping domain with socket: /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/qmp. write unix @-\u003e/run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/qmp: write: broken pipe","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:06.521514741Z"}
severity: error

content: {"file":"/pillar/hypervisor/qmp.go:205","func":"github.com/lfedge/eve/pkg/pillar/hypervisor.qmpEventHandler","level":"error","msg":"qmpEventHandler: Exception while quitting domain with socket: /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/qmp. dial unix /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/qmp: connect: no such file or directory","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:10.527364286Z"}
severity: error

content: {"file":"/pillar/hypervisor/qmp.go:193","func":"github.com/lf-edge/eve/pkg/pillar/hypervisor.qmpEventHandler","level":"error","msg":"qmpEventHandler: Exception while accessing listenerSocket: /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/listener.qmp. stat /run/hypervisor/kvm/1c868ce6-f417-4281-90ab-f98aec092c81.1.1/listener.qmp: no such file or directory","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:10.527630294Z"}
severity: error

content: {"file":"/pillar/hypervisor/hypervisor.go:155","func":"github.com/lf-edge/eve/pkg/pillar/hypervisor.logError","level":"error","msg":"containerd looking up domain 1c868ce6-f417-4281-90ab-f98aec092c81.1.1 resulted in CtrContainerInfo: couldn't load container 1c868ce6-f417-4281-90ab-f98aec092c81.1.1: CtrLoadContainer: Exception while loading container: container \"1c868ce6-f417-4281-90ab-f98aec092c81.1.1\" in namespace \"eve-user-apps\": not found","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:26.053377236Z"}
severity: error

content: {"file":"/pillar/hypervisor/hypervisor.go:155","func":"github.com/lf-edge/eve/pkg/pillar/hypervisor.logError","level":"error","msg":"containerd looking up domain 1c868ce6-f417-4281-90ab-f98aec092c81.1.1 resulted in CtrContainerInfo: couldn't load container 1c868ce6-f417-4281-90ab-f98aec092c81.1.1: CtrLoadContainer: Exception while loading container: container \"1c868ce6-f417-4281-90ab-f98aec092c81.1.1\" in namespace \"eve-user-apps\": not found","pid":2022,"source":"zedbox","time":"2024-09-25T21:23:41.852764882Z"}

Only these logs are not enough to know the root cause of the crash.

FWIW: AMD64/KVM, Sometimes the zedcontroller throws me this error while launching native container:

failed to create task: new task failed: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: brctl: bridge bn1: Resource busy
brctl: bridge bn1: Resource busy
brctl: bridge bn1: Resource busy
Cannot find device "nbu1x2.1": unknown

This error message happens a lot while deploying native containers when running EVE device on QEMU (I think it takes too long to setup the bridge device in this case). However, the virt. ethernet hook script retries a couple of times, so usually it succeeds after the second retry. If it manages to run the container, you don't need to worry about it....

Copy link
Contributor

@rene rene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kicking off Eden Tests...

@milan-zededa
Copy link
Contributor

Eden tests are green. Should we merge?
@rouming You will need to remove your "changes requested" for merge to be allowed.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-running the same qemu device as before
make -e HV=xen ACCEL=y QEMU_MEMORY=16384 MEDIA_SIZE=65536 SSH_PORT=2323 pkg/pillar pkg/watchdog rootfs live run

where the controller is deploying two VMs and two containers, the alpine container fails after 5 minutes, This is not happening when I run the same thing on master, nor when I run it with KVM on this PR. So that failure needs to be investigated.

@roja-zededa
Copy link
Contributor Author

@eriknordmark Could you please share the error trace?

@roja-zededa
Copy link
Contributor Author

@eriknordmark I don't face any issue while launching three VMs and three native containers at the same time on this PR. https://zedcontrol.alpha.zededa.net/edge-nodes/223c376d-242c-4da2-8487-253c6d599c07/details/status. Could you please share the error log messages so that I can reproduce the issue locally?

@eriknordmark
Copy link
Contributor

@eriknordmark I don't face any issue while launching three VMs and three native containers at the same time on this PR. https://zedcontrol.alpha.zededa.net/edge-nodes/223c376d-242c-4da2-8487-253c6d599c07/details/status. Could you please share the error log messages so that I can reproduce the issue locally?

I sent you a pointer to the Kibana logs when running xen-amd64.

I kicked off a run with the kvm image from your PR and today I see some new qmp errors (didn't see those a week ago)
FailureInfo: Error OperState. Error Details: [{'description': "Giving up waiting to connect to QEMU Monitor Protocol socket /run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp from VM 3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1, error: [attempt 1] qmp status failed for QMP socket '/run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp': err: 'dial unix /run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp: connect: connection refused'; (JSON response: ''); [attempt 2] qmp status failed for QMP socket '/run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp': err: 'dial unix /run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp: connect: connection refused'; (JSON response: ''); [attempt 3] qmp status failed for QMP socket '/run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp': err: 'dial unix /run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp: connect: connection refused'; (JSON response: '')", 'severity': 'SEVERITY_ERROR', 'timestamp': '2024-09-30T09:17:07.254882353Z', 'entities': [], 'retryCondition': ''}]

Do you want a kibana saved search for those as well? The kvm error is for a device called sc-supermicro-zc2 in the alpha cluster.

@roja-zededa
Copy link
Contributor Author

@eriknordmark I don't face any issue while launching three VMs and three native containers at the same time on this PR. https://zedcontrol.alpha.zededa.net/edge-nodes/223c376d-242c-4da2-8487-253c6d599c07/details/status. Could you please share the error log messages so that I can reproduce the issue locally?

I sent you a pointer to the Kibana logs when running xen-amd64.

I kicked off a run with the kvm image from your PR and today I see some new qmp errors (didn't see those a week ago) FailureInfo: Error OperState. Error Details: [{'description': "Giving up waiting to connect to QEMU Monitor Protocol socket /run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp from VM 3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1, error: [attempt 1] qmp status failed for QMP socket '/run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp': err: 'dial unix /run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp: connect: connection refused'; (JSON response: ''); [attempt 2] qmp status failed for QMP socket '/run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp': err: 'dial unix /run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp: connect: connection refused'; (JSON response: ''); [attempt 3] qmp status failed for QMP socket '/run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp': err: 'dial unix /run/hypervisor/kvm/3c9007f2-4907-4722-ac23-0ed61a0111fe.1.1/qmp: connect: connection refused'; (JSON response: '')", 'severity': 'SEVERITY_ERROR', 'timestamp': '2024-09-30T09:17:07.254882353Z', 'entities': [], 'retryCondition': ''}]

Do you want a kibana saved search for those as well? The kvm error is for a device called sc-supermicro-zc2 in the alpha cluster.

@eriknordmark I don't think so my PR is responsible for sc-supermicro-zc2 failure as I have seen a similar error in PR:4259 as well.
Thanks for sending the pointers w.r.t. Xen-amd64, I'll take a look at it.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look like some container (the alpine-cont we use in ztest) can have issues with a 9p mount when run on the xen hypervisor. But let's debug that in parallel with merging this (assuming no test issues discovered overnight for the kvm image).

I've seen a panic in the guest VM kernel (in p9_xen_create) which looks like an old guest kernel (missing a check for addr == NULL which is in the 6.1 kernel). So not an issue with this PR.

@eriknordmark eriknordmark merged commit 8776e5d into lf-edge:master Oct 1, 2024
69 of 77 checks passed
@rouming rouming mentioned this pull request Oct 1, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants