Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkbox remote needs a way to resume a test session #22

Closed
beliaev-maksim opened this issue Oct 21, 2022 · 1 comment · Fixed by #859
Closed

checkbox remote needs a way to resume a test session #22

beliaev-maksim opened this issue Oct 21, 2022 · 1 comment · Fixed by #859
Labels
bug Something isn't working FromLaunchpad Importance: Critical Triaged Triaged (in Jira and GH)

Comments

@beliaev-maksim
Copy link
Member

This issue was migrated from https://bugs.launchpad.net/checkbox-ng/+bug/1936477

Summary

Status Created on Heat Importance Security related
Confirmed 2021-07-16 08:29:07 28 Critical False

Description

It is far too easy for the CDTS master to lose connection with the SUT. When this happens, you end up back at the main selection screen, and the previous test session is completely lost.

Even trying to select tests to rerun them can cause the session to be lost.

Running Checkbox on the SUT locally can detect interrupted sessions and allow you to resume.

Steps to reproduce

  1. Install checkbox on the host and the device under test (DUT).
  2. Make sure the checkbox service is launched on the DUT:

systemctl status checkbox-ng.service
● checkbox-ng.service - Checkbox Remote Service
     Loaded: loaded (/lib/systemd/system/checkbox-ng.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-11-10 10:01:37 CST; 5 days ago
   Main PID: 1455 (checkbox-cli)
      Tasks: 1 (limit: 19051)
     Memory: 28.1M
     CGroup: /system.slice/checkbox-ng.service
             └─1455 /usr/bin/python3 /usr/bin/checkbox-cli service

(Please note: the service name might be different if you use CDTS or Checkbox as a snap. For CDTS for instance, it's snap.cdts.service.service)

  1. On the host, connect to the DUT:

$ checkbox-cli remote <DUT_IP>

  1. Select a test plan and start the test run (for instance, an automated test plan or a stress test plan such as client-cert-iot-server-20-04-[automated|stress]).

  2. While testing is ongoing, restart your host computer.

  3. After restart, try connecting back to the DUT:

$ checkbox-cli remote <DUT_IP>

Expected result

Checkbox prompts to ask if you want to resume the session.

Actual result

Checkbox starts afresh, showing you the list of available test plans in order to start a new test run.

Attachments

sessions.tar.xz
202110-29555-sessions.tar.gz

Tags:
['checkbox', 'checkbox-session-resume']

@beliaev-maksim
Copy link
Member Author

This thread was migrated from launchpad.net

https://launchpad.net/~pieq wrote on 2021-11-03 07:54:15:

The following just happened to me using Checkbox remote through the cdts snap:

cdts 0.8 180 20.04/stable ce-certification-qa classic
checkbox20 1.23 621 latest/stable ce-certification-qa
core20 20210928 1169 latest/stable canonical✓ base

I installed a new image to test on the DUT, installed checkbox20 and cdts snaps, then ran cdts.odm-certification <IP> from my host where is the IP of my DUT. I then selected the test plan I wanted to run (20.04 Server automated tests), and proceeded.

At the end of the testing session, I saw the Re-Run screen. I selected a few jobs that I wanted to re-run, then pressed R. While this happened, my host froze and I had to reboot it. When I tried to reconnect to the DUT using cdts.odm-certification <IP>, I was welcomed by the startup screen (test plan selection), as if there was no session to be resumed.

In the sessions directory, I can see a lot of sessions, even though I only selected a test plan once:

$ ls /var/tmp/checkbox-ng/sessions/
session_title-2021-11-03T02.17.43.session throwaway-2021-11-03T03.02.33.session throwaway-2021-11-03T03.40.22.session throwaway-2021-11-03T04.17.30.session throwaway-2021-11-03T05.02.01.session throwaway-2021-11-03T06.46.57.session
session_title-2021-11-03T02.18.15.session throwaway-2021-11-03T03.05.43.session throwaway-2021-11-03T03.43.31.session throwaway-2021-11-03T04.20.27.session throwaway-2021-11-03T05.06.04.session throwaway-2021-11-03T06.51.06.session
session_title-2021-11-03T02.22.29.session throwaway-2021-11-03T03.08.52.session throwaway-2021-11-03T03.46.41.session throwaway-2021-11-03T04.22.25.session throwaway-2021-11-03T05.09.57.session throwaway-2021-11-03T06.55.14.session
session_title-2021-11-03T02.23.22.session throwaway-2021-11-03T03.11.57.session throwaway-2021-11-03T03.49.50.session throwaway-2021-11-03T04.26.25.session throwaway-2021-11-03T05.13.53.session throwaway-2021-11-03T06.59.22.session
session_title-2021-11-03T07.36.00.session throwaway-2021-11-03T03.15.07.session throwaway-2021-11-03T03.53.00.session throwaway-2021-11-03T04.30.23.session throwaway-2021-11-03T05.17.46.session throwaway-2021-11-03T07.03.24.session
session_title-2021-11-03T07.36.11.session throwaway-2021-11-03T03.18.16.session throwaway-2021-11-03T03.56.11.session throwaway-2021-11-03T04.34.17.session throwaway-2021-11-03T05.21.40.session throwaway-2021-11-03T07.07.34.session
session_title-2021-11-03T07.40.47.session throwaway-2021-11-03T03.21.28.session throwaway-2021-11-03T03.59.19.session throwaway-2021-11-03T04.38.15.session throwaway-2021-11-03T05.25.33.session throwaway-2021-11-03T07.11.42.session
throwaway-2021-11-03T02.22.48.session throwaway-2021-11-03T03.24.32.session throwaway-2021-11-03T04.02.23.session throwaway-2021-11-03T04.42.12.session throwaway-2021-11-03T05.29.28.session throwaway-2021-11-03T07.15.50.session
throwaway-2021-11-03T02.49.54.session throwaway-2021-11-03T03.27.44.session throwaway-2021-11-03T04.05.31.session throwaway-2021-11-03T04.46.06.session throwaway-2021-11-03T06.30.26.session
throwaway-2021-11-03T02.53.05.session throwaway-2021-11-03T03.30.54.session throwaway-2021-11-03T04.08.36.session throwaway-2021-11-03T04.50.04.session throwaway-2021-11-03T06.34.34.session
throwaway-2021-11-03T02.56.13.session throwaway-2021-11-03T03.34.03.session throwaway-2021-11-03T04.11.38.session throwaway-2021-11-03T04.53.58.session throwaway-2021-11-03T06.38.42.session
throwaway-2021-11-03T02.59.23.session throwaway-2021-11-03T03.37.13.session throwaway-2021-11-03T04.14.35.session throwaway-2021-11-03T04.58.01.session throwaway-2021-11-03T06.42.48.session

I attached all the sessions into sessions.tar.xz.

How can I wrap up the testing session, and generate the submission.tar.xz?

https://launchpad.net/~pieq wrote on 2021-11-15 03:22:57:

Changing importance to Critical, because it potentially leads to a lot of time wasted by QA:

  • if the remote (host) has a connection issue during the test run, the whole test session is lost, because checkbox remote does not offer to resume the session afterwards.
  • if the client (DUT) dies during the test (this happened to me during a stress test, where stress-ng-test-for-class-cpu-memory Checkbox job killed the checkbox-cli process.

In both cases, a lot of time is spent understanding what's going on, trying to salvage whatever can be salvaged, and re-organizing the test runs to isolate the job that will mess up the entire session, so it can be run separately – or not at all.

It is therefore very important that Checkbox remote gets this "resume session" feature that local Checkbox run already has. It could be presented as a list of available, not-yet-finished sessions, and the user would select the one (s)he wants to resume, then press Enter to resume.

P.S.: A full, wrapped-up submission is mandatory in order to be able to generate an HTML report that will be shared with project managers and customers.

https://launchpad.net/~pieq wrote on 2021-12-01 08:10:09:

It happened to me again today, with another customer project, this time when running the firmware/fwts_desktop_diagnosis job.

The job crashes the DUT with this error:

[13337.339760] Call Trace:
[13337.370962] __schedule+0x2e3/0x740
[13337.414642] ? __internal_add_timer+0x2d/0x40
[13337.468617] schedule+0x42/0xb0
[13337.508036] schedule_timeout+0x8a/0x160
[13337.556810] ? __next_timer_interrupt+0xe0/0xe0
[13337.612764] rcu_gp_kthread+0x48d/0x990
[13337.660294] kthread+0x104/0x140
[13337.700541] ? kfree_call_rcu+0x20/0x20
[13337.748070] ? kthread_park+0x90/0x90
[13337.793519] ret_from_fork+0x35/0x40
[13355.988221] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [systemd-udevd:611]

On the Host, I see:

Reconnecting...
Reconnecting...
Reconnecting...
Reconnecting...
Reconnecting...

I manually shut down and restart the DUT. Then the Host presents me... the list of available test plans. I have the following sessions:

$ ll -latrh
total 28K
drwxrwxrwx 4 root root 4.0K Dec 1 07:19 throwaway-2021-12-01T07.19.53.session/
drwxrwxrwx 4 root root 4.0K Dec 1 07:22 checkbox-slave-2021-12-01T07.18.58.session/
drwxrwxrwx 4 root root 4.0K Dec 1 07:22 checkbox-slave-2021-12-01T07.22.06.session/
drwxrwxrwx 3 root root 4.0K Dec 1 07:30 ../
drwxrwxrwx 4 root root 4.0K Dec 1 07:31 checkbox-slave-2021-12-01T07.22.11.session/
drwxrwxrwx 7 root root 4.0K Dec 1 07:57 ./
drwxrwxrwx 4 root root 4.0K Dec 1 07:58 checkbox-slave-2021-12-01T07.57.59.session/

but checkbox remote seems to be ignoring them.

https://launchpad.net/~pieq wrote on 2022-03-18 09:06:49:

Annnnnd it happened again while running the RTC battery test on a project device.

On my remote, I'm using

cdts 0.9 278 20.04/stable ce-certification-qa classic
checkbox20 1.25 811 latest/stable ce-certification-qa -

And on the project (conroe), I'm using a checkbox snap made for the project with the same checkbox20 snap.

The RTC battery test stops the DUT, then the DUT wakes up on its own after 30 seconds. While it is booting, I can see "Reconnecting..." on my remote device, and once DUT finishes booting... the "Select test plan" screen appears on my remote device.

$ ls -latrh /var/tmp/checkbox-ng/sessions/
total 24K
drwxrwxrwx 3 root root 4.0K Mar 18 08:38 ..
drwxrwxrwx 4 root root 4.0K Mar 18 08:38 throwaway-2022-03-18T08.38.51.session
drwxrwxrwx 4 root root 4.0K Mar 18 08:45 checkbox-slave-2022-03-18T08.38.28.session
drwxrwxrwx 4 root root 4.0K Mar 18 08:51 checkbox-slave-2022-03-18T08.45.09.session
drwxrwxrwx 6 root root 4.0K Mar 18 08:55 .
drwxrwxrwx 4 root root 4.0K Mar 18 08:55 checkbox-slave-2022-03-18T08.55.40.session

checkbox-slave-2022-03-18T08.45.09.session is the unfinished session I started, and checkbox-slave-2022-03-18T08.55.40.session is the new session started by checkbox-ng after the device reboots, I guess... Attaching the content of /var/tmp/checkbox-ng/sessions/.

@pieqq pieqq closed this as completed Nov 28, 2022
@pieqq pieqq reopened this Nov 28, 2022
@beliaev-maksim beliaev-maksim added the bug Something isn't working label Nov 28, 2022
@pieqq pieqq changed the title LP1936477: [remote] checkbox needs a way to resume a test session [remote] checkbox needs a way to resume a test session Mar 7, 2023
@pieqq pieqq changed the title [remote] checkbox needs a way to resume a test session checkbox remote needs a way to resume a test session Mar 7, 2023
@pieqq pieqq added the Triaged Triaged (in Jira and GH) label Apr 19, 2023
kissiel pushed a commit that referenced this issue May 1, 2023
* Add connection test for kuiper

Signed-off-by: Mengyi <mengyi.wang@canonical.com>

* Update format and name of rules

Signed-off-by: Mengyi <mengyi.wang@canonical.com>

* Move the rule out of the if statement to make the condition shorter

Signed-off-by: Mengyi <mengyi.wang@canonical.com>

* Set `SHARED` to share source instance across rules

Signed-off-by: Mengyi <mengyi.wang@canonical.com>

* Print error logs for failed tests

- add error logs
- change log level to debug
- improve log words
- redirect stdout to stderr when test fails
 

Signed-off-by: Mengyi <mengyi.wang@canonical.com>

* Add `print_error_log()`

Signed-off-by: Mengyi <mengyi.wang@canonical.com>

* Add `print_error_log()` in utils

Signed-off-by: Mengyi <mengyi.wang@canonical.com>

* Update Vault's status check

Signed-off-by: Mengyi <mengyi.wang@canonical.com>

* Capture service errors on checkbox test failure (#24)

* Capture service errors on checkbox test failure

Signed-off-by: Mengyi Wang <mengyi.wang@canonical.com>
pieqq pushed a commit that referenced this issue Jan 12, 2024
Fix: wrong name of after-suspend-eeprom-automated
pieqq pushed a commit that referenced this issue Jan 12, 2024
Fix: wrong name of after-suspend-eeprom-automated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working FromLaunchpad Importance: Critical Triaged Triaged (in Jira and GH)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants