Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve server/client stuck at the test end before results-exchange #1527

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

davidBar-On
Copy link
Contributor

@davidBar-On davidBar-On commented Jun 3, 2023

A suggested fix for the issue that server and client got stuck at the end of the test, because the client did not receive EXCHANGE_RESULTS state, based on @ffolkes1911 test results and discussion starting at this comment.

The issue was caused on a cellular network, so the fix seem to be important for any iperf3 use over cellular network. As the changes are for the end of the test, they are probably related also to the multi-thread version.

The root cause of the issue was that the server sent (reverse mode) only about 60KB from each 128KB TCP packet. Therefore, the last read by the client did not receive a full 128KB packet. Since the read was in blocking-mode the client got stuck and was not able to respond to the server. Therefore, the server also got stuck while waiting for the client's reply. It is not clear whether the EXCHANGE_RESULTS was lost or whether it was just delayed, but the fix handles both cases.

The main changes done:

  1. Client does not set the TCP streams to blocking mode at TEST_END. This is to allow the client to receive non-full late packets.
  2. Allow --rcv-timeout in the client in sending mode - used at the end of the test to allow timeout for exchange-results messages read.
  3. Server does not close the test streams at TEST_END. This is because select() returns immediately when monitoring closed sockets, and the client's select() timeout is not effective in this case.
  4. Client does not monitor "write sockets" at TEST_END. Otherwise, even when no input is received from the server, client's select() for late packets will return immediately (as the write sockets are available) which will not allow using the --rcv-timeout.
  5. When client sends IPERF_DONE or CLIENT_TERMINATE, it sends additional 3 null bytes. This will make sure that in case the server is waiting for the exchanged-results JSON length (4 bytes), it will not get stuck in the read command (server will then fail because it doesn't get a legal JSON).
  6. Cancel client timers at TEST_END, as they are redundant at that point and may interrupt with the --rcv-timeout , because select() may return after a timer expires and before the receive timer expired.

@davidBar-On davidBar-On force-pushed the issue-819-server-stuck-after-printing-results branch from a3df342 to dbe8d68 Compare December 10, 2023 12:55
@davidBar-On
Copy link
Contributor Author

Re-implementation for the Multi-thread iperf3. I am not sure whether the original problem still happen with multi-thread. Some of the changes may be worth implementing in any case, so if needed I can submit a PR only with these changes:

  • Removed write_set as it became redundant with multi-thread.
  • Added timers canceling function, instead of canceling the timers one by one in several places.
  • Cancel client timers at TEST_END, as they are redundant at that point and may interrupt with the --rcv-timeout, because select() may return after a timer expires and before the receive timer expired.
  • Added sp->socket = -1 in two places, which makes the code cleaner.
  • Added some debug messages that may be useful in general.

The PR also include the code changes to resolve the original issue:

  • Allow --rcv-timeout in the client in sending mode - used at the end of the test to allow timeout for exchange-results messages read.
  • When client sends IPERF_DONE or CLIENT_TERMINATE, it sends additional 3 null bytes. This will make sure that in case the server is waiting for the exchanged-results JSON length (4 bytes), it will not get stuck in the read command (server will then fail because it doesn't get a legal JSON).

@@ -562,7 +545,6 @@ iperf_run_server(struct iperf_test *test)
}

memcpy(&read_set, &test->read_set, sizeof(fd_set));
memcpy(&write_set, &test->write_set, sizeof(fd_set));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this means that write_set may have an uninitialized values on line 581. Should you just remove the variable entirely since it it unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Forgot to remove all occurances of write_set... Now removed (with rebase).

@davidBar-On davidBar-On force-pushed the issue-819-server-stuck-after-printing-results branch from dbe8d68 to fcf1c46 Compare May 10, 2024 06:50
@@ -580,8 +583,9 @@ iperf_run_client(struct iperf_test * test)

/* Begin calculating CPU utilization */
cpu_util(NULL);
rcv_timeout_value_in_us = (test->settings->rcv_timeout.secs * SEC_TO_US) + test->settings->rcv_timeout.usecs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change creates a scenario with an early timeout.
Given:

  • test->mode == SENDER
  • rcv_timeout_value_in_us > 0
  • test->state == TEST_END (or EXCHANGE_RESULTS or DISPLAY_RESULTS)

This implies that rcv_timeout_us = 0. Then on line 642, the if statement will evaluate to something like if (t_usecs > 0); always being true since t_usecs will pretty much always be greater than 0.

It might make more sense to split the blocks so the correct timeout is used:
(I've also renamed rcv_timeout_value_in_us -> end_rcv_timeout and rcv_timeout_us -> running_rcv_timeout to hopefully make their use more clear)

    if (result < 0 && errno != EINTR) {
  	    i_errno = IESELECT;
	    goto cleanup_and_fail;
    } else if ( result == 0 && (running_rcv_timeout > 0 && test->state == TEST_RUNNING)) {
        /*
            * If nothing was received in non-reverse running state
            * then probably something got stuck - either client,
            * server or network, and test should be terminated./
            */
        iperf_time_now(&now);
        if (iperf_time_diff(&now, &last_receive_time, &diff_time) == 0) {
            t_usecs = iperf_time_in_usecs(&diff_time);
            if (t_usecs > running_rcv_timeout) {
                /* Idle timeout if no new blocks received */
                if (test->blocks_received == last_receive_blocks) {
                    i_errno = IENOMSG;
                    goto cleanup_and_fail;
                }
            }

        }
    } else if (result == 0 && (end_rcv_timeout > 0 && (test->state == TEST_END 
                                        || test->state == EXCHANGE_RESULTS 
                                        || test->state == DISPLAY_RESULTS))) {
        iperf_time_now(&now);
        if (iperf_time_diff(&now, &last_receive_time, &diff_time) == 0) {
            t_usecs = iperf_time_in_usecs(&diff_time);
            if (t_usecs > end_rcv_timeout) {
                /* Idle timeout if no new blocks received */
                if (test->blocks_received == last_receive_blocks) {
                    i_errno = IENOMSG;
                    goto cleanup_and_fail;
                }
            }

        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only rcv_timeout_us value is used for timeout (in the select statement). The value of rcv_timeout_value_in_us is only used to initialize rcv_timeout_us and to indicate for for the ending states whether receive timeout was requested (as there is no "timeout requested" setting). Therefore, the current code is correct in that respect.

I agree that probably rcv_timeout_value_in_us may have a better name, like rcv_timeout_setting_us but I won't make a change just for that.

@RizziMau
Copy link

Hi,
I am experiencing a similar problem with iperf3 version 3.16 in a product that tests the UDP data throughput of mobile networks.

The data transfers are performed between:

  • an Android phone running iperf3 in client mode
  • a Linux server running iperf3 in server mode

The data transfers occur every X minutes, alternating between client-to-server and server-to-client directions.

Sometimes, the iperf3 server hangs, and all subsequent tests fail with the message
iperf3: error - the server is busy running a test. Try again later.

When the iperf3 server hangs:

  • the last transfer was in reverse mode (server-to-client direction)
  • the transfer was completed without errors on the client side
  • the client has received the report from the server and printed it
  • the server hangs while printing the report

The issue is not systematic but occurs after several hours.
The only solution I have found is to kill and restart the iperf3 server.

I have also tried the "--rcv-timeout 30000" option, but it did not resolve the issue.

The iperf3 server is started with the following command:
iperf3 --server --interval 0 -p 5202 -1 --rcv-timeout 30000
Note that the "-1" option causes iperf3 to exit after one transfer, but a daemon restarts it after 2 seconds.

The iperf3 client command (on the Android phone) is:
iperf3 --forceflush -c x.x.x.x -V -p 5202 -u -t 15 -i 5 -fK -4 -b 5000000 -l 1200 -P 4 -O 0
(x.x.x.x is the IP address of the server)

Since I need a solution, I am available to experiment with other iperf3 command options or to run a debug version of iperf3.

Thank you.

@MattCatz
Copy link
Contributor

@RizziMau does it happen if you don't use --forceflush? That would indicate there is some deadlock condition with the print_mutex. https://github.com/esnet/iperf/blob/master/src/iperf_api.c#L5119

On a side note, with your work around, you may be interested in this PR. Using systemd socket based activation would remove your 2 second down time between one-off tests. #1171

@RizziMau
Copy link

@RizziMau does it happen if you don't use --forceflush? That would indicate there is some deadlock condition with the print_mutex. https://github.com/esnet/iperf/blob/master/src/iperf_api.c#L5119

@MattCatz
Currently I'm using the "--forceflush" option on the client-side only; could this client option have impacts on the server behaviour?
In any case, I am removing the "--forceflush" option from the client command to see if anything changes (I will update you).

As an additional test, does it make sense to try the "--forceflush" option on the server command?

@MattCatz
Copy link
Contributor

@RizziMau does it happen if you don't use --forceflush? That would indicate there is some deadlock condition with the print_mutex. https://github.com/esnet/iperf/blob/master/src/iperf_api.c#L5119

@MattCatz Currently I'm using the "--forceflush" option on the client-side only; could this client option have impacts on the server behaviour? In any case, I am removing the "--forceflush" option from the client command to see if anything changes (I will update you).

As an additional test, does it make sense to try the "--forceflush" option on the server command?

oh duhh I misunderstood your original comment. You are right, --forceflush is not pushed to the server. This sounds different enough that you might consider opening a separate issue. What might also be helpful is including 'the server hangs while printing the report'.

@davidBar-On
Copy link
Contributor Author

@RizziMau, I also suggest to run the server using --debug=3 option. The debug messages may help to better understand were the server got stuck, by comparing successful run output to the output when it got stuck.

You may also try building the server from PR #1734, which adds more debug information about the server's state changes. This may be important, as one of the possible reasons for the problem is that the server is waiting for a state change message from the client.

@RizziMau
Copy link

Thank you for the suggestions:
@MattCatz I have opened the new issue #1735
@davidBar-On I have attached to the issue #1735 the iperf3 logs produced with the --debug=3 option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants