Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_asyncio: test_start_tls_server_1() fails randomly #76639

Closed
pitrou opened this issue Dec 30, 2017 · 34 comments
Closed

test_asyncio: test_start_tls_server_1() fails randomly #76639

pitrou opened this issue Dec 30, 2017 · 34 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 (EOL) end of life topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@pitrou
Copy link
Member

pitrou commented Dec 30, 2017

BPO 32458
Nosy @terryjreedy, @vstinner, @tiran, @ned-deily, @asvetlov, @1st1, @miss-islington
PRs
  • bpo-32458: Temporarily mask start-tls proactor test on Windows #5054
  • bpo-32458: Increase TLS timeout in test_asyncio #7157
  • bpo-32458: Further tune start_tls tests #7166
  • [3.7] bpo-32458: Further tune start_tls tests (GH-7166) #7170
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/1st1'
    closed_at = <Date 2018-05-29.20:02:20.072>
    created_at = <Date 2017-12-30.19:20:50.946>
    labels = ['3.8', 'type-bug', '3.7', 'expert-asyncio']
    title = 'test_asyncio: test_start_tls_server_1() fails randomly'
    updated_at = <Date 2018-05-29.20:02:20.071>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2018-05-29.20:02:20.071>
    actor = 'vstinner'
    assignee = 'yselivanov'
    closed = True
    closed_date = <Date 2018-05-29.20:02:20.072>
    closer = 'vstinner'
    components = ['asyncio']
    creation = <Date 2017-12-30.19:20:50.946>
    creator = 'pitrou'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32458
    keywords = ['patch']
    message_count = 34.0
    messages = ['309244', '309254', '309255', '309256', '309521', '310405', '313507', '317468', '317638', '317639', '317640', '317643', '317751', '317755', '317758', '317759', '317760', '317761', '317762', '317807', '317808', '317809', '317817', '317833', '317862', '317871', '317872', '317873', '317898', '317907', '317918', '317920', '318055', '318080']
    nosy_count = 7.0
    nosy_names = ['terry.reedy', 'vstinner', 'christian.heimes', 'ned.deily', 'asvetlov', 'yselivanov', 'miss-islington']
    pr_nums = ['5054', '7157', '7166', '7170']
    priority = None
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue32458'
    versions = ['Python 3.7', 'Python 3.8']

    @pitrou
    Copy link
    Member Author

    pitrou commented Dec 30, 2017

    It seems test_asyncio fails sporadically on AppVeyor:
    https://ci.appveyor.com/project/python/cpython/build/3.7.0a0.10081#L2650

    ======================================================================
    ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLS)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\projects\cpython\lib\test\test_asyncio\test_sslproto.py", line 284, in test_start_tls_server_1
        asyncio.wait_for(main(), loop=self.loop, timeout=10))
      File "C:\projects\cpython\lib\asyncio\base_events.py", line 440, in run_until_complete
        return future.result()
      File "C:\projects\cpython\lib\asyncio\tasks.py", line 398, in wait_for
        raise futures.TimeoutError()
    concurrent.futures._base.TimeoutError

    ======================================================================
    FAIL: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLS)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\projects\cpython\lib\test\test_asyncio\functional.py", line 43, in tearDown
        self.fail('unexpected calls to loop.call_exception_handler()')
    AssertionError: unexpected calls to loop.call_exception_handler()

    @pitrou pitrou added stdlib Python modules in the Lib dir tests Tests in the Lib/test dir 3.7 (EOL) end of life type-bug An unexpected behavior, bug, or error labels Dec 30, 2017
    @1st1
    Copy link
    Member

    1st1 commented Dec 30, 2017

    I'm leaving on a two-weeks vacation today. To avoid risking breaking the workflow, I'll mask this tests on AppVeyor. I'll investigate this when I get back.

    @1st1 1st1 added topic-asyncio and removed stdlib Python modules in the Lib dir tests Tests in the Lib/test dir labels Dec 30, 2017
    @1st1 1st1 self-assigned this Dec 30, 2017
    @1st1
    Copy link
    Member

    1st1 commented Dec 30, 2017

    New changeset 0c36bed by Yury Selivanov in branch 'master':
    bpo-32458: Temporarily mask start-tls proactor test on Windows (bpo-5054)
    0c36bed

    @1st1
    Copy link
    Member

    1st1 commented Dec 30, 2017

    Please keep this issue open.

    @terryjreedy
    Copy link
    Member

    After fresh update and debug rebuild on master(3.7), with Win10 patched last night, I got
    ======================================================================
    FAIL: test_call_later (test.test_asyncio.test_events.ProactorEventLoopTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "F:\dev\3x\lib\test\test_asyncio\test_events.py", line 290, in test_call_later
        self.assertTrue(0.08 <= t1-t0 <= 0.8, t1-t0)
    AssertionError: False is not true : 0.07800000000861473

    I have not seen this before. Four subsequent runs passed.

    @asvetlov
    Copy link
    Contributor

    Hmm, the scheduled call was executed sooner than expected.
    Actually we had 0.1 sec delay, checked for at least 0.08 but actual call time was 0.078

    BTW I recall an issue with Windows when call_later() was executed sooner already.

    @ned-deily
    Copy link
    Member

    Note, I think this failure is still happening occasionally on the windows7 buildbot; for example:

    http://buildbot.python.org/all/#/builders/111/builds/68/steps/3/logs/stdio
    http://buildbot.python.org/all/#/builders/111/builds/85/steps/3/logs/stdio

    @vstinner
    Copy link
    Member

    test_start_tls_server_1() just failed on my Linux. It likely depends on the system load.

    @vstinner vstinner changed the title test_asyncio failures on Windows test_asyncio: test_start_tls_server_1() fails randomly May 23, 2018
    @vstinner
    Copy link
    Member

    It would be nice to fix this bug before Python 3.7.0 final: either skip the test, or fix it. Flaky tests can be very annoying. For example, the full test suite is run to build a Red Hat package. If a single test fails, the package build fails and should be retried whereas it's slow.

    I understood that fixing the root cause might require to rewrite the test, so I don't expect a quick fix on this test. The issue is open since last December... Maybe we can skip the test but fix it in Python 3.7.1?

    @1st1
    Copy link
    Member

    1st1 commented May 24, 2018

    I'm OK to skip it for now. Writing functional tests is super hard because some buildbots are super slow and unpredictable.

    @vstinner
    Copy link
    Member

    I'm OK to skip it for now. Writing functional tests is super hard because some buildbots are super slow and unpredictable.

    Would you mind to write a PR to skip the PR? So you will feel guilty and will try to remind to fix it!

    @1st1
    Copy link
    Member

    1st1 commented May 24, 2018

    Not at my computer right now, can do it tomorrow.

    @ned-deily
    Copy link
    Member

    Yury, are you still planning to address this?

    @1st1
    Copy link
    Member

    1st1 commented May 26, 2018

    Last time I looked into this I couldn't reproduce the failures on my Windows 10 VM, so it seems like an AppVeyor-specific problem. I'll take another look on Monday.

    Andrew, if you have a Windows environment, could you please try to run this test?

    @ned-deily
    Copy link
    Member

    Last time I looked into this I couldn't reproduce the failures on my Windows 10 VM, so it seems like an AppVeyor-specific problem. I'll take another look on Monday.

    But in msg317468 Victor asserts that it sometimes fails on Linux, too?

    @1st1
    Copy link
    Member

    1st1 commented May 26, 2018

    But in msg317468 Victor asserts that it sometimes fails on Linux, too?

    Hm, I missed that. I'll definitely take a look.

    @tiran
    Copy link
    Member

    tiran commented May 26, 2018

    It's failing reproducible with OpenSSL 1.1.1 and TLS 1.3 enabled. I haven't seen it failing with TLS 1.2 yet.

    @terryjreedy
    Copy link
    Member

    I have not seen test_asyncio fail on AppVeyor since about Sunday.
    https://ci.appveyor.com/project/python/cpython/history
    shows only a couple of failures, which could very well be real.
    I have seen test_asyncio fail on Travis since Sunday.

    With my local repository 32-bit debug build of master branch:

    If I run python -m test.test_asyncio -v,
    test_sock_sendfile_not_regular_file (test.test_asyncio.test_proactor_events.ProactorEventLoopUnixSockSendfileTests) ... (and the test process) aborts with a failed debug assertion:
    minkernel\crts\ucrt\src\appcrt\lowio\osfinfo.cpp, line 257
    fh >= 0 && (unsigned)fh < (unsigned)_nhandle

    If I run python -m test test_asyncio, I get
    F:\dev\3x\lib\asyncio\sslproto.py:320: ResourceWarning: unclosed transport <asyncio.sslproto._SSLProtocolTransport object at 0x052DA4D0>
      source=self)
    F:\dev\3x\lib\asyncio\sslproto.py:320: ResourceWarning: unclosed transport <asyncio.sslproto._SSLProtocolTransport object at 0x0476A118>
      source=self)
    Exception in callback _ProactorReadPipeTransport._loop_reading__data_received(<_OverlappedFuture cancelled>)
    handle: <Handle _ProactorReadPipeTransport._loop_reading__data_received(<_OverlappedFuture cancelled>)>
    Traceback (most recent call last):
      File "F:\dev\3x\lib\asyncio\events.py", line 88, in _run
        self._context.run(self._callback, *self._args)
      File "F:\dev\3x\lib\asyncio\proactor_events.py", line 222, in _loop_reading__data_received
        self._closing)
    AssertionError
    test_asyncio passed in 35 sec
    1 test OK.

    With -v, above is scattered in verbose output, but test_start_tls_server_1 passed both times.

    Specifically, I ran the following 10 times with no failures.
    f:\dev\3x>python -m test -v -m test_start_tls_server_1 test_asyncio
    ...
    test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests) ... ok
    test_start_tls_server_1 (test.test_asyncio.test_sslproto.SelectorStartTLSTests) ... ok

    @terryjreedy
    Copy link
    Member

    I presume the above is using the month-old openssl-bin-1.1.0h.

    @1st1
    Copy link
    Member

    1st1 commented May 27, 2018

    Terry, you somehow deleted Christian from the nosy list.

    Christian,

    It's failing reproducible with OpenSSL 1.1.1 and TLS 1.3 enabled. I haven't seen it failing with TLS 1.2 yet.

    On Linux or Windows?

    @tiran
    Copy link
    Member

    tiran commented May 27, 2018

    Linux

    @terryjreedy
    Copy link
    Member

    'deleted somehow': Christian added himself while I had this page open, so when I submitted without refreshing, the nosy list did not include him. When I saw the red warning, I did not notice the nosy list change; I should have.

    @1st1
    Copy link
    Member

    1st1 commented May 27, 2018

    Christian, I believe bd17a55 fixes the issue with OpenSSL 1.1.1 (socket.shutdown was missing and asyncio's protocol didn't receive EOF, instead the connection was terminated (as expected though))

    @vstinner
    Copy link
    Member

    I just reproduced the issue on Linux. Open 3 terminals and run the commands in parallel:

    (1) ./python -m test test_asyncio -m test_start_tls_server_1 -F
    (2) ./python -m test -j16 -r
    (3) ./python -m test -j16 -r

    It's a race condition which doesn't depend on the OS, but on the system load.

    @1st1
    Copy link
    Member

    1st1 commented May 28, 2018

    It's a race condition which doesn't depend on the OS, but on the system load.

    Yeah, I agree. The current timeout for the test is 5 seconds and it tries to transfer 2mb of data, which isn't that much actually. I'll bump the timeout and reduce the amount of data transferred in #7130

    Thing is, these are the very first *functional* tests for asyncio. Before it was mostly tested with mocks. So we'll need some adjustment period to learn how slow/loaded the CI servers are and what kind of timeouts are safe to use.

    @vstinner
    Copy link
    Member

    Yeah, I agree. The current timeout for the test is 5 seconds and it tries to transfer 2mb of data, which isn't that much actually. I'll bump the timeout and reduce the amount of data transferred in #7130

    What is the purpose of a timeout of 5 seconds? Some buildbots are really slow.

    If the purpose is to detect when the test hangs: use a timeout of 5 minutes, or at least 1 minute (60 seconds).

    @1st1
    Copy link
    Member

    1st1 commented May 28, 2018

    If the purpose is to detect when the test hangs: use a timeout of 5 minutes, or at least 1 minute (60 seconds).

    OK, sounds good, will do that.

    @vstinner
    Copy link
    Member

    OK, sounds good, will do that.

    Race condition! I created #7157 to increase the timeout from 5 seconds to 60 seconds.

    @1st1
    Copy link
    Member

    1st1 commented May 28, 2018

    #7130 is in. It includes many tests for TLS tests as well as bugfixes to asyncio code. Hopefully this all will make tests more stable.

    @1st1
    Copy link
    Member

    1st1 commented May 28, 2018

    New changeset 8267ea2 by Yury Selivanov in branch 'master':
    bpo-32458: Further tune start_tls tests (bpo-7166)
    8267ea2

    @vstinner
    Copy link
    Member

    While Yury seems unable to reproduce the bug, it's easy for me to reproduce it on Linux. After 2 hours of debugging, I found the root issue: a race condition not in the test, but in asyncio itself! => bpo-33674 "asyncio: race condition in SSLProtocol".

    @miss-islington
    Copy link
    Contributor

    New changeset 4b82846 by Miss Islington (bot) in branch '3.7':
    bpo-32458: Further tune start_tls tests (GH-7166)
    4b82846

    @1st1
    Copy link
    Member

    1st1 commented May 29, 2018

    Should we close this issue now?

    @vstinner
    Copy link
    Member

    I'm not 100% sure that all issues are fixed, but the tests seem much more reliable yet. I close the issue. If the bug reoccurs, I will reopen the issue or open a new one.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 (EOL) end of life topic-asyncio type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants