Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Flaky] test_operator_gpu.test_convolution_multiple_streams #14289

Closed
junrushao opened this issue Feb 28, 2019 · 6 comments
Closed

[Flaky] test_operator_gpu.test_convolution_multiple_streams #14289

junrushao opened this issue Feb 28, 2019 · 6 comments

Comments

@junrushao
Copy link
Member

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/PR-14192/runs/5/nodes/273/log/?start=0

======================================================================
FAIL: test_operator_gpu.test_convolution_multiple_streams
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 570, in test_convolution_multiple_streams
    {'MXNET_GPU_WORKER_NSTREAMS' : num_streams, 'MXNET_ENGINE_TYPE' : engine})
  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 540, in _test_in_separate_process
    assert p.exitcode == 0, "Non-zero exit code %d from %s()." % (p.exitcode, func.__name__)
AssertionError: Non-zero exit code 255 from _conv_with_num_streams().

-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=997087021 to reproduce.
--------------------- >> end captured logging << ---------------------
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test, Flaky

@junrushao
Copy link
Member Author

@mxnet-label-bot add [Test, Flaky]

@DickJC123
Copy link
Contributor

This appears to be an issue with releasing of resources during the NaiveEngine shutdown. There aren't many tests in our CI that test with the NaiveEngine, so this new test I added is fleshing out the problem. I saw segfaults (error = 255) on CentOS, and supplied a fix: 790a998.
It looks like there are more issues here to be worked out though.

@piyushghai
Copy link
Contributor

@piyushghai
Copy link
Contributor

There's another duplicate issue : #14329.

Suggest to close this one and carry the conversation on 14239.

@junrushao
Copy link
Member Author

I am closing this issue and submitting this to #14329

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants