[CI][NightlyTestsForBinaries] Test Large Tensor: GPU Failing #14981

perdasilva · 2019-05-17T13:32:48Z

Description

Test Large Tensor: GPU step is failing with:

======================================================================
ERROR: test_large_array.test_ndarray_random_randint
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/nightly/test_large_array.py", line 70, in test_ndarray_random_randint
    assert a.__gt__(low) & a.__lt__(high)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 336, in __gt__
    return greater(self, other)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 3376, in greater
    _internal._lesser_scalar)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2704, in _ufunc_helper
    return fn_array(lhs, rhs)
  File "<string>", line 46, in broadcast_greater
  File "/work/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/work/mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [06:39:26] /work/mxnet/src/io/../operator/elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node  at 1-th input: expected int32, got int64
Stack trace:
  [bt] (0) /work/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x3c) [0x7fa0e59e8b3c]
  [bt] (1) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseAttr<int, &mxnet::op::type_is_none, &mxnet::op::type_assign, true, &mxnet::op::type_string[abi:cxx11], -1l, -1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, int const&)::{lambda(std::vector<int, std::allocator<int> > const&, unsigned long, char const*)#1}::operator()(std::vector<int, std::allocator<int> > const&, unsigned long, char const*) const+0x62d) [0x7fa0e8c6866d]
  [bt] (2) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseAttr<int, &mxnet::op::type_is_none, &mxnet::op::type_assign, true, &mxnet::op::type_string[abi:cxx11], -1l, -1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, int const&)+0x2f3) [0x7fa0e8f963a3]
  [bt] (3) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseType<2l, 1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*)+0x34d) [0x7fa0e8f968ed]
  [bt] (4) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<bool (nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*), bool (*)(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*)>::_M_invoke(std::_Any_data const&, nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*&&, std::vector<int, std::allocator<int> >*&&)+0x1d) [0x7fa0e8bb909d]
  [bt] (5) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0x6a5) [0x7fa0e8c28e35]
  [bt] (6) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x10b) [0x7fa0e8c0f52b]
  [bt] (7) /work/mxnet/python/mxnet/../../build/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0x1c9) [0x7fa0e8a8a479]
  [bt] (8) /work/mxnet/python/mxnet/../../build/libmxnet.so(MXImperativeInvokeEx+0x8f) [0x7fa0e8a8a97f]


-------------------- >> begin captured logging << --------------------
tests.python.unittest.common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2073509752 to reproduce.
--------------------- >> end captured logging << ---------------------

see http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/312/pipeline/144 for the full log

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-05-17T13:32:51Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test, CI

vdantu · 2019-05-19T06:52:09Z

@mxnet-label-bot add [test]
@apeforest

roywei · 2019-05-21T23:28:21Z

fixed in latest run, we can close this now: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/320/pipeline

roywei · 2019-06-04T06:04:15Z

actually, we can't close it yet, this test was fixed but went back to failing after #15059. Similar OOM issue in #14980

roywei · 2019-06-04T16:36:36Z

Currently, both CPU and GPU tests have been disabled due to the same memory issue. Had a discussion with @access2rohit and @apeforest, we can try a few things:

change to P3 instances here https://github.com/apache/incubator-mxnet/blob/master/tests/nightly/JenkinsfileForBinaries#L82
further increase shared memory to 50G
stop running large tensor test parallelly with other tests.

We are having problems testing the above solutions on CI machines that have multiple jobs running in parallel.

roywei · 2019-06-06T07:10:10Z

failed with 200G shared memory on P3.2x and failed, we need another approach for testing large tensor.

perdasilva mentioned this issue May 17, 2019

[CI] Disables large tensor size gpu test step #14984

Closed

5 tasks

marcoabreu added the Test label May 19, 2019

perdasilva mentioned this issue May 20, 2019

[CI][NightlyTestsForBinaries] Test Large Tensor: CPU killing node instance #14980

Open

roywei mentioned this issue May 21, 2019

Fix test randint #14990

Merged

7 tasks

This was referenced Feb 5, 2020

skipping tests that cannot fit in nightly CI machine #17450

Merged

Re-Enabling Large Tensor and Vector Nightly on GPU #16164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][NightlyTestsForBinaries] Test Large Tensor: GPU Failing #14981

[CI][NightlyTestsForBinaries] Test Large Tensor: GPU Failing #14981

perdasilva commented May 17, 2019 •

edited

Loading

mxnet-label-bot commented May 17, 2019

vdantu commented May 19, 2019

roywei commented May 21, 2019

roywei commented Jun 4, 2019 •

edited

Loading

roywei commented Jun 4, 2019

roywei commented Jun 6, 2019

[CI][NightlyTestsForBinaries] Test Large Tensor: GPU Failing #14981

[CI][NightlyTestsForBinaries] Test Large Tensor: GPU Failing #14981

Comments

perdasilva commented May 17, 2019 • edited Loading

Description

mxnet-label-bot commented May 17, 2019

vdantu commented May 19, 2019

roywei commented May 21, 2019

roywei commented Jun 4, 2019 • edited Loading

roywei commented Jun 4, 2019

roywei commented Jun 6, 2019

perdasilva commented May 17, 2019 •

edited

Loading

roywei commented Jun 4, 2019 •

edited

Loading