Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CI][NightlyTestsForBinaries] Test Large Tensor: CPU killing node instance #14980

Open
perdasilva opened this issue May 17, 2019 · 13 comments · Fixed by #17450
Open

[CI][NightlyTestsForBinaries] Test Large Tensor: CPU killing node instance #14980

perdasilva opened this issue May 17, 2019 · 13 comments · Fixed by #17450
Labels

Comments

@perdasilva
Copy link
Contributor

perdasilva commented May 17, 2019

Description

It seems Test Large Tensor: CPU is killing the underlying CI node somehow:

build.py: 2019-05-16 06:17:45,292Z INFO Started container: ad87c775febf
+ NOSE_COVERAGE_ARGUMENTS='--with-coverage --cover-inclusive --cover-xml --cover-branches --cover-package=mxnet'
+ NOSE_TIMER_ARGUMENTS='--with-timer --timer-ok 1 --timer-warning 15 --timer-filter warning,error'
+ CI_CUDA_COMPUTE_CAPABILITIES='-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_70,code=sm_70'
+ CI_CMAKE_CUDA_ARCH_BIN=52,70
+ set +x
+ export PYTHONPATH=./python/
+ PYTHONPATH=./python/
+ nosetests-3.4 tests/nightly/test_large_array.py
....[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1786473439 to reproduce.
Cannot contact mxnetlinux-cpu_8gdgtj05sa: java.lang.InterruptedException

Complete log:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/312/pipeline/

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test

@perdasilva perdasilva changed the title [ [CI][NightlyTestsForBinaries] Test Large Tensor: CPU killing node instance May 17, 2019
@vdantu
Copy link
Contributor

vdantu commented May 19, 2019

@mxnet-label-bot add [test]
@apeforest : Would you be able to help out here?

@perdasilva
Copy link
Contributor Author

perdasilva commented May 20, 2019

Just had this failure in branch v1.5.x. Seems like this error could be related to #14981 and might be fixed by #14990. I'm running a test to verify.

@Chancebair
Copy link
Contributor

@ChaiBapchya @apeforest @access2rohit @anirudh2290 @larroy

The discussion has come up that this test needs to be refactored so that several gigs of memory are not needed to run tests. Pedro brought this up in an email subject "Tests with large inputs and rationalize resource usage. Better testing strategies..."

Could someone please take lead on this? The test is currently disabled in the Jenkins steps for now

@apeforest
Copy link
Contributor

@larroy @Chancebair Any suggestion for updating the tests? We still need to test large tensors for CPU. Please advise what the best practice is.

@larroy
Copy link
Contributor

larroy commented May 24, 2019

@apeforest Im talking to @access2rohit to understand how can we test this better. I will update this.

@larroy
Copy link
Contributor

larroy commented May 29, 2019

The discussion was moved to devlist.

@larroy
Copy link
Contributor

larroy commented May 29, 2019

@Chancebair @perdasilva can we rerun the linked CI run? is the failure related to the test or something else? (what's the root cause?)

@perdasilva
Copy link
Contributor Author

I think the tests were causing the machine to run out of ram and crash out. Should I re-run it?

@larroy
Copy link
Contributor

larroy commented May 30, 2019

how much ram do we have in that machine? If you are confident is not going to cause problems in your fleet I'd say let's run it to see if the problem persist.

@perdasilva
Copy link
Contributor Author

I think the CPU instances are c5.4xlarge - so 32GB. There's been a PR to disable the CPU tests. I would suggest to re-enable it in a PR, copy the nightly tests for binaries Jenkinsfile content to be be of the PR Jenkins files and seeing if it works. I'm sorry I can't do it atm as I'm on sick leave =S

@access2rohit
Copy link
Contributor

access2rohit commented May 4, 2020

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-cpu/detail/PR-18146/42/pipeline#step-96-log-126

@szha The link shared about has flags -DMSHADOW_INT64_TENSOR_SIZE=0 which is not Large Tensor Build. Currently Large Tensor Tests are only in nightly(Not CI that runs on every PR). Is it possible that you copied a different link ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants