ARM cluster move #611

rvagg · 2017-02-01T21:41:28Z

Sad news folks, I have to physically move the ARM cluster today and the internet connection where it's moving to isn't properly installed yet! I have a temporary connection ready but I'm paying by the GB for it so I can't hook up normal test runs.

There are some ARMv7 and ARMv8 machines in Jenkins (and ARMv7 in release) that aren't physically in the same place (i.e. they are hosted at Scaleway and miniNodes) so they won't be impacted.

So here's what I'm going to do:

ARM test and releases via Jenkins will be unavailable from around 4am UTC / 8pm Pacific for at least a few hours, possibly longer (up to 24 hours) as I hack together the temporary connection.
I'll unhook the relevant machines from Jenkins so nobody should be obviously impacted, you just won't get the complete runs you normally do.
When reconnected, I'll only be connecting release machines until I have a proper connection or decide that it'll be too long to wait and come up with an alternative.

I'll keep this thread updated as I make progress.

@nodejs/collaborators

mikeal · 2017-02-01T23:00:37Z

Will we get a new picture of the cluster once it is in its new home?

rvagg · 2017-02-02T13:18:12Z

release machines are back online for now, armv6 and armv8, others are off, no ETA on proper connection yet

Trott · 2017-02-02T17:48:28Z

So with no ETA for the return of the Raspberry Pi cluster:

For the jobs that are stalled waiting for the Raspberry Pi farm, will they kick off tomorrow or whenever the farm comes back online? Or probably not and the jobs should just be canceled now?

Land stuff without Raspberry Pi test results in CI? Or wait for the Raspberry Pi cluster to come back?

/cc @nodejs/ctc

joaocgreis · 2017-02-02T17:55:43Z

armv7-ubuntu1404 and armv8-ubuntu1404 were removed by @rvagg from the node-test-commit-arm job but node-test-commit-arm-fanned was left in place, possibly forgotten. I think it's better to cancel. I'll look for a way to remove the whole job.

EDIT: Just disabling the job woked, it is properly skipped by node-test-commit. Also disabled git-rpi-clean.

rvagg · 2017-02-02T21:52:47Z

Sorry, I thought I removed node-test-commit-arm-fanned. There shouldn't be any queued jobs, if there are then I've messed up!

rvagg · 2017-02-03T23:41:04Z

Thursday the 9th is the date I've been given for finalising this internet connection. Apparently there are some technical challenges (also I think some administrative incompetence but that's to be expected when dealing with large telcos!).

rvagg · 2017-02-09T02:33:24Z

Bad news .. I've been notified there are network problems in the area (monopoly government-provided internet infrastructure, yay) and it's been deferfed for another week. If it goes through then it should be up on the 16th of this month.

thefourtheye · 2017-02-09T06:24:43Z

We have three releases and we might get RCs out soon. Should we hold them till this setup is back up? We cannot release binaries without testing them, right?

italoacasas · 2017-02-09T15:30:50Z

I have two questions:

something I(we) can do to help right now?
something we can prepare(plan) in the case that this happens again in the future, like for example a storm, etc.

mhdawson · 2017-02-09T17:00:24Z

The LTS releases are planned for Feb 21st so availability on the 16 may not affect those directly. It may affect plan RC's, in that case the question would be if the changes going in that we wanted validation through the RC would be ARM only or can be adequately covered by use on other platforms.

In terms of testing for the Current release, I wonder if the binaries could be tested manually by somebody with access to the release machine logging in and running the tests. That might take a while to run thought since it would be on the single machine instead of fanned like it is in the regular jobs.

Trott · 2017-02-09T17:27:20Z

I think it's OK to release RCs without testing in the ARM cluster in this situation. Maybe explain/apologize in the release announcement.

And actual release (as opposed to an RC) might be different....

Fishrock123 · 2017-02-13T17:20:25Z

Same, RCs/Betas should be fine.

rvagg · 2017-02-16T06:29:38Z

AAAAND we're back up online again on a new stable connection that's quite a bit faster than the old one as a bonus. Working my way through everything but I'm pretty sure I've got most things in place already so it should be working as it used to before the move. Please let me know if you encounter anything that doesn't seem right.

Regarding RCs and nightlies, I think that it got screwed up after a reconnect of my temporary connection where a new dynamic IP got assigned which messed up the iptables rules on both Jenkins machines. They were working just not connecting! Ooops!

joaocgreis · 2017-02-16T15:58:30Z

Jobs seem to be running well! There are still 3 slaves offline and the DNS for the jump host is not updated, but this is not urgent. However, we have some tests failing:

test-dgram-address is failing consistently for master on RPi 1 and 2 (master test runs: 1, 2, 3)
v7.x-staging seems to have the same problem plus test-npm-install on all 3 RPis
v6.x-staging and v4.x-staging ~~are still running at this moment~~ seem good

rvagg · 2017-02-16T22:43:07Z

Thanks to @Trott for jumping on test-dgram-address @ nodejs/node#11432, looks like that'll be addressed soon. Full green run @ https://ci.nodejs.org/job/node-test-binary-arm/6241/

I've taken three Pi's offline, suspecting corrupted filesystems or dodgy SD cards, some of the failures were because of that. I'll address them as soon as I can and bring them back online.

rvagg · 2017-02-16T23:24:51Z

Failures on test-requireio_arm-ubuntu1404-arm64_xgene-2 are interesting, e.g. https://ci.nodejs.org/job/node-test-commit-arm/7806/nodes=armv8-ubuntu1404/ and correlate with disconnection notifications that we keep on getting for just this machine and they date back pretty far (prior to the move). I was tinkering on that box last night trying to understand it but I have no idea what's going on. There's nothing special about it, in fact it's the least special of the 3 XGene machines (one runs the NFS for the Pi's and does release builds, another serves as a jump host for SSH, this one just runs test builds and nothing else!). Something about Jenkins keeps on disconnecting and reconnecting, perhaps it's a Java problem..

Anyone got ideas for debugging this? @joaocgreis, @jbergstroem?

joaocgreis · 2017-02-21T11:51:45Z

@rvagg It's strange that it's just that one machine. I have no solution, but perhaps you can try a different ping interval from the slave side. This is used for Windows:

build/setup/windows/resources/jenkins.bat

Line 5 in d2d5dd2

    
           java -Dhudson.remoting.Launcher.pingIntervalSec=10 -jar slave.jar -jnlpUrl https://ci.nodejs.org/computer/{{ server_id }}/slave-agent.jnlp -secret {{ server_secret }}

(the main thing that clearly fixed Windows was the ping interval from the master side, but this was left in place in all Windows slaves so at least it doesn't hurt).

rvagg · 2017-02-22T05:57:10Z

https://ci.nodejs.org/computer/test-requireio_arm-ubuntu1404-arm64_xgene-2/builds

I tweaked the job slightly after posting the above and you can see that it's mostly green since then. It now downloads slave.jar before starting, each time, under the theory that having an updated slave.jar would be good ... but tbh I don't know if that's been a problem at all.

Kernel logs are still full of:

[511493.450658] init: jenkins main process (8821) terminated with status 255
[511493.450681] init: jenkins main process ended, respawning
[511499.729117] init: jenkins main process (8852) terminated with status 255
[511499.729139] init: jenkins main process ended, respawning
[511505.963897] init: jenkins main process (8883) terminated with status 255
[511505.963921] init: jenkins main process ended, respawning

But failures are less frequent now but they still happen. I've implemented the extended ping interval thing just now so let's see if that helps at all.

jbergstroem · 2017-02-22T12:42:10Z

@rvagg does the exits correlate with anything interesting in the logs?

rvagg · 2017-02-22T21:44:44Z

@jbergstroem well, when I look at the actual times, it would correlate with anything that's happening on the machine:

[Wed Feb 22 13:43:04 2017] init: jenkins main process (26312) terminated with status 255
[Wed Feb 22 13:43:04 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:10 2017] init: jenkins main process (26343) terminated with status 255
[Wed Feb 22 13:43:10 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:16 2017] init: jenkins main process (26374) terminated with status 255
[Wed Feb 22 13:43:16 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:23 2017] init: jenkins main process (26405) terminated with status 255
[Wed Feb 22 13:43:23 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:29 2017] init: jenkins main process (26436) terminated with status 255
[Wed Feb 22 13:43:29 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:35 2017] init: jenkins main process (26467) terminated with status 255
[Wed Feb 22 13:43:35 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:42 2017] init: jenkins main process (26498) terminated with status 255
[Wed Feb 22 13:43:42 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:53 2017] init: jenkins main process (26529) terminated with status 255
[Wed Feb 22 13:43:53 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:59 2017] init: jenkins main process (26560) terminated with status 255
[Wed Feb 22 13:43:59 2017] init: jenkins main process ended, respawning

(who knew dmesg had a -T eh?)

basically it's constantly happening. Going to have to run this manually and see if I can get anything from it.

rvagg · 2017-02-23T02:31:13Z

captured a failure, not sure if this is the failure, relevant log portions after connect are here: https://gist.github.com/rvagg/8eeb20b0fe7cf289601593ebff5bb827

There's a problem with child processes not being cleaned up properly which seems to cause Jenkins grief (never seen this before elsewhere) and then when it tries to reconnect it gets the kind of error you get when a node is already connected and it keeps on looping from there, which is similr behaviour to what I'm seeing with it running under upstart.

I'm trying out disabling the process tree killer as per https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller to see if that helps, perhaps this is an architecture thing (i.e. this thing is "native code").

maclover7 · 2017-11-07T23:24:19Z

ping @rvagg -- can this be closed?

joaocgreis mentioned this issue Feb 3, 2017

Include shared library file in `binary/binary.tar.[xz,gz] in Win and arm tests #596

Closed

Fishrock123 mentioned this issue Feb 13, 2017

Proposal - 7.6.0 (Current) nodejs/node#11185

Merged

cjihrig mentioned this issue Feb 16, 2017

dgram: fix possibly deoptimizing use of arguments nodejs/node#11242

Closed

2 tasks

gibfahn mentioned this issue Feb 19, 2017

test: refactor test-http-response-splitting nodejs/node#11429

Closed

2 tasks

maclover7 added the infra label Nov 7, 2017

rvagg closed this as completed Nov 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM cluster move #611

ARM cluster move #611

rvagg commented Feb 1, 2017

mikeal commented Feb 1, 2017

rvagg commented Feb 2, 2017

Trott commented Feb 2, 2017

joaocgreis commented Feb 2, 2017 •

edited

Loading

rvagg commented Feb 2, 2017

rvagg commented Feb 3, 2017

rvagg commented Feb 9, 2017

thefourtheye commented Feb 9, 2017

italoacasas commented Feb 9, 2017 •

edited

Loading

mhdawson commented Feb 9, 2017

Trott commented Feb 9, 2017 •

edited

Loading

Fishrock123 commented Feb 13, 2017

rvagg commented Feb 16, 2017

joaocgreis commented Feb 16, 2017 •

edited

Loading

rvagg commented Feb 16, 2017

rvagg commented Feb 16, 2017

joaocgreis commented Feb 21, 2017

rvagg commented Feb 22, 2017

jbergstroem commented Feb 22, 2017

rvagg commented Feb 22, 2017

rvagg commented Feb 23, 2017

maclover7 commented Nov 7, 2017

ARM cluster move #611

ARM cluster move #611

Comments

rvagg commented Feb 1, 2017

mikeal commented Feb 1, 2017

rvagg commented Feb 2, 2017

Trott commented Feb 2, 2017

joaocgreis commented Feb 2, 2017 • edited Loading

rvagg commented Feb 2, 2017

rvagg commented Feb 3, 2017

rvagg commented Feb 9, 2017

thefourtheye commented Feb 9, 2017

italoacasas commented Feb 9, 2017 • edited Loading

mhdawson commented Feb 9, 2017

Trott commented Feb 9, 2017 • edited Loading

Fishrock123 commented Feb 13, 2017

rvagg commented Feb 16, 2017

joaocgreis commented Feb 16, 2017 • edited Loading

rvagg commented Feb 16, 2017

rvagg commented Feb 16, 2017

joaocgreis commented Feb 21, 2017

rvagg commented Feb 22, 2017

jbergstroem commented Feb 22, 2017

rvagg commented Feb 22, 2017

rvagg commented Feb 23, 2017

maclover7 commented Nov 7, 2017

joaocgreis commented Feb 2, 2017 •

edited

Loading

italoacasas commented Feb 9, 2017 •

edited

Loading

Trott commented Feb 9, 2017 •

edited

Loading

joaocgreis commented Feb 16, 2017 •

edited

Loading