CC doesn't always stop #420

fraenkel · 2015-08-27T20:45:56Z

We have been hitting situations where CCDB has been holding CC hostage and preventing the kills from actually stopping CC. We instead see monit restart CC and have issues since we eventually run out of DB connections since the previous CC is holding many of them.

Our monit log shows the following,

[2015-08-20 03:01:10+0000] ------------ STARTING cloud_controller_ng_ctl at Thu Aug 20 03:01:10 UTC 2015 --------------
--
[2015-08-20 03:02:26+0000] Killing /var/vcap/sys/run/cloud_controller_ng/cloud_controller_ng.pid: 13685 
[2015-08-20 03:02:27+0000] ------------ STARTING cloud_controller_ng_ctl at Thu Aug 20 03:02:27 UTC 2015 --------------
--
[2015-08-25 11:53:23+0000] Killing /var/vcap/sys/run/cloud_controller_ng/cloud_controller_ng.pid: 7548 
[2015-08-25 11:53:23+0000] .Stopped
--

Notice the Killing @ 3:02:26 with no Stopped or Timed Out.
We haven't been able to find any additional details other than the CC ctl.log which only shows the above.

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2015-08-27T20:46:00Z

We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/102210084.

Gerg · 2015-08-31T21:44:52Z

Are there long-running queries that are holding the connection between the CC and the DB? If so, do you know what they are? This will help us reproduce the issue.

nastacio · 2015-09-01T12:27:29Z

Yes, the queries are for /v2/events/..., on a DB with .... 6 million records.

In the DB, this is an example of of one many queries taking minutes to complete:
SELECT * FROM "events" WHERE (timestamp >= ('2015-07-17 12:37:04.000000+0000')) ORDER BY "timestamp" ASC, "id" ASC LIMIT 100 OFFSET 8200

zrob · 2015-09-15T22:31:44Z

Hey, we've been doing a bit of investigation over here.

We found that according to this line in utils.sh we expect a kill -9 to occur if a regular kill does not cause the cc pid to stop after 25 seconds, so we're not sure why that isn't being triggered. What's interesting is we can see in your timestamps that the start happens 1 second after the kill is sent, so obviously the utils.sh never got a chance to wait 25 seconds.

We tried reproducing this by adding the following query to an endpoint and slamming it App.db.fetch("SELECT CURRENT_TIMESTAMP; SELECT pg_sleep(100); SELECT CURRENT_TIMESTAMP").first. Then with those connections in flight we did a monit stop on the CC process. This failed to reproduce the issue, we just saw it log with the .Stopped line.

We also tried coming at it from the other side by lowering the max_connections for postgres to a very small number to see if we could max out connections that way. We were still unable to reproduce.

We're a little stumped on what to attempt next to reproduce this. Any thoughts?

Thanks,
@zrob & @jberkhahn

nastacio · 2015-10-03T02:05:20Z

In doing more debugging, the only logical conclusion is that utils.sh is exiting on the call to the regular kill.

dieucao · 2015-12-01T22:07:10Z

@fraenkel We just merged this fix cloudfoundry-attic/shared-release-packages#4
Do you think that could be the issue you're seeing?

fraenkel · 2015-12-01T22:19:06Z

This is actually on the other end. The process never stops.

dieucao · 2015-12-01T23:37:59Z

@fraenkel Do you have any ideas on what else we could look at to try to reproduce the issue?

fraenkel · 2015-12-02T00:19:43Z

So I mimicked the behavior by commenting out all the kills in the common/utils.sh
Touched a file in cloud controller to force a redeploy which caused the following

Started updating job api_z1 > api_z1/0. Failed: Action Failed get_task: Task 33d32633-4477-4f59-67f5-497da593a95e result: Applying: Reloading jobSupervisor: Failed to reload monit: before=1449014744 after=1449014744 (00:01:37)

Error 450001: Action Failed get_task: Task 33d32633-4477-4f59-67f5-497da593a95e result: Applying: Reloading jobSupervisor: Failed to reload monit: before=1449014744 after=1449014744

If you go back on to api_z1, monit stop will fail but monit start will succeed.
You will now have 2 cloud controllers running. But it's in a bad way.

It causes the same behavior we saw but we had it happen via the scripts unmodified.

zrob · 2017-02-01T17:45:13Z

Closing this issue as it looks like it was addressed in the story referenced above.

fraenkel changed the title ~~CC should kill -9 on stop after kill doesn't succeed~~ CC should handle kill failures Aug 28, 2015

fraenkel changed the title ~~CC should handle kill failures~~ CC doesn't always stop Aug 28, 2015

zrob closed this as completed Feb 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CC doesn't always stop #420

CC doesn't always stop #420

fraenkel commented Aug 27, 2015

cf-gitbot commented Aug 27, 2015

Gerg commented Aug 31, 2015

nastacio commented Sep 1, 2015

zrob commented Sep 15, 2015

nastacio commented Oct 3, 2015

dieucao commented Dec 1, 2015

fraenkel commented Dec 1, 2015

dieucao commented Dec 1, 2015

fraenkel commented Dec 2, 2015

zrob commented Feb 1, 2017

CC doesn't always stop #420

CC doesn't always stop #420

Comments

fraenkel commented Aug 27, 2015

cf-gitbot commented Aug 27, 2015

Gerg commented Aug 31, 2015

nastacio commented Sep 1, 2015

zrob commented Sep 15, 2015

nastacio commented Oct 3, 2015

dieucao commented Dec 1, 2015

fraenkel commented Dec 1, 2015

dieucao commented Dec 1, 2015

fraenkel commented Dec 2, 2015

zrob commented Feb 1, 2017