Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC doesn't always stop #420

Closed
fraenkel opened this issue Aug 27, 2015 · 10 comments
Closed

CC doesn't always stop #420

fraenkel opened this issue Aug 27, 2015 · 10 comments

Comments

@fraenkel
Copy link

We have been hitting situations where CCDB has been holding CC hostage and preventing the kills from actually stopping CC. We instead see monit restart CC and have issues since we eventually run out of DB connections since the previous CC is holding many of them.

Our monit log shows the following,

[2015-08-20 03:01:10+0000] ------------ STARTING cloud_controller_ng_ctl at Thu Aug 20 03:01:10 UTC 2015 --------------
--
[2015-08-20 03:02:26+0000] Killing /var/vcap/sys/run/cloud_controller_ng/cloud_controller_ng.pid: 13685 
[2015-08-20 03:02:27+0000] ------------ STARTING cloud_controller_ng_ctl at Thu Aug 20 03:02:27 UTC 2015 --------------
--
[2015-08-25 11:53:23+0000] Killing /var/vcap/sys/run/cloud_controller_ng/cloud_controller_ng.pid: 7548 
[2015-08-25 11:53:23+0000] .Stopped
--

Notice the Killing @ 3:02:26 with no Stopped or Timed Out.
We haven't been able to find any additional details other than the CC ctl.log which only shows the above.

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/102210084.

@fraenkel fraenkel changed the title CC should kill -9 on stop after kill doesn't succeed CC should handle kill failures Aug 28, 2015
@fraenkel fraenkel changed the title CC should handle kill failures CC doesn't always stop Aug 28, 2015
@Gerg
Copy link
Member

Gerg commented Aug 31, 2015

Are there long-running queries that are holding the connection between the CC and the DB? If so, do you know what they are? This will help us reproduce the issue.

@nastacio
Copy link

nastacio commented Sep 1, 2015

Yes, the queries are for /v2/events/..., on a DB with .... 6 million records.

In the DB, this is an example of of one many queries taking minutes to complete:
SELECT * FROM "events" WHERE (timestamp >= ('2015-07-17 12:37:04.000000+0000')) ORDER BY "timestamp" ASC, "id" ASC LIMIT 100 OFFSET 8200

@zrob
Copy link
Contributor

zrob commented Sep 15, 2015

Hey, we've been doing a bit of investigation over here.

We found that according to this line in utils.sh we expect a kill -9 to occur if a regular kill does not cause the cc pid to stop after 25 seconds, so we're not sure why that isn't being triggered. What's interesting is we can see in your timestamps that the start happens 1 second after the kill is sent, so obviously the utils.sh never got a chance to wait 25 seconds.

We tried reproducing this by adding the following query to an endpoint and slamming it App.db.fetch("SELECT CURRENT_TIMESTAMP; SELECT pg_sleep(100); SELECT CURRENT_TIMESTAMP").first. Then with those connections in flight we did a monit stop on the CC process. This failed to reproduce the issue, we just saw it log with the .Stopped line.

We also tried coming at it from the other side by lowering the max_connections for postgres to a very small number to see if we could max out connections that way. We were still unable to reproduce.

We're a little stumped on what to attempt next to reproduce this. Any thoughts?

Thanks,
@zrob & @jberkhahn

@nastacio
Copy link

nastacio commented Oct 3, 2015

In doing more debugging, the only logical conclusion is that utils.sh is exiting on the call to the regular kill.

@dieucao
Copy link

dieucao commented Dec 1, 2015

@fraenkel We just merged this fix cloudfoundry-attic/shared-release-packages#4
Do you think that could be the issue you're seeing?

@fraenkel
Copy link
Author

fraenkel commented Dec 1, 2015

This is actually on the other end. The process never stops.

@dieucao
Copy link

dieucao commented Dec 1, 2015

@fraenkel Do you have any ideas on what else we could look at to try to reproduce the issue?

@fraenkel
Copy link
Author

fraenkel commented Dec 2, 2015

So I mimicked the behavior by commenting out all the kills in the common/utils.sh
Touched a file in cloud controller to force a redeploy which caused the following

Started updating job api_z1 > api_z1/0. Failed: Action Failed get_task: Task 33d32633-4477-4f59-67f5-497da593a95e result: Applying: Reloading jobSupervisor: Failed to reload monit: before=1449014744 after=1449014744 (00:01:37)

Error 450001: Action Failed get_task: Task 33d32633-4477-4f59-67f5-497da593a95e result: Applying: Reloading jobSupervisor: Failed to reload monit: before=1449014744 after=1449014744

If you go back on to api_z1, monit stop will fail but monit start will succeed.
You will now have 2 cloud controllers running. But it's in a bad way.

It causes the same behavior we saw but we had it happen via the scripts unmodified.

@zrob
Copy link
Contributor

zrob commented Feb 1, 2017

Closing this issue as it looks like it was addressed in the story referenced above.

@zrob zrob closed this as completed Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants