Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner: change healthcheck interval to a longer time instead of every 5 mins #1817

Closed
hengfengli opened this issue Mar 4, 2020 · 3 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@hengfengli
Copy link
Contributor

hengfengli commented Mar 4, 2020

Client

spanner/v1.2.1

Expected behavior

The session will not be expired within an hour so that we don't need to check too frequent for keeping sessions alive. The health checks should be scheduled for every ~50 mins. However, in Go client library, the interval is a random value time.Now() + [interval*0.5, interval*1.5) so that we should set the health interval to 30 mins and the random range will become [15 mins, 45 mins).

Actual behavior

The current health check is every 5 mins. This might overload the backend because of making too many GetSession() calls.

Related code

  • var DefaultSessionPoolConfig = SessionPoolConfig{
    MinOpened: 100,
    MaxOpened: numChannels * 100,
    MaxBurst: 10,
    WriteSessions: 0.2,
    HealthCheckWorkers: 10,
    HealthCheckInterval: 5 * time.Minute,
    }
  • // scheduledHCLocked schedules next healthcheck on session s with the assumption
    // that hc.mu is being held.
    func (hc *healthChecker) scheduledHCLocked(s *session) {
    // The next healthcheck will be scheduled after
    // [interval*0.5, interval*1.5) ns.
    nsFromNow := rand.Int63n(int64(hc.interval)) + int64(hc.interval)/2
    s.setNextCheck(time.Now().Add(time.Duration(nsFromNow)))
    if hi := s.getHcIndex(); hi != -1 {
    // Session is still being tracked by healthcheck workers.
    heap.Fix(&hc.queue, hi)
    }
    }
  • func (hc *healthChecker) healthCheck(s *session) {
    defer hc.markDone(s)
    if !s.pool.isValid() {
    // Session pool is closed, perform a garbage collection.
    s.destroy(false)
    return
    }
    if err := s.ping(); isSessionNotFoundError(err) {
    // Ping failed, destroy the session.
    s.destroy(false)
    }
    }
@hengfengli hengfengli added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Mar 4, 2020
@hengfengli hengfengli self-assigned this Mar 4, 2020
@snehashah16
Copy link

However, in Go client library, the interval is a random value time.Now() + [interval0.5, interval1.5) so that we should set the health interval to 30 mins and the random range will become [15 mins, 45 mins).

Is your fix going to run this every [15-45) mins or can we push it out even further ?
Example:

  • if the healthcheck.interval is set by the customer, respect it and run it every interval cadence
  • if it is not set, run it at every [50 - 55] mins interval.

I dont see the value in this wide interval of [0.5 - 1.5)*interval

@hengfengli
Copy link
Contributor Author

hengfengli commented Mar 5, 2020

@snehashah16 E.g., a health check for a session is done at 10:00am and then the next check will happen at a random value between 10:15am and 10:45am. This will spread requests across the time span instead of all traffic going to a single time point.

I dont see the value in this wide interval of [0.5 - 1.5)*interval

Sorry, I don't quite understand it. What do you mean by this?

I got what you mean now. I need to check the history and try to understand the initial reason why it is set in this way.

My guess is that e.g., if there are 3 million sessions that need to maintain, it would be better to spread this traffic across a wide interval. If all requests happen in [50-55], it might cause a peak traffic in this short time period.

@skuruppu Any thoughts?

@tbpg tbpg added the api: spanner Issues related to the Spanner API. label Mar 5, 2020
gopherbot pushed a commit that referenced this issue Mar 18, 2020
This includes the following changes:

* adjust the default interval to 50 minutes.
* the first healthcheck is scheduled to
[interval*0.2, interval*1.1), i.e., [10, 55) mins.
* the non-first healthchecks are scheduled to
[interval*0.9, interval*1.1), so the new range will become [45, 55)
mins.
* add a separately sourced random generator in session pool.

Fixes #1817

Change-Id: I7dc612063815279b2f6a3b2b24c17ae6d52c14a2
Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/53252
Reviewed-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Knut Olav Løite <koloite@gmail.com>
@hengfengli
Copy link
Contributor Author

FYI: a new CL has been merged for this issue: https://code-review.googlesource.com/c/gocloud/+/53252, which updates the health interval to [45, 55) mins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants