Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd v3 as storage backend for APIServer #44

Closed
30 tasks done
hongchaodeng opened this issue Jul 22, 2016 · 62 comments
Closed
30 tasks done

etcd v3 as storage backend for APIServer #44

hongchaodeng opened this issue Jul 22, 2016 · 62 comments
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Milestone

Comments

@hongchaodeng
Copy link
Contributor

hongchaodeng commented Jul 22, 2016

Description

This feature was merged into v1.3 and is an alpha feature behind a flag. Creating a feature for tracking through the rest of the process.

etcd v3 is a new API supported by etcd v3.0.0+. There are a number advantages of this new API but there are a changes that need to happen both externally, like turning etcd v3 on by default and migration docs, and internally, like testing and continued storage improvements, to fully take advantage of etcd v3.

Why is this a feature? It slightly changes the operation of Kubernetes clusters and impacts SIG Scalabilities and SIG API Machinery work. This is already an alpha feature today that makes etcd v3 a non-default backend but there is work that remains to move from an alpha feature to a well tested complete feature.

Progress Tracker

FEATURE_STATUS: Stable

@timothysc
Copy link
Member

/cc @kubernetes/sig-scalability

@philips philips added this to the v1.4 milestone Jul 23, 2016
@wojtek-t
Copy link
Member

To be honest, I wouldn't say the current status of it is "alpha".

To clarify, it is technically possible to configure apiserver to use etcd3, and start this etcd3, but it is completely not supported by our startup scripts - so one would need to do it on his/her own. In my opinion this is necessary step to say this feature is in alpha.
What is more, we don't have any regular testing set up (not even mentioning the upgrade tests).

So to be honest, I wouldn't say this feature is in alpha.

@lavalamp - if you don't agree with ^^

@philips philips added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Aug 3, 2016
@timothysc
Copy link
Member

timothysc commented Aug 11, 2016

Enablement has occurred as of - kubernetes/kubernetes#29399

We are currently following up with v3client enablement, which is in progress.

Also, please add @wojtek-t and @timothysc as assignees as well. Many folks are involved in this effort. We should also label team/SIG-Scalability.

@philips philips added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Aug 11, 2016
@philips
Copy link
Contributor

philips commented Aug 11, 2016

done. thanks for the update @timothysc

On Thu, Aug 11, 2016 at 10:25 AM Timothy St. Clair notifications@github.com
wrote:

Enablement has occurred as of - kubernetes/kubernetes#29399
kubernetes/kubernetes#29399

We are currently following up with v3client enablement, which is in
progress.

Also, please add @wojtek-t https://github.com/wojtek-t and @timothysc
https://github.com/timothysc as assignees as well. Many folks are
involved in this effort.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#44 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACDCK4chMdJ_aMi26XrPKSYohDZDv17ks5qe1r0gaJpZM4JTBjc
.

@alex-mohr
Copy link

@wojtek-t Is the current plan to switch to etcd3 for 1.4?

@wojtek-t
Copy link
Member

@alex-mohr - that's a very good question. Here is what we have:

  • we have e2e suite that is running with etcd3 underneath
  • the first version of migration etcd2 -> etcd3 PR is already in-flight and LGTMed so should be in ~tomorrow
  • we have some relatively simple, but real migration test as part of my PR
  • I did some manual testing and it seems to work

What we are missing:

  • there is no rollback, because we are blocked on a rollback tool that currently @hongchaodeng and @xiang90 are working on
  • we are missing some details in upgrade scripts (we are loosing event TTLs currently)
  • we are missing some GKE bits, but those are super minor changes (I already have those in my local branch and just need to test those)

So we won't have everything by the end of this week, but the missing things will be pretty small. So I guess this is a question mostly to @lavalamp if he will be OK with cherrypicking those small changes ~next week.

@xiang90
Copy link

xiang90 commented Aug 17, 2016

we are missing some details in upgrade scripts (we are loosing event TTLs currently)

I believe @hongchaodeng is also working on this. @hongchaodeng told me that we are not loosing events, just that the moved events do not have TTL attached properly. We need to fix that part by attaching them to a lease correctly for the first bootstrap after migration.

@xiang90
Copy link

xiang90 commented Aug 17, 2016

/cc @timothysc Anything from you? Are we able to switch unit tests/e2e test to etcd3 within next few weeks?

@wojtek-t
Copy link
Member

@hongchaodeng told me that we are not loosing events, just that the moved events do not have TTL attached properly

Yes - that's what I meant.

And yes - it's definitely doable, but it's not yet done. I mentioned things that we are missing.

@wojtek-t
Copy link
Member

e2e tests are already fixed - we have a Jenkins suite running with etcd3 underneath and using v3 client.

@timothysc
Copy link
Member

timothysc commented Aug 17, 2016

So the entire test suite has etcd3 as a backend. The work item/question I've been having is whether to enable clientv3 by default for the 1.4 release, and at this point I'm going to say NO for several reasons.

  • Currently we don't have security setup on clientv3 connections plumbed through
  • Unit + Integration Testing infrastructure needs cleaning and removal of old client and config references, it's messy.
  • Instead of creating a temporary bridge in unit tests we can wait a cycle and enable clientv3 across the board.

So for this release cycle b/c we are straddling v2/v3 support it seems prudent to release with etcd3 as the backend, using the v2 client as the default. This has passed all the tests and there are no known issues at this time.

The only question I have is how many releases will we support the straddling of v2/v3? (one?)

This should be a discussion topic for @kubernetes/sig-scalability tomorrow. /cc @jbeda

@lavalamp
Copy link
Member

Yes. v2 client until we have migration (both ways) at least.

On Wed, Aug 17, 2016 at 2:01 PM, Timothy St. Clair <notifications@github.com

wrote:

So the entire test suite has etcd3 as a backend. The work item/question
I've been having is whether to enable clientv3 by default for the 1.4
release, and at this point I'm going to say NO for several reasons.

  • Currently we don't have security setup on clientv3 connections
    plumbed through
  • Unit + Integration Testing infrastructure needs cleaning and removal
    of old client and config references, it's messy.
  • Instead of creating a temporary bridge in unit tests we can wait a
    cycle and enable clientv3 across the board.

So for this release cycle b/c we are straddling v2/v3 support it seems
prudent to release with etcd3 as the backend, using the v2 client as the
default. This has passed all the tests and there are no known issues at
this time.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#44 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAngltaaJUUY0o-8r1-sZOfzrqCnkQOhks5qg3ajgaJpZM4JTBjc
.

@wojtek-t
Copy link
Member

I definitely agree that migration both ways is a blocker.

However, I don't agree that unit tests are a blocker - if we have e2e & integration tests working, then I don't think unit tests are that important.

The reason why I would like to have this is that we had a discussion and we don't want to launch both protobuf format for storage and etcd3 in a single release. And we will definitely need to former to support 5000 nodes.

@wojtek-t
Copy link
Member

wojtek-t commented Aug 18, 2016

So let me summarize the missing things here:

  • Support for etcd migration kubernetes#30500 for initial version of migration
  • rollback support in migration script
  • attaching events (the only objects with TTL) to a lease (ttl`s are not migratable)
  • more extensive corretness testing for migration
  • scalability testing
  • migration of unit tests to use v3 client
  • migration of integration tests to use v3 client
  • some small GKE bits

In my opinion unit/integration tests migration is not a blocker, all others actually are.

@timothysc
Copy link
Member

timothysc commented Aug 18, 2016

I was able to get the unit tests cleaned up, I'll have PR shortly.

@xiang90
Copy link

xiang90 commented Aug 19, 2016

@hongchaodeng Can you please merge #44 (comment) into the tracking list and also update the current status?

@hongchaodeng
Copy link
Contributor Author

@xiang90
Updated tracking list.

@wojtek-t
Can you clarify "GKE bits" as a list of action items?

@timothysc
Copy link
Member

@wojtek-t
Copy link
Member

@hongchaodeng - GKE bits is our internal stuff that we need to change. This is more to remind us what we need to do.

@timothysc
Copy link
Member

@philips philips added this to the v1.5 milestone Aug 25, 2016
@timothysc timothysc modified the milestones: v1.6, next-milestone Jan 5, 2017
@timothysc
Copy link
Member

etcd3 is now the default in 1.6 (https://groups.google.com/forum/#!topic/kubernetes-dev/WMPK5fyOorE), we will need to be careful to outline installation and upgrade instructions for the broader community. Luckily @kubernetes/sig-cluster-lifecycle-misc has already defaulted to etcd3 in their release.

/cc @hongchaodeng @xiang90

@hongchaodeng
Copy link
Contributor Author

Great! Thanks @timothysc

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Jan 11, 2017
Automatic merge from submit-queue (batch tested with PRs 39714, 39646)

use etcd2 as storage-backend for federation until federation features are completely tested with etcd3

**What this PR does / why we need it**: move federation etcd to etcd3

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #39594

**Special notes for your reviewer**: here is the [link](kubernetes/enhancements#44 (comment)) to announcement making etcd3 as default

**Release note**:

```release-note
```
@idvoretskyi
Copy link
Member

@timothysc can you confirm that this feature can be labelled as "stable" for 1.6?

@lavalamp
Copy link
Member

I don't feel like "alpha, beta, stable" is a good paradigm for thinking about "features" like this one. It's on by default, which we wouldn't do if we thought there were any bugs. Chances of us rolling it back again rather than fixing any bugs is super low. Go ahead and call that stable if you want, I guess.

@idvoretskyi idvoretskyi added the stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status label Jan 12, 2017
@idvoretskyi
Copy link
Member

@lavalamp yes, thank you.

mikefaille referenced this issue in kelseyhightower/kubernetes-cluster-federation Feb 7, 2017
@timothysc
Copy link
Member

@mml @wojtek-t what else do you think needs to be done here other than docs? Which I believe are still pending...

kubernetes/website#2172 (comment)

@idvoretskyi
Copy link
Member

@timothysc @hongchaodeng @wojtek-t please, provide us with the release notes and documentation PR or link at https://docs.google.com/spreadsheets/d/1nspIeRVNjAQHRslHQD1-6gPv99OcYZLMezrBe3Pfhhg/edit#gid=0

Also, please, select the valid checkpoints at the Progress Tracker.

@timothysc
Copy link
Member

@idvoretskyi @hongchaodeng @wojtek-t

It basically looks like we are just waiting on a docs push.

This seems to have fallen down - kubernetes/website#2172 it looks like we need to pull the documentation together here.

@timothysc
Copy link
Member

@idvoretskyi where did you want the release notes, besides the one generated by the PR itself?

@idvoretskyi
Copy link
Member

@timothysc we require the human-friendly release note (one-line), please, put it to the spreadsheet above.

@timothysc
Copy link
Member

Updated spreadsheet with release note.

@wojtek-t
Copy link
Member

Thanks for creating release note - it lgtm.

@timothysc
Copy link
Member

docs PR: kubernetes/website#2763

@jaredbhatti
Copy link

Please don't merge docs PRs for 1.6 until 1.6 is actually ready :(

@philips
Copy link
Contributor

philips commented Apr 10, 2017

Now that v1.6 is out is this feature done and closable?

@idvoretskyi
Copy link
Member

@philips if no progress is expected - yes.

perotinus pushed a commit to kubernetes-retired/cluster-registry that referenced this issue Sep 2, 2017
Automatic merge from submit-queue (batch tested with PRs 39714, 39646)

use etcd2 as storage-backend for federation until federation features are completely tested with etcd3

**What this PR does / why we need it**: move federation etcd to etcd3

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #39594

**Special notes for your reviewer**: here is the [link](kubernetes/enhancements#44 (comment)) to announcement making etcd3 as default

**Release note**:

```release-note
```
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 23, 2017
@xiang90
Copy link

xiang90 commented Dec 23, 2017

this is done. closing.

@xiang90 xiang90 closed this as completed Dec 23, 2017
ingvagabund pushed a commit to ingvagabund/enhancements that referenced this issue Apr 2, 2020
Enhancements also facilitate public feature roadmap
howardjohn pushed a commit to howardjohn/enhancements that referenced this issue Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Projects
None yet
Development

No branches or pull requests