Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose project roadmap #15499

Closed
wants to merge 1 commit into from
Closed

Propose project roadmap #15499

wants to merge 1 commit into from

Conversation

serathius
Copy link
Member

@serathius serathius commented Mar 17, 2023

@serathius
Copy link
Member Author

serathius commented Mar 17, 2023

cc @spzala @lavacat @jmhbnz @chaochn47 @tjungblu @mitake for feedback

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Two things I think would be nice to add:

  • Section to mention feature tracking, with reference to feature tracking board if we are still going to use it for future releases: https://github.com/etcd-io/etcd/projects/1
  • Once known, we should add links to each item so it links through to the issue or milestone directly.

## Milestones

* [P0] Etcd releases are qualified by rigorous robustness testing
* [P0] Etcd can reliably detect data corruption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the above P1 3 action items fit into the Milestones section here?

etcd apply code should be easy to understand and validate correctness
etcd can reliably detect data corruption (hash is linearizable)
(This P1 action item is mentioned as P0)

etcd recovery from data inconsistency procedures are documented and tested

How does the stalled write due to slow disk fit into the Milestone? @ahrtr

What about lease redesign? #14094

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd apply code should be easy to understand and validate correctness

I would want to avoid touching random parts of apply code without better testing and clear goal. Removal of v2 API and following cleanup should already improve the situation and takes priority.

etcd can reliably detect data corruption (hash is linearizable)

This should be covered by Etcd can reliably detect data corruption, maybe requires rewrite/clarification. I might downgrade it back to P1 as it doesn't help with v3.4 release.

etcd recovery from data inconsistency procedures are documented and tested.

Documentation is done, we just need to add test. I think we should file an issue as important but goal milestone itself should be tracked as part of improvements to testing.

How does the stalled write due to slow disk fit into the Milestone? @ahrtr

This is somewhat new effort that is still not well defined. For me it comes under reliability, which is important, but as it relates to hardware failures it's not something etcd tackled yet. However with recently reported #15498 I would want to propose "etcd is resilient to hardware failures" soon.

What about lease redesign?

Correctness should be our top priority, however leases have been broken for long time and no-one cared (K8s also doesn't). As so I would treat it second priority to KV API.
At some point we could consider an larger effort "etcd APIs is high quality and has consistent behavior" that would encompass leases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the scope for this item clear? IIRC there was a discussion on corruption detection per key/value, then there were some discussions around merkle trees and partitioning the keyspace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the scope for this item clear? IIRC there was a discussion on corruption detection per key/value, then there were some discussions around merkle trees and partitioning the keyspace.

If you mean corruption detection scope, then not. I didn't have time to define it. It's pretty large issue to tackle and are multiple ways to approach it. Main challenge is balancing breaking changes and short term vs long term improvements. I have couple of ideas that I discussed with @ptabor, but didn't have time to write them down as I want to focus on finishing robustness tests first (not too long).

Happy to make the scope clearer if someone is interested in working on it. Still would like to encourage people to work on robustness tests first, as they also help v3.4 and v3.5 releases.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any tracking issue for changes to catch corruption per key/value?

@serathius
Copy link
Member Author

PING @ahrtr @ptabor for feedback


This document defines high level goals for project.

## Milestones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful about redefining meaning of the words.

So far in etcd we were using milestones = future minor release: v3.5, v3.6.
Here we naming as milestone a focus area we want to invest.

Let's use consistent terms. My proposal it to think about this as 3 level hierarchy:

  1. Milestones -> are milestones as defined in https://github.com/etcd-io/etcd/milestones. Let's keep them as publicly visible releases (might be patch).

  2. Efforts

    • are bugs that track progress on multiple issues that need to be addressed with a common objective
    • [or] (for bigger efforts) projects: https://github.com/etcd-io/etcd/projects.
      Still I would represent project as umbrella issue -> as it seems that project cannot be assigned to a milestone.
  3. Issues - for individual work items.

Now the question remains:
If we have a tool to dynamically track the milestones with attached efforts / items, do we need to redundantly track it in a markdown doc ?

And I would say - we don't. I assume that the purpose of the doc is different. It's a statement of intent what we want to focus in following releases. And thanks to being submitted by maintainers and reviewed, it forces them to be on the same page (as opposed to an individual maintainer assigning an issue to a milestone). But If that's the goal, let's call it explicitly in the preamble to this doc.

Then let's have:

Milestones:

release-v3.6

The main focus of the v3.6 is the reduction of technical debts. The explicit goal is to avoid new features.
The focus will be on:

  1. deprecate/decomission experimental / legacy cody:
    • decommission storage v2 (link to a tracking bug)
    • experimental features are graduated or removed (link to a tracking bug)
  2. ...

release-v3.5.x

The same as release-v3.4.x.

release-v3.4.x

The release focuses on stability. Etcd maintainers are going to backport:

  • critical/important vulnerabilities
  • backportable test coverage (including robustness tests)
  • critical correctness or robustness issue fixes
  • not invasive flags/warnings that enable easier transition to next release.

@ahrtr
Copy link
Member

ahrtr commented Mar 22, 2023

I think this PR mixed up three concepts/things:

  1. etcd's core value as a K/V datastore;
  2. priorities for now and in the near future.
    I agree the data inconsistent or corruption related issues should be our top priority. Currently I want to focus on Branch page items link to already released pages bbolt#402 and Invalid db page layout #15498
  3. roadmap
    roadmap gives people what will be included in future releases, e.g. 3.6, 3.7, etc. I don't think we need a roadmap for stable release (3.4 and 3.5); because we don't add new features for stable release, we just fix issues, including bugs and CVEs. Note that the items included in roadmap aren't necessary to be priorities for now, they are just direction of the project. Actually there is no rush to add a roadmap for now, because we all know we are still focusing on qualify our stable releases.

What should be included in future releases? This should be open to discussion. See my proposal (on top of @ptabor 's ). Note that adding whatever testing is a continuous effort, and is always welcome.

3.6.0

  1. remove the dependency on experimental gRPC API: Remove the dependency on grpc-go's experimental API (e.g. resolver & balancer) #15145 . Note that currently we also also stuck on bumping gRPC due to dependency on experimental API, see dependency: bump google.golang.org/grpc from 1.51.0 to 1.52.0 #15131
  2. Partially resolve All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247. e.g. etcdserver: add watchdog to detect stalled writes #15440. I tend to refactoring lease in 3.7 or 4.0, after we have more confidence on the project (etcd).
  3. Watch stream starvation. Note although it's an issue, it may introduce huge change in Replace Cmux #15510.

3.7.0 (or 4.0)

  1. Refactoring leases to completely resolve issue like All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247. It's big & challenge task. Raft is just like a network transportation channel instead of a consensus mechanism, see [Lease] Refactor lease renew request via raft #14094
  2. Cleanup legacy protobuf libraries, see Protobuf: cleanup both golang/protobuf and gogo/protobuf #14533. It's an important but not urgent task.
  3. Evaluate & Integrate the new raft feature (raft: support asynchronous storage writes raft#8) into etcdserver;
  4. ...

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius serathius changed the title Propose project roadmap and update status of postmortem action items Propose project roadmap Mar 23, 2023
@serathius
Copy link
Member Author

For now I separated postmortem update #15552

@serathius
Copy link
Member Author

@serathius serathius closed this Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants