Propose project roadmap #15499

serathius · 2023-03-17T15:30:47Z

Format based on https://github.com/etcd-io/etcd/blob/release-3.5/ROADMAP.md
cc @ahrtr @ptabor

serathius · 2023-03-17T15:52:43Z

cc @spzala @lavacat @jmhbnz @chaochn47 @tjungblu @mitake for feedback

jmhbnz

Looks good to me.

Two things I think would be nice to add:

Section to mention feature tracking, with reference to feature tracking board if we are still going to use it for future releases: https://github.com/etcd-io/etcd/projects/1
Once known, we should add links to each item so it links through to the issue or milestone directly.

chaochn47 · 2023-03-17T20:46:37Z

ROADMAP.md

+## Milestones
+
+* [P0] Etcd releases are qualified by rigorous robustness testing
+* [P0] Etcd can reliably detect data corruption


How does the above P1 3 action items fit into the Milestones section here?

etcd apply code should be easy to understand and validate correctness
etcd can reliably detect data corruption (hash is linearizable)
(This P1 action item is mentioned as P0)

etcd recovery from data inconsistency procedures are documented and tested

How does the stalled write due to slow disk fit into the Milestone? @ahrtr

What about lease redesign? #14094

etcd apply code should be easy to understand and validate correctness

I would want to avoid touching random parts of apply code without better testing and clear goal. Removal of v2 API and following cleanup should already improve the situation and takes priority.

etcd can reliably detect data corruption (hash is linearizable)

This should be covered by Etcd can reliably detect data corruption, maybe requires rewrite/clarification. I might downgrade it back to P1 as it doesn't help with v3.4 release.

etcd recovery from data inconsistency procedures are documented and tested.

Documentation is done, we just need to add test. I think we should file an issue as important but goal milestone itself should be tracked as part of improvements to testing.

How does the stalled write due to slow disk fit into the Milestone? @ahrtr

This is somewhat new effort that is still not well defined. For me it comes under reliability, which is important, but as it relates to hardware failures it's not something etcd tackled yet. However with recently reported #15498 I would want to propose "etcd is resilient to hardware failures" soon.

What about lease redesign?

Correctness should be our top priority, however leases have been broken for long time and no-one cared (K8s also doesn't). As so I would treat it second priority to KV API.
At some point we could consider an larger effort "etcd APIs is high quality and has consistent behavior" that would encompass leases.

Is the scope for this item clear? IIRC there was a discussion on corruption detection per key/value, then there were some discussions around merkle trees and partitioning the keyspace.

Is the scope for this item clear? IIRC there was a discussion on corruption detection per key/value, then there were some discussions around merkle trees and partitioning the keyspace.

If you mean corruption detection scope, then not. I didn't have time to define it. It's pretty large issue to tackle and are multiple ways to approach it. Main challenge is balancing breaking changes and short term vs long term improvements. I have couple of ideas that I discussed with @ptabor, but didn't have time to write them down as I want to focus on finishing robustness tests first (not too long).

Happy to make the scope clearer if someone is interested in working on it. Still would like to encourage people to work on robustness tests first, as they also help v3.4 and v3.5 releases.

is there any tracking issue for changes to catch corruption per key/value?

serathius · 2023-03-20T13:29:39Z

PING @ahrtr @ptabor for feedback

ptabor · 2023-03-21T10:43:58Z

ROADMAP.md

+
+This document defines high level goals for project.
+
+## Milestones


We should be careful about redefining meaning of the words.

So far in etcd we were using milestones = future minor release: v3.5, v3.6.
Here we naming as milestone a focus area we want to invest.

Let's use consistent terms. My proposal it to think about this as 3 level hierarchy:

Milestones -> are milestones as defined in https://github.com/etcd-io/etcd/milestones. Let's keep them as publicly visible releases (might be patch).

Efforts

are bugs that track progress on multiple issues that need to be addressed with a common objective

[or] (for bigger efforts) projects: https://github.com/etcd-io/etcd/projects.
Still I would represent project as umbrella issue -> as it seems that project cannot be assigned to a milestone.

Issues - for individual work items.

Now the question remains:
If we have a tool to dynamically track the milestones with attached efforts / items, do we need to redundantly track it in a markdown doc ?

And I would say - we don't. I assume that the purpose of the doc is different. It's a statement of intent what we want to focus in following releases. And thanks to being submitted by maintainers and reviewed, it forces them to be on the same page (as opposed to an individual maintainer assigning an issue to a milestone). But If that's the goal, let's call it explicitly in the preamble to this doc.

Then let's have:

Milestones:

release-v3.6

The main focus of the v3.6 is the reduction of technical debts. The explicit goal is to avoid new features.
The focus will be on:

deprecate/decomission experimental / legacy cody:

decommission storage v2 (link to a tracking bug)

experimental features are graduated or removed (link to a tracking bug)

...

release-v3.5.x

The same as release-v3.4.x.

release-v3.4.x

The release focuses on stability. Etcd maintainers are going to backport:

critical/important vulnerabilities

backportable test coverage (including robustness tests)

critical correctness or robustness issue fixes

not invasive flags/warnings that enable easier transition to next release.

ahrtr · 2023-03-22T08:51:57Z

I think this PR mixed up three concepts/things:

etcd's core value as a K/V datastore;
priorities for now and in the near future.
I agree the data inconsistent or corruption related issues should be our top priority. Currently I want to focus on Branch page items link to already released pages bbolt#402 and Invalid db page layout #15498
roadmap
roadmap gives people what will be included in future releases, e.g. 3.6, 3.7, etc. I don't think we need a roadmap for stable release (3.4 and 3.5); because we don't add new features for stable release, we just fix issues, including bugs and CVEs. Note that the items included in roadmap aren't necessary to be priorities for now, they are just direction of the project. Actually there is no rush to add a roadmap for now, because we all know we are still focusing on qualify our stable releases.

What should be included in future releases? This should be open to discussion. See my proposal (on top of @ptabor 's ). Note that adding whatever testing is a continuous effort, and is always welcome.

3.6.0

remove the dependency on experimental gRPC API: Remove the dependency on grpc-go's experimental API (e.g. resolver & balancer) #15145 . Note that currently we also also stuck on bumping gRPC due to dependency on experimental API, see dependency: bump google.golang.org/grpc from 1.51.0 to 1.52.0 #15131
Partially resolve All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247. e.g. etcdserver: add watchdog to detect stalled writes #15440. I tend to refactoring lease in 3.7 or 4.0, after we have more confidence on the project (etcd).
Watch stream starvation. Note although it's an issue, it may introduce huge change in Replace Cmux #15510.

3.7.0 (or 4.0)

Refactoring leases to completely resolve issue like All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247. It's big & challenge task. Raft is just like a network transportation channel instead of a consensus mechanism, see [Lease] Refactor lease renew request via raft #14094
Cleanup legacy protobuf libraries, see Protobuf: cleanup both golang/protobuf and gogo/protobuf #14533. It's an important but not urgent task.
Evaluate & Integrate the new raft feature (raft: support asynchronous storage writes raft#8) into etcdserver;
...

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius · 2023-03-23T10:33:04Z

For now I separated postmortem update #15552

serathius · 2023-07-20T07:54:23Z

Closing for https://docs.google.com/document/d/1KFrLeKyHvNDv4b0GswYU3_zbeSATgCanT5dzCID4oJY/edit?usp=sharing

jmhbnz approved these changes Mar 17, 2023

View reviewed changes

chaochn47 reviewed Mar 17, 2023

View reviewed changes

ptabor reviewed Mar 21, 2023

View reviewed changes

Propose project roadmap

1764b55

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius force-pushed the roadmap branch from b0ef2e2 to 1764b55 Compare March 23, 2023 10:32

serathius changed the title ~~Propose project roadmap and update status of postmortem action items~~ Propose project roadmap Mar 23, 2023

mitake mentioned this pull request Apr 11, 2023

WIP, RFC *: a new option for size based compaction #7782

Closed

serathius closed this Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propose project roadmap #15499

Propose project roadmap #15499

serathius commented Mar 17, 2023 •

edited

Loading

serathius commented Mar 17, 2023 •

edited

Loading

jmhbnz left a comment

chaochn47 Mar 17, 2023

serathius Mar 18, 2023

tjungblu Mar 20, 2023

serathius Mar 20, 2023

geetasg Apr 13, 2023

serathius commented Mar 20, 2023

ptabor Mar 21, 2023

ahrtr commented Mar 22, 2023

serathius commented Mar 23, 2023

serathius commented Jul 20, 2023


		This document defines high level goals for project.

		## Milestones

Propose project roadmap #15499

Propose project roadmap #15499

Conversation

serathius commented Mar 17, 2023 • edited Loading

serathius commented Mar 17, 2023 • edited Loading

jmhbnz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius commented Mar 20, 2023

Choose a reason for hiding this comment

Milestones:

ahrtr commented Mar 22, 2023

3.6.0

3.7.0 (or 4.0)

serathius commented Mar 23, 2023

serathius commented Jul 20, 2023

serathius commented Mar 17, 2023 •

edited

Loading

serathius commented Mar 17, 2023 •

edited

Loading