Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[coordinator] Configurable writes to leaving shards count towards consistency, add read level unstrict all #2687

Conversation

robskillington
Copy link
Collaborator

@robskillington robskillington commented Oct 2, 2020

What this PR does / why we need it:

Allows for writes to leaving shards to count towards consistency and adds read consistency level unstrict all. Combined these settings allow for clusters under stress to perform a node join without removing the availability of the leaving node. Since this can cause issues where writes made to unavailable nodes count towards consistency the read unstrict all consistency level is made available so you can attempt always to try to read from all replicas. This means you won't optimistically short-circuit once reading from majority of nodes (with unstrict majority) and will always give the cluster a chance to read from all replicas (so if a write during a node join with write to a leaving shard that count towards consistency leaves a write on two nodes but the shard that is given up after leaving drops the data and the joining node didn't persist it, you are most likely to always read it from the single shard that still has the data).

Only advised to use this when the cluster is under stress, and also advised to be using repairs in this situation.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

Copy link
Collaborator

@arnikola arnikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm w nits

src/dbnode/client/config.go Outdated Show resolved Hide resolved
src/dbnode/client/types.go Show resolved Hide resolved
Copy link
Collaborator

@rallen090 rallen090 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robskillington robskillington merged commit 81c3e19 into master Oct 2, 2020
@robskillington robskillington deleted the r/configurable-write-shards-leaving-count-consistency-read-unstrict-all branch October 2, 2020 05:54
ChrisChinchilla pushed a commit that referenced this pull request Oct 6, 2020
commit 47c219a
Author: Linas Medžiūnas <linasm@users.noreply.github.com>
Date:   Tue Oct 6 13:31:37 2020 +0300

    [dbnode] GetNamespaceFn returns public Namespace interface (#2696)

commit ac2ef9b
Author: Linas Medžiūnas <linasm@users.noreply.github.com>
Date:   Tue Oct 6 11:11:50 2020 +0300

    [coordinator] Store metrics type into the annotation (#2628)

    * Rename existing protobuf TimeSeries.type field to m3_type to avoid collision

    * Add new Prometheus protobuf fields

    * Rename the internal MetricsType to M3MetricsType

    * Implement conversions

    * Write metrics type as annotation payload

    * Avoid reusing annotation slices

    * Fix test

    * Introduce metric family type

    * Address review feedback

    * Revert "Introduce metric family type"

    This reverts commit d108b4f.

    * Introduce annotation.Payload.handle_value_resets field

    * Minor changes according to PR feedback

commit 2df33bf
Author: Linas Medžiūnas <linasm@users.noreply.github.com>
Date:   Tue Oct 6 09:23:48 2020 +0300

    [dbnode] Extended namespace options (#2644)

    * [dbnode] Extended namespace options

    * Add ExtendedOptions to integration tests

    * Fix build

    * Fix build (OptionsToProto)

    * go.sum

    * Imports

    * Allow updating extended namespace options

    * Lint

    * Fixed volumeIndex increment

    * Revert "Fixed volumeIndex increment"

    This reverts commit 2e0342b.

    * Fix TestLocalTypeWithAggregatedNamespace

    * Added protection against golang/protobuf/jsonpb use

    * More protection against non-gogo protobuf package

    * Make namespace tests pass

    * Reduce reptitiveness of protobuf.Any construction

    * Extract ExtendedOptions test infra to x package

    * Add copyright

    * Move test ExtendedOptionsConverter to the packages where it is needed

    * xtest.NewExtendedOptionsProto returns error

    Co-authored-by: Gediminas <gediminas@chronosphere.io>

commit 84f5b98
Author: teddywahle <69990143+teddywahle@users.noreply.github.com>
Date:   Mon Oct 5 14:36:45 2020 -0700

    [query] Implemented the Graphite `applyByNode` function (#2654)

commit 08117d2
Author: teddywahle <69990143+teddywahle@users.noreply.github.com>
Date:   Mon Oct 5 13:28:07 2020 -0700

    [query] Fix snapping bug affecting "moving" function (movingMedian, movingAverage, etc.) (#2694)

commit db4e9a3
Author: teddywahle <69990143+teddywahle@users.noreply.github.com>
Date:   Mon Oct 5 12:49:06 2020 -0700

    [query] ParseTime function supports 95%+ of `from / until` time formats (#2621)

commit 89f6fcc
Author: Gediminas Guoba <gediminas@chronosphere.io>
Date:   Mon Oct 5 21:36:32 2020 +0300

    [largetiles] Fixed volumeIndex increment (#2681)

    * Fixed volumeIndex increment

    * minor refactoring

    Co-authored-by: Linas Medžiūnas <linasm@users.noreply.github.com>
    Co-authored-by: Rob Skillington <rob.skillington@gmail.com>

commit 1df3eed
Author: Linas Medžiūnas <linasm@users.noreply.github.com>
Date:   Mon Oct 5 20:45:50 2020 +0300

    [dbnode] Fix granularity of LeaseManager.UpdateOpenLeases (#2695)

commit 5bcb1ba
Author: Linas Medžiūnas <linasm@users.noreply.github.com>
Date:   Fri Oct 2 16:59:22 2020 +0300

    [dbnode] Support dynamically registered background processes in mediator (#2634)

    * [dbnode] Support dynamically registered background processes in mediator

    * [dbnode] Pass background processes via RunOptions

    * Add a comment for BackgroundProcess

    * Add integration test TestRunCustomBackgroundProcess

    * Include backgroundProcess in TestDatabaseMediatorOpenClose

    * Move the source of BackgroundProcess asynchrony outside for consistency/easier testing

    * Fix test

    * Formatting

    * Update BackgroundProcess doc

    * go mod tidy

    * Verify reporting in TestRunCustomBackgroundProcess

    * Address review feedback

    * Resurrect BackgroundProcess.Start()

    * Enforce mediator.RegisterBackgroundProcess before Open

    * Remove locking from mediator.Report

    * Fix comment

commit ef9ba0a
Author: Rob Skillington <rob.skillington@gmail.com>
Date:   Fri Oct 2 02:39:33 2020 -0400

    [changelog] Add changelog for 0.15.16 (#2688)

commit 81c3e19
Author: Rob Skillington <rob.skillington@gmail.com>
Date:   Fri Oct 2 01:54:34 2020 -0400

    [coordinator] Configurable writes to leaving shards count towards consistency, add read level unstrict all (#2687)

commit a700d56
Author: Rob Skillington <rob.skillington@gmail.com>
Date:   Thu Oct 1 23:09:09 2020 -0400

    [coordinator] Continue write request when context is cancelled (#2682)

commit 526da79
Author: Ryan Allen <rallen090@gmail.com>
Date:   Thu Oct 1 15:48:37 2020 -0400

    [query] - Include FetchQuery in InspectSeries arguments (#2685)

commit 3dedba5
Author: arnikola <artem@chronosphere.io>
Date:   Thu Oct 1 15:18:27 2020 -0400

    [query] Additional testing on Prometheus engine path (#2686)

    * [query] BlockMetadata no longer required for Prometheus engine path

    * Revert code change

    * Exposes ApplyRangeWarnings

commit f996e2d
Author: Linas Medžiūnas <linasm@users.noreply.github.com>
Date:   Thu Oct 1 21:49:33 2020 +0300

    [dbnode] Large tile aggregation improvements (#2668)

    * [dbnode] Large tile aggregation improvements

    * Fix mocking

    * Log block start and volume on error

    * Infra for reverse index sharing

    * StorageOptions.AfterNamespaceCreatedFn

    * Adjust test timing

    * Do not fail on fs.ErrCheckpointFileNotFound

    * mockgen

    * Improve handling of shared reverse index

    * Improve Shard.AggregateTiles UT coverage

    * Remove unused reverseIndex from test

    * Address review feedback (partial)

    * Rearrange imports

commit d5fbe4b
Author: Bo Du <bo@chronosphere.io>
Date:   Thu Oct 1 10:12:38 2020 -0600

    [dbnode] No empty TSDB snapshots (#2666)

commit 06cca59
Author: Asaf Mesika <amesika@logz.io>
Date:   Thu Oct 1 12:18:21 2020 +0300

    [docs] Add short description of 2019 monitorama talk (#2515)

    * Add short description of 2019 monitorama talk

    * Update docs/overview/media.md

    Co-authored-by: Chris Chinchilla <chris@chrischinchilla.com>

    Co-authored-by: Chris Chinchilla <chris@chrischinchilla.com>
    Co-authored-by: Chris Chinchilla <chris@chronosphere.io>

commit 68ba5bb
Author: teddywahle <69990143+teddywahle@users.noreply.github.com>
Date:   Wed Sep 30 12:18:42 2020 -0700

    [query] Implemented the Graphite `interpolate` function  (#2650)

commit 77455be
Author: teddywahle <69990143+teddywahle@users.noreply.github.com>
Date:   Wed Sep 30 07:41:27 2020 -0700

    [query] Implemented the Graphite `grep` function (#2655)

commit 43ff200
Author: nate <nbroyles@gmail.com>
Date:   Tue Sep 29 18:00:14 2020 -0400

    [query] Update /database/create endpoint to support creating an aggregated namespace (#2670)

commit 41ac33c
Author: Ryan Hall <ryanhall07@gmail.com>
Date:   Tue Sep 29 12:37:01 2020 -0700

    Update debug/dump cmd with Cluster-Environment-Name header (#2672)

    This is needed since the dump includes the PlacmentSource

commit 503d68d
Author: nate <nbroyles@gmail.com>
Date:   Tue Sep 29 15:10:00 2020 -0400

    [docs] Add Fileset Migration docs to Operational Guide nav (#2680)

commit ac8259a
Author: nate <nbroyles@gmail.com>
Date:   Tue Sep 29 14:44:18 2020 -0400

    [dbnode][query] Add ability to set and retrieve AggregationOptions via namespace APis (#2661)

commit 946161b
Author: Ryan Hall <ryanhall07@gmail.com>
Date:   Tue Sep 29 10:54:43 2020 -0700

    Only acquire writeState exclusive lock when resetting connections (#2678)

    This is a shared lock across all goroutines, so ideally we limit the
    amount of sync points. There is no need to acquire the exclusive lock
    when flushing a connection. Instead the shared read lock can be acquired
    and individual connection exclusive locks can be acquired to ensure only
    a single writer is using the connection at a time.

    This is part of a broader fix to prevent deadlocks in the m3msg communication.
    Before this change and adding write timeouts, the producer and consumer could deadlock
    and stop all messages from flowing. If a producer was sending more messages than
    it was acking, it would eventually fill up the TCP buffers and begin to block
    writes. If the producer was flushing a connection and holding the exclusive lock,
    it would end up blocking with lock. Since the lock was held, the ACKing process
    could not clear the buffer and the flush would block forever.

commit 192f611
Author: Ryan Hall <ryanhall07@gmail.com>
Date:   Tue Sep 29 08:19:58 2020 -0700

    Fix data race in TestSeekerManagerCacheShardIndices (#2679)

commit c717d40
Author: Rob Skillington <rob.skillington@gmail.com>
Date:   Mon Sep 28 23:27:09 2020 -0400

    [documentation] Add documentation for disk read bytes per second limit (#2677)

commit 3c81774
Author: Ryan Hall <ryanhall07@gmail.com>
Date:   Mon Sep 28 11:25:08 2020 -0700

    Add a configurable write timeout for consumer ACKs (#2675)

    Without a timeout this can potentially block forever.

    The producer and consumer can potentially dead lock trying to send/receive messages:

    Producer -> msg -> Consumer. Consumer is not attempting to read since it's trying to ACK.
    Consumer -> ack -> Producer. Producer is not attempting to read since it might be stuck trying to get a lock on the connection.

    For backwards compatibility this defaults to 0. We should set it to a sane default (5s) in the future.

commit 80c9f9e
Author: Ryan Hall <ryanhall07@gmail.com>
Date:   Mon Sep 28 09:39:19 2020 -0700

    Add queue processing metrics (#2671)

    * Add queue processing metrics

    I found these metrics useful when tracking down an issue with the queue
    processor. Additionally, added an optional consumer label for the
    message writer metrics so you can easily tell which downstream consumer
    is causing issues.

    Unfortunately, setting the consumer label is a little awkward since it
    only makes since for the replicated shard writer, which only has a single
    consumer instance at one time. For the shared shard writer, the empty string
    is used.

commit 66b0b24
Author: teddywahle <69990143+teddywahle@users.noreply.github.com>
Date:   Mon Sep 28 03:51:22 2020 -0700

    [query] Implemented the Graphite `sortByMinima` function

commit a7b499d
Author: teddywahle <69990143+teddywahle@users.noreply.github.com>
Date:   Sun Sep 27 20:12:44 2020 -0700

    [query] Implemented the Graphite `cumulative` function

commit 13290be
Author: Ryan Hall <ryanhall07@gmail.com>
Date:   Fri Sep 25 16:34:55 2020 -0700

    Add metrics for bytes added/removed to the m3msg buffer (#2656)

    Helpful for understanding how to size the buffer to survive a m3aggregator
    deploy. Due to the graceful shutdown of a m3agregator, it can take almost a
    minute between connections closing on the old instances and connections
    start taking traffic on the new instances. If a coordinator takes 10MB/s
    of traffic, it only takes 10s to fill up a 100MB buffer and start dropping requests.

commit 06d23e2
Author: Matt Schallert <mattschallert@gmail.com>
Date:   Fri Sep 25 15:06:20 2020 -0700

    [deps] grpc v1.27 -> v1.29 (#2667)

    We're stuck in a tough spot with our gRPC version. We are on a
    relatively old version, and gRPC frequently introduces breaking changes
    to APIs that are experimental. Our own query code, and the version of
    etcd we depend on, use experimental APIs which were later deprecated.

    We can't upgrade too far as we won't be able to import etcd. However
    v1.29 is far more compatible with other parts of the ecosystem, and
    makes it easier to integrate M3 in other environments (as well as
    integrate newer tools into M3).

commit 5b95438
Author: Ryan Hall <ryanhall07@gmail.com>
Date:   Fri Sep 25 13:53:08 2020 -0700

    [msg] Move writer README to top level msg README  (#2669)

commit 38bc187
Author: Rob Skillington <rob.skillington@gmail.com>
Date:   Fri Sep 25 14:12:05 2020 -0400

    [query] Add metrics for remote storage backends (#2657)

commit 21cceab
Author: Rob Skillington <rob.skillington@gmail.com>
Date:   Fri Sep 25 10:42:52 2020 -0400

    [dbnode] Add example for using the Go DB node client to query/write metrics (#2658)

commit 41b4bc4
Author: Rob Skillington <rob.skillington@gmail.com>
Date:   Fri Sep 25 10:42:00 2020 -0400

    [query] Add ability to enable optimized Graphite fanout if all agg namespaces have all the data (#2665)

commit aefca00
Author: Linas Medžiūnas <linasm@users.noreply.github.com>
Date:   Fri Sep 25 09:09:46 2020 +0300

    [dbnode] Large tiles aggregation flow v2 (#2643)

    * Refactor and cleanup

    * Refactor interfaces to more closely match design

    * Update frame iterator read from a encoding.ReaderIterator

    * Removing unnecessary files

    * az

    * Add utility to apply tile calculations on data.

    * test fix

    * Added concurrency

    * Concurrency logging

    * [dbnode] A noop AggregateTiles thrift RPC

    * Add AggregateTilesRequest.rangeType

    * sourceNameSpace / targetNameSpace

    * Drop AggregateTilesRequest.shardId

    * A partial implementation of AggregateTiles

    * Open DataFileSetReader and iterate through it

    * Decompress the data read

    * Add explicit FileSetType

    * [dbnode] Add OrderedByIndex option for DataFileSetReader.Open

    * Remove dbShard.TagsFromSeriesID

    * Regenerate mocks

    * Unit tests

    * Mockgen

    * Fix test

    * Resurrect rpc_mock.go

    * Remove accidentally committed files

    * Trigger build

    * Add step parameter

    * Write aggregated data to other namespace

    * Fix tests

    * Introduced AggregateTilesOptions

    * Minor improvements

    * Cleanup

    * PR response

    * Add headers

    * Remove unnecessary stuff.

    * [dbnode] A noop AggregateTiles thrift RPC

    * Add AggregateTilesRequest.rangeType

    * sourceNameSpace / targetNameSpace

    * Drop AggregateTilesRequest.shardId

    * A partial implementation of AggregateTiles

    * Open DataFileSetReader and iterate through it

    * Decompress the data read

    * Add explicit FileSetType

    * Remove dbShard.TagsFromSeriesID

    * Regenerate mocks

    * Unit tests

    * Mockgen

    * Fix test

    * Resurrect rpc_mock.go

    * Remove accidentally committed files

    * Trigger build

    * Add step parameter

    * Write aggregated data to other namespace

    * Fix tests

    * Introduced AggregateTilesOptions

    * Minor improvements

    * Cleanup

    * [dbnode] Integrate arrow iterators into tile aggregation

    * Fix close error after EOF

    * Can already close the SeriesBlockIterator

    * Update to use concurrent iteration and prefer single metadata

    * [dbnode] Cross block series reader

    * Assert on OrderedByIndex

    * Tests

    * Mocks

    * Dont test just the happy path

    * Compute and validate block time frames

    * [dbnode] Integration test for large tiles (#2478)

    * [dbnode] Create a virtual reverse index for a computed namespace

    * Return processedBlockCount from AggregateTiles

    * Improve error handling

    * Validate AggregateTilesOptions

    * Unnest read locks

    * Use default instead of constant

    * Fix test

    * minor refactoring

    * Addressed review feedback

    * Legal stuff

    * Refactor recorder

    * Allow using flat buffers rather than arrow

    * [dbnode] persist manager for large tiles

    * revert of .ci

    * minor

    * Adding better comparisons for arrow vs flat

    * Some fixes for query_data_files

    * An option to read from all shards

    * Fix large_tiles_test

    * Fix TestDatabaseAggregateTiles

    * Read data ordered by index

    * Generate mocks

    * Fix TestAggregateTiles

    * Group Read() results by id

    * Remodel CrossBlockReader as an Iterator

    * Mockgen

    * Erase slice contents before draining them

    * Resolve merge conflicts

    * Align with master

    * Integrate CrossBlockReader

    * Make a defensive copy of dataFileSetReaders

    * avoid panics

    * Improve TestNamespaceAggregateTiles

    * Added TODO on TestAggregateTiles

    * Align query_data_files

    * Mockgen

    * Added cross block iterator to be able to read multiple BlockRecords.
    Also removed concurrency from tile iterators and cleaned up utility

    * Add HandleCounterResets to AggregateTilesOptions

    * Additional tests and cleanup.

    * [dbnode] Large Tiles fs.writer experimental implementation

    * Implement DownsampleCounterResets

    * Improve readablitiy

    * Use pointer arguments to get the results

    * Reduce code duplication

    * Refine comments

    * Remove dependency on SeriesBlockFrame

    * [dbnode] Add OrderedByIndex option for DataFileSetReader.Open (#2465)

    * [dbnode] Cross-block series reader (#2481)

    * Fix build

    * Integrate DownsampleCounterResets

    * Introduce DownsampledValue struct

    * Update DownsampleCounterResets integration

    * Preallocate capacity for downsampledValues

    * Large tiles parrallel indexing.

    * Checkpoint fixed

    * Successful write/fetch with some hardcoded values.

    * Some FIXME solved

    * [dbnode] AggregateTiles RPC - minimal E2E flow (#2466)

    * minor fixes

    * codegen fix

    * Address feedback from PR 2477

    * TestShardAggregateTiles using 2 readers

    * Fix large_tiles_test.go

    * integration test fix

    * Bug fix and test

    * [large tiles] Fix refcounting in CrossBlockReader

    * Workaround for negative reference count

    * Integration test fix

    * [large-tiles] Try detect double finalize

    * [dbnode] Large tiles concurrency test

    * Batch writer is used for waster writes

    * race fix

    * Fix compilation error

    * Comment out some noise

    * Fix data races on time field (bootstrapManager, flushManager)

    * Fix misplaced wd.Add

    * Close context used for aggregation (in test code)

    * Close encoders during large tile writes

    * removing some debug code

    * Close series in test code

    * Move series.Close() after err check (test code)

    * Update to series frame API

    * Additional tests

    * PR

    * work on integ test

    * tags close

    * [large-tiles] Fix management of pooled objects

    * Fix query_data_files tool

    * Use mock checked bytes in cross_block_reader_test.go

    * query test

    * Query in place of fetch

    * test fix

    * Bug reproduced

    * Heavier concurrency test

    * Fix session related races in large_tiles_test.go

    * Fix string conversion

    * Use mocks for pooled objects in TestShardAggregateTiles

    * Add build integration tag

    * test fix

    * minor refactoring

    * increased the amount of series to 5k

    * Remove a noop

    * Log the finish of namespace aggregation

    * some minor refactorings

    * Streaming aggregation, reusing resources for each series

    * Do not allocate minHeapEntries

    * Cleanup

    * Fix query_data_files

    * Fix build

    * go.sum

    * Align with StreamingWriter API changes

    * Add FIXME WRT segment.Tail finalizing

    * Exclude query_data_files

    * Exclude counter_resets_downsampler.go

    * Remove arrow related code

    * Cleanup

    * Use explicit EncodedTags type

    * Rename processedBlockCount to processedTileCount

    * Fix build

    * Exclude read_index_ids changes

    * Address review feedback

    * Increase fetch timeout to stabilize TestAggregationAndQueryingAtHighConcurrency

    * Abort the writer in case of any error during aggregation

    Co-authored-by: Artem <artem@chronosphere.io>
    Co-authored-by: Gediminas <gediminas@chronosphere.io>

commit ac530fa
Author: Matt Schallert <mattschallert@gmail.com>
Date:   Thu Sep 24 14:07:46 2020 -0700

    [deps] Bump tchannel v1.12 -> v1.14 (#2659)

commit 62a31fc
Author: Rob Skillington <rob.skillington@gmail.com>
Date:   Thu Sep 24 15:06:15 2020 -0400

    [dbnode] Better defaults for block retriever config (#2664)

commit 767a517
Author: nate <nbroyles@gmail.com>
Date:   Thu Sep 24 10:11:56 2020 -0400

    [dbnode] Load and cache info files during bootstrap  (#2598)

commit 4d81e4a
Author: teddywahle <69990143+teddywahle@users.noreply.github.com>
Date:   Wed Sep 23 12:43:21 2020 -0700

    [query] Implemented the Graphite `highest` and `lowest` functions (#2623)

commit b4575f6
Author: arnikola <artem@chronosphere.io>
Date:   Wed Sep 23 10:07:11 2020 -0400

    [query] Add resolution exceeds query range warning (#2429)

Signed-off-by: ChrisChinchilla <chris@chronosphere.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants