Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple writer for the same sst caused by close shard #990

Open
Rachelint opened this issue Jun 12, 2023 · 2 comments
Open

Multiple writer for the same sst caused by close shard #990

Rachelint opened this issue Jun 12, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Rachelint
Copy link
Contributor

Describe this problem

Shard will be moved from nodes when process panic because if for any reason, all operations related to such a shard should be stopped before moving(especially the write operations).
However background works(flush, compaction, all of them are writes) will not be stoppend rightly before now. That caused a serious bug : multiple writers for one sst.

Server version

CeresDB Server
Version: 1.2.2
Git commit: 2e20665
Git branch: main
Opt level: 3
Rustc version: 1.69.0-nightly
Target: aarch64-apple-darwin
Build date: 2023-06-12T13:01:03.592984000Z

Steps to reproduce

Hard to reproduce, if must do this, steps following may can work:

  • Setup a ceresdb cluster with ceresmeta.
  • Trigger compaction/flush work for a specific table of shard in one node manually.
  • Move the shard to another node by ceresmeta manually before comapction/flush work finishing.
  • Trigger compaction/flush work for the table of shard manually in the new node.

Expected behavior

No response

Additional Information

No response

@Rachelint Rachelint added the bug Something isn't working label Jun 12, 2023
@ShiKaiWi ShiKaiWi self-assigned this Jun 16, 2023
@ShiKaiWi
Copy link
Member

After #998, the updates following the closing shard will be forbidden. However, some ssts may be still being written when close the shard, while these ssts may share the same ids with the new ssts created by the new node, leading to the multiple writers on the same sst.

Let's fix this problem in another PR. @baojinri

Rachelint pushed a commit that referenced this issue Jun 19, 2023
## Rationale
Part of #990.

Some background jobs are still allowed to execute, and it will lead to
data corrupted when a table is migrated between different nodes because
of multiple writers for the same table.

## Detailed Changes
Introduce a flag called `invalid` in the table data to denote whether
the serial executor is valid, and this flag is protected with the
`TableOpSerialExecutor` in table data, and the `TableOpSerialExecutor`
won't be acquired if the flag is set, that is to say, any table
operation including updating manifest, altering table and so on, can't
be executed after the flag is set because these operations require the
`TableOpSerialExecutor`. Finally, the flag will be set when the table is
closed.
MichaelLeeHZ pushed a commit to MichaelLeeHZ/ceresdb that referenced this issue Jun 21, 2023
Part of apache#990.

Some background jobs are still allowed to execute, and it will lead to
data corrupted when a table is migrated between different nodes because
of multiple writers for the same table.

Introduce a flag called `invalid` in the table data to denote whether
the serial executor is valid, and this flag is protected with the
`TableOpSerialExecutor` in table data, and the `TableOpSerialExecutor`
won't be acquired if the flag is set, that is to say, any table
operation including updating manifest, altering table and so on, can't
be executed after the flag is set because these operations require the
`TableOpSerialExecutor`. Finally, the flag will be set when the table is
closed.
ShiKaiWi added a commit that referenced this issue Jun 26, 2023
This reverts commit 85eb0b7.

## Rationale
The changes introduced by #998 are not reasonable. Another fix will
address #990.

## Detailed Changes
Revert #998
ShiKaiWi pushed a commit that referenced this issue Jul 7, 2023
## Rationale
part of #990 

## Detailed Changes
- Abstract a `IdAllocator` in `common_util` library
- Use `IdAllocator` to alloc sst id
- Persist `IdAllocator`'s `max id` to manifest

## Test Plan
- Add new unit test for `IdAllocator`
- Manual compatibility test: make ssts generated with old
ceresdb-server, and deploy a new ceresdb-server with this changeset.
Write some new data into it to make new ssts generated, and check
whether the sst id is correct.
@ShiKaiWi
Copy link
Member

ShiKaiWi commented Jul 7, 2023

After #998, the updates following the closing shard will be forbidden. However, some ssts may be still being written when close the shard, while these ssts may share the same ids with the new ssts created by the new node, leading to the multiple writers on the same sst.

Let's fix this problem in another PR. @baojinri

#1009 has fixed the problem. However, #998 actually didn't achieve the goal to prevent updates after table is closed. And #998 has been reverted, I guess I'll submit another change set to make all things work.

dust1 pushed a commit to dust1/ceresdb that referenced this issue Aug 9, 2023
## Rationale
Part of apache#990.

Some background jobs are still allowed to execute, and it will lead to
data corrupted when a table is migrated between different nodes because
of multiple writers for the same table.

## Detailed Changes
Introduce a flag called `invalid` in the table data to denote whether
the serial executor is valid, and this flag is protected with the
`TableOpSerialExecutor` in table data, and the `TableOpSerialExecutor`
won't be acquired if the flag is set, that is to say, any table
operation including updating manifest, altering table and so on, can't
be executed after the flag is set because these operations require the
`TableOpSerialExecutor`. Finally, the flag will be set when the table is
closed.
dust1 pushed a commit to dust1/ceresdb that referenced this issue Aug 9, 2023
…pache#1034)

This reverts commit 85eb0b7.

## Rationale
The changes introduced by apache#998 are not reasonable. Another fix will
address apache#990.

## Detailed Changes
Revert apache#998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants