Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Region stall never recover #4475

Closed
v0y4g3r opened this issue Jul 31, 2024 · 1 comment · Fixed by #4476
Closed

Region stall never recover #4475

v0y4g3r opened this issue Jul 31, 2024 · 1 comment · Fixed by #4476
Assignees
Labels
C-bug Category Bugs

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Jul 31, 2024

What type of bug is this?

Locking issue, Performance issue

What subsystems are affected?

Storage Engine

Minimal reproduce step

Ingesting large amount of data to partitioned tables with multiple regions.

What did you expect to see?

Data ingestion is expected to recover when flush finished.

What did you see instead?

Region write stalls forever and it can be observed from greptime_mito_write_stall_total gauge.

What operating system did you use?

NA

What version of GreptimeDB did you use?

0.9.0

Relevant log output and stack trace

No response

@v0y4g3r v0y4g3r added the C-bug Category Bugs label Jul 31, 2024
@evenyag
Copy link
Contributor

evenyag commented Jul 31, 2024

I added some logs:

2024-07-31T07:59:56.617890172Z stdout F 2024-07-31T07:59:56.617805Z  INFO mito2::flush: Successfully flush memtables, region: 4398046511110(1024, 6), reason: EngineFull, files: [FileId(9c28caa2-8a00-4fd8-aefc-092a1996434d)], cost: 1.6947113599999999s
2024-07-31T07:59:56.61790186Z stdout F 2024-07-31T07:59:56.617822Z  INFO mito2::flush: Applying RegionEdit { files_to_add: [FileMeta { region_id: 4398046511110(1024, 6), file_id: FileId(9c28caa2-8a00-4fd8-aefc-092a1996434d), time_range: (1686444120000000000::Nanosecond, 1686595620000000000::Nanosecond), level: 0, file_size: 16235235, available_indexes: [InvertedIndex], index_file_size: 6230724, num_rows: 2000000, num_row_groups: 20 }], files_to_remove: [], compaction_time_window: None, flushed_entry_id: Some(6503), flushed_sequence: Some(18959992) } to region 4398046511110(1024, 6)
2024-07-31T07:59:56.643776318Z stdout F 2024-07-31T07:59:56.643699Z  INFO mito2::worker::handle_flush: Region 4398046511110(1024, 6) flush finished, tries to bump wal to 6503

2024-07-31T07:59:56.643791976Z stdout F 2024-07-31T07:59:56.643741Z  INFO mito2::worker::handle_write: Worker handle stalled requests, worker: 0, num_requests: 0
2024-07-31T07:59:56.650303335Z stdout F 2024-07-31T07:59:56.650247Z  INFO mito2::worker::handle_write: Worker handle stalled requests, worker: 1, num_requests: 0
2024-07-31T07:59:58.592379556Z stdout F 2024-07-31T07:59:58.592320Z  INFO mito2::worker::handle_write: Stall write requests, worker: 0, total_requests: 1
2024-07-31T07:59:58.605009141Z stdout F 2024-07-31T07:59:58.604949Z  INFO mito2::worker::handle_write: Stall write requests, worker: 0, total_requests: 2
2024-07-31T07:59:58.612022871Z stdout F 2024-07-31T07:59:58.611974Z  INFO mito2::worker::handle_write: Stall write requests, worker: 0, total_requests: 3
2024-07-31T07:59:58.616561729Z stdout F 2024-07-31T07:59:58.616509Z  INFO mito2::worker::handle_write: Stall write requests, worker: 0, total_requests: 4
2024-07-31T07:59:58.744621587Z stdout F 2024-07-31T07:59:58.744559Z  INFO mito2::worker::handle_write: Stall write requests, worker: 0, total_requests: 5
2024-07-31T07:59:58.763180051Z stdout F 2024-07-31T07:59:58.763076Z  INFO mito2::worker::handle_write: Stall write requests, worker: 0, total_requests: 6
2024-07-31T07:59:59.065750274Z stdout F 2024-07-31T07:59:59.065700Z  INFO mito2::worker::handle_write: Stall write requests, worker: 0, total_requests: 7
2024-07-31T07:59:59.077848538Z stdout F 2024-07-31T07:59:59.077806Z  INFO mito2::worker::handle_write: Stall write requests, worker: 0, total_requests: 8

2024-07-31T07:59:59.873759492Z stdout F 2024-07-31T07:59:59.873501Z  INFO mito2::memtable: Reduce write buffer to 866444281
2024-07-31T08:00:02.285816949Z stdout F 2024-07-31T08:00:02.285623Z  INFO mito2::memtable: Reduce write buffer to 658996402
2024-07-31T08:00:07.831858378Z stdout F 2024-07-31T08:00:07.831632Z  INFO mito2::memtable: Reduce write buffer to 451546912

When flush is finished, the stalled requests are processed before releasing the memtable (I guess the flush task is releasing the memtable). Then the global write buffer size is still high so we may block write requests at this time. But when all writers are stalled by the current worker and no other workers are handling write requests, the current may stall requests forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category Bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants