ledger: report catchpoint writing only when it actually started #5413

algorandskiy · 2023-05-23T20:43:25Z

Summary

The original implementation was setting this flag in task producer, that created a gap between setting it and actual writing catchpoint in commitSyncer.

Later, #5214 introduced ability to skip tasks if the queue is full. This caused an issue with catchup service that was reading the flag with IsWritingCatchpointDataFile() and stopping the catchup. Because of that ledger was not receiving new blocks and was unable to schedule a new commit task but had the catchpoint writing flag set from the previous discarded task.

Test Plan

Existing tests.
Manual catchup with catchpoints enabled up to round 170k.

The original implementation was setting this flag in task producer, that created a gap between setting it and actual writing catchpoint in commitSyncer. Later, algorand#5214 introduced ability to skip tasks if the queue is full. This caused an issue with catchup service that was reading the flag with IsWritingCatchpointDataFile() and stopping the catchup. Because of that ledger was not receiving new blocks and was unable to schedule a new commit task but had the catchpoint writing flag set from the previous discarded task.

algorandskiy · 2023-05-23T20:45:55Z

ledger/acctupdates.go

@@ -1625,14 +1625,6 @@ func (au *accountUpdates) prepareCommit(dcc *deferredCommitContext) error {
 	// verify version correctness : all the entries in the au.versions[1:offset+1] should have the *same* version, and the committedUpTo should be enforcing that.
 	if au.versions[1] != au.versions[offset] {
 		au.accountsMu.RUnlock()
-


this code is the same as catchpointtracker's handleUnorderedCommitOrError and called after commitRound / prepareCommit for all trackers in case of errors.

algorandskiy · 2023-05-23T20:46:15Z

ledger/catchpointtracker.go

@@ -469,16 +469,13 @@ func (ct *catchpointTracker) produceCommittingTask(committedRound basics.Round,
 		dcr.catchpointFirstStage = true

 		if ct.enableGeneratingCatchpointFiles {
-			// store non-zero ( all ones ) into the catchpointWriting atomic variable to indicate that a catchpoint is being written ( or, queued to be written )


the fix: moved to prepareCommit

codecov · 2023-05-23T21:47:36Z

Codecov Report

Merging #5413 (b7f1cd4) into master (1abbd50) will decrease coverage by 0.02%.
The diff coverage is 64.00%.

@@            Coverage Diff             @@
##           master    #5413      +/-   ##
==========================================
- Coverage   55.40%   55.39%   -0.02%     
==========================================
  Files         452      452              
  Lines       63855    63854       -1     
==========================================
- Hits        35379    35370       -9     
- Misses      26044    26052       +8     
  Partials     2432     2432

Impacted Files	Coverage Δ
ledger/ledger.go	`71.03% <0.00%> (ø)`
ledger/tracker.go	`73.51% <33.33%> (-1.49%)`	⬇️
ledger/acctonline.go	`79.38% <100.00%> (+0.30%)`	⬆️
ledger/acctupdates.go	`71.19% <100.00%> (-0.19%)`	⬇️
ledger/bulletin.go	`95.65% <100.00%> (ø)`
ledger/catchpointtracker.go	`59.78% <100.00%> (+0.11%)`	⬆️
ledger/metrics.go	`100.00% <100.00%> (ø)`
ledger/notifier.go	`88.23% <100.00%> (ø)`
ledger/spverificationtracker.go	`96.37% <100.00%> (ø)`
ledger/txtail.go	`78.21% <100.00%> (ø)`

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ledger/tracker.go

cce · 2023-05-24T15:13:04Z

ledger/tracker.go

 		}
 	}
+	if err != nil {


if another tracker overwrites the err with nil it won't be passed to handleUnorderedCommitOrError?

maybe nest the error handling right inside the iteration, is this crazy?

for _, lt := range tr.trackers { err := lt.prepareCommit(dcc) if err != nil { tr.log.Errorf(err.Error()) for _, lt := range tr.trackers { lt.handleUnorderedCommitOrError(dcc) } tr.mu.RUnlock() return err } }

I had this before and did not really liked so moved out of the outer loop

Current approach is pretty straightforward to follow now.

algorandskiy · 2023-05-24T15:25:08Z

Updated per Chris review

bbroder-algo

This patch changes the manipulation pattern of a communication boolean shared between the catchup service and the catchpoint file generation service in order to prevent a slow-catchup-with-checkpoint-file-generation stall related to our new design of dropping commit tasks if a commit is already in progress (#5214). This boolean is now set and unset in the commit flow solely by the catchpoint tracker, and is no longer set in the commit scheduler or unset by other trackers.

gmalouf

I think the logic managing this atomic variable is not the easiest to follow, but it works.

gmalouf · 2023-05-24T17:55:09Z

ledger/tracker.go

 		}
 	}
+	if err != nil {


Current approach is pretty straightforward to follow now.

algorandskiy added Team Carbon-11 Bug-Fix labels May 23, 2023

algorandskiy requested review from cce, gmalouf and bbroder-algo May 23, 2023 20:43

algorandskiy self-assigned this May 23, 2023

algorandskiy commented May 23, 2023

View reviewed changes

cce reviewed May 24, 2023

View reviewed changes

ledger/tracker.go Show resolved Hide resolved

cce reviewed May 24, 2023

View reviewed changes

CR fixes

b7f1cd4

algorandskiy requested a review from cce May 24, 2023 15:24

bbroder-algo approved these changes May 24, 2023

View reviewed changes

gmalouf approved these changes May 24, 2023

View reviewed changes

ledger/tracker.go

}

}

if err != nil {

Copy link

Contributor

gmalouf May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current approach is pretty straightforward to follow now.

algorandskiy merged commit 3fb6964 into algorand:master May 24, 2023

Algo-devops-service mentioned this pull request May 24, 2023

go-algorand 3.16.0-beta Release PR #5417

Merged

This was referenced May 30, 2023

go-algorand 3.16.0-beta Release PR #5430

Merged

go-algorand 3.16.0-beta Release PR #5434

Merged

Algo-devops-service mentioned this pull request Jun 12, 2023

go-algorand 3.16.1-stable Release PR #5465

Merged

onetechnical mentioned this pull request Jun 14, 2023

go-algorand 3.16.2-stable Release PR #5469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ledger: report catchpoint writing only when it actually started #5413

ledger: report catchpoint writing only when it actually started #5413

algorandskiy commented May 23, 2023

algorandskiy May 23, 2023

algorandskiy May 23, 2023

codecov bot commented May 23, 2023 •

edited

Loading

cce May 24, 2023

cce May 24, 2023

algorandskiy May 24, 2023

gmalouf May 24, 2023

algorandskiy commented May 24, 2023

bbroder-algo left a comment

gmalouf left a comment

gmalouf May 24, 2023

ledger: report catchpoint writing only when it actually started #5413

ledger: report catchpoint writing only when it actually started #5413

Conversation

algorandskiy commented May 23, 2023

Summary

Test Plan

algorandskiy May 23, 2023

Choose a reason for hiding this comment

algorandskiy May 23, 2023

Choose a reason for hiding this comment

codecov bot commented May 23, 2023 • edited Loading

Codecov Report

cce May 24, 2023

Choose a reason for hiding this comment

cce May 24, 2023

Choose a reason for hiding this comment

algorandskiy May 24, 2023

Choose a reason for hiding this comment

gmalouf May 24, 2023

Choose a reason for hiding this comment

algorandskiy commented May 24, 2023

bbroder-algo left a comment

Choose a reason for hiding this comment

gmalouf left a comment

Choose a reason for hiding this comment

gmalouf May 24, 2023

Choose a reason for hiding this comment

codecov bot commented May 23, 2023 •

edited

Loading