(7/7) FileManager.java Cleanup and Audit #3690

julianknutsen · 2019-11-25T23:41:05Z

2208003

Motivation

There was suspicion of a corruption bug inside FileManager. In order to unblock a review, I went through and audited the code and cleaned it up to make the concurrency more obvious.

At the end of the analysis, there was no corruption bug regarding the TODO in FileManager.java:75, but the interactions between the persistable field and the savePending were hard to understand. This has been updated with comments and an AtomicReference usage to make it more clear.

In some interleavings, two writeToFile calls for the same exact data could have occurred and this remains the case since references are passed in whose underlying data can change while saveToFile is running on another thread. That is why many implementations use a Concurrent data structure. Although, it is worth point out that not all do and it could be the source of some strange bugs.

Analysis

The comment was concerned with a situation where an in-progress saveFileTask would allow a UserThread saveLater call to schedule another write. There are two reasons this is OK here:

Reference writes are atomic
saveToFile is synchronized

In the event that an in-progress saveLater call spawned another saveFileTask, only one would be allowed to write the file at a time.

It is possible that the second call could write the file first, but since all callers share the same persistable reference, both saveToFile calls will write the latest data.

Bug Fix

There was one strange behavior that was fixed in this PR. If the first saveLater call had a large delay it would override any future saveLater delays until the original one was finished. This was because the first saveLater set savePending so all future saveLater calls returned early without scheduling a thread.

The update causes all requests to spawn a task so if the second saveLater call has a shorter delay, it will run and batch with the first call. The second task will finally get scheduled and it will immediately exit since there is no work to do.

Future Work

I think if the end goal is to have all writes completely thread-safe. You would need to do something like have PersistableEnvelope define a clone() function that all subclasses implement. The FileManager could then just clone the object prior to passing it off to the writing thread and have some guarantees.

That is way outside the scope of this work, but it may be a good piece for someone else to pick up.

Instead of using a subclass that overwrites a value, utilize Guice to inject the real value of 10000 in the app and let the tests overwrite it with their own.

Remove unused imports and clean up some access modifiers now that the final test structure is complete

Previously, this interface was called each time an item was changed. This required listeners to understand performance implications of multiple adds or removes in a short time span. Instead, give each listener the ability to process a list of added or removed entrys which can help them avoid performance issues. This patch is just a refactor. Each listener is called once for each ProtectedStorageEntry. Future patches will change this.

Minor performance overhead for constructing MapEntry and Collections of one element, but keeps the code cleaner and all removes can still use the same logic to remove from map, delete from data store, signal listeners, etc. The MapEntry type is used instead of Pair since it will require less operations when this is eventually used in the removeExpiredEntries path.

…batch All current users still call this one-at-a-time. But, it gives the ability for the expire code path to remove in a batch.

This will cause HashMapChangedListeners to receive just one onRemoved() call for the expire work instead of multiple onRemoved() calls for each item. This required a bit of updating for the remove validation in tests so that it correctly compares onRemoved with multiple items.

…ch removes bisq-network#3143 identified an issue that tempProposals listeners were being signaled once for each item that was removed during the P2PDataStore operation that expired old TempProposal objects. Some of the listeners are very expensive (ProposalListPresentation::updateLists()) which results in large UI performance issues. Now that the infrastructure is in place to receive updates from the P2PDataStore in a batch, the ProposalService can apply all of the removes received from the P2PDataStore at once. This results in only 1 onChanged() callback for each listener. The end result is that updateLists() is only called once and the performance problems are reduced. This removes the need for bisq-network#3148 and those interfaces will be removed in the next patch.

Now that the only user of this interface has been removed, go ahead and delete it. This is a partial revert of f5d75c4 that includes the code that was added into ProposalService that subscribed to the P2PDataStore.

Write a test that shows the incorrect behavior for bisq-network#3629, the hashmap is rebuilt from disk using the 20-byte key instead of the 32-byte key.

Addresses the first half of bisq-network#3629 by ensuring that the reconstructed HashMap always has the 32-byte key for each payload. It turns out, the TempProposalStore persists the ProtectedStorageEntrys on-disk as a List and doesn't persist the key at all. Then, on reconstruction, it creates the 20-byte key for its internal map. The fix is to update the TempProposalStore to use the 32-byte key instead. This means that all writes, reads, and reconstrution of the TempProposalStore uses the 32-byte key which matches perfectly with the in-memory map of the P2PDataStorage that expects 32-byte keys. Important to note that until all seednodes receive this update, nodes will continue to have both the 20-byte and 32-byte keys in their HashMap.

Addresses the second half of bisq-network#3629 by using the HashMap, not the protectedDataStore to generate the known keys in the requestData path. This won't have any bandwidth reduction until all seednodes have the update and only have the 32-byte key in their HashMap. fixes bisq-network#3629

The only user has been migrated to getMap(). Delete it so future development doesn't have the same 20-byte vs 32-byte key issue.

In order to implement remove-before-add behavior, we need a way to verify that the SequenceNumberMap was the only item updated.

It is possible to receive a RemoveData or RemoveMailboxData message before the relevant AddData, but the current code does not handle it. This results in internal state updates and signal handler's being called when an Add is received with a lower sequence number than a previously seen Remove. Minor test validation changes to allow tests to specify that only the SequenceNumberMap should be written during an operation.

Now that we have introduced remove-before-add, we need a way to validate that the SequenceNumberMap was written, but nothing else. Add this feature to the validation path.

In order to aid in propagation of remove() messages, broadcast them in the event the remove is seen before the add.

Now that there are cases where the SequenceNumberMap and Broadcast are called, but no other internal state is updated, the existing helper functions conflate too many decisions. Remove them in favor of explicitly defining each state change expected.

Fix a bug introduced in d484617 that did not properly handle a valid use case for duplicate sequence numbers. For in-memory-only ProtectedStoragePayloads, the client nodes need a way to reconstruct the Payloads after startup from peer and seed nodes. This involves sending a ProtectedStorageEntry with a sequence number that is equal to the last one the client had already seen. This patch adds tests to confirm the bug and fix as well as the changes necessary to allow adding of Payloads that were previously seen, but removed during a restart.

Although the code was correct, it was hard to understand the relationship between the to-be-written object and the savePending flag. Trade two dependent atomics for one and comment the code to make it more clear for the next reader.

Fix a bug in the FileManager where a saveLater called with a low delay won't execute until the delay specified by a previous saveLater call. The trade off here is the execution of a task that returns early vs. losing the requested delay.

Only one caller after deadcode removal.

julianknutsen · 2019-11-25T23:48:36Z

@freimair @chimp1984 This is what I found in the day of auditing.

I agree that some users pass in references that are not thread-safe, but fixing that requires a real design and implementation plan that is outside of the scope of my current work. I've outlined a potential solution in the Future Work section for anyone else who is interested or assigned to fix that technical debt.

Feel free to take this or trash it. I'm also happy to just close my open PR that fixes the persistence bug in TempProposalStore. The amount of overhead maintaining these patches for weeks isn't worth my time if they are known issues and we don't have the resources to fix them.

chimp1984 · 2019-11-26T00:36:21Z

Thanks @julianknutsen for looking into it and fixing issues there. I agree that a more extensive improvement is outside of the current scope. To clone each data structure might become expensive for larger data structures like DaoState or TradeStatistics.

For a larger refactoring in that area we should probably consider to use a database solution instead of the simple file based approach.

freimair

utAck

ripcurlx

utACK

ripcurlx · 2019-11-26T13:32:47Z

For documentation purpose. Reviews happened at:

julianknutsen added 22 commits November 19, 2019 08:30

[PR COMMENTS] Make maxSequenceNumberBeforePurge final

617585d

Instead of using a subclass that overwrites a value, utilize Guice to inject the real value of 10000 in the app and let the tests overwrite it with their own.

[TESTS] Clean up 'Analyze Code' warnings

3bd67ba

Remove unused imports and clean up some access modifiers now that the final test structure is complete

Change removeFromMapAndDataStore to signal listeners at the end in a …

489b25a

…batch All current users still call this one-at-a-time. But, it gives the ability for the expire code path to remove in a batch.

Remove HashmapChangedListener::onBatch operations

a8139f3

Now that the only user of this interface has been removed, go ahead and delete it. This is a partial revert of f5d75c4 that includes the code that was added into ProposalService that subscribed to the P2PDataStore.

[TESTS] Regression test for bisq-network#3629

849155a

Write a test that shows the incorrect behavior for bisq-network#3629, the hashmap is rebuilt from disk using the 20-byte key instead of the 32-byte key.

[DEAD CODE] Remove getProtectedDataStoreMap

793e84d

The only user has been migrated to getMap(). Delete it so future development doesn't have the same 20-byte vs 32-byte key issue.

[TESTS] Allow tests to validate SequenceNumberMap write separately

526aee5

In order to implement remove-before-add behavior, we need a way to verify that the SequenceNumberMap was the only item updated.

[TESTS] Allow remove() verification to be more flexible

931c1f4

Now that we have introduced remove-before-add, we need a way to validate that the SequenceNumberMap was written, but nothing else. Add this feature to the validation path.

Broadcast remove-before-add messages to P2P network

0472ffc

In order to aid in propagation of remove() messages, broadcast them in the event the remove is seen before the add.

Clean up AtomicBoolean usage in FileManager

2208003

Although the code was correct, it was hard to understand the relationship between the to-be-written object and the savePending flag. Trade two dependent atomics for one and comment the code to make it more clear for the next reader.

[DEADCODE] Clean up FileManager.java

1895802

[REFACTOR] Inline saveNowInternal

685824b

Only one caller after deadcode removal.

julianknutsen marked this pull request as ready for review November 26, 2019 01:53

julianknutsen requested a review from ripcurlx as a code owner November 26, 2019 01:53

julianknutsen mentioned this pull request Nov 26, 2019

For Cycle 8 bisq-network/compensation#413

Closed

freimair approved these changes Nov 26, 2019

View reviewed changes

freimair mentioned this pull request Nov 26, 2019

For Cycle 8 bisq-network/compensation#411

Closed

ripcurlx approved these changes Nov 26, 2019

View reviewed changes

ripcurlx merged commit 66b2306 into bisq-network:master Nov 26, 2019

julianknutsen deleted the filemanager-bug branch November 26, 2019 16:59

ripcurlx mentioned this pull request Dec 13, 2019

For Cycle 8 bisq-network/compensation#439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(7/7) FileManager.java Cleanup and Audit #3690

(7/7) FileManager.java Cleanup and Audit #3690

julianknutsen commented Nov 25, 2019 •

edited

Loading

julianknutsen commented Nov 25, 2019

chimp1984 commented Nov 26, 2019

freimair left a comment

ripcurlx left a comment

ripcurlx commented Nov 26, 2019

(7/7) FileManager.java Cleanup and Audit #3690

(7/7) FileManager.java Cleanup and Audit #3690

Conversation

julianknutsen commented Nov 25, 2019 • edited Loading

Motivation

Analysis

Bug Fix

Future Work

julianknutsen commented Nov 25, 2019

chimp1984 commented Nov 26, 2019

freimair left a comment

Choose a reason for hiding this comment

ripcurlx left a comment

Choose a reason for hiding this comment

ripcurlx commented Nov 26, 2019

julianknutsen commented Nov 25, 2019 •

edited

Loading