Enhance `test_index_all` to print state & status #3210

portante · 2023-01-24T03:08:03Z

We also move the state/status checks to 45 seconds past the minute since the cron jobs run on the minute and we give them 3/4 of the minute to run before we perform a check (makes it so that in the server logs we get overlapping API calls with the cron job logs making easier to read the logs).

Fixes #3209.

portante

This is a Draft PR on which I have requested reviews because I'd like folks to see the test change in case anybody else has ideas on how to solve this problem.

dbutenhof · 2023-01-24T14:50:55Z

lib/pbench/server/database/models/datasets.py

@@ -1062,6 +1062,9 @@ def setvalue(dataset: Dataset, key: str, value: Any, user_id: Optional[str] = No
                ) from e
            return

+        Database.db_session.connection(
+            execution_options={"isolation_level": "SERIALIZABLE"}


I'm still (despite spending too long tracking down the SQLAlchemy descriptions) not entirely certain of the implications for our code. But that's partly because SQLAlchemy's default automatic transaction model is a bit odd (and I believe at the heart of some of the semantic problems between our production and unit test modes). Still, this seems at least a worthwhile attempt.

I added some extra diagnostics in sync.py to try to verify the suspected overlap when things go south; I've been unable to actually make it turn south despite repeatedly running it basically all day, easily dozens of time. Which makes me somewhat uncomfortable about assuming any "fix"; however I don't see this hurting and it's probably worthwhile. (I'm actually writing this partly just to do something while I wait, but also hoping that actually committing this message will force it to fail the next time...)

FYI: Here's the detailed footprint, finally captured in my own manual testing with some additional Sync.update diagnostics (with logger boilerplate stripped to make it more readable). When Sync.update fetches the current operations, the value is reported along with the parameters: when it finishes, it logs what it actually stores:

unpack starting sync linpack_mock_2020.02.28T19.10.55: did Operation.UNPACK, enable: [<Operation.INDEX: 4>]: current ['BACKUP', 'UNPACK'] backup starting sync linpack_mock_2020.02.28T19.10.55: did Operation.BACKUP, enable: None: current ['BACKUP', 'UNPACK'] unpack finishing sync linpack_mock_2020.02.28T19.10.55: did Operation.UNPACK, enable: [<Operation.INDEX: 4>], storing {'BACKUP', 'INDEX'} backup finishing sync linpack_mock_2020.02.28T19.10.55: did Operation.BACKUP, enable: None, storing {'UNPACK'}

The final two lines are the most critical: backup and unpack, overlapping, each modify the original ['BACKUP', 'UNPACK'] list to remove their own operation, and unpack adds 'INDEX'... so unpack stores 'BACKUP', 'INDEX' and then backup stores 'UNPACK', exactly as suspected.

I'm not convinced that the "isolation_level" is really what we want, and in fact this seems to be permanently changing the setting, globally, for our single session as soon as we try the first Metadata.setvalue?? Which is odd.

After re-running the functional tests a bazillion-trillion-gazillion times over the past few days to clearly reproduce and diagnose, I decided to spend some more time trying to get a nested transaction proper update working for Metadata.setvalue ... I've current got a sequence that SQLAlchemy finds acceptable in both PostgreSQL and sqlite3 (this turns out to be simpler but a bit more subtle than I'd thought from my previous head-butting encounters) and I'm trying to reproduce the problem now that it's "fixed".

If it works, I'll post a PR, and you can back out this serialization hack. (I hope.) Of course the worst case is that I can't reproduce it but it's not truly fixed ... the race isn't easy to provoke, although like any race it happens often enough when you're not looking ...

FYI: just passing through the many open PR tabs, and thought I'd update here.

After lots of experimentation and research, I suspect SERIALIZABLE probably is our best bet: because while I think we don't need that level, the only other level sqlite3 supports may be too lax. I still don't think this is the right change, however, as it changes our connection dynamically on the first Metadata.setvalue call, which may be a bad idea (and then repeatedly re-sets it to that value on each subsequent call). We should just set this statically on the engine when we first initialize it.

I'm also entirely convinced that simply upping the integrity level (while necessary) isn't nearly enough. I've completely rewritten Sync, but I'm still wrestling with unit tests prior to live testing. Hopefully I'll have good news soon... but at least I should have news. Either way, I think you should back out this change.

npalaska

I think looks good so far.

portante · 2023-02-07T16:21:10Z

Proper fix is in #3232.

portante added bug Server Functional Testing labels Jan 24, 2023

portante added this to the v0.72 milestone Jan 24, 2023

portante requested a review from dbutenhof January 24, 2023 03:08

portante self-assigned this Jan 24, 2023

portante commented Jan 24, 2023

View reviewed changes

portante requested review from npalaska and siddardh-ra January 24, 2023 03:09

portante mentioned this pull request Jan 24, 2023

Rework database modules and class names #3191

Merged

dbutenhof previously approved these changes Jan 24, 2023

View reviewed changes

portante dismissed dbutenhof’s stale review via 7e50d1e January 25, 2023 03:39

portante force-pushed the fix-test-index-all branch from 4295a0a to 7e50d1e Compare January 25, 2023 03:39

npalaska reviewed Jan 25, 2023

View reviewed changes

portante force-pushed the fix-test-index-all branch 3 times, most recently from 38ad651 to 17f1bf9 Compare February 5, 2023 15:04

Shot in the dark: use serializable isolation level

31f15b5

portante force-pushed the fix-test-index-all branch from 17f1bf9 to 31f15b5 Compare February 7, 2023 16:20

portante closed this Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance `test_index_all` to print state & status #3210

Enhance `test_index_all` to print state & status #3210

portante commented Jan 24, 2023

portante left a comment

dbutenhof Jan 24, 2023

dbutenhof Jan 26, 2023

dbutenhof Jan 27, 2023

dbutenhof Feb 2, 2023

npalaska left a comment

portante commented Feb 7, 2023

Enhance test_index_all to print state & status #3210

Enhance test_index_all to print state & status #3210

Conversation

portante commented Jan 24, 2023

portante left a comment

Choose a reason for hiding this comment

dbutenhof Jan 24, 2023

Choose a reason for hiding this comment

dbutenhof Jan 26, 2023

Choose a reason for hiding this comment

dbutenhof Jan 27, 2023

Choose a reason for hiding this comment

dbutenhof Feb 2, 2023

Choose a reason for hiding this comment

npalaska left a comment

Choose a reason for hiding this comment

portante commented Feb 7, 2023

Enhance `test_index_all` to print state & status #3210

Enhance `test_index_all` to print state & status #3210