Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance test_index_all to print state & status #3210

Closed

Conversation

portante
Copy link
Member

We also move the state/status checks to 45 seconds past the minute since the cron jobs run on the minute and we give them 3/4 of the minute to run before we perform a check (makes it so that in the server logs we get overlapping API calls with the cron job logs making easier to read the logs).


Fixes #3209.

@portante portante added this to the v0.72 milestone Jan 24, 2023
@portante portante requested a review from dbutenhof January 24, 2023 03:08
@portante portante self-assigned this Jan 24, 2023
Copy link
Member Author

@portante portante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Draft PR on which I have requested reviews because I'd like folks to see the test change in case anybody else has ideas on how to solve this problem.

dbutenhof
dbutenhof previously approved these changes Jan 24, 2023
@@ -1062,6 +1062,9 @@ def setvalue(dataset: Dataset, key: str, value: Any, user_id: Optional[str] = No
) from e
return

Database.db_session.connection(
execution_options={"isolation_level": "SERIALIZABLE"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still (despite spending too long tracking down the SQLAlchemy descriptions) not entirely certain of the implications for our code. But that's partly because SQLAlchemy's default automatic transaction model is a bit odd (and I believe at the heart of some of the semantic problems between our production and unit test modes). Still, this seems at least a worthwhile attempt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some extra diagnostics in sync.py to try to verify the suspected overlap when things go south; I've been unable to actually make it turn south despite repeatedly running it basically all day, easily dozens of time. Which makes me somewhat uncomfortable about assuming any "fix"; however I don't see this hurting and it's probably worthwhile. (I'm actually writing this partly just to do something while I wait, but also hoping that actually committing this message will force it to fail the next time...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Here's the detailed footprint, finally captured in my own manual testing with some additional Sync.update diagnostics (with logger boilerplate stripped to make it more readable). When Sync.update fetches the current operations, the value is reported along with the parameters: when it finishes, it logs what it actually stores:

unpack starting sync linpack_mock_2020.02.28T19.10.55: did Operation.UNPACK, enable: [<Operation.INDEX: 4>]: current ['BACKUP', 'UNPACK']
backup starting sync linpack_mock_2020.02.28T19.10.55: did Operation.BACKUP, enable: None: current ['BACKUP', 'UNPACK']
unpack finishing sync linpack_mock_2020.02.28T19.10.55: did Operation.UNPACK, enable: [<Operation.INDEX: 4>], storing {'BACKUP', 'INDEX'}
backup finishing sync linpack_mock_2020.02.28T19.10.55: did Operation.BACKUP, enable: None, storing {'UNPACK'}

The final two lines are the most critical: backup and unpack, overlapping, each modify the original ['BACKUP', 'UNPACK'] list to remove their own operation, and unpack adds 'INDEX'... so unpack stores 'BACKUP', 'INDEX' and then backup stores 'UNPACK', exactly as suspected.

I'm not convinced that the "isolation_level" is really what we want, and in fact this seems to be permanently changing the setting, globally, for our single session as soon as we try the first Metadata.setvalue?? Which is odd.

After re-running the functional tests a bazillion-trillion-gazillion times over the past few days to clearly reproduce and diagnose, I decided to spend some more time trying to get a nested transaction proper update working for Metadata.setvalue ... I've current got a sequence that SQLAlchemy finds acceptable in both PostgreSQL and sqlite3 (this turns out to be simpler but a bit more subtle than I'd thought from my previous head-butting encounters) and I'm trying to reproduce the problem now that it's "fixed".

If it works, I'll post a PR, and you can back out this serialization hack. (I hope.) Of course the worst case is that I can't reproduce it but it's not truly fixed ... the race isn't easy to provoke, although like any race it happens often enough when you're not looking ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: just passing through the many open PR tabs, and thought I'd update here.

After lots of experimentation and research, I suspect SERIALIZABLE probably is our best bet: because while I think we don't need that level, the only other level sqlite3 supports may be too lax. I still don't think this is the right change, however, as it changes our connection dynamically on the first Metadata.setvalue call, which may be a bad idea (and then repeatedly re-sets it to that value on each subsequent call). We should just set this statically on the engine when we first initialize it.

I'm also entirely convinced that simply upping the integrity level (while necessary) isn't nearly enough. I've completely rewritten Sync, but I'm still wrestling with unit tests prior to live testing. Hopefully I'll have good news soon... but at least I should have news. Either way, I think you should back out this change.

Copy link
Member

@npalaska npalaska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think looks good so far.

@portante portante force-pushed the fix-test-index-all branch 3 times, most recently from 38ad651 to 17f1bf9 Compare February 5, 2023 15:04
@portante portante closed this Feb 7, 2023
@portante
Copy link
Member Author

portante commented Feb 7, 2023

Proper fix is in #3232.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Functional test for indexing, test_index_all, fails with missing unpack status
3 participants