Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent pending run_pending_tasks of future::Cache from causing busy loop in schedule_write_op #415

Merged
merged 5 commits into from
Apr 11, 2024

Conversation

tatsuya6502
Copy link
Member

@tatsuya6502 tatsuya6502 commented Apr 10, 2024

Fixes #412.

Summary

This PR prevents the async runtime's schedulers from going into infinite busy loops in an internal schedule_write_op method when there are pending run_pending_tasks calls on the future::Cache.

  • This bug was introduced in v0.12.0 when the background threads were removed from future::Cache.
  • This bug can occur when run_pending_task method is called by user code while cache is receiving a high number of concurrent cache write operations such as insert, get_with or invalidate.
  • When it occurs, the schedule_write_op method will be spinning in a busy loop forever, causing high CPU usage and all other async tasks to be starved..

The fix is to replace a async_lock::RwLock used in schedule_write_op method with an event notification mechanism provided by the event-listener crate. (Note that event-listener is used by async_lock::RwLock too.)

Other Changes

This PR also does the followings:

  • Bump the moka version to v0.12.6.
  • Upgrades async-lock crate used by future::Cache to the latest version 3.3. (This is not related to the bug)

The Root Cause

When a cache write operation such as insert is performed on future::Cache, schedule_write_op method is called internally to send a write operation log to a channel. Later, an internal do_run_pending_tasks method will be called and it will receive the log from the channel. The channel is bounded so it can only hold up to a fixed number of items (logs). When the channel gets full, schedule_write_op method will fail to send a log to the channel, so it will retry to send until it succeeds.

future::Cache used async_lock::RwLock to prevent the retry loop to take all CPU time. It will wait for a read lock becomes available, so that the scheduler threads of the async runtime can do other stuff including running do_run_pending_tasks. do_run_pending_tasks acquires an exclusive write lock at the beginning, so the retry loop will be kept waiting while do_run_pending_tasks is running.

The problem with the current implementation is the retry loop will keep spinning when do_run_pending_tasks is not running because no write lock is taken. So the retry loop can occupy the scheduler thread and never yield to other async tasks until do_run_pending_tasks is started. If there are enough number of retry loop spinning in parallel, all the scheduler threads will be occupied by them, preventing do_run_pending_tasks to start! This causes the retry loop to keep spinning forever by occupied schedulers.

The Fix

This PR fixes the problem by replacing the RwLock with an event notification mechanism provided by the event-listener crate. The retry loop in schedule_write_op will wait for an event to arrive no matter if do_run_pending_tasks is running. This ensures async schedulers not to be occupied by the retry loop in schedule_write_op, so the schedulers can run do_run_pending_tasks whenever needed.

The event will be sent when one of the following conditions meet:

  • The do_run_pending_tasks method has removed some logs from the channel.
  • The Housekeeper's run_pending_tasks or try_run_pending_tasks method has freed a lock on a Mutex called current_task.
    • current_task is used to ensure only one run_pending_tasks or try_run_pending_tasks method to run at a time.
    • run_pending_tasks and try_run_pending_tasks call do_run_pending_tasks.

The latter is needed to make the retry loop to spin once more. This may start do_run_pending_tasks again because the retry loop will call try_run_pending_tasks if necessary before resending the log to the channel.

NOTE: You may wonder why the spinning retry loop cannot start do_run_pending_tasks if the loop itself is calling try_run_pending_tasks. This is because if there are any run_pending_task calls waiting for current_task lock to be available, one of them will be given the right to acquire the lock, making no other to do so. Unless an async scheduler runs exactly the one having the right to acquire, do_run_pending_tasks will not start. See the starved lock in the description of #412 for the details.

Tests

Prevent the async runtime's schedulers from going into infinite busy loops when there
are pending `run_pending_tasks` calls on the `future::Cache`. This is done by
replacing a `RwLock` used in an internal `schedule_write_op` method with an event
notification.
@tatsuya6502 tatsuya6502 self-assigned this Apr 10, 2024
@tatsuya6502 tatsuya6502 added the bug Something isn't working label Apr 10, 2024
@tatsuya6502 tatsuya6502 added this to the v0.12.6 milestone Apr 10, 2024
@tatsuya6502
Copy link
Member Author

The CI for Miri tests are failing. It is not related to this PR and we will track it by #414.

src/future/base_cache.rs Outdated Show resolved Hide resolved
@tatsuya6502 tatsuya6502 changed the title Prevent the busy loop of async schedulers Prevent pending run_pending_tasks of future::Cache from causing busy loop in schedule_write_op Apr 10, 2024
src/future/base_cache.rs Outdated Show resolved Hide resolved
Simplify the implementation of an internal ` notify_write_op_ch_is_ready` method.
Copy link
Member Author

@tatsuya6502 tatsuya6502 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging.

@tatsuya6502 tatsuya6502 merged commit 2f23e5c into main Apr 11, 2024
46 of 47 checks passed
@tatsuya6502 tatsuya6502 deleted the avoid-async-scheduler-busy-loop branch April 11, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pending run_pending_tasks may cause busy loop in schedule_write_op
1 participant