Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add Error Status Field for Diskless Syncs #878

Open
naglera opened this issue Aug 8, 2024 · 6 comments
Open

Feature Request: Add Error Status Field for Diskless Syncs #878

naglera opened this issue Aug 8, 2024 · 6 comments
Labels
major-decision-pending Major decision pending by TSC team

Comments

@naglera
Copy link
Contributor

naglera commented Aug 8, 2024

The problem/use-case that the feature addresses

Currently, Valkey has a lastbgsave_status field that tracks the status of disk-based bgsave. However, there is no equivalent field or status indicator for diskless sync operations. This lack of visibility into diskless sync errors makes it difficult to monitor and troubleshoot issues related to these operations.

Description of the feature

Introduce a new field or status indicator, tentatively named lastbgsave_diskless_status, to track the status of diskless sync operations. This field should be updated with an appropriate error code or message whenever an error occurs during the diskless sync process.

Alternatives you've considered

  1. Logging errors: Instead of introducing a new field, errors during diskless sync operations could be logged. However, this approach would require parsing logs to identify and monitor errors, which can be less efficient than having a dedicated status field.

  2. Reusing lastbgsave_status: Another alternative would be to reuse the existing lastbgsave_status field for both disk-based and diskless sync operations. However, this could lead to confusion and make it harder to distinguish between different types of errors. It also may make tests which already uses the metric to do wrong assertions.

@enjoy-binbin
Copy link
Member

I mentioned here valkey-io/valkey-doc#158 wanting to add these fields, and now seems like a good time to do it.

@valkey-io/core-team please take a look at the doc PR link, and see if we want to add the diskless related fields.

@hwware
Copy link
Member

hwware commented Aug 12, 2024

I just want to confirm with you:

  1. the Diskless Sync meaning when repl-diskless-sync is set to yes, the primary send rdb to replica status?
  2. with which condition, there is error status? Could you please list most situations?
  3. the name lastbgsave_diskless_status is not properly, suggest to repl-diskless-sync-status or something else because for diskless-sync, there is no save file on disk

@naglera
Copy link
Contributor Author

naglera commented Aug 13, 2024

  1. Yes, you are correct. When repl-diskless-sync is set to yes, the primary sends the RDB file directly to the replica's socket, without saving it to disk on the primary side.
  2. There are several situations where an error status can occur during the diskless sync process:
    • Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
    • Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
    • Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
    • Some issue at the replica side that prevents it from receiving or storing the RDB data.
  3. I agree. repl-diskless-sync-status is a better name for the status variable.

@hwware
Copy link
Member

hwware commented Aug 14, 2024

    • Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
    • Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
    • Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
    • Some issue at the replica side that prevents it from receiving or storing the RDB data.

For case 2 and 4, I think it makes sense to add repl-diskless-sync-status.
But for case 1 and 3, I am not familiar with this kind of case. @enjoy-binbin How about you?

@enjoy-binbin
Copy link
Member

I think both cases can happend, as long as the situations will cause diskless-sync fail, we should set the repl-diskless-sync-status.

@enjoy-binbin
Copy link
Member

@valkey-io/core-team please take a look and check if this needs to be fit into 8.0

Here are the fields we have now and their definitions, we can see we are mixing disk-based RDB and diskless RDB in some fields, and rdb_last_bgsave_status does not include the diskless RDB

  • rdb_changes_since_last_save: Number of changes since the last RDB file save
  • rdb_bgsave_in_progress: Flag indicating a RDB save is on-going, including a diskless replication RDB save
  • rdb_last_save_time: Epoch-based timestamp of last successful RDB file save
  • rdb_last_bgsave_status: Status of the last RDB file save operation
  • rdb_last_bgsave_time_sec: Duration of the last RDB save operation in seconds, including a diskless replication RDB save
  • rdb_current_bgsave_time_sec: Duration of the on-going RDB save operation if any, including a diskless replication RDB save

@enjoy-binbin enjoy-binbin added the major-decision-pending Major decision pending by TSC team label Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Major decision pending by TSC team
Projects
None yet
Development

No branches or pull requests

3 participants