Feature Request: Add Error Status Field for Diskless Syncs #878

naglera · 2024-08-08T09:20:37Z

The problem/use-case that the feature addresses

Currently, Valkey has a lastbgsave_status field that tracks the status of disk-based bgsave. However, there is no equivalent field or status indicator for diskless sync operations. This lack of visibility into diskless sync errors makes it difficult to monitor and troubleshoot issues related to these operations.

Description of the feature

Introduce a new field or status indicator, tentatively named lastbgsave_diskless_status, to track the status of diskless sync operations. This field should be updated with an appropriate error code or message whenever an error occurs during the diskless sync process.

Alternatives you've considered

Logging errors: Instead of introducing a new field, errors during diskless sync operations could be logged. However, this approach would require parsing logs to identify and monitor errors, which can be less efficient than having a dedicated status field.
Reusing lastbgsave_status: Another alternative would be to reuse the existing lastbgsave_status field for both disk-based and diskless sync operations. However, this could lead to confusion and make it harder to distinguish between different types of errors. It also may make tests which already uses the metric to do wrong assertions.

The text was updated successfully, but these errors were encountered:

enjoy-binbin · 2024-08-08T10:08:42Z

I mentioned here valkey-io/valkey-doc#158 wanting to add these fields, and now seems like a good time to do it.

@valkey-io/core-team please take a look at the doc PR link, and see if we want to add the diskless related fields.

hwware · 2024-08-12T16:18:43Z

I just want to confirm with you:

the Diskless Sync meaning when repl-diskless-sync is set to yes, the primary send rdb to replica status?
with which condition, there is error status? Could you please list most situations?
the name lastbgsave_diskless_status is not properly, suggest to repl-diskless-sync-status or something else because for diskless-sync, there is no save file on disk

naglera · 2024-08-13T14:51:43Z

Yes, you are correct. When repl-diskless-sync is set to yes, the primary sends the RDB file directly to the replica's socket, without saving it to disk on the primary side.
There are several situations where an error status can occur during the diskless sync process:
- Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
- Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
- Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
- Some issue at the replica side that prevents it from receiving or storing the RDB data.
I agree. repl-diskless-sync-status is a better name for the status variable.

hwware · 2024-08-14T19:49:03Z

- Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
- Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
- Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
- Some issue at the replica side that prevents it from receiving or storing the RDB data.

For case 2 and 4, I think it makes sense to add repl-diskless-sync-status.
But for case 1 and 3, I am not familiar with this kind of case. @enjoy-binbin How about you?

enjoy-binbin · 2024-08-15T04:45:49Z

I think both cases can happend, as long as the situations will cause diskless-sync fail, we should set the repl-diskless-sync-status.

enjoy-binbin · 2024-08-26T15:39:27Z

@valkey-io/core-team please take a look and check if this needs to be fit into 8.0

Here are the fields we have now and their definitions, we can see we are mixing disk-based RDB and diskless RDB in some fields, and rdb_last_bgsave_status does not include the diskless RDB

rdb_changes_since_last_save: Number of changes since the last RDB file save
rdb_bgsave_in_progress: Flag indicating a RDB save is on-going, including a diskless replication RDB save
rdb_last_save_time: Epoch-based timestamp of last successful RDB file save
rdb_last_bgsave_status: Status of the last RDB file save operation
rdb_last_bgsave_time_sec: Duration of the last RDB save operation in seconds, including a diskless replication RDB save
rdb_current_bgsave_time_sec: Duration of the on-going RDB save operation if any, including a diskless replication RDB save

naglera mentioned this issue Aug 8, 2024

Dual channel replication should not update lastbgsave_status when transfer error #811

Merged

enjoy-binbin added the major-decision-pending Major decision pending by TSC team label Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add Error Status Field for Diskless Syncs #878

Feature Request: Add Error Status Field for Diskless Syncs #878

naglera commented Aug 8, 2024 •

edited

Loading

enjoy-binbin commented Aug 8, 2024

hwware commented Aug 12, 2024 •

edited

Loading

naglera commented Aug 13, 2024

hwware commented Aug 14, 2024

enjoy-binbin commented Aug 15, 2024

enjoy-binbin commented Aug 26, 2024

Feature Request: Add Error Status Field for Diskless Syncs #878

Feature Request: Add Error Status Field for Diskless Syncs #878

Comments

naglera commented Aug 8, 2024 • edited Loading

enjoy-binbin commented Aug 8, 2024

hwware commented Aug 12, 2024 • edited Loading

naglera commented Aug 13, 2024

hwware commented Aug 14, 2024

enjoy-binbin commented Aug 15, 2024

enjoy-binbin commented Aug 26, 2024

naglera commented Aug 8, 2024 •

edited

Loading

hwware commented Aug 12, 2024 •

edited

Loading