Abnormally vulnerability detector database size during re-downloading. #21663

MiguelazoDS · 2024-01-30T13:51:06Z

Description

It was found that after interrupting and resuming the feed processing the size of the database is too big. It is required to evaluate this behavior because the database may be corrupted, and this can unravel the entire scanning process.

Expected behavior

As of now, the database feed size is around 4.5GB. Interrupting and resuming the feed processing shouldn't exceed that value.

Actual behavior

After interrupting and resuming the feed processing the database feed size grows beyond 4.5GB

Steps to reproduce

Install manager 4.8.0 (62a5a67)
Let the processing start
Interrupt the process before it ends (wazuh-control stop)

The logs do not say how the process is going.
Restart manager. It will re-download the feed.

At 89k the size is almost 4.5GB

At 200k the size is 5.0GB

At the end the size is 5.6GB

DoD

Analyze the root cause of this behavior.
Identify the reason.
Propose a fix if needed.

The text was updated successfully, but these errors were encountered:

sebasfalcone · 2024-02-16T13:13:44Z

Moved ETA

Issue requires more investigation

sebasfalcone · 2024-02-20T20:40:35Z

On hold until we have more time to investigate

This issue has more depth than expected, it seems like rocksDB has some issues regarding the deletion of DBs with a high volume of data

@GabrielEValenzuela Will append a report with all these findings to the issue

GabrielEValenzuela · 2024-02-21T19:00:28Z

Investigation of disk space management issue in RocksDB

Abstract

This report investigates a concerning issue observed in RocksDB, a popular embedded database library developed by Facebook. The issue pertains to the ineffective management of disk space, wherein the delete operation fails to free up disk space, consequently leading to an increase in the database size. This report aims to analyze the root cause of the issue and propose potential solutions to mitigate its impact.

Introduction

RocksDB is a high-performance, persistent key-value store designed for fast storage systems such as flash drives and hard disk drives. It is widely utilized in various applications ranging from web servers to distributed systems due to its efficiency and reliability. However, recent observations have revealed a critical issue regarding disk space management within RocksDB, for example, questions made in StackOverflow like [1] that uses Apache Kafka as a service, are proof of this issue.

Problem Statement

The primary problem identified in RocksDB involves the inefficiency of the delete operation in reclaiming disk space. Despite deleting data entries from the database, the corresponding disk space occupied by these entries remains allocated, leading to an accumulation of unused space and an increase in the overall database size over time. Furthermore, when new data is inserted into the database, appears to not overwrite the previously deleted data, exacerbating the disk space utilization issue.

Findings

Based on the provided findings, it seems the issue with inefficient disk space management in RocksDB may be attributed to several factors, including suboptimal delete operations, inefficient compaction strategies, and potential limitations in managing disk space utilization. Here's a summary of the key findings and potential solutions:

Use of DeleteRange Function: Utilizing the DeleteRange function instead of performing range scans for deleting a range of keys can significantly improve write performance by creating range tombstones as a single key-value pair. This native operation speeds up the deletion process and enhances write performance, making it a preferable solution for performance-sensitive write paths. Currently throws CF not supported
Compaction Strategies: Implementing trigger compaction on deletes can help mitigate the performance skew caused by excessive tombstones in SST files. By tracking long close-by tombstones and triggering compaction when necessary, the database can efficiently manage tombstone cleanup and improve range scan performance.
Explicit Deletion of Iterators: Ensuring that iterators are explicitly deleted when they go out of scope can prevent memory leaks and optimize resource utilization within the database. [2] [3]
Management of max_bytes_for_level_base Variable: Adjusting the max_bytes_for_level_base variable to optimize space and write amplification during compactions can improve overall performance and prevent excessive compactions that degrade performance over time. [4]
Consideration of Probability in Delete Operations: Since the issue occurs with low probability, using standard delete operations in C++ may serve as a reliable alternative, considering the potential trade-offs in performance and resource utilization.

Conclusion

In conclusion, this report provides an analysis of this critical problem, addressing disk space management in RocksDB possibly requires a combination of optimizing delete operations, refining compaction strategies, and fine-tuning database configuration parameters to balance space utilization and performance. Implementation of the suggested solutions can help mitigate the observed issues and improve RocksDB's overall disk space management efficiency and reliability. While this is being addressed, we continue our research by bringing this issue to Meta's team of developers so that we can get a first-hand answer.

sebasfalcone · 2024-03-11T13:01:53Z

@GabrielEValenzuela

Lets re-try this investigation with this changes 7714e2a

GabrielEValenzuela · 2024-03-14T16:44:22Z

Investigation of disk space management issue in RocksDB - New analysis

Abstract

This report investigates a concerning issue observed in RocksDB, a popular embedded database library developed by Facebook. The issue pertains to the ineffective management of disk space, wherein the delete operation fails to free up disk space, consequently leading to an increase in the database size. This report aims to analyze the root cause of the issue and propose potential solutions to mitigate its impact.

Introduction

RocksDB is a high-performance, persistent key-value store designed for fast storage systems such as flash drives and hard disk drives. It is widely utilized in various applications ranging from web servers to distributed systems due to its efficiency and reliability. After the merging of:

Refactor for simplification and unique responsability for some classes Add OS support for macOS and Windows #22360

We made a test to check if the problem was solved.

Findings

The problem size was solved successfully, we left proof in this comment

Before re-start

After re-start

Conclusion

We can close this issue and open a new one, related to the hash of the downloaded files

wazuh/src/shared_modules/content_manager/src/components/offlineDownloader.hpp

Lines 148 to 162 in aca9796

    
           // Just process the new file if the hash is different from the last one. 
        
           auto inputFileHash {Utils::asciiToHex(Utils::hashFile(outputFilePath))}; 
        
           if (context.spUpdaterBaseContext->downloadedFileHash != inputFileHash) 
        
           { 
        
               // Store new hash. 
        
               context.spUpdaterBaseContext->downloadedFileHash = std::move(inputFileHash); 
        
               // Download finished: Insert path into context. 
        
               context.data.at("paths").push_back(outputFilePath.string()); 
        
               return; 
        
           } 
        
           logDebug2(WM_CONTENTUPDATER, 
        
                     "File '%s' didn't change from last download so it won't be published", 
        
                     outputFilePath.string().c_str());

If the download process exits, it should re-start with the same file, and only store the hash if the process ends successfully.

sebasfalcone · 2024-03-14T18:13:01Z

Great analysis Gabi! Good to know this issue was solved 😅

Conclusion

Issue solved here

Issues generated from the analysis

CTI OfflineDownloader is unable to recover from bad state #22526

MiguelazoDS added type/enhancement New feature or request level/subtask labels Jan 30, 2024

GabrielEValenzuela self-assigned this Feb 14, 2024

sebasfalcone mentioned this issue Mar 14, 2024

CTI OfflineDownloader is unable to recover from bad state #22526

Closed

2 tasks

sebasfalcone linked a pull request Mar 14, 2024 that will close this issue

Refactor for simplification and unique responsability for some classes Add OS support for macOS and Windows #22360

Merged

sebasfalcone closed this as completed Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormally vulnerability detector database size during re-downloading. #21663

Abnormally vulnerability detector database size during re-downloading. #21663

MiguelazoDS commented Jan 30, 2024

sebasfalcone commented Feb 16, 2024

sebasfalcone commented Feb 20, 2024 •

edited

Loading

GabrielEValenzuela commented Feb 21, 2024

sebasfalcone commented Mar 11, 2024

GabrielEValenzuela commented Mar 14, 2024

sebasfalcone commented Mar 14, 2024

Abnormally vulnerability detector database size during re-downloading. #21663

Abnormally vulnerability detector database size during re-downloading. #21663

Comments

MiguelazoDS commented Jan 30, 2024

Description

Expected behavior

Actual behavior

Steps to reproduce

DoD

sebasfalcone commented Feb 16, 2024

sebasfalcone commented Feb 20, 2024 • edited Loading

GabrielEValenzuela commented Feb 21, 2024

Investigation of disk space management issue in RocksDB

Abstract

Introduction

Problem Statement

Findings

Conclusion

sebasfalcone commented Mar 11, 2024

GabrielEValenzuela commented Mar 14, 2024

Investigation of disk space management issue in RocksDB - New analysis

Abstract

Introduction

Findings

Conclusion

sebasfalcone commented Mar 14, 2024

Conclusion

Issues generated from the analysis

sebasfalcone commented Feb 20, 2024 •

edited

Loading