Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Create-Workload Enhancements] Rearchitect Create-Workload Feature #587

Open
IanHoang opened this issue Jul 17, 2024 · 2 comments
Open

[Create-Workload Enhancements] Rearchitect Create-Workload Feature #587

IanHoang opened this issue Jul 17, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@IanHoang
Copy link
Collaborator

IanHoang commented Jul 17, 2024

Overview

This is an issue based off one of the proposed priorities in this RFC: #395

Background

As of now, OSB's create-workload is a monolith that uses a two modules of functions to create a custom workload. It was inherently designed to be a quick and easy way to build custom workloads off of small corpora. While this approach has worked in the past, there is an increasing demand for building custom workloads based off of complex workloads and more users are using this feature to achieve this.

Users who have been using this feature have mentioned that the create-workload code currently is difficult to extend, maintain, and, for newcomers to OSB, difficult to follow and interpret.

We should rearchitect the code to be more organized and scalable, which in turn will make it easier to extend and maintain. This work will also serve as the foundation for future development, such as extracting a random sampling of the documents and repairing incomplete workloads.

Proposed Design

While the existing approach is considered modular, create-workload in its current state is unwieldy. We have gathered feedback from users who have extended the feature and have used the feature to build custom workloads based on complex production workloads that are up to 10TB. Based on the feedback received, we should rearchitect create-workload to have the following components:

  • CustomWorkload will be a composite class that orchestrates components such as IndexExtractor, CorpusExtractor, RetryController / RepairController, and TemplateProcessor. It can utilize helper functions to assist with processing queries provided by the user
  • IndexExtractor will be a class that extracts index mappings and will not need an abstract class
  • CorporaExtractor will be an abstract class that extracts data corpora. Subclasses will be one layer deep in inheritance and will include SequentialExtractor and eventually a ConcurrentExtractor. These subclasses will consist of methods that sequentially extract data and concurrently extract data. Since users have also expressed concern over Python's ability to do concurrency, we can also build in support to have a distinction between different language extractors (such as PythonCorporaExtractor or GoCorporaExtractor). This will need to be thoroughly thought out and is out of scope for now.
  • RetryController / RepairController will be class that deals with incomplete workloads or index and data corpora extractions that failed and will offer users ways to retry the extraction process. This could also be more involved if users want to have a retry-create-workload failures and specify the indices that failed and this controller would grab serialized objects and retry. This is out of scope for now.
  • There can be additional classes in a helper file that will serve to do miscellaneous, small tasks such as processing queries or processing workload templates.

Proposed priority

It also makes it difficult for newcomers to come and understand the code easily. This approach would promote encapsulation and abstraction, overall making create-workload more organized and scalable as well as will be easier to extend and maintain.

@IanHoang IanHoang added enhancement New feature or request untriaged and removed untriaged labels Jul 17, 2024
@IanHoang IanHoang self-assigned this Jul 17, 2024
@gkamat gkamat changed the title [Create-Workload Enhancements] Reorganize Create-Worklload Feature [Create-Workload Enhancements] Reorganize Create-Workload Feature Jul 18, 2024
@IanHoang IanHoang moved this from 🆕 New to 🏗 In progress in Engineering Effectiveness Board Jul 30, 2024
@IanHoang IanHoang changed the title [Create-Workload Enhancements] Reorganize Create-Workload Feature [Create-Workload Enhancements] Rearchitect Create-Workload Feature Aug 9, 2024
@IanHoang
Copy link
Collaborator Author

IanHoang commented Aug 9, 2024

Received feedback to add support for pbzip2 compression now that OSB supports it. Will create a separate PR for it.

@gkamat gkamat moved this to 🏗 In progress in OpenSearch Benchmark Roadmap Sep 13, 2024
@IanHoang IanHoang moved this from 🏗 In progress to Backlog in Engineering Effectiveness Board Sep 25, 2024
@gkamat
Copy link
Collaborator

gkamat commented Oct 15, 2024

@IanHoang, it may be helpful to add some child tasks to this issue, since there are multiple items here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 📦 Backlog
Status: 🏗 In progress
Development

No branches or pull requests

2 participants