Skip to content

Commit

Permalink
docs: add documentation for kaggle scen (#448)
Browse files Browse the repository at this point in the history
* init for bg & quickstart for kaggle docs

* Add documentation for the environment configuration in the Kaggle scenario.

* add some descriptions in documents

* remove useless docs

* ci issue

---------

Co-authored-by: TPLin22 <tplin2@163.com>
  • Loading branch information
WinstonLiyt and TPLin22 authored Oct 23, 2024
1 parent 162d191 commit 5531b17
Show file tree
Hide file tree
Showing 3 changed files with 174 additions and 21 deletions.
19 changes: 10 additions & 9 deletions docs/scens/catalog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,14 @@ The supported scenarios are listed below:


.. toctree::
:maxdepth: 1
:caption: Doctree:
:hidden:

data_agent_fin
data_copilot_fin
model_agent_fin
model_agent_med
model_copilot_general
:maxdepth: 1
:caption: Doctree:
:hidden:

data_agent_fin
data_copilot_fin
model_agent_fin
model_agent_med
model_copilot_general
kaggle_agent

143 changes: 143 additions & 0 deletions docs/scens/kaggle_agent.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
.. _kaggle_agent:

=======================
Kaggle Agent
=======================

**🤖 Automated Feature Engineering & Model Tuning Evolution**
------------------------------------------------------------------------------------------

📖 Background
~~~~~~~~~~~~~~
In the landscape of data science competitions, Kaggle serves as the ultimate arena where data enthusiasts harness the power of algorithms to tackle real-world challenges.
The Kaggle Agent stands as a pivotal tool, empowering participants to seamlessly integrate cutting-edge models and datasets, transforming raw data into actionable insights.

By utilizing the **Kaggle Agent**, data scientists can craft innovative solutions that not only uncover hidden patterns but also drive significant advancements in predictive accuracy and model robustness.


🌟 Introduction
~~~~~~~~~~~~~~~~

In this scenario, our automated system proposes hypothesis, choose action, implements code, conducts validation, and utilizes feedback in a continuous, iterative process.

The goal is to automatically optimize performance metrics within the validation set or Kaggle Leaderboard, ultimately discovering the most efficient features and models through autonomous research and development.

Here's an enhanced outline of the steps:

**Step 1 : Hypothesis Generation 🔍**

- Generate and propose initial hypotheses based on previous experiment analysis and domain expertise, with thorough reasoning and financial justification.

**Step 2 : Experiment Creation ✨**

- Transform the hypothesis into a task.
- Choose a specific action within feature engineering or model tuning.
- Develop, define, and implement a new feature or model, including its name, description, and formulation.

**Step 3 : Model/Feature Implementation 👨‍💻**

- Implement the model code based on the detailed description.
- Evolve the model iteratively as a developer would, ensuring accuracy and efficiency.

**Step 4 : Validation on Test Set or Kaggle 📉**

- Validate the newly developed model using the test set or Kaggle dataset.
- Assess the model's effectiveness and performance based on the validation results.

**Step 5: Feedback Analysis 🔍**

- Analyze validation results to assess performance.
- Use insights to refine hypotheses and enhance the model.

**Step 6: Hypothesis Refinement ♻️**

- Adjust hypotheses based on validation feedback.
- Iterate the process to continuously improve the model.

⚡ Quick Start
~~~~~~~~~~~~~~~~~

Please refer to the installation part in :doc:`../installation_and_configuration` to prepare your system dependency.

You can try our demo by running the following command:

- 🐍 Create a Conda Environment

- Create a new conda environment with Python (3.10 and 3.11 are well tested in our CI):

.. code-block:: sh
conda create -n rdagent python=3.10
- Activate the environment:

.. code-block:: sh
conda activate rdagent
- 📦 Install the RDAgent

- You can install the RDAgent package from PyPI:

.. code-block:: sh
pip install rdagent
- 🚀 Run the Application

- You can directly run the application by using the following command:

.. code-block:: sh
python3 rdagent/app/kaggle/loop.py --competition [your competition name]
🛠️ Usage of modules
~~~~~~~~~~~~~~~~~~~~~

.. _Env Config:

- **Env Config**

The following environment variables can be set in the `.env` file to customize the application's behavior:

.. autopydantic_settings:: rdagent.app.kaggle.conf.KaggleBasePropSetting
:settings-show-field-summary: False
:exclude-members: Config

.. autopydantic_settings:: rdagent.components.coder.factor_coder.config.FactorImplementSettings
:settings-show-field-summary: False
:members: coder_use_cache, data_folder, data_folder_debug, file_based_execution_timeout, select_method, select_threshold, max_loop, knowledge_base_path, new_knowledge_base_path
:exclude-members: Config, fail_task_trial_limit, v1_query_former_trace_limit, v1_query_similar_success_limit, v2_query_component_limit, v2_query_error_limit, v2_query_former_trace_limit, v2_error_summary, v2_knowledge_sampler, v2_add_fail_attempt_to_latest_successful_execution
:no-index:

📋 Competition List Available
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+-----------------------------------+------------------+-----------+-------------------------------+
| **Competition Name** | **Task** | **Modal** | **ID** |
+===================================+==================+===========+===============================+
| Media Campaign Cost Dataset | Regression | Tabular | playground-series-s3e11 |
+-----------------------------------+------------------+-----------+-------------------------------+
| Wild Blueberry Yield Dataset | Regression | Tabular | playground-series-s3e14 |
+-----------------------------------+------------------+-----------+-------------------------------+
| Crab Age Dataset | Regression | Tabular | playground-series-s3e16 |
+-----------------------------------+------------------+-----------+-------------------------------+
| Flood Prediction Dataset | Regression | Tabular | playground-series-s4e5 |
+-----------------------------------+------------------+-----------+-------------------------------+
| Used Car Prices | Regression | Tabular | playground-series-s4e9 |
+-----------------------------------+------------------+-----------+-------------------------------+
| Cirrhosis Outcomes | Multi-Class | Tabular | playground-series-s3e26 |
+-----------------------------------+------------------+-----------+-------------------------------+
| San Francisco Crime Classification| Multi-Class | Tabular | sf-crime |
+-----------------------------------+------------------+-----------+-------------------------------+
| Poisonous Mushrooms | Classification | Tabular | playground-series-s4e8 |
+-----------------------------------+------------------+-----------+-------------------------------+
| Spaceship Titanic | Classification | Tabular | spaceship-titanic |
+-----------------------------------+------------------+-----------+-------------------------------+
| Forest Cover Type Prediction | Classification | Tabular | forest-cover-type-prediction |
+-----------------------------------+------------------+-----------+-------------------------------+
| Digit Recognizer | Classification | Image | digit-recognizer |
+-----------------------------------+------------------+-----------+-------------------------------+
| To be continued ... |
+-----------------------------------+------------------+-----------+-------------------------------+

33 changes: 21 additions & 12 deletions rdagent/app/kaggle/conf.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
from pathlib import Path

from pydantic_settings import BaseSettings

from rdagent.components.workflow.conf import BasePropSetting
Expand All @@ -16,13 +14,6 @@ class Config:
scen: str = "rdagent.scenarios.kaggle.experiment.scenario.KGScenario"
"""Scenario class for data mining model"""

knowledge_base: str = "" # TODO enable this line to use the knowledge base
# knowledge_base: str = "rdagent.scenarios.kaggle.knowledge_management.graph.KGKnowledgeGraph"
"""Knowledge base class"""

knowledge_base_path: str = "kg_graph.pkl"
"""Knowledge base path"""

hypothesis_gen: str = "rdagent.scenarios.kaggle.proposal.proposal.KGHypothesisGen"
"""Hypothesis generation class"""

Expand Down Expand Up @@ -51,22 +42,40 @@ class Config:
"""Number of evolutions"""

competition: str = ""
"""Kaggle competition name, e.g., 'sf-crime'"""

local_data_path: str = "/data/userdata/share/kaggle"
"""Folder storing Kaggle competition data"""

if_action_choosing_based_on_UCB: bool = False
"""Enable decision mechanism based on UCB algorithm"""

domain_knowledge_path: str = "/data/userdata/share/kaggle/domain_knowledge"
"""Folder storing domain knowledge files in .case format"""

rag_path: str = "git_ignore_folder/rag"
rag_path: str = "git_ignore_folder/kaggle_vector_base.pkl"
"""Base version of vector-based RAG"""

if_action_choosing_based_on_UCB: bool = False
if_using_vector_rag: bool = False
"""Enable basic vector-based RAG"""

if_using_graph_rag: bool = False
"""Enable advanced graph-based RAG"""

if_using_vector_rag: bool = False
# Conditionally set the knowledge_base based on the use of graph RAG
knowledge_base: str = (
"rdagent.scenarios.kaggle.knowledge_management.graph.KGKnowledgeGraph" if if_using_graph_rag else ""
)
"""Knowledge base class, uses 'KGKnowledgeGraph' when advanced graph-based RAG is enabled, otherwise empty."""

knowledge_base_path: str = "kg_graph.pkl"
"""Advanced version of graph-based RAG"""

auto_submit: bool = True
"""Automatically upload and submit each experiment result to Kaggle platform"""

mini_case: bool = False
"""Enable mini-case study for experiments"""


KAGGLE_IMPLEMENT_SETTING = KaggleBasePropSetting()

0 comments on commit 5531b17

Please sign in to comment.