Merge pull request #187 from SylphAI-Inc/main

[V0.2.0] official + classification documentation
SylphAI-Inc · Aug 21, 2024 · ffbba03 · ffbba03
2 parents b4b33dc + e7357a0
commit ffbba03
Show file tree

Hide file tree

Showing 79 changed files with 2,641 additions and 4,664 deletions.
diff --git a/README.md b/README.md
@@ -3,10 +3,18 @@
     <img alt="AdalFlow logo" src="docs/source/_static/images/adalflow-logo.png" style="width: 100%;">
 </h4> -->
 
+
+
 <h4 align="center">
     <img alt="AdalFlow logo" src="https://raw.githubusercontent.com/SylphAI-Inc/LightRAG/main/docs/source/_static/images/adalflow-logo.png" style="width: 100%;">
 </h4>
 
+<h2>
+    <p align="center">
+     ⚡ The Library to Build and Auto-optimize LLM Applications ⚡
+    </p>
+</h2>
+
 
 <p align="center">
     <a href="https://colab.research.google.com/drive/1TKw_JHE42Z_AWo8UuRYZCO2iuMgyslTZ?usp=sharing">
@@ -54,17 +62,43 @@
 
 
 
-<h2>
-    <p align="center">
-     ⚡ The Library to Build and Auto-optimize LLM Applications ⚡
-    </p>
-</h2>
 
 
 
-# Why AdalFlow?
 
-Embracing a design philosophy similar to PyTorch, AdalFlow is powerful, light, modular, and robust.
+# Why AdalFlow
+
+1. Embracing a design pattern similar to PyTorch, AdalFlow is powerful, light, modular, and robust.
+AdalFlow provides `Model-agnostic` building blocks to build LLM task pipeline, ranging from RAG, Agents to classical NLP tasks like text classification and named entity recognition. It is easy to get high performance even with just basic manual promting.
+2. AdalFlow provides a unified auto-differentiative framework for both zero-shot prompt optimization and few-shot optimization. It advances existing auto-optimization research, including ``Text-Grad`` and ``DsPy``.
+Through our research, ``Text-Grad 2.0`` and ``Learn-to-Reason Few-shot In Context Learning``, AdalFlow ``Trainer`` achieves the highest accuracy while being the most token-efficient.
+
+<!-- It advances existing auto-optimization research, including Text-Grad and DsPy. Through our research, Text-Grad 2.0, and Learn-to-Reason Few-shot In-Context Learning, AdalFlow Trainer achieves the highest accuracy while being the most token-efficient. -->
+
+<!-- AdalFlow not only helps developers build model-agnostic LLM task pipelines with full control over prompts and output processing, but it also auto-optimizes these pipelines to achieve SOTA accuracy. -->
+<!-- Embracing a design pattern similar to PyTorch, AdalFlow is powerful, light, modular, and robust. -->
+
+Here is our optimization demonstration on a text classification task:
+<!-- <p align="center">
+  <img src="docs/source/_static/images/classification_training_map.png" alt="AdalFlow Auto-optimization" style="width: 80%;">
+</p>
+
+<p align="center">
+  <img src="docs/source/_static/images/classification_opt_prompt.png" alt="AdalFlow Auto-optimization" style="width: 80%;">
+</p> -->
+
+<p align="center" style="background-color: #f0f0f0;">
+  <img src="https://raw.githubusercontent.com/SylphAI-Inc/LightRAG/main/docs/source/_static/images/classification_training_map.png" style="width: 80%;" alt="AdalFlow Auto-optimization">
+</p>
+
+<p align="center" style="background-color: #f0f0f0;">
+  <img src="https://raw.githubusercontent.com/SylphAI-Inc/LightRAG/main/docs/source/_static/images/classification_opt_prompt.png" alt="AdalFlow Optimized Prompt" style="width: 80%;">
+</p>
+
+
+Among all libraries, we achieved the highest accuracy with manual prompting (starting at 82%) and the highest accuracy after optimization.
+
+Further reading: [Optimize Classification](https://adalflow.sylph.ai/use_cases/classification.html)
 
 ## Light, Modular, and Model-agnositc Task Pipeline
 
@@ -178,6 +212,16 @@ AdalFlow is named in honor of [Ada Lovelace](https://en.wikipedia.org/wiki/Ada_L
 
 [![contributors](https://contrib.rocks/image?repo=SylphAI-Inc/LightRAG&max=2000)](https://github.com/SylphAI-Inc/LightRAG/graphs/contributors)
 
+# Acknowledgements
+
+Many existing works greatly inspired this project! Here is a non-exhaustive list:
+
+- 📚 [PyTorch](https://github.com/pytorch/pytorch/) for design philosophy and design pattern of ``Component``, ``Parameter``, ``Sequential``.
+- 📚 [Micrograd](https://github.com/karpathy/micrograd): A tiny autograd engine for our auto-differentiative architecture.
+- 📚 [Text-Grad](https://github.com/zou-group/textgrad) for the ``Textual Gradient Descent`` text optimizer.
+- 📚 [DSPy](https://github.com/stanfordnlp/dspy) for inspiring the ``__{input/output}__fields`` in our ``DataClass`` and the bootstrap few-shot optimizer.
+- 📚 [ORPO](https://github.com/google-deepmind/opro) for adding past text instruction along with its accuracy in the text optimizer.
+
 # Citation
 
 ```bibtex

diff --git a/adalflow/CHANGELOG.md b/adalflow/CHANGELOG.md
@@ -1,3 +1,15 @@
+## [0.2.0] - 2024-08-20
+### Added
+- Qdrant retriever.
+
+### Improved
+- Add "mixed" training in ``Trainer`` to do demo and text optimization both in each step.
+- ``DemoOptimizer``, allow to config if the input fields are included or excluded in the demonstration.
+-  Added ``sequential`` and ``mix`` in the ``optimization_order`` in the ``Trainer`` to support the mixed training.
+-  Added ``resume_from_ckpt`` in the ``Trainer.fit``.
+
+### Fixed Bug
+- wrong import in ``react`` agent.
 ## [0.2.0.beta.3] - 2024-08-16
 ### Fixed
 - missing `diskcache` package in the dependencies.

diff --git a/adalflow/adalflow/__init__.py b/adalflow/adalflow/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "0.2.0-beta.3"
+__version__ = "0.2.0"
 
 from adalflow.core.component import Component, fun_to_component
 from adalflow.core.container import Sequential

diff --git a/adalflow/adalflow/components/agent/react.py b/adalflow/adalflow/components/agent/react.py
@@ -18,13 +18,15 @@
     FunctionExpression,
 )
 from adalflow.core.model_client import ModelClient
-from lighadalflowtrag.utils.logger import printc
+from adalflow.utils.logger import printc
 
 
 log = logging.getLogger(__name__)
 
 __all__ = ["DEFAULT_REACT_AGENT_SYSTEM_PROMPT", "ReActAgent"]
 
+# TODO: test react agent
+
 DEFAULT_REACT_AGENT_SYSTEM_PROMPT = r"""<SYS>
 {# role/task description #}
 You are a helpful assistant.

diff --git a/adalflow/adalflow/components/retriever/__init__.py b/adalflow/adalflow/components/retriever/__init__.py
@@ -22,12 +22,18 @@
     OptionalPackages.SQLALCHEMY,
 )
 
+QdrantRetriever = LazyImport(
+    "adalflow.components.retriever.qdrant_retriever.QdrantRetriever",
+    OptionalPackages.QDRANT,
+)
+
 __all__ = [
     "BM25Retriever",
     "LLMRetriever",
     "FAISSRetriever",
     "RerankerRetriever",
     "PostgresRetriever",
+    "QdrantRetriever",
     "split_text_by_word_fn",
     "split_text_by_word_fn_then_lower_tokenized",
 ]

diff --git a/adalflow/adalflow/components/retriever/qdrant_retriever.py b/adalflow/adalflow/components/retriever/qdrant_retriever.py
@@ -0,0 +1,159 @@
+"""Leverage a Qdrant collection to retrieve documents."""
+
+from typing import List, Optional, Any
+from qdrant_client import QdrantClient, models
+
+from adalflow.core.retriever import (
+    Retriever,
+)
+from adalflow.core.embedder import Embedder
+
+from adalflow.core.types import (
+    RetrieverOutput,
+    RetrieverStrQueryType,
+    RetrieverStrQueriesType,
+    Document,
+)
+
+
+class QdrantRetriever(Retriever[Any, RetrieverStrQueryType]):
+    __doc__ = r"""Use a Qdrant collection to retrieve documents.
+
+    Args:
+        collection_name (str): the collection name in Qdrant.
+        client (QdrantClient): An instance of qdrant_client.QdrantClient.
+        embedder (Embedder): An instance of Embedder.
+        top_k (Optional[int], optional): top k documents to fetch. Defaults to 10.
+        vector_name (Optional[str], optional): the name of the vector in the collection. Defaults to None.
+        text_key (str, optional): the key in the payload that contains the text. Defaults to "text".
+        metadata_key (str, optional): the key in the payload that contains the metadata. Defaults to "meta_data".
+        filter (Optional[models.Filter], optional): the filter to apply to the query. Defaults to None.
+
+    References:
+    [1] Qdrant: https://qdrant.tech/
+    [2] Documentation: https://qdrant.tech/documentation/
+    """
+
+    def __init__(
+        self,
+        collection_name: str,
+        client: QdrantClient,
+        embedder: Embedder,
+        top_k: Optional[int] = 10,
+        vector_name: Optional[str] = None,
+        text_key: str = "text",
+        metadata_key: str = "meta_data",
+        filter: Optional[models.Filter] = None,
+    ):
+        super().__init__()
+        self._top_k = top_k
+        self._collection_name = collection_name
+        self._client = client
+        self._embedder = embedder
+        self._text_key = text_key
+        self._metadata_key = metadata_key
+        self._filter = filter
+
+        self._vector_name = vector_name or self._get_first_vector_name()
+
+    def reset_index(self):
+        if self._client.collection_exists(self._collection_name):
+            self._client.delete_collection(self._collection_name)
+
+    def call(
+        self,
+        input: RetrieverStrQueriesType,
+        top_k: Optional[int] = None,
+        **kwargs,
+    ) -> List[RetrieverOutput]:
+        top_k = top_k or self._top_k
+        queries: List[str] = input if isinstance(input, list) else [input]
+
+        queries_embeddings = self._embedder(queries)
+
+        query_requests: List[models.QueryRequest] = []
+        for idx, query in enumerate(queries):
+            query_embedding = queries_embeddings.data[idx].embedding
+            query_requests.append(
+                models.QueryRequest(
+                    query=query_embedding,
+                    limit=top_k,
+                    using=self._vector_name,
+                    with_payload=True,
+                    with_vector=True,
+                    filter=self._filter,
+                    **kwargs,
+                )
+            )
+
+        results = self._client.query_batch_points(
+            self._collection_name, requests=query_requests
+        )
+        retrieved_outputs: List[RetrieverOutput] = []
+        for result in results:
+            out = self._points_to_output(
+                result.points,
+                query,
+                self._text_key,
+                self._metadata_key,
+                self._vector_name,
+            )
+            retrieved_outputs.append(out)
+
+        return retrieved_outputs
+
+    def _get_first_vector_name(self) -> Optional[str]:
+        vectors = self._client.get_collection(
+            self._collection_name
+        ).config.params.vectors
+
+        if not isinstance(vectors, dict):
+            # The collection only has the default, unnamed vector
+            return None
+
+        first_vector_name = list(vectors.keys())[0]
+
+        # The collection has multiple vectors. Could also include the falsy unnamed vector - Empty string("")
+        return first_vector_name or None
+
+    @classmethod
+    def _points_to_output(
+        cls,
+        points: List[models.ScoredPoint],
+        query: str,
+        text_key: str,
+        metadata_key: str,
+        vector_name: Optional[str],
+    ) -> RetrieverOutput:
+        doc_indices = [point.id for point in points]
+        doc_scores = [point.score for point in points]
+        documents = [
+            cls._doc_from_point(point, text_key, metadata_key, vector_name)
+            for point in points
+        ]
+        return RetrieverOutput(
+            doc_indices=doc_indices,
+            doc_scores=doc_scores,
+            query=query,
+            documents=documents,
+        )
+
+    @classmethod
+    def _doc_from_point(
+        cls,
+        point: models.ScoredPoint,
+        text_key: str,
+        metadata_key: str,
+        vector_name: Optional[str] = None,
+    ) -> Document:
+        vector = point.vector
+        if isinstance(vector, dict):
+            vector = vector[vector_name]
+
+        payload = point.payload.copy()
+        return Document(
+            id=point.id,
+            text=payload.get(text_key, ""),
+            meta_data=payload.get(metadata_key, {}),
+            vector=vector,
+        )
diff --git a/adalflow/adalflow/core/functional.py b/adalflow/adalflow/core/functional.py
@@ -190,6 +190,11 @@ def check_data_class_field_args_zero(cls):
     )
 
 
+def check_if_class_field_args_zero_exists(cls):
+    """Check if the field is a dataclass."""
+    return hasattr(cls, "__args__") and len(cls.__args__) > 0 and cls.__args__[0]
+
+
 def check_data_class_field_args_one(cls):
     """Check if the field is a dataclass."""
     return (
@@ -200,6 +205,11 @@ def check_data_class_field_args_one(cls):
     )
 
 
+def check_if_class_field_args_one_exists(cls):
+    """Check if the field is a dataclass."""
+    return hasattr(cls, "__args__") and len(cls.__args__) > 1 and cls.__args__[1]
+
+
 def dataclass_obj_from_dict(cls: Type[object], data: Dict[str, object]) -> Any:
     r"""Convert a dictionary to a dataclass object.
 
@@ -236,30 +246,44 @@ class TrecDataList:
 
     """
     log.debug(f"Dataclass: {cls}, Data: {data}")
+    if data is None:
+        return None
+
     if is_dataclass(cls) or is_potential_dataclass(
         cls
     ):  # Optional[Address] will be false, and true for each check
 
         log.debug(
             f"{is_dataclass(cls)} of {cls}, {is_potential_dataclass(cls)} of {cls}"
         )
+        # Ensure the data is a dictionary
+        if not isinstance(data, dict):
+            raise ValueError(
+                f"Expected data of type dict for {cls}, but got {type(data).__name__}"
+            )
         cls_type = extract_dataclass_type(cls)
         fieldtypes = {f.name: f.type for f in cls_type.__dataclass_fields__.values()}
-        return cls_type(
+
+        restored_data = cls_type(
             **{
                 key: dataclass_obj_from_dict(fieldtypes[key], value)
                 for key, value in data.items()
             }
         )
+        return restored_data
     elif isinstance(data, (list, tuple)):
         log.debug(f"List or Tuple: {cls}, {data}")
         restored_data = []
         for item in data:
             if check_data_class_field_args_zero(cls):
                 # restore the value to its dataclass type
                 restored_data.append(dataclass_obj_from_dict(cls.__args__[0], item))
-            else:
+
+            elif check_if_class_field_args_zero_exists(cls):
                 # Use the original data [Any]
+                restored_data.append(dataclass_obj_from_dict(cls.__args__[0], item))
+
+            else:
                 restored_data.append(item)
         return restored_data
 
@@ -270,6 +294,10 @@ class TrecDataList:
             if check_data_class_field_args_zero(cls):
                 # restore the value to its dataclass type
                 restored_data.add(dataclass_obj_from_dict(cls.__args__[0], item))
+            elif check_if_class_field_args_zero_exists(cls):
+                # Use the original data [Any]
+                restored_data.add(dataclass_obj_from_dict(cls.__args__[0], item))
+
             else:
                 # Use the original data [Any]
                 restored_data.add(item)
@@ -280,6 +308,10 @@ class TrecDataList:
         for key, value in data.items():
             if check_data_class_field_args_one(cls):
                 # restore the value to its dataclass type
+                data[key] = dataclass_obj_from_dict(cls.__args__[1], value)
+            elif check_if_class_field_args_one_exists(cls):
+                # Use the original data [Any]
+
                 data[key] = dataclass_obj_from_dict(cls.__args__[1], value)
             else:
                 # Use the original data [Any]