Skip to content

Commit

Permalink
Merge pull request #187 from SylphAI-Inc/main
Browse files Browse the repository at this point in the history
[V0.2.0] official + classification documentation
  • Loading branch information
Sylph-AI authored Aug 21, 2024
2 parents b4b33dc + e7357a0 commit ffbba03
Show file tree
Hide file tree
Showing 79 changed files with 2,641 additions and 4,664 deletions.
58 changes: 51 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,18 @@
<img alt="AdalFlow logo" src="docs/source/_static/images/adalflow-logo.png" style="width: 100%;">
</h4> -->



<h4 align="center">
<img alt="AdalFlow logo" src="https://raw.githubusercontent.com/SylphAI-Inc/LightRAG/main/docs/source/_static/images/adalflow-logo.png" style="width: 100%;">
</h4>

<h2>
<p align="center">
⚡ The Library to Build and Auto-optimize LLM Applications ⚡
</p>
</h2>


<p align="center">
<a href="https://colab.research.google.com/drive/1TKw_JHE42Z_AWo8UuRYZCO2iuMgyslTZ?usp=sharing">
Expand Down Expand Up @@ -54,17 +62,43 @@



<h2>
<p align="center">
⚡ The Library to Build and Auto-optimize LLM Applications ⚡
</p>
</h2>



# Why AdalFlow?

Embracing a design philosophy similar to PyTorch, AdalFlow is powerful, light, modular, and robust.
# Why AdalFlow

1. Embracing a design pattern similar to PyTorch, AdalFlow is powerful, light, modular, and robust.
AdalFlow provides `Model-agnostic` building blocks to build LLM task pipeline, ranging from RAG, Agents to classical NLP tasks like text classification and named entity recognition. It is easy to get high performance even with just basic manual promting.
2. AdalFlow provides a unified auto-differentiative framework for both zero-shot prompt optimization and few-shot optimization. It advances existing auto-optimization research, including ``Text-Grad`` and ``DsPy``.
Through our research, ``Text-Grad 2.0`` and ``Learn-to-Reason Few-shot In Context Learning``, AdalFlow ``Trainer`` achieves the highest accuracy while being the most token-efficient.

<!-- It advances existing auto-optimization research, including Text-Grad and DsPy. Through our research, Text-Grad 2.0, and Learn-to-Reason Few-shot In-Context Learning, AdalFlow Trainer achieves the highest accuracy while being the most token-efficient. -->

<!-- AdalFlow not only helps developers build model-agnostic LLM task pipelines with full control over prompts and output processing, but it also auto-optimizes these pipelines to achieve SOTA accuracy. -->
<!-- Embracing a design pattern similar to PyTorch, AdalFlow is powerful, light, modular, and robust. -->

Here is our optimization demonstration on a text classification task:
<!-- <p align="center">
<img src="docs/source/_static/images/classification_training_map.png" alt="AdalFlow Auto-optimization" style="width: 80%;">
</p>
<p align="center">
<img src="docs/source/_static/images/classification_opt_prompt.png" alt="AdalFlow Auto-optimization" style="width: 80%;">
</p> -->

<p align="center" style="background-color: #f0f0f0;">
<img src="https://raw.githubusercontent.com/SylphAI-Inc/LightRAG/main/docs/source/_static/images/classification_training_map.png" style="width: 80%;" alt="AdalFlow Auto-optimization">
</p>

<p align="center" style="background-color: #f0f0f0;">
<img src="https://raw.githubusercontent.com/SylphAI-Inc/LightRAG/main/docs/source/_static/images/classification_opt_prompt.png" alt="AdalFlow Optimized Prompt" style="width: 80%;">
</p>


Among all libraries, we achieved the highest accuracy with manual prompting (starting at 82%) and the highest accuracy after optimization.

Further reading: [Optimize Classification](https://adalflow.sylph.ai/use_cases/classification.html)

## Light, Modular, and Model-agnositc Task Pipeline

Expand Down Expand Up @@ -178,6 +212,16 @@ AdalFlow is named in honor of [Ada Lovelace](https://en.wikipedia.org/wiki/Ada_L

[![contributors](https://contrib.rocks/image?repo=SylphAI-Inc/LightRAG&max=2000)](https://github.com/SylphAI-Inc/LightRAG/graphs/contributors)

# Acknowledgements

Many existing works greatly inspired this project! Here is a non-exhaustive list:

- 📚 [PyTorch](https://github.com/pytorch/pytorch/) for design philosophy and design pattern of ``Component``, ``Parameter``, ``Sequential``.
- 📚 [Micrograd](https://github.com/karpathy/micrograd): A tiny autograd engine for our auto-differentiative architecture.
- 📚 [Text-Grad](https://github.com/zou-group/textgrad) for the ``Textual Gradient Descent`` text optimizer.
- 📚 [DSPy](https://github.com/stanfordnlp/dspy) for inspiring the ``__{input/output}__fields`` in our ``DataClass`` and the bootstrap few-shot optimizer.
- 📚 [ORPO](https://github.com/google-deepmind/opro) for adding past text instruction along with its accuracy in the text optimizer.

# Citation

```bibtex
Expand Down
12 changes: 12 additions & 0 deletions adalflow/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
## [0.2.0] - 2024-08-20
### Added
- Qdrant retriever.

### Improved
- Add "mixed" training in ``Trainer`` to do demo and text optimization both in each step.
- ``DemoOptimizer``, allow to config if the input fields are included or excluded in the demonstration.
- Added ``sequential`` and ``mix`` in the ``optimization_order`` in the ``Trainer`` to support the mixed training.
- Added ``resume_from_ckpt`` in the ``Trainer.fit``.

### Fixed Bug
- wrong import in ``react`` agent.
## [0.2.0.beta.3] - 2024-08-16
### Fixed
- missing `diskcache` package in the dependencies.
Expand Down
2 changes: 1 addition & 1 deletion adalflow/adalflow/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.2.0-beta.3"
__version__ = "0.2.0"

from adalflow.core.component import Component, fun_to_component
from adalflow.core.container import Sequential
Expand Down
4 changes: 3 additions & 1 deletion adalflow/adalflow/components/agent/react.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,15 @@
FunctionExpression,
)
from adalflow.core.model_client import ModelClient
from lighadalflowtrag.utils.logger import printc
from adalflow.utils.logger import printc


log = logging.getLogger(__name__)

__all__ = ["DEFAULT_REACT_AGENT_SYSTEM_PROMPT", "ReActAgent"]

# TODO: test react agent

DEFAULT_REACT_AGENT_SYSTEM_PROMPT = r"""<SYS>
{# role/task description #}
You are a helpful assistant.
Expand Down
6 changes: 6 additions & 0 deletions adalflow/adalflow/components/retriever/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,18 @@
OptionalPackages.SQLALCHEMY,
)

QdrantRetriever = LazyImport(
"adalflow.components.retriever.qdrant_retriever.QdrantRetriever",
OptionalPackages.QDRANT,
)

__all__ = [
"BM25Retriever",
"LLMRetriever",
"FAISSRetriever",
"RerankerRetriever",
"PostgresRetriever",
"QdrantRetriever",
"split_text_by_word_fn",
"split_text_by_word_fn_then_lower_tokenized",
]
Expand Down
159 changes: 159 additions & 0 deletions adalflow/adalflow/components/retriever/qdrant_retriever.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
"""Leverage a Qdrant collection to retrieve documents."""

from typing import List, Optional, Any
from qdrant_client import QdrantClient, models

from adalflow.core.retriever import (
Retriever,
)
from adalflow.core.embedder import Embedder

from adalflow.core.types import (
RetrieverOutput,
RetrieverStrQueryType,
RetrieverStrQueriesType,
Document,
)


class QdrantRetriever(Retriever[Any, RetrieverStrQueryType]):
__doc__ = r"""Use a Qdrant collection to retrieve documents.
Args:
collection_name (str): the collection name in Qdrant.
client (QdrantClient): An instance of qdrant_client.QdrantClient.
embedder (Embedder): An instance of Embedder.
top_k (Optional[int], optional): top k documents to fetch. Defaults to 10.
vector_name (Optional[str], optional): the name of the vector in the collection. Defaults to None.
text_key (str, optional): the key in the payload that contains the text. Defaults to "text".
metadata_key (str, optional): the key in the payload that contains the metadata. Defaults to "meta_data".
filter (Optional[models.Filter], optional): the filter to apply to the query. Defaults to None.
References:
[1] Qdrant: https://qdrant.tech/
[2] Documentation: https://qdrant.tech/documentation/
"""

def __init__(
self,
collection_name: str,
client: QdrantClient,
embedder: Embedder,
top_k: Optional[int] = 10,
vector_name: Optional[str] = None,
text_key: str = "text",
metadata_key: str = "meta_data",
filter: Optional[models.Filter] = None,
):
super().__init__()
self._top_k = top_k
self._collection_name = collection_name
self._client = client
self._embedder = embedder
self._text_key = text_key
self._metadata_key = metadata_key
self._filter = filter

self._vector_name = vector_name or self._get_first_vector_name()

def reset_index(self):
if self._client.collection_exists(self._collection_name):
self._client.delete_collection(self._collection_name)

def call(
self,
input: RetrieverStrQueriesType,
top_k: Optional[int] = None,
**kwargs,
) -> List[RetrieverOutput]:
top_k = top_k or self._top_k
queries: List[str] = input if isinstance(input, list) else [input]

queries_embeddings = self._embedder(queries)

query_requests: List[models.QueryRequest] = []
for idx, query in enumerate(queries):
query_embedding = queries_embeddings.data[idx].embedding
query_requests.append(
models.QueryRequest(
query=query_embedding,
limit=top_k,
using=self._vector_name,
with_payload=True,
with_vector=True,
filter=self._filter,
**kwargs,
)
)

results = self._client.query_batch_points(
self._collection_name, requests=query_requests
)
retrieved_outputs: List[RetrieverOutput] = []
for result in results:
out = self._points_to_output(
result.points,
query,
self._text_key,
self._metadata_key,
self._vector_name,
)
retrieved_outputs.append(out)

return retrieved_outputs

def _get_first_vector_name(self) -> Optional[str]:
vectors = self._client.get_collection(
self._collection_name
).config.params.vectors

if not isinstance(vectors, dict):
# The collection only has the default, unnamed vector
return None

first_vector_name = list(vectors.keys())[0]

# The collection has multiple vectors. Could also include the falsy unnamed vector - Empty string("")
return first_vector_name or None

@classmethod
def _points_to_output(
cls,
points: List[models.ScoredPoint],
query: str,
text_key: str,
metadata_key: str,
vector_name: Optional[str],
) -> RetrieverOutput:
doc_indices = [point.id for point in points]
doc_scores = [point.score for point in points]
documents = [
cls._doc_from_point(point, text_key, metadata_key, vector_name)
for point in points
]
return RetrieverOutput(
doc_indices=doc_indices,
doc_scores=doc_scores,
query=query,
documents=documents,
)

@classmethod
def _doc_from_point(
cls,
point: models.ScoredPoint,
text_key: str,
metadata_key: str,
vector_name: Optional[str] = None,
) -> Document:
vector = point.vector
if isinstance(vector, dict):
vector = vector[vector_name]

payload = point.payload.copy()
return Document(
id=point.id,
text=payload.get(text_key, ""),
meta_data=payload.get(metadata_key, {}),
vector=vector,
)
36 changes: 34 additions & 2 deletions adalflow/adalflow/core/functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,11 @@ def check_data_class_field_args_zero(cls):
)


def check_if_class_field_args_zero_exists(cls):
"""Check if the field is a dataclass."""
return hasattr(cls, "__args__") and len(cls.__args__) > 0 and cls.__args__[0]


def check_data_class_field_args_one(cls):
"""Check if the field is a dataclass."""
return (
Expand All @@ -200,6 +205,11 @@ def check_data_class_field_args_one(cls):
)


def check_if_class_field_args_one_exists(cls):
"""Check if the field is a dataclass."""
return hasattr(cls, "__args__") and len(cls.__args__) > 1 and cls.__args__[1]


def dataclass_obj_from_dict(cls: Type[object], data: Dict[str, object]) -> Any:
r"""Convert a dictionary to a dataclass object.
Expand Down Expand Up @@ -236,30 +246,44 @@ class TrecDataList:
"""
log.debug(f"Dataclass: {cls}, Data: {data}")
if data is None:
return None

if is_dataclass(cls) or is_potential_dataclass(
cls
): # Optional[Address] will be false, and true for each check

log.debug(
f"{is_dataclass(cls)} of {cls}, {is_potential_dataclass(cls)} of {cls}"
)
# Ensure the data is a dictionary
if not isinstance(data, dict):
raise ValueError(
f"Expected data of type dict for {cls}, but got {type(data).__name__}"
)
cls_type = extract_dataclass_type(cls)
fieldtypes = {f.name: f.type for f in cls_type.__dataclass_fields__.values()}
return cls_type(

restored_data = cls_type(
**{
key: dataclass_obj_from_dict(fieldtypes[key], value)
for key, value in data.items()
}
)
return restored_data
elif isinstance(data, (list, tuple)):
log.debug(f"List or Tuple: {cls}, {data}")
restored_data = []
for item in data:
if check_data_class_field_args_zero(cls):
# restore the value to its dataclass type
restored_data.append(dataclass_obj_from_dict(cls.__args__[0], item))
else:

elif check_if_class_field_args_zero_exists(cls):
# Use the original data [Any]
restored_data.append(dataclass_obj_from_dict(cls.__args__[0], item))

else:
restored_data.append(item)
return restored_data

Expand All @@ -270,6 +294,10 @@ class TrecDataList:
if check_data_class_field_args_zero(cls):
# restore the value to its dataclass type
restored_data.add(dataclass_obj_from_dict(cls.__args__[0], item))
elif check_if_class_field_args_zero_exists(cls):
# Use the original data [Any]
restored_data.add(dataclass_obj_from_dict(cls.__args__[0], item))

else:
# Use the original data [Any]
restored_data.add(item)
Expand All @@ -280,6 +308,10 @@ class TrecDataList:
for key, value in data.items():
if check_data_class_field_args_one(cls):
# restore the value to its dataclass type
data[key] = dataclass_obj_from_dict(cls.__args__[1], value)
elif check_if_class_field_args_one_exists(cls):
# Use the original data [Any]

data[key] = dataclass_obj_from_dict(cls.__args__[1], value)
else:
# Use the original data [Any]
Expand Down
Loading

0 comments on commit ffbba03

Please sign in to comment.