Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ConditionalRouter Haystack 2.x component #6147

Merged
merged 41 commits into from
Nov 23, 2023
Merged

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Oct 21, 2023

Why:

  • Enable generic and conditionally expressive pipeline routing functionality by introducing a new Router component.
  • The Router component orchestrates the flow of data by evaluating specified route conditions to determine the appropriate route among a set of provided route alternatives.
  • fixes Add Conditional Routing in Haystack 2.x Pipelines #6109

What:

  • Added a new Router class to haystack/preview/components/routers.
  • Updated __init__.py to include the new Router class.

How can it be used:

  • Import and utilize the Router component to manage and route connections in your pipelines.
  • Here is an example:

In this example, we create a Router component with two routes. The first route will be selected if the number of streams is less than 2, and will output the query variable. The second route will be selected if the number of streams is 2 or more, and will output the streams variable. We also specify the routing variables, which are query and streams. These variables need to be provided in the pipeline run() method. Routing variables can be used in route conditions and as route output values.

routes = [
  {"condition": "len(streams) < 2", "output": "query", "output_type": str},
  {"condition": "len(streams) >= 2", "output": "streams", "output_type": List[int]}
]

router = Router(routes=routes, routing_variables=["query", "streams"])


# the second route from above should be selected
kwargs = {"streams": [1, 2, 3], "query": "test"}
result = router.run(**kwargs)
assert result == {"streams": [1, 2, 3]}

# the first route from above should be selected
kwargs = {"streams": [1], "query": "test"}
result = router.run(**kwargs)
assert result == {"query": "test"}

How did you test it:

  • Unit tests were added to ensure that the new Router component works correctly on the component level. A real-world example is available in this colab

Notes for reviewer:

  • This is not a final version but more of a start of a conversation in the direction of expressive conditional routing in Haystack 2.x. DO NOT INTEGRATE

@vblagoje vblagoje requested a review from a team as a code owner October 21, 2023 18:36
@vblagoje vblagoje requested review from ZanSara and removed request for a team October 21, 2023 18:36
@vblagoje vblagoje added 2.x Related to Haystack v2.0 and removed topic:tests labels Oct 21, 2023
@github-actions github-actions bot added the type:documentation Improvements on the docs label Oct 21, 2023
@vblagoje vblagoje requested a review from a team as a code owner October 21, 2023 18:41
@vblagoje vblagoje requested review from dfokina and removed request for a team October 21, 2023 18:41
@vblagoje
Copy link
Member Author

@ZanSara @masci This is not a final solution but the start of a conversation toward adding such a conditional routing component.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 23, 2023

Hey @vblagoje this seems like a solid start! I have a few question:

  • I see you've addressed the case where no route is selected by raising an exception. I think it would be better to just drop the value with a warning, or at least to let the user choose not to raise an exception.

  • Is it possible to output on two connections at the same time? If the input matches more than one rule, what happens?

  • Is it possible that the rules have non-overlapping variables? For example, what happens if I have two rules where one checks for streams and another for files (just an example). In this case, are both inputs mandatory, or are they alternatives?

  • I am also assuming that the conditions can only be applied to the "whole" inputs, regardless of their types, and lists can't be split with this component (something like what FileTypeClassifier does). Is it the case?

In general though, looking promising!

@vblagoje
Copy link
Member Author

vblagoje commented Oct 23, 2023

Hey @ZanSara I don't know what's the right answer for these questions but we can involve the community to hone in on details.

Some ideas:

  • I agree with your suggestion on providing a flexible way to handle unmatched routes. Introducing an unmatched_route_behavior parameter with options like 'warn', 'error', or 'drop' can empower users to dictate how the Router behaves in such scenarios.
  • Currently, the design allows for a single matched route per input to ensure deterministic routing. However, nothing wrong with accommodating multiple matched routes if users ask for it. Multiple routes "fire" in such cases.
  • The rule evaluation is designed to be flexible. If boolean logic checks for a variable being truthy it can be optional. It seems dependent on boolean logic for routes. Again we can include community.
  • Whatever you can do with some boolean expression and variable reference should be allowed, right? So if you have access to handle replies from GPTGenerator, you should be able to access the first ChatMessage and see its role for example.

For the last bullet point consider #6138 use case. Here we need to put Router after GPTGenerator to check if an LLM message response is a function call and if so route the message to ServiceContainer to handle it. It would make, what used to be a complex and verbose, function invocation, super simple and easy to understand while isolating responsibilities to exactly where they belong.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 23, 2023

For the last bullet point consider #6138 use case. Here we need to put Router after GPTGenerator to check if an LLM message response is a function call and if so route the message to ServiceContainer to handle it. It would make, what used to be a complex and verbose, function invocation, super simple and easy to understand while isolating responsibilities to exactly where they belong.

Actually this use case is very interesting. I don't know if it's possible, but imagine this scenario: I query the LLM with n=2 (for whatever reason), so I get two answers. If one is a function call and the other isn't, are both outputs going to carry one of these replies each? That would require unpacking the list. I think your current implementation does not support this case yet.

Not a requirement, I'm just trying to define the usecases that are supported and those that aren't 😊

@vblagoje
Copy link
Member Author

@masci @silvanocerza @ZanSara, I came across a compact MIT-licenced library, asteval, that precisely meets our requirements, providing a safe alternative to using eval. It iteratively traverses the ast, executing operations directly. We can set up the Interpreter they way we want (see example below) in our Router component. Here's a snippet demonstrating its use:

import contextlib
import sys
from asteval import Interpreter

aeval = Interpreter(
    minimal=True, 
    use_numpy=False,
    user_symbols={"x": [1,2,3], "y": 2},
    max_statement_length=10
)

# this context manager is totally optional but could be useful for invalid user expressions
@contextlib.contextmanager
def limited_recursion(recursion_limit):
    old_limit = sys.getrecursionlimit()
    sys.setrecursionlimit(recursion_limit)
    try:
        yield
    finally:
        sys.setrecursionlimit(old_limit)

with limited_recursion(50):
    result = aeval.eval("len(x) > y")
    if len(aeval.error) > 0:
        for err in aeval.error:
            print(err.get_error())
    else:
        print(result)

Take this snippet and run it yourself (pip install asteval first) and put a breakpoint in "on_compare" method and a few other interesting places like methods run and eval.

Please let me know if we should proceed with our experiments using this approach.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 26, 2023

Hey @vblagoje, while asteval looks safer than direct eval, I believe @masci's idea was rather about using something much simpler: for example something that looks like the document store's filters.

Right now your example would not work there because we're checking len(): however, such operators can be added to the filtering syntax for this component. I also believe we won't need many of them other than len, especially at the beginning.

@vblagoje
Copy link
Member Author

@ZanSara the example above was simple intentionally to show how we can make this work, maybe it biased you. I think the community needs rich boolean expressions to make this component useful. I know I did for the use case I encountered - the need to route messages around based on certain ChatMessage properties as it was a case with ServiceComponent
Here is the updated code sample (recursion circuit breaker omitted for brevity):

from asteval import Interpreter

from haystack.preview.dataclasses import ChatMessage, ChatRole

function_call_message = ChatMessage.from_assistant("Some function payload")
function_call_message.metadata.update({'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'function_call'})

messages = [ChatMessage.from_user("What's the weather like in Berlin?"),
            function_call_message]


aeval = Interpreter(minimal=True,
                    use_numpy=False,
                    user_symbols={"messages": messages},
                    max_statement_length=100)

result = aeval.eval('messages[-1].metadata["finish_reason"] == "function_call"')
print(result)

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, left a couple of comments regarding code documentation.

One last question: I see there's some additional complexity to make output_name optional. I think that always passing output_name would simplify both the code and how we teach this feature, but I'm not sure how big of a burden this would be from the UX perspective. How did you evaluate the tradeoff?

haystack/preview/components/routers/conditional_router.py Outdated Show resolved Hide resolved
haystack/preview/components/routers/conditional_router.py Outdated Show resolved Hide resolved
haystack/preview/components/routers/conditional_router.py Outdated Show resolved Hide resolved
haystack/preview/components/routers/conditional_router.py Outdated Show resolved Hide resolved
@vblagoje
Copy link
Member Author

vblagoje commented Nov 15, 2023

Overall looks good to me, left a couple of comments regarding code documentation.

One last question: I see there's some additional complexity to make output_name optional. I think that always passing output_name would simplify both the code and how we teach this feature, but I'm not sure how big of a burden this would be from the UX perspective. How did you evaluate the tradeoff?

Yeah, exactly @masci - I went on to always use output_name in my code tests. However, that's a sample of 1 and it would be great to have others take a look at the colab and play with it to get a sense. That would be an essential piece of information to conclude this PR.

vblagoje and others added 5 commits November 17, 2023 14:13
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
@masci
Copy link
Contributor

masci commented Nov 17, 2023

I would give precedence to simplicity: for now we can make the parameter mandatory, and if we get negative feedback around UX we know we can change it later.

@vblagoje
Copy link
Member Author

@masci it should be ready now, but please see if I have overlooked something in this slightly refactored version that is now largely simplified. The UX experience colab has not changed - as I have intuitively used all four fields as if they were mandatory. Please also see if docs are easy to digest

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems good to me, but there is a serious issue with type serialization that we need to fix before this component can be used in a pipeline.

When serializing a list assert serialize_type(List[int]) == "typing.List" is not sufficient: canals needs to know that it's a list of int. We need a way to store that information as well, or the deserialized pipeline will fail to re-connect.

"""Exception raised when there is an error parsing or evaluating the condition expression in ConditionalRouter."""


def serialize_type(target: Any) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this one (and it's sibling function deserialize_type) into an external module, so it can be reused. I think it will be handy for other components too.

except Exception as e:
raise RouteConditionException(f"Error evaluating condition for route '{route}': {e}") from e

raise NoRouteSelectedException(f"No route fired. Routes: {self.routes}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting: why failing instead of dropping the input (with a loud log if necessary)? I'd say we should at least give the option to either fail or drop the value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know; it's just a simple solution for now; let's put it in the hands of users, and we'll see what they say. If we add an option, it is yet another variable to turn on/off, describe, test, confuse people, and from my perspective - unnecessary.

Comment on lines +115 to +128
routes = [
{
"condition": "{{streams|length > 2}}",
"output": "{{streams}}",
"output_name": "enough_streams",
"output_type": List[int],
},
{
"condition": "{{streams|length <= 2}}",
"output": "{{streams}}",
"output_name": "insufficient_streams",
"output_type": List[int],
},
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, these are dictionary with these fixed 4 keys. How about a small dataclass to help with code completion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then @silvanocerza will tell me - why did you make a data class for this thing 🤣 🤣 Perhaps in the next iteration, final release!


def test_output_type_serialization(self):
assert serialize_type(str) == "builtins.str"
assert serialize_type(List[int]) == "typing.List"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this won't be enough to deserialize it into a type that Canals can use for a connection. We definitely need to preserve the int as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZanSara I can adjust code to serialize List[int] into typing.List[int] str, but what about deserialization? Is deserialization into typing.List enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, all should be covered now with 7eb0943

@vblagoje
Copy link
Member Author

Please have another pass @ZanSara and @masci
See unit tests in 7eb0943 as we now cover generics and nested generics. I agree we isolate this type (de)serialization, add more tests and develop it independently. But please after this PR has been integrated. I'm not sure I covered all the possible generics serialization cases but certainly many work now.

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think custom serialization for the types is the way to go here. Manually parsing the source code in deserialize_type is not ideal, but honestly I couldn't come up with a better alternative.

The rest was already good, thanks for incorporating the feedback about making output_name non-optional, I can confirm the code is easier now, let's re-evaluate later if optional is better.

Copy link
Contributor

@dfokina dfokina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a tiny docstring update from my side, all good 👍

@vblagoje
Copy link
Member Author

@ZanSara Let's integrate this one and then iron out these kinks during beta as we continue to play with ConditionalRouter

@vblagoje vblagoje merged commit b557f30 into main Nov 23, 2023
22 checks passed
@vblagoje vblagoje deleted the connection_router_v2 branch November 23, 2023 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Conditional Routing in Haystack 2.x Pipelines
6 participants