Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Move vectorize to Astra DB Component #3766

Merged
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@


class AstraVectorizeComponent(Component):
display_name: str = "Astra Vectorize"
description: str = "Configuration options for Astra Vectorize server-side embeddings."
display_name: str = "Astra Vectorize [DEPRECATED]"
description: str = "Configuration options for Astra Vectorize server-side embeddings. This component is deprecated. Please use the Astra DB Component directly."
documentation: str = "https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html"
icon = "AstraDB"
name = "AstraVectorize"
Expand Down
228 changes: 212 additions & 16 deletions src/backend/base/langflow/components/vectorstores/AstraDB.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from langflow.base.vectorstores.model import LCVectorStoreComponent, check_cached_vector_store
from langflow.helpers import docs_to_data
from langflow.inputs import DictInput, FloatInput
from langflow.inputs import DictInput, FloatInput, MessageTextInput
from langflow.io import (
BoolInput,
DataInput,
Expand All @@ -23,6 +23,40 @@ class AstraVectorStoreComponent(LCVectorStoreComponent):
name = "AstraDB"
icon: str = "AstraDB"

VECTORIZE_PROVIDERS_MAPPING = {
"Azure OpenAI": ["azureOpenAI", ["text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002"]],
"Hugging Face - Dedicated": ["huggingfaceDedicated", ["endpoint-defined-model"]],
"Hugging Face - Serverless": [
"huggingface",
[
"sentence-transformers/all-MiniLM-L6-v2",
"intfloat/multilingual-e5-large",
"intfloat/multilingual-e5-large-instruct",
"BAAI/bge-small-en-v1.5",
"BAAI/bge-base-en-v1.5",
"BAAI/bge-large-en-v1.5",
],
],
"Jina AI": [
"jinaAI",
[
"jina-embeddings-v2-base-en",
"jina-embeddings-v2-base-de",
"jina-embeddings-v2-base-es",
"jina-embeddings-v2-base-code",
"jina-embeddings-v2-base-zh",
],
],
"Mistral AI": ["mistral", ["mistral-embed"]],
"NVIDIA": ["nvidia", ["NV-Embed-QA"]],
"OpenAI": ["openai", ["text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002"]],
"Upstage": ["upstageAI", ["solar-embedding-1-large"]],
"Voyage AI": [
"voyageAI",
["voyage-large-2-instruct", "voyage-law-2", "voyage-code-2", "voyage-large-2", "voyage-2"],
],
}

inputs = [
StrInput(
name="collection_name",
Expand Down Expand Up @@ -59,6 +93,20 @@ class AstraVectorStoreComponent(LCVectorStoreComponent):
info="Optional namespace within Astra DB to use for the collection.",
advanced=True,
),
DropdownInput(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we default to one or the other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! For compatibility, i suppose defaulting to Embeddings makes sense so we dont break existing flows that were using it. (we are currently breaking Vectorize-based flows as mentioned...)

Side note, you may be wondering why its a dropdown rather than a boolean input / toggle... i tried to use the switch, but for whatever reason when disabling the flag, it didnt trigger the call to update_build_config - something i want to bring up with people on the langflow team...

name="embedding_service",
display_name="Embedding Model or Astra Vectorize",
info="Determines whether to use Astra Vectorize for the collection.",
options=["Embedding Model", "Astra Vectorize"],
real_time_refresh=True,
value="Embedding Model",
),
HandleInput(
name="embedding",
display_name="Embedding Model",
input_types=["Embeddings"],
info="Allows an embedding model configuration.",
),
DropdownInput(
name="metric",
display_name="Metric",
Expand Down Expand Up @@ -110,12 +158,6 @@ class AstraVectorStoreComponent(LCVectorStoreComponent):
info="Optional list of metadata fields to include in the indexing.",
advanced=True,
),
HandleInput(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users that are using this input will update langflow, refresh the component and it will be broken.
I think we need to keep it backwards compatible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicoloboschi definitely see your point here - but, i did maintain the embeddings support... the broken compatibility would be if someone had a flow that was using the AstraVectorize component as input to this input, right? in this PR, i remove that component entirely since its built in now... are you suggesting we should keep the separate component as well, for backwards compatibility purposes?

I think in other cases the backwards compatibility is maintained...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that that was mentioned yeah, keep the separate component around for backwards compatibility. On the other hand...how many people have a flow using vectorize right now? If we can significantly reduce confusion for future vectorize users by removing this separate component, it may be worth breaking backwards-compatibility in this case.

(disclaimer: I, of course, would never advocate for breaking backwards-compatibility)

name="embedding",
display_name="Embedding or Astra Vectorize",
input_types=["Embeddings", "dict"],
info="Allows either an embedding model or an Astra Vectorize configuration.", # TODO: This should be optional, but need to refactor langchain-astradb first.
),
StrInput(
name="metadata_indexing_exclude",
display_name="Metadata Indexing Exclude",
Expand Down Expand Up @@ -160,7 +202,159 @@ class AstraVectorStoreComponent(LCVectorStoreComponent):
]

@check_cached_vector_store
def build_vector_store(self):
def insert_in_dict(self, build_config, field_name, new_parameters):
# Insert the new key-value pair after the found key
for new_field_name, new_parameter in new_parameters.items():
# Get all the items as a list of tuples (key, value)
items = list(build_config.items())

# Find the index of the key to insert after
for i, (key, value) in enumerate(items):
if key == field_name:
break

items.insert(i + 1, (new_field_name, new_parameter))

# Clear the original dictionary and update with the modified items
build_config.clear()
build_config.update(items)

return build_config

def update_build_config(self, build_config: dict, field_value: str, field_name: str | None = None):
if field_name == "embedding_service":
if field_value == "Astra Vectorize":
for field in ["embedding"]:
if field in build_config:
del build_config[field]

new_parameter = DropdownInput(
name="provider",
display_name="Vectorize Provider",
options=self.VECTORIZE_PROVIDERS_MAPPING.keys(),
value="",
required=True,
real_time_refresh=True,
).to_dict()

self.insert_in_dict(build_config, "embedding_service", {"provider": new_parameter})
else:
for field in [
"provider",
"z_00_model_name",
"z_01_model_parameters",
"z_02_api_key_name",
"z_03_provider_api_key",
"z_04_authentication",
]:
if field in build_config:
del build_config[field]

new_parameter = HandleInput(
name="embedding",
display_name="Embedding Model",
input_types=["Embeddings"],
info="Allows an embedding model configuration.",
).to_dict()

self.insert_in_dict(build_config, "embedding_service", {"embedding": new_parameter})

elif field_name == "provider":
for field in [
"z_00_model_name",
"z_01_model_parameters",
"z_02_api_key_name",
"z_03_provider_api_key",
"z_04_authentication",
]:
if field in build_config:
del build_config[field]

model_options = self.VECTORIZE_PROVIDERS_MAPPING[field_value][1]

new_parameter_0 = DropdownInput(
name="z_00_model_name",
display_name="Model Name",
info=f"The embedding model to use for the selected provider. Each provider has a different set of models "
f"available (full list at https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html):\n\n{', '.join(model_options)}",
options=model_options,
required=True,
).to_dict()

new_parameter_1 = DictInput(
name="z_01_model_parameters",
display_name="Model Parameters",
is_list=True,
).to_dict()

new_parameter_2 = MessageTextInput(
name="z_02_api_key_name",
display_name="API Key name",
info="The name of the embeddings provider API key stored on Astra. If set, it will override the 'ProviderKey' in the authentication parameters.",
).to_dict()

new_parameter_3 = SecretStrInput(
name="z_03_provider_api_key",
display_name="Provider API Key",
info="An alternative to the Astra Authentication that passes an API key for the provider with each request to Astra DB. This may be used when Vectorize is configured for the collection, but no corresponding provider secret is stored within Astra's key management system.",
).to_dict()

new_parameter_4 = DictInput(
name="z_04_authentication",
display_name="Authentication parameters",
is_list=True,
).to_dict()

self.insert_in_dict(
build_config,
"provider",
{
"z_00_model_name": new_parameter_0,
"z_01_model_parameters": new_parameter_1,
"z_02_api_key_name": new_parameter_2,
"z_03_provider_api_key": new_parameter_3,
"z_04_authentication": new_parameter_4,
},
)

return build_config

def build_vectorize_options(self, **kwargs):
for attribute in [
"provider",
"z_00_api_key_name",
"z_01_model_name",
"z_02_authentication",
"z_03_provider_api_key",
"z_04_model_parameters",
]:
if not hasattr(self, attribute):
setattr(self, attribute, None)

# Fetch values from kwargs if any self.* attributes are None
provider_value = self.VECTORIZE_PROVIDERS_MAPPING.get(self.provider, [None])[0] or kwargs.get("provider")
authentication = {**(self.z_02_authentication or kwargs.get("z_02_authentication", {}))}

api_key_name = self.z_00_api_key_name or kwargs.get("z_00_api_key_name")
provider_key_name = self.z_03_provider_api_key or kwargs.get("z_03_provider_api_key")
if provider_key_name:
authentication["providerKey"] = provider_key_name
if api_key_name:
authentication["providerKey"] = api_key_name

return {
# must match astrapy.info.CollectionVectorServiceOptions
"collection_vector_service_options": {
"provider": provider_value,
"modelName": self.z_01_model_name or kwargs.get("z_01_model_name"),
"authentication": authentication,
"parameters": self.z_04_model_parameters or kwargs.get("z_04_model_parameters", {}),
},
"collection_embedding_api_key": self.z_03_provider_api_key or kwargs.get("z_03_provider_api_key"),
}

@check_cached_vector_store
def build_vector_store(self, vectorize_options=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the parameter addition here. Is it only used for testing purposes? In the main path, there's no scenario where this wouldn't be None, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct! I left a comment up above for a similar reason why it was done, but if there's a better / easier way i missed let me know cuz i didnt like it either. The goal was basically to allow for the tests to execute successfully with pytest (and for what its worth, the 5 test_astra_vectorize tests do in fact work for me) but i would love if rather than having these optional parameters it simulated more like how the UI executes the code. The dynamic inputs is the challenge...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah it's very possible we don't yet have any tests that use dynamic inputs, and thus haven't had to support a way to do that. I'll pull this and play around a bit this afternoon just to try as well

try:
from langchain_astradb import AstraDBVectorStore
from langchain_astradb.utils.astradb import SetupMode
Expand All @@ -178,22 +372,22 @@ def build_vector_store(self):
except KeyError:
raise ValueError(f"Invalid setup mode: {self.setup_mode}")

if not isinstance(self.embedding, dict):
if self.embedding:
embedding_dict = {"embedding": self.embedding}
else:
from astrapy.info import CollectionVectorServiceOptions

dict_options = self.embedding.get("collection_vector_service_options", {})
dict_options = vectorize_options or self.build_vectorize_options()
dict_options["authentication"] = {
k: v for k, v in dict_options.get("authentication", {}).items() if k and v
}
dict_options["parameters"] = {k: v for k, v in dict_options.get("parameters", {}).items() if k and v}

embedding_dict = {
"collection_vector_service_options": CollectionVectorServiceOptions.from_dict(dict_options)
"collection_vector_service_options": CollectionVectorServiceOptions.from_dict(
dict_options.get("collection_vector_service_options", {})
),
}
collection_embedding_api_key = self.embedding.get("collection_embedding_api_key")
if collection_embedding_api_key:
embedding_dict["collection_embedding_api_key"] = collection_embedding_api_key

vector_store_kwargs = {
**embedding_dict,
Expand Down Expand Up @@ -223,6 +417,7 @@ def build_vector_store(self):
raise ValueError(f"Error initializing AstraDBVectorStore: {str(e)}") from e

self._add_documents_to_vector_store(vector_store)

return vector_store

def _add_documents_to_vector_store(self, vector_store):
Expand Down Expand Up @@ -262,8 +457,9 @@ def _build_search_args(self):
args["filter"] = clean_filter
return args

def search_documents(self) -> list[Data]:
vector_store = self.build_vector_store()
def search_documents(self, vector_store=None) -> list[Data]:
if not vector_store:
vector_store = self.build_vector_store()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this change - the check_cached decorator handles this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the only reason i made this change was to more easily allow for a test to be created. The problem with the tests is that i wasnt sure how i could build the component with parameters that werent in the initial configuration - in the UI, it'll dynamically update the components based on the value of the dropdown, but is there a way to programmatically perform that same computation? i.e., something like

component.update_build_configuration(embedding_service = "Astra Vectorize")

So i allowed an optional inclusion of the vector store object for purposes of the test, but that would never be used in the happy path in the component. Let me know though if you see a better way to do that


logger.debug(f"Search input: {self.search_input}")
logger.debug(f"Search type: {self.search_type}")
Expand Down
3 changes: 3 additions & 0 deletions src/backend/base/langflow/components/vectorstores/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .AstraDB import AstraVectorStoreComponent

__all__ = ["AstraVectorStoreComponent"]
Loading
Loading