-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Move vectorize to Astra DB Component #3766
Changes from all commits
9da308a
087627a
05d9476
4a0cc78
be5b832
5d3781e
48c8aff
75945b6
3888eda
c4e5151
2b8513e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ | |
|
||
from langflow.base.vectorstores.model import LCVectorStoreComponent, check_cached_vector_store | ||
from langflow.helpers import docs_to_data | ||
from langflow.inputs import DictInput, FloatInput | ||
from langflow.inputs import DictInput, FloatInput, MessageTextInput | ||
from langflow.io import ( | ||
BoolInput, | ||
DataInput, | ||
|
@@ -23,6 +23,40 @@ class AstraVectorStoreComponent(LCVectorStoreComponent): | |
name = "AstraDB" | ||
icon: str = "AstraDB" | ||
|
||
VECTORIZE_PROVIDERS_MAPPING = { | ||
"Azure OpenAI": ["azureOpenAI", ["text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002"]], | ||
"Hugging Face - Dedicated": ["huggingfaceDedicated", ["endpoint-defined-model"]], | ||
"Hugging Face - Serverless": [ | ||
"huggingface", | ||
[ | ||
"sentence-transformers/all-MiniLM-L6-v2", | ||
"intfloat/multilingual-e5-large", | ||
"intfloat/multilingual-e5-large-instruct", | ||
"BAAI/bge-small-en-v1.5", | ||
"BAAI/bge-base-en-v1.5", | ||
"BAAI/bge-large-en-v1.5", | ||
], | ||
], | ||
"Jina AI": [ | ||
"jinaAI", | ||
[ | ||
"jina-embeddings-v2-base-en", | ||
"jina-embeddings-v2-base-de", | ||
"jina-embeddings-v2-base-es", | ||
"jina-embeddings-v2-base-code", | ||
"jina-embeddings-v2-base-zh", | ||
], | ||
], | ||
"Mistral AI": ["mistral", ["mistral-embed"]], | ||
"NVIDIA": ["nvidia", ["NV-Embed-QA"]], | ||
"OpenAI": ["openai", ["text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002"]], | ||
"Upstage": ["upstageAI", ["solar-embedding-1-large"]], | ||
"Voyage AI": [ | ||
"voyageAI", | ||
["voyage-large-2-instruct", "voyage-law-2", "voyage-code-2", "voyage-large-2", "voyage-2"], | ||
], | ||
} | ||
|
||
inputs = [ | ||
StrInput( | ||
name="collection_name", | ||
|
@@ -59,6 +93,20 @@ class AstraVectorStoreComponent(LCVectorStoreComponent): | |
info="Optional namespace within Astra DB to use for the collection.", | ||
advanced=True, | ||
), | ||
DropdownInput( | ||
name="embedding_service", | ||
display_name="Embedding Model or Astra Vectorize", | ||
info="Determines whether to use Astra Vectorize for the collection.", | ||
options=["Embedding Model", "Astra Vectorize"], | ||
real_time_refresh=True, | ||
value="Embedding Model", | ||
), | ||
HandleInput( | ||
name="embedding", | ||
display_name="Embedding Model", | ||
input_types=["Embeddings"], | ||
info="Allows an embedding model configuration.", | ||
), | ||
DropdownInput( | ||
name="metric", | ||
display_name="Metric", | ||
|
@@ -110,12 +158,6 @@ class AstraVectorStoreComponent(LCVectorStoreComponent): | |
info="Optional list of metadata fields to include in the indexing.", | ||
advanced=True, | ||
), | ||
HandleInput( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Users that are using this input will update langflow, refresh the component and it will be broken. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @nicoloboschi definitely see your point here - but, i did maintain the embeddings support... the broken compatibility would be if someone had a flow that was using the AstraVectorize component as input to this input, right? in this PR, i remove that component entirely since its built in now... are you suggesting we should keep the separate component as well, for backwards compatibility purposes? I think in other cases the backwards compatibility is maintained... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that that was mentioned yeah, keep the separate component around for backwards compatibility. On the other hand...how many people have a flow using vectorize right now? If we can significantly reduce confusion for future vectorize users by removing this separate component, it may be worth breaking backwards-compatibility in this case. (disclaimer: I, of course, would never advocate for breaking backwards-compatibility) |
||
name="embedding", | ||
display_name="Embedding or Astra Vectorize", | ||
input_types=["Embeddings", "dict"], | ||
info="Allows either an embedding model or an Astra Vectorize configuration.", # TODO: This should be optional, but need to refactor langchain-astradb first. | ||
), | ||
StrInput( | ||
name="metadata_indexing_exclude", | ||
display_name="Metadata Indexing Exclude", | ||
|
@@ -160,7 +202,159 @@ class AstraVectorStoreComponent(LCVectorStoreComponent): | |
] | ||
|
||
@check_cached_vector_store | ||
def build_vector_store(self): | ||
def insert_in_dict(self, build_config, field_name, new_parameters): | ||
# Insert the new key-value pair after the found key | ||
for new_field_name, new_parameter in new_parameters.items(): | ||
# Get all the items as a list of tuples (key, value) | ||
items = list(build_config.items()) | ||
|
||
# Find the index of the key to insert after | ||
for i, (key, value) in enumerate(items): | ||
if key == field_name: | ||
break | ||
|
||
items.insert(i + 1, (new_field_name, new_parameter)) | ||
|
||
# Clear the original dictionary and update with the modified items | ||
build_config.clear() | ||
build_config.update(items) | ||
|
||
return build_config | ||
|
||
def update_build_config(self, build_config: dict, field_value: str, field_name: str | None = None): | ||
if field_name == "embedding_service": | ||
if field_value == "Astra Vectorize": | ||
for field in ["embedding"]: | ||
if field in build_config: | ||
del build_config[field] | ||
|
||
new_parameter = DropdownInput( | ||
name="provider", | ||
display_name="Vectorize Provider", | ||
options=self.VECTORIZE_PROVIDERS_MAPPING.keys(), | ||
value="", | ||
required=True, | ||
real_time_refresh=True, | ||
).to_dict() | ||
|
||
self.insert_in_dict(build_config, "embedding_service", {"provider": new_parameter}) | ||
else: | ||
for field in [ | ||
"provider", | ||
"z_00_model_name", | ||
"z_01_model_parameters", | ||
"z_02_api_key_name", | ||
"z_03_provider_api_key", | ||
"z_04_authentication", | ||
]: | ||
if field in build_config: | ||
del build_config[field] | ||
|
||
new_parameter = HandleInput( | ||
name="embedding", | ||
display_name="Embedding Model", | ||
input_types=["Embeddings"], | ||
info="Allows an embedding model configuration.", | ||
).to_dict() | ||
|
||
self.insert_in_dict(build_config, "embedding_service", {"embedding": new_parameter}) | ||
|
||
elif field_name == "provider": | ||
for field in [ | ||
"z_00_model_name", | ||
"z_01_model_parameters", | ||
"z_02_api_key_name", | ||
"z_03_provider_api_key", | ||
"z_04_authentication", | ||
]: | ||
if field in build_config: | ||
del build_config[field] | ||
|
||
model_options = self.VECTORIZE_PROVIDERS_MAPPING[field_value][1] | ||
|
||
new_parameter_0 = DropdownInput( | ||
name="z_00_model_name", | ||
display_name="Model Name", | ||
info=f"The embedding model to use for the selected provider. Each provider has a different set of models " | ||
f"available (full list at https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html):\n\n{', '.join(model_options)}", | ||
options=model_options, | ||
required=True, | ||
).to_dict() | ||
|
||
new_parameter_1 = DictInput( | ||
name="z_01_model_parameters", | ||
display_name="Model Parameters", | ||
is_list=True, | ||
).to_dict() | ||
|
||
new_parameter_2 = MessageTextInput( | ||
name="z_02_api_key_name", | ||
display_name="API Key name", | ||
info="The name of the embeddings provider API key stored on Astra. If set, it will override the 'ProviderKey' in the authentication parameters.", | ||
).to_dict() | ||
|
||
new_parameter_3 = SecretStrInput( | ||
name="z_03_provider_api_key", | ||
display_name="Provider API Key", | ||
info="An alternative to the Astra Authentication that passes an API key for the provider with each request to Astra DB. This may be used when Vectorize is configured for the collection, but no corresponding provider secret is stored within Astra's key management system.", | ||
).to_dict() | ||
|
||
new_parameter_4 = DictInput( | ||
name="z_04_authentication", | ||
display_name="Authentication parameters", | ||
is_list=True, | ||
).to_dict() | ||
|
||
self.insert_in_dict( | ||
build_config, | ||
"provider", | ||
{ | ||
"z_00_model_name": new_parameter_0, | ||
"z_01_model_parameters": new_parameter_1, | ||
"z_02_api_key_name": new_parameter_2, | ||
"z_03_provider_api_key": new_parameter_3, | ||
"z_04_authentication": new_parameter_4, | ||
}, | ||
) | ||
|
||
return build_config | ||
|
||
def build_vectorize_options(self, **kwargs): | ||
for attribute in [ | ||
"provider", | ||
"z_00_api_key_name", | ||
"z_01_model_name", | ||
"z_02_authentication", | ||
"z_03_provider_api_key", | ||
"z_04_model_parameters", | ||
]: | ||
if not hasattr(self, attribute): | ||
setattr(self, attribute, None) | ||
|
||
# Fetch values from kwargs if any self.* attributes are None | ||
provider_value = self.VECTORIZE_PROVIDERS_MAPPING.get(self.provider, [None])[0] or kwargs.get("provider") | ||
authentication = {**(self.z_02_authentication or kwargs.get("z_02_authentication", {}))} | ||
|
||
api_key_name = self.z_00_api_key_name or kwargs.get("z_00_api_key_name") | ||
provider_key_name = self.z_03_provider_api_key or kwargs.get("z_03_provider_api_key") | ||
if provider_key_name: | ||
authentication["providerKey"] = provider_key_name | ||
if api_key_name: | ||
authentication["providerKey"] = api_key_name | ||
|
||
return { | ||
# must match astrapy.info.CollectionVectorServiceOptions | ||
"collection_vector_service_options": { | ||
"provider": provider_value, | ||
"modelName": self.z_01_model_name or kwargs.get("z_01_model_name"), | ||
"authentication": authentication, | ||
"parameters": self.z_04_model_parameters or kwargs.get("z_04_model_parameters", {}), | ||
}, | ||
"collection_embedding_api_key": self.z_03_provider_api_key or kwargs.get("z_03_provider_api_key"), | ||
} | ||
|
||
@check_cached_vector_store | ||
def build_vector_store(self, vectorize_options=None): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious about the parameter addition here. Is it only used for testing purposes? In the main path, there's no scenario where this wouldn't be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's correct! I left a comment up above for a similar reason why it was done, but if there's a better / easier way i missed let me know cuz i didnt like it either. The goal was basically to allow for the tests to execute successfully with pytest (and for what its worth, the 5 test_astra_vectorize tests do in fact work for me) but i would love if rather than having these optional parameters it simulated more like how the UI executes the code. The dynamic inputs is the challenge... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm yeah it's very possible we don't yet have any tests that use dynamic inputs, and thus haven't had to support a way to do that. I'll pull this and play around a bit this afternoon just to try as well |
||
try: | ||
from langchain_astradb import AstraDBVectorStore | ||
from langchain_astradb.utils.astradb import SetupMode | ||
|
@@ -178,22 +372,22 @@ def build_vector_store(self): | |
except KeyError: | ||
raise ValueError(f"Invalid setup mode: {self.setup_mode}") | ||
|
||
if not isinstance(self.embedding, dict): | ||
if self.embedding: | ||
embedding_dict = {"embedding": self.embedding} | ||
else: | ||
from astrapy.info import CollectionVectorServiceOptions | ||
|
||
dict_options = self.embedding.get("collection_vector_service_options", {}) | ||
dict_options = vectorize_options or self.build_vectorize_options() | ||
dict_options["authentication"] = { | ||
k: v for k, v in dict_options.get("authentication", {}).items() if k and v | ||
} | ||
dict_options["parameters"] = {k: v for k, v in dict_options.get("parameters", {}).items() if k and v} | ||
|
||
embedding_dict = { | ||
"collection_vector_service_options": CollectionVectorServiceOptions.from_dict(dict_options) | ||
"collection_vector_service_options": CollectionVectorServiceOptions.from_dict( | ||
dict_options.get("collection_vector_service_options", {}) | ||
), | ||
} | ||
collection_embedding_api_key = self.embedding.get("collection_embedding_api_key") | ||
if collection_embedding_api_key: | ||
embedding_dict["collection_embedding_api_key"] = collection_embedding_api_key | ||
|
||
vector_store_kwargs = { | ||
**embedding_dict, | ||
|
@@ -223,6 +417,7 @@ def build_vector_store(self): | |
raise ValueError(f"Error initializing AstraDBVectorStore: {str(e)}") from e | ||
|
||
self._add_documents_to_vector_store(vector_store) | ||
|
||
return vector_store | ||
|
||
def _add_documents_to_vector_store(self, vector_store): | ||
|
@@ -262,8 +457,9 @@ def _build_search_args(self): | |
args["filter"] = clean_filter | ||
return args | ||
|
||
def search_documents(self) -> list[Data]: | ||
vector_store = self.build_vector_store() | ||
def search_documents(self, vector_store=None) -> list[Data]: | ||
if not vector_store: | ||
vector_store = self.build_vector_store() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need this change - the check_cached decorator handles this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, the only reason i made this change was to more easily allow for a test to be created. The problem with the tests is that i wasnt sure how i could build the component with parameters that werent in the initial configuration - in the UI, it'll dynamically update the components based on the value of the dropdown, but is there a way to programmatically perform that same computation? i.e., something like
So i allowed an optional inclusion of the vector store object for purposes of the test, but that would never be used in the happy path in the component. Let me know though if you see a better way to do that |
||
|
||
logger.debug(f"Search input: {self.search_input}") | ||
logger.debug(f"Search type: {self.search_type}") | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from .AstraDB import AstraVectorStoreComponent | ||
|
||
__all__ = ["AstraVectorStoreComponent"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we default to one or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! For compatibility, i suppose defaulting to Embeddings makes sense so we dont break existing flows that were using it. (we are currently breaking Vectorize-based flows as mentioned...)
Side note, you may be wondering why its a dropdown rather than a boolean input / toggle... i tried to use the switch, but for whatever reason when disabling the flag, it didnt trigger the call to update_build_config - something i want to bring up with people on the langflow team...