Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Obs AI Assistant] Add uuid to knowledge base entries to avoid overwriting accidentally #191043

Merged

Conversation

sorenlouv
Copy link
Member

@sorenlouv sorenlouv commented Aug 22, 2024

Closes #184069

The Problem
The LLM decides the identifier (both _id and doc_id) for knowledge base entries. The _id must be globally unique in Elasticsearch but the LLM can easily pick the same id for different users thereby overwriting one users learning with another users learning.

Solution
The LLM should not pick the _id. With this PR a UUID is generated for new entries. This means the LLM will only be able to create new KB entries - it will not be able to update existing ones.

doc_id has been removed, and replaced with a title property. Title is simply a human readable string - it is not used to identify KB entries.
To retain backwards compatability, we will display the doc_id if title is not available

@sorenlouv sorenlouv requested a review from a team as a code owner August 22, 2024 04:22
@botelastic botelastic bot added ci:project-deploy-observability Create an Observability project Team:Obs AI Assistant Observability AI Assistant labels Aug 22, 2024
@obltmachine
Copy link

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@sorenlouv sorenlouv force-pushed the add-uuid-to-kb-entries-to-avoid-overwriting branch from 614ee57 to e627b85 Compare August 22, 2024 04:26
@sorenlouv sorenlouv added release_note:fix v8.16.0 and removed ci:project-deploy-observability Create an Observability project labels Aug 22, 2024
@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Aug 22, 2024
@sorenlouv sorenlouv force-pushed the add-uuid-to-kb-entries-to-avoid-overwriting branch from 0cc07e8 to a9ed9e9 Compare August 27, 2024 18:41
@dgieselaar
Copy link
Member

I've not looked through the code so maybe you took this into account, but we also have the documents that we pre-load into the knowledge base. Those should not have dynamically generated uuids, but predetermined IDs.

@@ -79,9 +79,10 @@ export type ConversationUpdateRequest = ConversationRequestBase & {

export interface KnowledgeBaseEntry {
'@timestamp': string;
id: string;
id: string; // unique ID
doc_id?: string; // human readable ID generated by the LLM and used by the LLM to lookup and update existing entries. TODO: rename `doc_id` to `lookup_id`
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is globally unique, doc_id is only unique per user. Multiple entries can be assigned the same doc_id if they are created for different users.

Comment on lines -102 to +103
doc_id?: string;
id?: string;
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc_id can be used by the LLM to lookup entries. I see no reason to expand that concept to instructions. instructions can still have pre-determined id's - they do not have to be UUIDs. See the lens docs for an example of this

@sorenlouv sorenlouv force-pushed the add-uuid-to-kb-entries-to-avoid-overwriting branch from a9ed9e9 to 14854d2 Compare August 27, 2024 20:44
@@ -42,7 +42,7 @@ const chatCompleteBaseRt = t.type({
]),
instructions: t.array(
t.intersection([
t.partial({ doc_id: t.string }),
t.partial({ id: t.string }),
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still possible to overwrite existing instructions by specifying the id

Comment on lines +38 to +41
keyword: {
type: 'keyword',
ignore_above: 256,
},
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding nested keyword in order to be able to sort on it. Using nested keyword is recommended over fielddata as it is more performant (should have been used for doc_id as well).

this.dependencies.logger.debug(
`Adding ${operations.length} operations to queue. Queue size now: ${this._queue.length})`
);
this._queue.push(...operations);
Copy link
Member Author

@sorenlouv sorenlouv Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict we had a bug here before: By calling this._queue.push conditionally we were not adding operations to the queue when isModelReady=true. This meant that anything imported after the model had been setup was being dropped 😱

In general I hope we can get rid of the queue, or separate the queuing logic from the knowledge base. Having the queue embedded makes it more complex to work with the KB than it needs to be.

@sorenlouv
Copy link
Member Author

I've not looked through the code so maybe you took this into account, but we also have the documents that we pre-load into the knowledge base. Those should not have dynamically generated uuids, but predetermined IDs.

@dgieselaar Perhaps see this comment #191043 (comment)

@sorenlouv sorenlouv requested a review from a team as a code owner November 6, 2024 23:25
@sorenlouv sorenlouv enabled auto-merge (squash) November 6, 2024 23:26
@@ -151,7 +151,6 @@ export default function ({ getService }: FtrProviderContext) {
'fleet:update_agent_tags:retry',
'fleet:upgrade_action:retry',
'logs-data-telemetry',
'observabilityAIAssistant:indexQueuedDocumentsTaskType',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task was removed. It is not longer needed

@elasticmachine
Copy link
Contributor

⏳ Build in-progress

  • Buildkite Build
  • Commit: 732f4ae
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-191043-732f4ae5430d

History

@sorenlouv sorenlouv merged commit 7c92a10 into elastic:main Nov 7, 2024
26 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11719652062

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 7, 2024
…iting accidentally (elastic#191043)

Closes elastic#184069

**The Problem**
The LLM decides the identifier (both `_id` and `doc_id`) for knowledge
base entries. The `_id` must be globally unique in Elasticsearch but the
LLM can easily pick the same id for different users thereby overwriting
one users learning with another users learning.

**Solution**
The LLM should not pick the `_id`. With this PR a UUID is generated for
new entries. This means the LLM will only be able to create new KB
entries - it will not be able to update existing ones.

`doc_id` has been removed, and replaced with a `title` property. Title
is simply a human readable string - it is not used to identify KB
entries.
To retain backwards compatability, we will display the `doc_id` if
`title` is not available

---------

Co-authored-by: Sandra G <neptunian@users.noreply.github.com>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 7c92a10)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

@sorenlouv sorenlouv deleted the add-uuid-to-kb-entries-to-avoid-overwriting branch November 7, 2024 09:55
kibanamachine added a commit that referenced this pull request Nov 7, 2024
…overwriting accidentally (#191043) (#199263)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Obs AI Assistant] Add uuid to knowledge base entries to avoid
overwriting accidentally
(#191043)](#191043)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Søren
Louv-Jansen","email":"soren.louv@elastic.co"},"sourceCommit":{"committedDate":"2024-11-07T08:55:34Z","message":"[Obs
AI Assistant] Add uuid to knowledge base entries to avoid overwriting
accidentally (#191043)\n\nCloses
https://github.com/elastic/kibana/issues/184069\r\n\r\n**The
Problem**\r\nThe LLM decides the identifier (both `_id` and `doc_id`)
for knowledge\r\nbase entries. The `_id` must be globally unique in
Elasticsearch but the\r\nLLM can easily pick the same id for different
users thereby overwriting\r\none users learning with another users
learning.\r\n\r\n**Solution**\r\nThe LLM should not pick the `_id`. With
this PR a UUID is generated for\r\nnew entries. This means the LLM will
only be able to create new KB\r\nentries - it will not be able to update
existing ones.\r\n\r\n`doc_id` has been removed, and replaced with a
`title` property. Title\r\nis simply a human readable string - it is not
used to identify KB\r\nentries.\r\nTo retain backwards compatability, we
will display the `doc_id` if\r\n`title` is not
available\r\n\r\n---------\r\n\r\nCo-authored-by: Sandra G
<neptunian@users.noreply.github.com>\r\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"7c92a10b324a8b1e10ae8924e5525b071b5c9797","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","v9.0.0","backport:prev-minor","Team:Obs
AI Assistant","ci:project-deploy-observability"],"title":"[Obs AI
Assistant] Add uuid to knowledge base entries to avoid overwriting
accidentally","number":191043,"url":"https://github.com/elastic/kibana/pull/191043","mergeCommit":{"message":"[Obs
AI Assistant] Add uuid to knowledge base entries to avoid overwriting
accidentally (#191043)\n\nCloses
https://github.com/elastic/kibana/issues/184069\r\n\r\n**The
Problem**\r\nThe LLM decides the identifier (both `_id` and `doc_id`)
for knowledge\r\nbase entries. The `_id` must be globally unique in
Elasticsearch but the\r\nLLM can easily pick the same id for different
users thereby overwriting\r\none users learning with another users
learning.\r\n\r\n**Solution**\r\nThe LLM should not pick the `_id`. With
this PR a UUID is generated for\r\nnew entries. This means the LLM will
only be able to create new KB\r\nentries - it will not be able to update
existing ones.\r\n\r\n`doc_id` has been removed, and replaced with a
`title` property. Title\r\nis simply a human readable string - it is not
used to identify KB\r\nentries.\r\nTo retain backwards compatability, we
will display the `doc_id` if\r\n`title` is not
available\r\n\r\n---------\r\n\r\nCo-authored-by: Sandra G
<neptunian@users.noreply.github.com>\r\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"7c92a10b324a8b1e10ae8924e5525b071b5c9797"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/191043","number":191043,"mergeCommit":{"message":"[Obs
AI Assistant] Add uuid to knowledge base entries to avoid overwriting
accidentally (#191043)\n\nCloses
https://github.com/elastic/kibana/issues/184069\r\n\r\n**The
Problem**\r\nThe LLM decides the identifier (both `_id` and `doc_id`)
for knowledge\r\nbase entries. The `_id` must be globally unique in
Elasticsearch but the\r\nLLM can easily pick the same id for different
users thereby overwriting\r\none users learning with another users
learning.\r\n\r\n**Solution**\r\nThe LLM should not pick the `_id`. With
this PR a UUID is generated for\r\nnew entries. This means the LLM will
only be able to create new KB\r\nentries - it will not be able to update
existing ones.\r\n\r\n`doc_id` has been removed, and replaced with a
`title` property. Title\r\nis simply a human readable string - it is not
used to identify KB\r\nentries.\r\nTo retain backwards compatability, we
will display the `doc_id` if\r\n`title` is not
available\r\n\r\n---------\r\n\r\nCo-authored-by: Sandra G
<neptunian@users.noreply.github.com>\r\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"7c92a10b324a8b1e10ae8924e5525b071b5c9797"}}]}]
BACKPORT-->

Co-authored-by: Søren Louv-Jansen <soren.louv@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) ci:project-deploy-observability Create an Observability project release_note:fix Team:Obs AI Assistant Observability AI Assistant v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Obs AI Assistant] Knowledge base entries with the same name overwrites each other without warning
8 participants