-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the beginnings of AI Semantic conventions #483
Conversation
Co-authored-by: Nathan Slaughter <28688390+nslaughter@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have many customers that would benefit from these semantic conventions getting merged in the short term even if in Experimental status.
Its a great start. I think the placesholders (todo and empty files) would need to be removed and added at a later date. I'd also reduce the the list done to the essentials in order to get a PR approved and then we can add more in furture PRs. For example, the OpenAI list can be greatly reduced by eliminating the deprecated Chat api and combine ChatCompletions into one list for streaming and non-streaming.
Its also important to have metrics defined as well. We started a draft sometime ago for openai: https://github.com/lmolkova/semantic-conventions/tree/openai/docs/openai. Feel free to cherry pick.
I'm happy to help anyway I can, to get this main. I can do a PR to your branch with some updates if that helps.
<!-- semconv ai(tag=llm-response) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.completion` | string | The full response string from an LLM. If the LLM responds with a more complex output like a JSON object made up of several pieces (such as OpenAI's message choices), this field is the content of the response. If the LLM produces multiple responses, then this field is left blank, and each response is instead captured in an attribute determined by the specific LLM technology semantic convention for responses.| `Why did the developer stop using OpenTelemetry? Because they couldn't trace their steps!` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In openai, you have completion_tokens, prompt_tokens, etc. Is that not generally applicable here?
On multiple responses from LLM, if these are captured as events (see my earlier suggestion) then this could be handled by adding multiple events to the Span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not every LLM supports this in their response. For example, in anthropic's client SDK they have a separate count_tokens
function that you use to pass your prompt and/or response to to get this information.
Perhaps this could be done as an optional attribute, since the reality is that most people are using OpenAI.
1. Data privacy concerns. End users of LLM applications may input sensitive information or personally identifiable information (PII) that they do not wish to be sent to a telemetry backend. | ||
2. Data size concerns. Although there is no specified limit to the size of an attribute, there are practical limitations in programming languages and telemety systems. Some LLMs allow for extremely large context windows that end users may take full advantage of. | ||
|
||
By default, these configurations SHOULD capture inputs and outputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these inputs and outputs be added as Events instead of directly to the span? They aren't directly used for query and Events in some systems have higher limits on attribute size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would disagree with that. Inputs and outputs are definitely used for querying, such as:
"For a system doing text -> json, show me all groups of inputs and outputs where we failed to parse a json response"
Or:
"Group inputs by feedback responses"
Or:
"For input , show all grouped outputs"
While a backend could in theory assemble these from span events, I think it's far more likely that a tracing backend would just look for this data directly on the spans. I also don't think it fits the conceptual model for span events, as there's not really a meaningful timestamp to assign to this data - it'd have to be contrived or zereod out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's common for backends to have limitations of attribute length
E.g.
In addition to backend limitations, attribute values will stay in memory until spans are exported and may significantly increase otel memory consumption.
Events have the same limitations, so logs seem the only reasonable option given verbosity and the ability to export them right away.
It's still possible to query logs/events (as long as they are in the same backend).
docs/ai/openai.md
Outdated
|---|---|---|---|---| | ||
| `llm.openai.messages.<index>.role` | string | The assigned role for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required | | ||
| `llm.openai.messages.<index>.message` | string | The message for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `You are an AI system that tells jokes about OpenTelemetry.` | Required | | ||
| `llm.openai.messages.<index>.name` | string | If present, the message for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `You are an AI system that tells jokes about OpenTelemetry.` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant description and example with line above.
docs/ai/openai.md
Outdated
| `llm.openai.functions.<index>.name` | string | If present, name of an OpenAI function for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `get_weather_forecast` | Required | | ||
| `llm.openai.functions.<index>.parameters` | string | If present, JSON-encoded string of the parameter object of an OpenAI function for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `{"type": "object", "properties": {}}` | Required | | ||
| `llm.openai.functions.<index>.description` | string | If present, description of an OpenAI function for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `Gets the weather forecast.` | Required | | ||
| `llm.openai.n` | int | If present, the number of messages an OpenAI request responds with. | `2` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using Span Events, this won't be needed.
docs/ai/openai.md
Outdated
<!-- semconv llm.openai(tag=llm-response-tech-specific) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.openai.choices.<index>.role` | string | The assigned role for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider using Span Events instead of "indexed" attributes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would span events make more sense here than attributes?
docs/ai/openai.md
Outdated
| `llm.openai.choices.<index>.role` | string | The assigned role for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required | | ||
| `llm.openai.choices.<index>.content` | string | The content for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `Why did the developer stop using OpenTelemetry? Because they couldn't trace their steps!` | Required | | ||
| `llm.openai.choices.<index>.function_call.name` | string | If exists, the name of a function call for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `get_weather_report` | Required | | ||
| `llm.openai.choices.<index>.function_call.arguments` | string | If exists, the arguments to call a function call with for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `{"type": "object", "properties": {"some":"data"}}` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, these could be Span Events with a type attribute of function.
docs/ai/openai.md
Outdated
<!-- semconv llm.openai(tag=llm-response-tech-specific-chunk) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.openai.choices.<index>.delta.role` | string | The assigned role for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I compeletely understand the use case, but this seems like it be an awful lot of attributes for each stream delta (really, one for every token?). Instead of having a seperate set of attributes for Streaming, why not just combine with ChatCompletions with an attribute that says it was a "Stream"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the short-term to get a PR approved, I'd focus this list on just ChatCompletions. Chat is deperecated to older models. And it will be much simpler to start if its just one list for not streaming and streaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the completions endpoint? I initially added a section there because (at the time) GPT-3.5-turbo-instruct was added. The docs are a little confusing, though, as the endpoint is considered legacy, but the model is quite new.
Happy to remove it for now, though.
We should push as much as possible to find a common set of attributes. But if you look at other areas like Database semantic conventions, there is a pattern for including vendor specific additions that build on the core set. So yes, I'd expect some specific conventions for openai, watsonx, etc. For this PR, I'd focus on a small set to start and we can add more via further PRs. It will be at "Experimental" level so changes will be expected. |
Yeah, I'd prefer to keep the scope smaller here. As far as I'm aware, once you're past OpenAI/Anthropic/Cohere there's very few end-users for other commercial options. Open Source is tricker since a fine-tuned model can emit just about anything in any format, so the generic attributes is about as good as we could get for now. |
@drewby Feel free to PR against my branch! I have time to address things and get this over the hump, but the more contributions, the better 🙂 |
|
||
## Configuration | ||
|
||
Instrumentations for LLMs MUST offer the ability to turn off capture of raw inputs to LLM requests and the completion response text for LLM responses. This is for two primary reasons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in other semconvs we control it with Opt-in requirement level.
Opt-in attributes are always off by default and instrumentations MAY provide configuration.
Given the privacy, verbosity and consistency reasons, I believe we should do the same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns around:
- capturing extensive amounts of data by default
- fitting it into potentially strictly limited attribute values
- capturing sensitive data (by default)
- capturing contents - we never capture contents of HTTP requests/responses, DB responses (even queries are controversial), messaging payloads, etc and we do not have a good approach for it in OTel.
I suggest starting with noncontroversial part that does not include prompt/completions and then evolving it to potentially include contents.
JFYI: we've been baking something around Azure OpenAI that's consistent with the current stuff in OTel semconv in case you want to take a look - https://github.com/open-telemetry/semantic-conventions/pull/513/files
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.model` | string | The name of the LLM a request is being made to. If the LLM is supplied by a vendor, then the value must be the exact name of the model used. If the LLM is a fine-tuned custom model, the value SHOULD have a more specific name than the base model that's been fine-tuned. | `gpt-4` | Required | | ||
| `llm.prompt` | string | The full prompt string sent to an LLM in a request. If the LLM accepts a more complex input like a JSON object made up of several pieces (such as OpenAI's different message types), this field is that entire JSON object encoded as a string. | `\n\nHuman:You are an AI assistant that tells jokes. Can you tell me a joke about OpenTelemetry?\n\nAssistant:` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the verbosity and that it contain sensitive and private data, this attribute should be opt-in
<!-- semconv ai(tag=llm-response) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.completion` | string | The full response string from an LLM. If the LLM responds with a more complex output like a JSON object made up of several pieces (such as OpenAI's message choices), this field is the content of the response. If the LLM produces multiple responses, then this field is left blank, and each response is instead captured in an attribute determined by the specific LLM technology semantic convention for responses.| `Why did the developer stop using OpenTelemetry? Because they couldn't trace their steps!` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the same reasons as propmt, this should be opt-in (and probably an event/log)
docs/ai/openai.md
Outdated
| `llm.openai.choices.<index>.finish_reason` | string | The reason the OpenAI model stopped generating tokens for this chunk. | `stop` | Recommended | | ||
| `llm.openai.id` | string | The unique identifier for the chat completion. | `chatcmpl-123` | Recommended | | ||
| `llm.openai.created` | int | The UNIX timestamp (in seconds) if when the completion was created. | `1677652288` | Recommended | | ||
| `llm.openai.model` | string | The name of the model used for the completion. | `gpt-3.5-turbo` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be covered with llm.model
and not necessarry?
docs/ai/openai.md
Outdated
| `llm.openai.choices.<index>.delta.function_call.arguments` | string | If exists, the arguments to call a function call with for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `{"type": "object",` | Required | | ||
| `llm.openai.choices.<index>.finish_reason` | string | The reason the OpenAI model stopped generating tokens for this chunk. | `stop` | Recommended | | ||
| `llm.openai.id` | string | The unique identifier for the chat completion. | `chatcmpl-123` | Recommended | | ||
| `llm.openai.created` | int | The UNIX timestamp (in seconds) if when the completion was created. | `1677652288` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be a good timestamp for the log/event
docs/ai/openai.md
Outdated
<!-- semconv llm.openai(tag=llm-response-tech-specific-chunk) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `llm.openai.choices.<index>.delta.role` | string | The assigned role for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of representing the whole response as one span, would it perhaps be better to represent each completion as an individual span and avoid having indexed attributes?
|
I would also like to request a review from @mikeldking from Arize . Arize team started with the Open-Inference Spec initiative which includes Semantic Conventions for Traces |
Thanks for the nomination @sudivate - this initiative is very much something we've been looking for and have a fair amount of learnings from our implementation of the OpenInference semantic conventions. Will follow along and try to give informed feedback as I see it. Exciting progress! |
Hi @drewby, also others, I saw you mentioned adding metrics to this PR, but it's specifically to OpenAI, while generally, I thought the conventions for metrics, just as tracing does in this PR, needs to be categorized into common stuff, plus vendor specific stuff. Will this be updated later? Besides that, I originally thought this PR is mainly for tracing, but now that I saw the metrics for OpenAI is also added, will this PR also cover metrics? |
@cartermp would love to take this over |
@nirga go for it! I don't have any staged changes, so feel free to carry on from here. My main TODO was to redefine the request/response as logs. |
@nirga, would a call make sense to sync up on scope for this? We may also want to have more discussion in a Slack thread in the SIG channel for semantic conventions. I'm normally in Japan time, but will be in the US for two weeks starting 12/14 and will have more time through the end of the year. |
We could focus a PR on tracing first, but metrics would also be useful to have some common data model / semantic conventions. |
@drewby I’ll ping you on slack |
<!-- endsemconv --> | ||
|
||
|
||
### Metric: `llm.openai.chat_completions.duration` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see chat_completions has the duration
attribute at https://platform.openai.com/docs/api-reference/chat/object, am I missing anything?
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
^ continued in #639 |
FYI, a first version of this is now merged with #825 |
Fixes #327
Changes
As mentioned in #327, this introduces semantic conventions for modern AI systems. While there's a lot of machine learning stuff that doesn't involve LLMs and vector DBs, the sheer adoption of this tech is so high and growing that it's a good one to start with. Furthermore, with projects like OpenLLMetry likely moving into the CNCF space, there's no better time like the present to get started here.
Merge requirement checklist