Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the beginnings of AI Semantic conventions #483

Closed
wants to merge 25 commits into from

Conversation

cartermp
Copy link
Contributor

@cartermp cartermp commented Nov 1, 2023

Fixes #327

Changes

As mentioned in #327, this introduces semantic conventions for modern AI systems. While there's a lot of machine learning stuff that doesn't involve LLMs and vector DBs, the sheer adoption of this tech is so high and growing that it's a good one to start with. Furthermore, with projects like OpenLLMetry likely moving into the CNCF space, there's no better time like the present to get started here.

Merge requirement checklist

@cartermp cartermp requested review from a team November 1, 2023 18:05
docs/ai/anthropic.md Outdated Show resolved Hide resolved
Co-authored-by: Nathan Slaughter <28688390+nslaughter@users.noreply.github.com>
Copy link
Member

@drewby drewby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many customers that would benefit from these semantic conventions getting merged in the short term even if in Experimental status.

Its a great start. I think the placesholders (todo and empty files) would need to be removed and added at a later date. I'd also reduce the the list done to the essentials in order to get a PR approved and then we can add more in furture PRs. For example, the OpenAI list can be greatly reduced by eliminating the deprecated Chat api and combine ChatCompletions into one list for streaming and non-streaming.

Its also important to have metrics defined as well. We started a draft sometime ago for openai: https://github.com/lmolkova/semantic-conventions/tree/openai/docs/openai. Feel free to cherry pick.

I'm happy to help anyway I can, to get this main. I can do a PR to your branch with some updates if that helps.

<!-- semconv ai(tag=llm-response) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `llm.completion` | string | The full response string from an LLM. If the LLM responds with a more complex output like a JSON object made up of several pieces (such as OpenAI's message choices), this field is the content of the response. If the LLM produces multiple responses, then this field is left blank, and each response is instead captured in an attribute determined by the specific LLM technology semantic convention for responses.| `Why did the developer stop using OpenTelemetry? Because they couldn't trace their steps!` | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In openai, you have completion_tokens, prompt_tokens, etc. Is that not generally applicable here?

On multiple responses from LLM, if these are captured as events (see my earlier suggestion) then this could be handled by adding multiple events to the Span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not every LLM supports this in their response. For example, in anthropic's client SDK they have a separate count_tokens function that you use to pass your prompt and/or response to to get this information.

Perhaps this could be done as an optional attribute, since the reality is that most people are using OpenAI.

1. Data privacy concerns. End users of LLM applications may input sensitive information or personally identifiable information (PII) that they do not wish to be sent to a telemetry backend.
2. Data size concerns. Although there is no specified limit to the size of an attribute, there are practical limitations in programming languages and telemety systems. Some LLMs allow for extremely large context windows that end users may take full advantage of.

By default, these configurations SHOULD capture inputs and outputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these inputs and outputs be added as Events instead of directly to the span? They aren't directly used for query and Events in some systems have higher limits on attribute size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would disagree with that. Inputs and outputs are definitely used for querying, such as:

"For a system doing text -> json, show me all groups of inputs and outputs where we failed to parse a json response"

Or:

"Group inputs by feedback responses"

Or:

"For input , show all grouped outputs"

While a backend could in theory assemble these from span events, I think it's far more likely that a tracing backend would just look for this data directly on the spans. I also don't think it fits the conceptual model for span events, as there's not really a meaningful timestamp to assign to this data - it'd have to be contrived or zereod out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's common for backends to have limitations of attribute length

E.g.

In addition to backend limitations, attribute values will stay in memory until spans are exported and may significantly increase otel memory consumption.
Events have the same limitations, so logs seem the only reasonable option given verbosity and the ability to export them right away.

It's still possible to query logs/events (as long as they are in the same backend).

|---|---|---|---|---|
| `llm.openai.messages.<index>.role` | string | The assigned role for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required |
| `llm.openai.messages.<index>.message` | string | The message for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `You are an AI system that tells jokes about OpenTelemetry.` | Required |
| `llm.openai.messages.<index>.name` | string | If present, the message for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `You are an AI system that tells jokes about OpenTelemetry.` | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant description and example with line above.

| `llm.openai.functions.<index>.name` | string | If present, name of an OpenAI function for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `get_weather_forecast` | Required |
| `llm.openai.functions.<index>.parameters` | string | If present, JSON-encoded string of the parameter object of an OpenAI function for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `{"type": "object", "properties": {}}` | Required |
| `llm.openai.functions.<index>.description` | string | If present, description of an OpenAI function for a given OpenAI request, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `Gets the weather forecast.` | Required |
| `llm.openai.n` | int | If present, the number of messages an OpenAI request responds with. | `2` | Recommended |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using Span Events, this won't be needed.

<!-- semconv llm.openai(tag=llm-response-tech-specific) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `llm.openai.choices.<index>.role` | string | The assigned role for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider using Span Events instead of "indexed" attributes here.

Copy link
Contributor

@nirga nirga Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would span events make more sense here than attributes?

| `llm.openai.choices.<index>.role` | string | The assigned role for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required |
| `llm.openai.choices.<index>.content` | string | The content for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `Why did the developer stop using OpenTelemetry? Because they couldn't trace their steps!` | Required |
| `llm.openai.choices.<index>.function_call.name` | string | If exists, the name of a function call for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `get_weather_report` | Required |
| `llm.openai.choices.<index>.function_call.arguments` | string | If exists, the arguments to call a function call with for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `{"type": "object", "properties": {"some":"data"}}` | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, these could be Span Events with a type attribute of function.

<!-- semconv llm.openai(tag=llm-response-tech-specific-chunk) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `llm.openai.choices.<index>.delta.role` | string | The assigned role for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I compeletely understand the use case, but this seems like it be an awful lot of attributes for each stream delta (really, one for every token?). Instead of having a seperate set of attributes for Streaming, why not just combine with ChatCompletions with an attribute that says it was a "Stream"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the short-term to get a PR approved, I'd focus this list on just ChatCompletions. Chat is deperecated to older models. And it will be much simpler to start if its just one list for not streaming and streaming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the completions endpoint? I initially added a section there because (at the time) GPT-3.5-turbo-instruct was added. The docs are a little confusing, though, as the endpoint is considered legacy, but the model is quite new.

Happy to remove it for now, though.

@morningspace
Copy link

morningspace commented Nov 13, 2023

Hi @cartermp @drewby happen to come across this PR. I am seeing there are vendor specific convention for openai, anthropic, etc. Just curious, if it would also cover watsonx, in case there's anything specific?

@drewby
Copy link
Member

drewby commented Nov 13, 2023

Hi @cartermp @drewby happen to come across this PR. I am seeing there are vendor specific convention for openai, anthropic, etc. Just curious, if it would also cover watsonx, in case there's anything specific?

We should push as much as possible to find a common set of attributes. But if you look at other areas like Database semantic conventions, there is a pattern for including vendor specific additions that build on the core set. So yes, I'd expect some specific conventions for openai, watsonx, etc.

For this PR, I'd focus on a small set to start and we can add more via further PRs. It will be at "Experimental" level so changes will be expected.

@cartermp
Copy link
Contributor Author

Yeah, I'd prefer to keep the scope smaller here. As far as I'm aware, once you're past OpenAI/Anthropic/Cohere there's very few end-users for other commercial options. Open Source is tricker since a fine-tuned model can emit just about anything in any format, so the generic attributes is about as good as we could get for now.

@cartermp
Copy link
Contributor Author

@drewby Feel free to PR against my branch! I have time to address things and get this over the hump, but the more contributions, the better 🙂


## Configuration

Instrumentations for LLMs MUST offer the ability to turn off capture of raw inputs to LLM requests and the completion response text for LLM responses. This is for two primary reasons:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other semconvs we control it with Opt-in requirement level.

Opt-in attributes are always off by default and instrumentations MAY provide configuration.
Given the privacy, verbosity and consistency reasons, I believe we should do the same here.

Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns around:

  • capturing extensive amounts of data by default
  • fitting it into potentially strictly limited attribute values
  • capturing sensitive data (by default)
  • capturing contents - we never capture contents of HTTP requests/responses, DB responses (even queries are controversial), messaging payloads, etc and we do not have a good approach for it in OTel.

I suggest starting with noncontroversial part that does not include prompt/completions and then evolving it to potentially include contents.

JFYI: we've been baking something around Azure OpenAI that's consistent with the current stuff in OTel semconv in case you want to take a look - https://github.com/open-telemetry/semantic-conventions/pull/513/files

| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `llm.model` | string | The name of the LLM a request is being made to. If the LLM is supplied by a vendor, then the value must be the exact name of the model used. If the LLM is a fine-tuned custom model, the value SHOULD have a more specific name than the base model that's been fine-tuned. | `gpt-4` | Required |
| `llm.prompt` | string | The full prompt string sent to an LLM in a request. If the LLM accepts a more complex input like a JSON object made up of several pieces (such as OpenAI's different message types), this field is that entire JSON object encoded as a string. | `\n\nHuman:You are an AI assistant that tells jokes. Can you tell me a joke about OpenTelemetry?\n\nAssistant:` | Required |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the verbosity and that it contain sensitive and private data, this attribute should be opt-in

<!-- semconv ai(tag=llm-response) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `llm.completion` | string | The full response string from an LLM. If the LLM responds with a more complex output like a JSON object made up of several pieces (such as OpenAI's message choices), this field is the content of the response. If the LLM produces multiple responses, then this field is left blank, and each response is instead captured in an attribute determined by the specific LLM technology semantic convention for responses.| `Why did the developer stop using OpenTelemetry? Because they couldn't trace their steps!` | Required |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the same reasons as propmt, this should be opt-in (and probably an event/log)

| `llm.openai.choices.<index>.finish_reason` | string | The reason the OpenAI model stopped generating tokens for this chunk. | `stop` | Recommended |
| `llm.openai.id` | string | The unique identifier for the chat completion. | `chatcmpl-123` | Recommended |
| `llm.openai.created` | int | The UNIX timestamp (in seconds) if when the completion was created. | `1677652288` | Recommended |
| `llm.openai.model` | string | The name of the model used for the completion. | `gpt-3.5-turbo` | Recommended |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be covered with llm.model and not necessarry?

| `llm.openai.choices.<index>.delta.function_call.arguments` | string | If exists, the arguments to call a function call with for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `{"type": "object",` | Required |
| `llm.openai.choices.<index>.finish_reason` | string | The reason the OpenAI model stopped generating tokens for this chunk. | `stop` | Recommended |
| `llm.openai.id` | string | The unique identifier for the chat completion. | `chatcmpl-123` | Recommended |
| `llm.openai.created` | int | The UNIX timestamp (in seconds) if when the completion was created. | `1677652288` | Recommended |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be a good timestamp for the log/event

<!-- semconv llm.openai(tag=llm-response-tech-specific-chunk) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `llm.openai.choices.<index>.delta.role` | string | The assigned role for a given OpenAI response, denoted by `<index>`. The value for `<index>` starts with 0, where 0 is the first message. | `system` | Required |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of representing the whole response as one span, would it perhaps be better to represent each completion as an individual span and avoid having indexed attributes?

Copy link

linux-foundation-easycla bot commented Nov 15, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@cartermp
Copy link
Contributor Author

@drewby I pulled in yours and @lmolkova's work. Looks like you need to sign the CLA though!

@drewby
Copy link
Member

drewby commented Nov 15, 2023

@drewby I pulled in yours and @lmolkova's work. Looks like you need to sign the CLA though!

Signed.

@sudivate
Copy link

I would also like to request a review from @mikeldking from Arize . Arize team started with the Open-Inference Spec initiative which includes Semantic Conventions for Traces

@mikeldking
Copy link

I would also like to request a review from @mikeldking from Arize . Arize team started with the Open-Inference Spec initiative which includes Semantic Conventions for Traces

Thanks for the nomination @sudivate - this initiative is very much something we've been looking for and have a fair amount of learnings from our implementation of the OpenInference semantic conventions. Will follow along and try to give informed feedback as I see it. Exciting progress!

@morningspace
Copy link

Hi @drewby, also others, I saw you mentioned adding metrics to this PR, but it's specifically to OpenAI, while generally, I thought the conventions for metrics, just as tracing does in this PR, needs to be categorized into common stuff, plus vendor specific stuff. Will this be updated later?

Besides that, I originally thought this PR is mainly for tracing, but now that I saw the metrics for OpenAI is also added, will this PR also cover metrics?

@cartermp
Copy link
Contributor Author

cartermp commented Dec 9, 2023

Unfortunately (as evidenced by my activity here), I don't really have the time/space to make reasonable progress on this PR anymore. @drewby @lmolkova please feel free to take over or start anew. I'm more than happy to offer a drive-by review.

@nirga
Copy link
Contributor

nirga commented Dec 9, 2023

@cartermp would love to take this over

@cartermp
Copy link
Contributor Author

cartermp commented Dec 9, 2023

@nirga go for it! I don't have any staged changes, so feel free to carry on from here. My main TODO was to redefine the request/response as logs.

@drewby
Copy link
Member

drewby commented Dec 11, 2023

@cartermp would love to take this over

@nirga, would a call make sense to sync up on scope for this? We may also want to have more discussion in a Slack thread in the SIG channel for semantic conventions.

I'm normally in Japan time, but will be in the US for two weeks starting 12/14 and will have more time through the end of the year.

@drewby
Copy link
Member

drewby commented Dec 11, 2023

Hi @drewby, also others, I saw you mentioned adding metrics to this PR, but it's specifically to OpenAI, while generally, I thought the conventions for metrics, just as tracing does in this PR, needs to be categorized into common stuff, plus vendor specific stuff. Will this be updated later?

Besides that, I originally thought this PR is mainly for tracing, but now that I saw the metrics for OpenAI is also added, will this PR also cover metrics?

We could focus a PR on tracing first, but metrics would also be useful to have some common data model / semantic conventions.

@nirga
Copy link
Contributor

nirga commented Dec 11, 2023

@drewby I’ll ping you on slack

<!-- endsemconv -->


### Metric: `llm.openai.chat_completions.duration`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see chat_completions has the duration attribute at https://platform.openai.com/docs/api-reference/chat/object, am I missing anything?

@nirga nirga mentioned this pull request Jan 12, 2024
3 tasks
@nirga
Copy link
Contributor

nirga commented Jan 12, 2024

@drewby @lmolkova @gyliu513 @mikeldking and others I might have missed -
I'm continuing @cartermp's great work in #639. Let's get this merged :)

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 28, 2024
Copy link

github-actions bot commented Feb 4, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Feb 4, 2024
@arminru
Copy link
Member

arminru commented Feb 5, 2024

^ continued in #639

@nirga nirga mentioned this pull request Mar 19, 2024
3 tasks
@nirga
Copy link
Contributor

nirga commented Apr 19, 2024

FYI, a first version of this is now merged with #825

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Introduce semantic conventions for modern AI (LLMs, vector databases, etc.)