Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add token usage tracking #872

Closed
wants to merge 3 commits into from

Conversation

marcominerva
Copy link
Contributor

Motivation and Context (Why the change? What's the scenario?)

This PR adds a new TokenUsage property to MemoryAnswer to hold information about token usage.

High level description (Approach, Design)

Token usage is calculated in the SearchClient.AskAsync method using the configured tokenizer.

- MemoryAnswer.cs: Imported TokenUsage class and added TokenUsage property.
- SearchClient.cs: Refactored GenerateAnswer method, added prompt creation, token count logic, and updated RenderFactTemplate call.
- TokenUsage.cs: Created TokenUsage class to track token counts.
@marcominerva marcominerva requested a review from dluc as a code owner October 30, 2024 16:22
@dluc
Copy link
Collaborator

dluc commented Oct 30, 2024

I think we should use actual usage reports from the AI services when available, to avoid the risk of returning incorrect information, particularly when token count is important for billing.

I would also design the report to support a list, because multiple AI calls are involved, and potentially to different models and different services.

For models where the report data is unavailable, we can fallback to tokenizers, using different keys like estimated_input_tokens for example.

Something like:

{
  "requests": [
    {
      "id": "A0E4C1D0-0D1A-4D3B-8D3D-3D0D1A0E4C1D",
      "date": "2024-10-30T12:00:00Z",
      "service": "Azure OpenAI",
      "model": "GPT-4o-mini",
      "usage": {
        "input_tokens": 123,
        "output_tokens": 456,
        "output_reasoning_tokens": 50
      }
    },
    {
      "id": "C1D0B1E4-4D3B-0D1A-8D3D-E4C1D3D0D1A0",
      "date": "2024-10-30T12:00:00Z",
      "service": "Azure OpenAI",
      "model": "text-embedding-ada-002",
      "usage": {
        "prompt_tokens": 123,
        "total_tokens": 123
      }
    },
    {
      "date": "2024-10-30T12:00:00Z",
      "service": "LlamaSharp",
      "model": "llama2",
      "usage": {
        "estimated_input_tokens": 123,
        "estimated_output_tokens": 456
      }
    }
  ]
}

@marcominerva
Copy link
Contributor Author

I think we should use actual usage reports from the AI services when available

In fact, I initially started with that kind of implementation in my branch https://github.com/marcominerva/kernel-memory/tree/token_usage (see in particular https://github.com/marcominerva/kernel-memory/blob/5a3a77f62a2a22d88fe85ad5efb8426731e9d4a5/extensions/AzureOpenAI/AzureOpenAITextGenerator.cs#L148-L150), but then I found some blocking issues like microsoft/semantic-kernel#9420.

So, I thought that we can start with the "manual" approach, and then progressively update it with the actual usage reports from the different services.

Token usage is one of the most request feature by my customers, and having a value such the one proposed in this PR, even if not 100% accurate, is much better than not having a value at all 😄 . Also because, at the moment, they use this exact approach when they need to get an idea of token usage, with the problem that the actual prompt is "hidden", and the only point in which its size is shown is in the logger:

if (this._log.IsEnabled(LogLevel.Debug))
{
this._log.LogDebug("Running RAG prompt, size: {0} tokens, requesting max {1} tokens",
this._textGenerator.CountTokens(prompt),
this._config.AnswerTokens);
this._log.LogSensitive("Prompt: {0}", prompt);
}

@dluc
Copy link
Collaborator

dluc commented Oct 31, 2024

I think we should use actual usage reports from the AI services when available

In fact, I initially started with that kind of implementation in my branch https://github.com/marcominerva/kernel-memory/tree/token_usage (see in particular https://github.com/marcominerva/kernel-memory/blob/5a3a77f62a2a22d88fe85ad5efb8426731e9d4a5/extensions/AzureOpenAI/AzureOpenAITextGenerator.cs#L148-L150), but then I found some blocking issues like microsoft/semantic-kernel#9420.

So, I thought that we can start with the "manual" approach, and then progressively update it with the actual usage reports from the different services.

Token usage is one of the most request feature by my customers, and having a value such the one proposed in this PR, even if not 100% accurate, is much better than not having a value at all 😄 . Also because, at the moment, they use this exact approach when they need to get an idea of token usage, with the problem that the actual prompt is "hidden", and the only point in which its size is shown is in the logger:

if (this._log.IsEnabled(LogLevel.Debug))
{
this._log.LogDebug("Running RAG prompt, size: {0} tokens, requesting max {1} tokens",
this._textGenerator.CountTokens(prompt),
this._config.AnswerTokens);
this._log.LogSensitive("Prompt: {0}", prompt);
}

I would use the service data when available and include the tokenizer optionally. Using the tokenizer is a performance concern, so it would be nice if we could turn it off.

Something like this:

PR 1

  • applies only to OpenAI generators, text and embeddings
  • define a common class for metrics, avoiding dictionaries
  • add token metrics to Search and Ask responses, organized as a list of calls
  • each call includes
    • timestamp
    • name of the service e.g. "Azure OpenAI", "Ollama", etc.
    • model type e.g. "TextEmbedding", "Text", etc.
    • name of the model/deployment used
    • "tokens in metric" reported by the service - NULL if not available (key: "service_tokens_in")
    • "tokens out metric" reported by the service - NULL if not available (key: "service_tokens_out")
    • other token metrics reported by the service if avaialvble, e.g. reasoning tokens (key: "service_reasoning_tokens")
    • "token in" measured by the tokenizer (key: "tokenizer_tokens_in")
    • "token out" measured by the tokenizer (key: "tokenizer_tokens_out")

PR 2

  • apply to all AI generators: Azure OpenAI, Ollama, LLamaSharp, Anthropic
  • add global option "enable token metrics"
  • add global option "include token metrics measured by tokenizer" - enabled by default

PR 3

  • add report also to the pipeline object, to cost ingestion. Support updates, don't lose previous metrics.

@marcominerva
Copy link
Contributor Author

@nurkmez2
Copy link

Any updates? 👍

@marcominerva
Copy link
Contributor Author

I plan to continue working on this feature next week.

@nurkmez2
Copy link

Hi @marcominerva
Any updates?
Thanks

@marcominerva
Copy link
Contributor Author

Sorry, unfortunately in these weeks I had a lot of unexpected works, so I had to delay this activity. I hope to resume the activity soon.

@marcominerva
Copy link
Contributor Author

@nurkmez2 I have made a new PR with the approach suggested by @dluc: #947. I think that this one can be safely closed.

@marcominerva marcominerva deleted the token_usage2 branch December 19, 2024 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants