Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x-pack/metricbeat/module/openai: Add new module #41516

Merged
merged 49 commits into from
Dec 13, 2024
Merged

Conversation

shmsr
Copy link
Member

@shmsr shmsr commented Nov 4, 2024

Proposed commit message

Implement a new module for OpenAI usage collection. This module operates on https://api.openai.com/v1/usage (by default; also configurable for Proxy URLs, etc.) and collects the limited set of usage metrics emitted from the undocumented endpoint.

Example how the usage endpoints emits metrics:

Given timestamps t0, t1, t2, ... tn in ascending order:

  • At t0 (first collection):
   usage_metrics_1: *
  • At t1 (after new API usage):
   usage_metrics_1: *
   usage_metrics_2: *
  • At t2 (continuous collection):
   usage_metrics_1: *
   usage_metrics_2: *
   usage_metrics_3: *

and so on.

Example response:

{
  "object": "list",
  "data": [
    {
      "organization_id": "org-xxx",
      "organization_name": "Personal",
      "aggregation_timestamp": 1725389580,
      "n_requests": 1,
      "operation": "completion",
      "snapshot_id": "gpt-4o-mini-2024-07-18",
      "n_context_tokens_total": 62,
      "n_generated_tokens_total": 21,
      "email": null,
      "api_key_id": null,
      "api_key_name": null,
      "api_key_redacted": null,
      "api_key_type": null,
      "project_id": null,
      "project_name": null,
      "request_type": ""
    },
    {
      "organization_id": "org-xxx",
      "organization_name": "Personal",
      "aggregation_timestamp": 1725389640,
      "n_requests": 1,
      "operation": "completion",
      "snapshot_id": "gpt-4o-mini-2024-07-18",
      "n_context_tokens_total": 97,
      "n_generated_tokens_total": 17,
      "email": null,
      "api_key_id": null,
      "api_key_name": null,
      "api_key_redacted": null,
      "api_key_type": null,
      "project_id": null,
      "project_name": null,
      "request_type": ""
    }
  ],
  "tpm_data": [
    {
      "organization_id": "org-xxx",
      "organization_name": "Personal",
      "day_timestamp": 1725321600,
      "snapshot_id": "gpt-4o-mini-2024-07-18",
      "operation": "completion",
      "p90_context_tpm": 97,
      "p90_generated_tpm": 21,
      "p90_provisioned_context_tpm": 0,
      "p90_provisioned_generated_tpm": 0,
      "max_context_tpm": 97,
      "max_generated_tpm": 21,
      "max_provisioned_context_tpm": 0,
      "max_provisioned_generated_tpm": 0
    }
  ],
  "ft_data": [],
  "dalle_api_data": [],
  "whisper_api_data": [],
  "tts_api_data": [],
  "assistant_code_interpreter_data": [],
  "retrieval_storage_data": []
}

As soon as the API is used, usage is generated after a few times. So, if collecting using the module real-time and that too multiple times of the day, it would collect duplicates and it is not good for storage as well as analytics of the usage data.

It's better to collect time.Now() (in UTC) - 24h so that we get full usage collection of the past day (in UTC) and it avoids duplication. So that's why I have introduced a config realtime and set it to false as the collection is 24h delayed; we are now getting daily data. realtime: true will work as any other normal collection where metrics are fetched in set intervals. Our recommendation is to keep realtime: false.

As this is a metricbeat module, we do not have existing package that gives us support to store the cursor. So, in order to avoid pulling already pulled data, timestamps are being stored per API key. Logic for the same is commented in the code on how it is stored. We are using a new custom code to store the state in order to store the cursor and begin from the next available date.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Check the state store
  • Validate with usage dashboard of OpenAI

How to test this PR locally

  • Run metricbeat and use your OpenAI's API key to collect usage metrics.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 4, 2024
@mergify mergify bot assigned shmsr Nov 4, 2024
Copy link
Contributor

mergify bot commented Nov 4, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @shmsr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 4, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 4, 2024
@shmsr shmsr added Module: openai Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team labels Nov 4, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 4, 2024
@shmsr shmsr requested a review from a team November 5, 2024 07:51
@shmsr
Copy link
Member Author

shmsr commented Nov 5, 2024

Getting hit by this error: #41174 (comment) and hence the CI is failing. Rest all okay.

@shmsr
Copy link
Member Author

shmsr commented Nov 6, 2024

To continue with my testing and to avoid: Limit of total fields [10000] has been exceeded error until it is fixed for 8.15.x and older, I am using: setup.template.fields where I specify a new field that only contains the ecs fields and openai fields from fields.yml and nothing else.

See this: https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-template.html

So, this has currently unblocked me but yes we definitely need a fix for this.

@shmsr shmsr marked this pull request as ready for review November 12, 2024 07:28
@shmsr shmsr requested a review from a team as a code owner November 12, 2024 07:28
@shmsr shmsr requested review from AndersonQ and belimawr November 12, 2024 07:28
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 12, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@shmsr
Copy link
Member Author

shmsr commented Nov 12, 2024

I've explain the complicated collection mechanism in the PR description itself. Rest is self-explanatory from the code. Please let me know, if anything needs further clarification.

x-pack/metricbeat/module/openai/usage/client.go Outdated Show resolved Hide resolved
x-pack/metricbeat/module/openai/usage/client.go Outdated Show resolved Hide resolved
x-pack/metricbeat/module/openai/usage/config.go Outdated Show resolved Hide resolved
x-pack/metricbeat/module/openai/usage/usage.go Outdated Show resolved Hide resolved
x-pack/metricbeat/module/openai/usage/usage.go Outdated Show resolved Hide resolved
}
],
"ft_data": [],
"dalle_api_data": [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to have data for each data set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried generating ft (fine-tuning) data but it doesn't seem to work. As OpenAI provides this API as undocumented, I couldn't find a single source with any samples. Not even sure they even populate for the response of this particular endpoint. For dalle_api_data, I'll add.

# - "k2: v2"
## Rate Limiting Configuration
# rate_limit:
# limit: 60 # requests per second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to be changes to 12 as well ?
Why have we changed the limit from 60 to 12 ?
I thought 60 was the agreed upon limit ?

Copy link
Member Author

@shmsr shmsr Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing everything from scratch today and that too thoroughly. Noticed a slower rate of firing of requests. So, understanding of limit and burst was confusing and I did put incorrect values there which I have corrected now.

This part of the doc needs to be updated with make update; I will run that. Rest all doc files are updated.

The rate limiter works as follows:

limit: 12 means one request every 12 seconds (60 seconds / 5 requests = 12 seconds per request)
burst: 1 means only 1 request can be made in burst
This ensures you never exceed 5 requests per minute

So nothing changed. It's just that it wasn't configured properly by default. Rate limit is still 5 req/ min as per OpenAI.

@shmsr
Copy link
Member Author

shmsr commented Dec 10, 2024

I hope all major review comments I've addressed. Now I'll begin a thorough testing to check:

  • Data is matching with OpenAI's dashboard.
  • Cursor management is working fine (across metricbeat restarts)
  • Similar viz. can be created like OpenAI's own dashboard
  • Any other bug / enhancement that I might've to incorporate

Thanks to all the reviewers!

@shmsr
Copy link
Member Author

shmsr commented Dec 10, 2024

So far with testing everything looks good. I did run it for a few hours today and collected all my limited OpenAI API usage over 4 months period. So far the data has matched and also found a case where OpenAI's own usage dashboard doesn't show a specific data although it present in the JSON when we hit the usage API. But in our dashboard, it shows perfectly which is good thing.

Here's a basic sample dashboard which has panels similar to that of OpenAI's usage dashboard.

kbn_dashboar_openai

@shmsr
Copy link
Member Author

shmsr commented Dec 10, 2024

I think we are ready to merge now unless there are more comments.

@shmsr
Copy link
Member Author

shmsr commented Dec 11, 2024

@ishleenk17 / @devamanv Let me know if you have any comments? Also @muthu-mps do you have any comments wrt Azure OpenAI vs this?

shmsr and others added 2 commits December 12, 2024 12:23
@bmorelli25
Copy link
Member

run docs-build

Copy link
Contributor

@ishleenk17 ishleenk17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good.
CI passes, then we are GTG!

@shmsr shmsr requested a review from a team as a code owner December 12, 2024 08:56
@shmsr
Copy link
Member Author

shmsr commented Dec 12, 2024

Updated the CODEOWNERS too.

cc: @lalit-satapathy Can you please approve as well?

Copy link
Contributor

@lalit-satapathy lalit-satapathy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM codeowner changes.

@shmsr shmsr merged commit 93b018a into elastic:main Dec 13, 2024
32 checks passed
mergify bot pushed a commit that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Module: openai Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants