Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: metrics endpoint #5850

Closed
phymbert opened this issue Mar 3, 2024 · 2 comments · Fixed by #5937
Closed

server: metrics endpoint #5850

phymbert opened this issue Mar 3, 2024 · 2 comments · Fixed by #5937
Assignees
Labels
bug Something isn't working server/webui

Comments

@phymbert
Copy link
Collaborator

phymbert commented Mar 3, 2024

Issues

  • Server /metrics endpoint share the same task event as /health: TASK_TYPE_METRICS. It means metrics are reset on both calls.
  • the Process-Start-Time-Unix http response header is not set.
  • metrics llamacpp:prompt_tokens_seconds and llamacpp:predicted_tokens_seconds are per slots, while the server actually process llamacpp:prompt_tokens_seconds * n_slots

Proposal

  • Add a data params in TASK_TYPE_METRICS to reset the metric bucket only in /metrics
  • Add llamacpp:prompt_tokens_seconds_total and llamacpp:predicted_tokens_seconds_total
@phymbert phymbert added bug Something isn't working server/webui labels Mar 3, 2024
@phymbert phymbert self-assigned this Mar 3, 2024
@ngxson
Copy link
Collaborator

ngxson commented Mar 3, 2024

I have one idea to improve /health but I'm not sure if it's currently possible or not:

If user only want to call /health to check if the server is busy doing something or not, then this call should not be blocking. Why - because if the call is blocked, then the moment it finishes, it always says that the server is free - meaning health will never be "busy"

The application of this such case is where include_slots is not set, that mean user just want to see how many idle / processing slot. This check can be done easily with a mutex I think. Only when include_slots is set, then we push the request to queue.

@phymbert
Copy link
Collaborator Author

phymbert commented Mar 3, 2024

it blocks only during prompt/image processing, llama_decode. It works as expected while slots are waiting for the next token. So, as @ggerganov explained earlier in #5851, we are fine with it. I will just fix the bucket window reset on /health instead of /metrics, add a few more total metrics, and fix json cast to int instead of float on kv cache use ratio.

phymbert added a commit that referenced this issue Mar 8, 2024
…s_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header.

Closes #5850
phymbert added a commit that referenced this issue Mar 8, 2024
…s_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header. (#5937)

Closes #5850
hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this issue Mar 10, 2024
…s_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header. (ggerganov#5937)

Closes ggerganov#5850
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this issue Mar 12, 2024
…s_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header. (ggerganov#5937)

Closes ggerganov#5850
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this issue Mar 13, 2024
…s_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header. (ggerganov#5937)

Closes ggerganov#5850
hodlen pushed a commit to hodlen/llama.cpp that referenced this issue Apr 1, 2024
…s_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header. (ggerganov#5937)

Closes ggerganov#5850
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working server/webui
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants