Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket #606

Merged
merged 18 commits into from
Apr 26, 2024

Conversation

gongmax
Copy link
Collaborator

@gongmax gongmax commented Apr 23, 2024

Mistral is now gated model which is breaks our RAG QSS. As a short-term mitigation, we now fetch the cached weights of the model from a GCS bucket.

Tested by:

kubectl port-forward deployment/mistral-7b-instruct -n rag 9090:8080
Forwarding from 127.0.0.1:9090 -> 8080
Forwarding from [::1]:9090 -> 8080
Handling connection for 9090
...

and

$ export USER_PROMPT="How to deploy a container on K8s?"
$ curl 127.0.0.1:9090/generate -X POST     -H 'Content-Type: application/json'     --data-binary @- <<EOF
{
    "inputs": "[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n$USER_PROMPT[/INST]",
    "parameters": {"max_new_tokens": 400}
}
EOF

{"generated_text":" To deploy a container on Kubernetes (K8s), you can follow these general steps:\n\n1. Create a Kubernetes cluster: You can use a cloud provider like Google Cloud Platform, Amazon Web Services, or Microsoft Azure to create a Kubernetes cluster. Alternatively, you can set up a cluster on-premises using tools like Minikube or Kubeadm.\n2. Create a Docker image: You will need to create a Docker image of your application. This image should contain all the necessary dependencies and configuration files for your application to run.\n3. Create a Kubernetes deployment: A deployment is a way to manage a group of replicas of your application. You can create a deployment using the kubectl command-line tool or a YAML file.\n4. Create a Kubernetes service: A service is a way to expose your application to the outside world. You can create a service using the kubectl command-line tool or a YAML file.\n5. Test your deployment: Once your deployment and service are created, you can test your application by accessing the service from outside the cluster.\n\nIt's important to note that these are general steps and the specifics may vary depending on your application and environment. Additionally, it's recommended to have a good understanding of Kubernetes concepts and best practices before deploying an application."}

@andrewsykim
Copy link
Collaborator

/gcbrun

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

test rag failed in CI

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

From the pod log of deployment mistral-7b-instruct, I can see

�[2m2024-04-24T00:58:40.862898Z�[0m �[32m INFO�[0m �[1mdownload�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Successfully downloaded weights.

Which indicated it went past the download weights step

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

Looks like the test failed before the shard was ready. Saw bunch of Waiting for shard to be ready... before the cluster was cleaned up.

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

The deployment mistral-7b-instruct was created around 17:58. The test call was made around 18:19. Is this time too short for the shard to be ready? For context, in my local test, it took around 1h for the shard to be ready after the pod was created. Is this performance degradation introduced by the cached weights? What was the typical time for the shard to be ready previously?

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

Can we wait longer before making the test prompt call to confirm?

@andrewsykim
Copy link
Collaborator

andrewsykim commented Apr 24, 2024

For context, in my local test, it took around 1h for the shard to be ready after the pod was created. Is this performance degradation introduced by the cached weights?

This seems really long. I don't think this was the previous start-up time.

The cloudbuild step has a kubectl wait command to wait for the Pod to become ready https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/cloudbuild.yaml#L230. If the pod is considered "ready" even though the model is running, we need to add a readiness probe for the pod.

Feel free to directly change the cloudbuild.yaml in this PR to change the behavior of the test

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

Increased the timeout. @andrewsykim please kick off the CI once you get chance. Thanks!

@andrewsykim
Copy link
Collaborator

@gongmax you should be able to run it with "`/gcbrun"

@andrewsykim
Copy link
Collaborator

/gcbrun

@andrewsykim
Copy link
Collaborator

@gongmax from the build logs it looks like mistrla pod is "ready"

pod/mistral-7b-instruct-6c54ddc768-mrql6 condition met

Can you try reproducing the whole RAG quick start solution to see why it's still returning 500? It's possible the pod is "ready" but not actually ready to serve because the mistral pod has no readiness probe

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

Can you try reproducing the whole RAG quick start solution to see why it's still returning 500? It's possible the pod is "ready" but not actually ready to serve because the mistral pod has no readiness probe

I followed the "Installation" and "Launch the frontend chat interface" part of the README, tested via the frontend and all the request returned 200.

@andrewsykim
Copy link
Collaborator

I followed the "Installation" and "Launch the frontend chat interface" part of the README, tested via the frontend and all the request returned 200.

Did you send a prompt to the frontend? That's the part of the test that is failing

@andrewsykim
Copy link
Collaborator

How long did you wait for the frontend to be up and running before you queried it? It seems likely we're trying to query the model before it's ready to serve because there's no readiness probe. But this is kind of weird because I would expect the cached weights to load faster than what we had before.

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

Did you send a prompt to the frontend? That's the part of the test that is failing
Yes,
image
It's 200 though it has some warning

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 24, 2024

How long did you wait for the frontend to be up and running before you queried it? It seems likely we're trying to query the model before it's ready to serve because there's no readiness probe. But this is kind of weird because I would expect the cached weights to load faster than what we had before.

Yes, that's what I mentioned before. From my local log, I can see Shard ready in 3966.331834209s. So I had to wait around 1h before I can query it

@andrewsykim
Copy link
Collaborator

There's an error at the bottom of that response "missing 2 required positional arguments: params and orig". Is that related?

I believe @blackzlq made a change recently to show the prompt response and the error code for easier debugability

@andrewsykim
Copy link
Collaborator

Yes, that's what I mentioned before. From my local log, I can see Shard ready in 3966.331834209s. So I had to wait around 1h before I can query it

This is really long startup time, is it expected? Either way if we expect longer start-up time we should add a readiness probe so that the kubectl wait command waits for the model to be ready to serve, can we add a readiness probe in this PR?

@andrewsykim
Copy link
Collaborator

There's an error at the bottom of that response "missing 2 required positional arguments: params and orig". Is that related?

Ignore this, the error is probably because you didn't run the notebook to generate vector embedings locally

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 25, 2024

/gcbrun

1 similar comment
@gongmax
Copy link
Collaborator Author

gongmax commented Apr 25, 2024

/gcbrun

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 25, 2024

/gcbrun

1 similar comment
@gongmax
Copy link
Collaborator Author

gongmax commented Apr 25, 2024

/gcbrun

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 25, 2024

/gcbrun

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 25, 2024

/gcbrun

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 25, 2024

/gcbrun

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 25, 2024

/gcbrun

infrastructure/platform.tfvars Outdated Show resolved Hide resolved
tutorials-and-examples/hf-tgi/main.tf Show resolved Hide resolved
tutorials-and-examples/hf-tgi/main.tf Show resolved Hide resolved
@gongmax
Copy link
Collaborator Author

gongmax commented Apr 26, 2024

/gcbrun

@gongmax
Copy link
Collaborator Author

gongmax commented Apr 26, 2024

/gcbrun

@roberthbailey roberthbailey merged commit 57415f4 into GoogleCloudPlatform:main Apr 26, 2024
4 checks passed
@@ -72,6 +79,15 @@ resource "kubernetes_deployment" "inference_deployment" {
}

spec {
init_container {
name = "download-model"
image = "google/cloud-sdk:473.0.0-alpine"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gongmax if possible it would be good to use an image hosted on GCR or AR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will investigate and address in a follow up PR

gongmax added a commit that referenced this pull request Apr 26, 2024
…#606)

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from in an init container.
Also increase ephemeral storage (boot disk size).
gongmax added a commit that referenced this pull request Apr 26, 2024
#621)

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket (#606)

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from in an init container.
Also increase ephemeral storage (boot disk size).
alpha-amundson pushed a commit that referenced this pull request May 3, 2024
…#606)

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from in an init container.
Also increase ephemeral storage (boot disk size).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants