Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket #606

gongmax · 2024-04-23T23:46:07Z

Mistral is now gated model which is breaks our RAG QSS. As a short-term mitigation, we now fetch the cached weights of the model from a GCS bucket.

Tested by:

kubectl port-forward deployment/mistral-7b-instruct -n rag 9090:8080
Forwarding from 127.0.0.1:9090 -> 8080
Forwarding from [::1]:9090 -> 8080
Handling connection for 9090
...

and

$ export USER_PROMPT="How to deploy a container on K8s?"
$ curl 127.0.0.1:9090/generate -X POST     -H 'Content-Type: application/json'     --data-binary @- <<EOF
{
    "inputs": "[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n$USER_PROMPT[/INST]",
    "parameters": {"max_new_tokens": 400}
}
EOF

{"generated_text":" To deploy a container on Kubernetes (K8s), you can follow these general steps:\n\n1. Create a Kubernetes cluster: You can use a cloud provider like Google Cloud Platform, Amazon Web Services, or Microsoft Azure to create a Kubernetes cluster. Alternatively, you can set up a cluster on-premises using tools like Minikube or Kubeadm.\n2. Create a Docker image: You will need to create a Docker image of your application. This image should contain all the necessary dependencies and configuration files for your application to run.\n3. Create a Kubernetes deployment: A deployment is a way to manage a group of replicas of your application. You can create a deployment using the kubectl command-line tool or a YAML file.\n4. Create a Kubernetes service: A service is a way to expose your application to the outside world. You can create a service using the kubectl command-line tool or a YAML file.\n5. Test your deployment: Once your deployment and service are created, you can test your application by accessing the service from outside the cluster.\n\nIt's important to note that these are general steps and the specifics may vary depending on your application and environment. Additionally, it's recommended to have a good understanding of Kubernetes concepts and best practices before deploying an application."}

andrewsykim · 2024-04-24T00:33:16Z

/gcbrun

gongmax · 2024-04-24T02:57:02Z

test rag failed in CI

gongmax · 2024-04-24T02:57:59Z

From the pod log of deployment mistral-7b-instruct, I can see

�[2m2024-04-24T00:58:40.862898Z�[0m �[32m INFO�[0m �[1mdownload�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Successfully downloaded weights.

Which indicated it went past the download weights step

gongmax · 2024-04-24T03:00:07Z

Looks like the test failed before the shard was ready. Saw bunch of Waiting for shard to be ready... before the cluster was cleaned up.

gongmax · 2024-04-24T03:03:29Z

The deployment mistral-7b-instruct was created around 17:58. The test call was made around 18:19. Is this time too short for the shard to be ready? For context, in my local test, it took around 1h for the shard to be ready after the pod was created. Is this performance degradation introduced by the cached weights? What was the typical time for the shard to be ready previously?

gongmax · 2024-04-24T03:04:56Z

Can we wait longer before making the test prompt call to confirm?

andrewsykim · 2024-04-24T03:13:46Z

For context, in my local test, it took around 1h for the shard to be ready after the pod was created. Is this performance degradation introduced by the cached weights?

This seems really long. I don't think this was the previous start-up time.

The cloudbuild step has a kubectl wait command to wait for the Pod to become ready https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/cloudbuild.yaml#L230. If the pod is considered "ready" even though the model is running, we need to add a readiness probe for the pod.

Feel free to directly change the cloudbuild.yaml in this PR to change the behavior of the test

gongmax · 2024-04-24T03:27:50Z

Increased the timeout. @andrewsykim please kick off the CI once you get chance. Thanks!

andrewsykim · 2024-04-24T03:29:46Z

@gongmax you should be able to run it with "`/gcbrun"

andrewsykim · 2024-04-24T03:29:49Z

/gcbrun

tutorials-and-examples/hf-tgi/main.tf

andrewsykim · 2024-04-24T14:08:08Z

@gongmax from the build logs it looks like mistrla pod is "ready"

pod/mistral-7b-instruct-6c54ddc768-mrql6 condition met

Can you try reproducing the whole RAG quick start solution to see why it's still returning 500? It's possible the pod is "ready" but not actually ready to serve because the mistral pod has no readiness probe

gongmax · 2024-04-24T16:39:58Z

Can you try reproducing the whole RAG quick start solution to see why it's still returning 500? It's possible the pod is "ready" but not actually ready to serve because the mistral pod has no readiness probe

I followed the "Installation" and "Launch the frontend chat interface" part of the README, tested via the frontend and all the request returned 200.

andrewsykim · 2024-04-24T16:44:21Z

I followed the "Installation" and "Launch the frontend chat interface" part of the README, tested via the frontend and all the request returned 200.

Did you send a prompt to the frontend? That's the part of the test that is failing

andrewsykim · 2024-04-24T16:45:52Z

How long did you wait for the frontend to be up and running before you queried it? It seems likely we're trying to query the model before it's ready to serve because there's no readiness probe. But this is kind of weird because I would expect the cached weights to load faster than what we had before.

gongmax · 2024-04-24T16:47:20Z

Did you send a prompt to the frontend? That's the part of the test that is failing
Yes,

It's 200 though it has some warning

gongmax · 2024-04-24T16:48:27Z

How long did you wait for the frontend to be up and running before you queried it? It seems likely we're trying to query the model before it's ready to serve because there's no readiness probe. But this is kind of weird because I would expect the cached weights to load faster than what we had before.

Yes, that's what I mentioned before. From my local log, I can see Shard ready in 3966.331834209s. So I had to wait around 1h before I can query it

andrewsykim · 2024-04-24T16:48:57Z

There's an error at the bottom of that response "missing 2 required positional arguments: params and orig". Is that related?

I believe @blackzlq made a change recently to show the prompt response and the error code for easier debugability

andrewsykim · 2024-04-24T16:50:35Z

Yes, that's what I mentioned before. From my local log, I can see Shard ready in 3966.331834209s. So I had to wait around 1h before I can query it

This is really long startup time, is it expected? Either way if we expect longer start-up time we should add a readiness probe so that the kubectl wait command waits for the model to be ready to serve, can we add a readiness probe in this PR?

andrewsykim · 2024-04-24T17:00:21Z

There's an error at the bottom of that response "missing 2 required positional arguments: params and orig". Is that related?

Ignore this, the error is probably because you didn't run the notebook to generate vector embedings locally

…ng gcsfuse

gongmax · 2024-04-25T01:29:58Z

/gcbrun

gongmax · 2024-04-25T01:36:25Z

/gcbrun

gongmax · 2024-04-25T01:37:36Z

/gcbrun

gongmax · 2024-04-25T16:04:31Z

/gcbrun

tutorials-and-examples/hf-tgi/main.tf

…util

gongmax · 2024-04-25T20:54:37Z

/gcbrun

gongmax · 2024-04-25T21:58:26Z

/gcbrun

gongmax · 2024-04-25T23:00:20Z

/gcbrun

…truct

gongmax · 2024-04-25T23:51:58Z

/gcbrun

infrastructure/platform.tfvars

tutorials-and-examples/hf-tgi/main.tf

gongmax · 2024-04-26T01:14:52Z

/gcbrun

gongmax · 2024-04-26T02:11:54Z

/gcbrun

andrewsykim · 2024-04-26T16:02:59Z

tutorials-and-examples/hf-tgi/main.tf

@@ -72,6 +79,15 @@ resource "kubernetes_deployment" "inference_deployment" {
      }

      spec {
+        init_container {
+          name    = "download-model"
+          image   = "google/cloud-sdk:473.0.0-alpine"


@gongmax if possible it would be good to use an image hosted on GCR or AR.

Will investigate and address in a follow up PR

…#606) Fetch the cached weights for Mistral-7B-Instruct-v0.1 from in an init container. Also increase ephemeral storage (boot disk size).

#621) Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket (#606) Fetch the cached weights for Mistral-7B-Instruct-v0.1 from in an init container. Also increase ephemeral storage (boot disk size).

…#606) Fetch the cached weights for Mistral-7B-Instruct-v0.1 from in an init container. Also increase ephemeral storage (boot disk size).

gongmax added 2 commits April 23, 2024 23:44

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket

db1a4e6

1d8c806

Increase the timeout for waiting pod ready

ebc3cc0

roberthbailey reviewed Apr 24, 2024

View reviewed changes

tutorials-and-examples/hf-tgi/main.tf Outdated Show resolved Hide resolved

gongmax added 5 commits April 24, 2024 20:28

try to fix long setup

6e3b5f7

download the cached weights directly in init container instead of usi…

b7bafbd

…ng gcsfuse

fix

6cccc17

fix

ea31c36

fix terrform format

6aed71f

andrewsykim mentioned this pull request Apr 25, 2024

CI flakes sometimes when waiting to pull Ray images #520

Open

andrewsykim reviewed Apr 25, 2024

View reviewed changes

tutorials-and-examples/hf-tgi/main.tf Outdated Show resolved Hide resolved

Increase the ephemeral storage for gpu pool; use smaller image for gs…

47d8bdf

…util

Increase ephemeral storage for testing cluster

eba2f56

gongmax added 3 commits April 25, 2024 22:53

Increase terraform timeout for deployment mistral-7b-instruct

3bfdbda

fix

9bdef3e

fix format

2c07e67

gongmax added 2 commits April 25, 2024 23:48

Also increase progress_deadline_seconds for deployment mistral-7b-ins…

c3e8ac1

…truct

fix format

a4ed9e9

andrewsykim reviewed Apr 26, 2024

View reviewed changes

infrastructure/platform.tfvars Outdated Show resolved Hide resolved

tutorials-and-examples/hf-tgi/main.tf Show resolved Hide resolved

tutorials-and-examples/hf-tgi/main.tf Show resolved Hide resolved

gongmax added 2 commits April 26, 2024 01:10

fix comment

bc782b3

fix comment

27bc5a6

fix comment

928da5d

andrewsykim approved these changes Apr 26, 2024

View reviewed changes

roberthbailey approved these changes Apr 26, 2024

View reviewed changes

roberthbailey merged commit 57415f4 into GoogleCloudPlatform:main Apr 26, 2024
4 checks passed

andrewsykim reviewed Apr 26, 2024

View reviewed changes

gongmax added a commit that referenced this pull request Apr 26, 2024

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket (…

a5df05a

…#606) Fetch the cached weights for Mistral-7B-Instruct-v0.1 from in an init container. Also increase ephemeral storage (boot disk size).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket #606

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket #606

gongmax commented Apr 23, 2024 •

edited

Loading

andrewsykim commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024

andrewsykim commented Apr 24, 2024 •

edited

Loading

gongmax commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

gongmax commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024 •

edited

Loading

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 26, 2024

gongmax commented Apr 26, 2024

andrewsykim Apr 26, 2024

gongmax Apr 26, 2024

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket #606

Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket #606

Conversation

gongmax commented Apr 23, 2024 • edited Loading

andrewsykim commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024

andrewsykim commented Apr 24, 2024 • edited Loading

gongmax commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

gongmax commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

gongmax commented Apr 24, 2024

gongmax commented Apr 24, 2024 • edited Loading

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

andrewsykim commented Apr 24, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 25, 2024

gongmax commented Apr 26, 2024

gongmax commented Apr 26, 2024

andrewsykim Apr 26, 2024

Choose a reason for hiding this comment

gongmax Apr 26, 2024

Choose a reason for hiding this comment

gongmax commented Apr 23, 2024 •

edited

Loading

andrewsykim commented Apr 24, 2024 •

edited

Loading

gongmax commented Apr 24, 2024 •

edited

Loading