Intermittent Cloud Datastore Error: ServiceUnavailable: 503 #3128

speedplane · 2017-03-10T02:10:23Z

I am getting intermittent ServiceUnavailable: 503 error messages when doing datastore operations with the google cloud client library. The stack trace is below. I am running Python 2.7 on Ubuntu, within a Docker container on a small Kubernetes cluster.

Traceback (most recent call last):
  ...
  File "/home/speedplane/DA/djangoappengine/db_gcl/compiler.py", line 345, in get_matching_pk
    self.client.get_multi(self.included_pks, **config)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/client.py", line 307, in get_multi
    transaction_id=transaction and transaction.id,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/client.py", line 124, in _extended_lookup
    transaction_id=transaction_id,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_http.py", line 519, in lookup
    lookup_response = self._datastore_api.lookup(project, lookup_request)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_http.py", line 304, in lookup
    return self._stub.Lookup(request_pb)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_http.py", line 260, in _grpc_catch_rendezvous
    raise error_class(exc.details())
ServiceUnavailable: 503 {"created":"@1489111389.496403106","description":"Endpoint read failed","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1489111389.496341023","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":166,"referenced_errors":[{"created":"@1489111389.496338171","description":"Socket closed","fd":25,"file":"src/core/lib/iomgr/tcp_posix.c","file_line":249,"target_address":"ipv4:173.194.192.95:443"}]}]}

For ease of reading, that last line is expanded:

ServiceUnavailable: 503 {
 "created":"@1489111389.496403106",
 "description":"Endpoint read failed",
 "file":"src/core/ext/transport/chttp2/transport/chttp2_transport.c",
 "file_line":1851,
 "grpc_status":14,"occurred_during_write":0,
 "referenced_errors":[{"created":"@1489111389.496341023",
   "description":"Secure read failed",
   "file":"src/core/lib/security/transport/secure_endpoint.c","file_line":166,
   "referenced_errors":[{"created":"@1489111389.496338171",
     "description":"Socket closed",
     "fd":25,"file":"src/core/lib/iomgr/tcp_posix.c",
    "file_line":249,"target_address":"ipv4:173.194.192.95:443"}]
  }]
}

I am running this within a Docker container orchestrated by Kubernetes. My Dockerfile inherits from ubuntu:trusty. Below is version information:

# lsb_release
LSB Version:    core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
# python --version
Python 2.7.6

The text was updated successfully, but these errors were encountered:

cobookman · 2017-03-10T02:18:22Z

Is there a chance that this is related: #2896

cobookman · 2017-03-10T02:20:27Z

Also the Socket closed error makes it seem like gRPC had its connection terminated to datastore. And needs to be re-opened. In issue #2896, they mention that the python datastore client library doesn't implement retry logic (yet). This issue was created around adding retry logic to all client libs: #2694

lukesneeringer · 2017-03-10T02:22:31Z

Are you getting the error at 30 minute intervals like the author of #2896?

speedplane · 2017-03-10T06:44:49Z

It's happening much more regularly than once every 30 minutes. Below are logs with some timings.

lukesneeringer · 2017-03-10T15:46:40Z

Interesting that the second log set shows so much more frequent failures than the first one.

daspecster · 2017-03-10T15:55:54Z

@speedplane if this was asked already somewhere, sorry for the repeat question, but is this script using multiprocessing or any kind of threading?

My thought is that either it's an upstream issue(in which case I feel like there would be more people chiming in) or there's something in your architecture that is unhappy.

speedplane · 2017-03-10T16:56:50Z

I saw that discussion. I don't use multiprocessing. I don't think there is threading, but just in case, I make sure all my clients are thread-local using the function below.

Perhaps I should have mentioned, but this call is being made across projects. It is being made from an instance within a project named ocr-api, but is trying to grab data from a project named docketupdate.

Also, I have a cloud support package, and I can try opening a case if that would help.

import threading
_datastore_client = threading.local()
def datastore_client(project = None):
    '''Return a thread-unique datastore client for the given project.'''
    # Our global datstore client cache.
    global _datastore_client
    if not getattr(_datastore_client, 'clients'):
        _datastore_client.clients = {}
    if project not in _datastore_client.clients:
        # Create a new datastore client and save it in the cache.
        _datastore_client.clients[project] = datastore.client.Client(project) \
            if project else datastore.client.Client()
    return _datastore_client.clients[project]

I can add my own retry logic and see if it helps, but given how often this is occurring, it would be preferable if the underlying issue could be addressed. I added a third screenshot of timestamps above.

cobookman · 2017-03-10T17:08:53Z

I'd recommend filling a support ticket. Feel free to cc me at bookman@google.com or share your case #, so I can follow up on it and make sure your issue gets resolved.

…

On Fri, Mar 10, 2017, 8:56 AM Michael Sander ***@***.***> wrote: I saw that discussion. I don't use multiprocessing. I don't think there is threading, but just in case, I make sure all my clients are thread-local using the function below. Perhaps I should have mentioned, but this call is being made across projects. It is being made from an instance within a project named ocr-api, but is trying to grab data from a project named docketupdate. Also, I have a cloud support package, and I can try opening a case if that would help. import threading _datastore_client = threading.local() def datastore_client(project = None): '''Return a thread-unique datastore client for the given project.''' # Our global datstore client cache. global _datastore_client if not getattr(_datastore_client, 'clients'): _datastore_client.clients = {} if project not in _datastore_client.clients: # Create a new datastore client and save it in the cache. _datastore_client.clients[project] = datastore.client.Client(project) \ if project else datastore.client.Client() return _datastore_client.clients[project] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3128 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABqI5wN-HzXktWwlMk9rS5Hqo-Uf74Cfks5rkYDVgaJpZM4MY4KO> .

daspecster · 2017-03-10T17:37:48Z

@speedplane the project information is helpful. I believe there was an issue about that a long time ago regarding that not being supported well? I can't seem to find it though.

@dhermes @tseaver might remember?

speedplane · 2017-03-10T17:54:03Z

Done. See case ID 12283187.

speedplane · 2017-03-13T18:09:39Z

I stopped receiving these altogether on March 10 at around 3:30PM ET. I did not redeploy or change any settings in my app since then. Did something change on the backend?

lukesneeringer · 2017-03-13T18:57:50Z

Not sure. Perhaps?

cobookman · 2017-03-14T16:29:18Z

@speedplane just double checking that you are still not seeing the errors.

lukesneeringer · 2017-03-14T16:30:20Z

Closing, but feel free to reopen if the status changes.

speedplane · 2017-03-30T05:50:48Z

Can you please reopen? This is coming back with a vengeance. Same error, but slightly different codepath, see stacktrace below:

  File "/home/speedplane/DA/djangoappengine/db_gcl/compiler.py", line 152, in fetch
    for entity in results:
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/iterator.py", line 218, in _items_iter
    for page in self._page_iter(increment=False):
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/iterator.py", line 247, in _page_iter
    page = self._next_page()
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/query.py", line 489, in _next_page
    transaction_id=transaction and transaction.id,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_http.py", line 392, in run_query
    return self._datastore_api.run_query(project, request)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_gax.py", line 128, in run_query
    return self._stub.RunQuery(request_pb)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_gax.py", line 67, in _grpc_catch_rendezvous
    raise error_class(exc.details())
ServiceUnavailable: 503 Endpoint read failed

daspecster · 2017-03-30T14:15:55Z

@speedplane I don't see anything on the status dashboard about an outage.
https://status.cloud.google.com

It does seem strange that it's ServiceUnavailable. Maybe communication problems between Kubernetes and datastore and it some how doesn't get reported? /cc @lukesneeringer

lukesneeringer · 2017-03-30T15:17:01Z

@speedplane Let me make sure I understand -- it was gone with the grpc 1.2 upgrade, and now it is back? But with a slightly different error? Is it still intermittent?

dhermes · 2017-03-30T15:19:09Z

@speedplane Seeing one UNAVAILABLE a day probably isn't a bug (other than the absence of a retry feature). Is this happening LOTS (as you indicated the issue was happening before)?

Aside: @lukesneeringer doesn't GAPIC have a retry feature?

lukesneeringer · 2017-03-31T14:46:03Z

It does.

speedplane · 2017-04-12T12:14:11Z

@dhermes @daspecster @lukesneeringer I am seeing this much more often than once a day, I see this roughly once a minute. I have a feeling that when I said the issue was gone, it wasn't, but only that I was not collecting the events properly. What other information would be helpful here?

Not sure if this is helpful, but I am installing the SDK files from within my Dockerfile with the following command:

RUN export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && \
  echo "deb https://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
  curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - && \
  apt-get update && apt-get install -y google-cloud-sdk

vsoch · 2017-08-10T01:47:13Z

+1, I am also seeing this issue using the python datastore client. I went too far forward in my terminal to get the exact trace, but it is a 503 ServiceUnavailable thing. For me, it reliably happens when I am "starting up" the client after being inactive for a bit, and the second try always works. If it's the case that datastore is run on an apps instance, not knowing anything else I would guess it has to do with waking it up / initializing it for use.

chemelnucfin · 2018-01-30T18:15:18Z

@speedplane @vsoch Is this still a problem? I noticed there has been a change in the datastore client and that retry is in the autogen client.

vsoch · 2018-01-30T19:15:27Z

Ah, I can't give additional feedback on this - we wound up switching to a different database.

speedplane · 2018-01-30T22:04:13Z

I still see datstore errors all the time, most recently 503 GOAWAY messages. It occurs only about 0.01% of the time, but still far too much.

On Tue, Jan 30, 2018 at 2:15 PM Vanessa Sochat ***@***.***> wrote: Ah, I can't give additional feedback on this - we wound up switching to a different database. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3128 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAo9TIj_q3Ohq3HutkAR7wqTF6XhTOr4ks5tP2pUgaJpZM4MY4KO> .

-- Best, -Michael

chemelnucfin · 2018-01-30T23:26:22Z

@vsoch Thank you for taking the time to respond. Perhaps in the future you can give our product a try again. I apologize for any inconvenience this may have caused.

@speedplane From 4766 it seems like getting a 503 GOAWAY message from server side is a normal part of the process. I believe that we are having retries built into our clients which will hopefully eliminate those errors. I will keep track of our progress on that. Thank you for your response.

tseaver · 2018-02-20T20:36:32Z

The google-cloud-datastore 1.5.0 release uses newly-generated GAPIC code, which includes built-in retries for idempoten methods. @speedplane, can you please try to reproduce with that version?

theacodes · 2018-02-22T19:17:17Z

I'm going to go ahead and close this as the latest release include significant retry plumbing. If it's still an issue, feel free to comment and we'll re-open.

tgensol · 2018-05-03T15:35:28Z

Hi guys,

I am using the google-cloud-datastore on AWS Lambda (serverless), and sometimes I am getting the error too :

503 Transport closed: ServiceUnavailable
Traceback (most recent call last):
  File "/var/task/handler.py", line 177, in create_user
    raise e
  File "/var/task/handler.py", line 163, in create_user
    datastore_client.put(task)
  File "/var/task/google/cloud/datastore/client.py", line 404, in put
    self.put_multi(entities=[entity])
  File "/var/task/google/cloud/datastore/client.py", line 431, in put_multi
    current.commit()
  File "/var/task/google/cloud/datastore/batch.py", line 273, in commit
    self._commit()
  File "/var/task/google/cloud/datastore/batch.py", line 249, in _commit
    self.project, mode, self._mutations, transaction=self._id)
  File "/var/task/google/cloud/datastore_v1/gapic/datastore_client.py", line 426, in commit
    request, retry=retry, timeout=timeout, metadata=metadata)
  File "/var/task/google/api_core/gapic_v1/method.py", line 139, in __call__
    return wrapped_func(*args, **kwargs)
  File "/var/task/google/api_core/retry.py", line 260, in retry_wrapped_func
    on_error=on_error,
  File "/var/task/google/api_core/retry.py", line 177, in retry_target
    return target()
  File "/var/task/google/api_core/timeout.py", line 206, in func_with_timeout
    return func(*args, **kwargs)
  File "/var/task/google/api_core/grpc_helpers.py", line 56, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 Transport closed

I am using google-cloud-core 0.28.1 and google-cloud-datastore 1.6.0,
Thanks !

tseaver · 2018-05-03T17:36:58Z

@tgensol Hmm, your error is propagated from a Commit call, which is not idempotent. I don't think we have any expected automatic retries for that case. @jonparrott can you confirm?

theacodes · 2018-05-03T19:15:31Z

Yeah, it's generally not safe for us to retry a commit operation blindly. If you are certain it's safe to retry, you can do so using our retry helper:

from google.api_core import exceptions
from google.api_core import retry

retry_commit = retry.Retry(
    predicate=retry.if_exception_type(exceptions.ServiceUnavailable),
    deadline=60)

...

retry(datastore_client.put)(task)

speedplane · 2018-05-03T20:17:23Z

I've dealt with this same problem. Using the retry handler directly can be difficult because the calls are often burried under middleware code. Instead, I've monkey hacked the default retry handler settings:

    try:
        from google.cloud.datastore_v1.gapic import datastore_client_config
    except ImportError:
        return
    dstore_config = datastore_client_config.config
    # Sanity check that this structure is what we think it is.
    if "interfaces" not in dstore_config:
        raise Exception("No interfaces in dstore_config: %s"%dstore_config)
    if "google.datastore.v1.Datastore" not in dstore_config["interfaces"]:
        raise Exception("No datastore in dstore_config: %s"%dstore_config)
    # Update the structure in case it's already been loaded.
    dstore_config["interfaces"]["google.datastore.v1.Datastore"].update({
        # Add new settings here, see datastore_client_config for format.
    })

theacodes · 2018-05-03T20:28:18Z

well, I certainly can't endorse that - but I'm glad it works for you.

EhsanKia · 2020-05-19T01:27:38Z

For what it's worth, I have an AppEngine Flask project, and I used to have datastore_client = datastore.Client() as a global, creating the client once for the lifetime of the instance (That's how it's done in the sample apps)

As with above, the connection would eventually break and the client would get corrupted, so I've switched to instead instantiating a new client every time since I don't need to very often. I would've rather implemented a getter which checks if the client is working and reloads it if it isn't, but there doesn't seem to be a clean way of doing that?

speedplane mentioned this issue Mar 10, 2017

Intermittent Cloud Datastore Error: Unauthorized: 401 ... Unable to find the server at accounts.google.com #3085

Closed

lukesneeringer added api: datastore Issues related to the Datastore API. priority: p2 Moderately-important priority. Fix may not be included in next release. status: acknowledged labels Mar 10, 2017

lukesneeringer closed this as completed Mar 14, 2017

lukesneeringer reopened this Mar 30, 2017

lukesneeringer removed the status: acknowledged label Nov 8, 2017

chemelnucfin added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jan 9, 2018

chemelnucfin assigned tseaver Jan 11, 2018

chemelnucfin assigned dhermes and chemelnucfin Jan 11, 2018

tseaver added the status: awaiting information label Feb 20, 2018

theacodes closed this as completed Feb 22, 2018

tgensol unassigned chemelnucfin May 3, 2018

oliverchang mentioned this issue Sep 27, 2018

Datastore: expose retry parameters #6119

Closed

theacodes unassigned dhermes Sep 28, 2018

pgoellnitz mentioned this issue Apr 20, 2020

cloud functions occasionally fails if service is not available globe-swiss/phaenonet-functions#23

Closed

SilasGG0530 mentioned this issue Feb 11, 2022

ServiceUnavailable: 503 failed to connect to all addresses googleapis/python-vision#307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent Cloud Datastore Error: ServiceUnavailable: 503 #3128

Intermittent Cloud Datastore Error: ServiceUnavailable: 503 #3128

speedplane commented Mar 10, 2017

cobookman commented Mar 10, 2017

cobookman commented Mar 10, 2017 •

edited

Loading

lukesneeringer commented Mar 10, 2017

speedplane commented Mar 10, 2017 •

edited

Loading

lukesneeringer commented Mar 10, 2017

daspecster commented Mar 10, 2017 •

edited

Loading

speedplane commented Mar 10, 2017 •

edited

Loading

cobookman commented Mar 10, 2017 via email

daspecster commented Mar 10, 2017

speedplane commented Mar 10, 2017

speedplane commented Mar 13, 2017 •

edited

Loading

lukesneeringer commented Mar 13, 2017

cobookman commented Mar 14, 2017

lukesneeringer commented Mar 14, 2017

speedplane commented Mar 30, 2017

daspecster commented Mar 30, 2017

lukesneeringer commented Mar 30, 2017

dhermes commented Mar 30, 2017

lukesneeringer commented Mar 31, 2017

speedplane commented Apr 12, 2017 •

edited

Loading

vsoch commented Aug 10, 2017

chemelnucfin commented Jan 30, 2018

vsoch commented Jan 30, 2018

speedplane commented Jan 30, 2018 via email

chemelnucfin commented Jan 30, 2018

tseaver commented Feb 20, 2018

theacodes commented Feb 22, 2018

tgensol commented May 3, 2018

tseaver commented May 3, 2018

theacodes commented May 3, 2018

speedplane commented May 3, 2018

theacodes commented May 3, 2018

EhsanKia commented May 19, 2020

Intermittent Cloud Datastore Error: ServiceUnavailable: 503 #3128

Intermittent Cloud Datastore Error: ServiceUnavailable: 503 #3128

Comments

speedplane commented Mar 10, 2017

cobookman commented Mar 10, 2017

cobookman commented Mar 10, 2017 • edited Loading

lukesneeringer commented Mar 10, 2017

speedplane commented Mar 10, 2017 • edited Loading

lukesneeringer commented Mar 10, 2017

daspecster commented Mar 10, 2017 • edited Loading

speedplane commented Mar 10, 2017 • edited Loading

cobookman commented Mar 10, 2017 via email

daspecster commented Mar 10, 2017

speedplane commented Mar 10, 2017

speedplane commented Mar 13, 2017 • edited Loading

lukesneeringer commented Mar 13, 2017

cobookman commented Mar 14, 2017

lukesneeringer commented Mar 14, 2017

speedplane commented Mar 30, 2017

daspecster commented Mar 30, 2017

lukesneeringer commented Mar 30, 2017

dhermes commented Mar 30, 2017

lukesneeringer commented Mar 31, 2017

speedplane commented Apr 12, 2017 • edited Loading

vsoch commented Aug 10, 2017

chemelnucfin commented Jan 30, 2018

vsoch commented Jan 30, 2018

speedplane commented Jan 30, 2018 via email

chemelnucfin commented Jan 30, 2018

tseaver commented Feb 20, 2018

theacodes commented Feb 22, 2018

tgensol commented May 3, 2018

tseaver commented May 3, 2018

theacodes commented May 3, 2018

speedplane commented May 3, 2018

theacodes commented May 3, 2018

EhsanKia commented May 19, 2020

cobookman commented Mar 10, 2017 •

edited

Loading

speedplane commented Mar 10, 2017 •

edited

Loading

daspecster commented Mar 10, 2017 •

edited

Loading

speedplane commented Mar 10, 2017 •

edited

Loading

speedplane commented Mar 13, 2017 •

edited

Loading

speedplane commented Apr 12, 2017 •

edited

Loading