Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent Cloud Datastore Error: ServiceUnavailable: 503 #3128

Closed
speedplane opened this issue Mar 10, 2017 · 33 comments
Closed

Intermittent Cloud Datastore Error: ServiceUnavailable: 503 #3128

speedplane opened this issue Mar 10, 2017 · 33 comments
Assignees
Labels
api: datastore Issues related to the Datastore API. priority: p2 Moderately-important priority. Fix may not be included in next release. status: awaiting information type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@speedplane
Copy link

I am getting intermittent ServiceUnavailable: 503 error messages when doing datastore operations with the google cloud client library. The stack trace is below. I am running Python 2.7 on Ubuntu, within a Docker container on a small Kubernetes cluster.

Traceback (most recent call last):
  ...
  File "/home/speedplane/DA/djangoappengine/db_gcl/compiler.py", line 345, in get_matching_pk
    self.client.get_multi(self.included_pks, **config)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/client.py", line 307, in get_multi
    transaction_id=transaction and transaction.id,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/client.py", line 124, in _extended_lookup
    transaction_id=transaction_id,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_http.py", line 519, in lookup
    lookup_response = self._datastore_api.lookup(project, lookup_request)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_http.py", line 304, in lookup
    return self._stub.Lookup(request_pb)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_http.py", line 260, in _grpc_catch_rendezvous
    raise error_class(exc.details())
ServiceUnavailable: 503 {"created":"@1489111389.496403106","description":"Endpoint read failed","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.c","file_line":1851,"grpc_status":14,"occurred_during_write":0,"referenced_errors":[{"created":"@1489111389.496341023","description":"Secure read failed","file":"src/core/lib/security/transport/secure_endpoint.c","file_line":166,"referenced_errors":[{"created":"@1489111389.496338171","description":"Socket closed","fd":25,"file":"src/core/lib/iomgr/tcp_posix.c","file_line":249,"target_address":"ipv4:173.194.192.95:443"}]}]}

For ease of reading, that last line is expanded:

ServiceUnavailable: 503 {
 "created":"@1489111389.496403106",
 "description":"Endpoint read failed",
 "file":"src/core/ext/transport/chttp2/transport/chttp2_transport.c",
 "file_line":1851,
 "grpc_status":14,"occurred_during_write":0,
 "referenced_errors":[{"created":"@1489111389.496341023",
   "description":"Secure read failed",
   "file":"src/core/lib/security/transport/secure_endpoint.c","file_line":166,
   "referenced_errors":[{"created":"@1489111389.496338171",
     "description":"Socket closed",
     "fd":25,"file":"src/core/lib/iomgr/tcp_posix.c",
    "file_line":249,"target_address":"ipv4:173.194.192.95:443"}]
  }]
}

I am running this within a Docker container orchestrated by Kubernetes. My Dockerfile inherits from ubuntu:trusty. Below is version information:

# lsb_release
LSB Version:    core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
# python --version
Python 2.7.6
@cobookman
Copy link

Is there a chance that this is related: #2896

@cobookman
Copy link

cobookman commented Mar 10, 2017

Also the Socket closed error makes it seem like gRPC had its connection terminated to datastore. And needs to be re-opened. In issue #2896, they mention that the python datastore client library doesn't implement retry logic (yet). This issue was created around adding retry logic to all client libs: #2694

@lukesneeringer
Copy link
Contributor

Are you getting the error at 30 minute intervals like the author of #2896?

@lukesneeringer lukesneeringer added api: datastore Issues related to the Datastore API. priority: p2 Moderately-important priority. Fix may not be included in next release. status: acknowledged labels Mar 10, 2017
@speedplane
Copy link
Author

speedplane commented Mar 10, 2017

It's happening much more regularly than once every 30 minutes. Below are logs with some timings.

image

image

image

@lukesneeringer
Copy link
Contributor

Interesting that the second log set shows so much more frequent failures than the first one.

@daspecster
Copy link
Contributor

daspecster commented Mar 10, 2017

@speedplane if this was asked already somewhere, sorry for the repeat question, but is this script using multiprocessing or any kind of threading?

My thought is that either it's an upstream issue(in which case I feel like there would be more people chiming in) or there's something in your architecture that is unhappy.

@speedplane
Copy link
Author

speedplane commented Mar 10, 2017

I saw that discussion. I don't use multiprocessing. I don't think there is threading, but just in case, I make sure all my clients are thread-local using the function below.

Perhaps I should have mentioned, but this call is being made across projects. It is being made from an instance within a project named ocr-api, but is trying to grab data from a project named docketupdate.

Also, I have a cloud support package, and I can try opening a case if that would help.

import threading
_datastore_client = threading.local()
def datastore_client(project = None):
    '''Return a thread-unique datastore client for the given project.'''
    # Our global datstore client cache.
    global _datastore_client
    if not getattr(_datastore_client, 'clients'):
        _datastore_client.clients = {}
    if project not in _datastore_client.clients:
        # Create a new datastore client and save it in the cache.
        _datastore_client.clients[project] = datastore.client.Client(project) \
            if project else datastore.client.Client()
    return _datastore_client.clients[project]

I can add my own retry logic and see if it helps, but given how often this is occurring, it would be preferable if the underlying issue could be addressed. I added a third screenshot of timestamps above.

@cobookman
Copy link

cobookman commented Mar 10, 2017 via email

@daspecster
Copy link
Contributor

@speedplane the project information is helpful. I believe there was an issue about that a long time ago regarding that not being supported well? I can't seem to find it though.

@dhermes @tseaver might remember?

@speedplane
Copy link
Author

Done. See case ID 12283187.

@speedplane
Copy link
Author

speedplane commented Mar 13, 2017

I stopped receiving these altogether on March 10 at around 3:30PM ET. I did not redeploy or change any settings in my app since then. Did something change on the backend?

@lukesneeringer
Copy link
Contributor

Not sure. Perhaps?

@cobookman
Copy link

@speedplane just double checking that you are still not seeing the errors.

@lukesneeringer
Copy link
Contributor

Closing, but feel free to reopen if the status changes.

@speedplane
Copy link
Author

Can you please reopen? This is coming back with a vengeance. Same error, but slightly different codepath, see stacktrace below:

  File "/home/speedplane/DA/djangoappengine/db_gcl/compiler.py", line 152, in fetch
    for entity in results:
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/iterator.py", line 218, in _items_iter
    for page in self._page_iter(increment=False):
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/iterator.py", line 247, in _page_iter
    page = self._next_page()
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/query.py", line 489, in _next_page
    transaction_id=transaction and transaction.id,
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_http.py", line 392, in run_query
    return self._datastore_api.run_query(project, request)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_gax.py", line 128, in run_query
    return self._stub.RunQuery(request_pb)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/_gax.py", line 67, in _grpc_catch_rendezvous
    raise error_class(exc.details())
ServiceUnavailable: 503 Endpoint read failed

@daspecster
Copy link
Contributor

@speedplane I don't see anything on the status dashboard about an outage.
https://status.cloud.google.com

It does seem strange that it's ServiceUnavailable. Maybe communication problems between Kubernetes and datastore and it some how doesn't get reported? /cc @lukesneeringer

@lukesneeringer
Copy link
Contributor

@speedplane Let me make sure I understand -- it was gone with the grpc 1.2 upgrade, and now it is back? But with a slightly different error? Is it still intermittent?

@dhermes
Copy link
Contributor

dhermes commented Mar 30, 2017

@speedplane Seeing one UNAVAILABLE a day probably isn't a bug (other than the absence of a retry feature). Is this happening LOTS (as you indicated the issue was happening before)?

Aside: @lukesneeringer doesn't GAPIC have a retry feature?

@lukesneeringer
Copy link
Contributor

It does.

@speedplane
Copy link
Author

speedplane commented Apr 12, 2017

@dhermes @daspecster @lukesneeringer I am seeing this much more often than once a day, I see this roughly once a minute. I have a feeling that when I said the issue was gone, it wasn't, but only that I was not collecting the events properly. What other information would be helpful here?

Not sure if this is helpful, but I am installing the SDK files from within my Dockerfile with the following command:

RUN export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && \
  echo "deb https://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
  curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - && \
  apt-get update && apt-get install -y google-cloud-sdk

@vsoch
Copy link

vsoch commented Aug 10, 2017

+1, I am also seeing this issue using the python datastore client. I went too far forward in my terminal to get the exact trace, but it is a 503 ServiceUnavailable thing. For me, it reliably happens when I am "starting up" the client after being inactive for a bit, and the second try always works. If it's the case that datastore is run on an apps instance, not knowing anything else I would guess it has to do with waking it up / initializing it for use.

@chemelnucfin chemelnucfin added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jan 9, 2018
@chemelnucfin
Copy link
Contributor

@speedplane @vsoch Is this still a problem? I noticed there has been a change in the datastore client and that retry is in the autogen client.

@vsoch
Copy link

vsoch commented Jan 30, 2018

Ah, I can't give additional feedback on this - we wound up switching to a different database.

@speedplane
Copy link
Author

speedplane commented Jan 30, 2018 via email

@chemelnucfin
Copy link
Contributor

@vsoch Thank you for taking the time to respond. Perhaps in the future you can give our product a try again. I apologize for any inconvenience this may have caused.

@speedplane From 4766 it seems like getting a 503 GOAWAY message from server side is a normal part of the process. I believe that we are having retries built into our clients which will hopefully eliminate those errors. I will keep track of our progress on that. Thank you for your response.

@tseaver
Copy link
Contributor

tseaver commented Feb 20, 2018

The google-cloud-datastore 1.5.0 release uses newly-generated GAPIC code, which includes built-in retries for idempoten methods. @speedplane, can you please try to reproduce with that version?

@theacodes
Copy link
Contributor

I'm going to go ahead and close this as the latest release include significant retry plumbing. If it's still an issue, feel free to comment and we'll re-open.

@tgensol
Copy link

tgensol commented May 3, 2018

Hi guys,

I am using the google-cloud-datastore on AWS Lambda (serverless), and sometimes I am getting the error too :

503 Transport closed: ServiceUnavailable
Traceback (most recent call last):
  File "/var/task/handler.py", line 177, in create_user
    raise e
  File "/var/task/handler.py", line 163, in create_user
    datastore_client.put(task)
  File "/var/task/google/cloud/datastore/client.py", line 404, in put
    self.put_multi(entities=[entity])
  File "/var/task/google/cloud/datastore/client.py", line 431, in put_multi
    current.commit()
  File "/var/task/google/cloud/datastore/batch.py", line 273, in commit
    self._commit()
  File "/var/task/google/cloud/datastore/batch.py", line 249, in _commit
    self.project, mode, self._mutations, transaction=self._id)
  File "/var/task/google/cloud/datastore_v1/gapic/datastore_client.py", line 426, in commit
    request, retry=retry, timeout=timeout, metadata=metadata)
  File "/var/task/google/api_core/gapic_v1/method.py", line 139, in __call__
    return wrapped_func(*args, **kwargs)
  File "/var/task/google/api_core/retry.py", line 260, in retry_wrapped_func
    on_error=on_error,
  File "/var/task/google/api_core/retry.py", line 177, in retry_target
    return target()
  File "/var/task/google/api_core/timeout.py", line 206, in func_with_timeout
    return func(*args, **kwargs)
  File "/var/task/google/api_core/grpc_helpers.py", line 56, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 Transport closed

I am using google-cloud-core 0.28.1 and google-cloud-datastore 1.6.0,
Thanks !

@tseaver
Copy link
Contributor

tseaver commented May 3, 2018

@tgensol Hmm, your error is propagated from a Commit call, which is not idempotent. I don't think we have any expected automatic retries for that case. @jonparrott can you confirm?

@theacodes
Copy link
Contributor

Yeah, it's generally not safe for us to retry a commit operation blindly. If you are certain it's safe to retry, you can do so using our retry helper:

from google.api_core import exceptions
from google.api_core import retry

retry_commit = retry.Retry(
    predicate=retry.if_exception_type(exceptions.ServiceUnavailable),
    deadline=60)

...

retry(datastore_client.put)(task)

@speedplane
Copy link
Author

I've dealt with this same problem. Using the retry handler directly can be difficult because the calls are often burried under middleware code. Instead, I've monkey hacked the default retry handler settings:

    try:
        from google.cloud.datastore_v1.gapic import datastore_client_config
    except ImportError:
        return
    dstore_config = datastore_client_config.config
    # Sanity check that this structure is what we think it is.
    if "interfaces" not in dstore_config:
        raise Exception("No interfaces in dstore_config: %s"%dstore_config)
    if "google.datastore.v1.Datastore" not in dstore_config["interfaces"]:
        raise Exception("No datastore in dstore_config: %s"%dstore_config)
    # Update the structure in case it's already been loaded.
    dstore_config["interfaces"]["google.datastore.v1.Datastore"].update({
        # Add new settings here, see datastore_client_config for format.
    })

@theacodes
Copy link
Contributor

well, I certainly can't endorse that - but I'm glad it works for you.

@EhsanKia
Copy link

For what it's worth, I have an AppEngine Flask project, and I used to have datastore_client = datastore.Client() as a global, creating the client once for the lifetime of the instance (That's how it's done in the sample apps)

As with above, the connection would eventually break and the client would get corrupted, so I've switched to instead instantiating a new client every time since I don't need to very often. I would've rather implemented a getter which checks if the client is working and reloads it if it isn't, but there doesn't seem to be a clean way of doing that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API. priority: p2 Moderately-important priority. Fix may not be included in next release. status: awaiting information type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests