Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Active listeners" not active if at least one is waiting for an SDS update #6904

Closed
PiotrSikora opened this issue May 11, 2019 · 28 comments
Closed
Labels
question Questions that are neither investigations, bugs, nor enhancements

Comments

@PiotrSikora
Copy link
Contributor

Description:

All listeners from LDS are marked as "active", but they don't even listen() if at least one of them is waiting for an SDS update.

I believe there are 2 issues here, really:

  1. All listeners are marked as "active" instead of "warming",
  2. All listeners are blocked on a single listener in a warming state.

Repro steps / Admin and Stats Output:

Start "fake" xDS server that will accept connection, but will never respond:

$ nc -l -p 9903 &

Start Envoy:

bazel-bin/source/exe/envoy-static -c /tmp/configs/config_not_warming.yaml -l trace --service-node node --service-cluster cluster

Verify that listeners from LDS are configured, and marked as active and not warming:

$ curl -s 127.0.0.1:9901/listeners
["0.0.0.0:9080","0.0.0.0:9443"]
$ curl -s 127.0.0.1:9901/config_dump | jq -r ".configs[].dynamic_active_listeners[]? | .listener.address"
{
  "socket_address": {
    "address": "0.0.0.0",
    "port_value": 9080
  }
}
{
  "socket_address": {
    "address": "0.0.0.0",
    "port_value": 9443
  }
}
$ curl -s 127.0.0.1:9901/config_dump | jq -r ".configs[].dynamic_warming_listeners[]? | .listener.address"
<no output>

Try to connect to both listeners, discover that neither is listening:

$ curl 127.0.0.1:9080
curl: (7) Failed to connect to 127.0.0.1 port 9080: Connection refused
$ curl 127.0.0.1:9443
curl: (7) Failed to connect to 127.0.0.1 port 9443: Connection refused

Config:

$ cat /tmp/configs/config_not_warming.yaml
admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9901

static_resources:
  clusters:
  - name: default_cluster
    connect_timeout: 1s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: default_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 9902
  - name: xds_cluster
    connect_timeout: 1s
    type: STATIC
    lb_policy: ROUND_ROBIN
    http2_protocol_options: {}
    load_assignment:
      cluster_name: xds_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 9903

dynamic_resources:
  lds_config:
    path: /tmp/configs/listeners_not_warming.yaml
$ cat /tmp/configs/listeners_not_warming.yaml
version_info: "0"
resources:
- '@type': type.googleapis.com/envoy.api.v2.Listener
  name: http_listener
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9080
  filter_chains:
  - filters:
    - name: envoy.tcp_proxy
      config:
        stat_prefix: stats
        cluster: default_cluster

- '@type': type.googleapis.com/envoy.api.v2.Listener
  name: https_listener
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9443
  filter_chains:
  - filters:
    - name: envoy.tcp_proxy
      config:
        stat_prefix: stats
        cluster: default_cluster
    tls_context:
      common_tls_context:
        tls_certificate_sds_secret_configs:
        - name: default_secret
          sds_config:
            api_config_source:
              api_type: GRPC
              grpc_services:
                envoy_grpc:
                  cluster_name: xds_cluster

Logs:

[info][main] [source/server/server.cc:203] initializing epoch 0 (hot restart version=10.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363 size=2654312)
[info][main] [source/server/server.cc:205] statically linked extensions:                                                         
[info][main] [source/server/server.cc:207]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log,envoy.wasm_access_log                                                                                                                           
[info][main] [source/server/server.cc:210]   filters.http: envoy.buffer,envoy.cors,envoy.ext_authz,envoy.fault,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.rbac,envoy.f
ilters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash,envoy.wasm
[info][main] [source/server/server.cc:213]   filters.listener: envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[info][main] [source/server/server.cc:216]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.
network.thrift_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[info][main] [source/server/server.cc:218]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[info][main] [source/server/server.cc:220]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.zipkin
[info][main] [source/server/server.cc:223]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[info][main] [source/server/server.cc:226]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[info][main] [source/server/server.cc:228]   wasm: envoy.wasm                     
[info][main] [source/server/server.cc:275] admin address: 127.0.0.1:9901            
[debug][main] [source/server/overload_manager_impl.cc:171] No overload action configured for envoy.overload_actions.stop_accepting_connections.
[info][config] [source/server/configuration_impl.cc:50] loading 0 static secret(s)
[info][config] [source/server/configuration_impl.cc:56] loading 2 cluster(s)         
[debug][grpc] [source/common/grpc/google_async_client_impl.cc:42] completionThread running                 
[debug][upstream] [source/common/upstream/cluster_manager_impl.cc:793] adding TLS initial cluster default_cluster
[debug][upstream] [source/common/upstream/cluster_manager_impl.cc:793] adding TLS initial cluster xds_cluster
[trace][upstream] [source/common/upstream/upstream_impl.cc:933] Local locality:   
[debug][upstream] [source/common/upstream/upstream_impl.cc:729] initializing secondary cluster default_cluster completed
[debug][upstream] [source/common/upstream/cluster_manager_impl.cc:938] membership update for TLS cluster default_cluster added 1 removed 0
[debug][upstream] [source/common/upstream/cluster_manager_impl.cc:91] cm init: init complete: cluster=default_cluster primary=0 secondary=0
[debug][upstream] [source/common/upstream/cluster_manager_impl.cc:63] cm init: adding: cluster=default_cluster primary=0 secondary=0
[trace][upstream] [source/common/upstream/upstream_impl.cc:933] Local locality:
[debug][upstream] [source/common/upstream/upstream_impl.cc:729] initializing secondary cluster xds_cluster completed
[debug][upstream] [source/common/upstream/cluster_manager_impl.cc:938] membership update for TLS cluster xds_cluster added 1 removed 0
[debug][upstream] [source/common/upstream/cluster_manager_impl.cc:91] cm init: init complete: cluster=xds_cluster primary=0 secondary=0
[debug][upstream] [source/common/upstream/cluster_manager_impl.cc:63] cm init: adding: cluster=xds_cluster primary=0 secondary=0
[info][upstream] [source/common/upstream/cluster_manager_impl.cc:136] cm init: all clusters initialized
[info][config] [source/server/configuration_impl.cc:65] loading 0 listener(s)
[info][config] [source/server/configuration_impl.cc:90] loading tracing configuration
[info][config] [source/server/configuration_impl.cc:110] loading stats sink configuration
[debug][file] [source/common/filesystem/inotify/watcher_impl.cc:49] added watch for directory: '/tmp/configs' file: 'listeners_not_warming.yaml' fd: 1
[info][main] [source/server/server.cc:466] all clusters initialized. initializing init manager
[debug][config] [bazel-out/k8-fastbuild/bin/source/common/config/_virtual_includes/filesystem_subscription_lib/common/config/filesystem_subscription_impl.h:58] Filesystem config refresh for /tmp/configs/listeners_not_warming.yaml
[warning][misc] [source/common/protobuf/utility.cc:129] Using deprecated option 'envoy.api.v2.listener.Filter.config'. This configuration will be removed from Envoy soon. Please see https://github.com/envoyproxy/envoy/blob/master/DEPRECATED.md for detail
s.
[warning][misc] [source/common/protobuf/utility.cc:129] Using deprecated option 'envoy.api.v2.listener.Filter.config'. This configuration will be removed from Envoy soon. Please see https://github.com/envoyproxy/envoy/blob/master/DEPRECATED.md for detail
s.
[debug][config] [source/server/listener_manager_impl.cc:729] begin add/update listener: name=http_listener hash=5551506821223866441
[debug][config] [source/server/listener_manager_impl.cc:55]   filter #0:
[debug][config] [source/server/listener_manager_impl.cc:56]     name: envoy.tcp_proxy
[debug][config] [source/server/listener_manager_impl.cc:59]   config: {"cluster":"default_cluster","stat_prefix":"stats"}
[debug][config] [source/server/listener_manager_impl.cc:617] add active listener: name=http_listener, hash=5551506821223866441, address=0.0.0.0:9080
[info][upstream] [source/server/lds_api.cc:73] lds: add/update listener 'http_listener'
[debug][config] [source/server/listener_manager_impl.cc:729] begin add/update listener: name=https_listener hash=16855263350469956370
[debug][config] [source/common/config/grpc_mux_impl.cc:73] gRPC mux subscribe for type.googleapis.com/envoy.api.v2.auth.Secret
[trace][config] [bazel-out/k8-fastbuild/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:111] draining discovery requests 1
[debug][config] [source/common/config/grpc_mux_impl.cc:34] No stream available to sendDiscoveryRequest for type.googleapis.com/envoy.api.v2.auth.Secret
[debug][config] [bazel-out/k8-fastbuild/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:53] Establishing new gRPC bidi stream for rpc StreamSecrets(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v
2.DiscoveryResponse);

[debug][router] [source/common/router/router.cc:320] [C0][S7940596661014428145] cluster 'xds_cluster' match for URL '/envoy.service.discovery.v2.SecretDiscoveryService/StreamSecrets'
[debug][router] [source/common/router/router.cc:381] [C0][S7940596661014428145] router decoding headers:
':method', 'POST'
':path', '/envoy.service.discovery.v2.SecretDiscoveryService/StreamSecrets'
':authority', 'xds_cluster'
':scheme', 'http'
'te', 'trailers'
'content-type', 'application/grpc'
'x-envoy-internal', 'true'
'x-forwarded-for', '10.138.0.2'

[debug][client] [source/common/http/codec_client.cc:26] [C0] connecting
[debug][connection] [source/common/network/connection_impl.cc:639] [C0] connecting to 127.0.0.1:9903
[debug][connection] [source/common/network/connection_impl.cc:648] [C0] connection in progress
[debug][http2] [source/common/http/http2/codec_impl.cc:721] [C0] setting stream-level initial window size to 268435456
[debug][http2] [source/common/http/http2/codec_impl.cc:743] [C0] updating connection-level initial window size to 268435456
[debug][pool] [source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[trace][config] [bazel-out/k8-fastbuild/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:111] draining discovery requests 1
[trace][config] [source/common/config/grpc_mux_impl.cc:59] Sending DiscoveryRequest for type.googleapis.com/envoy.api.v2.auth.Secret: node {
  id: "node"
  cluster: "cluster"
  build_version: "5f7bf108a93e962bf21dce7bbdfd9294d747cc71/1.10.0-dev/Modified/DEBUG/BoringSSL"
}
resource_names: "default_secret"
type_url: "type.googleapis.com/envoy.api.v2.auth.Secret"

[trace][router] [source/common/router/router.cc:1018] [C0][S7940596661014428145] buffering 162 bytes
[debug][config] [source/server/listener_manager_impl.cc:55]   filter #0:                                                                                                                                                                                      
[debug][config] [source/server/listener_manager_impl.cc:56]     name: envoy.tcp_proxy
[debug][config] [source/server/listener_manager_impl.cc:59]   config: {"cluster":"default_cluster","stat_prefix":"stats"}
[debug][config] [source/server/listener_manager_impl.cc:617] add active listener: name=https_listener, hash=16855263350469956370, address=0.0.0.0:9443
[info][upstream] [source/server/lds_api.cc:73] lds: add/update listener 'https_listener'
[debug][config] [bazel-out/k8-fastbuild/bin/source/common/config/_virtual_includes/filesystem_subscription_lib/common/config/filesystem_subscription_impl.h:70] Filesystem config update accepted for /tmp/configs/listeners_not_warming.yaml: version_info: $
0"
resources {
  [type.googleapis.com/envoy.api.v2.Listener] {
    name: "http_listener"
    address {
      socket_address {
        address: "0.0.0.0"
        port_value: 9080
      }
    }
    filter_chains {
      filters {
        name: "envoy.tcp_proxy"
        config {
          fields {
            key: "cluster"
            value {
              string_value: "default_cluster"
            }
          }
          fields {
            key: "stat_prefix"
            value {
              string_value: "stats"
            }
          }
        }
      }
    }
  }
}
resources {
  [type.googleapis.com/envoy.api.v2.Listener] {
    name: "https_listener"
    address {
      socket_address {
        address: "0.0.0.0"
        port_value: 9443
      }
    }
    filter_chains {
      tls_context {
        common_tls_context {
          tls_certificate_sds_secret_configs {
            name: "default_secret"
            sds_config {
              api_config_source {
                api_type: GRPC
                grpc_services {
                  envoy_grpc {
                    cluster_name: "xds_cluster"
                  }
                }
              }
            }
          }
        }
      }
      filters {
        name: "envoy.tcp_proxy"
        config {
          fields {
            key: "cluster"
            value {
              string_value: "default_cluster"
            }
          }
          fields {
            key: "stat_prefix"
            value {
              string_value: "stats"
            }
          }
        }
      }
    }
  }
}

[info][main] [source/server/server.cc:492] starting main dispatch loop
[debug][file] [source/common/filesystem/inotify/watcher_impl.cc:74] notification: fd: 1 mask: 20 file: listeners_not_warming.yaml
[debug][file] [source/common/filesystem/inotify/watcher_impl.cc:74] notification: fd: 1 mask: 1 file: listeners_not_warming.yaml
[debug][file] [source/common/filesystem/inotify/watcher_impl.cc:74] notification: fd: 1 mask: 10 file: listeners_not_warming.yaml
[trace][connection] [source/common/network/connection_impl.cc:440] [C0] socket event: 2
[trace][connection] [source/common/network/connection_impl.cc:508] [C0] write ready
[debug][connection] [source/common/network/connection_impl.cc:517] [C0] connected
[debug][client] [source/common/http/codec_client.cc:64] [C0] connected
[debug][pool] [source/common/http/http2/conn_pool.cc:83] [C0] creating stream
[debug][router] [source/common/router/router.cc:1128] [C0][S7940596661014428145] pool ready
[trace][http2] [source/common/http/http2/codec_impl.cc:554] [C0] send data: bytes=24
[trace][connection] [source/common/network/connection_impl.cc:376] [C0] writing 24 bytes, end_stream false
[trace][http2] [source/common/http/http2/codec_impl.cc:554] [C0] send data: bytes=21
[trace][connection] [source/common/network/connection_impl.cc:376] [C0] writing 21 bytes, end_stream false
[trace][http2] [source/common/http/http2/codec_impl.cc:508] [C0] sent frame type=4
[trace][http2] [source/common/http/http2/codec_impl.cc:554] [C0] send data: bytes=13
[trace][connection] [source/common/network/connection_impl.cc:376] [C0] writing 13 bytes, end_stream false
[trace][http2] [source/common/http/http2/codec_impl.cc:508] [C0] sent frame type=8
[trace][http2] [source/common/http/http2/codec_impl.cc:554] [C0] send data: bytes=132
[trace][connection] [source/common/network/connection_impl.cc:376] [C0] writing 132 bytes, end_stream false
[trace][http2] [source/common/http/http2/codec_impl.cc:508] [C0] sent frame type=1
[trace][connection] [source/common/network/connection_impl.cc:376] [C0] writing 171 bytes, end_stream false
[trace][http2] [source/common/http/http2/codec_impl.cc:508] [C0] sent frame type=0
[trace][connection] [source/common/network/connection_impl.cc:508] [C0] write ready
[trace][connection] [source/common/network/raw_buffer_socket.cc:62] [C0] write returns: 361
[trace][connection] [source/common/network/connection_impl.cc:440] [C0] socket event: 2
[trace][connection] [source/common/network/connection_impl.cc:508] [C0] write ready
[debug][main] [source/server/server.cc:145] flushing stats

cc @mattklein123 @htuch @duderino @costinm @jplevyak @silentdai @philrud @howardjohn @wattli @quanjielin @JimmyCYJ

@PiotrSikora
Copy link
Contributor Author

For the record, initial_fetch_timeout from #6048 mitigates this issue, and all listeners start listening after the configured timeout, with the listener waiting for SDS rejecting connections due to using NotReadySslSocket transport socket.

This can be tracked in stats by looking at downstream_context_secrets_not_ready, e.g. listener.0.0.0.0_9443.server_ssl_socket_factory.downstream_context_secrets_not_ready.

@mattklein123 mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label May 13, 2019
@mattklein123
Copy link
Member

@PiotrSikora where do you see the listener marked as "active"? That might be a bug or something we can expose better in stats/admin/etc.

The 2nd part AFAICT is WAI, at least as expected today. The way Envoy init works is that everything goes through the init process before listeners are started and actually accept connections. This is so that we make sure the server is fully initialized prior to accepting traffic.

We could obviously relax this so that listeners start taking traffic as their dependencies are initialized, but this would be a non-trivial code change. Perhaps open an explicit enhancement ticket on this part?

@PiotrSikora
Copy link
Contributor Author

@PiotrSikora where do you see the listener marked as "active"? That might be a bug or something we can expose better in stats/admin/etc.

Those listeners are listed under dynamic_active_listeners and not dynamic_warming_listeners in the /config_dump:

$ curl -s 127.0.0.1:9901/config_dump | jq -r ".configs[].dynamic_active_listeners[]? | .listener.address"
{
  "socket_address": {
    "address": "0.0.0.0",
    "port_value": 9080
  }
}
{
  "socket_address": {
    "address": "0.0.0.0",
    "port_value": 9443
  }
}

Is there a better place to look for the status? In any case, the /config_dump seems like a bug.

The 2nd part AFAICT is WAI, at least as expected today. The way Envoy init works is that everything goes through the init process before listeners are started and actually accept connections. This is so that we make sure the server is fully initialized prior to accepting traffic.

We could obviously relax this so that listeners start taking traffic as their dependencies are initialized, but this would be a non-trivial code change. Perhaps open an explicit enhancement ticket on this part?

Ah, I incorrectly assumed that this was already the case, thanks!

But wouldn't this be effectively the enhancement ticket?

@mattklein123
Copy link
Member

In any case, the /config_dump seems like a bug.

It's an artifact of how warming works pre and post init. I agree we can do better here.

if (workers_started_) {
new_listener->debugLog("add warming listener");
warming_listeners_.emplace_back(std::move(new_listener));
} else {
new_listener->debugLog("add active listener");
active_listeners_.emplace_back(std::move(new_listener));

But wouldn't this be effectively the enhancement ticket?

It's hard for viewers to look at a giant ticket like this and grok what is going on. Now that we understand both issues, I would suggest opening two details tickets for each sub-issue that we can then action on, as they are independent issues (mostly).

@htuch
Copy link
Member

htuch commented May 14, 2019

Having listeners warm independently also better lines up with the goal of better handling of partial xDS update failure handling (e.g. having a partial RDS update be able to unblock a bunch of listeners). So, +1 to this.

@stale
Copy link

stale bot commented Jun 13, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 13, 2019
@lambdai
Copy link
Contributor

lambdai commented Jun 13, 2019

This might happen if the listener is "added" but not at "replaced". If there is no draining listener at this port, it's nice to have the listen() before it's ready to server. The benefit is that the incoming connection are queued instead of rejected.

Maybe we should give the listeners in this state a new name instead of "active".

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 13, 2019
@incfly
Copy link
Contributor

incfly commented Jun 28, 2019

@PiotrSikora asked me to take a look, I go over the code and chatted with @silentdai , here're some thoughts

(scope only to why active_listeners_ are actually not really active, and confusion in config_dump, I agree it's better to have a separate issue for independent update, will file that)

  • It's unclear to us why decision on whether adding to active_listeners_ or warming_listeners_ depends on workers_started_... Couldn't just be always adding to warming_listeners if not exists?
    Then later on when listeners get warmed, they can still be moved to active_listeners list.

if (workers_started_) {
new_listener->debugLog("add warming listener");
warming_listeners_.emplace_back(std::move(new_listener));
} else {
new_listener->debugLog("add active listener");
active_listeners_.emplace_back(std::move(new_listener));

This is some discussion on original PR

https://github.com/envoyproxy/envoy/pull/1215/files#r126001538

  • We know that before workers started, Envoy use global init manager, after that, each listener use its own init manager. But that seems independent of the whether adding to warming

  • If we just always add listeners to warming_listeners in that line, and make changes to in listeners manager when finding/updating the list, things should work? I'm thinking to just change that line and see if any legitimate test case will break.

@mattklein123
Please let us know if above thinking is correct or not, or we miss some historical edge/tricky context, thanks!

@htuch
Copy link
Member

htuch commented Jun 28, 2019

I think @asraa also bumped into something in this area the other day while developing a dynamic xDS fuzzer; there is some interaction between workers started and listener warming that can lead to a crash in early initialization phases.

@asraa
Copy link
Contributor

asraa commented Jun 28, 2019

Yes! Sort of. I set up a crash by removing a listener before worker threads are started.

I think this happens because:

  1. The init manager server has a RunHelper watcher that contains a callback to start worker threads.
  2. The init manager server hasn't reached an Initialized state, so it never notifies the watcher to start that callback, hence the worker threads aren't started.
  3. Then, I ask to stop a Listener, and drainListener crashes with an ASSERT(thread_).

I'll set up a separate issue detailing the test case.

@lambdai
Copy link
Contributor

lambdai commented Jun 28, 2019

@asraa xDS fuzzer would be helpful! I am making changes to listener. Tests are breaking (including crash and expectations doesn't reach. Even with the crashing test It could be the test cases access the invalid address in the way that prod code wouldn't.
I would be much more confident on my code if I could spam envoy with kinds of config.

@mattklein123
Copy link
Member

@mattklein123
Please let us know if above thinking is correct or not, or we miss some historical edge/tricky context, thanks!

I would need to go back through all this code. At a high level, I agree that during the first init/boot we can/should be using the warming/active lists. I think I probably did it the way that it is initially just because it was simpler for some reason, possibly due to global init manager processing. Unfortunately it's been long enough that I would have to go through and refresh my memory again. I'm happy to help review any changes in this area of someone wants to tackle.

@incfly
Copy link
Contributor

incfly commented Jul 1, 2019

Thanks @mattklein123 for your feedback.

I took another round of code scanning today, so far I haven't found anything that prevent us adding listeners to warming_listeners_ all the time. @asraa I think this might fix the draining listeners issue as well, since listeners are waiting in the warming list, won't be drained...

I'll try that approach, see what breaks, and get back to this thread if anything is unclear to me.

@lambdai
Copy link
Contributor

lambdai commented Jul 2, 2019

I see the same same issue that global init manager is not yet initializing when the registered targets are ready.

My solution is to add a ready flag for InitTarget, so that if the target is ready when initializing, the init manager invoke additional ready().

It might be reasonable to make ready an internal state in InitTarget. @mattklein123 Do you think a InitTarget that support invoke "ready" before initialization would be useful?

@mattklein123
Copy link
Member

It might be reasonable to make ready an internal state in InitTarget. @mattklein123 Do you think a InitTarget that support invoke "ready" before initialization would be useful?

IIRC the way this is handled today is that the target should just immediately call the ready callback if it's ready. I believe this is done elsewhere. If that is not working I think we should debug why?

@lambdai
Copy link
Contributor

lambdai commented Jul 2, 2019

* any time. Calling it before initialization begins or after initialization has already ended
* will have no effect.

If the comment is true, InitManager should never knew if the ready() is invoked before.

I can add another Target impl as so far I am the one who really need it. I believe this logic could simplify the existing code as well.

@mattklein123
Copy link
Member

I'm not sure that comment is correct. IIRC a target can just call ready() from within the initialize callback, which presumably it can do if it's already ready. cc @mergeconflict who recently worked on this code and should be able to comment.

@mergeconflict
Copy link
Member

@mattklein123: @silentdai is right. If initialize hasn't yet been called on a given target, it won't have a watcher_handle_ (set here and checked here), so nothing will happen. I sort of like @silentdai's idea of adding a ready_ flag, so that if the target becomes ready before initialize is called, it will immediately call the init manager's watcher back rather than doing nothing. It's a change in semantics, but one which nobody should be adversely affected by. I think I'd prefer that instead of adding a new target impl.

@mattklein123
Copy link
Member

Sorry what I'm suggesting is to just immediately call ready() within the call stack of the initialize() call. That will work, right?

@mergeconflict
Copy link
Member

Sorry what I'm suggesting is to just immediately call ready() within the call stack of the initialize() call. That will work, right?

Oh, my bad, I forgot to address your comment. That's one of the use cases I originally intended, and it does work (see e.g. this test), but I'm assuming that @silentdai has some use case where he has to call ready() before initialize() is called. Is that right?

@incfly
Copy link
Contributor

incfly commented Jul 3, 2019

Back to the original issue, I talked with @silentdai and now have more understanding.

  • I learned the current implementation is to ensure, Envoy's initial config can be initialized before serving any traffic, thus starting from a known good state. That's probably why we use global init manager to startWorkers, and add listeners to active_listeners to block the workers threads start. So as @mattklein123 mentioned, this is work as intended right now. And I kinda agree, starting from a known good state is reasonable.
  • However, in the context of very dynamic xDS server, fuzzy test, what "initial good config" means is very obscure. fuzzy test add and remove the listeners quickly, xDS can change a lot before worker started. Which initial config Envoy should start from, is hard to tell...
  • Also note, once workers are started, later on per listener are using separate warming/per listener init manager, this should not block and hang the entire Envoy, so the impact of this problem should limit to the initial loading phase.

Not sure if we should proceed with my proposal above. If not, we still need to clarify in config_dump, have special logic to print active_listeners to not confuse users.

@mattklein123 @PiotrSikora @htuch WDTY?

@mattklein123
Copy link
Member

@mattklein123 @PiotrSikora @htuch WDTY?

We definitely have to block server start so we must use the global init manager. That can't change. However it would definitely be possible to make the code more complex and switch to using both the warming/active lists, and make sure that we take care of removal correctly during global init. This is definitely a bug and we should handle it. (This is the issue @asraa is seeing.)

Personally, I would probably focus on two separate things here:

  1. Let's clarify config_dump so that it's more clear what is going on during init.
  2. Let's fix the fast add/remove bugs somehow.

I would have to spend more time going back through the code to suggest specific fixes, but my advice is to iterate and let's come up with simple solutions to (1) and (2) above and go from there?

@lambdai
Copy link
Contributor

lambdai commented Jul 3, 2019

@mergeconflict Yes, I have one use case. I have subscription callback(more precisely the filter chain update) as a dependency of listener warm up, which indicates it should register as a target.
I don't want to couple the subscription with listener initialization (which is only late initialized at server start up but not in other cases). A new target implementation would be working like a charm.

@incfly
Copy link
Contributor

incfly commented Jul 11, 2019

Closing this issue since the two problems proposed from this issue are both resolved.

  1. Config dump is misleading.
  2. Listeners initialization is already independent after the workers are started. And init all listeners all together before started is a WAI behavior.

@PiotrSikora
Copy link
Contributor Author

2. And init all listeners all together before started is a WAI behavior.

Why is this WAI?

@mattklein123
Copy link
Member

Why is this WAI?

It's WAI as the code was written today. Please open up a fresh issue on making this part better so we can keep things straightforward?

@incfly
Copy link
Contributor

incfly commented Jul 12, 2019

Just want to make sure I understand what we agree on

  1. Envoy go ahead start the workers even if not every initial listener is warmed
  2. Changing implementation to add initial listeners to warming_list just to avoid confusion, but still block the workers/envoy started for those first-received listeners to be ready/warmed.

1 makes real behavior changes, 2. only makes code/implementation easier to follow.

@mattklein123
Copy link
Member

Correct. If we want to track (1) please open a new issue on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements
Projects
None yet
Development

No branches or pull requests

7 participants