- Author(s): Sanjay Pujare (@sanjaypujare)
- Approver: markdroth
- Status: Implemented
- Implemented in: C-core
- Last updated: 2023-03-19
- Discussion at: https://groups.google.com/g/grpc-io/c/F84rRVNgml4
This design specifies a mechanism to implement session affinity where backend
services maintain a local state per "session" and RPCs of a session are always
routed to the same backend (as long as it is in a valid state) that is
assigned to that session. The valid state for a backend typically includes
the HEALTHY
and DRAINING
states with special semantics
for the DRAINING
state. This specification is based on and is compatible
with stateful session persistence that is implemented in Envoy.
See Envoy PRs 17848, 18207.
Some users of proxyless gRPC need to use "stateful session affinity" in gRPC. In this usage scenario, backend services maintain local state for RPC sessions and once a particular "session" is assigned to a particular backend (by the load-balancer), all future RPCs of that session must be routed to the same backend.
This can be achieved by using HTTP cookies in the following manner. A gRPC client application sends the first RPC in a session without a cookie in the RPC. gRPC routes that RPC to some backend B1 based on the configured LB policy. When gRPC receives the response for that first RPC from B1, it attaches a configured cookie to the response before delivering the response to the client application. The cookie encodes some identifying information about B1 (for example its IP-address). The client application saves the cookie and uses that cookie in all future RPCs of that session (e.g. by using a separate cookie jar for each session). gRPC uses the cookie and gets the encoded backend information to route the RPC to the same backend B1 if it is still reachable.
With stateful session affinity, session draining of backends is an important
requirement. In this use-case, the user wants to stop or restart service
backends (let's say to optimize resources or to perform upgrades). However a
terminating backend first needs to "drain" all assigned sessions before
exiting. In this state - let's call it DRAINING
- the terminating backend is
only sent RPCs of its existing assigned sessions and no new traffic is sent.
Once all assigned sessions have finished (or a timeout triggers), the backend
application exits at which point the backend is now removed from the endpoints
list (EDS). This design includes support for session draining.
The goal of this design is to make it consistent with the Envoy implementation so it will interoperate with it. For example, one can create an xDS configuration for Envoy and the exact same configuration will work with proxyless gRPC. Similarly, an application using some cookie implementation that works with Envoy will continue to work with proxyless gRPC without any modifications.
The design involves the following parts.
The EdsUpdate
struct from the XdsClient
will be enhanced to include the
endpoint health status for each endpoint. XdsClient
will also be modified to include endpoints in DRAINING
state in the
EdsUpdate
reported to the watchers.
The feature is enabled and configured using the
cookie based stateful session extension which
is an http-filter in the HttpConnectionManager
for the target service.
A sample configuration is as follows:
{ // HttpConnectionManager
http_filters: [
{
name: "envoy.filters.http.stateful_session"
typed_config: {
"@type": "type.googleapis.com/envoy.extensions.filters.http.stateful_session.v3.StatefulSession"
session_state: {
name: "envoy.http.stateful_session.cookie"
typed_config: {
"@type": "type.googleapis.com/envoy.extensions.http.stateful_session.cookie.v3.CookieBasedSessionState"
cookie: {
name: "global-session-cookie"
path: "/Package1.Service2/Method3"
ttl: "120s"
}
}
}
}
}
]
}
The fields and their validation are described below. In case of failed validation, the config will be NACKed.
The typed_config
inside the session_state
struct has to be the value
type.googleapis.com/envoy.extensions.http.stateful_session.cookie.v3.CookieBasedSessionState
.
name
is the name of the cookie used in this feature. This must not be
empty.
path
is the path for which this cookie is applied. This is optional and
if not present, then the default is root i.e. /
in which case all paths are
considered. Note: Stateful session affinity will not work if the cookie's
path does not match the request's path. For example, if a given route matches
requests with path /service/foo
but the corresponding per-route filter config
specifies a cookie path of /service/bar
, then the cookie will never be set.
ttl
is the time-to-live for the cookie. It is set on the cookie, but will
not be enforced by gRPC. ttl
is optional and if present has to be non-negative.
If not present, the default is 0.
This new filter will be added to the HTTP Filter Registry and processed as
described in A39: xDS HTTP Filter Support. Specifically when this filter
is present in the configuration, gRPC will create the appropriate client filter
(aka interceptor in Java/Go) and install it in the channel to process data
plane RPCs. We call this filter StatefulSessionFilter
. It will
copy the 3 configuration values - name
, ttl
and path
- into the filter
object. Validation and processing are further described in
Stateful session.
Note that
StatefulSessionPerRoute will be supported as
Filter Config Overrides. For example, let's say
the top level config (in HttpConnectionManager.http_filters
) contains certain
StatefulSession
configuration. This configuration can either be
overridden or disabled by a
StatefulSessionPerRoute config for a particular
virtual host or route through a merged configuration as described in
A39 Filter Config Overrides.
The filter is described in more detail in the section
StatefulSessionFilter
Processing.
Even with stateful session affinity enabled, gRPC will only send an RPC to a
valid backend (host) as defined by
common_lb_config.override_host_status
. When the backend
state is invalid, gRPC will use the configured load balancing policy to pick
the backend.
As described above the XdsClient
will be modified to include endpoints with
the health status DRAINING
in addition to HEALTHY
or UNKNOWN
and will
include the health status in the update. As a result, gRPC will only consider
UNKNOWN
, HEALTHY
or DRAINING
states that are specified in
common_lb_config.override_host_status
and ignore all
other values. We considered the alternative of NACKing (i.e. rejecting)
the CDS update when unsupported values are present, however it was rejected
because of the difficulty in using such configuration in mixed deployments.
The override_host_status
value will be included in both the config of the
xds_override_host_experimental
policy and in the DiscoveryMessage
inner
message of the config of the xds_cluster_resolver_experimental
policy.
A channel configured for stateful session affinity will have the
StatefulSessionFilter
installed (interceptor in Java/Go).
This filter will process incoming RPCs and set an appropriate LB pick value
which will be passed to the Load Balancing Policy's Pick method.
When an RPC arrives on a configured channel, the Config Selector processes it
as described in A31: gRPC xDS Config Selector Design. This includes the
StatefulSessionFilter
which is installed as one of the filters in
the channel. Note that the filter maintains a context for an RPC and processes
both the outgoing request and incoming response of the RPC within that context.
Also note that the filter is initialized with 3 configuration values - name
,
ttl
and path
as described above.
The filter performs the following steps:
-
If the incoming RPC “path” does not match the
path
configuration value, skip any further processing (including response processing) and just pass the RPC through (to the next filter). For example if thepath
configuration value is/Package1.Service2
and the RPC method is/Package2.Service1/Method3
then just pass the RPC through. Note, for path matching the rules described in rfc6265 section-5.1.4 will apply. -
Search the RPC headers (metadata) for header(s) named
Cookie
and get the set of all the cookies present. If the set does not contain a cookie whose name matches thename
configuration value of the filter, then pass the RPC through. If you find more than one cookie with thatname
, then just pick the first cookie. Get the value of the cookie and base64-decode the value to get theoverride-host
for the RPC. This value should be a syntactically valid IP:port address: if not, then log a local warning and pass the RPC through. For a validoverride-host
value, set it in the RPC context as the value ofupstreamAddress
state variable which will be used in the response processing of that RPC. -
In case of a valid
override-host
value, pass this value as a "pick-argument" to the Load Balancer's pick method as described in A31. For example,-
in Java use
CallOptions
to pass this value. -
in Go pass it via
Context
. -
In C-core, this will be via the
GRPC_CONTEXT_SERVICE_CONFIG_CALL_DATA
call context element whose API will be extended to allow the filter to add a new call element.
-
-
In the response path - if the filter was not skipped based on the RPC path - get the value of
upstreamAddress
. Also get the peer address for the RPC; let’s call ithostAddress
here. For example, in Java you get the peer address through theGrpc.TRANSPORT_ATTR_REMOTE_ADDR
attribute of the current ClientCall. IfupstreamAddress
is not set orupstreamAddress
andhostAddress
are different, then create a cookie using our Cookie config which has thename
,ttl
andpath
values. Set the value of the cookie to be the base64-encoded string value ofhostAddress
. Add this Cookie to the response headers. The logic in pseudo-code is:
if (!upstreamAddress.has_value() || hostAddress != upstreamAddress) {
String encoded_address = base64_encode(hostAddress);
Create a Cookie from our cookie configuration;
Set cookie's value as encoded_address;
Add the cookie header to the response;
}
After the StatefulSessionFilter
has passed the override-host
value to the Load Balancer as a "pick-argument", the Load Balancer uses
this value to route the RPC appropriately. The required logic will be
implemented in a new and separate LB policy at an appropriate place in the
hierarchy.
RPC load reporting happens in the xds_cluster_impl_experimental
policy and
we do want all RPCs to be included in the load reports. Hence the new policy
needs to be just below the xds_cluster_impl_experimental
as its child policy.
Note that the new policy will be beneath the priority_experimental
policy,
which means that if we pick a host for a session while we are failed over to a
lower priority and then a higher priority comes back online, we will switch
hosts at that point and this behavior is different from Envoy.
Let's call the new policy xds_override_host_experimental
. This policy
contains subchannel management, endpoint management, and RPC routing logic as
follows:
- maintain a map - let's call it
address_map
. The key is (IP:port) and the value is the tuple[subchannel, connectivity-state, EDS health-status, list of other equivalent addresses]
. When a host (endpoint) has multiple addresses (e.g. IPv6 vs IPv4 or due to multiple network interfaces) they are said to be equivalent addresses. Whenever we get an updated list of addresses from the resolver, we create an entry in the map for each address, and populate the list of equivalent addresses and EDS health-status. The subchannel is wrapped and is updated, also theconnectivity-state
, based on subchannel creation/destruction or connection/disconnection from the child policy. The equivalent address list computation on each resolver update is performed as shown in the following pseudo-code:
// Resolver update contains a list of EquivalentAddressGroups,
// each of which has a list of addresses and a health status.
all_addresses = set()
for eag in resolver_update:
for address in eag.addresses:
lb_policy.address_map[address].equivalent_addresses = eag.addresses
lb_policy.address_map[address].health_status = eag.health_status
all_addresses.insert(address)
// Now remove equivalencies for any address not found in the current update.
for address, entry in lb_policy.address_map:
if address not in all_addresses:
entry.equivalent_addresses.clear()
-
The
xds_override_host_experimental
policy will always remove addresses in stateDRAINING
when passing an update down to its child policy, regardless of the value ofoverride_host_status
. -
When the policy gets an update that changes an address's health status to
DRAINING
, if (and only if) theoverride_host_status
set includesDRAINING
, it will avoid closing the connection on that subchannel, even when the child policy tries to do so. Instead, this policy will maintain the connection until either (a) the connection closes due to the server closing it or a network failure, or (b) the policy receives a subsequent EDS update that completely removes the endpoint. -
whenever a new subchannel is created (by the child policy that is routing an RPC - see below), update the entry in
address_map
(for the subchannel address i.e. the peer address as the key) with the new subchannel value updated in the tuple[subchannel, connectivity-state, EDS health-status, list of other equivalent addresses]
.-
in Java and Go, we may have to wait for subchannel to be
READY
in order to know which specific address the subchannel is connected to before we add it to the map. -
in C-core and Node, LB policies create a new subchannel for every address every time there is a new address list. In the map, we will just replace an existing entry with the most recent subchannel created for a given address i.e. the new subchannel created will replace the older one for that address.
-
-
whenever a subchannel is shut down, remove the associated entries from
address_map
.- this is achieved by wrapping a subchannel to intercept subchannel
shutdown and the wrapper is returned by
createSubchannel
.
- this is achieved by wrapping a subchannel to intercept subchannel
shutdown and the wrapper is returned by
-
the policy's subchannel picker pseudo-code is as follows.
policy_config
is theOverrideHostLoadBalancingPolicyConfig
object for this policy.
Note: there needs to be synchronization between map update described above and the following picker logic. Ideally this would be a global MRSW lock where the update code holds the lock for Write and a picker holds the lock for Read while scanning the map.
if override_host is set in pick arguments:
entry = lb_policy.address_map[override_host]
if entry found:
idle_subchannel = None
found_connecting = False
if entry.subchannel is set AND
entry.health_status is in policy_config.override_host_status:
if entry.subchannel.connectivity_state == READY:
return entry.subchannel as pick result
elif entry.subchannel.connectivity_state == IDLE:
idle_subchannel = entry.subchannel
elif entry.subchannel.connectivity_state == CONNECTING:
found_connecting = True
// Java-only, for now: check equivalent addresses
for address in entry.equivalent_addresses:
other_entry = lb_policy.address_map[address]
if other_entry.subchannel is set AND
other_entry.health_status is in policy_config.override_host_status:
if other_entry.subchannel.connectivity_state == READY:
return other_entry.subchannel as pick result
elif other_entry.subchannel.connectivity_state == IDLE:
idle_subchannel = other_entry.subchannel
elif other_entry.subchannel.connectivity_state == CONNECTING:
found_connecting = True
// No READY subchannel found. If we found an IDLE subchannel,
// trigger a connection attempt and queue the pick until that attempt
// completes.
if idle_subchannel is not None:
hop into control plane to trigger connection attempt for idle_subchannel
return queue as pick result
// No READY or IDLE subchannels. If we found a CONNECTING
// subchannel, queue the pick and wait for the connection attempt
// to complete.
if found_connecting:
return queue as pick result
// override_host not set or did not find a matching subchannel,
// so delegate to the child picker
return child_picker.Pick()
In the above logic we prefer a subchannel in the READY
state for a different
equivalent address instead of waiting for the subchannel for the original
override-host
address to become READY
(from one of IDLE
, CONNECTING
or TRANSIENT_FAILURE
states) because we assume that the equivalent address
is pointing to the same host thereby maintaining session affinity with the
pick.
If we do not find a READY
subchannel, and find an IDLE
one, we trigger a
connection attempt on that subchannel and return queue as the pick result
(i.e. the RPC stays buffered). If we instead find a CONNECTING
subchannel,
we just return queue as the pick result i.e. the RPC stays buffered until the
connection attempt completes. If we do not find a subchannel, we just delegate
to the child policy.
Note that we unconditionally create the xds_override_host_experimental
policy
as the child of xds_cluster_impl_experimental
even if the feature is not
configured (in which case, it will be a no-op). A new config
for the xds_override_host_experimental
policy will be defined as follows:
// Configuration for the override_host LB policy.
message OverrideHostLoadBalancingPolicyConfig {
enum HealthStatus {
UNKNOWN = 0;
HEALTHY = 1;
DRAINING = 3;
}
// valid health status for hosts that are considered when using
// xds_override_host_experimental policy.
// Default is [UNKNOWN, HEALTHY]
repeated HealthStatus override_host_status = 1;
repeated LoadBalancingConfig child_policy = 2;
}
The xds_cluster_resolver_experimental
config will be modified so that the
DiscoveryMechanism
inner message contains an additional
override_host_status
field of type repeated HealthStatus
.
cds_experimental
policy will copy the
common_lb_config.override_host_status
value it gets
in the CDS update from the XdsClient
to the appropriate DiscoveryMechanism
entry's override_host_status
field in the config of the
xds_cluster_resolver_experimental
policy. If the field is unset,
we treat it the same as if it was explicitly set to the default set which is
[UNKNOWN
, HEALTHY
].
The xds_cluster_resolver_experimental
LB policy will copy the
override_host_status
value from its config's appropriate DiscoveryMechanism
entry into the config for the xds_override_host_experimental
policy. The
diagram below shows how the configuration is passed down the hierarchy all the
way from cds_experimental
to xds_override_host_experimental
.
One of the existing policies (xds_wrr_locality_experimental
,
or ring_hash_experimental
) is created as the child policy of
xds_override_host_experimental
as shown in the diagram below.
To propagate common_lb_config.override_host_status
the
following changes are required:
-
The XdsClient will parse the common_lb_config.override_host_status field from the
Cluster
resource and copy the value to a newoverride_host_status
field in the CDS update that it sends to the watchers. -
cds_experimental
policy (which receives the CDS update) will copy the value ofoverride_host_status
into the appropriateDiscoveryMechanism
entry'soverride_host_status
field in the config of thexds_cluster_resolver_experimental
policy. -
When the
xds_cluster_resolver_experimental
policy generates the config for each priority, it will configure thexds_override_host_experimental
policy underneath thexds_cluster_impl_experimental
policy. The generated config for thexds_override_host_experimental policy
will include theoverride_host_status
field from theDiscoveryMechanism
associated with that priority.
Note that if stateful session affinity and retries (gRFC A44) are both configured, all retry attempts will be routed to the same endpoint.
During initial development, this feature will be enabled by the
GRPC_EXPERIMENTAL_XDS_ENABLE_OVERRIDE_HOST
environment variable. This
environment variable protection will be removed once the feature has
proven stable. The environment variable protection will specifically apply
to the following:
-
processing of the cookie based stateful session extension http_filter. Note:
XdsClient
will only accept the filter when the environment variable is set, otherwise it will respond as it responds today. -
processing of
common_lb_config.override_host_status
in the CDS update -
inclusion of
DRAINING
state endpoints inEdsUpdate
to watchers
If the environment variable is unset or not true, none of the above will be enabled.
The stateful session affinity requirement is not satisfied by the ring-hash LB policy because any time a backend host is removed or added, the hash mapping of 1/N client sessions gets affected (where N is the total number of backends) which would lead to dropped requests in some cases which is unacceptable.
This feature avoids the problem by using HTTP cookies as described above.
We are planning to implement the design in gRPC C-core and there are no current plans to implement in any other language. Wrapped languages should be able to use the feature provided they can do cookie management in their client code.
Cookies are rarely used with gRPC so the implementations should also
include a recipe for cookie management. This could involve illustrative
examples using cookie jar implementations which are available in various
languages such as Java, Go and Node.js. Wherever possible the
implementations may also include a reference implementation (which could
be interceptor based) that does cookie management using cookie jars. If
based on interceptors, the interceptor is instantiated for a session
and it processes the Set-Cookie
header in a response to save the cookie,
and uses that cookie in subsequent requests.