-
Notifications
You must be signed in to change notification settings - Fork 164
Ephemeral Resource Attributes #208
Changes from all commits
3e6a2be
e06b96c
381fb82
7852cf3
25e17a5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
# Ephemeral Resource Attributes | ||
|
||
Define a new type of resource attribute, ephemeral resources, which are allowed to change over the lifetime of the process. Existing resources are redefined as permanent resources, which must be present at SDK initialization and cannot be changed. | ||
|
||
## Motivation | ||
|
||
Server applications, which opentelemetry was initially designed around, simultaneously handle many unrelated transactions. Other types of applications, such as client applications, all events and transactions within the process are associated with a single user or activity. These applications often include "global" concepts which are important to telemetry. Examples include session ID, language preference, time zone and location data. These concepts must be represented as attributes in order to correctly report the state of client applications. | ||
|
||
Since the state being recorded is global to the process, it matches our concept of a resource attributes, as a resources are applied to all telemetry emitted by the SDK. However, unlike our current concept of a resource attribute, these attributes may change their value over the life of the application. This OTEP proposes a mechanism for extending the concept of a resource, in order to efficiently and accurately record these attributes while still preserving the immutability requirements of currently defined resource attributes. | ||
|
||
## Explanation | ||
|
||
There are two types of resource attributes, **permanent** and **ephemeral**. Attributed which are labeled as permanent in the semantic conventions must be present when the SDK is initialized. They cannot be added or updated at a later date. | ||
|
||
Resources are managed via a ResourceProvider. Setting an attribute on a ResourceProvider will cause that attribute value to be included in the resource attached to any signal generated in the future. Spans which have already been started, along with any telemetry which has already been passed to the export pipeline, will not have the new attribute value. Optionally, a check can be added to ensure that permanent resources are not modified after the SDK has started | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If nested attributes proposal is accepted, then one way to simplify ephemeral resources validation is to have just one attribute called There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see how this would simplify things? You then still have an attribute that needs special handling. Whether it is by name or with an explicit label would not make things more/less simple, would it? The nested attributes proposal also does not require SDKs to implement them. If we want ephemeral attributes to depend on that, it would mean that SDKs could also not implement ephemeral attributes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand the proposal correctly, it requires that the permanent attributes be marked so in the semantic conventions. This is the part that will not be required if we limit the special handling to only one attribute with a known name. Consider the following resource. The
Anyway, this is an optimization step. Let's ignore this initially until the larger proposal gets acceptance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Marking something in the semantic conventions is just that: A convention. If we want something to be conventionally ephemeral, we still need to have a note about that in the semantic conventions one way or another. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that it would simply things if ephemeral resources were kept separate from other resources. Validator is also something which can be run in development, but disabled in production, which would work as an optimization. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, one aside on nested attributes: my assumption is that attribute values wouldn't be merged, they would be replaced. In other words, there is still only a single string key per attribute, but with the option of storing an object, map, or array as the value for that attribute. If you set a new value for the key, it would throw the old value away. |
||
|
||
|
||
## Internal details | ||
|
||
### ResourceProvider | ||
|
||
#### NewResourceProvider([resource], [validator]) ResourceProvider | ||
|
||
NewResourceProvider instantiates an implimentation of the ResourceProvider interface. As argumentes, it optionally takes an initial set of resource attributes, and a validator. | ||
|
||
The ResourceProvider interface has the following methods | ||
|
||
#### MergeResource(resource) | ||
|
||
MergeResource creates a new resource, representing the union of the resource parameter and the resource contained within the Provider. The ResourceProvider holds a reference to the nwe resource. | ||
|
||
#### SetAttribute(key, value) | ||
|
||
SetAttribute functions the same as MergeResource, but only adds a single attribute. | ||
|
||
#### GetResource() Resource | ||
|
||
GetResource returns a reference to the current resource held by the ResourceProvider. | ||
|
||
#### FreezePermanent() | ||
|
||
FreezePermanent is called by the SDK one it has been stared. After FreezePermanent has been called, any calls to MergeResource or SetAttributes will only be applied if the validator acceptes the input. | ||
|
||
#### Implementation Notes | ||
|
||
For multithreaded systems, a lock should be used to queue all calls to MergeResource and SetAttribute. But the resource reference held by the ResourceProvider should be updated atomically, to prevent calls to GetResource from being blocked. | ||
|
||
### SDK Changes | ||
|
||
NewTraceProvider, NewMetricsProvider, and NewLogProvider now take a ResourceProvider as a parameter. For backwards compatibility, the Resource parameter remains functional. If both a resource and a resource provider as passed as parameters, the resource is merged into the ResourceProvider, then discarded. | ||
|
||
FreezePermanent is then called by the provider. | ||
|
||
Internally, providers hold a reference to the ResourceProvider, rather than a specific resource. When creating a signal, such as a span, metric, or log, GetResource() is called to obtain a reference to the correct resource to attach to the signal. | ||
|
||
## Trade-offs and mitigations | ||
|
||
This change should be fully backwards compatible, with one potential exception: fingerprinting. It is possible that an analysis tool which accepts OTLP may identify individual services by creating an identifier by hashing all of the resource attributes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is another issue: Exporters right now may be implemented to assume they only ever deal with spans with the same resource. With this proposal, they could receive a batch of mixed spans. Also there may be exporters for protocols that only support a single resource per connected agent. They would then probably need to stamp the ephemeral attributes on every single telemetry item. Similar issues may apply to span processors. (And possibly samplers that receive a resource in their constructor, but I don't think that will be a problem in practice open-telemetry/opentelemetry-specification#1658) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, exporters must deal with more than one resource already, which is what made this change so simple! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is an issue open for that: open-telemetry/opentelemetry-specification#1690 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I agree this should be clarified. My understanding is that a BatchSpanProcessor may be shared across multiple SDKs within the same process, and that is done in order to have different sets of resources for different sub-processes. So there is no guarantee that all spans in a batch have the same resource. I know that @MSNev has examples of this pattern. But, I think that this pattern is extremely rare, so it doesn't surprise me that Dynatrace and other exporters could take a shortcut without anyone noticing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our examples are (currently) used using our internal (not OpenTelemetry) SDK's on clients where multiple teams provide different components to the same "view" (page etc) and need / want to report telemetry to their own backends. And in some runtimes we have a single batching system which is shared, rather than having each component on the view creating its own SDK instance with all of the overhead and batching mechanisms. Thus reducing the runtime impact on resources for the client (CPU, Memory, etc) |
||
|
||
In this case, it is recommended that these systems modify their behavior, and choose a subset of permanent resources to use as a hash identifier. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That might be a pretty big deal for some, if they only allow storing one set of resource attributes per hash. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @open-telemetry/specs-approvers Please take a look - I suspect we may need a lot of eyes, in case somebody relies on this right now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems crazy to me to use a resource hash as an identifier, given that there is no requirement that the items within it would uniquely identify a service... But I'm throwing it out there as a possibility, just to cover all the bases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should be using something that doesn't exist yet, instead of hashing the whole resources: open-telemetry/opentelemetry-specification#1034 (EDIT: To clarify: We don't do/need this at Dynatrace, I don't know anybody who does. Just a side note) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I agree! There are various attributes which could count as a unique identifier. We could clarify in the spec which ones are currently defined. One possibility: by default, the SDK could generate a unique ID every time it starts, which would be a reliable identifier because we generate it ourselves. However, this identifier would not be stable across restarts. So there are limits to what can be provided without user input. |
||
|
||
## Prior art and alternatives | ||
|
||
An alternative to ephemeral resources would be to create span, metrics, and log processors which attach these ephemeral attributes to every instance of every signal. This would not require a modification to the specification. | ||
|
||
There are two problems to this approach. One is that the duplication of attributes is very inefficient. This is a problem on clients, which have a limited newtwork bandwidth. This problem is compounded by a lack of support for gzip and other compression algorithms on the browser. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great to quantify this. How inefficient is it? A benchmark demonstrating this would be a strong argument in favour of the proposed approach. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If processors can change scope attributes, they might be a good candidate to solve this as well.
I'm not an expert on browser stuff but can you expand on this? On its surface it seems wrong since gzipped static resources show up everywhere on the internet and there are js implementations of gzip (like this). This stackoverflow post suggests that a part of it is because a browser client can't know if the server can accept gzipped data, but OTLP requires gzip support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's uncommon because clients may not know if the server can accept compressed data. It is not clear to me if the gzip support in the OTLP spec refers to responses only (common for web services to provide) or requests as well (uncommon). I think there is also a danger of an attack on the server - compressed data could be expanded to a very large content. And lastly gzip compression is not native to browsers, so there is CPU overhead, which is important to consider for impact on user experience, especially when sending data while the page is unloading. Aside from that, I think that session ID specifically does not belong on individual signals. The session is a context for many signals in a given time period; it does not vary from signal to signal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Never mind on the OTLP gzip support, I see it says that clients MAY gzip the content. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tigrannajaryan Regarding the limited network bandwidth, the sendBeacon() API has a payload limit of 64KB. Assuming session.id attribute that looks like this when sent over the wire
It adds 79 bytes per each signal. The number of spans/events per export will depend on the type of application and which instrumentations are present. But assuming that 100 is plausible, this adds almost 8kB to the payload. This will further increase if we add additional context attributes (user attributes, URL etc.). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is incorrect. The CompressionStreams API provides a native solution for this and is supported in Chromium-based browsers already.
Also benefits from caching. In a network-constrained situation the cost of retrieving the additional code is paid once and the result cached. Conditional requests and etags are your friend.
The additional code for compression is only needed to export telemetry and does not need to be loaded at the same time as the code enabling instrumentation. Deferring until an export is required can increase the time-to-export but would not impact time-to-interaction or any other user-focused timing.
This is inverted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Aneurysm9 sorry my bad, I meant There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The cost to generate, serialize, and compress that many spans is also not a synchronous process that takes x milliseconds, but many small processes which each take a small fraction of X. It is most important to ensure that each individual step doesn't impact user experience. With the example of 100 spans otlp -> protobuf -> pako on the pixel 4a given, the whole process is 4.393ms but you have 2 chances to yield to the event loop to ensure user experience is not affected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Have missed it but I generally don't consider new browser features as a solution unless usage% is >90% (and well, safari has a monopoly on ios so....) (also 90% is probably low considering how much RUM products are asked for IE11 support but they already have a miserable experience due to using IE in current year so making it optional is worth consideration)
There is one but - not when user is leaving the page, tho generally you don't have 100 spans then, making it a question of how much do you want to maintain 2 different code paths (a sync one and an "async" one) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tigrannajaryan i just want to emphasize what @scheler said, that the purpose of this OTEP is not to avoid compression or gain efficiency, but to extend our data model in a way that correctly represents these attributes. If we don't want to extend the current Resource concept, we could add a new concept, call it ProccessScope or something similar, and have it work in effectively the same manner. Personally, I'd prefer we extend resources over adding a new scope. But I prefer both over an approach that makes it impossible to cleanly implement RUM using OpenTelemetry. In other words, I'm against "just tack on the process scope as span/event attributes" the same way I'd be opposed to "just tack on the instrumentation scope as span/event attributes." In both cases, yes it would "work." But it would create a headache for implementers and confusion for users. We should strive for a clean data model, where everything is explained just by looking at the data structure. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In situations where at least one of the ephemeral attributes changes very often, telemetry items are created between the changes and there are lots of permanent attributes, attaching to to the telemetry items ("signal instance") could even be more efficient. Generally, I wonder how many ephemeral attributes we expect relative to permanent ones. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are not expecting large numbers of ephemeral attributes, nor are we expecting them to change with great frequency. The expectation is that there would be between 1 and 10 ephemeral attributes set on a client, which may update after 15 minutes of inactivity, after the application reawakens, or in response to a change in user or user settings. |
||
|
||
The second problem is that it becomes difficult to distinguish between emphemeral resources and other types pf attributes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it needed to distinguish them by type? Usually the attribute keys should be all you need. E.g. if you have a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the browser,the overhead of applying the As far as the need to differentiate, putting data in the proper envelope helps backend systems use it more effectively. You might ask, why have resources at all in OTLP? Why not simple apply resources as attributes on every span and event? Besides the inefficiency, it would make life very difficult for backend systems which want to apply different analysis to resources and span attributes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Citation needed 😃 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please see this thread (#208 (comment)) for a lengthy discussion on data limitations in the browser. I don't think these arguments apply to #207, that proposal would be helpful imho. Just not a solution for ephemeral resources, since many of the events which need these resources happen when there is no trace present. |
||
|
||
## Open questions | ||
|
||
The primary open question is whether any common backends are hashing the resource to obtain a service identifier. | ||
|
||
## Future possibilities | ||
|
||
Ephemeral resource attributes will be critical feature for implementeting RUM/client instrumentation in OpennTelemetry. | ||
|
||
Other application domains may discover that they have process-wide state which affects their performance or otherwise changes code execution, which would be valuable to record as an ephemeral resource. For example, applications may have a drain or shutdown phase which affects the behavior of the application. The ability to identify telemetry data which occurs during this phase may be valuable to some end users. | ||
|
||
## Example Usage | ||
|
||
Pseudocode example of a ResourceProvider in use. The resource provider is loaded with all available permanent resources, then passed to a TraceProvider. The ResourceProvider is also passed to a session manager, which updates an ephemeral resource in the background. | ||
|
||
``` | ||
var resources = {“service.name” = “example-service”}; | ||
|
||
// Example of a deny list validator. | ||
var validator = NewDenyListValidator(PERMANENT_RESOURCE_KEYS); | ||
|
||
// Example of an allow list validator. | ||
// This is useful for browser environments | ||
// where loading a deny list would be too costly. | ||
var validator = NewAllowListValidator([“session.id”]); | ||
|
||
// The ResourceProvider is initialized with | ||
// a dictionary of resources and a validator. | ||
var resourceProvider = NewResourceProvider(resources, validator); | ||
|
||
// The resourceProvider can be passed to resource detectors | ||
// to populate additional permanent resources. | ||
DetectContainerResources(resourceProvider) | ||
|
||
// The TraceProvider now takes a ResourceProvider. | ||
// The TraceProvider calls Freeze on the ResourceProvider. | ||
// After this point, it is no longer possible to update or add | ||
// additional permanent resources. | ||
var traceProvider = NewTraceProvider(resourceProvider); | ||
|
||
// Whenever the SessionManager starts a new session | ||
// it updates the ResourceProvider with a new session id. | ||
sessionManager.OnChange( | ||
func(sessionID){ | ||
resourceProvider.SetAttribute(“session.id”, sessionID); | ||
} | ||
); | ||
|
||
``` | ||
|
||
## Example Implementation | ||
|
||
Pseudocode examples for a possible Validator and ResourceProvider implementation. Attention is placed on making the ResourceProvider thread safe, without introducing any locking or synchronization overhead to `GetResource`, which is the only ResourceProvider method on the hot path for OpenTelemetry instrumentation. | ||
|
||
``` | ||
|
||
// Example of a simple validator. | ||
class DenyListValidator{ | ||
|
||
// Attribute keys can be stored in any | ||
// data structure with a fast implementation | ||
// for detecting set membership. | ||
Set denyList | ||
|
||
Validate(key){ | ||
if(this.denyList.Contains(key)){ | ||
return false; | ||
} | ||
return true; | ||
} | ||
} | ||
|
||
// Example of a thread-safe ResourceProvider | ||
class ResourceProvider{ | ||
*Resource resource | ||
Validator validator | ||
bool isFrozen | ||
Lock lock | ||
|
||
GetResource(){ | ||
return this.resource; | ||
} | ||
|
||
|
||
SetAttribute(){ | ||
// All methods on a ResourceProvider which perform mutations | ||
// must share a lock. | ||
this.lock.Aquire(); | ||
|
||
// Only perform validation after the ResourceProvider | ||
// has been frozen. | ||
if(this.isFrozen && !this.validator.Validate(key)){ | ||
this.lock.Release(); | ||
return; | ||
} | ||
|
||
// Because Resource objects are immutable, it is safe to call | ||
// SetAttribute without locking GetResource | ||
var mergedResource = this.resource.SetAttribute(key, value); | ||
|
||
// Because ResourceProvider only stores a reference to | ||
// a Resource, that reference can be replaced using an | ||
// atomic swap operation. This approach allows GetResource | ||
// to remain lock-free. | ||
AtomicSwap(this.resource, mergedResource) | ||
|
||
this.lock.Release(); | ||
} | ||
|
||
// MergeResource essentially has the same implementation as SetAttribute. | ||
MergeResource(resource){ | ||
this.lock.Aquire(); | ||
|
||
if(this.isFrozen){ | ||
// Every key in the resource must be validated | ||
foreach(resource, key, value){ | ||
if(!this.validator.Validate(key)){ | ||
this.lock.Release(); | ||
return; | ||
} | ||
} | ||
} | ||
|
||
|
||
var mergedResource = this.resource.Merge(resource) | ||
AtomicSwap(this.resource, mergedResource) | ||
|
||
this.lock.Release(); | ||
} | ||
|
||
FreezePermanent(){ | ||
this.lock.Aquire(); | ||
this.isFrozen = true; | ||
this.lock.Release(); | ||
} | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have proposed a somewhat similar OTEP #207
If #207 was implemented you could store your ephemeral resource attributes on the Context, and replace the active context when they change. Please check if #207 would also cover your use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That looks like a good proposal, but the context scope still presumes a transactional scope within a server handling many independent transactions.
For clients, all telemetry emitted, including logs which are not bounded by a span, are related. Which is why the resource scope appears to be the correct one for things like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a continuum of use cases here, where some are better addressed by this OTEP and others better by #207. If one added the possibility to set a new context as root context (where the default is the empty context), we could have something that applies to everything.
Though the browser usually only has one thread of execution of which everything is a child context (I believe), so you probably would only need to set the attributes you want as active before starting your root spans, and it would stick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might work... but it might be better to keep the concept of a "process scope" and a "context scope" separate. I see these attributes as more similar to resources and instrumentation scopes - they represent the environment the transaction is occurring within.
Because contexts are immutable, and no rules as to when child contexts may be created, there would be synchronization issues between when ephemeral resources are updated and when they would applied, if they only change the root context and thus only affect transactions which start from a new root context.