Improving resiliency through disk buffering capability #2285

pmm-sumo · 2020-12-14T17:51:54Z

Is your feature request related to a problem? Please describe.

Currently, collector buffers all data in memory, which can lead to data loss in case buffer gets overfilled (e.g. during temporary exporter connection issues) or due to a sudden crash.

Describe the solution you'd like

There should be a possibility to enable disk buffering capability. For example, exporter could buffer the incoming data on disk (up to the defined limit) and clear the buffer as data gets sent out. In case of restart, recovery process could happen, which would send-out the buffered data.

Perhaps this capability could be a good fit to queued_retry helper, since it already manages send queues?

Additional context
Gitter discussion

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2020-12-14T20:20:42Z

Note that queued_retry is deprecated. We need to think where the persistent queue should fit. Perhaps queued_retry instead of being deprecated needs to be re-worked and persistence added.

pmm-sumo · 2020-12-14T20:25:40Z

Note that queued_retry is deprecated. We need to think where the persistent queue should fit. Perhaps queued_retry instead of being deprecated needs to be re-worked and persistence added.

The processor is deprecated but is it also true for the helper?

dashpole · 2020-12-17T17:06:22Z

@pmm-sumo I would love to collaborate on this. There are some folks at Google who are interested in solving this problem as well. cc @erain

@tigrannajaryan is this something that can be allowed for GA?

tigrannajaryan · 2021-01-04T16:35:49Z

Note that queued_retry is deprecated. We need to think where the persistent queue should fit. Perhaps queued_retry instead of being deprecated needs to be re-worked and persistence added.

The processor is deprecated but is it also true for the helper?

The helper is here to stay. We don't plan to deprecate it. However, if implemented in the helper each exporter will have its own persistent queue, which may or may not be desirable (it will likely mean persistent data is duplicated if more than one exporter is used).

tigrannajaryan · 2021-01-04T16:37:51Z

@pmm-sumo I would love to collaborate on this. There are some folks at Google who are interested in solving this problem as well. cc @erain

@tigrannajaryan is this something that can be allowed for GA?

It depends on the implementation. If it requires major changes to existing code and may destabilize the Collector right before the GA then it may not be desirable. But it should not be an issue if it is released immediately after GA, e.g. in 1.1.

If the implementation is a separate component that doesn't touch existing code then I see no problem with allowing it for GA.

Either way, let's see a design document first and we can discuss.

pmm-sumo · 2021-01-04T23:13:18Z

Thank you for the clarification @tigrannajaryan I agree that we need a design doc first

@dashpole and @erain, did you perhaps start working/researching this? Perhaps we could mention this during the next SIG meeting (if anyone else would like to participate too) and plan a separate meeting for discussing it? I think we might need more than a few minutes to share our ideas, discuss options and decide how to split the work (even if we would just start with creating a design doc). What do you think?

dashpole · 2021-01-05T16:58:02Z

Thus far, we have mostly been collecting use-cases, but have also explored a few different possible solutions that we'd be happy to share.

I added it to the next collector sig meeting agenda.

djaglowski · 2021-02-03T18:24:18Z

The recent stanza contribution may include some useful code to bootstrap this effort. I'm not deeply familiar with this part of the codebase, but it implements buffering and flushing as independent capabilities.

This code is basically dead at this point, since stanza previously made use of it as part of its output operators, which have been removed due to redundancy with the collector. If this code appears to be useful at all, I believe it would be fine to pull it out of the new repository and adapt as necessary without worrying about duplication.

pmm-sumo · 2021-02-17T17:45:44Z

Perhaps a related issue (when format used for buffering is considered): open-telemetry/opentelemetry-specification#1443

tigrannajaryan · 2021-04-16T16:21:05Z

@pmm-sumo please review open-telemetry/opentelemetry-collector-contrib#3087 from the perspective of using in disk buffer.

pmm-sumo · 2021-04-26T16:03:55Z

I have prepared a design doc with some ideas on how this could be achieved. cc @dashpole

SkrPaoWang · 2021-05-17T09:45:17Z

how much time the buffer data can be stored in memory?

pmm-sumo · 2021-05-17T12:25:37Z

how much time the buffer data can be stored in memory?

As mentioned in the other issue, this controlled by number of batches using queued_retry configuration. I am working on a PoC of a generic buffering solution, should have an update soon

@djaglowski

As discussed during the SIG, we want to move storage extension to core, starting with the interface (this PR) so persistent buffer implementation (#2285) could use it **Link to tracking Issue:** #3424 **Testing:** Just the interface, no tests **Documentation:** README.md with API cc @djaglowski @tigrannajaryan

…3274) Persistent queue implementation within queued_retry, aimed at being compatible with Jager's [BoundedQueue](https://github.com/jaegertracing/jaeger/blob/master/pkg/queue/bounded_queue.go) interface (providing a simple replacement) and backed by [file storage extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/storage/filestorage) for storing WAL. Currently, to run the persistent queue, OpenTelemetry Collector Contrib with `enable_unstable` build tag is required. **Link to tracking Issue:** #2285 [Design doc](https://docs.google.com/document/d/1Y4vNthCGdYI61ezeAzL5dXWgiZ73y9eSjIDitk3zXsU/edit#) **Testing:** Unit Tests and manual testing, more to come **Documentation:** README.md updated, including an example

…-telemetry#2285) * Fix IPv6 handling errors in semconv.NetAttributesFromHTTPRequest fixes open-telemetry#2283 * Enter PR number in CHANGELOG * Remove unnecessary creation and then assignment Standardize order of checks for IP, Name, Port * Assume happy path when parsing host and port i.e. assume net.SplitHostPort(input) will succeed * Get rid of uint64 for port * Fix git merge of main by adding back strings import Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> Co-authored-by: Tyler Yahn <codingalias@gmail.com>

pmm-sumo · 2022-02-07T15:55:12Z

Closing this since the capability is now available through experimental queued_retry implementation

)

… (in C++) (open-telemetry#2285)

pmm-sumo added the feature request label Dec 14, 2020

andrewhsu added enhancement New feature or request priority:p2 Medium release:after-ga and removed feature request labels Jan 6, 2021

tigrannajaryan mentioned this issue Jan 14, 2021

Should be buffer between SDK and Collector, such as file system and kafka? #2351

Closed

dashpole mentioned this issue Feb 5, 2021

Support write-ahead log (WAL) capabilities similar to Prometheus server open-telemetry/prometheus-interoperability-spec#9

Closed

tigrannajaryan assigned pmm-sumo Apr 7, 2021

pmm-sumo mentioned this issue May 17, 2021

can opentelemetry buffer data? like buffer timeseries before sending it? #3194

Closed

This was referenced May 19, 2021

Initial implementation of WAL in queued_retry, backed by diskqueue #3235

Closed

Move file storage extension to core #3273

Closed

Persistent storage in queued_retry, backed by file storage extension #3274

Merged

This was referenced Jun 11, 2021

Move storage extension to core #3424

Closed

Copy contrib's storage interface to core #3425

Merged

pmm-sumo closed this as completed Feb 7, 2022

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023

fix pushing windows 2022 images with the proper tag (open-telemetry#2285

e324995

)

Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024

[SEMANTIC CONVENTION] Deprecated semconv (in the spec) not deprecated…

01e6581

… (in C++) (open-telemetry#2285)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving resiliency through disk buffering capability #2285

Improving resiliency through disk buffering capability #2285

pmm-sumo commented Dec 14, 2020

tigrannajaryan commented Dec 14, 2020

pmm-sumo commented Dec 14, 2020

dashpole commented Dec 17, 2020

tigrannajaryan commented Jan 4, 2021

tigrannajaryan commented Jan 4, 2021

pmm-sumo commented Jan 4, 2021

dashpole commented Jan 5, 2021

djaglowski commented Feb 3, 2021

pmm-sumo commented Feb 17, 2021

tigrannajaryan commented Apr 16, 2021

pmm-sumo commented Apr 26, 2021

SkrPaoWang commented May 17, 2021

pmm-sumo commented May 17, 2021

pmm-sumo commented Feb 7, 2022

Improving resiliency through disk buffering capability #2285

Improving resiliency through disk buffering capability #2285

Comments

pmm-sumo commented Dec 14, 2020

tigrannajaryan commented Dec 14, 2020

pmm-sumo commented Dec 14, 2020

dashpole commented Dec 17, 2020

tigrannajaryan commented Jan 4, 2021

tigrannajaryan commented Jan 4, 2021

pmm-sumo commented Jan 4, 2021

dashpole commented Jan 5, 2021

djaglowski commented Feb 3, 2021

pmm-sumo commented Feb 17, 2021

tigrannajaryan commented Apr 16, 2021

pmm-sumo commented Apr 26, 2021

SkrPaoWang commented May 17, 2021

pmm-sumo commented May 17, 2021

pmm-sumo commented Feb 7, 2022