[Infrastructure Monitoring] Better data generation #119491

jasonrhodes · 2021-11-23T15:28:28Z

Epic for organizing work on how to generate data for development and testing. We will flesh this out over time.

elasticmachine · 2021-11-23T15:28:30Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

matschaffer · 2021-11-25T07:22:10Z

Next steps (given sync with @miltonhultgren ):

Get this PR green & merge
Write up doc of plan/use-case (@miltonhultgren)
Figure out next problem space that needs a generator (could be logs, metrics or SM)
- open question: how to make time series data more "interesting" - just a flat line right now
- open question: where do we handle mappings for logs & metrics?

matschaffer · 2021-11-25T07:29:55Z

Thinking that metrics UI could be a good next target given the efforts around alerting on high cardinality group-bys - cc @Zacqary

miltonhultgren · 2021-11-25T07:35:14Z

One more thought from syncing with @matschaffer is that one good POC would be to rewrite one of the Stack Monitoring E2E/integration tests using synthtrace generated data.

miltonhultgren · 2021-11-25T09:25:10Z

Just a ping on more problems that could benefit from data generation tooling #119658

miltonhultgren · 2021-11-29T12:08:17Z

Dropping this link here to Mat's notes https://docs.google.com/document/d/1lImDQTih61ufW3gDuY1FAYLUpZTJjj58sVbC993PSGA/edit#heading=h.m95jascdig79
In the notes I've also added a link to https://github.com/weltenwort/kibana/tree/add-kbn-test-data-generator/packages/kbn-test-data-generator

miltonhultgren · 2021-11-29T12:32:37Z

I got into this topic after struggling with writing tests for our API based on "missing" test data. As I wrote to Jason:

We do have a bunch of archives, but it is not clear what data they contain and how relevant that data is for the thing I want to test. Often they are quite narrow. At the same time it feels hard to wire up Metricbeat to something and use es_archiver to create a new archive on the fly and there is also the concern of how much bloat we can put into a git repo.
Once the archive is there it is also not easy to change, you need to regenerate it. Similarly, if all you wanted was a small tweak to the mapping you still have to copy paste the whole thing.

Additionally es_archiver currently doesn't support data streams although this will likely get fixed in time #69061.

While there have been many attempts to solve part of this problem we haven't really landed on something that we can make a road map around.

I would like to have a tool to easily generate data with different mappings, and to be able to use that in test instead of relying on es_archiver would be nice.
A one liner goal I thought of is to "generate metrics X, Y and Z, based on log events between date A and date B with this (regular|irregular) frequency, and use these mappings when creating and storing them"

What the current tools seem to have in common is the idea of defining a time range of where "events" should happen, with some frequency and possibility for spikes (by having overlapping time ranges with different event frequencies) and then some layer that turns these "events" into Elasticsearch documents.

One thing that most tools miss though is a connection to the underlying mappings of those documents. Synthtrace for example loads the index + mappings with es_archiver before inserting their documents.
@weltenwort built a tool that puts most of its focus on that part of the problem which might also help us generate data for different schema versions since we generate it from the mappings of that version. Synthtrace does have some "hooks" where we could perhaps inject this kind of code (bootstrap and the document generator class itself).

My hope is that by defining our problems more we can put up some goals to reach and the Synthtrace route seems promising so far, given also that in the future we'll likely want to use APM data in our own tests as well.

What could the next steps be? What have we missed so far?

@elastic/infra-monitoring-ui

jasonrhodes · 2021-12-08T17:18:23Z

Can we do an hour sync to present an overall set of findings here, and try to jumpstart this effort in a good direction? I want us to invest in this, but like you all are saying (I think), we need clear goals and to choose the ones that will have the highest ROI for us.

matschaffer · 2021-12-08T22:58:35Z

I wouldn't mind showing off the stack monitoring simulation stuff so far. It's basic but it looks promising.

miltonhultgren · 2021-12-16T10:51:52Z

@jasonrhodes Let's book something!

My vote is for focusing on the issue of "data generation from mapping", since much of the work in Stack Monitoring would benefit from being able to take 1 of the 3 different mappings we have (which are moving towards a single mapping) and generate data from that, run tests and check that things work.
Beyond that, all of our initiatives around curated views installed from integrations will likely come with mappings shipped. Being able to grab one of those mappings, generate a bunch of data and install the curated view Saved Object and run tests would be easier than setting up Fleet for each test.

jasonrhodes · 2021-12-16T14:59:56Z

Please book an hour for after the new year, it can be during my "meeting block" / "focus time" if that works but likely it'll be before that anyway in order to work with other calendars. @matschaffer maybe if you can weigh in on some good times to target and work with @miltonhultgren to get an hour on the calendar? Thanks, all!

smith · 2022-06-16T01:34:25Z

Closing this for now. If we put effort into improving data generation while creating or updating tests perhaps we can evolve to the right solution.

jasonrhodes added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Nov 23, 2021

smith closed this as completed Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infrastructure Monitoring] Better data generation #119491

[Infrastructure Monitoring] Better data generation #119491

jasonrhodes commented Nov 23, 2021 •

edited by matschaffer

Loading

elasticmachine commented Nov 23, 2021

matschaffer commented Nov 25, 2021 •

edited

Loading

matschaffer commented Nov 25, 2021

miltonhultgren commented Nov 25, 2021

miltonhultgren commented Nov 25, 2021

miltonhultgren commented Nov 29, 2021 •

edited

Loading

miltonhultgren commented Nov 29, 2021 •

edited

Loading

jasonrhodes commented Dec 8, 2021

matschaffer commented Dec 8, 2021

miltonhultgren commented Dec 16, 2021

jasonrhodes commented Dec 16, 2021

smith commented Jun 16, 2022

[Infrastructure Monitoring] Better data generation #119491

[Infrastructure Monitoring] Better data generation #119491

Comments

jasonrhodes commented Nov 23, 2021 • edited by matschaffer Loading

elasticmachine commented Nov 23, 2021

matschaffer commented Nov 25, 2021 • edited Loading

matschaffer commented Nov 25, 2021

miltonhultgren commented Nov 25, 2021

miltonhultgren commented Nov 25, 2021

miltonhultgren commented Nov 29, 2021 • edited Loading

miltonhultgren commented Nov 29, 2021 • edited Loading

jasonrhodes commented Dec 8, 2021

matschaffer commented Dec 8, 2021

miltonhultgren commented Dec 16, 2021

jasonrhodes commented Dec 16, 2021

smith commented Jun 16, 2022

jasonrhodes commented Nov 23, 2021 •

edited by matschaffer

Loading

matschaffer commented Nov 25, 2021 •

edited

Loading

miltonhultgren commented Nov 29, 2021 •

edited

Loading

miltonhultgren commented Nov 29, 2021 •

edited

Loading