-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data segmentation API framework #1539
Comments
https://community.k6.io/t/how-to-distribute-vus-across-different-scenarios-with-k6/49/11 is another very common use case we have to take into account when designing this API. Again, having the above interfaces composable would be key, since then we should be easily able to make a clean helper function that solves the following use case:
|
Here's a use case that we should take into account when we implement this: https://community.k6.io/t/unique-test-data-per-vu-without-reserving-data-upfront/1136/5 Basically, something like a |
Data segmentation proposalWe should strive to implement it separating concerns of loading, parsing and accessing data. The flow of data is as follows, composed of different parts that can be swapped out to meet the required properties:
Constraints:
k6 responsibilities:
User responsibilities:
|
Username | Password | ... |
---|---|---|
Data | Data | ... |
Data | Data | ... |
... | ... | ... |
An API as follows is proposed based on Consumption Patterns described below:
let table = new SharedTable("some name", function() {
// Load data from "Source"
...
return {
columns: ['Username', 'Password'],
rows: [...]
}
}, {
rowSelection: 'sequential'|'random'|'unique',
updateRow: 'each-iteration'|'each-access'|'once',
whenOutOfRows: 'abort-vu'|'continue-in-cycle'|'use-last-row',
segment: true|false|Number
});
export default function() {
let row = table.selectRow();
console.log(row[0], row['Username']);
// Accessing a *column value* according to specified data Consumption Pattern could also be handled
// by proxying the access through the index operator, but I think this might be to much magic and could
// introduce unnecessary cognitive burden on user trying to understand someone else's script, or?
console.log(table[0], table['Username']);
}
A SharedTable
could also be contructed from a SharedArray
or other Array
like object:
let arrayLikeObject = ...;
let table = SharedTable.from(arrayLikeObject, ['Column 1', 'Columns 2', ...]):
Consumption Pattern
Different testing use cases call for differences in desired consumption of data when parameterizing actions in a test. The user should have control of some parameters for how data should be consumed when requested by a VU. These are parameters are heavily influenced by the options available in LoadRunner:
- Row selection pattern: Controls how to select data rows during the test when multiple VUs are running (and each VU often running for multiple iterations). There are three options:
- Sequential: data rows are consumed sequentially from Source
- Random: data rows consumed randomly from Source
- Unique: data rows should be selected uniquely from Source
- Update row: Controls when VU should update the data row its consuming data from. Again there are three options:
- Each iteration: update the data row selected by VU at the start of each iteration
- Each access: update the data row selected by VU on each access to the data row
- Once: update the data row only once
- When out of rows
For some combinations of 1) and 2), namely "Unique+Each iteration" and "Unique+Each access", there's a third option. When the VU runs out of unique data rows, what should happen? There are a three options (again :)):
- Abort VU: when VU runs out of data rows (after last row has been accessed) it is aborted and stops executing any further iterations
- Continue in cyclical selection pattern: VU will continue executing iterations but will recycle data rows in a cyclical pattern
- Continue with last row: continue executing VU but each data row access would now always return the last row before we ran out of rows
- Segmentation
Sometimes when data files are really big it can make sense to segment the data when running tests spanning multiple load gens machines. There are three options (what's up with the 3 options? :smiling:):
true
: segment data proportionally according to execution segmentsfalse
: don't segment data, all data is available on all load gen machines- Number: user decides how many rows each VU should have access to (I think deciding per VU would make most sense from a user perspective, rather than per load gen machine which is something the user might be completely abstracted away from, say in k6 Cloud)
The combination of these parameters defines what data rows (from the SharedTable
) are to be selected by each VU for each iteration and each access/row selection (within an iteration):
-
Sequential+Each iteration: each VU starts from the top of the
SharedTable
, the first row, and iterates over rows top to bottom selecting a new row at the start of each iteration. -
Sequential+Each access: each VU starts from the top of the
SharedTable
, the first row, and iterates over rows top to bottom selecting a new row every time the VU requests data. -
Sequential+Once: each VU always uses data from the first data row. No other data rows would be used.
-
Random+Each iteration: each VU selects a random row for each iteration, each call to
table.selectRow()
return the same row for the duration of a full VU iteration. -
Random+Each iteration: each VU selects a random row each time a call to
table.selectRow()
is made. -
Random+Once: each VU selects a random row once in the first call to
table.selectRow()
and any subsequent call totable.selectRow()
selects the same row for all VU iterations. -
Unique+Each iteration: each VU selects a unique (previosuly unused) row for each iteration, each call to
table.selectRow()
return the same row for the duration of a full VU iteration. When out of rows thewhenOutOfRows
option tells us what to do. -
Unique+Each iteration: each VU selects a unique (previosuly unused) row each time a call to
table.selectRow()
is made. When out of rows thewhenOutOfRows
option tells us what to do. -
Unique+Once: each VU selects a unique (previously unused) row once in the first call to
table.selectRow()
and any subsequent call totable.selectRow()
selects the same row for all VU iterations.
I agree with this, but I'm also very confused... 😅 The proposed Moreover, in terms of the first 2 parts, it seems like it's duplicating import { SharedArray } from 'k6/data';
const data = new SharedArray('some name', function () {
return [
['we can have', 'arrays here'],
{ 'but': 'this is an object' },
'we can have anything, as long as',
'the top level is an array and elements are JSON-encodable',
42
];
});
export default () => {
console.log(data[1]['but']);
console.log(JSON.stringify(data, null, 4));
} So, if we want a clean JS API with separation of concerns (i.e. a composable API instead of a mega-object that does everything), and if we want to support these use cases:
It seems to me that the only missing piece from k6 currently are some sort of iterators or generators to facilitate these data access patterns? We already have the data storage ( Something from the "k6 responsibilities" section also seems wrong to me:
"Aborting a VU" is far from a simple thing - it doesn't really make sense in all executor types. It seems reasonably simple to conceptualize and maybe even implement in the Aborting VUs in the arrival-rate executors makes even less sense, since VUs are not the main thing there, the iteration pacing and iteration numbers are. VUs are simply workers, the substrate iterations are executed on at the specified arrival rate. Aborting a VU doesn't make any sense, it will just reduce the pool of workers, it won't stop the iteration pacing... 😕 In general, tying data segmentation too closely to VUs isn't the best solution. It should be possible to do, when that makes sense, but always doing it will just repeat some of the same problems relying on
Do we really need this? Maybe in some high-level wrapper, but "once" and "each iteration" can be boiled down to "each access" where we "access" the iterator only once, or only at the start of the iteration, and then we cache the result.
I mentioned above why "Abort VU" is not always possible, while "Continue in cyclical selection pattern" and "Continue with last row" can probably be implemented as simple JS wrappers around the generic (segmented or not) iterators (e.g. modulo division and by just caching the last value and returning it when we run out).
Again, same problems with tying this too closely with VUs - it doesn't really work well with arrival-rate or
Besides the problems of tying these use cases too closely to VUs I already mentioned above, my other problem is that not all of these combinations make sense. For example, in what situation would someone use And again, baking both So far I've only disagreed with the |
To get back to basics, let's start with an example. Say that we have a list of 5 elements: Say that we also have 2 separate "actors",
So yeah, I'll use "actors" instead of VUs. Based on the user's specific use case, these "actors" might need new data elements once per VU or multiple times per iteration, or anything in between, and we need to be flexible enough to support all of these use cases. The The other consideration the pattern in which
To explore the
My suggestion for an MVP version of this feature is to:
I'm probably missing something important here, but I think we should start as simple as possible (and iterators are pretty simple) and try to satisfy as many use cases as possible with as little code as possible, even if that simple code requires a lot of JS boilerplate initially. We can always simplify the boilerplate later with JS or Go wrappers, as long as the foundation is sound. |
I purposefully didn't want to touch the topics of loading, processing or streaming of data. I see those as seperate topics. I wanted to focus on how to control how data is consumed during a test, as that is a frequent topic when we talk to users and customers. That data is loaded and parsed naively, ie. not streamed, memory-mapped or whatever, is fine initially IMO.
The proposal does certainly allow for simple data structures to be consumed according to the specified patterns, not by segmenting the data per se between load gen nodes, but the actual access pattern which is the more interesting of the two from a user perspective I'd say.
Yes, it's on purpose very close to
My proposal actually started out based on following the JS
This is a really good point. I do think giving the user the control of how to handle the "we're out of data" situation is important, but you're completely right that we shouldn't tie it to a VU per-se. It should probably be thought of more as a "should more iterations be run in this k6 process when we run out of data" and whether that would involve spawning more VUs or continuing with existing VUs is irrelevant.
From a user perspective it's needed yes, but in the first abstraction level in whatever API we agree to, maybe not (we can probably implement it as you point out). Whatever the MVP of this feture will be, the important thing IMO is that the API is at an abstraction level that's useful to users, so I think having an API on a similar abstraction level as proposed with As I struggled to come up with an iterators based proposal I'd love to see what we can come up with that would allow us to build a Sequential+Each iterationlet data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = data[Symbol.iterator](); // Extending SharedArray with Iterable and/or Iterator protocol support
export default function() {
let row = iter.next().value; // Each iteration a new row will be used
...
} Sequential+Each accesslet data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = data[Symbol.iterator]();
export default function() {
// Call `iter.next()` each time a new row is needed
let row = iter.next().value;
...
row = iter.next().value;
...
} Sequential+Oncelet data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let row = data[0]; // First value (and same for last value or whatever index)
export default function() {
// Use `row` throughout test, it will be the same every time
} Random+Each iterationimport { randomItem } from "https://jslib.k6.io/k6-utils/1.0.0/index.js";
let data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
export default function() {
let row = randomItem(data); // Each iteration a new random row will be used
...
} Random+Each accessimport { randomItem } from "https://jslib.k6.io/k6-utils/1.0.0/index.js";
let data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
export default function() {
// Call `randomItem(data)` each time a new random row is needed
let row = randomItem(data);
...
row = randomItem(data);
...
} Random+Onceimport { randomItem } from "https://jslib.k6.io/k6-utils/1.0.0/index.js";
let data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let row = randomItem(data); // First value (and same for last value or whatever index)
export default function() {
// Use `row` throughout test, it will be the same every time
} Unique+Each iterationlet data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = UniqueIterator.from(data); // Implementing the Iterable and Iterator protocols [1]
export default function() {
let obj = iter.next(); // Each iteration a new random row will be used
if (obj.done) {
// Handle accordingly, eg. `abortFurtherIterations()`, `abortTest()` and similar future APIs.
}
let row = obj.value;
...
} Unique+Each accesslet data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = UniqueIterator.from(data); // Implementing the Iterable and Iterator protocols [1]
export default function() {
// Call `iter.next()` each time a new row is needed
let row = iter.next().value;
...
row = iter.next().value;
...
} Unique+Oncelet data = new SharedArray("some name", function() {
... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = UniqueIterator.from(data); // Implementing the Iterable and Iterator protocols [1]
let row = iter.next().value; // Unique across VUs
export default function() {
let row = iter.next().value; // Unique across iterations
} With this kind of API we still need to figure out how to segment data according to different scopes like test, scenario, VU or iteration, so that the iterators only need to concern themselves with iterating over the given slice of data according to their specific access pattern. I suppose as long as each k6 process (i.e. load gen) in a test has all data it can then segment it according to the users desired scoping (test, scenario, VU or iteration) by using execution segments and the new import exec from 'k6/execution';
let data = new SharedArray("some name", function() {
... // All data is always loaded here
});
export default function() {
const scenarioStats = exec.getScenarioStats();
let iter = UniqueIterator.from(
// Would calculate the appropriate segment of data to use based on execution segment for the current load gen + scenario and scenario iteration offset.
DataSegment.from(data, {
scope: 'scenario',
offet: scenarioStats.iteration
})
);
let obj = iter.next(); // Would give a scenario-level unique row per iteration (if only one row is consumed per iteration)
if (obj.done) {
... // Handle when-out-of-rows case
}
} ...but this feels very low-level. A tangent, but how does the [1] - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Iteration_protocols |
I won't comment on the on-topic discussion here as it will take me a few hours to dig into and properly respond 😅, but to address the last question by @robingustafsson: there are several counters introduced in #1863, one of which is the |
@imiric Ah, great, and based on process-local information (i.e. execution segments) as it looks 👍 |
I mostly collected a list of use cases to better understand the demand for the various points of the feature. I also tried to apply a basic proposal that I think is far from perfect but it has the advantage to put on the table a lot of limits that we can consider as a checklist for any solution we will pick. Use CasesThe most requested features accordingly to the previous table are the incremental index per VU or Iteration. Strictly sequential, randomness and Once have a lower demand. It's highly expected considering the work done for the Execution API or the old context variables __VU and __ITER. Per VU/Iteration with a not strictly by-one sequence are already supported by the system with the introduction of the Execution API. It would mean that we could already support the most required cases designing the consumption and/or iterators API based on the values returned by My lack of knowledge of the entire ecosystem, which is, at the moment, mostly based on the forum and repo's issues, could make me see a partial and/or wrong vision of our requirements. Basic ProposalFor the first iteration, the proposal is to write import { Table } from './table.js'
import { UniqueIterator } from 'k6/data'
// Iterator should be optional and if it isn't provided,
// the Iterable's Iterator (builtin or custom) will be used.
var iter = new UniqueIterator()
var table = new Table(
['first name', 'last name', 'age'],
[
['Joe0', 'Doe0', '26'],
['Joe1', 'Doe1', '27'],
['Joe2', 'Doe2', '28']
],
iter)
// If init context then per VU
//let user = table.next()
export default function() {
// if not init context then per Iteration
let user = table.next()
console.log(`Name: ${user['name']}, Surname: ${user['surname']}`)
} Table APIThe Table API should be responsible for resolving the rows from the Source getting an index from the Iterator, then it would map it into an Object/Map associating its values with the relative headers. Global and scoped Iterator
The idea has been inspired by this doc sentence, so the concept is to get the right counter based on the scope of the caller. If the It has the downside to return different values without any correlation if the The previous idea doesn't cover an eventual per Test iterator. I think we could use a global default instance of the Out of rowsI think the API should cover the cyclical selection as default and it fires a callback or exports a boolean API for checking if the iterator is out of rows. In this way, the user is free to apply any possible logic when it hits the out of rows state. export default function() {
// true if the latest item has been returned
if (table.isOutOfRows()) {
console.log('Out of rows')
return; // or exec.test.abort if it unexpected
} else {
let user = table.next()
console.log(`Name: ${user['name']}, Surname: ${user['surname']}`)
}
} OnceDoing in this way, the Init context couldn't be used for the Once case so a workaround like the following must be applied for it: var firstRow // or last
export default function() {
// true if the latest item has been returned
if (firstRow === undefined) {
firstRow = table.next()
}
console.log(`Name: ${user['name']}, Surname: ${user['surname']}`)
} Streaming APIThis is not yet covered by me but the feeling is that the counter-based solution could be used for achieving an incremental reading of the streaming. Ideally, the API should read the stream until it has enough data for returning a value for the requested index's value. Counters' limitsEach Access Each Access is not supported by global counters because they can't be increased "on-call", they follow the test's life-cycle so if the ChunksAs reported in the previous comment by @na-- and mostly for the same reason of Each Access, the counters don't support sequential chunks of data distributed across the Actors. VUIDGlobal The Reset Index Reset the index for re-starting is not supported case with global counters (it could be fixed in local, but I don't think in distributed cases). Test Scope We don't have a counter that goes across scenarios, maintain the sequence across scenarios then it wouldn't be supported. Fixing it would require an additional counter. Perfect Sequence
As reported in the docs, the global counters could not respect a perfect sequence creating unexpected holes. Open questions
|
I did not understand what the idea is here, sorry. Can you expand possibly with a script sample with some comments?
is the idea here that if that check isn't used and
I am even more confused by this example
This in fact is not possible(or at least will require a lot of synchronization between k6 instances) if you have a vu based scenario -
As mentioned above this has even more problems. This in practice will require that each "getting" of an item is synchronized. But I will also argue this case is really ... not possible in the strictest sense of the word. Problems are that:
So at least for me this use case is very badly defined and if we are okay with the two types of holes it creates I guess we are fine with using global IDs as well :). And if not this basically requires multiinstance database of some sort that gives you the next item(s). Arguably something that can be done with running Next steps (IMO):All in all I think we should really just make How-to guides for all the cases currently supported, possibly building some helper functions around them to test out APIs. Some of the "workarounds" for unsupported cases also should probably have full examples so they can be better evaluated. Some of those are already in comments we just need to add them to the documentation IMO. Some prior code by me that definitely needs more work but can be used as an idea. |
We now have support for partitioning work (i.e. VUs and iterations) between multiple k6 instances, via the
executionSegment
andexecutionSegmentSequence
options, originally described in #997 and subsequently evolved in #1007. In the end, we had to implement striping even in the initial version (thus, the need forexecutionSegmentSequence
😞), because some executors likeramping-vus
and thearrival-rate
ones needed it for optimal performance.The good news from that extra effort though, is that we now have most of the things we need to tackle data segmentation/partitioning between multiple k6 instances, without any external runtime scheduling between them. We need to refactor and improve some things in the initial implementation, like #1499, #1427, and #1386, but the rough building blocks are already here... 🎉 I'm making this issue as a place to discuss this effort, so that I can close #997, given that most of it is done.
Of course, we don't need to start implementing this right away. We "just" have to figure out how its JS API and options should look like... 😅 This will allow us to start implementing things like a streaming data support (#592), shared read-only memory (#532), CSV API (#1021), a JSONPath API (#992), and an XML parsing API with XPath support. I don't think binary data handling (#1020) is going to be affected by this, but it probably deserves some thought as well.
If we have a clear idea how the data segmentation, we can start implementing the issues above without the complicated data segmentation in their first versions, knowing we'd be able to add it at a later point, hopefully without having to completely refactor everything again. Currently, I think we can split the process like this:
iter.next()
returns the next item in a segmented and/or striped fashion. This, combined making sure that the new APIs from point 2. are compatible with these iterators, since this will immediately alow users to have data segmentation, albeit with a little bit of JS work and some minor loss of performance.The text was updated successfully, but these errors were encountered: