-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API] Add CSV bulk indexing support to Kibana API #6844
Conversation
@epixa I thought you might be interested in reviewing this since it seems like the inverse of the CSV export API you're building. But let me know if you already have too much on your plate and I can find someone else to look at it. |
jenkins, test it |
responseStream.write('['); | ||
csv.pipe(parser); | ||
|
||
hi(parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, none of this is really specific the csv import feature, so I think almost all of this logic should be pulled out of this handler and into a standalone library. The library should basically have one purpose: given a stream of docs, send those docs off the bulk api and return a stream response. The route handler would then simply be responsible for parsing the csv payload, sending it through the import function, and piping the response back to the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also have the added benefit of allowing more thorough unit testing of the individual parts of the stream handling process, which will help clarity and reliability of the streaming process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the exact same thought actually. I'd also like the _bulk input to support JSON files (or even JSON request bodies) in addition to CSVs, which the library approach would make easier.
However, my tendency is to way over engineer things (see: the first 3 implementations of the ingest API), so this is my attempt to keep to a simple MVP that's good enough to power the CSV upload UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel you on the whole over-engineering thing, but I don't think this falls into that category. Over-engineering is making things more complicated than they need to be, but I actually think having all of this code strung across the request handler makes this code more complicated than if it's pulled out into a library with clear boundaries. It also makes it impossible to unit test, which I think is a bug even in an MVP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate on that last point: MVP should describe the minimum capabilities of the product rather than the quality of code that powers it, and it's generally hard to make the case that code is quality if it can't be unit tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to not trying to build a third party api in this initial implementation, but that shouldn't mean we don't write tests to cover the code paths we create. Again, I go back to my product vs code thing from before, the vast majority of our code isn't publicly consumable (or at least we don't want people to use it directly), but that doesn't mean we don't benefit from testing it. Those ~24 code paths aren't accidental or anything, we're not talking about accounting for every possible input type and stuff, we're just talking about testing the code paths that are written.
Things like batching size, parallelization, and back pressure are handled by highland because the way you've coded this, but any developer at any time could make even trivial changes to this file that could break any or all of those things in subtle ways that could have effects we couldn't possibly anticipate through automated testing. That's why we should assert that these things are being setup/configured the way they need to be rather than hoping our tests will catch the unexpected effects of breaking them.
That doesn't mean we need to re-test all of the capabilities of highland, though. If we are comfortable assuming that highland will handle those things for us so long as we set them or configure them properly (which I am, and it sounds like you are), then we would only need to assert that we're configuring them properly. Those assertions also help codify important decisions that are made at implementation time, like the choice to do a lower batch size.
As for the rest of those things, I didn't mean to imply that you weren't testing for them, just that my 24 code paths number wasn't even considering those things. Many of those things are definitely best verified through functional tests just as many others are best verified through unit tests. You probably are verifying a bunch of them already.
To be honest, I think a boatload of context is being lost in this comment chain. Want to zoom tomorrow? Maybe we could bring in some other people to get some fresh perspectives as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be happy to Zoom. However, I would like you to provide some concrete examples of the types of tests you'd like to see, as there's a lot of hand waving going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any developer at any time could make even trivial changes to this file that could break any or all of those things in subtle ways that could have effects we couldn't possibly anticipate through automated testing. That's why we should assert that these things are being setup/configured the way they need to be rather than hoping our tests will catch the unexpected effects of breaking them.
And I'd also like concrete examples of ways in which a developer could make trivial changes that break everything, where a unit test would save them, and they wouldn't just change the unit test in addition to the code if the subtle breakages are in fact impossible to anticipate and the developer is incapable of detecting them when testing their changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can come up with some examples using this code before we chat about it.
I'm not sure what meant by the follow up comment, though. Developers will always be able to modify tests, and doing so will always mean that the code isn't working the way it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you want tests to verify the value of certain configuration options, like batch size, to prevent bugs that "we couldn't possibly anticipate" caused by a developer changing that value. What I'm saying is that if the bug is in fact impossible to anticipate, a unit test isn't going to prevent the developer from changing that value.
But I could be totally wrong about what you mean, so let's table that discussion until we have some concrete examples of unit tests that we can dissect.
"type": "double", | ||
"doc_values": true | ||
}, | ||
"CountryFull": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop this and all of the other "type": "text" mappings since the _default_
dynamic template will catch them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's right, ES creates text and keywords out of strings by default now doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can still drop all of these "text" fields. Even outside of Elasticsearch's defaults you have a _default_
mapping that will catch all of the text fields anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment about this.
I'm not sure why github collapsed the one longer thread we have going, so I'm linking it here for other reviewers to add their thoughts: #6844 (diff) If we add anything more to it, we should probably do so on this main PR comment thread so they don't continue to disappear into the void. |
Its because it was a comment on code that has changed, so I guess github decides its resolved, but yeah, in general anything large scope should go in the main thread or it gets lost. |
jenkins, test it |
@Bargs can you merge master on this? That should fix the tests |
Hmmm I rebased but it's still busted, I'll have to look into it when I get some time. |
Tests are passing again. @epixa and I are going to discuss the unit testing more on Friday. @rashidkpc in the meantime could you give this a review as well? |
"_default_": { | ||
"dynamic_templates": [ | ||
{ | ||
"string_fields": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will catch all of your text/string fields, no reason to define them explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the explicit mappings otherwise I'll get mapping exceptions when indexing the data. For instance, some of the ZipCode values are actually strings, but Elasticsearch will map is as a long if the first value it sees looks like a number. So ZipCode needs to be explicitly mapped as a string.
Some of the fields might not strictly require explicit mappings, but to make the tests as robust and repeatable as possible, I went ahead and mapped everything.
LGTM Lee's discovery test (with @Bargs help); In this case I started elasticsearch and kibana with
|
}); | ||
}); | ||
|
||
}); | ||
|
||
bdd.describe('optional parameters', function () { | ||
bdd.it('should accept a custom delimiter query string param for parsing the CSV', function () { | ||
return request.post('/kibana/names/_bulk?delimiter=|') | ||
return request.post('/kibana/names/_data?delimiter=|') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this. The |
could be passed in as %7C
because of URL encoding. Mind adding a test to make sure that works?
I made the following request:
And got the following response:
I have a few questions/comments:
Authentication error example, request + response:
|
Currently the API (really Elasticsearch) auto-assigns IDs for each row in the CSV. Would it be useful for the API (and by extension, the UI) to expose a parameter that will allow the user to specify an |
General note about query string parameters on this API: This API is named A couple of thoughts:
|
|
kbn-version is actually unrelated to the API, it's a header we added to protect against XSRF.
Setting the
I think that's a good point but it's a little tricky since the response is streamed to the client. Making the response an object will make it a bit harder to parse in a streaming manner. The way I think about it, the array is a container for 1 or more response objects. It could just as easily be 1 or more json objects separated by newlines, but I wanted to make it valid json so clients could consume it easily if they don't care about reading the streaming results as they arrive. So if we were to add a property in the future, it would be to the response objects, not the array.
Exactly.
So a couple of things here
For the sake of streaming, I really like the simplicity of treating each response object in the array as its own independent chunk and I think most errors that can occur really are line specific. But it does seem to make sense to return a 403 if, for instance, the user has no write permissions on the index they're trying to add data to. But that would require pre-checking the users permissions which we don't really do anywhere else, and it seems like more of a cross cutting concern that should be handled in the security plugin rather than adding security specific code to the API route handler itself. I dunno, I'm pretty torn on this one. |
Actually the API creates an ID from the filename and line number: https://github.com/elastic/kibana/pull/6844/files#diff-acac6333b1d8bc4acac4ec269419e41aR51 I do think the ability to specify an ID column would be useful, but that's something else I think I'd like to take a wait and see approach on. Having the line number is really useful in the CSV upload UI so that I can point users to the exact line that caused an error. |
I like that idea, since there are other params like |
@Bargs and I chatted out-of-issue about the structure of the API response. I'll try to summarize: Given that this is an internal-only API at the moment, we go with whatever is simplest to generate (from the server side) and consume (from the client side). So we keep the response structure as-is: a top-level array containing response objects. If/when we release this API publicly, we'll need to think harder about the implications of that structure per points 3 and 4 in my comment above. |
Ah, I missed that. Thanks for clarifying.
The wait-and-see approach about letting users specify an ID column sounds good! |
…when other data types are supported in the future
@ycombinator just pushed some updates.
I think that covers everything we agreed to update. Let me know how we're looking. |
jenkins, test it |
LGTM. |
`eui@82.1.0` ⏩ `83.0.0`⚠️ The biggest change in this PR by far is the `EuiButtonEmpty` Emotion conversion, which changes the DOM structure of the button slightly as well as several CSS classes around it. EUI has attempted to convert any custom EuiButtonEmpty CSS overrides where possible, but would super appreciate it if CODEOWNERS checked their touched files. If anything other than a snapshot or test was touched, please double check the display of your button(s) and confirm everything still looks shipshape. Feel free to ping us for advice if not. --- ## [`83.0.0`](https://github.com/elastic/eui/tree/v83.0.0) **Bug fixes** - Fixed `EuiPaginationButton` styling affected by `EuiButtonEmpty`'s Emotion conversion ([#6893](elastic/eui#6893)) **Breaking changes** - Removed `isPlaceholder` prop from `EuiPaginationButton` ([#6893](elastic/eui#6893)) ## [`82.2.1`](https://github.com/elastic/eui/tree/v82.2.1) - Updated supported Node engine versions to allow Node 16, 18 and >=20 ([#6884](elastic/eui#6884)) ## [`82.2.0`](https://github.com/elastic/eui/tree/v82.2.0) - Updated EUI's SVG icons library to use latest SVGO v3 optimization ([#6843](elastic/eui#6843)) - Added success color `EuiNotificationBadge` ([#6864](elastic/eui#6864)) - Added `badgeColor` prop to `EuiFilterButton` ([#6864](elastic/eui#6864)) - Updated `EuiBadge` to use CSS-in-JS for named colors instead of inline styles. Custom colors will still use inline styles. ([#6864](elastic/eui#6864)) **CSS-in-JS conversions** - Converted `EuiButtonGroup` and `EuiButtonGroupButton` to Emotion ([#6841](elastic/eui#6841)) - Converted `EuiButtonIcon` to Emotion ([#6844](elastic/eui#6844)) - Converted `EuiButtonEmpty` to Emotion ([#6863](elastic/eui#6863)) - Converted `EuiCollapsibleNav` and `EuiCollapsibleNavGroup` to Emotion ([#6865](elastic/eui#6865)) - Removed Sass variables `$euiCollapsibleNavGroupLightBackgroundColor`, `$euiCollapsibleNavGroupDarkBackgroundColor`, and `$euiCollapsibleNavGroupDarkHighContrastColor` ([#6865](elastic/eui#6865)) --------- Co-authored-by: Cee Chen <constance.chen@elastic.co> Co-authored-by: Jeramy Soucy <jeramy.soucy@elastic.co> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Creates the API necessary for #6541
Required by: #6845
This PR creates a streaming API endpoint that parses CSV files and indexes each row in the file as a document in Elasticsearch.
Endpoint
/api/kibana/{index}/_data
The request payload can be either raw CSV data, or
multipart/form-data
with the CSV file attached under acsv
key.Query String Parameters
delimiter
- (optional) String - a custom delimiter. Defaults to,
pipeline
- (optional) Boolean - If true, documents are sent through the index's corresponding pipeline (based on the Kibana ingest config convetion of appendingkibana-
to the index name), if one existsResponse
In order to support large files without having to implement a polling mechanism or websockets, the API streams a chunked response back to the browser. This makes realtime progress updates a possibility and helps prevent timeouts between the backend and the browser. The response payload is a JSON array of "result objects". Each result object represents the results of a portion (or for a small file, potentially the entirety) of the parsing and indexing the API has completed so far. The result object schema looks like the following:
The errors object may have the following keys:
index
- Array of objects - contains any indexing errors returned by elasticsearch, per documentother
- Array of strings - contains csv parse errors, non-indexing ES errors, and any other misc. errorsTo make this more concrete, here are a couple sample responses:
With an indexing error:
With a parsing error: