-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attachment ingest processor: add resource_name field #64389
Attachment ingest processor: add resource_name field #64389
Conversation
Pinging @elastic/es-core-features (:Core/Features/Ingest) |
@elasticmachine ok to test |
@yangyaofei, thank you for opening this PR. It's true that we have a check that prohibits files with encodings other than UTF8. Is that necessary for this PR? I believe that non-ascii characters are supported by the ingest-attachment plugin so long as they're encoded in UTF8. |
@danhermann hi, this PR is not for add non-utf8 file to the repo. it's a test file to test the main purpose , which is adding filename field to Tika through plugin to let tika recognize test file. For example. A GBK encoded text file cannot be recognized right . Before adding the filename to Tika, the encode can be recognized right And To test that recognition, I add the non-utf8 test text file. |
@elasticmachine update branch |
@yangyaofei, can you try adding the following line:
right after
|
@danhermann Done |
@danhermann Is there anything I can do to merge this ? Or just wait ? |
@yangyaofei, sorry for the delay. Now that it's compiling and passing the tests, I need to review the code. I'll try to get that done in the next week or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yangyaofei, thank you for your contribution here. It looks good and I think we can get it merged with a few minor changes as noted below.
...ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java
Outdated
Show resolved
Hide resolved
...ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java
Outdated
Show resolved
Hide resolved
@danhermann done with that |
@elasticmachine update branch |
Thank you, @yangyaofei. It looks good. I noticed one other thing about the tests that should be changed and then it can be merged. The suggestions below aren't ordered very well, but essentially, I would like to leave the existing unit tests unchanged so that they test the behavior of the processor when |
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
...t-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
Outdated
Show resolved
Hide resolved
…st/attachment/AttachmentProcessorTests.java Co-authored-by: Dan Hermann <danhermann@users.noreply.github.com>
…st/attachment/AttachmentProcessorTests.java Co-authored-by: Dan Hermann <danhermann@users.noreply.github.com>
…st/attachment/AttachmentProcessorTests.java Co-authored-by: Dan Hermann <danhermann@users.noreply.github.com>
…st/attachment/AttachmentProcessorTests.java Co-authored-by: Dan Hermann <danhermann@users.noreply.github.com>
Great, the remaining test failures are unrelated, so I can get this merged. This is a very useful addition to the attachment processor, so thanks again for the contribution, @yangyaofei! |
Contributes to #5198 Relates to elastic/elasticsearch#64389
Contributes to #5198 Relates to elastic/elasticsearch#64389
Contributes to #5198 Relates to elastic/elasticsearch#64389
Contributes to #5198 Relates to elastic/elasticsearch#64389
Contributes to #5198 Relates to elastic/elasticsearch#64389 Co-authored-by: Steve Gordon <sgordon@hotmail.co.uk>
Contributes to #5198 Relates to elastic/elasticsearch#64389 Co-authored-by: Steve Gordon <sgordon@hotmail.co.uk>
Hey @yangyaofei! I'm Faith from Elastic's Community team. I want to encourage you to check out Elastic's Contributor Program, where you can earn points towards rewards for code contributions like this one. |
In the current plugin: ingest-attachment, the text file cannot be read properly if the encode is not utf-8
and contain some non-ascii characters.
I study a little about Tika which is used in ingest-attachment. Then I find out if we can tell Tika the file's name, it can recognize the file better. So I add an attachment options
file_name
, if there is a field defined asfile_name
, then this name will sent to Tika to improve the result.But there is something not looks well. That's the
gradle check
. I wrote the unit test for reading different text usingdifferent encoding. But seems there is a role to not commit no-utf8 things.