Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attachment ingest processor: add resource_name field #64389

Merged
merged 18 commits into from
Dec 7, 2020

Conversation

yangyaofei
Copy link
Contributor

@yangyaofei yangyaofei commented Oct 30, 2020

In the current plugin: ingest-attachment, the text file cannot be read properly if the encode is not utf-8
and contain some non-ascii characters.

I study a little about Tika which is used in ingest-attachment. Then I find out if we can tell Tika the file's name, it can recognize the file better. So I add an attachment options file_name, if there is a field defined as file_name, then this name will sent to Tika to improve the result.

But there is something not looks well. That's the gradle check. I wrote the unit test for reading different text using
different encoding. But seems there is a role to not commit no-utf8 things.

@martijnvg martijnvg added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Oct 30, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Oct 30, 2020
@danhermann danhermann self-requested a review November 5, 2020 15:04
@danhermann
Copy link
Contributor

@elasticmachine ok to test

@danhermann
Copy link
Contributor

But there is something not looks well. That's the gradle check. I wrote the unit test for reading different text using
different encoding. But seems there is a role to not commit no-utf8 things.

@yangyaofei, thank you for opening this PR. It's true that we have a check that prohibits files with encodings other than UTF8. Is that necessary for this PR? I believe that non-ascii characters are supported by the ingest-attachment plugin so long as they're encoded in UTF8.

@yangyaofei
Copy link
Contributor Author

yangyaofei commented Nov 6, 2020

@danhermann hi, this PR is not for add non-utf8 file to the repo. it's a test file to test the main purpose , which is adding filename field to Tika through plugin to let tika recognize test file. For example. A GBK encoded text file cannot be recognized right . Before adding the filename to Tika, the encode can be recognized right

And To test that recognition, I add the non-utf8 test text file.

@danhermann
Copy link
Contributor

@elasticmachine update branch

@danhermann
Copy link
Contributor

@danhermann hi, this PR is not for add non-utf8 file to the repo. it's a test file to test the main purpose , which is adding filename field to Tika through plugin to let tika recognize test file. For example. A GBK encoded text file cannot be recognized right . Before adding the filename to Tika, the encode can be recognized right

And To test that recognition, I add the non-utf8 test text file.

@yangyaofei, can you try adding the following line:

exclude '**/text-cjk-*.txt'

right after

to exempt the new text files from the check that they're in UTF8 format?

@yangyaofei
Copy link
Contributor Author

@danhermann Done

@yangyaofei
Copy link
Contributor Author

@danhermann Is there anything I can do to merge this ? Or just wait ?

@danhermann
Copy link
Contributor

@danhermann Is there anything I can do to merge this ? Or just wait ?

@yangyaofei, sorry for the delay. Now that it's compiling and passing the tests, I need to review the code. I'll try to get that done in the next week or so.

Copy link
Contributor

@danhermann danhermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yangyaofei, thank you for your contribution here. It looks good and I think we can get it merged with a few minor changes as noted below.

@yangyaofei
Copy link
Contributor Author

@danhermann done with that

@danhermann
Copy link
Contributor

@elasticmachine update branch

@danhermann
Copy link
Contributor

danhermann commented Dec 7, 2020

@danhermann done with that

Thank you, @yangyaofei. It looks good. I noticed one other thing about the tests that should be changed and then it can be merged. The suggestions below aren't ordered very well, but essentially, I would like to leave the existing unit tests unchanged so that they test the behavior of the processor when resource_name is unspecified. Your new tests would be added to a new test method that is the only one that specifies resource_name to test the behavior of the processor with the new option that you added.

yangyaofei and others added 4 commits December 7, 2020 22:45
…st/attachment/AttachmentProcessorTests.java

Co-authored-by: Dan Hermann <danhermann@users.noreply.github.com>
…st/attachment/AttachmentProcessorTests.java

Co-authored-by: Dan Hermann <danhermann@users.noreply.github.com>
…st/attachment/AttachmentProcessorTests.java

Co-authored-by: Dan Hermann <danhermann@users.noreply.github.com>
…st/attachment/AttachmentProcessorTests.java

Co-authored-by: Dan Hermann <danhermann@users.noreply.github.com>
@danhermann danhermann changed the title Plugin ingest-attachment add filename field Plugin ingest-attachment add resource_name field Dec 7, 2020
@danhermann danhermann changed the title Plugin ingest-attachment add resource_name field Attachment ingest processor: add resource_name field Dec 7, 2020
@danhermann
Copy link
Contributor

Great, the remaining test failures are unrelated, so I can get this merged. This is a very useful addition to the attachment processor, so thanks again for the contribution, @yangyaofei!

@danhermann danhermann merged commit 0f84763 into elastic:master Dec 7, 2020
danhermann pushed a commit to danhermann/elasticsearch that referenced this pull request Dec 14, 2020
stevejgordon added a commit to elastic/elasticsearch-net that referenced this pull request Dec 22, 2020
stevejgordon added a commit to elastic/elasticsearch-net that referenced this pull request Jan 14, 2021
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Jan 14, 2021
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Jan 14, 2021
stevejgordon added a commit to elastic/elasticsearch-net that referenced this pull request Jan 14, 2021
Contributes to #5198
Relates to elastic/elasticsearch#64389

Co-authored-by: Steve Gordon <sgordon@hotmail.co.uk>
stevejgordon added a commit to elastic/elasticsearch-net that referenced this pull request Jan 14, 2021
Contributes to #5198
Relates to elastic/elasticsearch#64389

Co-authored-by: Steve Gordon <sgordon@hotmail.co.uk>
@fw-at-elastic
Copy link

Hey @yangyaofei! I'm Faith from Elastic's Community team. I want to encourage you to check out Elastic's Contributor Program, where you can earn points towards rewards for code contributions like this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants