Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve importing images from Microsoft Word #4291

Merged
merged 41 commits into from
Oct 16, 2020
Merged

Improve importing images from Microsoft Word #4291

merged 41 commits into from
Oct 16, 2020

Conversation

Comandeer
Copy link
Member

@Comandeer Comandeer commented Sep 20, 2020

What is the purpose of this pull request?

Bug fix

Does your PR contain necessary tests?

All patches that change the editor code must include tests. You can always read more
on PR testing,
how to set the testing environment and
how to create tests
in the official CKEditor documentation.

This PR contains

  • Unit tests
  • Manual tests

Did you follow the CKEditor 4 code style guide?

Your code should follow the guidelines from the CKEditor 4 code style guide which helps keep the entire codebase consistent.

  • PR is consistent with the code style guide

What is the proposed changelog entry for this pull request?

Fixed Issues:

* [#2800](https://github.com/ckeditor/ckeditor4/issues/2800): Fixed: no images are imported from Microsoft Word if there is at least one image in unsupported format.

API Changes:

* [#3782](https://github.com/ckeditor/ckeditor4/issues/3782): Merge [`CKEDITOR.plugins.pastetool.filters.word.images`](https://ckeditor.com/docs/ckeditor4/latest/api/CKEDITOR_plugins_pastetools_filters_word_images.html) to [`CKEDITOR.plugins.pastetools.filters.image`](https://ckeditor.com/docs/ckeditor4/latest/api/CKEDITOR_plugins_pastetools_filters_image.html).

What changes did you make?

Ok, this was a really tough one:

  1. I've implemented a very primitive RTF parser to get rid of headers and footers before further parsing the document.
  2. Now all images are extracted, even the ones with unknown formats. Thanks to that, we can at least partially render images in pasted content (ones in unknown formats are still broken, but the rest is rendered correctly). For supporting these unsupported formats see Add support for EMF, WMF and other image formats #4290.
  3. Some images are inserted several times – in different formats (e.g. PNG & WMF) or for additional version for non-Word readers. The non-Word ones are dismissed and images with the same ids are not extracted if there is already an image data for the id.
  4. Some images are WordArt shapes and they are not included in HTML source as img tags. Fortunately they are easily filtered by using \defshp.
  5. Images inserted more than once have duplicated unique ids and I had to differiantiate between such case and the same image in different formats.
  6. I've merged the whole CKEDITOR.plugins.pastetools.filters.word.images with CKEDITOR.plugins.pastetools.filters.image. It's potentially breaking change – however it touches only private members of API.
  7. I've got rid of our old extracting based on regexes and I've switched to the parser I prepared in 1.
  8. I've made CKEDITOR.pasteFilters the real alias of CKEDITOR.plugins.pastetools.filters.
  9. I've added and modified a lot of API docs.

Which issues does your PR resolve?

Closes #2800.
Closes #3782.
Closes #4297.

@f1ames f1ames self-requested a review September 22, 2020 11:36
@f1ames f1ames self-assigned this Sep 22, 2020
Copy link
Contributor

@f1ames f1ames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Fix

I see it still doesn't work with some of the samples added in initial issue (tested on Chrome 85 on Windows 10):

  1. Images not imported from word #2800 (comment) - still no images at all
  2. Images not imported from word #2800 (comment) - works fine 👍
  3. Images not imported from word #2800 (comment) - works almost fine (some images are missing, see below).
  4. Images not imported from word #2800 (comment) - first image doesn't show up.

Related to 3rd point above - page 20 and 21 results in broken images. I suspect this are .emz because there were some related errors in the console, but it will be good to check what is the exact cause

Code

I would be for making this RTF parser more generic thing (moving to pastetools maybe?) and reworking removeHeadersAndFooters() and removeMatchedGroup() methods (see comments in the code).

Manual Tests

It might be good to add another manual test which covers all formats, so shapes, WMF, EMF, png, jpg, gif and also some unsupported ones (bmp or something?) to see how it works with complex content with multiple different formats. Or just add original .docx files from issue itself (mentioned on the beginning of this review).

Others

From what I see you can paste header/footer content explicitly, but I assume it is intended as it doesn't happen during regular copy/paste?

Could you also review other related tickets mentioned in #2800 to see if it solves other issues too?

Other issues which mention Paste From MS Word and images problems: #3972, #3937, #3782, #3781, #2675, #2516, #1345, #1134

tests/plugins/pastefromword/manual/imagesunsupported.md Outdated Show resolved Hide resolved
plugins/pastefromword/filter/default.js Outdated Show resolved Hide resolved
plugins/pastefromword/filter/default.js Outdated Show resolved Hide resolved
plugins/pastefromword/filter/default.js Outdated Show resolved Hide resolved
@Comandeer Comandeer self-assigned this Sep 23, 2020
@Comandeer
Copy link
Member Author

I'm wondering if we can't introduce new error that will be displayed when some image can't be inserted because it's unsupported. WDYT?

@f1ames
Copy link
Contributor

f1ames commented Sep 23, 2020

I'm wondering if we can't introduce new error that will be displayed when some image can't be inserted because it's unsupported. WDYT?

It would be more descriptive than some random errors for sure. I wonder how many information this error may provide to guide user and by descriptive enough... Probably file extension? File names are just some random strings generated by Word and no related to original file names AFAIR?

@Comandeer
Copy link
Member Author

I've rebased onto latest major as it seems that this PR is going to be even bigger than I thought, especially now when the whole mechanism of extracting images is rewritten.

@Comandeer
Copy link
Member Author

From what I see you can paste header/footer content explicitly, but I assume it is intended as it doesn't happen during regular copy/paste?

Yes, it just works this way, I didn't change anything here.

I would be for making this RTF parser more generic thing (moving to pastetools maybe?) and reworking removeHeadersAndFooters() and removeMatchedGroup() methods (see comments in the code).

Done 👍

@Comandeer
Copy link
Member Author

  1. #2800 (comment) - still no images at all

Fixed. It seems that if image is inserted more than once, it has the same unique id. There is only one image not rendered correctly in this document, according to our newly introduced error message. Probably it is the list marker, because i don't see any other incorrectly rendered image 🤔

  1. #2800 (comment) - works almost fine (some images are missing, see below).

Yup, these two images are in EMF format, currently unsupported. Covered by our new error message.

  1. #2800 (comment) - first image doesn't show up.

The first image is also in EMF format.

@Comandeer
Copy link
Member Author

Could you also review other related tickets mentioned in #2800 to see if it solves other issues too?

Only #3782 is also fixed by this PR.

@Comandeer
Copy link
Member Author

OK, I've updated the PR's description to sum up the major changes here.

@f1ames f1ames self-assigned this Sep 30, 2020
Copy link
Contributor

@f1ames f1ames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature

Haven't found any issues 👏 Also the error message is pretty descriptive and should help users to understand better what's going on 👍

  1. Images not imported from word #2800 (comment) - works fine, first image is WMF (error shows correctly) 👍
  2. Images not imported from word #2800 (comment) - works fine 👍
  3. Images not imported from word #2800 (comment) - works fine (apart from IE11, see below), two EMF images 👍
  4. Images not imported from word #2800 (comment) - works fine, error for the first images is shown 👍

As for 3. and IE11, for me there are no images. Since changes in this PR touches only browsers with Clipboard API I assume it worked the same before, but will be good to confirm that.

The code

Haven't looked at pastetools/filter/*.js closely yet, but since the rest looks good I expect some minor polishing only.

Manual tests

Since changes touches paste tools (which are used by PfLO too - mentioned here also) maybe we could add new or tag existing PfLO manual tests so we will be sure to check it during testing phase too?

It would be good if we could add resize plugin (or just increase editor height), because now testing is pretty painful with longer content.

Unit tests

One test fails in IE11 and IE10 (probably IE9 and IE8 too, but haven't checked):
image

Again regarding Libre Office, I guess handling images is mainly covered in tests/plugins/pastetools/filter/image.js test, but maybe some dedicated PfLO test could be useful here? Unless, we have all cases already covered in existing PfLO test? 🤔

Docs

I see documentation builds fine and API docs are generated correctly. I have one doubt regarding @removed annotation (see review comments).

Others

I think we don't need this file - https://github.com/ckeditor/ckeditor4/blob/afc87ba0463c74ec76ec5b0f4a64fc60c0c0cd43/tests/plugins/pastefromword/generated/_fixtures/ImagesExtraction/DuplicatedImage/~%24plicatedImage.docx

@Comandeer Comandeer self-assigned this Oct 2, 2020
@Comandeer
Copy link
Member Author

The images in your review comment were not linked correctly, but I've just checked all the linked docs in the issue and they seem to work correctly. In case of IE11 it works the same as on major. I've also added some tests for PfLO.

@Comandeer Comandeer requested a review from f1ames October 2, 2020 15:43
@f1ames f1ames self-assigned this Oct 6, 2020
Copy link
Contributor

@f1ames f1ames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really impressive job on this one 👍

And thanks to RTF fixtures you unlocked new achievement (I suppose first time ever 🤔) - more than 1 milion additions in a single PR 😱

image

Some very minor things to polish in the code itself and we are good to go 🔥

plugins/pastetools/filter/common.js Show resolved Hide resolved
plugins/pastetools/filter/image.js Outdated Show resolved Hide resolved
plugins/pastetools/filter/image.js Show resolved Hide resolved
tests/plugins/pastetools/filter/rtf.js Show resolved Hide resolved
@f1ames
Copy link
Contributor

f1ames commented Oct 9, 2020

@ckeditor/qa-team could you take a look and test this PR too to check if everything works fine without any regressions?

@Comandeer
Copy link
Member Author

Ok, I've added examples to API docs and added some unit tests for RTF helpers.

@Comandeer Comandeer requested a review from f1ames October 10, 2020 21:12
@FilipTokarski
Copy link
Member

Steps:

  1. Open http://localhost:1030/tests/plugins/pastefromword/manual/imagesduplicated
  2. Download attached docx file, open it and in both images change layout options to wrap text around images
  3. ctrl+a and copy
  4. Paste into editor
[CKEDITOR] Error code: pastetools-unsupported-image. {type: "image/wmf", index: 0}

paste_1

Notes:

  • if you change options of only one image, it works ok
  • seems to be ok on major branch

@FilipTokarski
Copy link
Member

Ok, I didn't find anything new apart from this issue mentioned above. The rest seems to be ok 👍

@Comandeer Comandeer self-assigned this Oct 14, 2020
@Comandeer
Copy link
Member Author

Word treats images with changed wrapped options as shapes, adding additional \pict group to them, causing the issue. Removing the additional \pict (that is enclosed inside \shprslt group) seems to do the trick.

@Comandeer Comandeer removed their assignment Oct 14, 2020
@f1ames f1ames self-assigned this Oct 15, 2020
Copy link
Contributor

@f1ames f1ames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 👏 🎉

@f1ames
Copy link
Contributor

f1ames commented Oct 15, 2020

@FilipTokarski could you also verify on your side that everything works fine?

@FilipTokarski
Copy link
Member

@Comandeer Could you check if you can ctrl+a ctrl+c and paste content of the file from this comment into manual test on Chrome? It's strange because it works fine for @f1ames and I'm getting:

paste_3

@Comandeer
Copy link
Member Author

@FilipTokarski, it works for me. From what I see on the recording, it seems that you test it on some old version 🤔 Even if images weren't rendered correctly, there should be more specific errors (e.g. about unsupported image formats or incorrect image extraction).

@f1ames f1ames self-assigned this Oct 16, 2020
@f1ames
Copy link
Contributor

f1ames commented Oct 16, 2020

Works for me too. We have checked with @FilipTokarski if this is the issue with e.g. different software/OS versions but seems to be insignificant here. @FilipTokarski also did fresh checkout but it still doesn't work for him.

I asked @FilipTokarski to use our getclipboard.html dev tool to post Raw HTML/RTF and data after processing which is result of pasting this document - maybe it will give us some hint hat's going on.

@FilipTokarski
Copy link
Member

I used getclipboard.html with this file:
paste_sample.docx

Results:

  1. Raw HTML Data Received: Data from: dataTransfer.getData( 'text/html', true )
    https://gist.github.com/FilipTokarski/f806d9fa094588743c2aaa78c91aa3be
  2. Raw RTF Data Received: dataTransfer.getData( 'text/rtf', true ) was empty
  3. After paste processing: https://gist.github.com/FilipTokarski/94446d78e443c71540b2c5fb9c68e3c5

Copy link
Member

@FilipTokarski FilipTokarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking to @f1ames we concluded that this is most probably some edge case related to my environment and not caused directly by the changes in this PR. In the future we should however closely monitor any bug reports concerning pasting from Word, as I suspect sooner or later someone might stumble upon similar problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants