Encoding - Characters converted to UTF8-hex #17

Kalpens · 2019-10-21T12:46:32Z

Version 1.4.0
Original text in .msg body:

Char-å-Char
Char-Å-Char
Char-ø-Char
Char-Ø-Char
Char-æ-Char
Char-Æ-Char

After calling parseMsg on the file and looking into BodyHTML & ConvertedBodyHTML of OutlookMessage, both values are null. The BodyRtf has now these values, but the characters are changed to UTF-8hex and the body in the return string contains the following:

Char-'c3'a5-Char
Char-'c3'85-Char
Char-'c3'b8-Char
Char-'c3'98-Char
Char-'c3'a6-Char
Char-'c3'86-Char

What is not displayed above is that before ' there is also a backslash \

If I try to convert this extracted rtf from .msg using the recently forked library "rtf-to-html" with (RTF2HTMLConverterRFCCompliant or RTF2HTMLConverterClassic)then I get the following exception:

Exception in thread "main" java.nio.charset.UnsupportedCharsetException: 65001 at org.bbottema.rtftohtml.impl.util.CharsetHelper.findCharset(CharsetHelper.java:19) at org.bbottema.rtftohtml.impl.RTF2HTMLConverterRFCCompliant.rtf2html(RTF2HTMLConverterRFCCompliant.java:112)

If I use RTF2HTMLConverterJEditorPane, I am able to convert the rtf to html, but the result contains some encoding issues, so to partially solve them, I first convert the string of "Cp1252" to byte array and then the byte array to "UTF-8" String. After this I get almost all the results I wanted to achieve:

Char-å-Char
Char-Å-Char
Char-ø-Char
Char-#-Char
Char-æ-Char
Char-Æ-Char

As you can see I am able to convert most of the characters to correct encoding except the Ø character.

My current solution is to go back to version 1.1.16, retrieving the ConvertedBodyHTML as in this version it is not null and converting this string of html from "Cp1252" to byte array and then the byte array to "UTF-8" string. This way I don't use the newly forked"rtf-to-html" and am able to get html from OutlookMessageParser itself.

Is there some other workaround to make the newest version of Outlook-Message-Parser work?

The text was updated successfully, but these errors were encountered:

bbottema · 2019-10-21T18:49:56Z

Hmm, you should have the exact same result with the classic converter from the rtf-to-html library, which was just a lift and shift for backwards compatibility. I wonder what changed since 1.1.16 to that.

Wait a minute. 1.1.16?? That's from 2017, did you type that correctly? All the versions including 1.1.17 after that don't work?

Kalpens · 2019-10-22T06:20:16Z

Hmm, you should have the exact same result with the classic converter from the rtf-to-html library, which was just a lift and shift for backwards compatibility. I wonder what changed since 1.1.16 to that.

Wait a minute. 1.1.16?? That's from 2017, did you type that correctly? All the versions including 1.1.17 after that don't work?

Yes, all the way back to 1.1.16

If I switch up to 1.1.17 or 1.1.18, I get simmilar excetpion to the one when calling manually rtf to html: Exception in thread "main" java.nio.charset.UnsupportedCharsetException: 65001 at org.simplejavamail.outlookmessageparser.rtf.util.CharsetHelper.findCharset(CharsetHelper.java:19) at org.simplejavamail.outlookmessageparser.rtf.SimpleRTF2HTMLConverter.extractCodepage(SimpleRTF2HTMLConverter.java:44) at org.simplejavamail.outlookmessageparser.rtf.SimpleRTF2HTMLConverter.rtf2html(SimpleRTF2HTMLConverter.java:22)

Version 1.1.19 & 1.1.21 & 1.2.1 & 1.3.0 & 1.4.0 msg parser returns null in both html methods.

Version 1.1.20, msg parser during parsing returns again a similar excepion, but it also runs and is possible to get null html values.

[main] ERROR org.simplejavamail.outlookmessageparser.model.OutlookMessage - Error occurred while extracting compressed RTF from source msg java.nio.charset.UnsupportedCharsetException: 65001 at org.simplejavamail.outlookmessageparser.rtf.util.CharsetHelper.findCharset(CharsetHelper.java:19) ~[outlook-message-parser-1.1.20.jar:?] at org.simplejavamail.outlookmessageparser.rtf.SimpleRTF2HTMLConverter.extractCodepage(SimpleRTF2HTMLConverter.java:44) ~[outlook-message-parser-1.1.20.jar:?] at org.simplejavamail.outlookmessageparser.rtf.SimpleRTF2HTMLConverter.rtf2html(SimpleRTF2HTMLConverter.java:22) ~[outlook-message-parser-1.1.20.jar:?]

bbottema · 2019-10-22T06:51:29Z

wow

Ok, is it possible for you to share your .msg? Then I can start tracing down which change breaks it.

Kalpens · 2019-10-22T07:00:42Z

wow

Ok, is it possible for you to share your .msg? Then I can start tracing down which change breaks it.

Sure, here is a link to the .msg file.
https://tiny.cc/bbzxez

…UTF-8's legacy name (cp)65001

…-8 character set name

bbottema · 2019-10-22T08:42:03Z

Fix released in 1.4.1. Thanks for the report!

bbottema added bug need user input labels Oct 21, 2019

bbottema pushed a commit to bbottema/rtf-to-html that referenced this issue Oct 22, 2019

#1 (bbottema/outlook-message-parser/issues/17): Restored support for …

20a8f47

…UTF-8's legacy name (cp)65001

bbottema pushed a commit that referenced this issue Oct 22, 2019

#17: Updated rtf-to-html dependency which fixed support for older UTF…

cb709e3

…-8 character set name

bbottema closed this as completed Oct 22, 2019

bbottema removed the need user input label Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding - Characters converted to UTF8-hex #17

Encoding - Characters converted to UTF8-hex #17

Kalpens commented Oct 21, 2019 •

edited

Loading

bbottema commented Oct 21, 2019 •

edited

Loading

Kalpens commented Oct 22, 2019 •

edited

Loading

bbottema commented Oct 22, 2019 •

edited

Loading

Kalpens commented Oct 22, 2019

bbottema commented Oct 22, 2019

Encoding - Characters converted to UTF8-hex #17

Encoding - Characters converted to UTF8-hex #17

Comments

Kalpens commented Oct 21, 2019 • edited Loading

bbottema commented Oct 21, 2019 • edited Loading

Kalpens commented Oct 22, 2019 • edited Loading

bbottema commented Oct 22, 2019 • edited Loading

Kalpens commented Oct 22, 2019

bbottema commented Oct 22, 2019

Kalpens commented Oct 21, 2019 •

edited

Loading

bbottema commented Oct 21, 2019 •

edited

Loading

Kalpens commented Oct 22, 2019 •

edited

Loading

bbottema commented Oct 22, 2019 •

edited

Loading