Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for iso-8859-1 #149

Open
qLag opened this issue Nov 25, 2020 · 7 comments
Open

Support for iso-8859-1 #149

qLag opened this issue Nov 25, 2020 · 7 comments

Comments

@qLag
Copy link

qLag commented Nov 25, 2020

I try to parse an XML that comes from an iso-8859-1 API (somes strings have french accents).

Unfortunately, tikXml seems only to work with UTF-8.

I tried to use a TypeConverter :

`class StringUT8Converter : TypeConverter {

override fun read(value: String): String {
    return String(value.toByteArray(Charsets.ISO_8859_1), Charsets.UTF_8)
}

override fun write(value: String): String {
    return String(value.toByteArray(Charsets.UTF_8), Charsets.ISO_8859_1)

}

}
`
but it doesn't work.

Do you think you can include other encodings than UTF-8 (for poor old webservices 😝 )?

Thx

@qLag
Copy link
Author

qLag commented Nov 25, 2020

I didn't try but is it not possible to use buffer.read(_index_, this.charset) instead of buffer.readUtf!(_index_) in all the file XmlReader (with a default charset = Charsets.UTF_8)

And give the the possibility to define a custom Charset with TikXmlConfig that will be used in TikXml.java :
XmlReader reader = XmlReader.of(source, config.charset);

@reline
Copy link
Contributor

reline commented Nov 25, 2020

@qLag It is possible, Okio's API allows you to provide a charset with Buffer#readString(), Buffer#writeString(), and ByteString#encodeString()

I think the only issue is skipping the leading BOM for each charset. This is the current implementation.

private int nextNonWhitespace(boolean throwOnEof, boolean isDocumentBeginning) throws IOException {
  // Look for UTF-8 BOM sequence 0xEFBBBF and skip it
  if (isDocumentBeginning && source.rangeEquals(0, UTF8_BOM)) {
    source.skip(3);
  }
  ...
}

Not sure if this is the most optimal way to support skipping the BOM for each charset, but here's how OkHttp does it for several UTF charsets.
https://github.com/square/okhttp/blob/3f946d0b13534bcd1662e58624b0fc5816d1cc14/okhttp/src/main/java/okhttp3/internal/Util.kt#L255-L265

Edit:
FWIW, Moshi doesn't skip the BOM, you have to detect it and skip it yourself before handing the stream to Moshi. Perhaps that is another avenue of approach.

@reline
Copy link
Contributor

reline commented Nov 26, 2020

I made a draft here #150, needs unit tests but I went ahead and started the leg work.

@qLag
Copy link
Author

qLag commented Nov 30, 2020

Hi reline,
Thank for your support :) I will please to test your feature when it gets ready 👍

@reline
Copy link
Contributor

reline commented Dec 3, 2020

@qLag In the meantime you can always build a snapshot off of that branch if it's urgent and meets your needs.
I'd like to get more feedback from the maintainers now.

@qLag
Copy link
Author

qLag commented Dec 16, 2020

Hi reline,

I tried your draft using this line in Gradle :
implementation 'com.github.reline:tikxml:iso-8859-1-SNAPSHOT'

And this in my code :
val tikXml = TikXml.Builder() .charset(Charsets.ISO_8859_1) .exceptionOnUnreadXml(false) .build()

And... it works great ! 👍 😊 🎉
I needed to add these lines too in my build.gralde to make it work :
packagingOptions { exclude 'META-INF/gradle/incremental.annotation.processors' }

Its a really good new. How can we proceed now to be included in Tickaroo/tikXML ?
Thanks again :)

Qlag

@reline
Copy link
Contributor

reline commented Dec 22, 2020

@qLag Glad that worked for you!

I updated the PR with some unit tests, only significant difference I made was fixing the XML declaration when writing in charsets other than UTF-8.

- XML_DECLARATION = ByteString.encodeUtf8("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
+ XML_DECLARATION = ByteString.encodeString("<?xml version=\"1.0\" encoding=\"" + charset.name() + "\"?>", charset);

Is anyone available to review it? @sockeqwe @Bodo1981

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants