Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs2-data-xml streaming parser #25

Closed
wants to merge 11 commits into from

Conversation

armanbilge
Copy link
Member

Might close #4 and #3.

Currently failing:

==> X org.http4s.scalaxml.ScalaXmlSuite.parse omitted charset and 8-Bit MIME Entity  0.018s munit.ComparisonFailException: /workspace/http4s-scala-xml/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala:147
146:      .withBodyStream(body)
147:    msg.as[Elem].map(_ \\ "hello" \@ "name").assertEquals(name)
148:  }
values are not the same
=> Obtained
G�nther
=> Diff (- obtained, + expected)
-G�nther
+Günther
==> X org.http4s.scalaxml.ScalaXmlSuite.parse omitted charset and 16-Bit MIME Entity  0.006s org.http4s.MalformedMessageBodyFailure: Malformed message body: Invalid XML
...
Caused by: fs2.data.xml.XmlException: unexpected character '�'
==> X org.http4s.scalaxml.ScalaXmlSuite.parse non-utf charset  0.005s munit.ComparisonFailException: /workspace/http4s-scala-xml/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala:147
146:      .withBodyStream(body)
147:    msg.as[Elem].map(_ \\ "hello" \@ "name").assertEquals(name)
148:  }
values are not the same
=> Obtained

=> Diff (- obtained, + expected)
-
+문재인

This was linked to issues Jun 10, 2022
ThisBuild / tlBaseVersion := "0.23"
ThisBuild / tlMimaPreviousVersions ++= (0 to 11).map(y => s"0.23.$y").toSet
ThisBuild / tlBaseVersion := "0.24"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't decide whether or not to bump the base version. We can make this change binary-compatibly, it's more about semantics.

Comment on lines 33 to 34
@deprecated("0.23.12", "This factory is no longer used")
protected def saxFactory: SAXParserFactory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the (semantically) breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing SAX will also change the types of errors that are raised.

Comment on lines 64 to 65
case e: Elem => Right(e)
case _ => Left(MalformedMessageBodyFailure("Invalid XML"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this charade, maybe this should become an EntityEncoder[F, Document] instead?

Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks compelling as an 0.24. It doesn't emit any partial documents, but it does avoid collecting the entire body into a Chunk[Byte]. It also opens the door to a decoder of streaming events.

I do wonder how the performance compares to SAX. My guess is not favorably, but perhaps similar?

Comment on lines 33 to 34
@deprecated("0.23.12", "This factory is no longer used")
protected def saxFactory: SAXParserFactory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing SAX will also change the types of errors that are raised.

@armanbilge armanbilge changed the base branch from series/0.23 to series/0.24 June 11, 2022 18:13
Comment on lines 46 to 51
implicit def xmlEvents[F[_]](implicit F: Concurrent[F]): EntityDecoder[F, Stream[F, XmlEvent]] = {
import EntityDecoder._
decodeBy(MediaType.text.xml, MediaType.text.html, MediaType.application.xml) { msg =>
val source = new InputSource()
msg.charset.foreach(cs => source.setEncoding(cs.nioCharset.name))
DecodeResult.successT(msg.bodyText.through(fs2.data.xml.events()))
}
}
Copy link
Member Author

@armanbilge armanbilge Jun 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There, now we can decode as a stream of events too 😁

@rossabaker
Copy link
Member

There's a lot to like here:

  • It's the only way to get incremental events right now. The new decoder seems like a better SAX.
  • It's the only thing that works with Concurrent, without naughty thread blocking. (Mark the loadXML call as blocking #28).
  • It's the only Scala.JS implementation. (I guess scala-xml doesn't support it? I haven't checked.)

There are also reasons I hesitate to retire the 0.23 impl:

Both of these problems could be me being ignorant.

One way forward is to have a parallel SAX and an fs2-data implementation here and let people pick their pros and cons.

Another way forward is to have an http4s-fs2-data repo, which is an alternate way to do XML. An advantage there is that we could implement the other formats that fs2-data supports. I do have a favorable impression of the library! But is there really enough XML demand in http4s to support two XML libraries? (Three, if we count scala-xml-1?)

/cc @sjenna, who is maintaining http4s-scala-xml-1 and might be interested in the proceedings...

@armanbilge
Copy link
Member Author

  • It's the only Scala.JS implementation. (I guess scala-xml doesn't support it? I haven't checked.)

scala-xml itself cross-compiles for Scala.js. The problem is that nobody has ported the Java-lib parser to Scala.js. So it's parser-less.

There are also reasons I hesitate to retire the 0.23 impl:

There's also those test failures I mentioned in #25 (comment) and gnieh/fs2-data#332. I don't understand RFCs well enough to appreciate if they are important or fixable.

But is there really enough XML demand in http4s to support two XML libraries? (Three, if we count scala-xml-1?)

I don't care enough to maintain it 😂 this was low-hanging fruit, and somebody had wanted to use XML with http4s-dom in the browser.

BTW, it's not obvious to me why we need a separate repo for scala-xml-1, how different is it? Since it has a different artifact name but same versioning we could theoretically publish side-by-side from a single set of sources.

@rossabaker
Copy link
Member

I don't know that scala-xml-1 is going to live to see http4s-1.0. But they're getting mostly the same Steward bumps, and backporting #28 (I'm calling it a bugfix) was a pain. That repo is mainly Jenna preserving a cohesive stack after we jumped out ahead of some other libraries (i.e., Twirl). Note that this PR pulls in a scala-xml-2 artifact, and would need to live in a wholly separate module if scala-xml-1 moved back here.

I haven't had time to dig into that encoding failure you hit yet.

@armanbilge
Copy link
Member Author

armanbilge commented Jun 12, 2022

Note that this PR pulls in a scala-xml-2 artifact, and would need to live in a wholly separate module if scala-xml-1 moved back here.

Not just this PR, this repo already pulls in scala-xml-2 anyway :) http4s-scala-xml-1 and http4s-scala-xml are already separate artifacts, but both in the 0.23 series. If their sources are similar enough we can keep one copy and cross-compile it as two separate projects with the separate scala-xml dependencies, all within a single repository.

Seems like supporting fs2-data-xml parsing for scala-xml-1 would basically require this file:
https://github.com/satabin/fs2-data/blob/b2ef5a52b17285cb5c6ae65d0082904723a2b4c5/xml/scala-xml/src/main/scala/fs2/data/xml/scalaXml/package.scala

@@ -46,7 +46,7 @@ trait ElemInstances {
implicit def xmlEvents[F[_]](implicit F: Concurrent[F]): EntityDecoder[F, Stream[F, XmlEvent]] =
EntityDecoder.decodeBy(MediaType.text.xml, MediaType.text.html, MediaType.application.xml) {
msg =>
DecodeResult.successT(msg.bodyText.through(fs2.data.xml.events()))
DecodeResult.successT(msg.bodyText.through(fs2.data.xml.events(includeComments = true)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to make the new round-trip test pass. But is it actually what we want?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I conflated normalization in the generator with normalization in the test in a way I probably shouldn't have. I guess the question is whether comments are part of normalized output. I think it belongs, perhaps with an option to turn it off.

@armanbilge
Copy link
Member Author

==> X org.http4s.scalaxml.ScalaXmlSuite.round trips utf-8  1.431s munit.FailException: Failing seed: wh9v0HPz0cj_gFFHwFNDktHBXQTfCaJD4KAgkQc4xYE=
You can reproduce this failure by adding the following override to your suite:

  override def scalaCheckInitialSeed = "wh9v0HPz0cj_gFFHwFNDktHBXQTfCaJD4KAgkQc4xYE="

Exception raised on property evaluation.
> ARG_0: <Ẵ줐샃뗧饜孫 悊頃ふ퉞="ꨍ邭䋒↏᲎ừ" 듸괎:ʿक턻뽜="촏"/>
> Exception: org.http4s.MalformedMessageBodyFailure: Malformed message body: Invalid XML

@rossabaker
Copy link
Member

So thinking about where to put things...

JDK/Xerces fs2-data aalto
Exists ⚙️
Fast ⚙️ 🏆
Compliant ⚙️
Non-blocking
Streaming ⚙️
JVM
JS

fs2-data has the most upside. If we make an http4s-fs2-data, it's probably the one people will want to use in the long run. If we continue here and make it http4s-scala-xml-0.24, there will be regression in certain facets today, but reason to hope it dominates Xerces in the near future.

Finishing Aalto would not be hard and probably the easiest way to dominate Xerces, and real hard to beat as a JVM only solution. fs2-data could eventually eat its lunch too, but it will be a higher performance bar.

@rossabaker
Copy link
Member

My best argument for a separate http4s-fs2-data is that it would be a natural home for CBOR support. fs2-data is much more than XML, and some of it is relevant. It might also be less discoverable than, like, http4s-cbor. (I don't actively use CBOR... this is mostly hypothetical to see if it makes anyone jump up and raise their hands and yell "ooh!")

@armanbilge
Copy link
Member Author

I do love your fantastic tables! ❤️

Agree, an http4s-fs2-data could make good sense. Specifically to XML, it gives a chance to "incubate" while performance and compliance improve, without being a regression as it would here. More generally, all the other fun integrations like cbor, csv, what-have-you :)

Short term, it will need a bit more activation energy and bike-shedding to get off the ground 😂

@rossabaker
Copy link
Member

I could set up a repo and cherry-pick your commits without much fuss. The slightly more annoying part would be publishing the generators somewhere, as I really don't want multiple copies. I could always move those to a rural Typelevel project.

@rossabaker
Copy link
Member

https://github.com/typelevel/scalacheck-xml

@rossabaker rossabaker changed the base branch from series/0.24 to series/0.23 June 17, 2022 04:55
@rossabaker
Copy link
Member

Change to base branch will help me rebuild this over in http4s-fs2-data, after I, like, sleep and stuff.

@rossabaker
Copy link
Member

Migrated to http4s/http4s-fs2-data#3.

@rossabaker rossabaker closed this Jun 17, 2022
@armanbilge armanbilge deleted the feature/fs2-data-xml-streaming branch June 17, 2022 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fs2-data streaming xml parser? Cross-build for Scala.js
2 participants