Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace java.net.HTTPUrlConnection with java.net.http.HttpClient (500USD Bounty) #150

Closed
lihaoyi opened this issue Jan 24, 2024 · 32 comments · Fixed by #158
Closed

Replace java.net.HTTPUrlConnection with java.net.http.HttpClient (500USD Bounty) #150

lihaoyi opened this issue Jan 24, 2024 · 32 comments · Fixed by #158

Comments

@lihaoyi
Copy link
Member

lihaoyi commented Jan 24, 2024

This would fix #123, and put us in a better state going forward.

HttpClient was standardized on Java 11, which is probably old enough we can rely on it already

HttpClient provides an async API, and we should expose it as requests.get.async, requests.post.async, etc.. Apart from that, the rest of the API and test suite should be unchanged. All existing tests should pass, with any unavoidable failures or necessary changes in the tests called out in the ticket

This should also resolve #123

To incentivize contribution, I'm putting a 500USD bounty on resolving this ticket. This is payable via bank transfer, and at my discretion in case of ambiguity.

@lihaoyi lihaoyi changed the title Replace java.net.HTTPUrlConnection with java.net.http.HttpClient Replace java.net.HTTPUrlConnection with java.net.http.HttpClient (500USD Bounty) May 14, 2024
@nafg
Copy link
Contributor

nafg commented May 14, 2024

I'll take a stab at this now

@nafg
Copy link
Contributor

nafg commented May 15, 2024

HTTPUrlConnection has you write the request body by exposing an OutputStream, and RequestBlob is designed around this.

OTOH with HttpClient you pass a BodyPublisher, which is an abstraction analogous to RequestBlob other than the headers, that has helpers given you have a byte array or InputStream or other common cases. Otherwise it's based on java.util.concurrent.Flow.Publisher so you can wrap one of those or extend it.

Can I change the RequestBlob API, or do we have to work with it? In particular I'm not sure how to adapt it to a BodyPublisher without either loading it all into memory or starting another thread.

TBH I'm not sure why it has to take something as powerful as Publisher which can have multiple subscribers.

@lihaoyi
Copy link
Member Author

lihaoyi commented May 15, 2024

@nafg does HttpResponse.BodyHandlers.ofInputStream do what we want? At at glance it seems to do the right thing - streaming the data and making it available to whoever reads the stream as it downloads.

@nafg
Copy link
Contributor

nafg commented May 15, 2024

@nafg does HttpResponse.BodyHandlers.ofInputStream do what we want? At at glance it seems to do the right thing - streaming the data and making it available to whoever reads the stream as it downloads.

I guess you mean BodyPublisher (BodyHandler would be for the response body). It does "the right thing" as you say, but I don't know how to adapt a RequestBlob to it (without reading it into memory or starting another thread).

@nafg
Copy link
Contributor

nafg commented May 15, 2024

As I said, the cleanest implementation would be to completely break the RequestBlob API. (Of course we would still support implicitly adapting the same types of values.)

@nafg
Copy link
Contributor

nafg commented May 15, 2024

Maybe I could just create a connected PipedOutputStream and PipedInputStream and pass the former to data.write and the latter to BodyPublisher.ofInputStream. But where would write be called?

@lihaoyi
Copy link
Member Author

lihaoyi commented May 15, 2024

Sorry, I mixed up BodyHandler and BodyPublisher.

If I understand the problem right, it is that RequestBlob exposes a push-based def write(out: java.io.OutputStream): Unit for people to implement, whereas HttpRequest.BodyPublisher.ofInputStream requires a pull-based Supplier[? extends InputStream], and the pull-based approach is fundamentally harder to implement that the original push-based approach and thus incompatible.

Making RequestBlob pull-based rather than push-based is a no-go. Pull-based APIs are much harder to implement than pull-based APIs, and it would mean the Writables coming from e.g. uPickle or Scalatags or other libraries would not be compatible with this API. Making uPickle, Scalatags, etc. support pull-based serialization would be a full rewrite and is unfeasible

Spawning a thread is an option, but an expensive one. What if we instead use the Java HTTPClient's sendAsync, and then use the calling thread to call RequestBlob#write and block on it? The calling thread isn't doing anything anyway when using sendAsync, so it costs us nothing to use it to call .write. That way we don't need to pay the cost of a new thread while the calling thread sits idle

I'm actually not sure exactly how the async concurrency story works with java.net.http.HttpClient. Does it need an Executor, ExecutionContext, or Threadpool of some sort? We need to sort that out as part of this ticket so we can understand what the consequences of this replacement will be

@nafg
Copy link
Contributor

nafg commented May 15, 2024

I'm actually not sure exactly how the async concurrency story works with java.net.http.HttpClient. Does it need an Executor, ExecutionContext, or Threadpool of some sort? We need to sort that out as part of this ticket so we can understand what the consequences of this replacement will be

Yeah, the builder takes an Executor IIRC

@nafg
Copy link
Contributor

nafg commented May 15, 2024

Spawning a thread is an option, but an expensive one. What if we instead use the Java HTTPClient's sendAsync, and then use the calling thread to call RequestBlob#write and block on it? The calling thread isn't doing anything anyway when using sendAsync, so it costs us nothing to use it to call .write. That way we don't need to pay the cost of a new thread while the calling thread sits idle

Does that mean that we give it an InputStream that will block until we write to the OutputStream? Not sure I follow.

@lihaoyi
Copy link
Member Author

lihaoyi commented May 15, 2024

I'm not totally sure myself TBH. But if you said you can solve the problem by spawning a thread, it seems like you should be able to solve the problem by using sendAsync and taking advantage of the calling thread.

@nafg
Copy link
Contributor

nafg commented May 16, 2024

Do I need to handle chunkedUpload==false

@nafg
Copy link
Contributor

nafg commented May 16, 2024

I don't see an equivalent of connection.getResponseMessage

@lihaoyi
Copy link
Member Author

lihaoyi commented May 16, 2024

@nafg You'll have to dig through the code and tell me what the tradeoffs and necessary changes are

@nafg
Copy link
Contributor

nafg commented May 16, 2024

Re response message (reason phrase) see https://stackoverflow.com/a/63576759/333643

Some points from there: HTTP 2 and some HTTP 1 servers don't provide it, HttpClient has no API to get it, and it doesn't really have any purpose.

Two options:

  1. Preserve compatibility using a table of strings per status code
  2. Remove the field from case class Response

@nafg
Copy link
Contributor

nafg commented May 16, 2024

I guess we'll need more breaking changes or fake compatibility: https://stackoverflow.com/q/53617574/333643 (can't control keep-alive)

@nafg
Copy link
Contributor

nafg commented May 16, 2024

It doesn't let you set Content-Length manually; it should be determined by the BodyPublisher. Which means RequestBlob can't define it among an arbitrary list of headers.

If I can make breaking changes as desired the most natural replacement would be

trait RequestBlob {
  def bodyPublisher: BodyPublisher
  def contentType: String
}

There are less breaking ways to do it though.

I guess that's really the theme of all my questions. Is this supposed to be "preserve the interface, swap the implementation" or is this going to be a V2 designed around HttpClient? If it's somewhere in the middle then I have to consult on all the specifics I guess.

@lihaoyi
Copy link
Member Author

lihaoyi commented May 16, 2024

I guess that's really the theme of all my questions. Is this supposed to be "preserve the interface, swap the implementation" or is this going to be a V2 designed around HttpClient? If it's somewhere in the middle then I have to consult on all the specifics I guess.

I think it's somewhere in the middle:

  1. Breaking changes are fine, especially ones that the Java 11 HTTP design explicitly decided were low priority enough to not support. Requests-Scala isn't 1.0.0 yet, so we are not promising people that we'll keep things strictly compatible, so it's just a best effort thing and if users need to tweak code in some edge cases to update that's fine.

  2. But integration with uPickle, Scalatags, etc. via geny.Writable should not break, if at all possible. Being able to stream stuff seamlessly between the different libraries is a big deal, and not a capability I would give up lightly. Swapping over to geny.Readable to require it read from an InputStream is not feasible, because Readable is tremendously harder to implement than Writable

At least for now, it seems we have "use sendAsync and use the calling thread to call .write" as a path forward for keeping (2) working. If that doesn't work, then we'll have to reconsider: maybe we don't want to do this at all, maybe we want to pull in some third-party HTTP library, maybe something else. But we can cross that bridge when we reach it

@nafg
Copy link
Contributor

nafg commented May 20, 2024

I have all tests passing except gzipError which does requests.head("https://api.github.com/users/lihaoyi").

For some reason it returns a 400. I don't know why, and I don't know what it has to do with gzip. Any thoughts?

@lihaoyi
Copy link
Member Author

lihaoyi commented May 21, 2024

@nafg not sure, you'll have to dig into it to take a look. Maybe try setting up a proxy to intercept the requests and compare the difference between the old and new?

@nafg
Copy link
Contributor

nafg commented May 23, 2024

Just logging this here to remember it. I used webhook.site to see what the gzipError test was doing different. I saved each as HAR JSON and compared with https://jsoncompare.com/#!/diff/id=a5efdd49952712265bd06d4e8be477cb&fullscreen&sortAlphabetically/ and got this:

image

@nafg
Copy link
Contributor

nafg commented May 23, 2024

so:

  1. The old version includes cache-control: no-cache
  2. The old version includes content-length: while the new version does content-length: 0
  3. The old version includes pragma: no-cache
  4. The new version includes transfer-encoding: chunked

I'm guessing the issue (if any of the above and not something more subtle) is (4).

But I still don't know what the purpose of the test is.

@nafg
Copy link
Contributor

nafg commented May 23, 2024

Ok it's from c65a2be and seems to be part of #95 trying to fix #37.

So basically it's trying to assert that HEAD requests that accept gzipped encoding don't throw an exception trying to read a non-existent body.

@nafg
Copy link
Contributor

nafg commented May 24, 2024

So using a BodyPublisher with a fixed length when contentLength is known (as would be correct anyway) fixes the 400.

But it creates another issue, which I think is a bug in the current behavior that now crashes: Content-Length should be the size after compression.

Since there's no way to know the length after compression without compressing it, that basically means that known-content length RequestBlobs must be written into memory, which means there's no point in them specifying their content length at all. They should just say if they should be streamed or not.

@lihaoyi
Copy link
Member Author

lihaoyi commented May 25, 2024

The gzip + content length thing seems like it might need an API change to make it work.

What if we defaulted to yes streaming + no content length, and provided a Compress.GzipBuffered that did the compression in memory and thus was yes content length + no streaming?

I don't see an easy way to make the decision automatically in a principled fashion, so maybe the best we can do is let the user choose

@nafg
Copy link
Contributor

nafg commented May 26, 2024

The gzip + content length thing seems like it might need an API change to make it work.

What if we defaulted to yes streaming + no content length, and provided a Compress.GzipBuffered that did the compression in memory and thus was yes content length + no streaming?

I don't see an easy way to make the decision automatically in a principled fashion, so maybe the best we can do is let the user choose

I didn't fully understand that.

I thought we needed an API change anyway and that was okay. I'll implement my last idea in the near future and push my code up. I think that should help us communicate ;)

@nafg
Copy link
Contributor

nafg commented May 27, 2024

Regarding the last issue, what I did for now is that the request body is fixed-length if it's empty or has a known length and is uncompressed. If it is compressed or its length is unknown then it will be streamed.

In both cases the request is first initiated with sendAsync and then in the calling thread we call .write just as was done until now (as you suggested).

(Their API is a bit high level but IIUC fixed-length would mean it includes a Content-Length header and streaming means it will be Transfer-Encoding: chunked.)

This means that uploading even a large file without compression will load it all (compressed) into memory (using the provided implicit conversion). I don't know if there is alternative (other than perhaps picking a maximum size above which we always stream, or else saying that certain RequestBlobs such as files always should stream). Similar for adapting an arbitrary Writable that defines a contentLength.

@lihaoyi
Copy link
Member Author

lihaoyi commented Jun 21, 2024

@nafg I cut 0.9.0-RC1. Will leave it for a few weeks before cutting 0.9.0 final.

If you email me your bank transfer details I can close out the bounty

@nafg
Copy link
Contributor

nafg commented Jun 24, 2024

I don't know your email address

@lihaoyi
Copy link
Member Author

lihaoyi commented Jun 24, 2024 via email

@sideeffffect
Copy link

@nafg https://github.com/lihaoyi
image

in case the email address is censored, because I see this as the response
image

@lihaoyi
Copy link
Member Author

lihaoyi commented Jun 24, 2024

huh, github censors emails???

haoyi.sg@gmail.com

@sideeffffect
Copy link

Maybe only in email responses? 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants