ClientPayloadError: Response payload is not completed #396

rrod515 · 2023-10-08T20:41:02Z

Describe the bug

When running models with larger data on an Intel-based Mac, the following errors occurs:
ClientPayloadError: Response payload is not completed

Describe your system

OS: macOS 10.15.7 (19H2026)
CPU: 2.4 GHz 8-Core Intel Core i9
C++: clang++ Apple clang version 11.0.0 (clang-1100.0.33.16)
Python: Anaconda conda 4.14.0

Steps/Code to Reproduce

Code Sample, a copy-pastable example

import stan
import numpy as np

schools_code = """
data {
  int<lower=0> J;         // number of schools
  array[J] real y;              // estimated treatment effects
  array[J] real<lower=0> sigma; // standard error of effect estimates
}
parameters {
  real mu;                // population treatment effect
  real<lower=0> tau;      // standard deviation in treatment effects
  vector[J] eta;          // unscaled deviation from mu by school
}
transformed parameters {
  vector[J] theta = mu + tau * eta;        // school treatment effects
}
model {
  target += normal_lpdf(eta | 0, 1);       // prior log-density
  target += normal_lpdf(y | theta, sigma); // log-likelihood
}
"""

schools_data = {"J": 40000,
                "y": np.repeat([28,  8, -3,  7, -1,  1, 18, 12], 5000),
                "sigma": np.repeat([15, 10, 16, 11,  9, 11, 10, 18], 5000)}

posterior = stan.build(schools_code, data=schools_data)
fit = posterior.sample(num_chains=4, num_samples=1000)
eta = fit["eta"]  # array with shape (8, 4000)
df = fit.to_frame()  # pandas `DataFrame, requires pandas

wm1995 · 2023-11-29T17:01:55Z

I've had this same issue on an M1 Mac - I've been validating my inference using simulation-based calibration, and if I simulate a dataset of 100 samples and use a model with ~400 params, about 75% of my runs fail with this error.

OS: Sonoma 14.0
CPU: M1 Pro
Compiler: clang version 15.0.0
Python: 3.10.8
Pystan: 3.7.0
Httpstan: 4.10.1 (compiled from source)
aiohttp: 3.8.4

The full output and error traceback is:

Messages received during sampling:
  Gradient evaluation took 0.00027 seconds
  1000 transitions using 10 leapfrog steps per transition would take 2.7 seconds.
  Adjust your expectations accordingly!
Sampling: 100% (37591/37591)
Traceback (most recent call last):
  File "project/scripts/fit_sbc.py", line 53, in <module>
    fit = sampler.sample(
  File "project/venv/lib/python3.10/site-packages/stan/model.py", line 89, in sample
    return self.hmc_nuts_diag_e_adapt(num_chains=num_chains, **kwargs)
  File "project/venv/lib/python3.10/site-packages/stan/model.py", line 108, in hmc_nuts_diag_e_adapt
    return self._create_fit(function=function, num_chains=num_chains, **kwargs)
  File "project/venv/lib/python3.10/site-packages/stan/model.py", line 313, in _create_fit
    return asyncio.run(go())
  File "project/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "project/base_events.py", line 649, in run_until_complete
    return future.result()
  File "project/venv/lib/python3.10/site-packages/stan/model.py", line 238, in go
    resp = await client.get(f"/{fit_name}")
  File "project/venv/lib/python3.10/site-packages/stan/common.py", line 48, in get
    return HTTPResponse(status=resp.status, content=await resp.read())
  File "project/venv/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1037, in read
    self._body = await self.content.read()
  File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 375, in read
    block = await self.readany()
  File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 397, in readany
    await self._wait("readany")
  File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

riddell-stan · 2023-12-01T19:30:46Z

This isn't an error I've seen before. Sounds like some kind of memory or buffer size limit might be being exceeded.

wm1995 · 2023-12-04T11:38:43Z

I'm wondering if this is linked to aio-libs/aiohttp#4581, and it occurs for large requests that don't get a response in time? Not sure though

wm1995 · 2023-12-06T12:02:43Z

I used a packet sniffer to inspect the traffic going back and forth between Pystan and httpstan; I also had a go at using pdb to try to catch the error as it happened, but the httpstan server had been torn down at that point.

For my example, the fit works for 6000 samples, but fails for 48000 samples.

76878	869.712981	127.0.0.1	127.0.0.1	HTTP	213	GET /v1/models/ex6bdm7x/fits/5nifthnq HTTP/1.1 
76879	869.713043	127.0.0.1	127.0.0.1	TCP	56	49954 → 49955 [ACK] Seq=4957307 Ack=2237498 Win=7215 Len=0 TSval=1115740592 TSecr=2511616970
76880	870.393448	127.0.0.1	127.0.0.1	TCP	215	49954 → 49955 [PSH, ACK] Seq=4957307 Ack=2237498 Win=7215 Len=159 TSval=1115741273 TSecr=2511616970 [TCP segment of a reassembled PDU]
76881	870.393478	127.0.0.1	127.0.0.1	TCP	16388	49954 → 49955 [ACK] Seq=4957466 Ack=2237498 Win=7215 Len=16332 TSval=1115741273 TSecr=2511616970 [TCP segment of a reassembled PDU]

There's a 0.7 second delay between when the GET request for the results is issued and when the data begins to be streamed. The response header (in packet 76880) from httpstan is

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Content-Length: 328436942
Date: Mon, 04 Dec 2023 15:31:50 GMT
Server: Python/3.10 aiohttp/3.8.4

and indicates that ~300MB of data is being streamed (though the packet sniffer seemingly only picks up ~150MB of that). The data streaming begins in 76881 and is complete in 0.3 seconds. I can't see any sort of TCP teardown, but that could well be because the sniffer missed the packets.

For the 48000 sample run (where the ClientPayloadError is raised), the situation is different:

756363	8238.629726	127.0.0.1	127.0.0.1	HTTP	213	GET /v1/models/ex6bdm7x/fits/kxk3uhsc HTTP/1.1 
756364	8238.629795	127.0.0.1	127.0.0.1	TCP	56	50357 → 50358 [ACK] Seq=43239183 Ack=19398529 Win=346816 Len=0 TSval=3598381091 TSecr=247335874
756369	8247.179848	127.0.0.1	127.0.0.1	TCP	216	50357 → 50358 [PSH, ACK] Seq=43239183 Ack=19398529 Win=346816 Len=160 TSval=3598389641 TSecr=247335874 [TCP segment of a reassembled PDU]
756370	8247.179893	127.0.0.1	127.0.0.1	TCP	56	50358 → 50357 [ACK] Seq=19398529 Ack=43239343 Win=285376 Len=0 TSval=247344424 TSecr=3598389641
756371	8247.181520	127.0.0.1	127.0.0.1	HTTP	56	HTTP/1.1 200 OK 
756372	8247.181541	127.0.0.1	127.0.0.1	TCP	56	50358 → 50357 [ACK] Seq=19398529 Ack=43239344 Win=285376 Len=0 TSval=247344426 TSecr=3598389643
756373	8247.320002	127.0.0.1	127.0.0.1	TCP	56	50358 → 50357 [FIN, ACK] Seq=19398529 Ack=43239344 Win=285376 Len=0 TSval=247344564 TSecr=3598389643
756374	8247.320090	127.0.0.1	127.0.0.1	TCP	56	50357 → 50358 [ACK] Seq=43239344 Ack=19398530 Win=346816 Len=0 TSval=3598389781 TSecr=247344564

There's 9 seconds before the server responds, but the connection is torn down by ~~Pystan~~ httpstan before any data is streamed (I can't see a FIN packet, but the FIN-ACK from Pystan is packet 756373). The header returned in packet 756369 is

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Content-Length: 3144360715
Date: Mon, 04 Dec 2023 17:34:47 GMT
Server: Python/3.10 aiohttp/3.8.4

indicating that ~3GB of data would be streamed.

wm1995 · 2023-12-06T17:34:25Z

It looks like this is fundamentally a httpstan issue (as I have the same issue when using curl to interact with httpstan directly), so I've opened stan-dev/httpstan#652 to summarise what I've found.

riddell-stan · 2023-12-08T11:50:08Z

Closing this in favor of stan-dev/httpstan#652. As @wm1995 says, this is an httpstan issue.

rrod515 added the bug label Oct 8, 2023

wm1995 mentioned this issue Dec 6, 2023

httpstan fails to give a response for large outputs stan-dev/httpstan#652

Open

riddell-stan closed this as completed Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClientPayloadError: Response payload is not completed #396

ClientPayloadError: Response payload is not completed #396

rrod515 commented Oct 8, 2023

wm1995 commented Nov 29, 2023 •

edited

Loading

riddell-stan commented Dec 1, 2023

wm1995 commented Dec 4, 2023

wm1995 commented Dec 6, 2023 •

edited

Loading

wm1995 commented Dec 6, 2023

riddell-stan commented Dec 8, 2023

ClientPayloadError: Response payload is not completed #396

ClientPayloadError: Response payload is not completed #396

Comments

rrod515 commented Oct 8, 2023

Describe the bug

Describe your system

Steps/Code to Reproduce

Code Sample, a copy-pastable example

wm1995 commented Nov 29, 2023 • edited Loading

riddell-stan commented Dec 1, 2023

wm1995 commented Dec 4, 2023

wm1995 commented Dec 6, 2023 • edited Loading

wm1995 commented Dec 6, 2023

riddell-stan commented Dec 8, 2023

wm1995 commented Nov 29, 2023 •

edited

Loading

wm1995 commented Dec 6, 2023 •

edited

Loading