Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

starting a snapshot listener from afar causes problems #8451

Open
michaelAtCoalesce opened this issue Aug 22, 2024 · 24 comments
Open

starting a snapshot listener from afar causes problems #8451

michaelAtCoalesce opened this issue Aug 22, 2024 · 24 comments

Comments

@michaelAtCoalesce
Copy link

michaelAtCoalesce commented Aug 22, 2024

Operating System

windows

Environment (if applicable)

chrome

Firebase SDK Version

10.13.0

Firebase SDK Product(s)

Firestore

Project Tooling

create-react-app example

Detailed Problem Description

i have a collection with ~50 megabytes of data across ~1500 documents.

when i try to start a listener while in north america (connecting to US firestore), it takes only 10 seconds to start the listener, and it completes 100% of the time

image

. when i turn on my india VPN (same machine, same code, only difference is routing through india), the listener never even completes.

i immediately get these errors -

image

on some machines from APAC region connecting to US firestore, i also get really poor behavior, and it never actually succeeds.

image

for what its worth - the connection FROM india to united states should easily be able to handle this...

image

Steps and code to reproduce issue

if someone wants to email me i can send them the info for the recreate. it's a few lines of code.

@michaelAtCoalesce michaelAtCoalesce added new A new issue that hasn't be categoirzed as question, bug or feature request question labels Aug 22, 2024
@google-oss-bot
Copy link
Contributor

I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.

@jbalidiong jbalidiong added api: firestore needs-attention and removed needs-triage new A new issue that hasn't be categoirzed as question, bug or feature request labels Aug 22, 2024
@michaelAtCoalesce
Copy link
Author

michaelAtCoalesce commented Aug 22, 2024

based on previous issues @dconeybe might be a good person to look at this? Someone can email me at 'mx2323 <@> gmail.com' and I'll jump on a call to recreate.

It's very concerning behavior so we'd like to get this looked at ASAP.

@michaelAtCoalesce
Copy link
Author

any updates? this is causing our application to not load.

@wu-hui
Copy link
Contributor

wu-hui commented Aug 23, 2024

Hey @michaelAtCoalesce ,

I suspect the issue here is that the bidirectional stream between the SDK and the backend does not work well when the network is not stable, especially when you need to load a lot data over the wire. There are several things you can try:

  1. Can you create a test firestore in Asian to see if things improve?
  2. Try to always turn on longpolling (https://firebase.google.com/docs/reference/js/firestore_.firestoresettings.md#firestoresettingsexperimentalforcelongpolling) see if that helps.

@sampajano Do you have some other suggestions/ideas?

@wu-hui wu-hui self-assigned this Aug 23, 2024
@michaelAtCoalesce
Copy link
Author

michaelAtCoalesce commented Aug 26, 2024

hi @wu-hui,

  1. yes, this is definitely the case. as i said above, when i turn off the VPN (so connect from nearby) - the reliability is 100%
  2. i have forced long polling, but i am noticing that i see slower load times. more consistent, but slower load times.

the data here is only on the order of tens of megabytes - but it can take over a minute (and sometimes not load at all), whereas a closer connection will take ~10 seconds.

in conclusion:
the speed test shows that this should work, so it seems there is work to do here for firestore to reliability support this kind of a connection... it shouldn't take over a minute to load when it takes 10 seconds in the ideal case and even the force long polling is slow.

i can reliably recreate this issue 100% of the time, within seconds. happy to hop on a call and share recreate details (or do it over email). you can email me at mx2323 <@> gmail.com

@michaelAtCoalesce
Copy link
Author

why was the needs attention label removed and a needs-info label added? i believe ive given the information required and this is causing our production to not load.

@DellaBitta
Copy link
Contributor

My mistake! I must have had been looking at a stale page that I had loaded yesterday, sorry!

@michaelAtCoalesce
Copy link
Author

from nearby: 11seconds
from afar: 226 seconds

im uploading some firestore debug level logs of the degenerative case here.

aec2-38-34-123-154.ngrok-free.app-1724779566040.log

@michaelAtCoalesce
Copy link
Author

any updates?

@MarkDuckworth
Copy link
Contributor

@michaelAtCoalesce, Thank you for providing the logs. I reviewed them and I don't see a clear indication of an issue in the SDK, however I have forwarded this to our backend team for review. Googlers see b/361143373

For what it's worth, on this behavior, you may get more frequent updates if you open a Firebase or Google Cloud support ticket rather than a GitHub issue on the SDK. However, we will update this GH issue when we learn more.

@michaelAtCoalesce
Copy link
Author

michaelAtCoalesce commented Aug 28, 2024

@MarkDuckworth thanks for the update. wanted to add another data point in here. it appears to happen more frequently on windows. i have anecdotally noticed in my recreate case that on chrome on windows the default implementation is more likely to fail than chrome on macOS. if nothing else, it appears that windows is at least 3x as slow.

something appears to happen where the default implementation will start up, download a bit, then just hang there for tens of seconds or minutes and not do anything. when i turn on the experimental long polling option, it immediately goes back to working.

@michaelAtCoalesce
Copy link
Author

michaelAtCoalesce commented Sep 10, 2024

its been 2 weeks.. any updates?

i was told by firebase support to try to paginate the snapshots. i tried that, and the performance did not improve and sometimes the listeners do not ever start still. this appears to happen more frequently on windows. it appears that this issue appears even when the user is nearby the firestore location.

on my mac it'll take 12 seconds, the exact same page on a windows e2-standard-2 instance has 400 errors, 404 errors, and takes over 3 minutes sometimes) for the same test case.

image

@vdemko001
Copy link

Any updates?
I'm getting the same error on Firebase SDK 7.24.0
image

@mx2323
Copy link

mx2323 commented Oct 4, 2024

@MarkDuckworth I think the enabling of web channel in 10.14.0 is improving the performance of my standalone test case. Sounds like with large datasets and chunking there may have been a corruption causing memory issues as a symptom that was fixed.

We are still having an issue where inexplicably firestore will take 50 seconds within our app to listen on windows, but in the standalone test case on windows it’ll take 15 seconds, consistently. Our users frequently open and close listeners as part of their workflow so the performance is important for them.

once I get some more data will open a ticket for that

@MarkDuckworth
Copy link
Contributor

@mx2323, thanks for the feedback on 10.14.0. It's good to hear it is helping you out.

Regarding the 50 second delay, is it consistently 50 seconds? Take a look at #8474, it could be related to that. Although we have only seen that delay at ~45 seconds. Your logs in #8451 (comment) don't show similarity, but if you're seeing a consistent 45-50 second delay, it may be worth getting logs covering this timespan.

@mx2323
Copy link

mx2323 commented Oct 4, 2024

ok I’ll follow up soon. headed on a flight and will be unavailable.

@michaelAtCoalesce
Copy link
Author

michaelAtCoalesce commented Oct 4, 2024

OK - debug logs attached

  1. standalone recreate on windows india vm - 40 seconds
    standalone-windows-40seconds.log

  2. in-app recreate on windows india vm - 75 seconds
    windows-insideapp-75seconds.log

image

I think the absolute numbers changed here because of the added debug logging.

these are on the same machine, same browser too. just that one is a standalone one that does nothing else, the other is a version of our app that is very stripped down.. i took a look at the logs with the 75 second issue, it looks like it spends ~25 seconds of that doing pretty much nothing with "detecting buffered proxy" resulting in"The Operation Could not be completed" a few times .... then "detected no buffering proxy" prints and it immediately starts downloading and working as it should.

any explanation on why the same machine would sometimes say 'detect buffering proxy' and then other times not? what's odd is that in the standalone app, its never detecting a buffered proxy, but something about once the same code is running in our app, it'll detect a buffered proxy.

also - is there a log level that doesnt print as much but also tells us key information about whether a buffering proxy was detected? the logs when turned on slow things down so much it'd be good to have closer to the actual situation and less verbose logging.

i think what this is telling us is that this proxy detection code isn't working correctly on this machine

@michaelAtCoalesce
Copy link
Author

michaelAtCoalesce commented Oct 16, 2024

Is there someone I can send a recreate to? This behavior is very problematic for us and it’s been a month and a half. It’s a 20 megabyte collection with 2000 documents this is well within whey should be supported

@wu-hui
Copy link
Contributor

wu-hui commented Oct 17, 2024

One way to share the reproduction is to create a private repo, and invite me and @sampajano to join.

Please provide a clear instruction on how to reproduce, especially this seems to involve some VM setup and VPNs.

Also, we have no control over VPN or public internet, speed test does not necessarily translate to actual connection. We will do our best to look into this, but it is certainly possible that this won't lead us anywhere.

@google-oss-bot
Copy link
Contributor

Hey @michaelAtCoalesce. We need more information to resolve this issue but there hasn't been an update in 5 weekdays. I'm marking the issue as stale and if there are no new updates in the next 5 days I will close it automatically.

If you have more information that will help us get to the bottom of this, just add a comment!

@google-oss-bot
Copy link
Contributor

Since there haven't been any recent updates here, I am going to close this issue.

@michaelAtCoalesce if you're still experiencing this problem and want to continue the discussion just leave a comment here and we are happy to re-open this.

@michaelAtCoalesce
Copy link
Author

michaelAtCoalesce commented Nov 8, 2024

what we noticed and had filed a ticket for originally was that the client sdk was very unreliable in windows chrome browsers with slightly larger datasets.

We noticed that the backend Firestore client has a preferRest option that was much faster and more reliable. We noticed that although the rest api was faster, we still had issues where concurrent requests within the same backend process were slow. I filed a ticket about this here googleapis/nodejs-firestore#2215. Ultimately I think the unzipping of hundreds of megabytes of data from the rest api is just slow and hogs the main thread because forking processes and executing on the same machine was parallel and fast.

After all this investigation, what we decided to do is reverse engineer with wireshark what the backend was doing with preferRest (since the runQuery docs are not helpful https://firebase.google.com/docs/firestore/reference/rest/v1beta1/projects.databases.documents/runQuery) and call that via the frontend directly. Then we have changed every write in our app to store server timestamps, and we changed our snapshot listener to filter for all values higher than the values seen than what we pulled down in the starting snapshot. We also implemented a separate collection so we can have deletes.

So far in our extensive testing the result has more reliable and consistent performance than the Firestore client web sdk. It's unfortunate that we had to do this, but I repeatedly reached out to Firestore team members over many months and sent logs and offered multiple times to jump on a call with no luck.

@wu-hui
Copy link
Contributor

wu-hui commented Nov 13, 2024

Thanks for the update, it is great that you found a way that works for you.

I have couple questions:

  1. runQuery does not support realtime queries, how do you use snapshot listener still?
  2. About "Then we have changed every write in our app to store server timestamps, and we changed our snapshot listener to filter for all values higher than the values seen than what we pulled down in the starting snapshot.". Does this mean eventually you get the improvements by having less documents sent to the client?

@michaelAtCoalesce
Copy link
Author

  1. we still use the client SDK for realtime updates. we go through the data we pulled down via rest API, then set a filter on the firestore sdk listener to start after the highest timestamp.
  2. no, its the same number of documents. what we do is pull down our own snapshot of the data via the REST api (because the client SDK performance is so unreliable on windows), then use the client SDK to listen to all changes starting from the highest "updatedAt" value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants