Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with using a direct DB connection instead of an API for retrieve data for SSR #783

Closed
beguene opened this issue Jul 12, 2022 · 11 comments

Comments

@beguene
Copy link
Collaborator

beguene commented Jul 12, 2022

For most of our pages, the explorer server code is doing one or more (up to 6) api calls to https://stacks-node-api.mainnet.stacks.co/ server-side which slows down dramatically our first server response.
We should avoid this pattern because it puts too much load on the explorer web server and users would experience degraded performance on the explorer as traffic increases (even if the api endpoint can handle it).
Example: 1000 users, each user requests the /txid page once, the explorer server itself requests 6 calls for each user and wait for each user for all the 6 requests to complete before returning a response to the user. That's 6000 api call that can't really leverage user caching as the cache would invalidate more often with more users, it consumes way more memory than necessary etc...
Had those api call been made client side those 6000 api call could be cached per user.
Reducing the load on the explorer server means it will be able to handle more with the same config or the config can be reduced to handle the same number of users.
(Note that nothing changes for the api because the api endpoint will still have to handle those 1000 * 6 requests)

My PR reduces the number of server-side api calls to a maximum of 1. This solution is a quick one and will give us good performance without much change in the code.

Ideally we should fetch the data directly from the db server side with a direct read-only connection to the same db that the api server is using. This way the explorer server will get all the data really fast and will return a response quickly. Besides we can leverage the SSR of nextjs and have great SEO.

This task is to experiment with simple node db connection to the postgresql and check the performance and the amount of code that need to change.
Note that those changes could be done in a progressive manner (say page per page) without changing the rest of the explorer code.

@andresgalante
Copy link
Member

We might be investigating this approach in the next weeks.

@CharlieC3 do you think that there might be a security concern if we follow this route?

@CharlieC3
Copy link
Member

Ideally we should fetch the data directly from the db server side with a direct read-only connection to the same db that the api server is using. This way the explorer server will get all the data really fast and will return a response quickly. Besides we can leverage the SSR of nextjs and have great SEO.

Are there any reports showing the API is a cause of performance issues in the Explorer? The API is specifically designed to be as fast as possible at retrieving data from the DB with the Explorer being a prime usecase. What makes us believe the Explorer can do it better? Not trying to poke holes in this, just trying to understand the thought process around this 😄

Additionally during times of high traffic (e.g. during a mint), it's been shown that the root cause/bottleneck is queries piling up in the database, rather than the API service itself. If this happens, the Explorer is going to experience the same slowdown whether it routes requests through the API or directly to the backend database.

The only way around this is to create a separate private API + DB deployment for Explorer SSR's, and at that point it would be easier to simply route SSR traffic through the private API anyway.

I think reducing the amount of SSRs from 6 down to 1 in itself should yield massive performance improvements by an order of magnitude under heavy load, but it's not clear to me that there would be any benefit to accessing the API database directly.

Maybe @zone117x or @rafaelcr can weigh in here if they have additional perspective I'm missing.

@rafaelcr
Copy link

Completely agree with what @CharlieC3 just described.

Another important aspect is that we use cache headers in the API very heavily. If the explorer connected directly to the API DB, we would not get any help from CF to serve this data to clients.

We'll need to accelerate work on this issue on the API so we can serve as much cached requests as we can: hirosystems/stacks-blockchain-api#1230

@beguene
Copy link
Collaborator Author

beguene commented Jul 14, 2022

Thanks @CharlieC3 and @rafaelcr for those details, this is really helpful.

I was not implying that there were any performance issue with API (which is actually really fast) but that doing SSR with 6 api requests per user request on the explorer server just to show 1 simple html page with light data is not the correct approach.
I was just exploring several alternative solutions and their respective trade-offs. The solutions are the following:

  1. Doing 1 request server side to show the minimum info (tx id etc..) and do the other 5 requests client side and render progressively (what this PR is doing)
    This has minimal change in our code base, gives us decent perf and good UX has the main content is displayed on first response

  2. fetch all the data that those 6 requests currently fetch (tx data, accountBalance, contract, mempool transactions, transactionsForAddress) in 1 query (either db or direct api with a specific endpoint with the data already structured for the page)
    This is costlier in dev time and costlier in devops time, we don't leverage the work done in the api but the explorer will be really fast on first page load. But we still have to do client side fetch for most pages to get fresh data (either http get or websocket)

  3. fetching all data client-side and render progressively
    There would be a no impact on the SEO if we change the title on the client side as googlebot can parse the title correctly ref.
    Everything would be really fast with minimum amount of work. We will leverage all the work done to scale the api. We will also remove all those react-query data hydration done server side which is costly. We basically don't do SSR with external data.

@andresgalante In light of @CharlieC3's and @rafaelcr's comments and the high dev cost involved with 2) I think we should move to solution 3) whose performance would be close to 2)
It's just moving the one api call from the explorer server to the client and handling the empty components with skeletons and progressive rendering (basically continuing what I have done in 1) )

@zone117x
Copy link
Member

fetch all the data that those 6 requests currently fetch (tx data, accountBalance, contract, mempool transactions, transactionsForAddress) in 1 query (either db or direct api with a specific endpoint with the data already structured for the page)

Note that the typical bottlenecks we see are around postgres server load. So changing the way this data is fetched by creating a new a single API endpoint (which would still require multiple sql queries), or by making the Explorer perform multiple direct sql queries -- the performance floor will not change.

fetching all data client-side and render progressively

This makes much more sense to me, and seems like a better user experience. For example, if I'm loading a page for the purpose of checking tx status, I don't want to wait around for all the other data to be queried before anything is visible.

It would also help users when providing feedback, instead of something like "account page is slow", users could report specific issues like "the account balance section of this page is slow to render."

It also increases the amount of cache-hits (speeds up page rendering), since user web browsers generally cache data for a longer period of time than Cloudflare. With the assumption that the Explorer SSR fetches are not performing http caching, which means all those requests are at the mercy of Cloudflare's sporadic cache-eviction behavior.

@beguene
Copy link
Collaborator Author

beguene commented Jul 14, 2022

Note that the typical bottlenecks we see are around postgres server load.

This is good to know. We can then confidently eliminate solution 2)

It's seems that we are all converging to a full client-side solution (3))

@andresgalante unless you have any argument for considering solution 2) or 1), I will pursue my work on moving everything client-side and we can say bye bye to SSR 😄

@andresgalante
Copy link
Member

I am not concerned about moving to CSR.

Even though at the moment SEO is not something we care about, we might in the future. Not a concern, just something to keep in mind.

If this decision can wait a week, I'd like Heidar to be involved.

@beguene @CharlieC3 @zone117x @rafaelcr thanks a lot for the input and help here 🙇

@beguene
Copy link
Collaborator Author

beguene commented Jul 15, 2022

@andresgalante
I already factored in SEO and this won't be much of a problem, as I linked in my comment before googlebot pretty much "sees" the page the way a normal user would see the page even if it's heavily rendered client-side.
"Google is able to render the entire page and read the DOM, thereby indexing dynamically generated content."
Example
https://blockstream.info/tx/84cd0fea1f604046c1a4712e2d089cac1efaf095e2808d35bf475ffc95b0e0b4 populates meta and title dynamically.
Etherscan does not even change the title per tx

If this decision can wait a week, I'd like Heidar to be involved.

In our last meeting we already discussed this and he was ok with moving to client-side as long as we are consistent across pages and in the UX.

I don't think this can wait. We must get this code ready as soon as possible.

We currently put unnecessary load on the api, do not leverage caching, make rate-limiting difficult and the explorer have poor performance during heavy spike of traffic (and even during low traffic). The explorer becomes the single bottleneck. If you have many users requesting a page from the explorer at the same time, the explorer server is currently caching all the data for all the users in the server.
I suspect all the api caching would be useless in time of high traffic or when users refresh the page multiple times(which is often the case for blockchain related event): as the number of user requesting an explorer page increases the more likely the previous cache data in the explorer server would be invalidated and therefore would force an api request.

To give you an idea of the urgency

Compare this
https://explorer.stacks.co/block/0x73ed5130cb150480aed6ac207b9ae933653e386319b41be960f5cb818ef41f2b?chain=mainnet
prod environment and main branch deployed
First Contentful Paint: 12.0s
Fully Loaded Time: 14.1s

Report: https://gtmetrix.com/reports/explorer.stacks.co/qhW6fOvb/

with this

https://explorer.staging.blockstack.xyz/block/0x73ed5130cb150480aed6ac207b9ae933653e386319b41be960f5cb818ef41f2b?chain=mainnet

staging environment (similar to prod environment) and my PR branch fix/improve-perf with only 1 request server side among other changes
First Contentful Paint: 765ms
Fully Loaded Time: 2.9s

Or

https://explorer.stacks.co/txid/0x28d5c2d5a9d070b378aa061a9a18eddf676d7747e9b232a34f538487e83e0302
Report: https://gtmetrix.com/reports/explorer.staging.blockstack.xyz/45ml2B2S/

First Contentful Paint: 3.7s
Fully Loaded Time: 5.0s

https://explorer.staging.blockstack.xyz/txid/0x28d5c2d5a9d070b378aa061a9a18eddf676d7747e9b232a34f538487e83e0302
https://gtmetrix.com/reports/explorer.stacks.co/t5zLoo02/

First Contentful Paint: 1.0s
Fully Loaded Time: 2.6s

I don't think a 12 seconds waiting time for a simple block page is acceptable. Many users would think (incorrectly) that the blockchain is slow while it's actually the explorer which is slow.

I am fine with not deploying to production right now, but I don't understand why we would not continue working on this issue now.
Given that you agree with CSR, can you tell me what are your concerns here ?

@beguene
Copy link
Collaborator Author

beguene commented Jul 15, 2022

With pictures

Before
block-page-before

After
block-page-after

@andresgalante
Copy link
Member

andresgalante commented Jul 15, 2022

I don't think this can wait. We must get this code ready as soon as possible.

You are right, it's not acceptable. Let's ship when it's done 👍 🚀

@CharlieC3
Copy link
Member

If you have many users requesting a page from the explorer at the same time, the explorer server is currently caching all the data for all the users in the server.
I suspect all the api caching would be useless in time of high traffic or when users refresh the page multiple times...

Just adding more context for the current setup. Generally at all times there will be multiple load-balanced instances of the Explorer running to serve requests, especially when traffic is high. So even if they do have support for a server-side cache, it's not being shared across all Explorer instances. If someone reloads their address page, they'd likely hit a different Explorer instance, which may not have cached data to serve them.

Moving the caching efforts client-side should open the door to resolve this since we'd then be able to leverage their browser's cache and perhaps Cloudflare's cache. But if there's still any server-side caching going on, we should look into leveraging a middle caching layer (e.g. Cloudflare), or implementing support for a shared cache view for the Explorer with something like Redis or memcached.

@beguene beguene closed this as not planned Won't fix, can't repro, duplicate, stale Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants