-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #1506] Business Analytics Data Storage ADR #1503
Conversation
- API application metrics | ||
- API infrastructure metrics from Cloudwatch | ||
|
||
We will not be importing all of these types of data immediately. On the 0 - 6 month timeframe, we will only be importing the smaller datasets (thousands of records). By 2 - 5 years we will be importing all of these types of data, and our data size will be quite large (many millions of records). The desired solutions have different cost/performance characteristics in those time ranges, and we will need to evaluate those differences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small note on this: I imagine by the 6 month mark we might be ingesting analytics and infrastructure metrics. We'd certainly want to start doing that by the 12-month mark.
I think the biggest outstanding question for those sources of data is whether we want to ingest point level data (e.g. individual page views, clicks, and API calls) or if we'd do some level of aggregation before loading it into our data warehouse.
Just to "show our work" around volume of data metrics:
- ~250k users in the past 7 days (based on latest GA metrics for grants.gov)
- ~4 average page views per user per week (conservative estimate for sizing purposes)
- 52 weeks per year
- ~52 million page view records per year
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number could grow geometrically though if we also want to track things like impressions and clicks and tie those to sessions and active devices.
At 0 - 6 months, S3 is a reasonable choice due to our small data sizes. Past that point, performance issues with large data sizes make S3 a non-ideal choice. | ||
|
||
At 2 - 5 years, S3's performance issues require the introduction of another query / compute layer like AWS Athena or AWS Redshift. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kinds of timeline-based comparisons are super helpful @coilysiren !!
- S3 | ||
- Redshift | ||
- Postgres | ||
- Snowflake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding Aurora to this list of options per @bretthrosenblatt's comments in Slack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aurora=Postgres in this respect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend both, actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I wasn't sure if postgres meant RDS -- that's what we're using for our main database right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@widal001 AWS RDS Aurora PostgreSQL is its full name, and its our main database yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I wasn't sure if postgres meant RDS -- that's what we're using for our main database right?
Purposely confusing. There is RDS Postgres, which is cloud-hosted Postgres, and Postgres-compatible Aurora, which is a Postgres-like interface on an Aurora db engine and storage cluster (what you're using)
|
Can you provide a citation for that? When I looked into pricing, what I saw was:
I'm abstracting compute cost as 0 because both databases are "Serverless" and billed as such
We plan on using it with an ELT tool, which should be more flexible.
I don't think we are going to be doing enforcement or validation of most of this data. I believe the plan is to use the ELT tool to pull the data as-is. |
@coilysiren |
It's a business analytics data warehouse getting filled by an ELT process, its definitely going to down 80% of the time. |
It's a business analytics data warehouse getting filled by an ELT process, its definitely going to down 80% of the time. By down meaning nothing would access, queries, keepalive pings, cleanup actions, etc., and that the scale time would be acceptable. We used it for a POC at CACI and the only way we'd be sure it was down was manually, and under load we'd hang for 1-2 mins waiting for it to scale to expected performance. |
@coilysiren The 'source of truth' comment threw me off...different meaning for me |
One last thought. If this is not intended to be a live analytics platform but just preparing analytical data for downstream usage, then Redshift doesn't make any sense. The potential performance advantage would be irrelevant, so I don't see how a separate platform would be of added benefit to what you already have. |
@acouch for the decision criteria in the ticket, you wrote
Has it been documented from a security impact point of view that we want the public to have direct access to the analytics database? Because that's the implication of adding this as a requirement here. It's a non-traditional requirement, so I wanted to double-check that I'm understanding this correctly. |
Given that Redshift is an optimized for OLAP queries, I don't understand how this could possibly be the case. Everything I'm reading online says that Redshift is built for the data warehouse use case, but it sounds like you're saying that's wrong? |
It is designed for OLAP and handles analytics much better, but what you're talking about is mainly in the storage layer. Because it's a col store, every bit of data is fully normalized and compresses far better, and the queries are spread across mpp nodes, so they are dramatically faster on large data sets. The only thing you're missing are the large data sets. Given the sizes discussed here you're unlikely to see much of a difference, and may even be slower, due to the storage and transaction advantage in Aurora. In any case, Redshift is a better solution for a warehouse, but you can also just use a data lake. |
Just wanted to add a quick note here, when we expose our metrics and underlying data it will be through an analytics API, so the public wouldn't have direct access to the OLAP database. We will also most likely will be limiting public access to analytics data to aggregates rather than point-level data for things like site traffic, API calls, etc. |
I confirmed the scaling issue for Redshift serverless. You need to set the base RPU (Redshift Processing Unit) to whatever is required for the ETL process, and then if you want the ability to scale up you can set a range. The the scaling is concurrency scaling (read only), and is driven primarily by the number of queries, not the complexity. If you want the opposite you can do so in the Redshift API (increase RPU before ETL, reduce after, or even spin up a temp cluster), or you can just use a price ratio and AWS uses ML to control the workloads. Not sure if this would be ultimately cheaper than a reserved instance though. Looking at a minimum 2.88/hr (just RPU), max likely 46/hr, based on projected work loads (assume 128RPU max, based on average usage reporting) |
This seems like less of a "scaling issue" and more "the fine grain details about how it scales". Thanks for going to get those details! |
|
||
[Data is hosted in Postgres at $0.115 per GB-month](https://aws.amazon.com/rds/postgresql/pricing/), higher than S3 and Redshift. | ||
|
||
In every time range, Postgres loses to Redshift due to Postgres being an OLTP database built for real-time data processing. In the 2 - 5 year range, Postgres also loses due to its high GB-month hosting cost. There is however, evidence that Postgres is faster than Redshift, and that Redshift's OLAP advantages don't kick in until the data grows to terabtyes in size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pros of postgres include:
- Open Source
- Keep stack simplified as we already use it
- Easier to setup locally for open source devs or local dev and for other teams to adopt
- I'm still skeptical that we will ever need to run queries that require an OLTP database
- We could use the same database for our open source BI tool (Metabase/Superset use postgres)
Given the evidence and discussion, I've changed the decision in this ADR to instead be in favor of Postgres |
Summary
Fixes #1506
Time to review: 20 mins