Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #1506] Business Analytics Data Storage ADR #1503

Merged
merged 8 commits into from
Apr 1, 2024

Conversation

coilysiren
Copy link
Collaborator

@coilysiren coilysiren commented Mar 19, 2024

Summary

Fixes #1506

Time to review: 20 mins

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 19, 2024
@coilysiren coilysiren changed the title BI Storage ADR Business Analytics Data Storage ADR Mar 19, 2024
- API application metrics
- API infrastructure metrics from Cloudwatch

We will not be importing all of these types of data immediately. On the 0 - 6 month timeframe, we will only be importing the smaller datasets (thousands of records). By 2 - 5 years we will be importing all of these types of data, and our data size will be quite large (many millions of records). The desired solutions have different cost/performance characteristics in those time ranges, and we will need to evaluate those differences.
Copy link
Collaborator

@widal001 widal001 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small note on this: I imagine by the 6 month mark we might be ingesting analytics and infrastructure metrics. We'd certainly want to start doing that by the 12-month mark.

I think the biggest outstanding question for those sources of data is whether we want to ingest point level data (e.g. individual page views, clicks, and API calls) or if we'd do some level of aggregation before loading it into our data warehouse.

Just to "show our work" around volume of data metrics:

  • ~250k users in the past 7 days (based on latest GA metrics for grants.gov)
  • ~4 average page views per user per week (conservative estimate for sizing purposes)
  • 52 weeks per year
  • ~52 million page view records per year

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number could grow geometrically though if we also want to track things like impressions and clicks and tie those to sessions and active devices.

Comment on lines +51 to +53
At 0 - 6 months, S3 is a reasonable choice due to our small data sizes. Past that point, performance issues with large data sizes make S3 a non-ideal choice.

At 2 - 5 years, S3's performance issues require the introduction of another query / compute layer like AWS Athena or AWS Redshift.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kinds of timeline-based comparisons are super helpful @coilysiren !!

- S3
- Redshift
- Postgres
- Snowflake
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding Aurora to this list of options per @bretthrosenblatt's comments in Slack?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aurora=Postgres in this respect

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend both, actually

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I wasn't sure if postgres meant RDS -- that's what we're using for our main database right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@widal001 AWS RDS Aurora PostgreSQL is its full name, and its our main database yes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I wasn't sure if postgres meant RDS -- that's what we're using for our main database right?

Purposely confusing. There is RDS Postgres, which is cloud-hosted Postgres, and Postgres-compatible Aurora, which is a Postgres-like interface on an Aurora db engine and storage cluster (what you're using)

@bretthrosenblatt
Copy link
Collaborator

  1. Until you reach volume Redshift will be far slower than Aurora (I've used both). Redshift begins to excel at hundreds of terabytes, not millions of rows

  2. When you do reach critical volume, Redshift is considerably more expensive than Aurora, so one strategy it to use an Aurora cluster optimized for analytics and offload to Redshift as needed

  3. If you implement Redshift in a similar manner to Aurora...as an application datastore under SQLAlchemy/Alembic, it will provide very little value, as data will be locked in a simple/rigid state

  4. If you have expertise in either Aurora or Redshift, exploiting the capabilities of either/both are simple, as is taking advantage of the strengths of each. If you're not using any of the capabilities present in Aurora, then there will likely be no difference between the two

  5. Redshift can't be used as a source of truth, so any data sets requiring enforcement/validation would best be filtered through Aurora first and then moved to Redshift if necessary

@coilysiren
Copy link
Collaborator Author

@bretthrosenblatt

When you do reach critical volume, Redshift is considerably more expensive than Aurora, so one strategy it to use an Aurora cluster optimized for analytics and offload to Redshift as needed

Can you provide a citation for that? When I looked into pricing, what I saw was:

I'm abstracting compute cost as 0 because both databases are "Serverless" and billed as such

If you implement Redshift in a similar manner to Aurora...as an application datastore under SQLAlchemy/Alembic, it will provide very little value, as data will be locked in a simple/rigid state

We plan on using it with an ELT tool, which should be more flexible.

Redshift can't be used as a source of truth, so any data sets requiring enforcement/validation would best be filtered through Aurora first and then moved to Redshift if necessary

I don't think we are going to be doing enforcement or validation of most of this data. I believe the plan is to use the ELT tool to pull the data as-is.

@bretthrosenblatt
Copy link
Collaborator

bretthrosenblatt commented Mar 20, 2024

@coilysiren
There's no benefit to using serverless in a prod environment unless you're sure it would be down at least 80% of the time (completely idle), otherwise it stays at the high range and costs more. For Redshift you'd likely have a much larger instance, and pay for spectrum as well, and for any type commercial usage you need multiple nodes for scaling, cost goes up dramatically, especially if you have automated analytics, effectively running all or most of the time

@coilysiren
Copy link
Collaborator Author

There's no benefit to using serverless in a prod environment unless you're sure it would be down at least 80% of the time

It's a business analytics data warehouse getting filled by an ELT process, its definitely going to down 80% of the time.

@bretthrosenblatt
Copy link
Collaborator

bretthrosenblatt commented Mar 20, 2024

@coilysiren

It's a business analytics data warehouse getting filled by an ELT process, its definitely going to down 80% of the time.

By down meaning nothing would access, queries, keepalive pings, cleanup actions, etc., and that the scale time would be acceptable. We used it for a POC at CACI and the only way we'd be sure it was down was manually, and under load we'd hang for 1-2 mins waiting for it to scale to expected performance.

@bretthrosenblatt
Copy link
Collaborator

@coilysiren
I don't think we are going to be doing enforcement or validation of most of this data. I believe the plan is to use the ELT tool to pull the data as-is

The 'source of truth' comment threw me off...different meaning for me

@bretthrosenblatt
Copy link
Collaborator

One last thought. If this is not intended to be a live analytics platform but just preparing analytical data for downstream usage, then Redshift doesn't make any sense. The potential performance advantage would be irrelevant, so I don't see how a separate platform would be of added benefit to what you already have.

@coilysiren
Copy link
Collaborator Author

@acouch for the decision criteria in the ticket, you wrote

replicable for outside users (ie can a member of the public run reports)

Has it been documented from a security impact point of view that we want the public to have direct access to the analytics database? Because that's the implication of adding this as a requirement here. It's a non-traditional requirement, so I wanted to double-check that I'm understanding this correctly.

@coilysiren
Copy link
Collaborator Author

@bretthrosenblatt

The potential performance advantage would be irrelevant, so I don't see how a separate platform would be of added benefit to what you already have.

Given that Redshift is an optimized for OLAP queries, I don't understand how this could possibly be the case. Everything I'm reading online says that Redshift is built for the data warehouse use case, but it sounds like you're saying that's wrong?

@bretthrosenblatt
Copy link
Collaborator

bretthrosenblatt commented Mar 20, 2024

@bretthrosenblatt

The potential performance advantage would be irrelevant, so I don't see how a separate platform would be of added benefit to what you already have.

Given that Redshift is an optimized for OLAP queries, I don't understand how this could possibly be the case. Everything I'm reading online says that Redshift is built for the data warehouse use case, but it sounds like you're saying that's wrong?

It is designed for OLAP and handles analytics much better, but what you're talking about is mainly in the storage layer. Because it's a col store, every bit of data is fully normalized and compresses far better, and the queries are spread across mpp nodes, so they are dramatically faster on large data sets. The only thing you're missing are the large data sets. Given the sizes discussed here you're unlikely to see much of a difference, and may even be slower, due to the storage and transaction advantage in Aurora.

In any case, Redshift is a better solution for a warehouse, but you can also just use a data lake.

@widal001
Copy link
Collaborator

@acouch for the decision criteria in the ticket, you wrote

replicable for outside users (ie can a member of the public run reports)

Has it been documented from a security impact point of view that we want the public to have direct access to the analytics database? Because that's the implication of adding this as a requirement here. It's a non-traditional requirement, so I wanted to double-check that I'm understanding this correctly.

Just wanted to add a quick note here, when we expose our metrics and underlying data it will be through an analytics API, so the public wouldn't have direct access to the OLAP database. We will also most likely will be limiting public access to analytics data to aggregates rather than point-level data for things like site traffic, API calls, etc.

@coilysiren coilysiren marked this pull request as ready for review March 20, 2024 19:47
@bretthrosenblatt
Copy link
Collaborator

I confirmed the scaling issue for Redshift serverless. You need to set the base RPU (Redshift Processing Unit) to whatever is required for the ETL process, and then if you want the ability to scale up you can set a range. The the scaling is concurrency scaling (read only), and is driven primarily by the number of queries, not the complexity. If you want the opposite you can do so in the Redshift API (increase RPU before ETL, reduce after, or even spin up a temp cluster), or you can just use a price ratio and AWS uses ML to control the workloads. Not sure if this would be ultimately cheaper than a reserved instance though. Looking at a minimum 2.88/hr (just RPU), max likely 46/hr, based on projected work loads (assume 128RPU max, based on average usage reporting)

@coilysiren
Copy link
Collaborator Author

coilysiren commented Mar 21, 2024

I confirmed the scaling issue for Redshift serverless [...]

This seems like less of a "scaling issue" and more "the fine grain details about how it scales".

Thanks for going to get those details!

@coilysiren coilysiren changed the title Business Analytics Data Storage ADR [Issue #1506] Business Analytics Data Storage ADR Mar 22, 2024

[Data is hosted in Postgres at $0.115 per GB-month](https://aws.amazon.com/rds/postgresql/pricing/), higher than S3 and Redshift.

In every time range, Postgres loses to Redshift due to Postgres being an OLTP database built for real-time data processing. In the 2 - 5 year range, Postgres also loses due to its high GB-month hosting cost. There is however, evidence that Postgres is faster than Redshift, and that Redshift's OLAP advantages don't kick in until the data grows to terabtyes in size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pros of postgres include:

  • Open Source
  • Keep stack simplified as we already use it
  • Easier to setup locally for open source devs or local dev and for other teams to adopt
  • I'm still skeptical that we will ever need to run queries that require an OLTP database
  • We could use the same database for our open source BI tool (Metabase/Superset use postgres)

@coilysiren
Copy link
Collaborator Author

Given the evidence and discussion, I've changed the decision in this ADR to instead be in favor of Postgres

@coilysiren coilysiren requested a review from acouch March 25, 2024 17:47
@coilysiren coilysiren merged commit 306210f into main Apr 1, 2024
2 checks passed
@coilysiren coilysiren deleted the coilysiren-patch-2 branch April 1, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADR]: Business Analytics Data Storage
4 participants