[Issue #1506] Business Analytics Data Storage ADR #1503

coilysiren · 2024-03-19T21:16:13Z

Summary

Time to review: 20 mins

widal001 · 2024-03-20T17:03:27Z

documentation/decisions/adr/2024-03-19-dashboard-storage.md

+- API application metrics
+- API infrastructure metrics from Cloudwatch
+
+We will not be importing all of these types of data immediately. On the 0 - 6 month timeframe, we will only be importing the smaller datasets (thousands of records). By 2 - 5 years we will be importing all of these types of data, and our data size will be quite large (many millions of records). The desired solutions have different cost/performance characteristics in those time ranges, and we will need to evaluate those differences.


One small note on this: I imagine by the 6 month mark we might be ingesting analytics and infrastructure metrics. We'd certainly want to start doing that by the 12-month mark.

I think the biggest outstanding question for those sources of data is whether we want to ingest point level data (e.g. individual page views, clicks, and API calls) or if we'd do some level of aggregation before loading it into our data warehouse.

Just to "show our work" around volume of data metrics:

~250k users in the past 7 days (based on latest GA metrics for grants.gov)

~4 average page views per user per week (conservative estimate for sizing purposes)

52 weeks per year

~52 million page view records per year

This number could grow geometrically though if we also want to track things like impressions and clicks and tie those to sessions and active devices.

widal001 · 2024-03-20T17:09:23Z

documentation/decisions/adr/2024-03-19-dashboard-storage.md

+At 0 - 6 months, S3 is a reasonable choice due to our small data sizes. Past that point, performance issues with large data sizes make S3 a non-ideal choice.
+
+At 2 - 5 years, S3's performance issues require the introduction of another query / compute layer like AWS Athena or AWS Redshift.


These kinds of timeline-based comparisons are super helpful @coilysiren !!

widal001 · 2024-03-20T17:09:42Z

documentation/decisions/adr/2024-03-19-dashboard-storage.md

+- S3
+- Redshift
+- Postgres
+- Snowflake


Would it be worth adding Aurora to this list of options per @bretthrosenblatt's comments in Slack?

Aurora=Postgres in this respect

I'd recommend both, actually

Gotcha, I wasn't sure if postgres meant RDS -- that's what we're using for our main database right?

@widal001 AWS RDS Aurora PostgreSQL is its full name, and its our main database yes

Gotcha, I wasn't sure if postgres meant RDS -- that's what we're using for our main database right?

Purposely confusing. There is RDS Postgres, which is cloud-hosted Postgres, and Postgres-compatible Aurora, which is a Postgres-like interface on an Aurora db engine and storage cluster (what you're using)

bretthrosenblatt · 2024-03-20T17:21:14Z

Until you reach volume Redshift will be far slower than Aurora (I've used both). Redshift begins to excel at hundreds of terabytes, not millions of rows
When you do reach critical volume, Redshift is considerably more expensive than Aurora, so one strategy it to use an Aurora cluster optimized for analytics and offload to Redshift as needed
If you implement Redshift in a similar manner to Aurora...as an application datastore under SQLAlchemy/Alembic, it will provide very little value, as data will be locked in a simple/rigid state
If you have expertise in either Aurora or Redshift, exploiting the capabilities of either/both are simple, as is taking advantage of the strengths of each. If you're not using any of the capabilities present in Aurora, then there will likely be no difference between the two
Redshift can't be used as a source of truth, so any data sets requiring enforcement/validation would best be filtered through Aurora first and then moved to Redshift if necessary

coilysiren · 2024-03-20T17:49:37Z

@bretthrosenblatt

When you do reach critical volume, Redshift is considerably more expensive than Aurora, so one strategy it to use an Aurora cluster optimized for analytics and offload to Redshift as needed

Can you provide a citation for that? When I looked into pricing, what I saw was:

Data is hosted in Redshift at $0.024 per GB-month, essentially the same price as S3.
Data is hosted in Postgres at $0.115 per GB-month, higher than S3 and Redshift.

I'm abstracting compute cost as 0 because both databases are "Serverless" and billed as such

If you implement Redshift in a similar manner to Aurora...as an application datastore under SQLAlchemy/Alembic, it will provide very little value, as data will be locked in a simple/rigid state

We plan on using it with an ELT tool, which should be more flexible.

Redshift can't be used as a source of truth, so any data sets requiring enforcement/validation would best be filtered through Aurora first and then moved to Redshift if necessary

I don't think we are going to be doing enforcement or validation of most of this data. I believe the plan is to use the ELT tool to pull the data as-is.

bretthrosenblatt · 2024-03-20T18:03:30Z

@coilysiren
There's no benefit to using serverless in a prod environment unless you're sure it would be down at least 80% of the time (completely idle), otherwise it stays at the high range and costs more. For Redshift you'd likely have a much larger instance, and pay for spectrum as well, and for any type commercial usage you need multiple nodes for scaling, cost goes up dramatically, especially if you have automated analytics, effectively running all or most of the time

coilysiren · 2024-03-20T18:08:25Z

There's no benefit to using serverless in a prod environment unless you're sure it would be down at least 80% of the time

It's a business analytics data warehouse getting filled by an ELT process, its definitely going to down 80% of the time.

bretthrosenblatt · 2024-03-20T18:12:14Z

@coilysiren

It's a business analytics data warehouse getting filled by an ELT process, its definitely going to down 80% of the time.

By down meaning nothing would access, queries, keepalive pings, cleanup actions, etc., and that the scale time would be acceptable. We used it for a POC at CACI and the only way we'd be sure it was down was manually, and under load we'd hang for 1-2 mins waiting for it to scale to expected performance.

bretthrosenblatt · 2024-03-20T18:18:23Z

@coilysiren
I don't think we are going to be doing enforcement or validation of most of this data. I believe the plan is to use the ELT tool to pull the data as-is

The 'source of truth' comment threw me off...different meaning for me

bretthrosenblatt · 2024-03-20T18:30:28Z

One last thought. If this is not intended to be a live analytics platform but just preparing analytical data for downstream usage, then Redshift doesn't make any sense. The potential performance advantage would be irrelevant, so I don't see how a separate platform would be of added benefit to what you already have.

coilysiren · 2024-03-20T18:31:25Z

@acouch for the decision criteria in the ticket, you wrote

replicable for outside users (ie can a member of the public run reports)

Has it been documented from a security impact point of view that we want the public to have direct access to the analytics database? Because that's the implication of adding this as a requirement here. It's a non-traditional requirement, so I wanted to double-check that I'm understanding this correctly.

coilysiren · 2024-03-20T19:06:53Z

@bretthrosenblatt

The potential performance advantage would be irrelevant, so I don't see how a separate platform would be of added benefit to what you already have.

Given that Redshift is an optimized for OLAP queries, I don't understand how this could possibly be the case. Everything I'm reading online says that Redshift is built for the data warehouse use case, but it sounds like you're saying that's wrong?

bretthrosenblatt · 2024-03-20T19:17:00Z

@bretthrosenblatt

The potential performance advantage would be irrelevant, so I don't see how a separate platform would be of added benefit to what you already have.

Given that Redshift is an optimized for OLAP queries, I don't understand how this could possibly be the case. Everything I'm reading online says that Redshift is built for the data warehouse use case, but it sounds like you're saying that's wrong?

It is designed for OLAP and handles analytics much better, but what you're talking about is mainly in the storage layer. Because it's a col store, every bit of data is fully normalized and compresses far better, and the queries are spread across mpp nodes, so they are dramatically faster on large data sets. The only thing you're missing are the large data sets. Given the sizes discussed here you're unlikely to see much of a difference, and may even be slower, due to the storage and transaction advantage in Aurora.

In any case, Redshift is a better solution for a warehouse, but you can also just use a data lake.

widal001 · 2024-03-20T19:28:53Z

@acouch for the decision criteria in the ticket, you wrote

replicable for outside users (ie can a member of the public run reports)

Has it been documented from a security impact point of view that we want the public to have direct access to the analytics database? Because that's the implication of adding this as a requirement here. It's a non-traditional requirement, so I wanted to double-check that I'm understanding this correctly.

Just wanted to add a quick note here, when we expose our metrics and underlying data it will be through an analytics API, so the public wouldn't have direct access to the OLAP database. We will also most likely will be limiting public access to analytics data to aggregates rather than point-level data for things like site traffic, API calls, etc.

bretthrosenblatt · 2024-03-20T21:38:23Z

I confirmed the scaling issue for Redshift serverless. You need to set the base RPU (Redshift Processing Unit) to whatever is required for the ETL process, and then if you want the ability to scale up you can set a range. The the scaling is concurrency scaling (read only), and is driven primarily by the number of queries, not the complexity. If you want the opposite you can do so in the Redshift API (increase RPU before ETL, reduce after, or even spin up a temp cluster), or you can just use a price ratio and AWS uses ML to control the workloads. Not sure if this would be ultimately cheaper than a reserved instance though. Looking at a minimum 2.88/hr (just RPU), max likely 46/hr, based on projected work loads (assume 128RPU max, based on average usage reporting)

coilysiren · 2024-03-21T19:00:48Z

I confirmed the scaling issue for Redshift serverless [...]

This seems like less of a "scaling issue" and more "the fine grain details about how it scales".

Thanks for going to get those details!

acouch · 2024-03-25T13:48:55Z

documentation/decisions/adr/2024-03-19-dashboard-storage.md

+
+[Data is hosted in Postgres at $0.115 per GB-month](https://aws.amazon.com/rds/postgresql/pricing/), higher than S3 and Redshift.
+
+In every time range, Postgres loses to Redshift due to Postgres being an OLTP database built for real-time data processing. In the 2 - 5 year range, Postgres also loses due to its high GB-month hosting cost. There is however, evidence that Postgres is faster than Redshift, and that Redshift's OLAP advantages don't kick in until the data grows to terabtyes in size.


Pros of postgres include:

Open Source

Keep stack simplified as we already use it

Easier to setup locally for open source devs or local dev and for other teams to adopt

I'm still skeptical that we will ever need to run queries that require an OLTP database

We could use the same database for our open source BI tool (Metabase/Superset use postgres)

coilysiren · 2024-03-25T17:23:58Z

Given the evidence and discussion, I've changed the decision in this ADR to instead be in favor of Postgres

Create 2024-03-19-dashboard-storage.md

f247ce1

github-actions bot added the documentation Improvements or additions to documentation label Mar 19, 2024

Update 2024-03-19-dashboard-storage.md

6ecdd62

coilysiren changed the title ~~BI Storage ADR~~ Business Analytics Data Storage ADR Mar 19, 2024

widal001 reviewed Mar 20, 2024

View reviewed changes

coilysiren requested a review from bretthrosenblatt March 20, 2024 17:51

Update 2024-03-19-dashboard-storage.md

828bc30

Update 2024-03-19-dashboard-storage.md

55c3cad

coilysiren added 2 commits March 20, 2024 12:32

Update 2024-03-19-dashboard-storage.md

62b3bcb

Merge branch 'main' into coilysiren-patch-2

fb96a1a

coilysiren marked this pull request as ready for review March 20, 2024 19:47

coilysiren requested review from sumiat, andycochran, acouch and SammySteiner as code owners March 20, 2024 19:47

coilysiren requested a review from widal001 March 20, 2024 19:47

Update 2024-03-19-dashboard-storage.md

157c022

coilysiren changed the title ~~Business Analytics Data Storage ADR~~ [Issue #1506] Business Analytics Data Storage ADR Mar 22, 2024

acouch reviewed Mar 25, 2024

View reviewed changes

Account for latest evidence

7b1f4c2

coilysiren requested a review from acouch March 25, 2024 17:47

acouch approved these changes Apr 1, 2024

View reviewed changes

bretthrosenblatt approved these changes Apr 1, 2024

View reviewed changes

coilysiren merged commit 306210f into main Apr 1, 2024
2 checks passed

coilysiren deleted the coilysiren-patch-2 branch April 1, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue #1506] Business Analytics Data Storage ADR #1503

[Issue #1506] Business Analytics Data Storage ADR #1503

coilysiren commented Mar 19, 2024 •

edited

Loading

widal001 Mar 20, 2024 •

edited

Loading

widal001 Mar 20, 2024

widal001 Mar 20, 2024

widal001 Mar 20, 2024

bretthrosenblatt Mar 20, 2024

bretthrosenblatt Mar 20, 2024

widal001 Mar 20, 2024

coilysiren Mar 20, 2024

bretthrosenblatt Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024

coilysiren commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024 •

edited

Loading

coilysiren commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024 •

edited

Loading

bretthrosenblatt commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024

coilysiren commented Mar 20, 2024

coilysiren commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024 •

edited

Loading

widal001 commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024

coilysiren commented Mar 21, 2024 •

edited

Loading

acouch Mar 25, 2024

coilysiren commented Mar 25, 2024

		At 0 - 6 months, S3 is a reasonable choice due to our small data sizes. Past that point, performance issues with large data sizes make S3 a non-ideal choice.

		At 2 - 5 years, S3's performance issues require the introduction of another query / compute layer like AWS Athena or AWS Redshift.


		[Data is hosted in Postgres at $0.115 per GB-month](https://aws.amazon.com/rds/postgresql/pricing/), higher than S3 and Redshift.

		In every time range, Postgres loses to Redshift due to Postgres being an OLTP database built for real-time data processing. In the 2 - 5 year range, Postgres also loses due to its high GB-month hosting cost. There is however, evidence that Postgres is faster than Redshift, and that Redshift's OLAP advantages don't kick in until the data grows to terabtyes in size.

[Issue #1506] Business Analytics Data Storage ADR #1503

[Issue #1506] Business Analytics Data Storage ADR #1503

Conversation

coilysiren commented Mar 19, 2024 • edited Loading

Summary

Time to review: 20 mins

widal001 Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bretthrosenblatt commented Mar 20, 2024

coilysiren commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024 • edited Loading

coilysiren commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024 • edited Loading

bretthrosenblatt commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024

coilysiren commented Mar 20, 2024

coilysiren commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024 • edited Loading

widal001 commented Mar 20, 2024

bretthrosenblatt commented Mar 20, 2024

coilysiren commented Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

coilysiren commented Mar 25, 2024

coilysiren commented Mar 19, 2024 •

edited

Loading

widal001 Mar 20, 2024 •

edited

Loading

bretthrosenblatt commented Mar 20, 2024 •

edited

Loading

bretthrosenblatt commented Mar 20, 2024 •

edited

Loading

bretthrosenblatt commented Mar 20, 2024 •

edited

Loading

coilysiren commented Mar 21, 2024 •

edited

Loading