Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to perform accuracy checks / repair sink data #843

Open
bmtcril opened this issue May 31, 2024 · 4 comments
Open

Add ability to perform accuracy checks / repair sink data #843

bmtcril opened this issue May 31, 2024 · 4 comments

Comments

@bmtcril
Copy link
Contributor

bmtcril commented May 31, 2024

Building on the work in openedx/aspects-dbt#102 I think we should add some checks that an operator can run to see if Aspects matches the source of truth tables for important data. This would require connecting ClickHouse to MySQL and running some potentially large queries, but would also be expected to either be run manually or during low traffic times to prevent causing performance issues.

Some high level thoughts:

  • We should be able to run dbt test at any point to check the data in ClickHouse if the tests in the above issue are written in a performant way. My experience in the past with OLAP databases and dbt test makes me think we do want some kind of "fudge factor" for the fact that data is constantly flowing in, either only checking against data more than X minutes old or allowing for some amount of slop around the acceptable test conditions to account for it
  • We should be able to check event sink source tables fairly easily (give or take all of the xblocks in a course) by comparing counts between MySQL and ClickHouse tables, since they are effectively mirrors. We should be able to check course publish times to make sure we have the freshest courses.
  • For event sinks we should also be able to optionally "repair" those tables by triggering sink dumps for known missing IDs
  • We should be able to compare these items for users fairly easily: currently enrolled, passing/failing, enrollment track, current grade, current email address (if PII is on)
  • I don't think we should focus on engagement as most of those events are transient and have no analog source of truth
  • We may be able to identify and report outage times by aggregating timestamps in the missing data to direct operators to places where logs may need to be replayed. For example if there are 15 incorrect / missing rows in ClickHouse for enrollments, we should be able to pull the CourseEnrollment created datetime or CourseEnrollmentHistory rows and check when updates were missed
  • We could run in different check modes, like "last x days", "random sample", and "by course" to limit the amount of data processed

This ticket is just to start the conversation, but once we decide on a list of features it can be converted into an epic with a number of sub-tasks.

@bmtcril
Copy link
Contributor Author

bmtcril commented May 31, 2024

@Ian2012 @saraburns1 @pomegranited I'd love your thoughts and feedback on this and the accompanying aspects-dbt ticket. I think it's the next major chunk of work we can do while waiting on feedback from v1.

@saraburns1
Copy link
Contributor

Sounds cool!

I think the sink repair and outage reports should be follow ups since getting the connections set up and doing all the validations should be first priority. Could either be a separate epic or just fast-follow tickets.

Also wanted to bring up a probably rare case, but we should think about what happens if MySQL is missing data that exists in Clickhouse (like if a SQL transaction failed after the data was already sent downstream). Would we delete CH data since SQL is the source of truth? Could there be mismatched data from a failed update (in which case CH might be more accurate)?

@pomegranited
Copy link
Contributor

Another point to add is data bootstrapping:

Suppose I have an existing Open edX instance with lots of users / courses / enrollments, and I add Aspects to the deployment. The sinks would catch any newly triggered events, so would be partially populated, but they will need to be repaired to receive the old data.

@bmtcril
Copy link
Contributor Author

bmtcril commented Jun 20, 2024

@pomegranited right! It would be able to either run or form the correct management commands to backfill sinks where needed. One nit we should keep in mind is data that is older than the TTL, which we shouldn't alert or backfill on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants