-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figures pipeline performance improvement #429
Conversation
903b064
to
4cd78a7
Compare
This PR adds `figures.course`, a new module that has the `Course` class The goal of this class is to simplify Figures code This class provides data specific to and associated with a course. Our initial version provides only the essential data needed for our pipeline performance improvement
6142aa2
to
439063a
Compare
This PR provides alternate execution to create the same data as before * See the module docstring in `figures.pipeline.enrollment_metrics_next` for details * This PR contains a new module and tests to support the new module
439063a
to
f6a177d
Compare
* Default workflow is the old progress calculator, enrollment data collector (in `figures.pipeline.enrollment_metrics`) * New workflow calls `figures.pipeline.enrollment_metrics_next'` * Updates tests to handle both conditions and default when `ed_next` is not set
* Updated daily metrics tests to exercise new code * Added new helper function, `fake_course_key` so we don't have to create a CourseOverview to get a couse id when we just need a valid course key * Also lint fixed figures/management/commands/backfill_figures_daily_metrics.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnbaldwin I've added two optional nitpicks and one question for celery task scheudling.
The pull request looks good, thanks!
def extract(self, course_id, date_for, **_kwargs): | ||
""" | ||
defaults = dict( | ||
def extract(self, course_id, date_for, ed_next=False, **_kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: ed_next
-> use_next_enrollment_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will add the ed_next=False
default. But want to keep the name short. It is temporary until it replaces the existing system, I made sure to have all of these checks use the same variable name to make it simple to grep the code for this and the longer use_next_enrollment_data
made the source look messier. I was spending too much time making code pretty to fix the quite long variable name. This felt like a waste of time and the short name, being short lived was a reasonable trade-off.
edit: Hopefully sooner rather than later, I'm going to overhaul the pipeline code and task scheduling to be more plugable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, looks like I already set ed_next=False
as default. I agree, the name isn't great, but I did that abbreviation on purpose. Having you have a nit about it is a bonus. Once we accept "Workflow Next" as the new workflow for Tahoe, we can clean house on the old workflow, reduce a ton of code and this will go away too
@@ -75,11 +75,6 @@ def handle(self, *args, **options): | |||
metrics_func(**kwargs) | |||
else: | |||
metrics_func.delay(**kwargs) # pragma: no cover | |||
# except Exception as e: # pylint: disable=bare-except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: also remove # try:
please:
# try: |
populate_daily_metrics_for_site(site_id=site.id, | ||
date_for=date_for, | ||
ed_next=True, | ||
force_update=force_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Celery queue question: This would run populate_daily_metrics_for_site
synchronously. Do we want this behavior?
populate_daily_metrics_for_site.delay()
should be easier to work with in terms of deployments and queue impact.
We have the group()
and delay()
pattern used in figures:
Lines 456 to 468 in 2f08796
@shared_task | |
def run_figures_monthly_metrics(): | |
""" | |
Populate monthly metrics for all sites. | |
""" | |
if waffle.switch_is_active(WAFFLE_DISABLE_PIPELINE): | |
logger.info('Figures pipeline is disabled due to %s being active.', | |
WAFFLE_DISABLE_PIPELINE) | |
return | |
logger.info('Starting figures.tasks.run_figures_monthly_metrics...') | |
all_sites_jobs = group(populate_monthly_metrics_for_site.s(site.id) for site in get_sites()) | |
all_sites_jobs.delay() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OmarIthawi I would prefer to run the tasks synchronously. We have not historically because when I originally implemented the celery tasks, I tried it synchronously with very simple vanilla tasks on standalone edx-platform and they would just "poof" disappear. No logs, no exceptions, nothing. Just drop out of existence. So we carried them on as serial executions since.
Soon in my queue is to drastically improve Figures backfill / data repair toolkit and health checking, and improve the Figures Celery docker devsite environment.
Once we have those plus documentation and training the platform team on how to use it. THEN we can optimize Figures tasks to run in parallel as you mention
* Removed dead code
This PR contains code to improve Figures pipeline performance for collecting daily metrics
This PR follows on #427
What is in this PR?
figures.course
- A new module to simplify retrieving data about a specific course. See commit, 839b322figures.pipeline.enrollment_metrics_next
- A new module to support an alternate workflow to collect per-enrollment metrics and provide aggregate progress metrics. See commit, 4dcbd75figures.pipeline.course_daily_metrics
to allow the deployment to chose which enrollment data collection and reporting to use. See commit, 76aea12figures.tasks
to provide new tasks to run this alternate workflow. See commit, 2f08796See also #428
What needs to follow on as a new PR