-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-5644] Simplify TriggerDagRunOperator usage #6317
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6317 +/- ##
=========================================
- Coverage 80.57% 80.3% -0.27%
=========================================
Files 626 626
Lines 36237 36223 -14
=========================================
- Hits 29198 29090 -108
- Misses 7039 7133 +94
Continue to review full report at Codecov.
|
@@ -511,18 +510,6 @@ def check_failure(context, test_case=self): | |||
start_date=DEFAULT_DATE, end_date=DEFAULT_DATE, ignore_ti_state=True) | |||
self.assertTrue(data['called']) | |||
|
|||
def test_trigger_dagrun(self): | |||
def trigga(_, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked this name 😅
In overall I like the proposed simplification ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
run_id = "trig__{}".format(self.execution_date.isoformat()) | ||
elif isinstance(self.execution_date, str): | ||
run_id = "trig__{}".format(self.execution_date) | ||
self.execution_date = timezone.parse(self.execution_date) # trigger_dag() expects datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to set the execution date on line 58 where you are setting it the first time?
if isinstance(execution_date, str):
self.execution_date = timezone.parse(execution_date)
else:
self.execution_date = execution_date
IMO self.execution_date = timezone.parse(self.execution_date)
is a kind of validation so that should be made in the constructor even if it will be called more often than in the execute
.
then in the execute
you can just do
if self.execution_date is None:
self.execution_date = timezone.utcnow()
run_id = "trig__{}".format(self.execution_date.isoformat())
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to use it as it is much shorter and simpler.
However, the execution_date
can be templated. For example {{ execution_date }}
will fail in timezone.parse()
. So, we have to save it first, wait for execute()
to be called and all variables to be templated, and only then can we call timezone.parse()
on the execution_date
:(
Greetings! I'm curious to know if you folks knew this change reduced functionality. Specifically, we have workflows where the
I am not sure how this can be done after the change. In general, having the convenience of an arbitrary python callable to hook into and modify behavior based on incoming conf is very valuable. For a practical example, this task would trigger a dag only if a flag was set in the conf; this flag can vary between dag runs, but the same dag can model both behaviors: step1 = SomeOperator()
step2 = AnotherOperator()
def check_and_trigger(context, dag_run_obj):
payload = context["dag_run"].conf
if not payload["should_trigger"]:
return False
dag_run_obj.payload = payload["downstream_payload"]
return dag_run_obj
maybe_trigger_bar_dag = TriggerDagRunOperator(
trigger_dag_id="bar",
python_callable=check_and_trigger,
)
step1 >> step2 >> maybe_trigger_bar_dag In our use-case, the Dag itself is static but takes in a few parameters via Please let me know if I can explain things further. I was unable to find the motivation for this change apart from the Jira ticket linked, so please do point me to more reading if it exists, so I can gain context. |
Hi @Sharadh, The main motivation for this change was code clarity. Back then, I found the To achieve the same, I suggest to split your task into two: (1) a PythonOperator which succeeds if from airflow.exceptions import AirflowSkipException
step1 = SomeOperator()
step2 = AnotherOperator()
def _should_trigger(dag_run, **_):
if not dag_run.conf["should_trigger"]:
raise AirflowSkipException("should_trigger set to False")
should_trigger = PythonOperator(
task_id="should_trigger",
python_callable=_should_trigger,
provide_context=True,
)
trigger_bar_dag = TriggerDagRunOperator(
task_id="trigger_bar_dag",
trigger_dag_id="bar",
conf={"downstream_payload": "{{ dag_run.conf['downstream_payload'] }}"},
)
step1 >> step2 >> should_trigger >> trigger_bar_dag Since you're passing the payload to the to-be-triggered-DAG, the The Does this clarify? |
Thanks for the quick reply @BasPH, and especially for taking the time to sketch out the workaround / new style. That does clarify - I'm assuming that def _should_trigger(dag_run, **_):
if not dag_run.conf["should_trigger"]:
raise AirflowSkipException("should_trigger set to False")
dag_run.conf["downstream_payload"]["trigger_checked_at"] = datetime.now() I do also appreciate - from a design perspective - that the code now is simpler; I daresay it's cleaner for us to separate out the logic of if / how to trigger (which is purely business logic), from the trigger itself (which is purely an airflow construct). I guess the old code was conflating a bit of Is there some sort of release notes for the new |
Unless I overlooked something, it seems that the changes in TriggerDagRunOperator merged here seem to be missing form 1.10.11 - https://github.com/apache/airflow/blob/1.10.11/airflow/operators/dagrun_operator.py still contains |
Yes @shippy these changes are not yet released. As of https://issues.apache.org/jira/browse/AIRFLOW-5644 these changes will be included in 2.0. |
@BasPH Could you clarify the reasoning that setting the run_id is abusing the behavior? I'm working with Airflow to build DAGs that are externally triggered once-per-patient, not on a time interval. It's very helpful to be able to search the run_ids for a patient ID. So, each externally triggered DAG run sets the run_id to a patient ID. I was hoping to be able to set the run_id with this operator, but now it sounds like I'm going down the wrong path. |
Setting DAG Run ids via the old TriggerDagRunOperator feels like one big hack. Both from a code perspective (the Python callable to provide is odd to read), and a logical perspective (I don't think it's desirable to have users edit the run id). To understand your use case, I'd like to know how exactly you're searching for patient-specific DAG runs? |
In the DagRun model list view there is an option to search by run_id. Since we set the run_id when the dag is triggered externally, this field can be used when we need to check on progress for a specific case. This seems to work alright, but there are some downsides:
So, I'm also exploring options for moving away from using run_ids, and just adding this info to the run conf. But, I think this is going to require updating the DagRun model views, or a custom view plugin. |
I don't think so, TriggerDagRunOperator should allow a user to set the RUN_ID the same way airflow cli allow the user to set run id using i.e In my case, I use the TriggerDagRunOperator to trigger 30K dagrun daily and will be so annoying to see them having unidentifiable run_id. Could you please look into this again. |
Hi Sharadh, Did you managed to find a way to edit dag_run.conf to add further inputs before passing it to the TriggerDagRunOperator conf parameter? I have been trying this and can't seem to find a solution. Thanks, |
I am in same boat. We use the run_id to identify the runs and when there are so many runs, it is very useful. It is particularly useful in Tree View where the run_id shows up in the hover of the run without having to go somewhere else. Why couldn't the run_id be added as a parameter?
|
Hi @BasPH, I realize this is an old thread now, but we just migrated to Airflow 2.0.0. Similar to @Sharadh, is there a way to dynamically pass a dag_run_obj or conf object to the new DAG that was triggered? There does not seem to be a way to do this with the removal of the python_callable function. Is "downstream_payload" below a unique variable that passes the current DAG's dag_run object to the new triggered DAG?
Here is my Current Code (w/ some explanation on the flow and approach):
Any help or advise would be appreciated! Thank you. |
Late to the party here, but I agree this change is causing pain when upgrading to 2.x. @mathee06 If you recall, what was your final approach? I will likely move the logic to a new task before the trigger and possibly to a new task in the downstream triggered dag. |
Make sure you have checked all steps below.
Jira
Description
This PR refactors the TriggerDagRunOperator to provide a much more intuitive behaviour, i.e. it now has a
conf
argument to which a dict can be provided, for configuration for the triggered Dag(Run).Before:
After:
It removes the
python_callable
argument and is thus not backwards compatible so should be merged in Airflow 2.0. Also (I think), people might have "abused" this weird DagRunOrder class to set their dagrun id. This PR removes that possibility.Tests
TriggerDagRunOperator tests were extracted from core.py and placed in a dedicated test_dagrun_operator.py file. I added additional tests for validating correct behaviour.
These tests were a bit tricky because they rely on passing state via the database, but the triggered DAG file is also read from disk somewhere in the code. To make these idempotent and not rely on external files (i.e. example DAGs), the
setUp()
writes a small DAG to a temporary file, which is used throughout the tests, and in thetearDown()
all state is removed from the DB.Commits
Documentation