Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIP-45 breaks follow-on mini scheduler for mapped tasks #23838

Closed
ashb opened this issue May 20, 2022 · 8 comments · Fixed by #24772
Closed

AIP-45 breaks follow-on mini scheduler for mapped tasks #23838

ashb opened this issue May 20, 2022 · 8 comments · Fixed by #24772
Assignees
Labels
area:dynamic-task-mapping AIP-42 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug
Milestone

Comments

@ashb
Copy link
Member

ashb commented May 20, 2022

I've just noticed that this causes a problem for the follow-on mini scheduler for mapped tasks. I guess that code path wasn't sufficiently unit tested.

DAG

import csv
import io
import os
import json
from datetime import datetime

from airflow import DAG
from airflow.decorators import task
from airflow.models.xcom_arg import XComArg
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.providers.amazon.aws.operators.s3 import S3ListOperator


with DAG(dag_id='mapped_s3', start_date=datetime(2022, 5, 19)) as dag:
    files = S3ListOperator(
        task_id="get_inputs",
        bucket="airflow-summit-2022",
        prefix="data_provider_a/{{ data_interval_end | ds }}/",
        delimiter='/',
        do_xcom_push=True,
    )

    @task
    def csv_to_json(aws_conn_id, input_bucket, key, output_bucket):
        hook = S3Hook(aws_conn_id=aws_conn_id)

        csv_data = hook.read_key(key, input_bucket)
        reader = csv.DictReader(io.StringIO(csv_data))

        output = io.BytesIO()

        for row in reader:
            output.write(json.dumps(row, indent=None).encode('utf-8'))
            output.write(b"\n")

        output.seek(0, os.SEEK_SET)
        hook.load_file_obj(output, key=key.replace(".csv", ".json"), bucket_name=output_bucket)

    csv_to_json.partial(
        aws_conn_id="aws_default", input_bucket=files.bucket, output_bucket="airflow-summit-2022-processed"
    ).expand(key=XComArg(files))

Error:

  File "/home/ash/code/airflow/airflow/airflow/jobs/local_task_job.py", line 253, in _run_mini_scheduler_on_child_tasks
    info = dag_run.task_instance_scheduling_decisions(session)
  File "/home/ash/code/airflow/airflow/airflow/utils/session.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/home/ash/code/airflow/airflow/airflow/models/dagrun.py", line 658, in task_instance_scheduling_decisions
    schedulable_tis, changed_tis, expansion_happened = self._get_ready_tis(
  File "/home/ash/code/airflow/airflow/airflow/models/dagrun.py", line 714, in _get_ready_tis
    expanded_tis, _ = schedulable.task.expand_mapped_task(self.run_id, session=session)
  File "/home/ash/code/airflow/airflow/airflow/models/mappedoperator.py", line 609, in expand_mapped_task
    operator.mul, self._resolve_map_lengths(run_id, session=session).values()
  File "/home/ash/code/airflow/airflow/airflow/models/mappedoperator.py", line 591, in _resolve_map_lengths
    expansion_kwargs = self._get_expansion_kwargs()
  File "/home/ash/code/airflow/airflow/airflow/models/mappedoperator.py", line 526, in _get_expansion_kwargs
    return getattr(self, self._expansion_kwargs_attr)
AttributeError: 'MappedOperator' object has no attribute 'mapped_op_kwargs'

Originally posted by @ashb in #21877 (comment)

@ashb
Copy link
Member Author

ashb commented May 20, 2022

FYI @pingzh

I can look at this later, right now I've got to write my talk for the Summit :)

@pingzh
Copy link
Contributor

pingzh commented May 20, 2022

FYI @pingzh

I can look at this later, right now I've got to write my talk for the Summit :)

thanks @ashb . i can also take a look. looking forward to your talk.


it is strange that the scheduler can scheduler this task, it means dagrun.task_instance_scheduling_decisions is called based on the deserialized dag, but it failed in the _run_mini_scheduler_on_child_tasks

@pingzh
Copy link
Contributor

pingzh commented May 20, 2022

update:

it only happens to the new DagRun. If you try to clear an existing DagRun, it does not throw that error

@ashb
Copy link
Member Author

ashb commented May 21, 2022

That might be because the task is already expanded?.

I've not looked at this at all - just noticed main was broken but (thankfully for my talk) 2.3 is fine

@potiuk
Copy link
Member

potiuk commented May 21, 2022

I've not looked at this at all - just noticed main was broken but (thankfully for my talk) 2.3 is fine

Your talk was main reason why we waited with merging it until we branched off to 2.3 :).

@uranusjr uranusjr added kind:bug This is a clearly a bug area:Scheduler including HA (high availability) scheduler area:dynamic-task-mapping AIP-42 labels May 23, 2022
@pingzh
Copy link
Contributor

pingzh commented Jun 1, 2022

@ashb do you mind taking a look at this? thanks

@ashb ashb added this to the Airflow 2.4.0 milestone Jun 17, 2022
@ashb
Copy link
Member Author

ashb commented Jun 17, 2022

@pingzh Yeah, I or @uranusjr will take a look before we start closing out 2.4

@uranusjr
Copy link
Member

The most immediate problem is we’re trying to expand a deserialised mapped operator. This does not work; only a “real” mapped operator can be properly expanded. We should fix this, and figure out a way to prevent this from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dynamic-task-mapping AIP-42 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants