Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed interfering force of index. #25404

Merged
merged 1 commit into from
Aug 2, 2022

Conversation

dene14
Copy link
Contributor

@dene14 dene14 commented Jul 29, 2022

Forcing index in mysql dialect causing unnecessary complexity to process the query and adds a lot to the load when DAG executed frequently enough (hourly instead of daily) and having a lot of tasks in it (every additional task increment query time significantly).

Below a few explain plans collected on mysql 8.0.28.

  • daily execution, 4 tasks in it, forcing index is set, but plan optimizer decides to use range, execution time: 1.5s
    image
  • daily execution, 4 tasks in it, forcing index is NOT set, plan optimizer decides to use range, execution time: 1.5s
    image

Summary: when range fits in range_optimizer_max_mem_size, optimizer decides to use range scan instead of the optimization provided by hint and request executed fast enough (less than one second in both cases)

  • hourly execution, 96 tasks in it, forcing index is set, plan optimizer follows the hint, execution time: 15s
    image
  • hourly execution, 96 tasks in it, forcing index is NOT set, plan optimizer does a better job, execution time: 8s
    image

Summary: when range DOESN'T fit in range_optimizer_max_mem_size, optimizer follows the hint which produces much more expensive plan than w/o this hint.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jul 29, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 29, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@uranusjr
Copy link
Member

uranusjr commented Aug 1, 2022

Hmm, this line was actually introduced in #2021 to fix a performance issue in the first place. Unfortunately context is sparse on things this old, but I wonder what changed between then and now that the “normal” case is not slower now—we did change TaskInstance’s primary key in 2.2 to use run_id instead of execution_date, and also added an explicit foreign key to DagRun (which I think is relevant because this query joins that table). Do you think any of these would explain the difference?

@dene14
Copy link
Contributor Author

dene14 commented Aug 1, 2022

@uranusjr very could be...

also, I've just opened #25448 with a bit more details, this adjustment should produce much bigger impact on performance.

@potiuk
Copy link
Member

potiuk commented Aug 2, 2022

Yeah. I am all for removing such hints, especially that they really work in specific cases and specific problems and specific DB structure which seems to be already way changed since the time it was needed.

@potiuk potiuk merged commit 6778503 into apache:main Aug 2, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 2, 2022

Awesome work, congrats on your first merged pull request!

@dene14 dene14 deleted the interfering_index_forcing branch August 2, 2022 15:10
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Aug 15, 2022
@ephraimbuddy ephraimbuddy added this to the Airflow 2.3.4 milestone Aug 15, 2022
ephraimbuddy pushed a commit that referenced this pull request Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants