Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TPCH #1094

Merged
merged 18 commits into from
Oct 18, 2023
Merged

Update TPCH #1094

merged 18 commits into from
Oct 18, 2023

Conversation

mrocklin
Copy link
Member

This moves the TPC-H benchmarks outside of the tests/benchmarks directory, and treats it as more of a standalone system.

Then, this also rewrites some of the relevant conftest.py file in a way that, I think, makes it simpler to manipulate (that might be subjective though).

Some notable changes:

  1. There are local and scale fixtures that control datasets and should eventually control clusters that are used
  2. pyspark uses the dask infrastructure (it's exactly the same)
  3. Cleaned up some very old config settings in cluster_kwargs.yaml

However, I suspect that this approach screws with existing tests because they import things from conftest, which is now at root level. Things will still need to be moved around to make this work. Mostly right now I'm looking for if this makes sense to do, or if there are major reasons why not to do it.

To give more motivation, if TPC-H benchmarks work, people will want to come here and look at them. If they're highly intertwined with the existing benchmark machinery then I think that they will be hard to understand and run. I would like to make the tpc-h benchmarks more accessible to people who are unfamiliar with this repository.

@mrocklin
Copy link
Member Author

@milesgranger I put the Spark stuff in here (sorry for mixing PRs)

Things work locally, but I'm getting the following when running remotely:

E                   Caused by: com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))

If you're still around and have some time can I ask you to try running?

@mrocklin
Copy link
Member Author

Should be just

py.test --benchmark tpch/test_pyspark.py

@mrocklin
Copy link
Member Author

You'll need to pip install -e your local coiled/python-api directory (although maybe this is what's broken)

@mrocklin
Copy link
Member Author

Oh, I'm maybe dumb. One sec.

@mrocklin
Copy link
Member Author

Yeah, I was dumb. Still not working, but that particular issue is gone.

@mrocklin
Copy link
Member Author

OK, things are working. I'll post results in the notebook PR shortly.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mrocklin
Copy link
Member Author

To run benchmarks today (@phofl was curious)

  1. Clone this repo, install as indicated in the README
  2. Also install the following pyspark grpcio grpcio-status openjdk protobuf
  3. py.test --benchmark tpch or alternatively just a sample py.test --benchmark tpch/test_dask.py
  4. Run through the tpch/visualize.ipynb notebook to render things. Note that this takes the most recent entries from each of dask/spark/polars/duckdb.

@mrocklin
Copy link
Member Author

Oh, and for Spark you'll need https://github.com/coiled/platform/pull/3530

@mrocklin mrocklin changed the title WIP - Pull TPC-H out of tests/benchmarks Update TPCH Oct 17, 2023
@mrocklin
Copy link
Member Author

OK, I did enough TPCH work here that I decided to unwind the larger structural changes of this PR and save that for another day (tomorrow?)

Changes in this PR now include:

  • A bunch of stuff to TPCH that I think are good, and that I think people are likely happy to trust me on (but happy for review!)
  • Modernization of cluster_kwrags.yaml . Mostly using newer keywords
  • Fixing a couple deprecation warnings

I think that this is ready for review. I'm hopeful that, if people are fine ignoring TPCH stuff, that it's easy to get in.

@mrocklin
Copy link
Member Author

@fjetter I would like to get this in soon. It is a large change to TPCH and likely to contain conflicts. Should I just merge? I suspect that we'll need migrations. Can I ask you or someone on your team to help review quickly?

Copy link
Contributor

@milesgranger milesgranger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, for what it's worth. 👍
Maybe some others want to look? (@crusaderky @phofl)

@mrocklin
Copy link
Member Author

Cool. I'm mostly curious if there is something that we have to do with database migrations, given that I've moved some of the tests around.

@milesgranger
Copy link
Contributor

Cool. I'm mostly curious if there is something that we have to do with database migrations, given that I've moved some of the tests around.

My understanding is that can be a follow up, I can try to figure that one out.

@fjetter fjetter merged commit 8f1072b into main Oct 18, 2023
15 checks passed
@fjetter
Copy link
Member

fjetter commented Oct 18, 2023

@milesgranger I trust you to follow up with the migrations. Let me know if you need help. #1099 (comment) suggests a starting point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants