Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpch: Dask vs PySpark and PySpark, Polars and DuckDB single node #1044

Merged
merged 60 commits into from
Oct 13, 2023

Conversation

milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Oct 3, 2023

Merged #971 and #1027 together and added tests comparing each per tpch query; right now the first 7 queries as implemented in #971.

First run: https://cloud.coiled.io/clusters/284888/information?account=dask-engineering&tab=Metrics
Note that dask runs first, then pyspark for the same query.
One can pretty easily determine start/stop of each based on tasks being present in the first chart.

Initial impression is PySpark spills to disk less / uses less memory, and Dask is generally faster.

Dashboard artifact:
https://github.com/coiled/benchmarks/suites/16815018862/artifacts/960398346

@mrocklin
Copy link
Member

mrocklin commented Oct 3, 2023

Initial impression is PySpark spills to disk less / uses less memory, and Dask is generally faster.

Surprising, but encouraging! I'm intersted in digging in

One can pretty easily determine start/stop of each based on tasks being present in the first chart.

I wonder if we can improve on this. I'm not very familiar with how A/B tests work, but if we had two branches with the same test names like test_tpch.py::test_query_1 but where one implementation was with Dask and the other was with Spark, then would the normal A/B test charts work?

Right now looking at the charts I think I'm seeing that Dask is a lot slower on more queries than Spark is. It's a little hard to tell though (I'm looking at the x-axis range on charts like these). Is that the right way to interpret these?

Screenshot 2023-10-03 at 7 21 32 AM

@milesgranger
Copy link
Contributor Author

milesgranger commented Oct 3, 2023

I think I'm seeing that Dask is a lot slower on more queries than Spark is.

Ah, you are indeed correct, I found from the 8 and 20 node clusters that dask was faster 3 out of the 7 queries. (1, 2 and 6)

@mrocklin
Copy link
Member

mrocklin commented Oct 3, 2023

Also, while I'm asking for things. I'll also be very curious how to changes at different scale levels. Common wisdom today says "Dask is ok at medium scales, but for large scale you really have to use Spark". I'm curious how true this is today.

@milesgranger milesgranger force-pushed the milesgranger/tpch-benchmarks branch from 51ba62a to c0db646 Compare October 4, 2023 07:32
@milesgranger milesgranger force-pushed the milesgranger/tpch-benchmarks branch from e5ae393 to f308a10 Compare October 13, 2023 12:51
@phofl
Copy link
Contributor

phofl commented Oct 13, 2023

There might be value here in avoiding future work.

I was looking at this from the other direction. We will have to update them when we eventually want to use them (and more importantly we have to actually remember this when adding new queries), so better to copy as needed

@mrocklin
Copy link
Member

I don't have a strong preference

@milesgranger
Copy link
Contributor Author

We will have to update them when we eventually want to use them (and more importantly we have to actually remember this when adding new queries), so better to copy as needed

Just remove the pytest.mark.skip? This was my original approach in implementing all the queries then limited to however many have been implemented in Dask. But here, 1) the work is done. 2) only need to remove the skip marker. 3) maybe I'm missing something. :)

@phofl
Copy link
Contributor

phofl commented Oct 13, 2023

These queries change regularly over In the polars repo, so I wanted to avoid keeping stuff in synch that we are not using anyway

@mrocklin
Copy link
Member

I think that Patrick is saying that by the time we implement these queries it's decently likely that the upstream implementations will also have changed, so we'll have to go back and look at them anyway.

If so, then I guess this depends on how much work there is to modify the upstream queries, and how frequently they change.

@mrocklin
Copy link
Member

I'm curious if anyone has thoughts on moving the entire tpch/ directory to root level, rather than having it in tests/benchmarks/tpch/. I think that this is actually two questions:

  1. Will this screw up any particular infrastructure that we have today, like benchmark plots?
  2. Do people agree with this aesthetic?

@milesgranger
Copy link
Contributor Author

These queries change regularly over In the polars repo, so I wanted to avoid keeping stuff in synch that we are not using anyway

I don't understand, then don't update them? They're being skipped anyway. Anyhow, it appears you feel strongly about this so I'll remove them.

@phofl
Copy link
Contributor

phofl commented Oct 13, 2023

It's likely that folks would just add our implementation without looking at the upstream changes. I want to avoid the cognitive load here since copy-paste should be quick at a later point.

Do people agree with this aesthetic?

No preference one way or the other

@mrocklin
Copy link
Member

Anything stopping us from merging? I'm sensitive to having this PR stay around over the weekend.

@milesgranger
Copy link
Contributor Author

Nothing from my end stopping a merge. 👍

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some suggestions for follow up work but I would like to not do this on this PR. There is one question regarding configuration of the dask TCPH benchmarks and queuing that we should address. This PR should not change any other benchmarks IMO

If that is a no-op, we're good to go from my POV

Comment on lines +73 to +80
skip_benchmarks = pytest.mark.skip(reason="need --tpch-non-dask option to run")
for item in items:
if not config.getoption("--tpch-non-dask") and not (
str(item.path).startswith(
str(TEST_DIR / "benchmarks" / "tpch" / "test_dask")
)
):
item.add_marker(skip_benchmarks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a follow up ticket for this. We don't need this for this PR but I want this to run somewhat regularly (every commit, once a day, etc.)

Comment on lines +107 to +136
def _():
spark = get_or_create_spark(f"query{module.__name__}")

# scale1000 stored as timestamp[ns] which spark parquet
# can't use natively.
if ENABLED_DATASET == "scale 1000":
module.query = fix_timestamp_ns_columns(module.query)

module.setup(spark) # read spark tables query will select from
if hasattr(module, "ddl"):
spark.sql(module.ddl) # might create temp view
q_final = spark.sql(module.query) # benchmark query
try:
# trigger materialization of df
return q_final.toJSON().collect()
finally:
spark.catalog.clearCache()
spark.sparkContext.stop()
spark.stop()

return await asyncio.to_thread(_)

if not is_local:
rows = tpch_pyspark_client.run_on_scheduler(_run_tpch)
else:
rows = asyncio.run(_run_tpch(None)) # running locally
print(f"Received {len(rows)} rows")


class SparkMaster(SchedulerPlugin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also suggest as a follow up to

  • Move the plugins into a spark.py submodule or something like this. That makes it easier to import if one runs a couple of ad-hoc tests. At least until we have this in coiled
  • I would refactor the run_tpch_pyspark function slightly to allow it being used as run_spark_query(dask_client, query) for similar reasons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for pyspark we merge it as-is, fix up spark outside of this repo, and then make this file like the others, where the query is just defined in the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future work regardless though.

@@ -526,8 +585,9 @@ def tpch_cluster(request, dask_env_variables, cluster_kwargs, github_cluster_tag
**cluster_kwargs["tpch"],
)
dump_cluster_kwargs(kwargs, f"tpch.{module}")
with Cluster(**kwargs) as cluster:
yield cluster
with dask.config.set({"distributed.scheduler.worker-saturation": "inf"}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. This should be on main already, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is a rendering issue

@pytest.fixture(scope="module")
def tpch_cluster(request, dask_env_variables, cluster_kwargs, github_cluster_tags):
module = os.path.basename(request.fspath).split(".")[0]
module = module.replace("test_", "")
kwargs = dict(
name=f"{module}-{uuid.uuid4().hex[:8]}",
environ=dask_env_variables,
tags=github_cluster_tags,
**cluster_kwargs["tpch"],
)
dump_cluster_kwargs(kwargs, f"tpch.{module}")
with dask.config.set({"distributed.scheduler.worker-saturation": "inf"}):
with Cluster(**kwargs) as cluster:
yield cluster

Comment on lines +74 to +83
def fix_timestamp_ns_columns(query):
"""
scale100 stores l_shipdate/o_orderdate as timestamp[us]
scale1000 stores l_shipdate/o_orderdate as timestamp[ns] which gives:
Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
so we set spark.sql.legacy.parquet.nanosAsLong and then convert to timestamp.
"""
for name in ("l_shipdate", "o_orderdate"):
query = re.sub(rf"\b{name}\b", f"to_timestamp(cast({name} as string))", query)
return query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milesgranger please also open an issue for this. I don't feel comfortable with this string casting when we're comparing us to spark. If that means we have to regenerate the dataset, that's unfortunate but I wouldn't want us to bias spark too severely. At the very least we should confirm on smaller / other data what the impact here is and whether it can be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants