GPU device watermark metrics #11457

zpuller · 2024-09-10T18:09:27Z

This PR builds on NVIDIA/spark-rapids-jni#2392 to add a metric to spark to capture the highest amount of device memory we have allocated over the lifespan of a task (as a Spark accumulator).

I have a corresponding change prepared for https://docs.nvidia.com/spark-rapids/user-guide/latest/tuning-guide.html#metrics

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller · 2024-09-24T17:07:58Z

The JNI PR was just merged so I think this PR won't build until the next SNAPSHOT jar is produced for the JNI

revans2

Just a few nits around naming.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuTaskMetrics.scala

Signed-off-by: Zach Puller <zpuller@nvidia.com>

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuTaskMetrics.scala

Signed-off-by: Zach Puller <zpuller@nvidia.com>

gerashegalov

LGTM

jihoonson · 2024-09-25T17:15:43Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuTaskMetrics.scala

+  }
+
+  override def add(v: jl.Long): Unit = {
+    _value += v


nit: Although I understand this function will be called once on task completion, I wonder if it would be more straightforward to understand if this were _value = _value.max(v). Is HighWatermarkAccumulator not just a max accumulator?

Yeah, I don't have a strong opinion about this. I think, yes, it is a max accumulator. Do others have thoughts on this?

/** * Takes the inputs and accumulates. */ def add(v: IN): Unit

The above snippet shows the doc of AccumulatorV2.add(). Based on that, it seems reasonable to compute a max in this function rather than a sum.

I think my comment is a nitpick though, and do not want to block this PR unnecessarily as long as it is clear in the code that this function must be called only once on task completion.

My only 2c for keeping it as is that we can probably consolidate such metrics into a single generic implementation potentially.

I agree with the sentiment that this is not meant to be an additive metric. I guess I am more leaning towards add setting the value, and throwing if it already had set it. E.g. we don't have a case where we want increment, and we don't think max makes sense within this scope (only at merge time). I'd throw to be sure we don't shoot ourselves in the foot.

Yes the API is not ideal, but it is what we have to work with. Also if you think of it as a collection of measurements, then add is appending a new measurement instead of adding a number to a single value. The fact that we reduce it to a single value to represent the collection is separate.

But either way I am not a fan of throwing an exception if the value is already set. This makes assumptions about how the metric is going to be used by Spark that might not hold true in all cases. It has a reset API after all. If you really want to do it, then go ahead, but we need to do testing on all of our supported platforms to make sure it is correct.

That's a good point @revans2 on the exception being a bad idea.

I am back at what does add mean here. I think it means what @jihoonson suggested, it's a max, that's how you add it because the input is not meant to be additive, it is meant to be the max, always.

Ok I'm going to leave it as is for now for expediency given that I have the approvals and I'm still not totally sure I want to change it, but definitely open to it in a follow up PR. I will be continuing work on related metrics so I'll have my eye on it.

But either way I am not a fan of throwing an exception if the value is already set. This makes assumptions about how the metric is going to be used by Spark that might not hold true in all cases. It has a reset API after all. If you really want to do it, then go ahead, but we need to do testing on all of our supported platforms to make sure it is correct.

Throwing an exception sounds reasonable to me because we will be able to catch such cases if Spark ever uses it in a different way than we expect. If we are not 100% sure whether Spark does or not in all cases, do we need the testing suggested above anyway? If add() is ever called more than once, computing a sum instead of a max in it will likely mess up the metric.

That is why I approved the max version. Not the sum version. These metrics can get to be kind of complicated because they are produced on the worker and then sent back to the driver. On the driver they are written out ot the log, but then also something has to put them into the UI. I don't think that they use the accumulator for that, and that is where my questions are. I just have not looked at it enough. A max of a max is not a big deal, in reporting what we want to see if it does happen to be called multiple times. But doing a sum of a max might cause issues. They would likely be minor issues but all the same. If you want to learn what the code is doing then lets read it and do some experiments. If we want to ship it to production, then lets be defensive with the code that we ship.

zpuller · 2024-09-26T16:28:24Z

build

sameerz marked this pull request as draft September 11, 2024 00:05

zpuller mentioned this pull request Sep 17, 2024

Device max memory watermark tracking NVIDIA/spark-rapids-jni#2392

Merged

add max memory watermark metric

b6a9fd7

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller force-pushed the device_metrics branch from 4697e40 to b6a9fd7 Compare September 24, 2024 16:38

zpuller marked this pull request as ready for review September 24, 2024 16:38

zpuller changed the title ~~[DRAFT] Device metrics~~ GPU device watermark metrics Sep 24, 2024

zpuller requested a review from revans2 September 24, 2024 17:08

revans2 previously approved these changes Sep 24, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuTaskMetrics.scala Outdated Show resolved Hide resolved

pr comments

ebb0803

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller dismissed revans2’s stale review via ebb0803 September 24, 2024 17:32

revans2 previously approved these changes Sep 24, 2024

View reviewed changes

gerashegalov reviewed Sep 24, 2024

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuTaskMetrics.scala Show resolved Hide resolved

add comment about the accumulator usage

4e81016

Signed-off-by: Zach Puller <zpuller@nvidia.com>

zpuller dismissed revans2’s stale review via 4e81016 September 24, 2024 17:41

formatting

2a27b4c

Signed-off-by: Zach Puller <zpuller@nvidia.com>

revans2 approved these changes Sep 24, 2024

View reviewed changes

abellina approved these changes Sep 24, 2024

View reviewed changes

gerashegalov approved these changes Sep 24, 2024

View reviewed changes

jihoonson reviewed Sep 25, 2024

View reviewed changes

zpuller merged commit c047707 into NVIDIA:branch-24.10 Sep 26, 2024
45 checks passed

sameerz added the task Work required that improves the product but is not user facing label Oct 1, 2024

zpuller deleted the device_metrics branch October 4, 2024 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU device watermark metrics #11457

GPU device watermark metrics #11457

zpuller commented Sep 10, 2024 •

edited

Loading

zpuller commented Sep 24, 2024

revans2 left a comment

gerashegalov left a comment

jihoonson Sep 25, 2024

zpuller Sep 25, 2024

jihoonson Sep 25, 2024 •

edited

Loading

gerashegalov Sep 25, 2024

abellina Sep 25, 2024

revans2 Sep 26, 2024

abellina Sep 26, 2024

zpuller Sep 26, 2024

jihoonson Sep 26, 2024

revans2 Sep 26, 2024

zpuller commented Sep 26, 2024

GPU device watermark metrics #11457

GPU device watermark metrics #11457

Conversation

zpuller commented Sep 10, 2024 • edited Loading

zpuller commented Sep 24, 2024

revans2 left a comment

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zpuller commented Sep 26, 2024

zpuller commented Sep 10, 2024 •

edited

Loading

jihoonson Sep 25, 2024 •

edited

Loading