-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU device watermark metrics #11457
GPU device watermark metrics #11457
Conversation
Signed-off-by: Zach Puller <zpuller@nvidia.com>
4697e40
to
b6a9fd7
Compare
The JNI PR was just merged so I think this PR won't build until the next SNAPSHOT jar is produced for the JNI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits around naming.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuTaskMetrics.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Zach Puller <zpuller@nvidia.com>
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuTaskMetrics.scala
Show resolved
Hide resolved
Signed-off-by: Zach Puller <zpuller@nvidia.com>
Signed-off-by: Zach Puller <zpuller@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
|
||
override def add(v: jl.Long): Unit = { | ||
_value += v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Although I understand this function will be called once on task completion, I wonder if it would be more straightforward to understand if this were _value = _value.max(v)
. Is HighWatermarkAccumulator
not just a max accumulator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't have a strong opinion about this. I think, yes, it is a max accumulator. Do others have thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/**
* Takes the inputs and accumulates.
*/
def add(v: IN): Unit
The above snippet shows the doc of AccumulatorV2.add()
. Based on that, it seems reasonable to compute a max in this function rather than a sum.
I think my comment is a nitpick though, and do not want to block this PR unnecessarily as long as it is clear in the code that this function must be called only once on task completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only 2c for keeping it as is that we can probably consolidate such metrics into a single generic implementation potentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the sentiment that this is not meant to be an additive metric. I guess I am more leaning towards add
setting the value, and throwing if it already had set it. E.g. we don't have a case where we want increment, and we don't think max makes sense within this scope (only at merge time). I'd throw to be sure we don't shoot ourselves in the foot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the API is not ideal, but it is what we have to work with. Also if you think of it as a collection of measurements, then add
is appending a new measurement instead of adding a number to a single value. The fact that we reduce it to a single value to represent the collection is separate.
But either way I am not a fan of throwing an exception if the value is already set. This makes assumptions about how the metric is going to be used by Spark that might not hold true in all cases. It has a reset API after all. If you really want to do it, then go ahead, but we need to do testing on all of our supported platforms to make sure it is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point @revans2 on the exception being a bad idea.
I am back at what does add
mean here. I think it means what @jihoonson suggested, it's a max, that's how you add it because the input is not meant to be additive, it is meant to be the max, always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'm going to leave it as is for now for expediency given that I have the approvals and I'm still not totally sure I want to change it, but definitely open to it in a follow up PR. I will be continuing work on related metrics so I'll have my eye on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But either way I am not a fan of throwing an exception if the value is already set. This makes assumptions about how the metric is going to be used by Spark that might not hold true in all cases. It has a reset API after all. If you really want to do it, then go ahead, but we need to do testing on all of our supported platforms to make sure it is correct.
Throwing an exception sounds reasonable to me because we will be able to catch such cases if Spark ever uses it in a different way than we expect. If we are not 100% sure whether Spark does or not in all cases, do we need the testing suggested above anyway? If add()
is ever called more than once, computing a sum instead of a max in it will likely mess up the metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why I approved the max version. Not the sum version. These metrics can get to be kind of complicated because they are produced on the worker and then sent back to the driver. On the driver they are written out ot the log, but then also something has to put them into the UI. I don't think that they use the accumulator for that, and that is where my questions are. I just have not looked at it enough. A max of a max is not a big deal, in reporting what we want to see if it does happen to be called multiple times. But doing a sum of a max might cause issues. They would likely be minor issues but all the same. If you want to learn what the code is doing then lets read it and do some experiments. If we want to ship it to production, then lets be defensive with the code that we ship.
build |
This PR builds on NVIDIA/spark-rapids-jni#2392 to add a metric to spark to capture the highest amount of device memory we have allocated over the lifespan of a task (as a Spark accumulator).
I have a corresponding change prepared for https://docs.nvidia.com/spark-rapids/user-guide/latest/tuning-guide.html#metrics