-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid calling UDF when the row being updated to is the same as existing row #12201
Comments
Do we have the assumption that Does it have anything to do with the other expressions in the same |
Calling other services in UDF has significant impact on performance. @jon-chuang once added a feature to parallelize these blocking function calls using thread pool. This can be enabled by an @udf(input_types=["INT"], result_type="INT", io_threads=32)
def wait_concurrent(x):
time.sleep(0.01)
return 0 This could help reduce the latency introduced by external calls. As for the number of calls, if necessary, we can provide a batch evaluation API. This way, users will receive a batch of input at a time. Then they might be able to call external services in batch style. |
Thanks, I see. The solution can work for calling external services. For this particular user, computation overhead is one of the primary reasons. I suppose if this is the case, we have no other way to avoid calling UDF when the data is the same? Wonder if the compaction of stream chunks before calling UDF is enough to solve that, or it turns to be a quite difficult problem? |
Oh, you reminded me, we really don't compact the chunk before sending to UDF. I mean, if there is a chunk with capacity 1024 but cardinality 1, it will send all 1024 rows instead of the one valid row. I'm not sure whether most of the input chunks are high cardinality or low cardinality in real world. But it's worth fixing anyway. I'll create an issue and fix it soon. |
related #11070 |
We have three levels of optimization to reduce the rows processed by the UDF
|
As long as a UDF or even multiple UDFs are used in a
Is it a knob that users need to turn on/off when
Is it completely managed inside users' UDF code? I guess I just failed to understand why
is not taken due to what limitation. |
After some consideration, I prefer to treat this as a brand-new issue. Sorry for bringing up #11070... The cost of UDF call is much much higher than built-in functions. I think we can do more specific optimizations here:
|
@wangrunji0408 Can you please take this? |
is going to cover the original intention of the issue and does even more |
Are we actually assuming the UDF to be pure in this way? |
Even for non-pure (aka. non-deterministic) functions, this still makes sense to me 🤔 From my understanding, most non-deterministic are not intended to be non-deterministic like |
Noted down the conclusion of the huddle just now: Will do 1 & 2, but not 3. |
Random idea: Should we introduce a new streaming operation |
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned. |
IIUC, if we introduce something like a
This actually reminds me of the design of general OverWindow, in which we already achieved (1), and we plan to do the optimization (2) (done by #19056). |
Users request that their UDF is particularly expensive, e.g. in terms of computation overhead, or they are calling other services in UDF and it charges by the number of calls.
It is an MV in the end, but also wonder if the solution changes when it is a sink in the end?
Prefer an idea of a viable solution. The time of implementation can be discussed later,
The text was updated successfully, but these errors were encountered: