-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock: concurrent cloned spark sessions #47
Comments
@wvwwvwwv thanks for this! I've been playing around with adjusting the internals of the client implementation. I have been trying to switch to references where possible so you don't need to constantly be cloning the session. I'll change the existing |
Thanks for fixing this so quickly! One question that might not be related to this issue. I noticed that SparkSession is boxed when passed to DataFrame, and I guess that's because you wanted to avoid memcpy. But as far as I understand, DataFrame 'consumes' SparkSession, so SparkSession won't be mem-copied, instead DataFrame will directly be constructed upon the supplied SparkSession if the current stack allows (didn't check it in the assembly level so not 100% sure). Apart from that, heap allocation is usually (unless certain conditions are met) quite expensive as compared to memcpy, so I think avoiding the use of -> Don't get me wrong, I was just curious when I saw Therefore, in my humble opinion, it would be nice to compare performance between memcpy vs Box. What's your opinion? |
Yeah, great question! I initially was modeling some of the components of this projects after DataFusion. Specifically I was looking at how they are creating a dataframe from their concept of the Doing a similar With
With
This project isn't using nearly the same level of futures allocations, so I wasn't running into a stack overflow error like DataFusion was when they switched. I was just taking some pointers from their lessons learned. I am working on adjusting some aspects of the |
Thanks for the answer! SparkSession being 1424 (oh...) totally justifies heap allocation. And, I also think that using |
Hi, thanks a lot for this nice crate!
I'd like to report a deadlock issue when a spark session is cloned and used concurrently.
#46 demonstrates a possible workflow leading to a deadlock.
The gist is, everywhere
#[allow(clippy::await_holding_lock)]
is used poses a possibility of resulting in a deadlock when a spark session is cloned and used concurrently.-> When a task is suspended holding a lock, another task will wait for the lock to be released without yielding the executor.
-> This is a very common "dangerous" asynchronous programming pattern that should always be avoided at all cost.
Therefore, I would suggest that we remove the 'clone' method from
SparkSession
, or replacerwlock
with an asynchronous lock. What do you think of this?The text was updated successfully, but these errors were encountered: