Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock: concurrent cloned spark sessions #47

Closed
wvwwvwwv opened this issue Jun 11, 2024 · 4 comments · Fixed by #48
Closed

Deadlock: concurrent cloned spark sessions #47

wvwwvwwv opened this issue Jun 11, 2024 · 4 comments · Fixed by #48

Comments

@wvwwvwwv
Copy link

wvwwvwwv commented Jun 11, 2024

Hi, thanks a lot for this nice crate!

I'd like to report a deadlock issue when a spark session is cloned and used concurrently.

#46 demonstrates a possible workflow leading to a deadlock.

The gist is, everywhere #[allow(clippy::await_holding_lock)] is used poses a possibility of resulting in a deadlock when a spark session is cloned and used concurrently.

-> When a task is suspended holding a lock, another task will wait for the lock to be released without yielding the executor.
-> This is a very common "dangerous" asynchronous programming pattern that should always be avoided at all cost.

Therefore, I would suggest that we remove the 'clone' method from SparkSession, or replace rwlock with an asynchronous lock. What do you think of this?

@sjrusso8
Copy link
Owner

@wvwwvwwv thanks for this! I've been playing around with adjusting the internals of the client implementation.

I have been trying to switch to references where possible so you don't need to constantly be cloning the session. I'll change the existing Rwlock from parking_lot to tokio::sync::Rwlock and incorporate your test

@wvwwvwwv
Copy link
Author

wvwwvwwv commented Jun 11, 2024

Thanks for fixing this so quickly!

One question that might not be related to this issue.

I noticed that SparkSession is boxed when passed to DataFrame, and I guess that's because you wanted to avoid memcpy. But as far as I understand, DataFrame 'consumes' SparkSession, so SparkSession won't be mem-copied, instead DataFrame will directly be constructed upon the supplied SparkSession if the current stack allows (didn't check it in the assembly level so not 100% sure). Apart from that, heap allocation is usually (unless certain conditions are met) quite expensive as compared to memcpy, so I think avoiding the use of Box will be more beneficial in general.

-> Don't get me wrong, I was just curious when I saw Arc<SparkConnectServiceClient> and Box<SparkSession> because the use of heap allocation was frowned upon in my previous company :-(

Therefore, in my humble opinion, it would be nice to compare performance between memcpy vs Box. What's your opinion?

@sjrusso8
Copy link
Owner

Yeah, great question!

I initially was modeling some of the components of this projects after DataFusion. Specifically I was looking at how they are creating a dataframe from their concept of the SessionContext. This is where I saw they were using Box<> as a means to help reduce size of allocations. The change to Box was in this DataFusion PR #10033.

Doing a similar size_of test to see what the total allocation of DataFrame results in something like this.

With Box<SparkSession>:

  • size_of::<DataFrame>: 232

With SparkSession:

  • size_of::<DataFrame>: 1424

This project isn't using nearly the same level of futures allocations, so I wasn't running into a stack overflow error like DataFusion was when they switched. I was just taking some pointers from their lessons learned.

I am working on adjusting some aspects of the DataFrame so it doesn't take ownership of the SparkSession when it is being created. I think users would be expecting to use the SparkSession multiple times in one app without needing to call .clone() if they want to use it again. The DataFusion SessionContext uses &self for nearly all its methods and can be used many times. I used clone() originally as a way to just get things working, but have starting to rework some of the internals now.

@wvwwvwwv
Copy link
Author

Thanks for the answer! SparkSession being 1424 (oh...) totally justifies heap allocation.

And, I also think that using &self does make more sense, given that a single DataFrame or SparkSession is often used multiple times to create different DataFrames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants