Deadlock: concurrent cloned spark sessions #47

wvwwvwwv · 2024-06-11T07:37:14Z

Hi, thanks a lot for this nice crate!

I'd like to report a deadlock issue when a spark session is cloned and used concurrently.

#46 demonstrates a possible workflow leading to a deadlock.

The gist is, everywhere #[allow(clippy::await_holding_lock)] is used poses a possibility of resulting in a deadlock when a spark session is cloned and used concurrently.

-> When a task is suspended holding a lock, another task will wait for the lock to be released without yielding the executor.
-> This is a very common "dangerous" asynchronous programming pattern that should always be avoided at all cost.

Therefore, I would suggest that we remove the 'clone' method from SparkSession, or replace rwlock with an asynchronous lock. What do you think of this?

The text was updated successfully, but these errors were encountered:

sjrusso8 · 2024-06-11T17:26:22Z

@wvwwvwwv thanks for this! I've been playing around with adjusting the internals of the client implementation.

I have been trying to switch to references where possible so you don't need to constantly be cloning the session. I'll change the existing Rwlock from parking_lot to tokio::sync::Rwlock and incorporate your test

wvwwvwwv · 2024-06-11T18:26:51Z

Thanks for fixing this so quickly!

One question that might not be related to this issue.

I noticed that SparkSession is boxed when passed to DataFrame, and I guess that's because you wanted to avoid memcpy. But as far as I understand, DataFrame 'consumes' SparkSession, so SparkSession won't be mem-copied, instead DataFrame will directly be constructed upon the supplied SparkSession if the current stack allows (didn't check it in the assembly level so not 100% sure). Apart from that, heap allocation is usually (unless certain conditions are met) quite expensive as compared to memcpy, so I think avoiding the use of Box will be more beneficial in general.

-> Don't get me wrong, I was just curious when I saw Arc<SparkConnectServiceClient> and Box<SparkSession> because the use of heap allocation was frowned upon in my previous company :-(

Therefore, in my humble opinion, it would be nice to compare performance between memcpy vs Box. What's your opinion?

sjrusso8 · 2024-06-11T22:04:57Z

Yeah, great question!

I initially was modeling some of the components of this projects after DataFusion. Specifically I was looking at how they are creating a dataframe from their concept of the SessionContext. This is where I saw they were using Box<> as a means to help reduce size of allocations. The change to Box was in this DataFusion PR #10033.

Doing a similar size_of test to see what the total allocation of DataFrame results in something like this.

With Box<SparkSession>:

size_of::<DataFrame>: 232

With SparkSession:

size_of::<DataFrame>: 1424

This project isn't using nearly the same level of futures allocations, so I wasn't running into a stack overflow error like DataFusion was when they switched. I was just taking some pointers from their lessons learned.

I am working on adjusting some aspects of the DataFrame so it doesn't take ownership of the SparkSession when it is being created. I think users would be expecting to use the SparkSession multiple times in one app without needing to call .clone() if they want to use it again. The DataFusion SessionContext uses &self for nearly all its methods and can be used many times. I used clone() originally as a way to just get things working, but have starting to rework some of the internals now.

wvwwvwwv · 2024-06-12T06:50:38Z

Thanks for the answer! SparkSession being 1424 (oh...) totally justifies heap allocation.

And, I also think that using &self does make more sense, given that a single DataFrame or SparkSession is often used multiple times to create different DataFrames.

wvwwvwwv mentioned this issue Jun 11, 2024

test(core): add a concurrent spark session test #46

Closed

sjrusso8 mentioned this issue Jun 11, 2024

bug(deadlock): change to async rwlock #48

Merged

sjrusso8 closed this as completed in #48 Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock: concurrent cloned spark sessions #47

Deadlock: concurrent cloned spark sessions #47

wvwwvwwv commented Jun 11, 2024 •

edited

Loading

sjrusso8 commented Jun 11, 2024

wvwwvwwv commented Jun 11, 2024 •

edited

Loading

sjrusso8 commented Jun 11, 2024

wvwwvwwv commented Jun 12, 2024

Deadlock: concurrent cloned spark sessions #47

Deadlock: concurrent cloned spark sessions #47

Comments

wvwwvwwv commented Jun 11, 2024 • edited Loading

sjrusso8 commented Jun 11, 2024

wvwwvwwv commented Jun 11, 2024 • edited Loading

sjrusso8 commented Jun 11, 2024

wvwwvwwv commented Jun 12, 2024

wvwwvwwv commented Jun 11, 2024 •

edited

Loading

wvwwvwwv commented Jun 11, 2024 •

edited

Loading