Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing data races in subscriptions #880

Merged

Conversation

aivcec
Copy link
Contributor

@aivcec aivcec commented Nov 3, 2019

Current implementation of WebSocketTransport is not thread-safe so I took the steps to identify shared resources and possible race condition scenarios. A lot of this was done using ideas from the past pr.

I wrote a couple of tests that help illustrate the problems. If you run them without ThreadSanitizer, sometimes they will pass with a success, but there is a chance that you will get an EXC_BAD_ACCESS error due to race conditions. In order to identify the issues check the ThreadSanitizer in the ApolloWebSocket/Test phase and run the tests that way.

Also, make sure to only run the tests in the StarWarsSubscriptionTests file, as the StarWarsWebSocketTests uses InMemoryNormalizedCache which causes TS problems on its own. I did not address those issues in this PR, this PR is only about making WebSocketTransport thread-safe.

Its important to note that transport methods can be called from two sources. Transport methods can be called from the outside/client (A) who can add subscribers, cancel them and close the connection. He can call the transport from any thread/queue he wants. The underlying socket (Starscream), however, only calls the transport from the specified callback queue (B) which defaults to Main. It is from that queue that all WebSocketDelegate methods are called from.

I will go through each set of shared resources and how I fixed the data races:

1) sequenceNumber variable

This variable is used to enumerate all subscribers. The data race arises if two subscribers are concurrently being added to the transport (both try to fetch and increment the counter).

Solution: I added a simple AtomicCounter class that NSLocks the current thread upon incrementing this value.

2) Subscription arrays (subscribers and subscriptions)

These arrays are being accessed/modified when:
A) A new subscriber is being added, disconnected or the client is closing connection
B) The underlying socket sends a new message, an error or unsubscribes a subscriber

Solution: I think it is a bad idea to process starscream responses on the main queue so I added a separate serial processing queue to handle those tasks. This queue is set as a callbackQueue on the Websocket so all B) calls are done serially. To fix data races that stem from A) I made those perform on the same queue manually as well (sendHelper, unsubscribe, closeConnection).

3) Message queue array

This was fairly similar to the issue above and was solved by performing all write() calls on the callbackQueue as well. Some were made explicit, some are being done through the WebSocketDelegate/callbackQueue (B).

4) Various state variables (reconnect, error)

TSan was also reporting issues with concurrent access to these variables so I wrapped each of them with an NSLock (Atomic class). There are some other state variables (acked etc.) but they don't seem to be an issue atm as they are always called from the processing queue (from WebSocketDelegate methods).

As far as I can tell this fixes all the current concurrency issues in ApolloWebSocket, but I'm interested if you have other ideas for how to solve these issues. Some of these could be avoided by different architecture of the WebSocketTransport.

I also made some changes to MockWebSocket to support the addition of callback queue property.

@apollo-cla
Copy link

@aivcec: Thank you for submitting a pull request! Before we can merge it, you'll need to sign the Apollo Contributor License Agreement here: https://contribute.apollographql.com/

@aivcec aivcec changed the title Fixing data races in subscriptions Fixing data races in supscriptions Nov 3, 2019
@aivcec aivcec changed the title Fixing data races in supscriptions Fixing data races in subscriptions Nov 3, 2019
Copy link
Contributor

@designatednerd designatednerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THANK YOU for looking into this, I know this has been a mess for a while. I have a bunch of questions, mostly around the testing stuff.

Tests/ApolloWebsocketTests/StarWarsSubscriptionTests.swift Outdated Show resolved Hide resolved
expectation.expectedFulfillmentCount = 2

let sub1 = client.subscribe(subscription: firstSubscription) { _ in }
let sub2 = client.subscribe(subscription: secondSubscription) { _ in }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a post of a review, then failing either of these if something comes through since they're both being cancelled immediately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that if you think it looks better, but I think it wouldn't really matter.

I feel like these tests only make sense if TSAN is turned on anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not particularly concerned about the way it looks, but it'd be nice to have a way to test these without TSAN turn on since right now we're not able to turn it on for the library as a whole (yet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^still thinking on this - is this something where we should at least be able to say "These should never get called, even without TSAN on?'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see what you mean. Kinda misunderstood you before.

I will perform a mutation after calling cancel to confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I'm trying to do:

let firstSubscription = ReviewAddedSubscription(episode: .empire)
let secondSubscription = ReviewAddedSubscription(episode: .empire)
let expectation = self.expectation(description: "Subscriptions cancelled")
expectation.expectedFulfillmentCount = 2
    
let sub1 = client.subscribe(subscription: firstSubscription) { _ in
  XCTFail("Unexpected subscription response")
}
let sub2 = client.subscribe(subscription: secondSubscription) { _ in
  XCTFail("Unexpected subscription response")
}

let queue = DispatchQueue(label: "com.apollographql.testing", attributes: .concurrent)

queue.async {
  sub1.cancel()
  expectation.fulfill()
}
queue.async {
  sub2.cancel()
  expectation.fulfill()
}
queue.async(flags: .barrier) {
  _ = self.client.perform(mutation: CreateReviewForEpisodeMutation(episode: .empire, review: ReviewInput(stars: 5, commentary: "The greatest movie ever!")))
}

waitForExpectations(timeout: 10, handler: nil)

But what happens is that final mutation sometimes triggers a subscription response in another test - testMultipleSubscriptions. Its callbacks get called too many times and the whole test fails. Testing this way can be achieved if the test sleeps after waitForExpectations for a while, but I feel thats just messy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think waiting for expectations is totally reasonable - you also can do an inverted expectation where the test fails if it actually gets called - so instead of having the explicit XCTFails you can just have an invertedExpectation.fulfill(). You'd probably only need to wait that one 1-2 seconds, since the perform mutation should happen a lot faster than that.

Utilities/Atomic.swift Outdated Show resolved Hide resolved
Utilities/AtomicCounter.swift Outdated Show resolved Hide resolved
Utilities/AtomicCounter.swift Outdated Show resolved Hide resolved
Sources/ApolloWebSocket/ApolloWebSocket.swift Show resolved Hide resolved
}
}

extension Atomic where T == Int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!

Copy link
Contributor

@designatednerd designatednerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. Ship it!

@designatednerd designatednerd added this to the 0.20.0 milestone Nov 11, 2019
@designatednerd
Copy link
Contributor

This will go out with 0.20.0, so keep an eye out for that release (hopefully later today)

@designatednerd designatednerd merged commit 08ae91a into apollographql:master Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants