Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

Closed
goglusid opened this issue Jul 3, 2017 · 14 comments
Closed

C++ Client Crashes on ClientConductor::onInterServiceTimeout #371

goglusid opened this issue Jul 3, 2017 · 14 comments
Assignees

Comments

@goglusid
Copy link

goglusid commented Jul 3, 2017

When the following stack of functions are executed, if the C++ client still has pointers on the log buffers then it crashes.

Following is how it can happen:

Thread#1: Call Publication::tryClaim
Thread#1: Use the BufferClaim...
Thread#2[ConductorThread]: Detects a timeout and execute the following stack.
Thread#1: Calls BufferClaim.commit();

Obviously, here I'm debugging so I reach the 5 seconds timeout.

That being said, to be thread safe it seems that there's a need to managed the MemoryMappedFiles has lingering resources like the subscription's images.

aeron::util::MemoryMappedFile::cleanUp() Line 206
aeron::util::MemoryMappedFile::~MemoryMappedFile() Line 219
std::_Ref_countaeron::util::MemoryMappedFile::_Destroy() Line 578 + 0x23 bytes
std::_Ref_count_base::_Decref() Line 538
std::vector<std::shared_ptraeron::util::MemoryMappedFile,std::allocator<std::shared_ptraeron::util::MemoryMappedFile > >::_Destroy(std::shared_ptraeron::util::MemoryMappedFile * _First=0x00549310, std::shared_ptraeron::util::MemoryMappedFile * _Last=0x00549318) Line 1885 + 0x40 bytes
std::vector<std::shared_ptraeron::util::MemoryMappedFile,std::allocator<std::shared_ptraeron::util::MemoryMappedFile > >::_Tidy() Line 1952
aeron::LogBuffers::~LogBuffers() Line 84 + 0x56 bytes
aeron::LogBuffers::`scalar deleting destructor'() + 0xf bytes
std::_Ref_count_objaeron::LogBuffers::_Destroy() Line 1327
std::_Ref_count_base::_Decref() Line 538 aeron::ClientConductor::PublicationStateDefn::~PublicationStateDefn() + 0x65 bytes
std::vector<aeron::ClientConductor::PublicationStateDefn,std::allocatoraeron::ClientConductor::PublicationStateDefn >::clear() Line 1616 + 0x64 bytes
aeron::ClientConductor::onInterServiceTimeout(__int64 now=1499112928196) Line 548
aeron::ClientConductor::onHeartbeatCheckTimeouts() Line 303
aeron::concurrent::AgentRunneraeron::ClientConductor,aeron::concurrent::SleepingIdleStrategy::run() Line 64 + 0x2e bytes

@goglusid
Copy link
Author

goglusid commented Jul 3, 2017

If we were to delay the call to MemoryMappedFile::cleanUp() X ms after the actual ClientConductor::onInterServiceTimeout then we could avoid this crash.

This period (X ms) would represent the maximum amount of execution time for a single call to Publication::offer or the time elapsed between calling Publication::tryClaim and BufferClaim.commit

@goglusid
Copy link
Author

goglusid commented Jul 3, 2017

For reference, I'm talking about the following code when I talk about the management of Image's log buffers has lingering resources:

void ClientConductor::onUnavailableImage(
std::int32_t streamId,
std::int64_t correlationId)
{
const long long now = m_epochClock();
std::lock_guardstd::recursive_mutex lock(m_adminLock);

std::for_each(m_subscriptions.begin(), m_subscriptions.end(),
    [&](const SubscriptionStateDefn &entry)
    {
        if (streamId == entry.m_streamId)
        {
            std::shared_ptr<Subscription> subscription = entry.m_subscription.lock();

            if (nullptr != subscription)
            {
                std::pair<Image*, int> result = subscription->removeImage(correlationId);
                Image* oldArray = result.first;
                const int index = result.second;

                if (nullptr != oldArray)
                {
                    **lingerResource(now, oldArray[index].logBuffers());**
                    lingerResource(now, oldArray);
                    entry.m_onUnavailableImageHandler(oldArray[index]);
                }
            }
        }
    });

}

@tmontgomery
Copy link
Contributor

cc @mjpt777

Lingering doesn't solve the underlying issue. The same thing exists in the Java version, I do believe. Lingering simply moves the time horizon. At its heart this is a race between the munmap due to the inter service timeout and the BufferClaim commit/abort operations.

@mjpt777
Copy link
Contributor

mjpt777 commented Jul 3, 2017

The Java code does not call the unavailable handlers when a forced close happens. I've also just pushed a change that will linger the resources for 1ms on a normal close and 1s on an inter service timeout.

@tmontgomery tmontgomery self-assigned this Jul 3, 2017
@tmontgomery
Copy link
Contributor

I will reflect in C++ in the next couple days if not sooner. Also, I want to make the C++ API have the agent invoker type option soon.

@goglusid
Copy link
Author

goglusid commented Jul 3, 2017

I agree that lingering only reduce the probability of having this issue.

If we were to store a smart ptr in the Publication instance returned by findPublication then the application would control the lifetime of the logbuffers without possible race. Anything I am missing here?

@tmontgomery
Copy link
Contributor

@goglusid Hmmm. Very very good point. That might work. Will give it a think. Yeah, that might be a nice way to handle it. Might also be usable for Java as well. Keep it around until Publication.close.

@goglusid
Copy link
Author

goglusid commented Jul 4, 2017

@tmontgomery I meant keep it around until Publication::~Publication

@tmontgomery
Copy link
Contributor

Agreed. Was thinking about Java as well. Which requires an explicit close of the Publication instead of it simply going out of scope.

tmontgomery added a commit that referenced this issue Jul 5, 2017
…ivePublication to keep mapping around while in scope. For #371. Updated naming and layout for subcriber position in available image.
@tmontgomery
Copy link
Contributor

@goglusid go ahead and see about this now. The Publication (and ExclusivePublication) have a shared_ptr to the LogBuffers. So, this should be cleaner now.

@goglusid
Copy link
Author

goglusid commented Jul 6, 2017

@tmontgomery Your awesomeness knows no bounds! ;p Problem solved. Thanks :D

@goglusid goglusid closed this as completed Jul 6, 2017
@tmontgomery
Copy link
Contributor

Thanks! No worries! We'll be making some other changes in this area shortly as well.

@goglusid
Copy link
Author

goglusid commented Jul 6, 2017

@tmontgomery Could you please elaborate a bit on the other changes in this area?

@tmontgomery
Copy link
Contributor

Experimenting with reference counting the mappings for #365 so multiple mappings are not needed. Also want to add the agent invoker style thread control to C++. And also change the mapping flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants