Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging erratic DDS behaviour #171

Open
rdghickman opened this issue Aug 12, 2021 · 6 comments
Open

Debugging erratic DDS behaviour #171

rdghickman opened this issue Aug 12, 2021 · 6 comments

Comments

@rdghickman
Copy link

Until very recently, the system I was maintaining was running OSPL 6.7180404OSS. Part of this has been rebuilt recently against 6.9.210323OSS and I have begun to see some extremely erratic behaviour that almost certainly borders on broken, however I am clueless as to how to debug such a scenario.

Essentially there is a few C++ apps, one of which is publishing to a topic. There are other Java apps that use the supplied JNI layer that subscribe to this topic. When problems arose, I turned to a quick tool that was written to show all publishers and subscribers for a given topic and I observed the following:

  • After the publisher is started, I see 1 publisher;
  • After the first subscriber is started, I see 1 publisher and 1 subscriber;
  • After the second subscriber is started, I see 1 publisher and 1 subscriber for a while, then I see 2 subscribers and 0 publishers;
  • After the third subscriber is started, I see anything from 1-3 subscribers with 0 publishers or 1 publisher with 0 subscribers.

This feels like the whole messaging layer is malformed somehow, however there are no ospl error logs. The only things of note in the info logs are missed heartbeats, which seems to occur when new publishers/subscribers are started.

I am looking to see if anyone would put a finger on any particular area to investigate if anyone had happened to see such behaviour before. At the moment given the limited tools and knowledge at my disposal, it may have to abort back to 6.7 unless I can ascertain what the actual issue may be.

@vivekpandey02
Copy link

Dear Richard,

May i know your ospl configuraion? Please share us your previous and latest ospl configuration.

With best regards,
Vivek Pandey
ADLINK Technology

@rdghickman
Copy link
Author

The configuration is as follows (which has not changed between 6.7 and 6.9):

<OpenSplice>
    <Domain>
        <Name>rdghickman</Name>
        <Id>224</Id>
        <SingleProcess>true</SingleProcess>
        <Description>Stand-alone 'single-process' deployment and standard DDSI networking.</Description>
        <Service name="ddsi2">
            <Command>ddsi2</Command>
        </Service>
        <Service enabled="false" name="cmsoap">
            <Command>cmsoap</Command>
        </Service>
        <Report verbosity="INFO"/>
    </Domain>
    <DDSI2Service name="ddsi2">
        <General>
            <NetworkInterfaceAddress>eth0</NetworkInterfaceAddress>
            <AllowMulticast>true</AllowMulticast>
            <EnableMulticastLoopback>true</EnableMulticastLoopback>
            <CoexistWithNativeNetworking>false</CoexistWithNativeNetworking>
        </General>
        <Compatibility>
            <!-- see the release notes and/or the OpenSplice configurator on DDSI interoperability -->
            <StandardsConformance>lax</StandardsConformance>
            <!-- the following one is necessary only for TwinOaks CoreDX DDS compatibility -->
            <!-- <ExplicitlyPublishQosSetToDefault>true</ExplicitlyPublishQosSetToDefault> -->
        </Compatibility>
        <Discovery>
            <MaxAutoParticipantIndex>39</MaxAutoParticipantIndex>
        </Discovery>
    </DDSI2Service>
    <TunerService name="cmsoap">
        <Server>
            <PortNr>Auto</PortNr>
        </Server>
    </TunerService>
</OpenSplice>

This runs on x64 Linux built with GCC 10.

After further analysis our current belief is that this only affects Java applications. C++ applications seem fine which may indicate an issue with the 6.9 JNI layer or its configuration, but it is difficult to be certain.

@vivekpandey02
Copy link

It seems wired to me why it is not working because we didn't find any such problem in the OSPL 6.9.210323OSS.

You could try with our Vortex Opensplice DDS Commercial 6.11.x ( https://www.adlinktech.com/en/vortex-data-distribution-service-dds-software-evaluation ).

With best regards,
Vivek

@rdghickman
Copy link
Author

Have now confirmed that reverting to 6.7.180404OSS with the same code appears to fix this issue, even with GCC10, so we will have to abandon 6.9.210323OSS at this time and look for another solution.

@rdghickman
Copy link
Author

After extensive analysis I believe I have actually found the problem. I don't believe it is actually a 6.7 vs 6.9 issue, but it is a subtle mis-use of DDS from the Java side which seems to create a catastrophic problem inside DDS. Unfortunately OpenSplice does not show up with any warnings or errors that I can find, which resulted in our assumption that the upgrade of OpenSplice caused the failure. While it appeared to occur in 6.9 and not 6.7, what is required is a Java application using LivelinessChangedStatus, which crept in during the shift to 6.9. Allow me to clarify.

We have a wrapper around the very old Java 5 DDS API, one of which is a reader-type class which supports a liveliness callback:

public LivelinessOpenspliceDdsReader(Subscriber subscriber, Topic<T> topic, DdsReceiver<R> receiver, MessageConverter<T, R> converter,
        DdsLivelinessListener livelinessListener, QosPolicy.ForDataReader... policies)
    {
        super(subscriber, topic, receiver, converter, Arrays.asList(DataAvailableStatus.class, LivelinessChangedStatus.class), policies);
        this.livelinessListener = Objects.requireNonNull(livelinessListener);
    }

    @Override
    public void onLivelinessChanged(LivelinessChangedEvent<T> status)
    {
        super.onLivelinessChanged(status);
        livelinessListener.livelinessChanged(status.getStatus().getAliveCount());
    }

This code appears innocent enough and is completely valid Java however this causes what seems to be a massive problem. The base class does the following:

protected BaseOpenspliceDdsReader(Subscriber subscriber, Topic<T> topic, DdsReceiver<R> receiver, MessageConverter<T, R> converter,
        Collection<Class<? extends Status>> status, QosPolicy.ForDataReader... policies)
    {
        this.subscriber = Objects.requireNonNull(subscriber);
        this.converter = Objects.requireNonNull(converter);
        this.receiver = Objects.requireNonNull(receiver);
        this.dataReader = subscriber.createDataReader(topic,
            subscriber.copyFromTopicQos(subscriber.getDefaultDataReaderQos().withPolicies(policies),
                topic.getQos()), this, status);
    }

It appears that createDataReader ends up calling down into native code, with the status variable in this case including LivelinessChangedStatus.class. This completes the constructor for the base class but not the constructor for the child class. The native code appears to then callback into the Java side, hitting the void onLivelinessChanged(LivelinessChangedEvent<T> status) method in the child class, however since the child class has not finished its constructor, the livelinessListener member variable has not yet been initialised.

It appears this is causing some significant problem inside the DDS layer at some point, however there are no warnings or errors. If the LivelinessChangedStatus.class is removed then no problems are seen because presumably the void onLivelinessChanged(LivelinessChangedEvent<T> status) is not being accessed and generating what is most likely a NullPointerException.

Given that the Java written is valid and there does not appear to be anything stopping JNI from calling into an object's methods before initialisation is complete, may I perhaps suggest the following:

  • Make it extremely clear in some appropriate documentation that the methods of a DataReaderListener<TYPE> may be called, even during initialisation;
  • Add at least some manner of warning or error to the OSPL log files if this kind of callback failure occurs to aid in tracking down the problem.

@vivekpandey02
Copy link

Dear Richard,

Thank you for your comment. I really appreciate your suggestion. We will test the case and consider your workaround, if required.

With best regards,
Vivek K Pandey
Solutions Architect
ADLINK Technology, SAT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants