Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tests] "python pulsar_test.py" failed with segmentation fault and core dump #11635

Closed
lhotari opened this issue Aug 11, 2021 · 10 comments · Fixed by #12427
Closed

[Tests] "python pulsar_test.py" failed with segmentation fault and core dump #11635

lhotari opened this issue Aug 11, 2021 · 10 comments · Fixed by #12427
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@lhotari
Copy link
Member

lhotari commented Aug 11, 2021

Describe the bug

The logs contained this line

............../run-unit-tests.sh: line 85:  2489 Segmentation fault      (core dumped) python pulsar_test.py
.
.
.
Error: Process completed with exit code 139.

this happened in https://github.com/apache/pulsar/pull/11387/checks?check_run_id=3297639124

To Reproduce

Unknown.

Expected behavior

Tests should not crash with segmentation fault.

Additional context

Logs of build is available at https://gist.github.com/lhotari/5241f83ed6c348b3d2c91ce5da33c1d0 . "test-logs.zip" is at https://transfer.sh/1sa6OMj/test-logs.zip .

@BewareMyPower
Copy link
Contributor

I also found this problem recently. It's flaky but the cause is hard to know currently. Maybe a temporary solution is adding the retry logic for python pulsar_test.py? I'm not sure if it's easy to do that. In addition, I found test-logs.zip only contains gtest logs (C++ tests) and broker's logs (unfortunately immediateFlush is false by default). Python test logs can only be found in the CI workflow page. Maybe it's better to modify immediateFlush from false to true when running C++/Python tests.

From the last few lines of the Python test output

[persistent://public/default/partitioned_topic_name_test-partition-0, partitioned_topic_name_test_sub, 0] Closed consumer 0
2021-08-11 06:29:09.454 INFO  [139946459391744] ConsumerImpl:930 | [persistent://public/default/partitioned_topic_name_test-partition-1, partitioned_topic_name_test_sub, 1] Closed consumer 1
2021-08-11 06:29:09.454 INFO  [139946459391744] ConsumerImpl:930 | [persistent://public/default/partitioned_topic_name_test-partition-2, partitioned_topic_name_test_sub, 2] Closed consumer 2
2021-08-11 06:29:09.455 INFO  [139946459391744] ClientConnection:1504 | [127.0.0.1:56176 -> 127.0.0.1:6650] Connection closed

and standalone logs

06:29:09.444 [Thread-380] INFO  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:56176] Created new producer: Producer{topic=PersistentTopic{topic=persistent://public/default/partitioned_topic_name_test-partition-1}, client=/127.0.0.1:56176, producerName=standalone-0-355, producerId=1}
06:29:09.444 [Thread-380] INFO  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:56176] persistent://public/default/partitioned_topic_name_test-partition-0 configured with schema false
06:29:09.445 [Thread-380] INFO  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:56176] Created new producer: Producer{topic=PersistentTopic{topic=persistent://public/default/partitioned_topic_name_test-partition-0}, client=/127.0.0.1:56176,

we can see the topic is partitioned_topic_name_test. See

def test_get_partitioned_topic_name(self):
client = Client(self.serviceUrl)
url1 = self.adminUrl + '/admin/v2/persistent/public/default/partitioned_topic_name_test/partitions'
doHttpPut(url1, '3')
partitions = ['persistent://public/default/partitioned_topic_name_test-partition-0',
'persistent://public/default/partitioned_topic_name_test-partition-1',
'persistent://public/default/partitioned_topic_name_test-partition-2']
self.assertEqual(client.get_topic_partitions('persistent://public/default/partitioned_topic_name_test'),
partitions)
consumer = client.subscribe('persistent://public/default/partitioned_topic_name_test',
'partitioned_topic_name_test_sub',
consumer_type=ConsumerType.Shared)
producer = client.create_producer('persistent://public/default/partitioned_topic_name_test')
producer.send(b'hello')
msg = consumer.receive(TM)
self.assertTrue(msg.topic_name() in partitions)
client.close()

But it looks like there's no problem with the code. And usually a rerun could work.

@zbentley
Copy link
Contributor

If it's the same segfault I mentioned here, that may provide a means to more predictably reproduce it.

Might be a totally unrelated issue though.

@lhotari
Copy link
Member Author

lhotari commented Sep 2, 2021

This might be fixed by #11887 . Closing. We can reopen if this appears in the future.

@lhotari lhotari closed this as completed Sep 2, 2021
@lhotari
Copy link
Member Author

lhotari commented Sep 10, 2021

The problem remains. Reopening the issue.

Happens quite often. Most recent is https://github.com/apache/pulsar/pull/11994/checks?check_run_id=3564315253 .
Logs at https://transfer.sh/z7Pa5K/logs_347931.zip and https://transfer.sh/lJtmwo/test-logs%20%283%29.zip (contains also broker logs).

@BewareMyPower would you mind checking what the problem is?

@lhotari lhotari reopened this Sep 10, 2021
@BewareMyPower
Copy link
Contributor

Okay, I'll take a look soon.

@BewareMyPower
Copy link
Contributor

I might not have enough time for this issue recently, so I assigned it to me first to avoid forgetting this issue.

@BewareMyPower BewareMyPower self-assigned this Sep 10, 2021
@BewareMyPower
Copy link
Contributor

It looks like this bug can be reproduced easily in docker container

2021-09-12 03:36:21.326 INFO  [140646740616960] ConnectionPool:96 | Created connection for fakeServiceUrl
2021-09-12 03:36:21.326 ERROR [140646740616960] ClientConnection:502 | [<none> -> fakeServiceUrl] Invalid Url, unable to parse: system:0 Success
2021-09-12 03:36:21.326 INFO  [140646740616960] ClientConnection:1499 | [<none> -> fakeServiceUrl] Connection closed
2021-09-12 03:36:21.326 INFO  [140646740616960] ClientConnection:255 | [<none> -> fakeServiceUrl] Destroyed connection
2021-09-12 03:36:21.326 ERROR [140646740616960] ClientImpl:188 | Error Checking/Getting Partition Metadata while creating producer on persistent://public/default/connect-error-topic -- ConnectError
.2021-09-12 03:36:21.327 INFO  [140646740616960] ClientConnection:181 | [<none> -> pulsar://192.0.2.1:1234] Create ClientConnection, timeout=1000
2021-09-12 03:36:21.327 INFO  [140646740616960] ConnectionPool:96 | Created connection for pulsar://192.0.2.1:1234
Segmentation fault

Hope the cause can be figured out soon.

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Oct 19, 2021

I reproduced it in my local env and analyzed the core using gdb.

Here is the most important stack:

#0  0x00007f655e5e8d0d in std::__atomic_base<long>::load (__m=std::memory_order_seq_cst, this=0xb8) at /usr/include/c++/5/bits/atomic_base.h:396
#1  std::__atomic_base<long>::operator long (this=0xb8) at /usr/include/c++/5/bits/atomic_base.h:259
#2  0x00007f655e5e6348 in boost::asio::detail::task_io_service::run (this=0x0, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:136
#3  0x00007f655e5e6cc4 in boost::asio::io_service::run (this=0x1f11990) at /usr/include/boost/asio/impl/io_service.ipp:59
#4  0x00007f655e5e32d0 in pulsar::ExecutorService::startWorker (this=0x1f11130, io_service=std::shared_ptr (count 2, weak 0) 0x1e8ba00) at /pulsar/pulsar-client-cpp/lib/ExecutorService.cc:37
#5  0x00007f655e5f1680 in std::_Mem_fn_base<void (pulsar::ExecutorService::*)(std::shared_ptr<boost::asio::io_service>), true>::operator()<std::shared_ptr<boost::asio::io_service>&, void> (this=0x1c45ff8, __object=0x1f11130)
    at /usr/include/c++/5/functional:600
#6  0x00007f655e5f1574 in std::_Bind<std::_Mem_fn<void (pulsar::ExecutorService::*)(std::shared_ptr<boost::asio::io_service>)> (pulsar::ExecutorService*, std::shared_ptr<boost::asio::io_service>)>::__call<void, , 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) (this=0x1c45ff8, __args=<unknown type in /pulsar/pulsar-client-cpp/python/_pulsar.so, CU 0x60498d, DIE 0x64b00a>) at /usr/include/c++/5/functional:1074
#7  0x00007f655e5f1469 in std::_Bind<std::_Mem_fn<void (pulsar::ExecutorService::*)(std::shared_ptr<boost::asio::io_service>)> (pulsar::ExecutorService*, std::shared_ptr<boost::asio::io_service>)>::operator()<, void>() (this=0x1c45ff8)
    at /usr/include/c++/5/functional:1133
#8  0x00007f655e5f142e in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (pulsar::ExecutorService::*)(std::shared_ptr<boost::asio::io_service>)> (pulsar::ExecutorService*, std::shared_ptr<boost::asio::io_service>)> ()>::_M_invoke<>(std::_Index_tuple<>)
    (this=0x1c45ff8) at /usr/include/c++/5/functional:1531
#9  0x00007f655e5f132e in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (pulsar::ExecutorService::*)(std::shared_ptr<boost::asio::io_service>)> (pulsar::ExecutorService*, std::shared_ptr<boost::asio::io_service>)> ()>::operator()() (this=0x1c45ff8)
    at /usr/include/c++/5/functional:1520
#10 0x00007f655e5f1194 in std::thread::_Impl<std::_Bind_simple<std::_Bind<std::_Mem_fn<void (pulsar::ExecutorService::*)(std::shared_ptr<boost::asio::io_service>)> (pulsar::ExecutorService*, std::shared_ptr<boost::asio::io_service>)> ()> >::_M_run() (
    this=0x1c45fe0) at /usr/include/c++/5/thread:115
#11 0x00007f655e866bd0 in execute_native_thread_routine () from /pulsar/pulsar-client-cpp/python/_pulsar.so
#12 0x00007f655fbab6ba in start_thread (arg=0x7f65558f4700) at pthread_create.c:333
#13 0x00007f655f8e151d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

It should be noted about the frame #2:

#2  0x00007f655e5e6348 in boost::asio::detail::task_io_service::run (this=0x0, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:136

We can see the pointer is null from this=0x0. While from #3 we can see the io_service object is not null. It looks like somehow the internal impl_ field of io_service is released.

I'll work on refactoring the ExecutorService implementation and see whether will happen.

@BewareMyPower
Copy link
Contributor

I've rerun the Python tests in docker container for 7 times after applying my patch:

  • Tue Oct 19 15:08:20 UTC 2021
  • Tue Oct 19 15:12:35 UTC 2021
  • Tue Oct 19 15:15:04 UTC 2021
  • Tue Oct 19 15:17:37 UTC 2021
  • Tue Oct 19 15:20:42 UTC 2021
  • Tue Oct 19 15:24:13 UTC 2021
  • Tue Oct 19 15:27:04 UTC 2021

No segmentation fault happened. I'll open a PR soon.

@lhotari
Copy link
Member Author

lhotari commented Oct 19, 2021

I've rerun the Python tests in docker container for 7 times after applying my patch:

  • Tue Oct 19 15:08:20 UTC 2021
  • Tue Oct 19 15:12:35 UTC 2021
  • Tue Oct 19 15:15:04 UTC 2021
  • Tue Oct 19 15:17:37 UTC 2021
  • Tue Oct 19 15:20:42 UTC 2021
  • Tue Oct 19 15:24:13 UTC 2021
  • Tue Oct 19 15:27:04 UTC 2021

No segmentation fault happened. I'll open a PR soon.

Nice! Good work @BewareMyPower . I'm looking forward to your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants