Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push local pool memory to a global list at thread termination #1621

Merged
merged 1 commit into from
Mar 4, 2017

Conversation

Praetonus
Copy link
Member

This change adds a global list of free blocks to the pool allocator, similar to the existing global list of free lists. When a runtime thread terminates, it will push its local free blocks to the global list, which can then be retreived by other threads.

The goal of this change is to avoid memory leaks when starting and stopping the runtime multiple times in a program, for example in the compiler test suite.

This pool allocator cleanup functionnality is exposed to user threads through the new pony_unregister_thread function.

PS: I expect that this change will introduce a performance penalty to the runtime termination, especially for long-running programs with lots of memory in free lists. I haven't done any real measurements so I don't have a precise estimate for this slowdown. If that's an issue I can modify the patch to add a runtime option for pool cleanup.

{
PONY_ABA_PROTECTED(T) ret = {NULL, 0};
PONY_ABA_PROTECTED_PTR(T) ret = {NULL, 0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interlocked exchange seems wrong. Such a heavyweight operation for a load? I can't see anywhere where that would be necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, big_store use a CAS where a store should be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because x86 doesn't have plain plain load/store operations on 16 bytes. The only instruction that can work on an operand of that size is cmpxchg16b and if we don't use that and use 2 plain mov instead, there would be a time window where a thread could modify one part of the value while another thread is still processing the other part.
For reference, both GCC and Clang also generate cmpxchg16b for atomic loads/stores of 16 bytes objects.

I've tried to reduce the overhead introduced by these operations by using non-atomic loads/stores whenever possible. For example, pool.c:465. Also, the fast bailout path of both pool_pull and pool_block_pull doesn't use any big atomic or hardware synchronisation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I'm thinking is, that we are using bigatomic_load to "initialise" a 16 byte ABA protected pointer before a CAS operation. If we load the two 8 byte chunks independently (cheaply!), we will either get a matched pair (which may succeed during the CAS, and everything is fine) or an unmatched pair.

If we get an unmatched pair, the loads could be in either order. If we read a stale ABA and a current pointer, we will do a second loop on the CAS - no more expensive than the initial atomic read. If we read a current ABA and a stale pointer... same result!

The only way to provoke an error would be if 2^64 writes happen between the ABA read and the ptr read, and the 2^64 th write writes back the old pointer.

Is there another error condition that I'm missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work, yes. I think it will only be possible in pool_push/pull and not in pool_block_pull/push because in the latter ones the CAS can be on any element of the list and if we fail, we always retry from the start of the list. I'll update the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out it is also possible in pool_block_push/pull but with an additional branching to avoid the CAS when we don't need it. Would that be an interesting tradeoff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it, these functions won't be used really often so it probably won't make a difference. I'll include it in the change.

@@ -239,6 +239,7 @@ DECLARE_THREAD_FN(ponyint_asio_backend_dispatch)
close(b->wakeup);
ponyint_messageq_destroy(&b->q);
POOL_FREE(asio_backend_t, b);
pony_unregister_thread();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the explicit unregister approach.

@Praetonus Praetonus force-pushed the pool-cleanup-block branch 2 times, most recently from d6ef876 to ba28fee Compare March 2, 2017 17:34
Praetonus pushed a commit to Praetonus/ponyc that referenced this pull request Mar 2, 2017
This is a change originally suggested by @sylvanc in ponylang#1621. On x86,
this results in less occurrences of the expensive cmpxchg16b
instruction.
sylvanc pushed a commit that referenced this pull request Mar 3, 2017
This is a change originally suggested by @sylvanc in #1621. On x86,
this results in less occurrences of the expensive cmpxchg16b
instruction.
@@ -53,7 +53,7 @@ void ponyint_mpmcq_destroy(mpmcq_t* q)
{
atomic_store_explicit(&q->head, NULL, memory_order_relaxed);
#ifdef PLATFORM_IS_X86
PONY_ABA_PROTECTED(mpmcq_node_t*) tail = bigatomic_load_explicit(&q->tail,
PONY_ABA_PROTECTED_PTR(mpmcq_node_t) tail = bigatomic_load_explicit(&q->tail,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigatomic_load_explicit isn't needed here either. Only the .object is used.

@@ -40,7 +40,7 @@ void ponyint_mpmcq_init(mpmcq_t* q)

atomic_store_explicit(&q->head, node, memory_order_relaxed);
#ifdef PLATFORM_IS_X86
PONY_ABA_PROTECTED(mpmcq_node_t*) tail;
PONY_ABA_PROTECTED_PTR(mpmcq_node_t) tail;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for bigatomic_store_explicit here. There is no concurrency issue when initialising the queue.

@sylvanc
Copy link
Contributor

sylvanc commented Mar 3, 2017

Sorry @Praetonus , I just noticed two more bigatomic ops we don't need.

This change adds a global list of free blocks to the pool allocator,
similar to the existing global list of free lists. When a runtime
thread terminates, it will push its local free blocks to the global
list, which can then be retreived by other threads.

The goal of this change is to avoid memory leaks when starting and
stopping the runtime multiple times in a program, for example in the
compiler test suite.

This pool allocator cleanup functionality is exposed to user threads
through the new pony_unregister_thread function.
@Praetonus
Copy link
Member Author

I've updated the PR.

@sylvanc
Copy link
Contributor

sylvanc commented Mar 4, 2017

Excellent, thanks @Praetonus !

@sylvanc sylvanc merged commit 43a6346 into ponylang:master Mar 4, 2017
@Praetonus Praetonus deleted the pool-cleanup-block branch March 14, 2017 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants