Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

gapisback · 2023-04-14T23:29:15Z

The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits.

Another driving factor for implementing free-fragment list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point. We would frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle).

Design Overview

The main design changes to manage small-fragments are follows:

Managing memory allocation / free using `platform_memfrag{}` fragments

Allocation and free of memory is dealt with in terms of "memory fragments", a small structure that holds the memory->{addr, size}. All memory requests (as is being done previously) are aligned to
the cacheline.
- Allocation: All clients of memory allocation have to "hand-in" an opaque platform_memfrag{} handle, which will be returned populated with the memory address, and more importantly, the size-of-the-fragment that was used to satisfy the memory request.
- Free: Clients now have to safely keep a handle to this returned platform_memfrag{}, and hand it back to the free() method. free() will rely "totally" on the size specified in this input fragment handle supplied. And the free'd memory fragment will be returned to the corresponding free-list bucket, if the fragment's size is one in a small set of free-fragments being tracked.
Upon free(), the freed-fragment is tracked in a few free-lists bucketed by size of the freed-fragment. For now, we support 4 buckets, size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient
for current benchmarking requirements.)

A free'd fragment is hung off of the corresponding list, threading the free-fragments using the fragment's memory itself.
New struct free_frag_hdr{} provides the threading structure. It tracks the current fragment's size and free_frag_next pointer. The 'size' provided to the free() call is is recorded as the free'd fragment's size.
Subsequently, a new alloc() request is 1st satisfied by searching the free-list corresponding to the memory request.

For example, a request from a client for 150 bytes will be rounded-up to a cacheline boundary,
i.e. 192 bytes. The free-list for bucket 256 bytes will be searched to find the 1st free-fragment of the right size. If no free fragment is found in the target list, we then allocate a new fragment. The returned fragment will have a size of 256 (for an original request of 150 bytes).

An immediate consequence of this approach is that there is a small, but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(), TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z' equivalents, which return 0'ed out memory.
All existing clients of the various TYPED*() memory allocation calls have been updated to declare an on-stack platform_memfrag{} handle, which is passed back to platform_free().
In some places memory is allocated to initialize sub-systems and then torn down during deinit(). In a few places existing structures are extended to track an additional 'size' field. The size of the memory fragment allocated during init() is recorded here, and then used to invoke platform_free() as part of the deinit() method.
- An example is clockcache_init() where this kind of work to record the 'size' of the fragment is done and passed-down to clockcache_deinit(), where the memory fragment is then freed with the right 'size'.
This pattern is now to be seen in many such init()/deinit() methods of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ...
Copious debug and platform asserts have been added in shmem alloc/free methods to cross-check to some extent illegal calls.

Cautionary Note

If the 'ptr' handed to platform_free() is not of type platform_memfrag{} *, it is treated as a generic *, and its sizeof() will be used as the 'size' of the fragment to free.

This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if
we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.)

Fingerprint Object Management

Managing memory for fingerprint arrays was particularly problematic.

This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process.

The requirements from the new interfaces are handled as follows:

Added a new fingerprint{} object, struct fp_hdr{}, which embeds at its head a platform_memfrag{}. And few other short fields are added for tracking fingerprint memory mgmt gyrations.
Various accessor methods are added to manage memory for fingerprint arrays through this object.
E.g.,
- fingerprint_init() - Will allocate required fingerprint for 'ntuples'.
- fingerprint_deinit() - Will dismantle object and free the memory
- fingerprint_start() - Returns start of fingerprint array's memory
- fingerprint_nth() - Returns n'th element of fingerprint

Packaging the handling of fingerprint array through this object and its interfaces helped greatly to stabilize the memory histrionics.

When SplinterDB is closed, shared memory dismantling routine will tag any large-fragments that are still found "in-use". This is percolated all the way back to splinterdb_close(), unmount() and to
platform_heap_destory() as a failure $rc. Test will fail if they have left some un-freed large fragments.
(Similar approach was considered to book-keep all small fragments used/freed, but due to some rounding errors, it cannot be a reliable check at this time. So hasn't been done.)

Test changes

Miscellaneous

Elaborate and illustrative tracing added to track memory mgmt done for fingerprint arrays, especially when they are bounced around queued / re-queued tasks. (Was a very problematic debugging issue.)
Extended tests to exercise core memory allocation / free APIs, and to exercise fingerprint object mgmt, and writable_buffer interfaces:
- platform_apis_test:
- splinter_shmem_test.c: Adds specific test-cases to verify that free-list mgmt is happening correctly.
Enhanced various diagnostics, asserts, tracing
Improved memory usage stats gathering and reporting
Added hooks to cross-check multiple-frees of fragments, and testing hooks to verify if a free'd fragment is relocated to the right free-list

netlify · 2023-04-14T23:29:21Z

✅ Deploy Preview for splinterdb canceled.

Name	Link
🔨 Latest commit	`9281c83`
🔍 Latest deploy log	https://app.netlify.com/sites/splinterdb/deploys/65bb050c366a1500096d0cad

gapisback · 2023-04-14T23:35:06Z

include/splinterdb/splinterdb.h

@@ -179,7 +179,7 @@ splinterdb_open(splinterdb_config *cfg, splinterdb **kvs);
 // Close a splinterdb
 //
 // This will flush all data to disk and release all resources
-void
+int


To percolate errors found by shm-destroy, if large-fragments not free are still found hanging around.

Good change. Can you add a comment defining the meaning of the return value? e.g.

"returns 0 on success, non-zero otherwise."

Or

"returns

0 on success,

a positive integer when all data has been persisted but not all resources were able to be released, and

a negative number to indicate that not all data was able to be persisted and the database was unable to shut down safely."

gapisback · 2023-04-14T23:35:17Z

src/PackedArray.c

@@ -380,7 +380,6 @@ void PACKEDARRAY_JOIN(__PackedArray_unpack_, PACKEDARRAY_IMPL_BITS_PER_ITEM)(con
 #include "poison.h"

 #define PACKEDARRAY_MALLOC(size) platform_malloc(size)
-#define PACKEDARRAY_FREE(p) platform_free(p)


Unused interface.

gapisback · 2023-04-14T23:36:07Z

src/btree.c

-      platform_assert(req->num_tuples < req->max_tuples);
-      req->fingerprint_arr[req->num_tuples] =
+      platform_assert(btree_pack_can_fit_tuple(req));
+      fingerprint_start(&req->fingerprint)[req->num_tuples] =


Here's where you will start to see the use of fingerprint object and its accessor / interfaces.

gapisback · 2023-04-14T23:37:05Z

src/btree.h

-   uint32       *fingerprint_arr; // IN/OUT: hashes of the keys in the tree
+   hash_fn       hash;        // hash function used for calculating filter_hash
+   unsigned int  seed;        // seed used for calculating filter_hash
+   fp_hdr        fingerprint; // IN/OUT: hashes of the keys in the tree


The in-place char * array is now replaced by the fingerprint object, which carries inside of it platform_memfrag{} handle to track allocate memory fragment's size and to free it reliably.

gapisback · 2023-04-14T23:38:20Z

src/btree.h

-      req->fingerprint_arr =
-         TYPED_ARRAY_ZALLOC(hid, req->fingerprint_arr, max_tuples);
+
+      fingerprint_init(&req->fingerprint, hid, max_tuples); // Allocates memory


Inline memory allocation on old L345 is, henceforth, replaced by init()'ing the fingerprint object ... And so on ...

gapisback · 2023-04-14T23:39:03Z

src/btree.h

                   "Unable to allocate memory for %lu tuples",
                   max_tuples);
-      if (!req->fingerprint_arr) {
+      if (fingerprint_is_empty(&req->fingerprint)) {


You can no longer check for NULL array ptr, to detect OOM. You must consult the is_empty() method to figure out if there is memory or not.

gapisback · 2023-04-14T23:39:26Z

src/btree.h

-   if (req->fingerprint_arr) {
-      platform_free(hid, req->fingerprint_arr);
+   if (!fingerprint_is_empty(&req->fingerprint)) {
+      fingerprint_deinit(hid, &req->fingerprint);


deinit() will free memory.

gapisback · 2023-04-14T23:40:35Z

src/clockcache.c

   if (!cc->lookup) {
      goto alloc_error;
   }
+   cc->lookup_size = memfrag_size(&memfrag_cc_lookup);


Here's the first instance of a pair of init / deinit calls, which now need to communicate the size of memory fragment allocated by init().

Like it's done on this line, few common structures now gain a new size field to track the memory fragment's size. These structures are of the kind where they are allocated / init'ed in one function and much later the deinit() method is called in a separate function.

gapisback · 2023-04-14T23:41:28Z

src/clockcache.c

   if (cc->lookup) {
-      platform_free(cc->heap_id, cc->lookup);
+      memfrag_init_size(mf, cc->lookup, cc->lookup_size);
+      platform_free(cc->heap_id, mf);


free() needs to be told the fragment's size correctly. This is obtained from the size field stashed away when init() was done.

gapisback · 2023-04-14T23:42:00Z

src/clockcache.c

   }
   if (cc->entry) {
-      platform_free(cc->heap_id, cc->entry);
+      memfrag_init_size(mf, cc->entry, cc->entry_size);
+      platform_free(cc->heap_id, mf);


Same pattern of changes continues. This will appear in many more instances ...

gapisback · 2023-04-14T23:45:08Z

src/memtable.c

+   platform_memfrag  memfrag_ctxt;
+   platform_memfrag *mf = &memfrag_ctxt;
+   memfrag_init_size(mf, ctxt, ctxt->mt_ctxt_size);
+   platform_free(hid, mf);


NOTE: This does look like verbose multi-line repeat code.

@rtjohnso - I did consider whether to add a packaged macro, say, memfrag_init(), to which you supply the addr / init. Inside the body, we can declare a hidden structures and platform_memfrag *mf, do the setup, and pass-it as the 2nd arg.

Can be done ... probably .. did not try it too hard. Wanted to get this into review, and then I expect anyway to get comments on this approach.

We can re-discuss the coding impact this approach has ... and re-visit during review.

gapisback · 2023-04-14T23:46:43Z

src/platform_linux/laio.c

@@ -287,7 +291,11 @@ io_handle_deinit(laio_handle *io)
   }
   platform_assert(status == 0);

-   platform_free(io->heap_id, io->req);
+   platform_memfrag  memfrag = {.addr = io->req, .size = io->req_size};


NOTE to myself: This should go away. Currently, memfrag_init_size() is a #define and this struct is exposed in platform.h. Rework this so the fields are hidden, and only memfrag_init_size() is exposed to client code.

This will prevent such naked assignments.

gapisback · 2023-04-14T23:47:18Z

src/platform_linux/platform.c

@@ -84,13 +84,16 @@ platform_heap_create(platform_module_id    UNUSED_PARAM(module_id),
   return STATUS_OK;
 }

-void
+platform_status


Percolating errors upstream from platform_shmdestroy; see L91 below.

gapisback · 2023-04-14T23:47:55Z

src/platform_linux/platform.c

   platform_histo_handle hh;
-   hh = TYPED_MANUAL_MALLOC(
+   hh = TYPED_ARRAY_MALLOC(


Equivalent calls.

Actually, I believe the correct macro for this situation is TYPED_FLEXIBLE_STRUCT_ZALLOC.

gapisback · 2023-04-14T23:49:07Z

src/platform_linux/platform.h

   ({                                                                          \
      debug_assert((n) >= sizeof(*(v)));                                       \
      (typeof(v))platform_aligned_malloc(hid,                                  \
                                         PLATFORM_CACHELINE_SIZE,              \
                                         (n),                                  \
+                                         (mf),                                 \


Allocation now receives and returns a platform_memfrag{} *, So, macro's call changes.

gapisback · 2023-04-14T23:50:30Z

src/platform_linux/platform.h

@@ -368,13 +371,13 @@ extern platform_heap_id     Heap_id;
   ({                                                                          \
      debug_assert((n) >= sizeof(*(v)));                                       \
      (typeof(v))platform_aligned_malloc(                                      \
-         hid, (a), (n), STRINGIFY(v), __func__, __FILE__, __LINE__);           \
+         hid, (a), (n), NULL, STRINGIFY(v), __func__, __FILE__, __LINE__);     \


For most of the consumers, this is good-enough. I could have changed this to require all callers to also declare an on-stack platform_memfrag{}, but that would be more code changes.

The one 'minor' issue with this is that we might incorrectly free a smaller-sized fragment . But that's not a huge loss, so I went with current solution.

gapisback · 2023-04-14T23:52:01Z

src/platform_linux/platform.h

+                      "Attempt to free a NULL ptr from '%s', line=%d",         \
+                      __func__,                                                \
+                      __LINE__);                                               \
+      if (IS_MEM_FRAG(p)) {                                                    \


This change is key ... and needs understanding. Please review carefully.

gapisback · 2023-04-14T23:55:06Z

src/platform_linux/platform.h

+         const size_t _reqd =                                                  \
+            (_size + platform_alignment(PLATFORM_CACHELINE_SIZE, _size));      \
+         platform_free_mem((hid), (p), _reqd, STRINGIFY(p));                   \
+         (p) = NULL;                                                           \


@rtjohnso - This line and L907 below is what makes it necessary for clients calling free() to do two things:

platform_memfrag memfrag; platform_memfrag *mf; ... Do the initialization ... platform_free(mf);

I would have liked to skip the mf and simply pass-in &memfrag to free(), but there is a compiler error.

I think this can be fixed with some rework ... but I ran out of energy. Let's review if this can be improved.

gapisback · 2023-04-14T23:57:03Z

src/platform_linux/platform_inline.h

+platform_do_realloc(const platform_heap_id heap_id,
+                    const size_t           oldsize,
+                    void                  *ptr,     // IN
+                    size_t                *newsize, // IN/OUT


Reallocation now returns the *newsize, so clients like writable buffer resize can record the new fragment's size in its buffer_capacity field.

This, then, allows writable_buffer_deinit() to correctly supply the newly realloc'ed fragment's size to free.

gapisback · 2023-04-14T23:58:23Z

src/platform_linux/platform_inline.h

-   void *retptr = (heap_id ? splinter_shm_alloc(heap_id, required, objname)
-                           : aligned_alloc(alignment, required));
+   void *retptr = NULL;
+   if (heap_id == PROCESS_PRIVATE_HEAP_ID) {


Clarified usage of the semantic of heap_id; NULL means process-private-heap-ID, so go thru old malloc()-style code-flow.

gapisback · 2023-04-14T23:59:09Z

src/platform_linux/shmem.c

+   int         frag_allocated_to_pid; // Allocated to this OS-pid
+   int         frag_freed_by_pid;     // OS-pid that freed this fragment
+   threadid    frag_freed_by_tid;     // Splinter thread-ID that freed this
+   int         frag_line;


No changed, indented and aligned fields for readability.

Updated: Nothing changed ... only indentation changes ...

gapisback · 2023-04-15T00:00:47Z

src/platform_linux/shmem.c

-#   define SHM_LARGE_FRAG_SIZE (90 * KiB)
-#else
-#   define SHM_LARGE_FRAG_SIZE (38 * KiB)
-#endif // SPLINTER_DEBUG


Diff in this limit was causing unit-tests to fail in debug builds and pass in release build.

This artifact is leftover from my poc-dev days ... no longer should be needed to separate out limits.

Have stabilized on 32K as lower limit for large fragments in both builds. Moved to shmem.h

gapisback · 2023-04-15T00:01:39Z

src/platform_linux/shmem.c

+typedef struct free_frag_hdr {
+   struct free_frag_hdr *free_frag_next;
+   size_t                free_frag_size;
+} free_frag_hdr;


Used to chain free'd-fragments that are returned to the free-list. This tiny struct lives at the head of each free fragment.

Min frag-size is 64 bytes, so we have room.

gapisback · 2023-04-15T00:02:50Z

src/platform_linux/shmem.c

+ * can print these after shared segment has been destroyed.
+ * ------------------------------------------------------------------------
+ */
+typedef struct shminfo_usage_stats {


Consolidated all metrics, usage-stats into this common struct.

It will be update in-place nested in the shmem control block.

And will also be used to return metrics when shared memory is being dismantled.

Works much better this way!!

gapisback · 2023-04-15T00:04:05Z

src/platform_linux/shmem.c

@@ -198,13 +309,13 @@ platform_shm_hip(platform_heap_id hid)
 static inline void
 shm_lock_mem_frags(shmem_info *shminfo)
 {
-   platform_spin_lock(&shminfo->shm_mem_frags_lock);
+   platform_mutex_lock(&shminfo->shm_mem_frags_mutex);


Better solution.... Otherwise, for some cases of new large_inserts_stress_test, we were simply burning up 100% CPU.

In all workloads, pthread-semaphore is less than 5%, sometimes even 1-2% as seen in perf top.

gapisback · 2023-04-15T00:05:03Z

src/platform_linux/shmem.c

+platform_save_usage_stats(shminfo_usage_stats *usage, shmem_info *shminfo)
+{
+   *usage                          = shminfo->usage;
+   usage->large_frags_found_in_use = platform_trace_large_frags(shminfo);


Much simpler rather than the line-by-line copy that old code from L303 onwards was doing.

gapisback · 2023-04-15T00:06:15Z

src/platform_linux/shmem.c

+   } else {
+      // Try to satisfy small memory fragments based on requested size, from
+      // cached list of free-fragments.
+      retptr = platform_shm_find_frag(shminfo, size, objname, func, file, line);


New find method to find small free fragments, thru its free-list-by-size.

gapisback · 2023-04-15T00:07:31Z

src/platform_linux/shmem.c

+{
+   ((free_frag_hdr *)ptr)->free_frag_next = *here;
+   ((free_frag_hdr *)ptr)->free_frag_size = size;
+   *here                                  = ptr;


Inserting free'd- fragment at the head of its free-list

Upcoming PR #569 is overhauling large-inserts stress test. To simplify examining the diffs of this test case as part of that review, this commit is renaming the test file to large_inserts_stress_test.c, with appropriate changes to the build Makefile and test files, to pickup new file.

Upcoming PR #569 is overhauling large-inserts stress test. To simplify examining the diffs of this test case as part of that review, this commit is renaming the test file to large_inserts_stress_test.c . Make appropriate changes to the build Makefile and test files, to pickup new file.

Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for upcoming small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() ... No other code-/logic-changes are done with this commit.

Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for code changes coming from small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() NOTE: No other code-/logic-changes are done with this commit.

This commit refactors shared memory usage stats fields to drive-off shminfo_usage_stats{} struct entirely. Add platform_save_usage_stats(), used by platform_shm_print_usage(). This refactoring paves the way for upcoming PR #569 which is adding more memory-usage stats fields.

gapisback · 2024-01-25T06:58:43Z

@rtjohnso - The final part-3 shared memory support change-set is now ready for review.

The suggested order in which to review these diffs is:

src/platform_linux/platform.h
src/platform_linux/platform_inline.h
src/platform_linux/platform_types.h
src/util.h, src/util.c
src/platform_linux/shmem.h
src/platform_linux/shmem.c
src/trunk.h, src/trunk.c
src/routing_filter.c
Then the rest of the files.
Good luck!

rtjohnso · 2024-01-27T07:29:36Z

I think the current memfrag interface is leaky and not general.

I think the interface should look like this:

platform_status
platform_alloc(memfrag *mf, // OUT
               int size);

platform_status
platform_realloc(memfrag *mf, // IN/OUT
                 int newsize);

platform_status
platform_free(memfrag *mf); // IN

void *
memfrag_get_pointer(memfrag *mf);

(Note that details, like the exact names of the functions or the memfrag datatype are not too important in this example.)

The point is that the rest of the code should treat memfrags as opaque objects. In the current code, the rest of the code goes around pulling out fields and saving them for later use. It means that internal details of the current allocator implementation are being leaked all over the rest of the code. This will make it difficult to change the allocator implementation down the road.

As for names, I would advocate renaming memfrag to memory_allocation.

gapisback · 2024-01-27T16:16:14Z

HI, @rtjohnso --

Thanks for your initial approach on reworking the interfaces.

I'm happy to take this further, but I feel this round-trip discussion will become long and meandering. And this review panel UI exchange is not ideally suited for that kind of interaction.

I want to avoid re-doing the implementation till we've settled on and agreed to the new interfaces. Every bit of code rework requires massively editing the change-set and re-stabilizing - an effort I would like to avoid doing multiple times.

How about I start a new thread under Discussions tab, with your initial proposal? And, will give you my responses, rebuttal. I suspect we will have to go back-and-forth a few times before settling on the final interfaces.

(As a team, we haven't used the Discussions tab feature internally. As I am beginning my transition to fully out-of-VMware, it may be a good opportunity to engage using this GitHub feature, so it continues even when I'm a fully O-Sourced' engineer.)

…ject mgmt The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits. Another driving factor for implementing some free-list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point, and we'd frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle). The main design changes to manage small-fragments are follows: Managing memory allocation / free using platform_memfrag{} fragments: - Allocation and free of memory is dealt with in terms of "memory fragments", a small structure that holds the memory->{addr, size}. All memory requests (as is being done previously) are aligned to the cacheline. - Allocation: All clients of memory allocation have to "hand-in" an opaque platform_memfrag{} handle, which will be returned populated with the memory address, and more importantly, the size-of-the-fragment that was used to satisfy the memory request. - Free: Clients now have to safely keep a handle to this returned platform_memfrag{}, and hand it back to the free() method. free() will rely "totally" on the size specified in this input fragment handle supplied. And the free'd memory fragment will be returned to the corresponding free-list bucket. - Upon free(), the freed-fragment is tracked in a few free-lists bucketed by size of the freed-fragment. For now, we support 4 buckets, size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient for current benchmarking requirements.) A free'd fragment is hung off of the corresponding list, threading the free-fragments using the fragment's memory itself. New struct free_frag_hdr{} provides the threading structure. It tracks the current fragment's size and free_frag_next pointer. The 'size' provided to the free() call is is recorded as the free'd fragment's size. - Subsequently, a new alloc() request is 1st satisfied by searching the free-list corresponding to the memory request. For example, a request from a client for 150 bytes will be rounded-up a cacheline boundary, i.e. 192 bytes. The free-list for bucket 256 bytes will be searched to find the 1st free-fragment of the right size. If no free fragment is found in the target list, we then allocate a new fragment. The returned fragment will have a size of 256 (for an original request of 150 bytes). - An immediate consequence of this approach is that there is a small, but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(), TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z' equivalents, which return 0'ed out memory. - All existing clients of the various TYPED*() memory allocation calls have been updated to declare an on-stack platform_memfrag{} handle, which is passed back to platform_free(). - In some places memory is allocated to initialize sub-systems and then torn down during deinit(). In a few places existing structures are extended to track an additional 'size' field. The size of the memory fragment allocated during init() is recorded here, and then used to invoke platform_free() as part of the deinit() method. An example is clockcache_init() where this kind of work to record the 'size' of the fragment is done and passed-down to clockcache_deinit(), where the memory fragment is then freed with the right 'size'. This pattern is now to be seen in many such init()/deinit() methods of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ... - Cautionary Note: If the 'ptr' handed to platform_free() is not of type platform_memfrag{} *, it is treated as a generic <struct> *, and its sizeof() will be used as the 'size' of the fragment to free. This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.) - Copious debug and platform asserts have been added in shmem alloc/free methods to cross-check to some extent illegal calls. Fingerprint Object Management: Managing memory for fingerprint arrays was particularly problematic. This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process. The requirements from the new interfaces are handled as follows: - Added a new fingerprint{} object, struct fp_hdr{}, which embeds at its head a platform_memfrag{}. And few other short fields are added for tracking fingerprint memory mgmt gyrations. - Various accessor methods are added to manage memory for fingerprint arrays through this object. E.g., - fingerprint_init() - Will allocate required fingerprint for 'ntuples'. - fingerprint_deinit() - Will dismantle object and free the memory - fingerprint_start() - Returns start of fingerprint array's memory - fingerprint_nth() - Returns n'th element of fingerprint Packaging the handling of fingerprint array through this object and its interfaces helped greatly to stabilize the memory histrionics. - When SplinterDB is closed, shared memory dismantling routine will tag any large-fragments that are still found "in-use". This is percolated all the way back to splinterdb_close(), unmount() and to platform_heap_destory() as a failure $rc. Test will fail if they have left some un-freed large fragments. (Similar approach was considered to book-keep all small fragments used/freed, but due to some rounding errors, it cannot be a reliable check at this time. So hasn't been done.) Test changes: Miscellaneous: - Elaborate and illustrative tracing added to track memory mgmt done for fingerprint arrays, especially when they are bounced around queued / re-queued tasks. (Was a very problematic debugging issue.) - Extended tests to exercise core memory allocation / free APIs, and to exercise fingerprint object mgmt, and writable_buffer interfaces: - platform_apis_test: - splinter_shmem_test.c: Adds specific test-cases to verify that free-list mgmt is happening correctly. - Enhanced various diagnostics, asserts, tracing - Improved memory usage stats gathering and reporting - Added hooks to cross-check multiple-frees of fragments, and testing hooks to verify if a free'd fragment is relocated to the right free-list - Add diagram for large-free fragment tracking.

This commit reworks the interfaces on the lines discussed in this discussion thread: #615 void * platform_alloc(memfrag *mf, int size, ...); void * platform_realloc(memfrag *mf, int newsize); void platform_free(memfrag *mf); // IN Currently, the return from `platform_free` still remains as void. Changing it to platform_status will mean plumbing the return handling to all callers. Also C-system call `free()` is defined as `void`. So changing `platform_free` to return platform_status will be a bit inconsistent.

gapisback · 2024-02-01T02:45:43Z

@rtjohnso - My CI-stabilization jobs have succeeded. I have squashed all changes arising from our proposal discussion thread into this one single commit and have refreshed this change-set.

You can restart your review on this amended change-set. (I expect CI-jobs will succeed as they did in the stabilization PR #616 )

gapisback · 2024-02-01T21:33:47Z

@rtjohnso : Fyi -- I want to log this one ASAN-instability the most recent round of CI-jobs ran into, as I am not going to remember all this later.

Here is the state of affairs and results of my investigations.

CI Job no. 109 (main-pr-asan) job failed with this error:

 build/release-asan/bin/driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 4 --num-lookup-threads 4 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

build/release-asan/bin/driver_test: splinterdb_build_version 9281c83f

Dispatch test splinter_test

Attempt to create shared segment of size 8589934592 bytes.

Created shared memory of size 8589934592 bytes (8 GiB), shmid=8617984.

Completed setup of shared memory of size 8589934592 bytes (8 GiB), shmaddr=0x7f6924570000, shmid=8617984, available memory = 8589894272 bytes (~7.99 GiB).

filter-index-size: 256 is too small, setting to 512

Running splinter_test with 1 caches

splinter_test: SplinterDB performance test started with 1 tables

splinter_perf_inserts() starting num_insert_threads=4, num_threads=4, num_inserts=27185152 (~27 million) ...

Thread 2 inserting  37% complete for table 0 ... =================================================================

==2666==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f68fba16f80 at pc 0x7f6b276fef50 bp 0x7f69044aa080 sp 0x7f69044a9828

READ of size 589 at 0x7f68fba16f80 thread T1

Thread 2 inserting  42% complete for table 0 ... OS-pid=2666, OS-tid=2669, Thread-ID=3, Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

./test.sh: line 115:  2666 Aborted                 "$@"

make: *** [Makefile:558: run-tests] Error 134

Upon re-run this asan job no. 109.1 succeeded.
Attempted to manually re-run this specific test multiple times on my Nimbus-VM,but could not reproduce the ASAN heap-buffer-overflow error. Ran the exact test with different combinations (one run, 4 concurrent runs with the exact same params, 4 concurrent executions with increasing thread-count up to --num-insert-threads 8 --num-lookup-threads 8, and similar stress load on the VM), but could not repro the problem outside CI.

The last variation of this test in manual repro attempts I tried is 4 concurrent invocations of this test: (Logging this here so I can refer to this later on.)

./driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 8 --num-lookup-threads 8 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

The VM has 16 vCPUs, so I figured by running with 8 insert-threads and 4 concurrent instances, we'd load the CPU high-enough to tickle any bugs out. But the ASAN problem did not recur in these manual repro attempts.

NOTE: In the original failure in CI, hard to tell exactly, but it seems like the thread ID 2 ran into the ASAN memory over flow and soon after, thread ID=3 ran into this assertion a few lines later:

OS-pid=2666, OS-tid=2669, Thread-ID=3,  Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

You may recall that I had reported issue #474 some time ago for this trunk bundle mgmt assertion.

I suspect that there is something lurking there that popped up in the CI-run.

I cannot explain how / whether / if this assertion tripping is caused by the ASAN heap-buffer-overflow error or if they are even related. Unfortunately, I could not repro the ASAN issue outside CI, so have to give up on this investigation now.

The rest of the test runs are stable, and this ASAN-job did succeed on a re-run. I have re-reviewed the code-diffs applied recently and could not find anything obviously broken. For now, I will have to conclude that the changes are fine except there may be some hidden instability popping up, possibly triggered by issue #474 mentioned earlier.

rtjohnso

I've mostly just gone through the headers in the platform code, plus the fingerprint array api.

I assume once we get these nailed down, then most of the changes in the rest of the code will be relatively straightforward updates to the new apis.

Or is there anything else major?

Let's get the new apis sorted and then I can review the whole PR.

rtjohnso · 2024-01-26T08:39:07Z

include/splinterdb/splinterdb.h

@@ -179,7 +179,7 @@ splinterdb_open(splinterdb_config *cfg, splinterdb **kvs);
 // Close a splinterdb
 //
 // This will flush all data to disk and release all resources
-void
+int


Good change. Can you add a comment defining the meaning of the return value? e.g.

"returns 0 on success, non-zero otherwise."

Or

"returns

0 on success,

a positive integer when all data has been persisted but not all resources were able to be released, and

a negative number to indicate that not all data was able to be persisted and the database was unable to shut down safely."

rtjohnso · 2024-02-02T00:41:29Z

src/btree.c

@@ -3167,8 +3173,8 @@ btree_pack_loop(btree_pack_req *req,       // IN/OUT
   log_trace_key(tuple_key, "btree_pack_loop (bottom)");

   if (req->hash) {
-      platform_assert(req->num_tuples < req->max_tuples);
-      req->fingerprint_arr[req->num_tuples] =
+      platform_assert(btree_pack_can_fit_tuple(req));


rtjohnso · 2024-02-02T09:28:39Z

src/platform_linux/platform.c

   platform_histo_handle hh;
-   hh = TYPED_MANUAL_MALLOC(
+   hh = TYPED_ARRAY_MALLOC(


Actually, I believe the correct macro for this situation is TYPED_FLEXIBLE_STRUCT_ZALLOC.

rtjohnso · 2024-02-02T09:39:47Z

src/platform_linux/platform.h

-#define TYPED_ALIGNED_ZALLOC(hid, a, v, n)                                     \
+
+#define TYPED_ALIGNED_MALLOC(hid, a, v, n)                                     \
+   TYPED_ALIGNED_MALLOC_MF(&memfrag_##v, hid, a, v, n)


I hate this hard-coding of this naming convention, and I think it doesn't solve an important problem.

Let me propose something that might be more useful.

#define TYPED_ALIGNED_MALLOC_AUTO(hid, a, v, n) \ platform_memfrag v_##memfrag __attribute__(cleanup(platform_free)); TYPED_ALIGNED_MALLOC_MF(&v_##memfrag, hid, a, v, n)

This allocates memory that will automatically get freed when the function exits the scope of the allocation.

Furthermore, it alleviates the user of the responsibility to declare a memfrag at all.

rtjohnso · 2024-02-02T09:41:18Z

src/platform_linux/platform.h

+ * Utility macro to test if an argument to platform_free() is a
+ * platform_memfrag *.
+ */
+#define IS_MEM_FRAG(x)                                                         \


This will go away, right?

rtjohnso · 2024-02-02T09:48:44Z

src/platform_linux/platform.h

+                                       __FILE__,                               \
+                                       __LINE__);                              \
+      _mf->addr = NULL;                                                        \
+      _mf->size = 0;                                                           \
   } while (0)

 // Convenience function to free something volatile
 static inline void
 platform_free_volatile_from_heap(platform_heap_id heap_id,
                                 volatile void   *ptr,


This will change to accept a memfrag, right?

rtjohnso · 2024-02-02T09:49:23Z

src/platform_linux/platform_inline.h

@@ -40,10 +40,11 @@ platform_checksum_is_equal(checksum128 left, checksum128 right)
 static void
 platform_free_from_heap(platform_heap_id UNUSED_PARAM(heap_id),
                        void            *ptr,
+                        const size_t     size,


This will be modified to take a memfrag, right?

rtjohnso · 2024-02-02T09:49:56Z

src/platform_linux/platform.h

-      platform_free_volatile_from_heap(                                        \
-         id, (p), STRINGIFY(p), __func__, __FILE__, __LINE__);                 \
-      (p) = NULL;                                                              \
+      debug_assert(((p) != NULL),                                              \


Once you change platform_free_volatile_from_heap to take a memfrag, you won't need all this type checking.

rtjohnso · 2024-02-02T09:50:56Z

src/platform_linux/platform_inline.h

+   void *retptr = NULL;
+   if (heap_id == PROCESS_PRIVATE_HEAP_ID) {
+      retptr = aligned_alloc(alignment, required);
+      if (memfrag) {


Get rid of the if. Always require a memfrag parameter.

rtjohnso · 2024-02-02T09:51:45Z

src/platform_linux/platform_inline.h

 static inline void
 platform_free_from_heap(platform_heap_id heap_id,
                        void            *ptr,
+                        const size_t     size,


Take a memfrag instead.

gapisback · 2024-02-02T14:27:34Z

@rtjohnso - I've gone thru your review comments quickly. Most of those are easily implementable. I will get to it.

I've mostly just gone through the headers in the platform code, plus the fingerprint array api.

I am curious about your review of the fingerprint array API rework. Did you not find any issues with that? I was bracing myself to get lots of comments as this area is fragile and the rework is a bit tricky. If you think this array API is acceptable, then that will reduce a bunch of rework rounds on me.

Let's get the new apis sorted and then I can review the whole PR.

Let me apply the changes requested and then re-test. (CI-re-test stabilization will be a nightmare starting tomorrow.)

Once I go over all the changes, I will be better able to answer this question of yours:

Or is there anything else major?

... for which the answer now is, I don't think so, off-hand.

rtjohnso · 2024-02-02T18:49:59Z

I left a few comments on the fingerprint array code already.

I haven't done a full evaluation. It seemed more complex than I expected, but I see that it is trying to make explicit some of the complex sharing that goes on with the fingerprint arrays, which is a goal I like. I will want to do a more thorough review of how it is used to understand how it all fits together.

rtjohnso · 2024-02-02T20:42:35Z

I spoke with Alex today about the overall design, and he really doesn't like how the whole concept of memfrags puts a burden on the rest of the code.

So let's do the following. Whenever the shm code allocates memory, it allocates one extra cache line in front, and stores the memfrag on that cacheline. Later, during a free, you use pointer arithmetic to find the memfrag for that pointer.

commit 369bf55 Author: Rob Johnson <rob@robjohnson.io> Date: Wed May 22 16:53:01 2024 -0700 Robj/onetrust scriptid (#627) update to Broadcom's OneTrust script ID commit 655988f Author: Rob Johnson <rob@robjohnson.io> Date: Wed May 22 13:16:55 2024 -0700 Robj/deukyeon deadlock (#626) fix several compact_bundle bugs when index node has split commit 05654ab Author: Rob Johnson <rob@robjohnson.io> Date: Sat May 18 17:11:11 2024 -0700 Robj/print node (#625) cleanup trunk printing and add node ids commit 6a4afab Author: Rob Johnson <rob@robjohnson.io> Date: Thu May 16 13:30:23 2024 -0700 Robj/fallocate blockdev fix (#623) don't fallocate on block devices commit ac426e4 Author: Rob Johnson <rob@robjohnson.io> Date: Thu May 16 10:43:48 2024 -0700 Robj/leaf split fix (#624) Fix bug where a leaf split that resulted in a single leaf would cause an assertion failure in a subsequent compaction. The cause of the bug was that the compaction code detected node splits by checking whether the upper-bound key of the node had changed since the compaction request was created. However, in a leaf split into 1 leaf, the upper bound key doesn't change, but the leaf does get rebundled, meaning that the bundle in the compaction request is no longer live. This meant that the compaction thought it was seeing a dead bundle in a node that had not split. This should only occur if the bundle gets flushed from a parent to a child between the enqueuing of the compaction and its execution. But that can only occur if the node is not a leaf. Hence the compaction code asserted that the node was not a leaf in this case. To fix the bug, we improve the way node splits are detected. Every node has a unique id in its header. IDs change whenever a node is split, which makes detecting splits trivial. commit a46705b Author: Rob Johnson <rob@robjohnson.io> Date: Wed Apr 17 14:10:54 2024 -0700 Fix O_DIRECT and multi-threaded io_context issues (#621) * enable --set-O_DIRECT flag in large_inserts_stress_test * workaround linux aio oddity: with O_DIRECT, io_getevents with a NULL timeout may still return 0 even though there are in-flight IOs. * switch to per-process io contexts commit 9359c9a Author: Rob Johnson <rob@robjohnson.io> Date: Thu Feb 29 08:38:51 2024 -0800 Robj/badge (#619) add badge to README.md commit 77f8fc9 Author: Rob Johnson <rob@robjohnson.io> Date: Wed Feb 28 01:19:01 2024 -0800 Set up github actions CI (#617) * fix some memory leaks * fix some build-system bugs and add a memory-safety assert * remove databases during tests to reduce disk usage * Create run-tests.yml commit 2eacefa Author: Aditya Gurajada <gapisback@gmail.com> Date: Fri Jan 26 15:36:52 2024 -0800 (#599) Replace calls to getpid() with platform'ized platform_getpid() This commit cleans-up the `#include <unistd.h>` from several .c files that needed access to get pid(). Instead, change the code to invoke platform_getpid(), now defined in platform_inline.h commit 23f5e4f Author: Aditya Gurajada <agurajada@vmware.com> Date: Wed Jan 24 05:55:16 2024 -0800 Refactor to save / print shmem usage stats using common struct. This commit refactors shared memory usage stats fields to drive-off shminfo_usage_stats{} struct entirely. Add platform_save_usage_stats(), used by platform_shm_print_usage(). This refactoring paves the way for upcoming PR #569 which is adding more memory-usage stats fields. commit 4c0a225 Author: Aditya Gurajada <agurajada@vmware.com> Date: Sun Dec 10 11:49:54 2023 -0800 shmem.c: Rename variables relating to large-fragment handling. Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for code changes coming from small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() NOTE: No other code-/logic-changes are done with this commit. commit 634df21 Author: Rob Johnson <rob@robjohnson.io> Date: Wed Jan 24 04:24:36 2024 -0800 Robj/fallocate (#601) catch fallocate failure commit ef74125 Author: Aditya Gurajada <agurajada@vmware.com> Date: Sat Dec 9 22:30:15 2023 -0800 (#604) Cache page-size in local variables to reduce multiple lookups. In debug-build test runs, profiling shows interfaces like clockcache_page_size(), clockcache_config_page_size() bubbling up to the top of 'perf top' output. This commit replaces multiple calls to lookup functions that retrieve the page-size by caching the page-size once per function where it's used multiple times. The affected interfaces are: btree_page_size(), clockcache_page_size(), cache_page_size(), cache_config_page_size(), trunk_page_size() and a few similar ones. These changes add up to saving few seconds of test-execution (out of few mins of run-time) in debug-build mode, esp for BTree-related tests. commit 3e259ca Author: Aditya Gurajada <agurajada@vmware.com> Date: Sun Dec 10 07:58:21 2023 -0800 test.sh: Enable running by named-function w/o shared memory. This commit adds minor improvements / bug-fixes to test.sh: - The capability to run individual test-function(s) by name was not working without the "--use-shmem" flag. Rework the parameter parsing in all test-functions so that we can now invoke this driver script as: $ INCLUDE_SLOW_TESTS=true ./test.sh run_slower_unit_tests This will run individually named test-functions with default memory configuration. - run_other_driver_tests() had a bug where tests run by this function were not honoring '--use-shmem' arg. Fix this so that cache_test, log_test, filter_test can now also be run with "--use-shmem" enabled. - Introduce Use_shmem global to parse-out --use-shmem arg. Rework minion test-functions to drive off of global variable. - Update elapsed-time tracking to separately track the test execution run-times w/o and w/ shared memory configured. commit c340a0e Author: Aditya Gurajada <agurajada@vmware.com> Date: Sun Dec 10 06:43:01 2023 -0800 Rename large_inserts_bugs_stress_test.c -> large_inserts_stress_test.c Upcoming PR #569 is overhauling large-inserts stress test. To simplify examining the diffs of this test case as part of that review, this commit is renaming the test file to large_inserts_stress_test.c . Make appropriate changes to the build Makefile and test files, to pickup new file. commit ae4636f Author: Aditya Gurajada <agurajada@vmware.com> Date: Thu Dec 7 15:08:05 2023 -0800 CI: Bump timeout from 3h to 4h. shmem-tests cause debug test runs to take longer. After addition of new large_inserts_stress_test, being done as part of PR #569 (free/memory mgmt support for shared memory), CI-debug jobs are timing out at current timeout=3h. Bump this timeout limit to 4h, to see if test-jobs complete. commit c169d5e Author: Gabe Rosenhouse <grosenhouse@vmware.com> Date: Mon Nov 20 15:51:24 2023 -0800 Another fix for an old CI problem (#602) This field is documented as optional and defaulting to "latest" [0] but for some reason, setting it explicitly seems to matter [1]. [0]: https://github.com/concourse/registry-image-resource#source-configuration [1]: https://vmware.slack.com/archives/CEUC18KQA/p1689754476227109?thread_ts=1689753436.440569&cid=CEUC18KQA commit c64f68d Author: Aditya Gurajada <agurajada@vmware.com> Date: Mon Oct 10 15:22:07 2022 -0700 Support for multi-process execution, with processes using shared memory This commit extends core shared memory support to now allow for a multi-process execution model, where multiple processes can now attach to Splinter shared memory. Core thread-specific concurrency primitives are modified, slightly, to now also support a multi-process execution model. - This commit sets up the stage to support fork()'ed or other OS-processes running with --use-shmem option, where each process will [in future] masquerade as a Splinter thread. A core change needed to move to that execution model is to support thread-specific IO-context structures. Otherwise, if an/other OS-process tries to do IO using AIO-context established by the main thread (i.e. by the process that started up SplinterDB), we will immediately run into hard IO-system call errors. This commit: - Performs an io_setup() for each thread / process - Manages this AIO-context tightly bound to Splinter's thread context - Does required book-keeping to keep this IO-context state kosher in conjunction with thread registration / de-registration. - Updates existing io_apis_test to deal with thread-specific IO context handles. An alternative could be localize this change-in-behaviour (of setting up thread-specific IO-context structs) only when the process-model of execution comes around. That execution model requires configuring SplinterDB with shared-memory support. But, just by looking at --use-shmem (or corresponding config setting), we cannot be sure that the process-model will be used or if we are just re-running rest of the test suites with shared-segment enabled. So, without trying to further complicate this choice-making, with this commit we will always set up thread-specific AIO-context structures. Collection of lower-level changes to move to this execution model: - platform_buffer_init() that mmap()s' memory for the buffer cache will now use MAP_SHARED (v/s MAP_PRIVATE). The issue is that some parts of structures, e.g. buffer cache, are allocated using mmap(). The flags for this were MAP_PRIVATE, which means this memory is only accessible to the main process that set up Splinter. All child threads work on a COW-version of this mapped memory. So the changes done by the child process to the BTree in the buffer cache are not visible to the parent process. - Convert synchronization primitives to be shared across processes. This commit reworks core synchronization APIs to use interfaces that allow the sync-hook across child processes. This affects: - platform_mutex_init() - platform_spinlock_init() - platform_condvar_init() - platform_semaphore_init() - Add corresponding API-exerciser unit-tests for sanity coverage - Now that we have thread-specific IO-context setup, as part of thread register / deregister, we now also do io_register_thread(), io_deregister_thread(). This is basically book-keeping state of the thread w.r.t IO setup & context. Testing changes added: - Support --fork-child to test execution options. Some new tests will honor this argument, and will exercise activity using a forked-process execution model. - New test splinterdb_forked_child_test added: This covers the cases to show that IO errors could be repro'ed when running Splinter activity from a forked child process. Many other cases are added to this framework to exercise different cases of forked process doing SplinterDB activity. Much code/dev stabilization was achieved through this single new test. - Add case test_seq_key_seq_values_inserts_forked to large_inserts_stress test. - Existing functional io_apis_test to run with --fork-child option, thereby creating the scenario(s) of forked processes exercising the basic IO APIs. - Add new & extended tests to test.sh, for extended coverage using shared-memory and multi-process execution. - Add support for --wait-for-gdb and wait_for_gdb_hook() function. To debug forked child processes, add support for new command-line flag: --wait-for-gdb . And add a looping function where we can set a breakpoint, wait_for_gdb_hook(). Use this facility in splinterdb_forked_child_test.c, which has helped debug errors seen while running test_multiple_forked_process_doing_IOs(). Changes arising from review comments: Mostly cleanup: - splinterdb.c: Redefine testing-accessor methods to return correct <data type> *, rather than void * - platform.c: Clean-up error handling in platform_condvar_init(), using goto labels. Add missing pthread_mutex_destroy() in one case. - platform_shmcreate() will now return heap-ID as start address of allocated shared segment. Adjust platform_heap_id_to_shmaddr() appropriately. - rename tests/splinterdb_test_apis.h -> src/splinterdb_tests_private.h - Rename test config 'num-forked-processes' -> 'num-processes'. Adjust tests accordingly. commit 42799b1 Author: Aditya Gurajada <agurajada@vmware.com> Date: Mon Sep 26 09:45:25 2022 -0700 Core changes to support running Splinter with allocated shared memory. Support to run SplinterDB with shared memory configured for most memory allocation is an -EXPERIMENTAL- feature added with this commit. This commit brings in basic support to create a shared memory segment and to redirect all memory allocation primitives to shared memory. Currently, we only support a simplistic memory mgmt; i.e. only-allocs, and a very simplistic handling of free() of the very last memory piece allocated. With shared segments of 1-2 GiB we can run all functional and unit tests. The high-points of the changes are: - External configuration: splinterdb_config{} gains a few new visible fields to configure and troubleshoot shared memory configuration. - Boolean: use_shmem: Default is OFF - size_t : shmem_size: - The main driving change is the re-deployment of platform_heap_id 'hid' arg that appears in all memory-related interfaces. If Splinter is configured for shared memory use, 'hid' will be an opaque handle to the shared segment. Most memory allocation will be redirected to new shmem-based alloc() / free() interfaces. - Formalize usages of PROCESS_PRIVATE_HEAP_ID: A small number of clients that wish to repeatedly allocate large chunks of memory tend to cause OOMs. The memory allocated by these clients is not shared across threads / processes. For such usages, introduce PROCESS_PRIVATE_HEAP_ID as an alias to NULL, defaulting to allocating memory from the heap. - Manage handling of heap-ID to platform_get_heap_id() to correctly return the handle to shared memory. (Otherwise, it would return NULL by default.) - BTree pack allocates large fingerprint-array. This also causes large tests to run into OOMs. For threaded execution, it's ok if the memory for this array is allocated from the heap. But for multi-process execution, when one process (thread) allocates this finger print array, another thread may pick up the task to compact a bundle and will try to free this memory. So, this memory has to come from shared memory. To cope with such repeated allocations of large chunks of memory to build fingerprint, a small scheme for recycling such "free"-large-memory chunks scheme is supported by shmem module. Applied this technique to recycle memory allocated for iterators also. They tend to be big'gish, so can also cause shmem-OOMs. - All existing functional and unit-tests have been enhanced to now support "--use-shmem" argument. This will create Splinter with shared memory configured, and tests are run in this mode. This change brings-in quite a good coverage of existing testing for this new feature. - New test: large_inserts_bugs_stress_test -- added to cover the primary use-case of concurrent insert performance benchmarking (that this feature is driving in prior integration effort). - test.sh enhanced to run different classes of test with the "--use-shmem" option. - Diagnostis & Troubleshooting: - Shmem-based alloc/free interfaces extended to print name of object and other call-site info, to better pinpoint source code-flow leading to memory issues. - Add shared memory usage metrics, including for large-fragment handling. Report summary-line of metrics when Splinter is shutdown. Print stats on close. - Add various utility diagnostic helper methods to validate that addresses within shared memory are valid. Unit-tests and some asserts use these. - minor #include cleanups Changes arising through review cycle and stabilization v/s /main: - In test.sh/run_slower_unit_tests(), re-enable execution of large_inserts_bugs_stress_test, but bracketted under "set +e" / "set -e" settings. If this test fails in CI (as it does randomly), hopefully, this SET toggling will allow the rest of the script to still run. CI job should not fail immediately. (Some deeper stabilization is needed for these test cases.) - Purged the heap_handle * in shmem.h/.c module and through the rest of the Splinter code. Only heap-ID is a valid handle anymore. - Fix race condition bug in platform_shm_alloc() - Added Micro-optimization to recycle last-allocated frag being freed. - Add config_parse_use_shmem() as parsing interface to see if "--use-shmem" was supplied. Apply to many unit-/functional-tests. Rework unit-tests to use config_parse_use_shmem() to support --use-shmem parsing. Re-enable large_inserts_bugs_stress_test execution. commit 2fb4d7c Author: Deukyeon Hwang <deukyeon@users.noreply.github.com> Date: Tue Aug 15 17:19:03 2023 -0700 Fix the compile error on platform_open_log_file() (#596) commit 4679bb7 Author: Rob Johnson <rob@robjohnson.io> Date: Tue Aug 15 16:15:54 2023 -0700 remove btree rough count stuff, since it is unused (#594) commit 3bf7023 Author: Rob Johnson <rob@robjohnson.io> Date: Wed Jul 26 18:53:26 2023 -0700 Bidirectional Iterators (#588) * tweak iterator api to make it easier to add bidirectionality * debugging btree reverse iteration * reduce time we hold locks during btree_split_child_leaf * further refine the locking in btree node splits and fix reverse iterator bug * btree iterator init at key other than min and add btree_iterator_seek * splinterdb_iterator_prev implmentated and working * clang formatting * improve the trunk iterator logic * corrections for pull request * more pull request fixes * assert fix * more pull request feedback * iterator stress test, bug fixes, formatting * final bit of pr feedback * formatting --------- Co-authored-by: Evan West <evan.ts.west@gmail.com> commit 950df20 Author: Rob Johnson <rob@robjohnson.io> Date: Tue Jul 25 16:27:33 2023 -0700 allow merge callbacks to be NULL (#577) In that case, splinterdb_update() is not supported. commit a7547cd Author: deukyeon <deukyeon@users.noreply.github.com> Date: Tue Jul 25 15:15:05 2023 -0700 (#580) Cleanup some bool stuff (#584) * Add the header file for _Bool * converting bool to bool32 * move stdbool include --------- Co-authored-by: Rob Johnson <rob@robjohnson.io> commit a3e9469 Author: deukyeon <deukyeon@users.noreply.github.com> Date: Mon Jul 24 18:02:50 2023 -0700 Set the addresses of log for super block if it is. (#582) Co-authored-by: Alex Conway <aconway@vmware.com> commit 3cec342 Author: Gabe Rosenhouse <grosenhouse@vmware.com> Date: Mon Jul 24 17:12:20 2023 -0700 CI fix: use new version of registry-image resource (#593) commit d2e8369 Author: Evan West <evan.ts.west@gmail.com> Date: Thu Jul 13 20:22:03 2023 +0000 fix formatting in trunk.c commit fad27b5 Author: Evan West <evan.ts.west@gmail.com> Date: Fri Jul 7 18:38:55 2023 +0000 remove local_max_key and fix filter assertion commit b6dafdf Author: Gabe Rosenhouse <grosenhouse@vmware.com> Date: Mon Jun 26 11:13:59 2023 -0700 CI: switch PR resource to maintained one (#587) commit 6a2348c Author: Rob Johnson <rob@robjohnson.io> Date: Sun Apr 30 13:04:41 2023 -0700 Robj/memtable race fix (#574) * Memtable Generation Bugfix Fixes a bug where memtable_maybe_rotate_and_get_insert_lock would speculatively increment the memtable generation even when the next memtable was not yet ready. This would cause concurrent lookup threads to attempt to access that memtable, resulting in errors. This fix requires the insert threads to wait until the next memtable is ready before finalizing the current one. * abstract memtable and trunk root-addr locking apis --------- Co-authored-by: Alex Conway <aconway@vmware.com> commit 8c639a0 Author: Rob Johnson <rob@robjohnson.io> Date: Mon Apr 24 21:47:01 2023 -0700 fix next_req node-split bug in trunk (#575) commit 1e8f790 Author: deukyeon <deukyeon@users.noreply.github.com> Date: Mon Apr 24 19:36:07 2023 -0700 (#546) Fix the segmentation fault after splinterdb_stats_reset() (#547) Previously, when trunk_stats_reset() was called by splinterdb_stats_reset(), the entire statistics of a trunk, including the histogram handles, were reset to zero. commit fa990cf Author: Alex Conway <aconway@vmware.com> Date: Tue Dec 20 21:19:00 2022 +0000 Copy-on-Write Trunk This changeset implements copy-on-write for trunk nodes, which includes several high-level changes. This PR still needs to be rebased onto main, but the the purpose is to discuss high- and low-level design decisions. Changes in this PR: Trunk root lock. A distributed RW lock is used to access/change the current root. Flush from root. Flushes proceed from the root and cascade immediately rather than being triggered at the beginning of trunk_compact_bundle. Copy-on-write. Trunk nodes cannot be modified directly, and instead are change via a copy-on-write of the root-to-node path together with a change of the root node. Garbage Collection for unlinked branches and filters. After a copy-on-write, the nodes on the old path will be unreferenced. This PR does not GC the trunk nodes themselves, but it includes a GC path to dereference the replaced branches and filters. platform_batch_rwlock. Replaces distributed locks using dummy cache pages with a batched distributed RW lock implementation in platform.[ch]. commit b5a283b Author: Gabe Rosenhouse <grosenhouse@vmware.com> Date: Sat Apr 22 20:10:34 2023 -0700 Update CONTRIBUTING.md to describe ok-to-test PR label (#573) commit d991bac Author: Gabe Rosenhouse <grosenhouse@vmware.com> Date: Thu Apr 20 14:25:56 2023 -0700 CI requires an "ok-to-test" label before running PRs (#572) commit 77ab353 Author: gapisback <89874928+gapisback@users.noreply.github.com> Date: Thu Apr 20 14:12:23 2023 -0700 CI: Bump timeout from 2h to 3h. shmem-tests cause debug test runs to take longer. (#566) In-flight stabilization of shared memory support in Splinter is bringin along with tons more additional tests. We are effetively running most of the existing twice; once w/o and once w/ shared memory configured. Debug-build test runs are timing out at 2 hours. Bump timeout to 3hs, and once stabilized, we can look into dropping this to 2h. commit b2245ac Author: Aditya Gurajada <agurajada@vmware.com> Date: Wed Apr 5 16:29:13 2023 -0700 Fix bug in output formatted by size_to_str() helper. Fractional portion of value formatted by size_to_str() was incorrect. We were losing the scale for things which were supposed to be "xx.07", we were reporting "xx.7", which is incorrect. commit 8a04854 Author: Aditya Gurajada <agurajada@vmware.com> Date: Mon Mar 20 15:26:29 2023 -0700 (#548) Use _Bool for boolean fields in external config struct. In SplinterDB's public splinterdb_config{} config, we have few fields defined as 'bool' which is typedef'ed to int32 on our side. This creates compatibility problems when linking this library with other s/w which may have defined 'bool' as 1-byte field. (Offsets of fields in the splinterdb_config{} struct following 1st field defined as 'bool' changes across dot-oh's.) This commit slightly adjusts the typedefs of boolean fields in external structs to now use _Bool. This should reduce the risk of such incompatibilities. Change return type of methods in public_util.h to _Bool . Relocate typedef int32 bool to private platform_linux/platform.h so it's used only on Splinter-side. Cleaned-up few instances around use of bool type for code hygiene: - Minor adjustment to routing_filter_is_value_found() returning bool. - Stray references to use of 0/1 for boolean values with FALSE/TRUE. commit f3c92ef Author: Aditya Gurajada <agurajada@vmware.com> Date: Tue Apr 4 15:51:05 2023 -0700 (#561) Fix bug in routing_filter_prefetch(), causing assertion to trip. This commit fixes a simple arithmetic error in routing_filter_prefetch() while computing next page's address. The bug results in a debug-assert in clockcache_get_internal(), or an unending hang in clockcache_get() code-flow using release binary. A new test case test_issue_458_mini_destroy_unused_debug_assert has been added which reproduces the problem. However, this case still runs into another failure (being tracked separately), so this case is currently being skipped. commit 89f09b3 Author: Gabriel Rosenhouse <grosenhouse@vmware.com> Date: Wed Mar 22 20:22:05 2023 -0700 CI: use gcc for ASAN jobs commit 9037ebe Author: Aditya Gurajada <agurajada@vmware.com> Date: Wed Mar 22 10:54:35 2023 -0700 (#554) Fixes to get couple of tests running cleanly in ASAN-builds This commit fixes minor errors in 2 tests (io_apis_test, filter_test) to get them running cleanly in ASAN-builds. commit ea7203a Author: Aditya Gurajada <agurajada@vmware.com> Date: Wed Mar 22 14:25:29 2023 -0700 (#554) Enhance test.sh to run a sub-set of tests named by their driving function. This commit now allows running as "test.sh <fn-name>" interface, where the name of the driving function executing a batch of tests can be run independently, without having to go through full execution of all tests. This helps developers shorten their fix-dev-test cycle, especially when validating quick-fixes for long-running tests, like ASAN / MSAN builds. commit a5c821c Author: Gabriel Rosenhouse <grosenhouse@vmware.com> Date: Tue Mar 21 11:00:01 2023 -0700 CI: temporarily cover the shmem branch revert this once it merges commit 5dd7535 Author: Gabriel Rosenhouse <grosenhouse@vmware.com> Date: Tue Mar 21 10:54:58 2023 -0700 CI: run msan and asan tests on all PRs commit 98f5ca1 Author: Gabriel Rosenhouse <grosenhouse@vmware.com> Date: Tue Mar 21 10:59:29 2023 -0700 CI: fixup for multi-branch work commit 6be0461 Author: Gabriel Rosenhouse <grosenhouse@vmware.com> Date: Mon Mar 20 15:16:00 2023 -0700 CI: refactor config to enable coverage of multiple branches commit d9fcc40 Author: Aditya Gurajada <agurajada@vmware.com> Date: Mon Jan 9 17:55:52 2023 -0800 Identify Memtable v/s Branch page types via BTree-print routines. This commit extends BTree-print routines to also report the page type, whether it's a branch or a memtable BTree. As the structures and print methods are shared between two objects, this extra information will help in diagnostics. Trunk nodes are likewise identified. Extend btree_print_tree() to receive page_type arg. Minor fix in trunk_print_pivots() to align outputs for pivot key's string. commit 2ea30f4 Author: Aditya Gurajada <agurajada@vmware.com> Date: Fri Jan 6 12:54:11 2023 -0800 (#500) Move hook-related global vars to task_system{} struct. This commit removes the dependency of task system structures on global variables declared in task.c . The hook-related variables are now moved as members of the task_system{} struct. This removes accessing potentially stale values when task-system is destroyed and re-created. Also, TASK_MAX_HOOKS is now decreased from 8 to 4. This change largely has no functional impact, and is mostly a test-stabilization fix. commit 4b6e0b1 Author: Aditya Gurajada <agurajada@vmware.com> Date: Mon Feb 6 17:18:13 2023 -0800 Add initial support for message logging levels, used in unit-tests This commit does some clean-up and normalizes the behaviour of interfaces to control outputs from C-unit tests. The main goal is to reduce voluminous output generated by few unit-tests that exercise print diagnostic code (which otherwise crashes browsers when viewing test-run outputs in CI). An additional benefit of this rework is that we now have a way to run unit-tests to see output generated at different verbosity levels. - By default, unit test execution remains silent and only error messags will be printed. ctests' main() takes care of setting this up. - set_log_streams_for_tests() becomes the single-interface that unit test code has to invoke, when needed to change the test output's verbosity level. - Small collection of MSG_LEVEL_ levels added to ctest.h Test execution examples: Run with env-var to see diff outputs: VERBOSE=0 (or unset env-var): Default; silent output VERBOSE=3 : See error messages VERBOSE=6 : See info and error messages VERBOSE=7 : See all messages; mainly intended to collect debug output commit 1f09113 Author: gapisback <89874928+gapisback@users.noreply.github.com> Date: Thu Jan 19 13:42:32 2023 -0800 Fix-up indentation of multi-line comments to conform to coding standards. (#535) This commit fixes up several comments in btree.c to conform to the style we have followed elsewhere for multi-line comments. No code logic changes are done with this commit. commit caeaeb1 Author: Aditya Gurajada <agurajada@vmware.com> Date: Fri Dec 23 10:26:27 2022 -0800 (#513) Add set_log_streams_for_*() fns to manage unit-test outputs. This commit refactors existing chunks of code that exists in different unit- test sources to manage output file handles to a common function. This is defined in new file unit/unit_tests_common.c ; set_log_streams_for_tests() Tests that check error-raising behaviour will now need to call set_log_streams_for_error_tests(), to manage output streams. Minor correction to TEST_DB_NAME; change it to conform to the r.e. defined in .gitignore to suppress listing this in 'git status' output. commit 7e85a29 Author: Aditya Gurajada <agurajada@vmware.com> Date: Thu Dec 22 10:44:04 2022 -0800 (#511) Add fns to print 'size' as human-readable string w/ unit-specifiers This commit adds couple of utility functions to snprintf(), in an output buffer, the 'size' unit to a human-readable string with unit-specifiers. - size_to_str() - Convert 'size' to a string in an output buffer - size_to_fmtstr() - Same as above, using user-specified format-string. Useful to generate output enclosed in, e.g., '(%s)'. - Add size_str(), size_fmstr() caller-macros to simplify calling these formatting functions. These macros declare on-stack buffers used to format the output string. size_str() provided by Rob Johnson, to greatly simplify the usage. Add utility bytes-to-Units conversion macros. Add unit tests to exercise these interfaces. Apply these utility fns in couple of stats-printing and BTree print-methods, to display size values as human-friendly unit specifiers. commit 84484df Author: Aditya Gurajada <agurajada@vmware.com> Date: Fri Jan 6 10:47:16 2023 -0800 (#499) Minor cleanup of INVALID_TID, and MAX_THREADS in task-system This commit applies some minor clean-up to task system as a follow-on to the larger rework done under PR #497. Consistently use STATUS_BUSY, as a way to report when all concurrent threads are in-use. Minor changes are done to task system unit-test code & cleanup of comments. commit 60c7910 Author: Aditya Gurajada <agurajada@vmware.com> Date: Wed Dec 21 16:32:08 2022 -0800 (#507) Rework of platform_buffer_create()/destroy() to init/deinit() interfaces. This commit reworks the buffer_handle{} interfaces to now become platform_buffer_init() and platform_buffer_deinit(). Structures that need a buffer_handle{}, declare a nested sub-struct, which will go through this init / deinit interface to allocate / free memory using existing mmap() interfaces. This removes the need for an input 'heap_id / heap_handle' arg to allocate and free memory. This change does not functionally change anything in these methods. Added small unit-test, platform_apis_test, to exercise these changes. Cleanup structures and fns that used to take 'heap_handle *' which now become unused with this rework. Tighten up backout / error handling in clockcache_init() and deinit() code-flow. Co-authored by Rob Johnson, who reworked the entire interfaces as implemented above, to remove the dependency on 'hid' argument.

vmwclabot added the cla-not-required label Apr 14, 2023

gapisback commented Apr 14, 2023

View reviewed changes

gapisback commented Apr 15, 2023

View reviewed changes

gapisback mentioned this pull request Jan 24, 2024

shmem.c: Rename variables relating to large-fragment handling. #609

Merged

gapisback mentioned this pull request Jan 24, 2024

Refactor to save / print shmem usage stats using common struct. #610

Merged

gapisback force-pushed the agurajada/shmem-free-list-mgmt-Rev branch from 6c24747 to 33a95f2 Compare January 25, 2024 06:45

gapisback removed request for rosenhouse and ajhconway January 25, 2024 07:01

This was referenced Jan 25, 2024

Stabilization changes for platform_free() and free-list fragment management. #565

Closed

Rework fingerprint array as derived object from memfrag{} object. #564

Closed

gapisback mentioned this pull request Jan 31, 2024

Stabilization of part-3 of shared-memory support, based on review comments. #616

Draft

gapisback force-pushed the agurajada/shmem-free-list-mgmt-Rev branch from 3588f3e to 9281c83 Compare February 1, 2024 02:42

rtjohnso requested changes Feb 2, 2024

View reviewed changes

Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

Are you sure you want to change the base?

Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

Conversation

gapisback commented Apr 14, 2023 • edited Loading

Design Overview

Managing memory allocation / free using platform_memfrag{} fragments

Cautionary Note

Fingerprint Object Management

Test changes

netlify bot commented Apr 14, 2023 • edited Loading

✅ Deploy Preview for splinterdb canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gapisback Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gapisback Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gapisback commented Jan 25, 2024

rtjohnso commented Jan 27, 2024

gapisback commented Jan 27, 2024

gapisback commented Feb 1, 2024

gapisback commented Feb 1, 2024 • edited Loading

rtjohnso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gapisback commented Feb 2, 2024

rtjohnso commented Feb 2, 2024

rtjohnso commented Feb 2, 2024

gapisback commented Apr 14, 2023 •

edited

Loading

Managing memory allocation / free using `platform_memfrag{}` fragments

netlify bot commented Apr 14, 2023 •

edited

Loading

gapisback Apr 14, 2023 •

edited

Loading

gapisback Apr 14, 2023 •

edited

Loading

gapisback commented Feb 1, 2024 •

edited

Loading