partition_allocator: Make topic_memory_per_partition memory group aware

Previously `topic_memory_per_partition` was more a rough guess whose value was way too large. This was partly a result of not having a better idea but also because obviously not all memory can be used by partitions. After some analysis via metrics and the memory profiler we now have a better idea of what the real value is. Hence, we aggressively change the value down. At the same time we make the partition_allocator memory limit check memory group aware. It will no longer compare `topic_memory_per_partition` against the total memory as if partitions could use all that memory. Instead it now compares it against the memory reserved for partitions via using the memory groups. We use some heuristics (which are better explained in the code comment) to try to guard against cases where we would make partition density worse. The new defaults are: - topic_memory_per_partition: 200KiB (this already assumes some to be merged optimizations in the seastar metrics stack). It's still fairly conservative. Probably more like 150KiB. - topic_partitions_max_memory_allocation_share: 10% - together with the above this effectively gives us twice the partition density compared to the old calculation and using 4MiB for TMPP
redpanda-data · Dec 18, 2024 · f820490 · f820490
1 parent b76ba02
commit f820490
Show file tree

Hide file tree

Showing 11 changed files with 277 additions and 40 deletions.
diff --git a/src/v/cluster/BUILD b/src/v/cluster/BUILD
@@ -215,6 +215,18 @@ redpanda_cc_library(
     ],
 )
 
+redpanda_cc_library(
+    name = "topic_memory_per_partition_default",
+    hdrs = [
+        "scheduling/topic_memory_per_partition_default.h",
+    ],
+    include_prefix = "cluster",
+    visibility = ["//visibility:public"],
+    deps = [
+        "//src/v/base",
+    ],
+)
+
 # TODO the following headers are the only headers in redpanda which are excluded
 # from clang-tidy. if you remove the exclusion then you'll observe a tangled web
 # of header dependencies. after many hours those have not yet been resolved, so
@@ -617,7 +629,9 @@ redpanda_cc_library(
         "version.h",
     ],
     implementation_deps = [
+        ":topic_memory_per_partition_default",
         "//src/v/features:enterprise_feature_messages",
+        "//src/v/resource_mgmt:memory_groups",
     ],
     include_prefix = "cluster",
     visibility = ["//visibility:public"],

diff --git a/src/v/cluster/scheduling/partition_allocator.cc b/src/v/cluster/scheduling/partition_allocator.cc
@@ -15,12 +15,14 @@
 #include "cluster/logger.h"
 #include "cluster/members_table.h"
 #include "cluster/scheduling/constraints.h"
+#include "cluster/scheduling/topic_memory_per_partition_default.h"
 #include "cluster/scheduling/types.h"
 #include "cluster/types.h"
 #include "config/configuration.h"
 #include "features/feature_table.h"
 #include "model/metadata.h"
 #include "random/generators.h"
+#include "resource_mgmt/memory_groups.h"
 #include "ssx/async_algorithm.h"
 #include "utils/human.h"
 
@@ -79,6 +81,53 @@ allocation_constraints partition_allocator::default_constraints() {
     return req;
 }
 
+bool guess_is_memory_group_aware_memory_limit() {
+    // Here we are trying to guess whether `topic_memory_per_partition` is
+    // memory group aware.
+    //
+    // Originally the property was chosen such that at maxed out
+    // `topic_partitions_per_shard` we wouldn't run into OOM issues on 4 or 8GiB
+    // per shard hosts.
+    //
+    // These days we have a better understanding of how much memory a partition
+    // actually uses at rest based on memory profiler and metrics analysis.
+    // Hence we now set it to a more realistic value in the XXXKiB range.
+    //
+    // At the same time we also made the memory_groups reserve a percentage of
+    // the total memory space for partitions. Instead of dividing the whole
+    // memory space by `topic_memory_per_partition` we now only do that with the
+    // reserved value.
+    //
+    // Because of the latter making this change isn't trivial. Users might have
+    // lowered the `topic_memory_per_partition` property to gain higher
+    // partition density. With only using the reserved memory space when
+    // checking memory limits this might lead to now actually having less
+    // partition density. Hence, we need to guess whether the property was set
+    // before this change or after. If it was set before we just use the old
+    // rules (while still reserving a part of the memory) when checking memory
+    // limits to not break existing behaviour.
+    //
+    // Note that if we get this guess wrong that isn't immediately fatal. This
+    // only becomes active when creating new topics or modifying existing
+    // topics. At that point users can increase the memory reservation if
+    // needed.
+
+    // not overriden so definitely memory group aware
+    if (!config::shard_local_cfg().topic_memory_per_partition.is_overriden()) {
+        return true;
+    }
+
+    auto value = config::shard_local_cfg().topic_memory_per_partition.value();
+
+    // Previous default was 4MiB. New one is more than 10x smaller. We assume
+    // it's unlikely someone would have changed the value to be that much
+    // smaller and hence guess that all the values larger than 2 times the new
+    // default are old values. Equally in the new world it's unlikely anybody
+    // would increase the value (at all really).
+
+    return value < 2 * ORIGINAL_MEMORY_GROUP_AWARE_TMPP;
+}
+
 std::error_code partition_allocator::check_memory_limits(
   uint64_t new_partitions_replicas_requested,
   uint64_t proposed_total_partitions,
@@ -88,23 +137,38 @@ std::error_code partition_allocator::check_memory_limits(
     auto memory_per_partition_replica
       = config::shard_local_cfg().topic_memory_per_partition();
 
-    if (
-      memory_per_partition_replica.has_value()
-      && memory_per_partition_replica.value() > 0) {
-        const uint64_t memory_limit = effective_cluster_memory
-                                      / memory_per_partition_replica.value();
-
-        if (proposed_total_partitions > memory_limit) {
-            vlog(
-              clusterlog.warn,
-              "Refusing to create {} new partitions as total partition count "
-              "{} "
-              "would exceed memory limit {}",
-              new_partitions_replicas_requested,
-              proposed_total_partitions,
-              memory_limit);
-            return errc::topic_invalid_partitions_memory_limit;
-        }
+    if (!memory_per_partition_replica.has_value()) {
+        return errc::success;
+    }
+
+    bool is_memory_aware = guess_is_memory_group_aware_memory_limit();
+
+    uint64_t partition_memory;
+    if (is_memory_aware) {
+        partition_memory = effective_cluster_memory
+                           * memory_groups().partitions_max_memory_share();
+    } else {
+        partition_memory = effective_cluster_memory;
+    }
+
+    uint64_t memory_limit = partition_memory
+                            / memory_per_partition_replica.value();
+
+    if (proposed_total_partitions > memory_limit) {
+        vlog(
+          clusterlog.warn,
+          "Refusing to create {} new partitions as total partition count "
+          "{} "
+          "would exceed memory limit of {} partitions. Cluster partition "
+          "memory: {} - "
+          "required memory per partition: {} - memory group aware: {}",
+          new_partitions_replicas_requested,
+          proposed_total_partitions,
+          memory_limit,
+          partition_memory,
+          memory_per_partition_replica.value(),
+          is_memory_aware);
+        return errc::topic_invalid_partitions_memory_limit;
     }
 
     return errc::success;

diff --git a/src/v/cluster/scheduling/topic_memory_per_partition_default.h b/src/v/cluster/scheduling/topic_memory_per_partition_default.h
@@ -0,0 +1,32 @@
+/*
+ * Copyright 2024 Redpanda Data, Inc.
+ *
+ * Use of this software is governed by the Business Source License
+ * included in the file licenses/BSL.md
+ *
+ * As of the Change Date specified in that file, in accordance with
+ * the Business Source License, use of this software will be governed
+ * by the Apache License, Version 2.0
+ */
+#pragma once
+
+#include "base/units.h"
+
+#include <cstddef>
+
+namespace cluster {
+
+// Default value for `topic_memory_per_partition`. In a constant here such that
+// it can easier be referred to.
+inline constexpr size_t DEFAULT_TOPIC_MEMORY_PER_PARTITION = 200_KiB;
+
+// DO NOT CHANGE
+//
+// Original default for `topic_memory_per_partition` when we made it memory
+// group aware. This property is used for heuristics to determine if the user
+// had overridden the default when it was non-mmemory group aware. Hence, this
+// value should NOT be changed and stay the same even if the (above) default is
+// changed.
+inline constexpr size_t ORIGINAL_MEMORY_GROUP_AWARE_TMPP = 200_KiB;
+
+} // namespace cluster
diff --git a/src/v/cluster/tests/BUILD b/src/v/cluster/tests/BUILD
@@ -323,13 +323,17 @@ redpanda_cc_btest(
         "partition_allocator_tests.cc",
     ],
     deps = [
+        "//src/v/base",
         "//src/v/cluster",
+        "//src/v/cluster:topic_memory_per_partition_default",
         "//src/v/cluster/tests:partition_allocator_fixture",
+        "//src/v/config",
         "//src/v/model",
         "//src/v/raft",
         "//src/v/raft:fundamental",
         "//src/v/random:fast_prng",
         "//src/v/random:generators",
+        "//src/v/resource_mgmt:memory_groups",
         "//src/v/test_utils:seastar_boost",
         "@abseil-cpp//absl/container:flat_hash_set",
         "@boost//:test",

diff --git a/src/v/cluster/tests/partition_allocator_fixture.h b/src/v/cluster/tests/partition_allocator_fixture.h
@@ -28,13 +28,14 @@
 
 #include <seastar/core/chunked_fifo.hh>
 
+#include <cstdint>
 #include <limits>
 
 struct partition_allocator_fixture {
-    static constexpr uint32_t partitions_per_shard = 1000;
+    static constexpr uint32_t gb_per_core = 5;
 
     partition_allocator_fixture()
-      : partition_allocator_fixture(std::nullopt, std::nullopt) {}
+      : partition_allocator_fixture(std::nullopt, std::nullopt, 1000) {}
 
     ~partition_allocator_fixture() {
         _allocator.stop().get();
@@ -53,7 +54,7 @@ struct partition_allocator_fixture {
           std::move(rack),
           model::broker_properties{
             .cores = core_count,
-            .available_memory_gb = 5 * core_count,
+            .available_memory_gb = gb_per_core * core_count,
             .available_disk_gb = 10 * core_count});
 
         auto ec = members.local().apply(
@@ -142,27 +143,32 @@ struct partition_allocator_fixture {
 
     fast_prng prng;
 
+    uint32_t partitions_per_shard;
+
 protected:
     explicit partition_allocator_fixture(
       std::optional<size_t> memory_per_partition,
-      std::optional<int32_t> fds_per_partition) {
+      std::optional<int32_t> fds_per_partition,
+      uint32_t partitions_per_shard)
+      : partitions_per_shard(partitions_per_shard) {
         members.start().get();
         features.start().get();
         _allocator
           .start_single(
             std::ref(members),
             std::ref(features),
-            config::mock_binding<std::optional<size_t>>(memory_per_partition),
             config::mock_binding<std::optional<int32_t>>(fds_per_partition),
             config::mock_binding<uint32_t>(uint32_t{partitions_per_shard}),
             partitions_reserve_shard0.bind(),
             kafka_internal_topics.bind(),
             config::mock_binding<bool>(true))
           .get();
-        ss::smp::invoke_on_all([] {
+        ss::smp::invoke_on_all([memory_per_partition] {
             config::shard_local_cfg()
               .get("partition_autobalancing_mode")
               .set_value(model::partition_autobalancing_mode::node_add);
+            config::shard_local_cfg().topic_memory_per_partition.set_value(
+              memory_per_partition);
         }).get();
     }
 };
@@ -172,7 +178,8 @@ struct partition_allocator_memory_limited_fixture
     static constexpr size_t memory_per_partition
       = std::numeric_limits<size_t>::max();
     partition_allocator_memory_limited_fixture()
-      : partition_allocator_fixture(memory_per_partition, std::nullopt) {}
+      : partition_allocator_fixture(memory_per_partition, std::nullopt, 10000) {
+    }
 };
 
 struct partition_allocator_fd_limited_fixture
@@ -181,5 +188,5 @@ struct partition_allocator_fd_limited_fixture
       = std::numeric_limits<int32_t>::max();
 
     partition_allocator_fd_limited_fixture()
-      : partition_allocator_fixture(std::nullopt, fds_per_partition) {}
+      : partition_allocator_fixture(std::nullopt, fds_per_partition, 10000) {}
 };