Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop partition node not generated #602

Open
as51340 opened this issue Nov 6, 2024 · 4 comments
Open

Stop partition node not generated #602

as51340 opened this issue Nov 6, 2024 · 4 comments

Comments

@as51340
Copy link
Contributor

as51340 commented Nov 6, 2024

Hi,

Sometimes at the end of the test the operation which should stop disruption isn't generated. Example from logs:

2024-11-05 15:17:58,885{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:start-partition-node	[:isolated {"n1" #{"n2" "n5" "n6" "n4" "n3"}, "n2" #{"n1"}, "n5" #{"n1"}, "n6" #{"n1"}, "n4" #{"n1"}, "n3" #{"n1"}}]
.
.
.
.
.
2024-11-05 15:34:08,356{GMT}	INFO	[jepsen worker 4] jepsen.generator.interpreter: Healing cluster.
2024-11-05 15:34:08,357{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:stop-partition-ring	nil
2024-11-05 15:34:08,498{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:stop-partition-ring	:network-healed
2024-11-05 15:34:08,498{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:stop-partition-halves	nil
2024-11-05 15:34:08,610{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:stop-partition-halves	:network-healed
2024-11-05 15:34:08,610{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:stop-partition-node	nil
2024-11-05 15:34:08,732{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:stop-partition-node	:network-healed
2024-11-05 15:34:08,732{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:heal-node	nil
2024-11-05 15:34:08,732{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:heal-node	:not-started
2024-11-05 15:34:08,732{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:stop-network-disruption	nil
2024-11-05 15:34:08,847{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:stop-network-disruption	[:reliable {"n1" nil, "n2" nil, "n3" nil, "n4" nil, "n5" nil, "n6" nil}]

You can see that at 15:17:58,885 start-partition-node was generated but I don't see corresponding stop-partition-node until the final generator started.

This is the code I have for nemesis:

(ns memgraph.high-availability.create.nemesis
  "Memgraph nemesis for HA cluster"
  (:require [jepsen
             [nemesis :as nemesis]
             [generator :as gen]
             [net :as net]
             [util :as util]
             [control :as c]]
            [jepsen.nemesis.combined :as nemesis-combined]
            [clojure.tools.logging :refer [info]]
            [clojure.string :as str]
            [memgraph.high-availability.utils :as hautils]
            [memgraph
             [support :as s]
             [query :as mgquery]
             [utils :as utils]]))

(defn node-start-stopper
  "Takes a targeting function which, given a list of nodes, returns a single
  node or collection of nodes to affect, and two functions `(start! test node)`
  invoked on nemesis start, and `(stop! test node nodes-config)` invoked on nemesis stop.
  Also takes a nodes-config map needed to restart nodes.
  Returns a nemesis which responds to :start and :stop by running the start!
  and stop! fns on each of the given nodes. During `start!` and `stop!`, binds
  the `jepsen.control` session to the given node, so you can just call `(c/exec
  ...)`.

  The targeter can take either (targeter test nodes) or, if that fails,
  (targeter nodes).

  Re-selects a fresh node (or nodes) for each start--if targeter returns nil,
  skips the start. The return values from the start and stop fns will become
  the :values of the returned :info operations from the nemesis, e.g.:

      {:value {:n1 [:killed \"java\"]}}"
  [targeter start! stop! nodes-config]
  (let [nodes (atom nil)]
    (reify nemesis/Nemesis
      (setup! [this _] this)

      (invoke! [_ test op]
        (locking nodes
          (assoc op :type :info, :value
                 (case (:f op)
                   :start (let [ns (:nodes test)
                                ns (try (targeter test ns)
                                        (catch clojure.lang.ArityException _
                                          (targeter ns)))
                                ns (util/coll ns)]
                            (if ns
                              (if (compare-and-set! nodes nil ns)
                                (c/on-many ns (start! test c/*host*))
                                (str "nemesis already disrupting "
                                     (pr-str @nodes)))
                              :no-target))
                   :stop (if-let [ns @nodes]
                           (let [value (c/on-many ns (stop! test c/*host* nodes-config))]
                             (reset! nodes nil)
                             value)
                           :not-started)))))

      (teardown! [_ _]))))

(defn main-instance?
  [instance]
  (= (:role instance) "main"))

(defn get-current-main
  "Returns the name of the main instance. If there is no main instance or more than one main, throws exception."
  [instances]
  (let [main-instances (filter main-instance? instances)
        num-mains (count main-instances)
        main-instance
        (cond (= num-mains 1) (first main-instances)
              (= num-mains 0) nil
              :else (throw (Exception. "Expected at most one main instance.")))
        main-instance-name (if (nil? main-instance) nil (:name main-instance))]
    main-instance-name))

(defn leader?
  [instance]
  (= (:role instance) "leader"))

(defn extract-coord-name
  "We could use any server. We cannot just use name because in Memgraph coordinator's name is written as 'coordinator_{id}'."
  [bolt-server]
  (first (str/split bolt-server #":")))

(defn get-current-leader
  "Returns the name of the current leader. If there is no leader or more than one leader, throws exception."
  [instances]
  (let [leaders (filter leader? instances)
        num-leaders (count leaders)
        leader
        (cond (= num-leaders 1) (first leaders)
              (= num-leaders 0) nil
              :else (throw (Exception. "Expected at most one leader.")))
        leader-name (if (nil? leader) nil (extract-coord-name (:bolt_server leader)))]
    leader-name))

(defn choose-node-to-kill-on-coord
  "Chooses between the current main and the current leader. If there are no clear MAIN and LEADER instance in the cluster, we choose random node. We always connect to
  coordinator 'n4' (free choice). If n4 is down, we choose random instance to kill"
  [ns coord]
  (let [conn (utils/open-bolt coord)]
    (try
      (utils/with-session conn session
        (let [instances (->> (mgquery/get-all-instances session) (reduce conj []))
              main (get-current-main instances)
              leader (get-current-leader instances)
              node-to-kill (cond
                             (and (nil? main) (nil? leader)) (rand-nth ns)
                             (nil? main) leader
                             (nil? leader) main
                             :else (rand-nth [main leader]))
              node-desc (cond
                          (hautils/data-instance? node-to-kill) "current main"
                          (hautils/coord-instance? node-to-kill) "current leader"
                          :else "random node")]

          (info "Killing" node-desc ":" node-to-kill)

          node-to-kill))
      (catch org.neo4j.driver.exceptions.ServiceUnavailableException _e
        nil))))

(defn choose-node-to-kill
  [ns]
  (let [coordinators ["n4" "n5" "n6"]
        chooser (partial choose-node-to-kill-on-coord ns)]

    (some chooser coordinators)))

(defn node-killer
  "Responds to :start by killing a random node."
  [nodes-config]
  (node-start-stopper choose-node-to-kill
                      s/stop-node!
                      s/start-memgraph-node!
                      nodes-config))

(defn network-disruptor
  "Responds to :start by disrupting network on chosen nodes and to :stop by restoring network configuration."
  [db]
  (nemesis-combined/packet-nemesis db))

(defn full-nemesis
  "Can kill, restart all processess and initiate network partitions."
  [db nodes-config]
  (nemesis/compose
   {{:kill-node    :start
     :heal-node :stop} (node-killer nodes-config)

    {:start-partition-halves :start
     :stop-partition-halves :stop} (nemesis/partition-random-halves)

    {:start-partition-ring :start
     :stop-partition-ring :stop} (nemesis/partition-majorities-ring)

    {:start-partition-node :start
     :stop-partition-node :stop} (nemesis/partition-random-node)

    {:start-network-disruption :start-packet
     :stop-network-disruption :stop-packet} (network-disruptor db)}))

(defn nemesis-events
  "Create a random sequence of nemesis events. Disruptions last [10-60] seconds, and the system remains undisrupted for some time afterwards."
  [nodes-config]
  (let [events [[{:type :info :f :start-partition-halves}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :stop-partition-halves}
                 (gen/sleep 5)]

                [{:type :info :f :start-partition-ring}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :stop-partition-ring}
                 (gen/sleep 5)]

                [{:type :info :f :start-partition-node}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :stop-partition-node}
                 (gen/sleep 5)]

                [{:type :info :f :kill-node}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :heal-node}
                 (gen/sleep 5)]

                [{:type :info :f :start-network-disruption :value [(keys nodes-config) net/all-packet-behaviors]}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :stop-network-disruption}
                 (gen/sleep 5)]]]

    (mapcat identity (repeatedly #(rand-nth events)))))

(defn create
  "Create a map which contains a nemesis configuration for running HA create test."
  [db nodes-config]
  {:nemesis (full-nemesis db nodes-config)
   :generator (gen/phases
               (gen/sleep 5) ; Enough time for cluster setup to finish
               (nemesis-events nodes-config))
   :final-generator (map utils/op [:stop-partition-ring :stop-partition-halves :stop-partition-node :heal-node :stop-network-disruption])})

@aphyr
Copy link
Collaborator

aphyr commented Nov 6, 2024

You can see that at 15:17:58,885 start-partition-node was generated but I don't see corresponding stop-partition-node until the final generator started.

Is that... not what you expected? The final generator's job is specifically to emit those operations in case a partition is still ongoing.

@as51340
Copy link
Contributor Author

as51340 commented Nov 6, 2024

I was expecting that after rand([10,60]) seconds stop-partition-node would be generated because of :

(defn nemesis-events
  "Create a random sequence of nemesis events. Disruptions last [10-60] seconds, and the system remains undisrupted for some time afterwards."
  [nodes-config]
  (let [events [[{:type :info :f :start-partition-halves}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :stop-partition-halves}
                 (gen/sleep 5)]

                [{:type :info :f :start-partition-ring}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :stop-partition-ring}
                 (gen/sleep 5)]

                [{:type :info :f :start-partition-node}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :stop-partition-node}
                 (gen/sleep 5)]

                [{:type :info :f :kill-node}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :heal-node}
                 (gen/sleep 5)]

                [{:type :info :f :start-network-disruption :value [(keys nodes-config) net/all-packet-behaviors]}
                 (gen/sleep (+ 10 (rand-int 51))) ; [10, 60]
                 {:type :info :f :stop-network-disruption}
                 (gen/sleep 5)]]]

because I do start-partition-node then sleep then stop-partition-node.

However there is no stop-partition-node from 15:17:58 to the end.

@aphyr
Copy link
Collaborator

aphyr commented Nov 6, 2024

I'm afraid debugging this is a bit beyond the scope of what I can offer as far as unpaid support. There's a lot of code here, I don't know what much of it is calling in to, or what's calling it. This might involve interactions with whatever top-level generator you're wrapping the nemesis generator in. Or it might be because the nemesis itself seems to go and do... maybe blocking IO? Really can't say, sorry!

@as51340
Copy link
Contributor Author

as51340 commented Nov 7, 2024

understand, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants