Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

IP allocation design

Matthias Radestock edited this page Dec 11, 2014 · 22 revisions

Basic idea: We divide the IP address space between all nodes; we call this process 'reservation'. This allows nodes to allocate and free individual IPs locally in a very efficient manner, without synchronous interaction with other nodes. It also provides some inherent partition tolerance.

The key problems to solve are:

  • How do nodes get hold of reservations when they start up or are running out of IPs?
  • How do we avoid leaks when nodes die?

Our approach is founded on the following ideas:

  1. CRDT. Nodes share a CRDT that maps node ids to reservations. Nodes only ever update their own entries, except for erecting tombstones for failed nodes. This makes the data structure inherently convergent.
  2. Gossipping. Updates to the data structure are communicated between nodes through gossipping, which scales well and copes well with changing topologies and partitions. The gossipping is piggybacking on the weave router's peer gossiping and is therefore topology-aware, which decreases message loss and duplication. [1]
  3. Protocol. Nodes move reservations between each other through a mini messaging protocol. The protocol is conjoined with the CRDT gossiping, thus ensuring that the recipient has up to date information about the sender. [2]
  4. Buddies. Reservation is based on a buddy scheme, which ensures that lost reservations can be reclaimed by a uniquely identifiable node.

[1] We expect gossipping to propagate information to all nodes eventually, provided the rate of topology change is below a certain threshold. [2] The messaging is async, with no delivery or ordering guarantees.

CRDT

The CRDT is implemented with help of a vector clock of all map entries. When a node updates its own map entry, the clock for its entry is advanced. This allows gossipees to monotonically merge entries with their own, i.e. they will overwrite entries they hold which are older than those received.

A crucial property of this data structure and the protocol we describe below is that, despite nodes potentially having wildly different ideas of what IP ranges are reserved by other nodes, no two nodes will simultaneously think that they hold a reservation for the same range.

Piggybacking on that property, we also have a guarantee that any range not reserved by any node has a unique inheritor node, i.e. its buddy.

The CRDT does not need to be consistent, e.g. it is ok for it contain multiple entries containing the same reservations. Consequently we can gossip individual entries rather than the entire map.

TODO figure out exactly what should be gossipped when; we need to ensure that all entries get propagated everywhere eventually, with a propagation delay of no worse than log(number-of-nodes) and a per-node message rates and volumes being no worse than log(number-of-nodes).

buddies

There are several suitable representations for reservations.

In the simplest case, think of the IP space as a ring, with reservations being represented as segments of the ring labelled by the node holding the reservation. Gaps in the ring then represent lost reservations. The buddy of a lost reservation is the node holding the reservation just "below" / counter-clockwise to the gap.

For a possibly more compact representation, which also happens to neatly align with sub-nets and CIDR notation, think of the IP space as a binary tree, with reservations being represented as leaves labelled by the node holding the reservation for all IPs below it. Unlabelled leaves then represent lost reservations. Non-leaf nodes get the label of the left-most labelled leaf of their children. The buddy of a lost reservation is the node identified by the label on the sibling of the unlabelled leaf corresponding to the lost reservation.

initialisation

Nodes are given an IP range when starting up. We assume that each node is given the same range, i.e. the range from which all nodes are meant to make their reservations.

TODO deal with nodes being given different ranges and multiple ranges; the effect should be that all ranges specified anywhere are used for reservations everywhere.

We need to deal with concurrent start-up. We do this by electing a leader - the node with the lowest id - that claims the entire range for itself.

When a node starts up, it initialises the CRDT map to contain just an empty entry for itself. It then waits for map updates. If it receives an update containing entries for other nodes that do claim to own some IP range, then it proceeds with normal allocation [see below]. Otherwise, if, after a while [1], no such update has been received (i.e. we've either received no update or only updates with all node entries being empty [2]), and the node is the node with the lowest id, then the node claims the entire IP range for itself, setting its map entry accordingly.

[1] This time can be quite short; it needs to be just long enough for any alive nodes to gossip their state. Unless we want to cater for situations where a partition has occurred and all no nodes on our side were subsequently stopped, or we are trying to add a new node to a network of existing nodes which is presently unreachable. We could conceivably have two start-up modes for nodes; one which permits the above range grabbing, and another where we simply wait for a non-empty map. This mode selection could happen based on whether any IP ranges were specified on start-up. TODO work this out. [2] Ordinarily nodes will always hold a reservation, i.e. once a node has managed to get hold of a reservation it is never going to give it away entirely. Therefore the situation of all node entries being empty only arises on formation of a new network.

Note that the chosen leader can be a node other than the one with the lowest id. This happens in the event a node with a lower id starts up just as another node has made the decision to become leader. That is ok; the new node wouldn't itself claim to be leader until some time has passed, enough time to receive an update from the leader and thus preempt that.

Failures:

  • prospective leader dies before sending map -> detect node failure (see below)
  • repeated appearance and sudden death (after sending map) of nodes with lower ids - perhaps make the MSB of the id a time-stamp? Then the max delay is bounded by the clock difference.

TODO We probably want to allow for a "node recovery mode", where a node rejoins the network after the IP allocator (and/or perhaps other parts of the weave infrastructure) has been restarted. A node could learn what IPs are locally allocated by inspecting the local containers, and then learn from the received map what its previous reservations were.

reservation

When a node wants more IPs it contacts another node, asking it for a donation of some ranges. The criteria for selecting that node are:

  • proximity
  • number of available IPs
  • closeness of available ranges to ours, so that we can reduce fragmentation by merging the new ranges with existing ones

TODO: figure out what selection criteria we should use. This also has an impact on what additional data about nodes we need to include in the CRDT, and the update frequency of that.

For now, we select the node which a) has some free ranges, and b) has the maximum number of ranges for which we are the buddy. This requires the addition of just a boolean "has some free IPs" flag to the CRDT, and is conducive to increasing the chances that the selected node will be able to donate some buddy ranges to us, which is good for reducing fragmentation.

The chosen node replies with a list of ranges, selected with a preference for a) buddies of the requester, and b) small ranges, that collectively add up to approximately half of the free IPs in reservations of that node. Special case: if the node only has one free IP, it does donate that.

If the returned list is empty, the requester repeats the node selection process, contacting a different node this time (since the original requestee will now show as having no free IPs). Otherwise the requester updates its map entry to add the donated range.

When the map indicates no free ranges, we wait for a map update that indicates a change of conditions.

Failures:

  • request delayed --> requester gives up after a while and tries the next best node. Any subsequent response is ignored. -> we've leaked some IPs. See below.
  • request lost
  • target node dead
  • target node unreachable --> requester gives up after a while and tries the next best node.
  • response delayed --> requester should ignore it -> we've leaked some IPs. See below.
  • response lost
  • requester node dead
  • requester node unreachable --> we've leaked some IPs. See below.

TODO: handle case where we run out of "next best nodes". Probably we should just sleep and try again, essentially waiting for either better luck next time (i.e. less message loss/delay), some node recovery / partition-healing, or a map update.

TODO: might it be worth considering proactive donation, perhaps based on gossiped "time until I run out of IPs based on current allocation rate"?

node shutdown

We could have no node shutdown protocol at all and instead handle this the same way as node failure (see below). But failure detection is comparatively expensive and takes time, during which the IP ranges of the failed node are not usable. So the protocol here is purely an optimisation.

When a node leaves, it donates its IP ranges to one other node by first removing its own map entry (this entails erecting a tombstone, see below), and then sending a message to the selected node, containing all the reservations. The target node is chosen based on its ability to merge the reservation IP ranges with its own, i.e. it is a buddy to the maximum number of our ranges.

After sending the message, the node terminates - it does not wait for any response. Overall, there is very little work performed by the node on shutdown, very little, so as to not unduly delay it.

Failures:

  • message lost
  • target node dead
  • target node unreachable --> we've leaked some IPs. See below.

reclaiming leaked IPs

A node's knowledge of its own reservations is authoritative. By examining the map, a node can tell which IP ranges are currently unclaimed and for which of these it is the buddy. Nodes wait for a while for these ranges to be claimed by somebody, in order to give the reservation and node shutdown protocols enough time to conduct range transfers and the result of that to propagate. If that doesn't happen then they claim the range for themselves, updating their map entry accordingly.

NB: "waiting for a while" is more complex than it may first appear. We need to be able to reclaim ranges even when their buddies keep changing due to reservations getting transferred between nodes. One way to cope with that is for each node to maintain age information on the unclaimed IPs, updating that information whenever receiving/performing a map update. Unclaimed IPs of the same age can be coalesced into ranges for compact representation. Any such ranges beyond a certain age for which the node is the buddy can be reclaimed by that node. This creates a dual trigger: reclaiming can happen when a node acquires a reservation or through the passage of time.

Failures:

  • range reclaimed too early --> as a result at some point some node will attempt to construct a map that has some IP range allocated to two nodes. That actually isn't the end of the world (yet). We could ask both nodes to talk to each other and work out which node should have which sub-ranges of the overlapping range. As long as the nodes haven't actually handed out the same IP then everything is fine. And if there is a genuine conflict then we could pause or kill one of the offending containers.

TODO: Could we end up in a situation where a reservation for a particular range moves rapidly between nodes, with none of the ownership claims ever making it to the buddy node of that range, thus resulting in that buddy node reclaiming the range in error? Seems unlikely, but can we quantify this somehow?

node failure

If we haven't heard from a node for a while[1] then any other node can marks that node as dead by removing its entry from the map. Deletion is performed by erecting a tombstone - setting the vector clock slot of the map entry to the max value, thus ensuring that the map entry will always be chosen during a CRDT merge operation.

[1] This requires some heart-beating; which gossipping tends to incorporate anyway. The duration to wait before declaring a node as dead must be long enough to span likely partitioning periods.

Once a node has been marked as dead, its IP ranges are effectively leaked and can be reclaimed as explained above.

NB Bad things will happen if a node rejoins with the same id after it has been declared dead. A partition rejoin is one scenario where that can happen. We can detect that at least - the node will be receiving a map update with a tombstone for itself.

TODO We may want some way of garbage collecting tombstones.

Can we do better?...

It would be nice if we were able to identify a single node to perform this failure marking, so we don't needlessly perform the same cleanup operation on multiple nodes. Various options for identifying a single node:

  1. node with the lowest id overall - this is rather unbalanced
  2. node with the "next lowest" id compared to dead node - this can still be rather unbalanced, especially if ids have the time-stamp as the MSB.
  3. node which is the buddy of the lowest IP range of the merged IP ranges of all dead nodes, becomes the re-claimer of all the dead nodes containing IPs in that range. - quite complicated, especially since we need to account for the fact that different nodes may declare another node as dead at slightly different points in time.

Here's another idea... when marking a node as dead, we move its reservations into another part of the CRDT, which collects reservations from all dead nodes. Alive nodes can then remove buddy ranges from that data structure and claim it as their own. It should be possible to construct a convergent data structure for this, but I haven't quite figured it out yet.

Clone this wiki locally