Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Is Docker Swarm a viable way to achieve HA for orchestrator? #1177

Open
sbrattla opened this issue May 29, 2020 · 13 comments
Open

Is Docker Swarm a viable way to achieve HA for orchestrator? #1177

sbrattla opened this issue May 29, 2020 · 13 comments

Comments

@sbrattla
Copy link

I've just recently found out about orchestrator, and I have been testing it out trying to get familiar with it.

There are multiple ways to run orchestrator, and raft appears to be the simplest way to achieve HA. However, right now I'm running orchestrator as (single task/instance) service in a Docker Swarm. Is there anything which speaks against this way of achieving HA for orchestrator?

In essence, the HA part is delegated to Docker Swarm, and orchestrator runs with the default sqllite3. The database directory is bind mounted from a host directory (which in turn in a mounted network share), so data will persist.

Ignoring any performance aspects of this setup (underlying storage for database is a network share), is there anything about this setup which would/could break orchestrator?

@shlomi-noach
Copy link
Collaborator

Thank you for this question. I'm not very familiar with docker swarm. Does a mounted network share guarantee persistence of the sqlite database file? Does docker swarm maintain copies of that file? If so, how are these copies synchronized?

@sbrattla
Copy link
Author

sbrattla commented May 29, 2020

I'm not that familiar withh sqllite3, but I assumed that only a single instance of orchestrator should only ever write to a single sqllite3 instance. I therefore deployed orchestrator as a single instance service (replicas set to 1).

The network (NFS) share is mounted on /mnt/storage/orchestrator/var/lib/orchestrator on each of the Docker Swarm hosts. Docker Swarm will bind mount the host directory /mnt/storage/orchestrator/var/lib/orchestrator on /var/lib/orchestrator inside the container (or task, as Docker Swarm terms it).

Docker Swarm does not in any way interfere with this setup. All reads and writes to /var/lib/orchestrator are passed directly on to the mounted host directory.

This setup does assume that the storage is HA, but granted this is already in place, Docker Swarm should ensure that orchestrator always runs somewhere.

My question is probably more a question of reusing existing infrastructure (Docker Swarm), as I'd really like to avoid having to create separate VMs for just running orchestrator.

@shlomi-noach
Copy link
Collaborator

This setup does assume that the storage is HA, but granted this is already in place

That was my question. Right, so if storage is guaranteed to be HA, and you can guarantee a single instance of orchestrator running at any time (you're correct, and no more than one orchestrator node should use the same backend database/file), then:

  • The setup is hopefully sound (no split brains or write conflicts) -- but take a look at my questions below.
  • It will take time, upon orchestrator crash/loss, to bring up a new orchestrator node. Take this time into consideration. Also, a node which just started up, waits for at least 3 discovery cycles before running failovers. If you correlate a crash scenario to an orchestrator loss (e.g. network goes down) then you should add all that time to a recovery process.

I wonder about docker swarm cross-DC, and how it handles DC network partitioning. What if the orchestrator node gets network isolated; if/how does docker swarm know to start up a new node in another DC; and once the original DC is back online, how does docker swarm know to take it offline, and whether there will be a time period where two orchestrator nodes will actually mount the same backend.

@sbrattla
Copy link
Author

Assuming orchestrator runs on node A in a 3-node swarm with nodes A,B,C. If A gets isolated from B and C, then B and C has quorum. However, orchestrator will still run on A. New tasks cannot be scheduled on A, but existing tasks will run. So we still have orchestraor running on A.

However, B and C will also discover that orchestrator is not running, and will schedule orchestrator on the remaining "healthy" nodes of the swarm. So, this is probably where it breaks.

If the swarm broke (let's say due to network issues), then the storage likely also broke. In other words, you could have a split brain situation on the storage level.

So, the issue here is not so much that multiple instances of orchestrator runs. That would also be the case with orchestrator running in raft mode. The issue is that you could have a split brain situation on the storage level, and that could result in havoc when the network normalises again.

Although not what I was originally hoping for, does this make sense?

@shlomi-noach
Copy link
Collaborator

Right. Change of plans. I never considered the case where two orchestrator nodes would talk to the same sqlite backend database file, but it's actually supposedly covered by virtue of the leader election process on shared database setup. That's how orchestrator works with MySQL on a non-raft setup.

SQLite actually protects multiple process writes to the same backend database by acquiring a file lock. Now, it remains to clarify whether the shared mount supports file-level locks, and whther SQLite is able to use that lock.

If not, things will break.

If yes, then the next question is the matter of storage split brain. If storage cannot handle split brains and cannot correctly override changes once network split is alleviated, things will break.

@sbrattla
Copy link
Author

OK, so true HA for orchestrator would be hard to get up and running with Docker Swarm if split-brain handling in the underlying storage is uncertain.

I also considered running orchestrator in raft mode in Docker Swarm, but with that comes a few new challenges :

  • orchestrator must be given ip.or.fqdn.of.orchestrator for all orchestrator instances up front. While it is possible to generate a predictable hostnames for an instance n in Docker Swarm, these hostnames can only be used inside each container. Different instances of a given service cannot resolve each other using the generated hostnames.
  • It is possible to resolve all IP addresses for a service from within any container of that service using tasks.[service_name]. However, this is a dynamic list of IP addresses which changes as you scale up and down a service. For this to be useful, orchestrator would have to support looking up peer nodes via DNS. That probably comes with a whole range of new challenges.

So, I believe the only way to run orchestrator in a Docker Swarm right now is to run orchestrator as a single instance and have an underlying storage which is somehow capable of resolving split-brain situations.

@shlomi-noach
Copy link
Collaborator

Agreed on the limitations of preset hostnames/IPs. Possibly I will address that.

@hkotka
Copy link

hkotka commented May 30, 2020

I was trying to setup Orchestrator in Docker Swarm on Raft mode and thought it worked well with a setup where, in my case 3 orchestrator services were deployed as separate services(orchestrator01-03) instead of one service with 3 replicas. This way initial service discovery is predictable.
However an issue which turned out to be a blocker was exactly that hostname/IP discovery. If one Orchestrator node crashes for whatever reason and Swarm re-creates the instance and allocates a new IP-address for it, the Orchestrator nodes that did not crash will never find the newly created Orchestrator instance, since hostname-to-ip resolution is only done when Orchestrator initially starts. I did not find a solution around this and had to change the architecture away from Swarm.

Would be amazing if Orchestrator tried to re-resolve fqdn's if it loses connection with other Raft nodes.

@shlomi-noach
Copy link
Collaborator

Related: #253

@sbrattla
Copy link
Author

sbrattla commented Jun 3, 2020

So, in order for orchestrator to be able to run on top of Docker Swarm, I assume some combination of raft and dynamic discovery of orchestrator peers would have to be supported.

Docker Swarm allows any task to discover the IP of it's replicas using a DNS lookup on tasks.[servicename]. This will return one or more IP addresses. The list of IP addresses includes the the IP address of the querying task.

If the IP address of the task which did the DNS query should be filtered from the list of IP addresses from the DNS answer, one could possibly do this by gathering IP addresses associated with network interfaces local to the task which did the DNS query (a task can be a member of multiple networks). From there, any IP address local to the task could be excluded from the list of IP addresses from the DNS answer.

Tasks may come and go as a service is scaled up and down, but I guess this holds true for any raft topology. That is, a raft topology will continuously have to agree on the peers available, and which of the peers to elect as a leader.

So, the list of IP addresses for peers would not be limited to static IPs, but also support lookup of hostnames that could return multiple IP addresses. The list of IP addresses which the provided hostname would resolve to could change as the service is scaled up and down.

I'm not too familiar with Kubernetes, but if Kubernetes supports a similar type of service discovery, this may open up for HA enabling orchestrator on Kubernetes as well?

@sbrattla
Copy link
Author

sbrattla commented Jun 4, 2020

@shlomi-noach I can see that some work has been done in hashicort/raft to move away from IP addresses, and use server IDs to manage the peers in a raft topology. I am not familiar with all the details in this, but maybe that is a step in the right direction.

hashicorp/raft#236

Assuming server IDs would be unique to each instance of orchestrator (e.g. generated upon scheduling / startup of a task replica), I'm not sure how "gone" service replicas in Docker Swarm (gone due to rescheduling / scaling / moving / dead hosts etc etc) would be cleaned up though. Would they just linger as "gone" peers in the list of orchestrator peers?

Before I go on with this, is all this anything you would consider implementing in orchestrator? That is, HA enabling orchestrator by letting it run on top of Docker Swarm, Kubernetes or some other container orchestration tool?

@shlomi-noach
Copy link
Collaborator

@sbrattla correct. That's why I linked #253, which discusses the use of IDs, and also mentions hashicorp/raft#236.
See some followup discussion in that issue.

I'm not sure how "gone" service replicas in Docker Swarm (gone due to rescheduling / scaling / moving / dead hosts etc etc) would be cleaned up though.

I don't know yet. I'm not familiar with Docker swarm.

Before I go on with this, is all this anything you would consider implementing in orchestrator? That is, HA enabling orchestrator by letting it run on top of Docker Swarm, Kubernetes or some other container orchestration tool?

orchestrator is already known to run on kubernetes, see for example https://github.com/presslabs/mysql-operator ; there's other implementations I've heard of. I can perhaps fill in some details in the next weeks.

The general answer is: "yes, I want to solve that", but that depends on my priorities. So I can make promises unfortunately.

@sbrattla
Copy link
Author

sbrattla commented Jun 5, 2020

Thanks @shlomi-noach. I'm trying to following up on #253 , but running each orchestrator node behind it's own service in Docker Swarm unfortunately seems to break because...

  • RaftBind needs the IP oor FQDN of the orchestrator instance. I cannot set RaftBind to the service FQDN, because orchestrator seems to want to bind to that IP address (which fails, since the IP belongs to the service / load balancer).
  • I can provide the FQDN of the service fronting each orchestrator in RaftNodes, but again, this conflicts with the above since he FQDN provided in RaftBind also must appear in RaftNodes.

Anyway, I'm happy for your efforts, and very much hoping that enough people are interested in this feature for your to want to allocate some of your time to this :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants