Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple clusters management in EKS #616

Merged
merged 13 commits into from
Jul 3, 2019
Merged

Conversation

aylei
Copy link
Contributor

@aylei aylei commented Jun 29, 2019

What problem does this PR solve?

Support multiple TiDB clusters management in EKS with some updates per #575

What is changed and how does it work?

The control plane (Kubernetes master and tidb-operator) and data plane (node pools and helm release) are separate into two different modules: tidb-operator and tidb-cluster, a top-level module compose one tidb-operator module and one or more tidb-cluster modules to support multiple cluster management.

See the README of top-level module, README of tidb-operator module and README of tidb-cluster module for detail.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
    • 1 EKS + 1 TiDB cluster
    • 1 EKS + 2 TiDB cluster
    • 1 EKS + 3 TiDB cluster

Code changes

  • Has Terraform scripts change
  • Has documents change

Related changes

  • Need to update the documentation

Does this PR introduce a user-facing change?:

The terraform scripts support manage multiple TiDB clusters in one EKS cluster.

Limitations

  • Terraform do not support loop for modules, I cannot figure out how to define multiple TiDB clusters in .tfvars, so it is inevitable to edit the cluster.tf directly for multiple cluster management now;

@tennix @jlerche @gregwebs PTAL, most of the works are done by @tennix in bd2343f, I did some clean up and re-organization of code in the follow up commits.

@jlerche I changed the aws-tutorial.tfvars to keep compatibility, hope this won't break your tutorial😅

tennix and others added 3 commits June 17, 2019 20:05
Signed-off-by: Aylei <rayingecho@gmail.com>
Signed-off-by: Aylei <rayingecho@gmail.com>
@aylei aylei requested review from gregwebs, jlerche and tennix June 29, 2019 14:55
aylei added 2 commits June 29, 2019 22:57
Signed-off-by: Aylei <rayingecho@gmail.com>
Signed-off-by: Aylei <rayingecho@gmail.com>
deploy/aws/README.md Outdated Show resolved Hide resolved
deploy/aws/eks/local.tf Outdated Show resolved Hide resolved
@@ -0,0 +1,103 @@
apiVersion: apiextensions.k8s.io/v1beta1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we symlink this instead of copying?

Copy link
Contributor Author

@aylei aylei Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this module is supposed to be workable outside this repository. But it is definitely not a good idea to keep a copy here.

How about kubectl appy -f https://raw.githubusercontent.com/pingcap/tidb-operator/v1.0.0-beta.3/manifests/crd.yaml?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to work outside this repository? I have been referencing manifests with symlinks in the GCP terraform. The deployment doesn't work without a good way to reference our charts and manifests.

The suggested update is problematic because it will fall out of sync. Better would be to have a variable "manifests" which could be set to ../../manifests or https://raw.githubusercontent.com/pingcap/tidb-operator/v1.0.0-beta.3.

Does it help to move the manifests directory under the deploy directory?

Copy link
Contributor Author

@aylei aylei Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to work outside this repository? I have been referencing manifests with symlinks in the GCP terraform. The deployment doesn't work without a good way to reference our charts and manifests.

Cases that use these modules outside this repository:

  • terraform store state in the working dir, users may copy this module to manage different EKS instances (one of our colleagues have tried this and got hurt by the symlink)
  • for advanced users, it is possible to compose the tidb-operator module and tidb-cluster module into their own terraform scripts. A self-contained module will be easier to be composed in this case

Does it help to move the manifests directory under the deploy directory?

Yes, to some extent. But the manifest directory is still necessary because the ebs-gp2 storage class and local-volume-provisioner.yaml are dedicated to AWS. I've considered use a small overlay yaml to customize the base local-volume-provisioner.yaml via kusomize, but this has a low priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to state, it seems that users are solving a problem by copying and in response we are copying :)
I want to get away from copying as a solution to anything. The way I solve this is by having users instantiate a module, rather than just changing a terraform variable files. A user instantiates the module in a directory representing their environment (staging, production). The usability is about the same for one instantiation, but when you do multiple everything is much better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in our case we are missing a nice mechanism to depend on our manifest files. The only way I can see how to do it is to use the file function to read the file in (and we can write it out to a new local file to avoid memory consumption). But then I think we still end up needing the entire git repo, so it doesn't really seem better than a symlink.

aylei and others added 3 commits July 2, 2019 14:05
Co-Authored-By: Greg Weber <greg@gregweber.info>
Signed-off-by: Aylei <rayingecho@gmail.com>
Signed-off-by: Aylei <rayingecho@gmail.com>
@aylei
Copy link
Contributor Author

aylei commented Jul 2, 2019

@tennix @gregwebs I've removed the customized eks module for better maintainability, and addressed all the review comments, PTAL again

aylei added 3 commits July 3, 2019 00:10
Signed-off-by: Aylei <rayingecho@gmail.com>
Signed-off-by: Aylei <rayingecho@gmail.com>
Signed-off-by: Aylei <rayingecho@gmail.com>
@@ -38,6 +38,8 @@ spec:
- key: dedicated
operator: Exists
effect: "NoSchedule"
nodeSelector:
localVolume: "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local-ssd would be a slightly better name. GKE adds this tag automatically now: cloud.google.com/gke-local-ssd

default_cluster_cluster_name = "aws_tutorial"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster name should not contain underscore.

Copy link
Member

@tennix tennix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

}
```

## Multiple Cluster Management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Multiple Cluster Management
## Multiple TiDB Cluster Management

Signed-off-by: Aylei <rayingecho@gmail.com>
@aylei aylei requested review from gregwebs and tennix July 3, 2019 01:16
@aylei
Copy link
Contributor Author

aylei commented Jul 3, 2019

@tennix @gregwebs I've updated this PR according to your comments, PTAL, thanks!

@gregwebs
Copy link
Contributor

gregwebs commented Jul 3, 2019

As per my comments above, I would like to demonstrate usage that is safe for multiple environments. The good news is we have already created a module, we just need to show how to use it as one!

The only change we need to make is that all the usages of file need to have path.module, so file("${path.module}.

Then the usage as module is just a new directory with a file

module "staging" {
  source = "../"
}

The problem with this is then the suggestion to create multiple TiDB clusters by editing clusters.tf. I suggest we have the user instantiate the tidb-cluster modules themselves.

We can move all our top-level terraform to a separate directory, perhaps called "vpc-setup". Then we have a file deploy/aws/staging/main.tf with contents:

module "staging" {
  source = "../vpc-setup"
}

module module "default-cluster" {
  source = "../tidb-cluster"
  ...
}

@gregwebs
Copy link
Contributor

gregwebs commented Jul 3, 2019

Great work on this, the above comment shows how easy it is to start changing things around now that things have been properly modularized :)

@aylei
Copy link
Contributor Author

aylei commented Jul 3, 2019

Yes, and (luckily) all the file() invocation and the reference of files in local-exec handles directory properly now😄

The example is great and I'm going to add it to the README👍

@gregwebs
Copy link
Contributor

gregwebs commented Jul 3, 2019

Glad that makes sense. I think if we add it to the README but also explain how to modify the vpc-setup module directly it will lead to problems where someone will have already modified a module directly. I don't know how to fix the state file at that point to move it to the multiple environment setup.

@aylei
Copy link
Contributor Author

aylei commented Jul 3, 2019

I suggest we have the user instantiate the tidb-cluster modules themselves.

IMHO, this UX is more applicable for advanced users with hands-on terraform experience, because after all a little (maybe a lot) glue codes are necessary.

I think if we add it to the README but also explain how to modify the vpc-setup module directly it will lead to problems where someone will have already modified a module directly. I don't know how to fix the state file at that point to move it to the multiple environment setup.

We should document that users should avoid reusing the top-level module because it is not modularized. Then the state of the top-level module will be stored in /deploy/aws/, and the state of custom modules will be stored in, say, /deploy/aws/staging. And there should be no tf state in /deploy/aws/tidb-operator and /deploy/aws/tidb-cluster because they are not supposed to be applied directly.

Signed-off-by: Aylei <rayingecho@gmail.com>
@gregwebs
Copy link
Contributor

gregwebs commented Jul 3, 2019

IMHO, this UX is more applicable for advanced users with hands-on terraform experience, because after all a little (maybe a lot) glue codes are necessary.

The glue code is almost entirely default variables. But those are all just used in clusters.tf so we can move the default values into the clusters module itself.

We can also provide a default environment to use. So the user only needs to cd default.

@gregwebs
Copy link
Contributor

gregwebs commented Jul 3, 2019

Then the state of the top-level module will be stored in /deploy/aws/, and the state of custom modules will be stored in, say, /deploy/aws/staging.

Yes, but the user has already made edits to the top-level module directly. This will end up affecting deploy/aws/staging, which will cause problems. The user now needs to move their top-level usage to the environment based approach, but I don't even know how to do this without destroying the existing top-level environment first.

@aylei
Copy link
Contributor Author

aylei commented Jul 3, 2019

I mean, the top-level module is indeed an example of composing the sub modules. Whether this module located in /deploy/aws or /deploy/aws/default makes no difference in usage except that the top-level may seems to be special because it is located in the ... top-level.

@aylei
Copy link
Contributor Author

aylei commented Jul 3, 2019

Let me demonstrate if I get you correct now:

Maybe a better UX is moving the VPC setup to a separate sub module, and moving the current top-level module to the ./default directory. Then we will have the following directories layout:

./tidb-cluster
./tidb-operator
./vpc
./default

Then:

  • if the users want to provision more TiDB clusters in the default module, they go to edit the tf scripts in this module;
  • if the users want to provision a new EKS cluster and TiDB clusters in it, they create a new module like staging:
./tidb-cluster
./tidb-operator
./vpc
./default
./staging

Does this make sense?

@aylei
Copy link
Contributor Author

aylei commented Jul 3, 2019

I'd like to move the discussion about UX and the recommended usage of modules to a new issue, @gregwebs @tennix how do you think?

@tennix
Copy link
Member

tennix commented Jul 3, 2019

Agreed, I think we can merge this PR now and improve the UX in following PRs.

Copy link
Member

@tennix tennix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gregwebs
Copy link
Contributor

gregwebs commented Jul 3, 2019

@aylei yes, that is a good module organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants