This small project aims to demonstrate how to set up a load-balanced environment that runs some web application and incorporates their database backend as well, all these in Google Kubernetes Engine, step by step right from the start.
It looks a simple case, but believe me, there were quite some unexpected obstacles along the way...
You need an account to the Google Cloud platform (obviously), you shall create a project that will enclose all the resources and entities we'll create (and separate them from your other things).
Then create a Service Account within this project (Menu / IAM & Accounts / Service Accounts),
and generate a private key for this account that our mechanisms will use later by clicking the three-dot icon to the
right of the service account, choosing 'Create key' and saving the file, like service_account.json
.
Then you shall authorize this account to perform certain roles in your project:
- Go to IAM & Accounts / IAM
- Choose the service account, click its 'Edit' on the right
- 'Add another role', choose 'Kubernetes Engine' / 'Kubernetes Engine Admin'
- 'Add another role', choose 'Service Accounts' / 'Service Account User'
- 'Add another role', choose 'Storage' / 'Storage Admin'
- Save
Then transfer that service_account.json
here and tell the gcloud cli to use it:
gcloud auth activate-service-account --key-file=service_account.json
Then you can check its results: gcloud info
, or actually test if it indeed works: gcloud container clusters list
If you got error messages, then something is still wrong, but an empty list is completely normal if you don't have any clusters created yet. (We'll change that soon :) ...)
We'll need to manipulate Docker images, so Docker must be installed, enabled and started.
The docker in the CentOS repo is way too obsolete (as of now: 1.13.1), so install the latest stable (as of now: 19.03.1) from the Docker repo instead.
docker, kubectl
python2-google-auth, python2-libcloud, python2-crypto, python2-openshift, python-netaddr, python-docker-py
A cluster consists of a bunch of hosts that run all the Kubernetes stuff, so first we'll need some instances to do this.
Fortunately we don't need to do this right from the instance-level, because the GKE infrastructure will do all these for us:
- Provisioning the instances
- Choosing an already Kubernetes-aware OS image for them
- Configuring the cluster store, etc.
- Registering the new instances as parts of the cluster
All we need is to define the parameters of the instances and of the cluster:
For the instances:
- Machine type
- Disk size
For the cluster:
- Initial number of nodes
- Minimal / maximal number of nodes (if we want autoscaling)
- Whether we want auto-repair and auto-upgrade functionality
It's that simple! Well, almost...
In addition to the Cluster there is another entity: the Node Pool. As its name suggests, it contains (and manages) a bunch of nodes, so actually all those 'for the cluster' parameters belong to a Node Pool, and such Node Pools (note the plural) belong to a Cluster. In fact, the Cluster adds very little extra to those Pools...
And there is a small limitation here, that causes some inconvenience for us.
Creating a Cluster involves storing a lot of information in its Pools, so we can't create a Cluster without a (default)
Pool. But as of the current state of the Ansible module gcp_container_cluster
, not all Pool parameters can be
configured through it. But we can't create the Pool first, without the Cluster, so we've got a chicken-and-egg problem
here.
As of now, the only solution seems to
- Create the Cluster with a minimal Pool
- Create a new Pool with all the features, and assign it to the Cluster
- Dispose of that first, implicitely created minimal 'default' Pool
This may change in the future, as the Cluster API documentation
marks the nodeConfig
and initialNodeCount
fields of the Cluster as obsoleted, and recommends specifying the initial
Pool as part of the Cluster specification (nodePools[]
), but with the full specification model as any Pools that we
would add afterwards.
So, this is an Ansible limitation, the latest changes of the GKE API hasn't yet been tracked to the
corresponding module gcp_container_cluster
.
(Others
have also run into this problem, only at that time the new GKE API may not have existed yet.)
As of that initial 'minimal pool', GKE also has some peculiar restrictions: if we chose the smallest machine type
(f1-micro
), it would require at least 3 of them to form a Node Pool, therefore we have to choose the next smallest
(g1-small
), of which one is enough.
If the script cannot dump the cluster services, but returns a terse Unauthorized
error, then check on the web
console if the cluster could indeed start up, or is it standing in a 'Pods unschedulable' state.
If we drop a node pool while the auto-upgrade is in progress, then we'll get that 'Pods unschedulable' above. That's why the auto-upgrade option is disabled in the creator playbook.
We have an Ansible playbook for that: ./run.sh 0_create_cluster.yaml
When we want to manage the cluster manually, we'll use the CLI tool kubectl
, which needs access to the cluster, so
we must tell it to ask gcloud
for credentials.
This information must be described in ~/.kube/config
in quite a nice syntax, but gcloud
can do that for us:
gcloud container clusters get-credentials --zone=europe-west1-c lb-demo-cluster
Then, to check that we can actually access the cluster: kubectl cluster-info
NOTE: This ~/.kube/config
is only needed for kubectl
, as the playbooks access and use the credentials directly.
First of all, we need a persistent disk that will store all our data, and it shall be initialised to contain the (empty) database files.
- Create the disk
- Create a temporary VM, attach the disk, create filesystem on it
- Shut down and destroy the VM
This is done by 1_create_db_disk.yaml
.
- Generate random passwords for the (soon to be created) root and app users of the DB
- Deploy the MariaDB to the cluster (see the embedded resource definition in the .yaml)
- Create the MariaDB service
- Check out the App sources (to have the DB setup .sql scripts)
- Customise the DB setup scripts and the Frontend configs
- Remotely access the DB service and execute the DB setup scripts
This is done by 2_mariadb_deployment.yaml
NOTE: It may take some time (even minutes) until the service gets its external IP assigned, see in the output of kubectl get service mariadb-server
The randomly generated passwords are stored in the local files db.root.password
and db.user.password
, so if we want to check the DB service manually, we can connect to it:
mysql -u root --password="$(<db.root.password)" -h <the service external IP address>
The frontend is a standalone web application with our specific code and content, so we must create an image for it.
To push a local image to a GSE storage, the officially recommended way
is to do it via docker
:
- Create our local image
- Add a tag that refers to our project registry:
docker tag lb_demo_frontend eu.gcr.io/networksandbox-232012/lb_demo_frontend
- Push the image:
docker push eu.gcr.io/networksandbox-232012/lb_demo_frontend
Pushing needs some credentials, so the documentation recommends configuring docker
to use gcloud
as a credential
store: gcloud auth configure-docker --quiet
, but DON'T do it yet. It wouldn't work as expected, so we'll have a
workaround and therefore we won't need it.
(Btw, this command would just create/update ~/.docker/config.json
with { "credHelpers": { "gcr.io": "gcloud", ... }
,
so it's not dealing with the credentials, it only configures how to get them.)
We are working as a plain, non-root user, so just saying docker whatever
will only give us some error messages about
not being able to write /var/run/docker.sock
.
There is a doc on how to configure Docker accessible for non-root users, and there is another doc about why not to do it.
Starting from Centos 7, the OS-supported docker
packages follow the 'root-only' discipline and tell that if you want to
make docker
available for non-root users, then configure sudo
for them, because it's at least audited, while the
implicit (and needed) root-exec abilities of docker
aren't.
At first glance there isn't too much difference between docker whatever
and sudo docker whatever
, and for those
commands that don't access the GSE repo (eg. pull
, build
, tag
) it indeed works.
For the push
, however, it wouldn't, because the credential handling isn't working across sudo
.
If we told gcloud
to tell docker
to use gcloud
as credential source (that gcloud auth configure-docker
above), it would create/update the .docker/config.json
in our home folder.
Then when we'd say sudo docker push ...
, the docker
command would run as root and use the home folder of root, and
thus wouldn't even consider our config.json
.
This could be circumvented by telling docker
where to look for its config: sudo docker --config "$HOME/.docker" push ...
,
and it almost works. Almost...
The docker client can connect to the socket, talk to the docker daemon, prepare the images to send, but then it fails:
denied: Token exchange failed for project 'networksandbox-232012'. Caller does not have permission 'storage.buckets.get'. To configure permissions, follow instructions at: https://cloud.google.com/container-registry/docs/access-control
The problem is this:
docker push
is running as root- It starts talking with the GSE server, reaches the authentication phase
- Calls out to
gcloud
for credentials, still as root - So
gcloud
is also running as root - It tries to get its active config and such information from
~/.config/gcloud/
- That refers to the home folder of root, and not to ours
- There certainly are no credentials for our logged-in service account
(It took about two hours of debugging the scripts and tracing docker
with strace
, but finally managed to catch it :D !)
What docker
actually does when asking gcloud
for credentials is like this: echo "https://eu.gcr.io" | gcloud auth docker-helper get
,
and expects the credentials in .json format: { "Secret": "...", "Username": "_dcgcloud_token" }
That is the username and password docker
uses to access the server eu.gcr.io
, so we may as well login
there with these credentials: sudo docker login -u "_dcgcloud_token" -p "..." eu.gcr.io
.
The token has a limited validity period, so it should be performed right before the sudo docker push ...
, but then
all the sudo docker ...
commands work fine, and finally we can log out as well: sudo docker logout https://eu.gcr.io
NOTE: This token is the same as that in the structure returned by gcloud config config-helper --format=json
, so the
playbook uses that one.
So, the process goes like this:
- Create the frontend image (see details below)
- Add a tag that refers to our project registry:
sudo docker tag lb_demo_frontend eu.gcr.io/networksandbox-232012/lb_demo_frontend
- Logout any previous logins:
sudo docker logout eu.gcr.io
- Get a fresh token:
echo "https://eu.gcr.io" | gcloud auth docker-helper get
- Login to the server:
sudo docker login -u "_dcgcloud_token" -p "..." eu.gcr.io
- Push the image:
docker push eu.gcr.io/networksandbox-232012/lb_demo_frontend
- Logout:
sudo docker logout eu.gcr.io
This is done by 3_create_frontend_image.yaml
, and then its result can be checked:
gcloud container images list --repository=eu.gcr.io/networksandbox-232012
The standalone docker command would be sudo docker image build -t frontend:latest .
Of course, it can be integrated into our playbooks, so this is also in 3_create_frontend_image.yaml
.
Having reached this point, no surprises are left. With the DB we already deployed nodes to the cluster (using an image), created service to make it available, and this step is even simpler (it needs no customisation of files, etc.) :
4_frontend_deployment.yaml
And now we have the cluster up and running, at full functionality :D !
- Creating the cluster with the implicite pool (1 g1-small with 10GB disk, no autoscaling, no auto-repair, no auto-upgrade): 2.5 minutes
- Creating the real pool (4 f1-micro instances with 10GB disks): 5.5 minutes
- Deleting the implicite pool: 2 minutes
It seems that the time overhead caused by the implicite pool (creating and destroying it) is about 3 minutes. That's not negligible, but not insufferable either.
There is a bug, it was fixed, the fix was merged in on Jun 5, 2019. '2.8.1' was released on Jun 7 but it doesn't have the fix, neither does '2.8.2', so probably it'll be released in '2.8.3'.
Until then, apply this workaround.
It's a .yaml file with the following information:
fullname = "gke_{{ project_id }}_{{ zone }}_{{ cluster_name }}"
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: "{{ cluster.masterAuth.clusterCaCertificate }}"
server: "https://{{ cluster.endpoint }}"
name: "{{ fullname }}"
contexts:
- context:
cluster: "{{ fullname }}"
user: "{{ fullname }}"
name: "{{ fullname }}"
current-context: "{{ fullname }}"
kind: Config
preferences: {}
users:
- name: "{{ fullname }}"
user:
auth-provider:
config:
cmd-args: config config-helper --format=json
cmd-path: /usr/lib64/google-cloud-sdk/bin/gcloud
expiry-key: '{.credential.token_expiry}'
token-key: '{.credential.access_token}'
name: gcp
So the actual auth credentials are provided by gcloud config config-helper --format=json
, which produces
{
"configuration": {
"active_configuration": "default",
"properties": {
"core": {
"account": "...@....",
"disable_usage_reporting": "True",
"project": "networksandbox-232012"
}
}
},
"credential": {
"access_token": "...",
"token_expiry": "2019-07-23T10:54:19Z"
},
"sentinels": {
"config_sentinel": ".../.config/gcloud/config_sentinel"
}
}
That access_token
is the same as the one returned by echo "https://gcr.io" | gcloud auth docker-helper get