Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statefulsets for cinder: allow multi-AZ deployments, spread pods across zones #44798

Merged
merged 1 commit into from
May 9, 2017

Conversation

zetaab
Copy link
Member

@zetaab zetaab commented Apr 22, 2017

What this PR does / why we need it: Currently if we do not specify availability zone in cinder storageclass, the cinder is provisioned to zone called nova. However, like mentioned in issue, we have situation that we want spread statefulset across 3 different zones. Currently this is not possible with statefulsets and cinder storageclass. In this new solution, if we leave it empty the algorithm will choose the zone for the cinder drive similar style like in aws and gce storageclass solutions.

Which issue this PR fixes fixes #44735

Special notes for your reviewer:

example:

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: all
provisioner: kubernetes.io/cinder
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
  name: galera
  labels:
    app: mysql
spec:
  ports:
  - port: 3306
    name: mysql
  clusterIP: None
  selector:
    app: mysql
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: mysql
spec:
  serviceName: "galera"
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"
    spec:
      containers:
      - name: mysql
        image: adfinissygroup/k8s-mariadb-galera-centos:v002
        imagePullPolicy: Always
        ports:
        - containerPort: 3306
          name: mysql
        - containerPort: 4444
          name: sst
        - containerPort: 4567
          name: replication
        - containerPort: 4568
          name: ist
        volumeMounts:
        - name: storage
          mountPath: /data
        readinessProbe:
          exec:
            command:
            - /usr/share/container-scripts/mysql/readiness-probe.sh
          initialDelaySeconds: 15
          timeoutSeconds: 5
        env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
  volumeClaimTemplates:
  - metadata:
      name: storage
      annotations:
        volume.beta.kubernetes.io/storage-class: all
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 12Gi

If this example is deployed it will automatically create one replica per AZ. This helps us a lot making HA databases.

Current storageclass for cinder is not perfect in case of statefulsets. Lets assume that cinder storageclass is defined to be in zone called nova, but because labels are not added to pv - pods can be started in any zone. The problem is that at least in our openstack it is not possible to use cinder drive located in zone x from zone y. However, should we have possibility to choose between cross-zone cinder mounts or not? Imo it is not good way of doing things that they mount volume from another zone where the pod is located(means more network traffic between zones)? What you think? Current new solution does not allow that anymore (should we have possibility to allow it? it means removing the labels from pv).

There might be some things that needs to be fixed still in this release and I need help for that. Some parts of the code is not perfect.

Issues what i am thinking about (I need some help for these):

  1. Can everybody see in openstack what AZ their servers are? Can there be like access policy that do not show that? If AZ is not found from server specs, I have no idea how the code behaves.
  2. In GetAllZones() function, is it really needed to make new serviceclient using openstack.NewComputeV2 or could I somehow use existing one
  3. This fetches all servers from some openstack tenant(project). However, in some cases kubernetes is maybe deployed only to specific zone. If kube servers are located for instance in zone 1, and then there are another servers in same tenant in zone 2. There might be usecase that cinder drive is provisioned to zone-2 but it cannot start pod, because kubernetes does not have any nodes in zone-2. Could we have better way to fetch kubernetes nodes zones? Currently that information is not added to kubernetes node labels automatically in openstack (which should I think). I have added those labels manually to nodes. If that zone information is not added to nodes, the new solution does not start stateful pods at all, because it cannot target pods.

cc @rootfs @anguslees @jsafrane

Default behaviour in cinder storageclass is changed. If availability is not specified, the zone is chosen by algorithm. It makes possible to spread stateful pods across many zones.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 22, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 22, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @zetaab. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. release-note-label-needed labels Apr 22, 2017
@zetaab
Copy link
Member Author

zetaab commented Apr 22, 2017

actually I was wrong, the cloudprovider is adding the labels nowadays for zones:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go#L284
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/openstack/openstack.go#L488

so in GetAllZones() function, I should maybe filter through nodes and check those labels, instead of calling openstack API? If so, which library i should use and how?

@thockin thockin assigned rootfs and unassigned thockin and roberthbailey Apr 25, 2017
@zetaab
Copy link
Member Author

zetaab commented Apr 25, 2017

@liggitt could you take a look, it looks like rootfs has been away for a while.

@liggitt
Copy link
Member

liggitt commented Apr 25, 2017

@liggitt could you take a look, it looks like rootfs has been away for a while.

I'm not the right person to review this... I'd suggest some intersection of @kubernetes/sig-storage-pr-reviews and @kubernetes/sig-openstack-pr-reviews

@liggitt liggitt added area/provider/openstack Issues or PRs related to openstack provider sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Apr 25, 2017
@0xmichalis
Copy link
Contributor

I'm not the right person to review this... I'd suggest some intersection of @kubernetes/sig-storage-pr-reviews and @kubernetes/sig-openstack-pr-reviews

And @kubernetes/sig-apps-api-reviews since this affects statefulsets

@rootfs
Copy link
Contributor

rootfs commented Apr 27, 2017

Leaving default zone to nova is a problem. I find AWS/GCE uses a heuristics that hashes PVC name to a zone. Cinder could use the same idea for this issue.

@rootfs
Copy link
Contributor

rootfs commented Apr 27, 2017

also see #41598

@zetaab
Copy link
Member Author

zetaab commented Apr 27, 2017

Hi @rootfs this PR actually uses that same heuristics thing what is used in AWS/GCE. So if you do not specify AZ for cinder storageclass it uses that same algorithm to choose correct zone. By using that algorithm and labeling PVs we can achieve situation to spread pods to multiple zones

@rootfs
Copy link
Contributor

rootfs commented Apr 27, 2017

@zetaab I see it now, cinder_util.go is modified twice in 1st and 3rd commits. Can you merge these two commits?

Please also use camel case for consistency.

@kubernetes/sig-openstack-pr-reviews please review gophercloud update.

@smarterclayton
Copy link
Contributor

@justinsb for drive by on az spreading of volumes

@kow3ns
Copy link
Member

kow3ns commented Apr 27, 2017

@rootfs @justinsb It uses the same function, ChooseZoneForVolume, used by the GCE and AWS CloudProviders, to implement the round robin balancing. It will have, for better or worse, the same benefits and limitations.
@zetaab If your Volume are zonal resources be aware of the scheduling implications this has wrt spreading, Affinity, and AntiAffinity.

@zetaab
Copy link
Member Author

zetaab commented Apr 28, 2017

@rootfs 1st and 3rd commit now merged. I did not fully get where I am missing camelcase?

@zetaab
Copy link
Member Author

zetaab commented Apr 28, 2017

@kow3ns yes, the volumes are zonal resources and that is why I added labels to PVs. After PV has zone label, the pod is automatically provisioned in same zone(I tested it with shutdowning all nodes from zone and after that it could not provision pod because there were no available nodes with those labels).

The bad thing which exist in current solution is that, dynamic pvs does not have labels in cinder. In our case it means that pv can be provisioned for example zone x, and then pod itself is started in zone y. However, because our cinder is zonal resource it does not work at all. Although this can be allowed in cinder, so in some usecases people can use cinder resources from another zone. In my opinion that is bad way of doing things, it creates unnecessary network traffic between zones.

If people really want provision pv to zone x and run pod in zone y, they need remove PV labels. But as I see this is really bad way of doing things? What you think? Should we have way to define is that code adding pv labels or not?

glog.V(2).Infof("Successfully created cinder volume %s", name)
return name, volSizeGB, nil

volume_id, volume_az, errr := cloud.CreateVolume(name, volSizeGB, vtype, availability, c.options.CloudTags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: volumeId, volumeAZ

@rootfs
Copy link
Contributor

rootfs commented Apr 28, 2017

@k8s-bot ok to test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 28, 2017
@rootfs
Copy link
Contributor

rootfs commented Apr 28, 2017

Cinder fix lgtm.

Anybody from @kubernetes/sig-openstack-pr-reviews want to check gophercloud change?

@xsgordon @anguslees

@k8s-github-robot k8s-github-robot removed the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 28, 2017
@zetaab
Copy link
Member Author

zetaab commented May 4, 2017

@pospispa I can update it, it should be something like:

availability: Availability Zone. If not specified, the zone is chosen by algorithm from those where Kubernetes cluster has a node.

zetaab/kubernetes.github.io@b5b384d

I will do PR after this is merged

pospispa referenced this pull request in zetaab/kubernetes.github.io May 4, 2017
zetaab referenced this pull request in zetaab/kubernetes.github.io May 4, 2017
update aws+gce availability texts, the algorithm is not random
@dims
Copy link
Member

dims commented May 5, 2017

/lgtm

@dims
Copy link
Member

dims commented May 5, 2017

/assign @jsafrane @mikedanese

@mikedanese
Copy link
Member

/approve

@zetaab
Copy link
Member Author

zetaab commented May 8, 2017

@jsafrane could you please approve

@jsafrane
Copy link
Member

jsafrane commented May 9, 2017

/approve

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, jsafrane, mikedanese, pospispa, zetaab

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 9, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@pospispa
Copy link
Contributor

@zetaab please, will you create the corresponding documentation PR for this PR?

@zetaab
Copy link
Member Author

zetaab commented May 10, 2017

@pospispa its done already kubernetes/website#3693

pospispa pushed a commit to pospispa/origin that referenced this pull request May 12, 2017
…in Storage Class is not Configured

Backport of Kubernetes PR #44798 (kubernetes/kubernetes#44798).

In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect.

Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster.

Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience.
deads2k pushed a commit to deads2k/kubernetes that referenced this pull request May 22, 2017
…in Storage Class is not Configured

Backport of Kubernetes PR kubernetes#44798 (kubernetes#44798).

In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect.

Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster.

Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience.

:100644 100644 2f91722b65... d4a75b3a40... M	pkg/cloudprovider/providers/openstack/openstack_test.go
:100644 100644 68e35d520f... cdd1c8d2f4... M	pkg/cloudprovider/providers/openstack/openstack_volumes.go
:100644 100644 5939ba3a2e... 052c3e1e92... M	pkg/cloudprovider/providers/rackspace/rackspace.go
:100644 100644 c9a7fc435b... eb5e1d20e2... M	pkg/volume/cinder/BUILD
:100644 100644 d45b99784d... 956f7a1088... M	pkg/volume/cinder/attacher_test.go
:100644 100644 047e735568... 06d4528434... M	pkg/volume/cinder/cinder.go
:100644 100644 7b1735fa69... 66acbcd22e... M	pkg/volume/cinder/cinder_test.go
:100644 100644 5b0402afd6... c0013e29e3... M	pkg/volume/cinder/cinder_util.go
smarterclayton pushed a commit to smarterclayton/kubernetes that referenced this pull request May 23, 2017
…in Storage Class is not Configured

Backport of Kubernetes PR kubernetes#44798 (kubernetes#44798).

In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect.

Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster.

Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience.

:100644 100644 2f91722b65... d4a75b3a40... M	pkg/cloudprovider/providers/openstack/openstack_test.go
:100644 100644 68e35d520f... cdd1c8d2f4... M	pkg/cloudprovider/providers/openstack/openstack_volumes.go
:100644 100644 5939ba3a2e... 052c3e1e92... M	pkg/cloudprovider/providers/rackspace/rackspace.go
:100644 100644 c9a7fc435b... eb5e1d20e2... M	pkg/volume/cinder/BUILD
:100644 100644 d45b99784d... 956f7a1088... M	pkg/volume/cinder/attacher_test.go
:100644 100644 047e735568... 06d4528434... M	pkg/volume/cinder/cinder.go
:100644 100644 7b1735fa69... 66acbcd22e... M	pkg/volume/cinder/cinder_test.go
:100644 100644 5b0402afd6... c0013e29e3... M	pkg/volume/cinder/cinder_util.go
deads2k pushed a commit to deads2k/kubernetes that referenced this pull request Jun 7, 2017
…in Storage Class is not Configured

Backport of Kubernetes PR kubernetes#44798 (kubernetes#44798).

In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect.

Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster.

Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience.

:100644 100644 2f91722b65... d4a75b3a40... M	pkg/cloudprovider/providers/openstack/openstack_test.go
:100644 100644 68e35d520f... cdd1c8d2f4... M	pkg/cloudprovider/providers/openstack/openstack_volumes.go
:100644 100644 5939ba3a2e... 052c3e1e92... M	pkg/cloudprovider/providers/rackspace/rackspace.go
:100644 100644 c9a7fc435b... eb5e1d20e2... M	pkg/volume/cinder/BUILD
:100644 100644 d45b99784d... 956f7a1088... M	pkg/volume/cinder/attacher_test.go
:100644 100644 047e735568... 06d4528434... M	pkg/volume/cinder/cinder.go
:100644 100644 7b1735fa69... 66acbcd22e... M	pkg/volume/cinder/cinder_test.go
:100644 100644 5b0402afd6... c0013e29e3... M	pkg/volume/cinder/cinder_util.go
deads2k pushed a commit to openshift/kubernetes that referenced this pull request Jun 7, 2017
…in Storage Class is not Configured

Backport of Kubernetes PR kubernetes#44798 (kubernetes#44798).

In case the availability parameter is not configured in a cinder Storage Class the cinder volume is always provisioned in the nova availability zone. That is incorrect.

Now, the cinder volume is provisioned in a zone that is generated by an algorithm from the set of zone available in the cluster.

Positive side-effect: cinder volumes for individual pods in a StatefulSet are provisioned in unique zones. This increases the StatefulSet resilience.

:100644 100644 2f91722b65... d4a75b3a40... M	pkg/cloudprovider/providers/openstack/openstack_test.go
:100644 100644 68e35d520f... cdd1c8d2f4... M	pkg/cloudprovider/providers/openstack/openstack_volumes.go
:100644 100644 5939ba3a2e... 052c3e1e92... M	pkg/cloudprovider/providers/rackspace/rackspace.go
:100644 100644 c9a7fc435b... eb5e1d20e2... M	pkg/volume/cinder/BUILD
:100644 100644 d45b99784d... 956f7a1088... M	pkg/volume/cinder/attacher_test.go
:100644 100644 047e735568... 06d4528434... M	pkg/volume/cinder/cinder.go
:100644 100644 7b1735fa69... 66acbcd22e... M	pkg/volume/cinder/cinder_test.go
:100644 100644 5b0402afd6... c0013e29e3... M	pkg/volume/cinder/cinder_util.go
dims pushed a commit to dims/kubernetes that referenced this pull request Feb 8, 2018
Automatic merge from submit-queue

Statefulsets for cinder: allow multi-AZ deployments, spread pods across zones

**What this PR does / why we need it**: Currently if we do not specify availability zone in cinder storageclass, the cinder is provisioned to zone called nova. However, like mentioned in issue, we have situation that we want spread statefulset across 3 different zones. Currently this is not possible with statefulsets and cinder storageclass. In this new solution, if we leave it empty the algorithm will choose the zone for the cinder drive similar style like in aws and gce storageclass solutions. 

**Which issue this PR fixes** fixes kubernetes#44735

**Special notes for your reviewer**:

example:

```
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: all
provisioner: kubernetes.io/cinder
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
  name: galera
  labels:
    app: mysql
spec:
  ports:
  - port: 3306
    name: mysql
  clusterIP: None
  selector:
    app: mysql
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: mysql
spec:
  serviceName: "galera"
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"
    spec:
      containers:
      - name: mysql
        image: adfinissygroup/k8s-mariadb-galera-centos:v002
        imagePullPolicy: Always
        ports:
        - containerPort: 3306
          name: mysql
        - containerPort: 4444
          name: sst
        - containerPort: 4567
          name: replication
        - containerPort: 4568
          name: ist
        volumeMounts:
        - name: storage
          mountPath: /data
        readinessProbe:
          exec:
            command:
            - /usr/share/container-scripts/mysql/readiness-probe.sh
          initialDelaySeconds: 15
          timeoutSeconds: 5
        env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
  volumeClaimTemplates:
  - metadata:
      name: storage
      annotations:
        volume.beta.kubernetes.io/storage-class: all
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 12Gi
```

If this example is deployed it will automatically create one replica per AZ. This helps us a lot making HA databases.

Current storageclass for cinder is not perfect in case of statefulsets. Lets assume that cinder storageclass is defined to be in zone called nova, but because labels are not added to pv - pods can be started in any zone. The problem is that at least in our openstack it is not possible to use cinder drive located in zone x from zone y. However, should we have possibility to choose between cross-zone cinder mounts or not? Imo it is not good way of doing things that they mount volume from another zone where the pod is located(means more network traffic between zones)? What you think? Current new solution does not allow that anymore (should we have possibility to allow it? it means removing the labels from pv).

There might be some things that needs to be fixed still in this release and I need help for that. Some parts of the code is not perfect.

Issues what i am thinking about (I need some help for these):
1) Can everybody see in openstack what AZ their servers are? Can there be like access policy that do not show that? If AZ is not found from server specs, I have no idea how the code behaves. 
2) In GetAllZones() function, is it really needed to make new serviceclient using openstack.NewComputeV2 or could I somehow use existing one
3) This fetches all servers from some openstack tenant(project). However, in some cases kubernetes is maybe deployed only to specific zone. If kube servers are located for instance in zone 1, and then there are another servers in same tenant in zone 2. There might be usecase that cinder drive is provisioned to zone-2 but it cannot start pod, because kubernetes does not have any nodes in zone-2. Could we have better way to fetch kubernetes nodes zones? Currently that information is not added to kubernetes node labels automatically in openstack (which should I think). I have added those labels manually to nodes. If that zone information is not added to nodes, the new solution does not start stateful pods at all, because it cannot target pods.


cc @rootfs @anguslees @jsafrane 

```release-note
Default behaviour in cinder storageclass is changed. If availability is not specified, the zone is chosen by algorithm. It makes possible to spread stateful pods across many zones.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/openstack Issues or PRs related to openstack provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Openstack cinder storageclass should support also random zone names.