Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prototype] Batch job admission and flexible resource allocation #1

Closed
wants to merge 3 commits into from

Conversation

k82cn
Copy link
Owner

@k82cn k82cn commented May 16, 2017

@timothysc
Copy link

thx, I'll run through it today.

Copy link

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it still very much a WIP so I stopped after a certain point.

General take aways:

  1. Should probably find an easier way to reuse the Job resource.
  2. Should be able to stand on it's own and be built as a set of add-on extensions (containers).

"k8s.io/kubernetes/pkg/api"
)

// +genclient=true

// QueueJob represents the configuration of a QueueJob.
type QueueJob struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're not intending to change the main API right?
This is either a candidate for api-aggregation of TPRs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that queues can grow unbounded and be several times the capacity of the system. All of this data for standard API's is stored in etcd, which on unbounded queues would OOM and bring down a cluster.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree; maybe we can make "queuejob" small enough to contains more jobs in the queue.

"k8s.io/kubernetes/pkg/api/v1"
)

// +genclient=true

// QueueJob represents the configuration of a QueueJob.
type QueueJob struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a separate object from existing Jobs? Why couldn't we slightly augment the existing Jobs to meet be reusable? I remember this coming up in conversation or the doc, but I don't recall the reason.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the design doc, we'd like to have a wrapper object for all API objects. But re-think it today, maybe convert Job, RC, RS into an internal object seems OK.

}

// QueueJobStatus is xxx
type QueueJobStatus struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't know why this is necessary.

limitations under the License.
*/

package app

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you build this into the repo vs. a separate component?
There is little chance that folks would allow another controller into the main repo.

See: https://github.com/heptio/eventrouter on example for stand-alone controllers.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, good point; it seems controller-manager is the component to handle this.

}

func (bjc *BatchJobControllerOptions) AddFlags(fs *pflag.FlagSet) {
fs.StringVarP(&bjc.ApiServer, "api-server", "m", "localhost:8080", "The address of api-server.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's run as a separate introspective component you don't need this, see other example.

@k82cn k82cn changed the title [WIP][Prototype] Batch job admission and flexible resource allocation [Prototype] Batch job admission and flexible resource allocation May 23, 2017
@k82cn k82cn force-pushed the k8s_batchjob branch 9 times, most recently from 3c82edd to 60fd24d Compare May 24, 2017 13:39
k82cn pushed a commit that referenced this pull request May 26, 2017
…mance

Automatic merge from submit-queue (batch tested with PRs 38505, 41785, 46315)

Only retrieve relevant volumes

**What this PR does / why we need it**:

Improves performance for Cinder volume attach/detach calls. 

Currently when Cinder volumes are attached or detached, functions try to retrieve details about the volume from the Nova API. Because some only have the volume name not its UUID, they use the list function in gophercloud to iterate over all volumes to find a match. This incurs severe performance problems on OpenStack projects with lots of volumes (sometimes thousands) since it needs to send a new request when the current page does not contain a match. A better way of doing this is use the `?name=XXX` query parameter to refine the results.

**Which issue this PR fixes**:

kubernetes#26404

**Special notes for your reviewer**:

There were 2 ways of addressing this problem:

1. Use the `name` query parameter
2. Instead of using the list function, switch to using volume UUIDs and use the GET function instead. You'd need to change the signature of a few functions though, such as [`DeleteVolume`](https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/cinder/cinder.go#L49), so I'm not sure how backwards compatible that is.

Since #1 does effectively the same as #2, I went with it because it ensures BC.

One assumption that is made is that the `volumeName` being retrieved matches exactly the name of the volume in Cinder. I'm not sure how accurate that is, but I see no reason why cloud providers would want to append/prefix things arbitrarily. 

**Release note**:
```release-note
Improves performance of Cinder volume attach/detach operations
```
@k82cn k82cn force-pushed the k8s_batchjob branch 5 times, most recently from b76fb9d to 2b4c82d Compare June 1, 2017 08:15
@k82cn k82cn force-pushed the k8s_batchjob branch 5 times, most recently from 917a296 to e3fa564 Compare June 3, 2017 12:03
@k82cn
Copy link
Owner Author

k82cn commented Jun 5, 2017

@davidopp, @erictune, @foxish, @timothysc

just a quick update on batchjob proposal:

  1. Did a change based on the design: I used "JobQuota" instead of the wrapper object "QueueJob"
  • for policy, it only need to know the resource requirements; and we also need to parse QueueJob's sub API object to get that, for example, if a RC in QueueJob, we need to get RC's replicas (both creation and scale up)
  • update job controller to create JobQuota when job created; the customer can create any object/TPR themselves, just create a JobQuota to share the resources
  1. Currently, the overall code of controller is there; fixing bugs to make it work. Please help to review the interface of policy
  2. I'm thinking how to add Admission to avoid order dependency, e.g. ResourceQuotaAdmission should be the latest Admission

I'd like to demo it in sig meeting when it's ready; if any comments, please let me know.

k82cn pushed a commit that referenced this pull request Jun 16, 2017
Automatic merge from submit-queue (batch tested with PRs 47523, 47438, 47550, 47450, 47612)

Move slow PV test to slow suite.

See [testgrid](https://k8s-testgrid.appspot.com/google-gce#gce&width=5&graph-metrics=test-duration-minutes).

#1
@k82cn
Copy link
Owner Author

k82cn commented Aug 8, 2017

Close this PR because of doc changed; but we'll re-use part of this, e.g. DRF.

@k82cn k82cn closed this Aug 8, 2017
k82cn pushed a commit that referenced this pull request Feb 16, 2018
k82cn pushed a commit that referenced this pull request Feb 23, 2018
k82cn pushed a commit that referenced this pull request Jun 23, 2018
Adding recent upstream changes to k8s.
k82cn pushed a commit that referenced this pull request Sep 26, 2018
k82cn pushed a commit that referenced this pull request Jan 28, 2019
update from kubernetes master
k82cn pushed a commit that referenced this pull request Feb 14, 2019
update from base repository
k82cn pushed a commit that referenced this pull request Apr 6, 2020
k82cn pushed a commit that referenced this pull request Mar 10, 2021
Sharing the same connection for multiple streams should have worked,
but ran into unexpected timeouts:

I0227 08:07:49.754263   80029 portproxy.go:109] container "mock" in pod csi-mock-volumes-4037-2061/csi-mockplugin-0 is running
E0227 08:07:49.779359   80029 portproxy.go:178] prepare forwarding csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: dialer failed: unable to upgrade connection: pod not found ("csi-mockplugin-0_csi-mock-volumes-4037-2061")
I0227 08:07:50.782705   80029 portproxy.go:109] container "mock" in pod csi-mock-volumes-4037-2061/csi-mockplugin-0 is running
I0227 08:07:50.809326   80029 portproxy.go:125] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: starting connection polling
I0227 08:07:50.909544   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #0, 0 open
I0227 08:07:50.912436   80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #0
I0227 08:07:50.912503   80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #0
I0227 08:07:50.913161   80029 portproxy.go:322] forward connection #0 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
E0227 08:07:50.913324   80029 portproxy.go:242] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: an error occurred connecting to the remote port: error forwarding port 9000 to pod 66662ea1ab30b4193dac0102c49be840971d337c802cc0c8bbc074214522bd13, uid : failed to execute portforward in network namespace "/var/run/netns/cni-c15e4e36-dad9-8316-c301-33af9dad5717": failed to dial 9000: dial tcp4 127.0.0.1:9000: connect: connection refused
I0227 08:07:50.913371   80029 portproxy.go:340] forward connection #0 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
W0227 08:07:50.913487   80029 server.go:669] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
I0227 08:07:51.009519   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #1, 0 open
I0227 08:07:51.011912   80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #1
I0227 08:07:51.011973   80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #1
I0227 08:07:51.013677   80029 portproxy.go:322] forward connection #1 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:07:51.013720   80029 portproxy.go:340] forward connection #1 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
W0227 08:07:51.013794   80029 server.go:669] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
E0227 08:07:51.017026   80029 portproxy.go:242] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: an error occurred connecting to the remote port: error forwarding port 9000 to pod 66662ea1ab30b4193dac0102c49be840971d337c802cc0c8bbc074214522bd13, uid : failed to execute portforward in network namespace "/var/run/netns/cni-c15e4e36-dad9-8316-c301-33af9dad5717": failed to dial 9000: dial tcp4 127.0.0.1:9000: connect: connection refused
I0227 08:07:51.109515   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #2, 0 open
I0227 08:07:51.111479   80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #2
I0227 08:07:51.111519   80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #2
I0227 08:07:51.209519   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #3, 1 open
I0227 08:07:51.766305   80029 csi.go:377] gRPC call: {"Method":"/csi.v1.Identity/Probe","Request":{},"Response":{"ready":{"value":true}},"Error":"","FullError":null}
I0227 08:07:51.768304   80029 csi.go:377] gRPC call: {"Method":"/csi.v1.Identity/GetPluginInfo","Request":{},"Response":{"name":"csi-mock-csi-mock-volumes-4037","vendor_version":"0.3.0","manifest":{"url":"https://k8s.io/kubernetes/test/e2e/storage/drivers/csi-test/mock"}},"Error":"","FullError":null}
I0227 08:07:51.770494   80029 csi.go:377] gRPC call: {"Method":"/csi.v1.Identity/GetPluginCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Service":{"type":1}}},{"Type":{"VolumeExpansion":{"type":1}}},{"Type":{"Service":{"type":2}}}]},"Error":"","FullError":null}
I0227 08:07:51.772899   80029 csi.go:377] gRPC call: {"Method":"/csi.v1.Controller/ControllerGetCapabilities","Request":{},"Response":{"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":10}}},{"Type":{"Rpc":{"type":4}}},{"Type":{"Rpc":{"type":6}}},{"Type":{"Rpc":{"type":5}}},{"Type":{"Rpc":{"type":8}}},{"Type":{"Rpc":{"type":7}}},{"Type":{"Rpc":{"type":12}}},{"Type":{"Rpc":{"type":11}}},{"Type":{"Rpc":{"type":9}}}]},"Error":"","FullError":null}
I0227 08:08:21.209901   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:08:21.209980   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #3, 1 open
I0227 08:08:51.211522   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:08:51.211566   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection #3, 1 open
I0227 08:08:51.213451   80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection #3
I0227 08:08:51.213498   80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection #3
I0227 08:08:51.309540   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#4, 2 open
I0227 08:08:52.215358   80029 portproxy.go:322] forward connection #3 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:08:52.215475   80029 portproxy.go:340] forward connection #3 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
I0227 08:09:21.310003   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:09:21.310086   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#4, 1 open
I0227 08:09:51.311854   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:09:51.311908   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#4, 1 open
I0227 08:09:51.314415   80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection kubernetes#4
I0227 08:09:51.314497   80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection kubernetes#4
I0227 08:09:51.409527   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#5, 2 open
I0227 08:09:52.326203   80029 portproxy.go:322] forward connection kubernetes#4 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:09:52.326277   80029 portproxy.go:340] forward connection kubernetes#4 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
I0227 08:10:21.409892   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:10:21.409954   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#5, 1 open
I0227 08:10:51.411455   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:10:51.411557   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#5, 1 open
I0227 08:10:51.413229   80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection kubernetes#5
I0227 08:10:51.413274   80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection kubernetes#5
I0227 08:10:51.509508   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#6, 2 open
I0227 08:10:52.414862   80029 portproxy.go:322] forward connection kubernetes#5 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:10:52.414931   80029 portproxy.go:340] forward connection kubernetes#5 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
I0227 08:11:21.509879   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:11:21.509934   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#6, 1 open
I0227 08:11:51.511519   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:11:51.511568   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#6, 1 open
I0227 08:11:51.513519   80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection kubernetes#6
I0227 08:11:51.513571   80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection kubernetes#6
I0227 08:11:51.609504   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#7, 2 open
I0227 08:11:52.517799   80029 portproxy.go:322] forward connection kubernetes#6 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:11:52.517918   80029 portproxy.go:340] forward connection kubernetes#6 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
I0227 08:12:21.609856   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:12:21.609909   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#7, 1 open
I0227 08:12:51.611494   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating data stream: Timeout occurred
I0227 08:12:51.611555   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#7, 1 open
I0227 08:12:51.613289   80029 portproxy.go:155] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: created a new connection kubernetes#7
I0227 08:12:51.613343   80029 portproxy.go:286] forward listener for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: got a new connection kubernetes#7
I0227 08:12:51.709535   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#8, 2 open
I0227 08:12:52.615858   80029 portproxy.go:322] forward connection kubernetes#7 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: remote side closed the stream
I0227 08:12:52.615989   80029 portproxy.go:340] forward connection kubernetes#7 for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: closing our side
W0227 08:12:52.616116   80029 server.go:669] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
I0227 08:13:21.709934   80029 portproxy.go:151] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: no connection: error creating error stream: Timeout occurred
I0227 08:13:21.709997   80029 portproxy.go:148] port forwarding for csi-mock-volumes-4037-2061/csi-mockplugin-0:9000: trying to create a new connection kubernetes#8, 1 open
Feb 27 08:13:30.916: FAIL: Failed to register CSIDriver csi-mock-csi-mock-volumes-4037
Unexpected error:
    <*errors.errorString | 0xc002666220>: {
        s: "error waiting for CSI driver csi-mock-csi-mock-volumes-4037 registration on node kind-worker2: timed out waiting for the condition",
    }
    error waiting for CSI driver csi-mock-csi-mock-volumes-4037 registration on node kind-worker2: timed out waiting for the condition
occurred
@k82cn k82cn deleted the k8s_batchjob branch May 16, 2022 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants