-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only report jobs done once their state has been reported #899
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true
samclark
approved these changes
Mar 24, 2022
GROpenSourceRO
pushed a commit
that referenced
this pull request
Apr 22, 2022
* Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * Added gr-tests-e2e make target (#2) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * added gr-tests-e2e target * make e2e tests work inside gr (#3) * make e2e tests work inside gr * rever change for normal e2e * and again * enable docker build * Enable e2e tests running in WSL (#4) * Enable e2e tests running in WSL * Submit to pulsar, fall back to existing API for queue admin * Added pulsar-client-go and go-multierror * Spin up Pulsar in e2e tests, load config for Pulsar * Start Pulsar submit API and log processor in Armada * Removed debug messages, comments * Add flag to explicitly enable Pulsar * Added periodic logging to the submit from Pulsar service * Import ordering * Kubernetes object metadata improvements, improved logging (#5) * Improved logging and error handling * Import ordering * Comments, logging * Include any additional podspecs in Pulsar submit jobs message * go mod tidy * Use a separate ObjectMeta for each k8s object in Pulsar' * Merge namespace/annotations/labels at the Pulsar submit API * Support submitting jobs with multiple podspecs * Annotate each incoming gRPC request with a request id * Annotate Pulsar messages with gRPC request id * Annotate per-message logger with gRPC request id attached to the Pulsar message * Import ordering * Preserve ordering within sequences (#6) * Publish job transitions to Pulsar (#7) * Added JobRunFailed reasons * Added logic to covert legacy events to Pulsar events * Publish events to Pulsar in addition to Redis * Added e2e tests that connrect directly to Pulsar * Updated Pulsar message spec (#8) * Comments * comments * Added function to return a request id or missing if none is found * Updated events spec * Updated Pulsar e2e tests * Removed commented-out code * Updated state transition message adapter to reflect changes to the proto * Generate JobSucceeded on JobRunSucceeded, logging * Provide Pulsar producer for SubmitFromLog service * Added utility function to insert error information and stack trace to a logrus.Entry * Removed deprecated code * Import ordering * Removed commented-out code * Removed commented-out code * Added isSequencef that takes a message to be logged on error * Removed commented-out code * Comments, removed debug logging * Removed temporary swagger.merged file (#9) * Removed temporary swagger.merged file * Removed temporary swagger.merged file * add pulsar tls config * add pulsar tls config * remove stray files * Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10) * Pulsar message utilities * Added service for writing to Pulsar based on Pulsar messages * Refactoring, use separate PulsarFromPulsar service * Import ordering * Refactoring * Return an error on invalid pulsar message id comparison * Improved error message * Renamed Pulsar events topic to be more descriptive * Removed commented-out code * add advanced pulsar config * review comments * review comments * more review comments * more review comments * more review comments * Pulsar events spec improvements (#13) * Use uint32 instead of double for priority * Todo comment * Hash queue + job_set_name instead of job_set_name * Added efficient UUID message type * Added conversion between google UUID and proto message UUID * Added converters between proto UUIDs and ULIDs * Import ordering * Use optimised uuid message * Added converters between strings and proto uuids * Function to generate a plain ULID, comments * Use optimised proto UUIDs * Comments * More fine-grained settings for job guarantees * Replace 4294967295 by math.MaxUint32 * Break priority parsing into a separate function * Securely hash queue and jobSetName together * Refactoring * Added lifetime to SubmitJob message * fix chart * move defaults * remove pulsar enabled * test fixes on wsl and windows * End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15) * Pass through GOPROXY/GOPRIVATE from the host for make proto * Removed commented-out code * Open armadactl by relative path, use valid priority * Refactoring, cleanup * Added test submitting several jobs, more rigorous event comparison * Removed test submitting only a single job * Pulsar e2e test cleanup * Added code for getting jobIds from events * Remove GR-specific GOPROXY/GOPRIVATE * Remove references to GR from proto build * Todo, whitespace * Test improvements * Use same alpine image as for tests, set limits equal to requests (as required by Armada) * Removed todos * Disallow combining PodSpec and PodSpecs, dissallow PodSpecs * Correctly create services and ingresses in log submit API * Set name of objects to create from the ObjectMeta included with the SubmitJob message * Comments * Added todo * Comments * Todos * Test jobs with services/ingresses * Comments * Use PodSpec instead of PodSpecs * Fail immediately on failure to connect to db * Convert PodSpecs with 1 entry to PodSpec * Avoid panics, check for PodSpec instead of PodSpecs[0] * Import ordering * Pulsar events refactoring (#17) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Moved events package into pkg * Update reference to events.proto * Updated events package import * Added Pulsar properties to distinguish between control and utilisation messages * Handle legacy job utilisation messages, set message key * Removed queue_job_set_hash * Comments * Added ObjectMeta to main object, comments * Added executor_id to ObjectMeta * Renamed code to exit_code in ApplicationError message * Comments * Comments Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> * Address comments on PR ARMADA/990 GRPub/armada (#18) * Refer to corporate proxies in general terms * Comments * Removed events.pb.go to simplify PR * Restore swagger files to simplify PR * Comments * Removed proposed scheduler code * Only report jobs done once their state has been reported (#899) Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true * Enable dotnet and npm build internally (#19) * Added make target for dotnet that works internally * Added dotnet tests to tests-e2e target * Handle end-of-line symbols in a cross-platform manner * Moved dotnet build to separate make target * Get protoc via Maven * Load Maven URL from environment variables * Single Dockerfile for building proto * Cleaned up tests make target * Run unit tests in docker containers * Removed unused armada-test docker network * Run e2e tests and dotnet target in containers * Bump go version to 1.16 for consitency * Optionally run all go commands in containers * Mount GOPROXY and GOPRIVATE into go containers * Comments * Run npm in docker containers * Get go version string correctly from containers * Added missing .SubmitServer * Generate legacy job submitted events when submitting to Pulsar * Have cancel and reprioritise endpoints generate set messages * Always run builds in containers * Moved armadactl tests into e2e * Moved pulsar e2e tests into separate directory * Renamed directory * Updated e2e tests target to reflect new directories * Updated tests-e2e-no-setup target * Commented in tests * Removed npm environment variables values * Commented in tests * Removed commented-out code * ARMADA-1028 Events proto updates (#20) * Moved terminal flag into individual errors * Generate JobErrors instead of JobRunErrors on JobFailed API message * Updated to reflect moving terminal flag in errors * Added JobErrors to JobIdFromEvent * Added config files for local use to .gitignore * Added ReprioritisedJob, CancelledJob, and JobDuplicateDetected * Handle all api messages * Populate ObjectMeta info for errors * Include container name with container errors * Create EventId type (#21) * Return a concrete type from new to enable comparison * Added event id type * Create utility for sniffing Pulsar events (#22) * Added program to print events * Write eventsprinter as a cobra app * Take pulsar.Message instead of pulsar.ConsumerMessage * Filter out non-control messages * Fix bug associated with creating nil events * Include CancelledJob in list of messages indicating job failure * Print job ids * Improved testing (#23) * Added missing events to JobIdFromEvent * write test output to disk, convert test output to junit format * Added go-junit-report as dependency * Spin up postgres for e2e tests * Write html test report if possible * Removed problematic -e flag * Test submitting a job with errors, test cancelling jobs * Ignore test_reports * Improve Pulsar consumer retry logic (#24) * Do not return an error on messages requiring no action * Ack messages after processing * Fail test immediately on error * Propagate errors correctly * Add code to detect (possibly nested) network errors * Return immediately on nil in IsNetworkError * Improved error handling and retry logic * Added missing parentheses * Consider context.DeadlineExceeded a network error * Improved failure and retry logic * Removed seek from Pulsar setup * Removed multierror from GetActiveJobIds * Added tests for non-network errors * Only ack on successfully processing a sequence * Set Pulsar message key * keyshared sub (#27) * Merge In changes from public github (#28) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Only report jobs done once their state has been reported (#899) Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> * fix infinite loop (#29) * lowe case returned jobIds (#30) * ARMADA-995: Ingester from pulsar-> lookout database (#26) * initial impl of lookout ingester * added pod/container error * fixes after testing * goimports * doc * doc * fixes * remove unneeded code * add docker file- move tests * Fix event sequence number update (#31) * Enable CI builds (#25) * Added Jenkinsfile * Fix syntax errors * Changed var to def * Removed need to clone armada-ci * Removed armada-ci dependency * Use writeFile instead of echo * Removed -it * Get version correctly * Added debug printouts * Printouts * Set PWD to one valid on the host * Set PWD in sh * Set PWD consistently * Make proto * Exit on proto no matching * Added goimports as dependency * Import ordering * Check error code * Check error * Added code checks make target * Go mod tidy * Aded code checks stage * Updated line endings * Escape $ * Added download make target * Set GOPATH such that it is available on the host * Removed unnecessary mkdir * Added build tag to avoid compiling tools into binary * Automatically install tools * Run goimports for pkg/events * Whitespace * Fixed line endings * Added templify to list of tools * Comment out proto stage during development * Renamed junit_report to junit-report for consistency * Run go-junit-report in docker * Require kind, run kubectl in docker * Do not check for kind, to allow downloading it using make download * Add kind to list of tools * Fixed import for go-junit-report * Added tests-teardown target, avoid returning error code from tests-e2e-teardown * Run teardown targets before tests * Removed unnecessary cleanup stage * Set shorter OperationTimeout, bundle stack trace with errors * Commented out tests failing due to Docker setup * Added gox to list of tools * Run gox in docker * Comments * Whitespace * Whitespace * Whitespace * Commented out tests during development * Commented in all stages * Enabled junit plugin * Removed junit2html * Updated paths * Updated PWD * Removed -set-exit-code from go-junit-report * For Pulsar, only expose IPv4 to fix issue with Pulsar client library * Added go-swagger and grpc-gateway to tools * Move proto post-processing from proto.sh into makefile for performance * Fix go-imports * Comment in Pulsar tests * Update Jenkinsfile * Fix ineffassign * Fix typo NoError -> Error * Only build if tests succeed * Delete Jenkinsfile * Import ordering * Timeout retrying processing an event sequence (#32) * Add dependent targets to tests-e2e-no-setup * Only ack messages if sequence is empty or if at least 1 event was processed * Timeout processing a sequence after 5 minutes * Added rebuild-server target for testing, removed tests-e2e-no-setup dependencies * Comments * Always sleep on making no progress * Rename events module to armadaevents (#33) * Fix test * Rename events -> armadaevents * Rename events -> armadaevents * Import ordering * Lookout Ingester fixes following testing (#35) * fixes following testing * fixes following testing * Use PGX for Database Conections to Lookout (#915) * use pgx * removed log lines * fix import order * fix test * fix test * Add retries to Lookout Ingester (#36) * fixes following testing * fixes following testing * added database error handling * import order * code review comments * code review comments * cause doesn't work * import order * Set default tolerations (#34) * Fix test * Rename events -> armadaevents * Rename events -> armadaevents * Import ordering * Add default tolerations for e2e tests * Test for default tolerations * Improved job vonersion code * Add job conversion tests * Use job conversion code in eventutil * Test expected tolerations * More verbose printing * Lookout ingester produces compressed proto (#37) * add compressed proto * wip * go imports * fixed config- added todo * Remove changed generated files * Comments * Remove changed generated files * Remove changed generated files Co-authored-by: Chris Martin <Chris.Martin@gresearch.co.uk> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: Chris Martin <council_tax@hotmail.com>
severinson
added a commit
that referenced
this pull request
Apr 26, 2022
* Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * Added gr-tests-e2e make target (#2) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * added gr-tests-e2e target * make e2e tests work inside gr (#3) * make e2e tests work inside gr * rever change for normal e2e * and again * enable docker build * Enable e2e tests running in WSL (#4) * Enable e2e tests running in WSL * Submit to pulsar, fall back to existing API for queue admin * Added pulsar-client-go and go-multierror * Spin up Pulsar in e2e tests, load config for Pulsar * Start Pulsar submit API and log processor in Armada * Removed debug messages, comments * Add flag to explicitly enable Pulsar * Added periodic logging to the submit from Pulsar service * Import ordering * Kubernetes object metadata improvements, improved logging (#5) * Improved logging and error handling * Import ordering * Comments, logging * Include any additional podspecs in Pulsar submit jobs message * go mod tidy * Use a separate ObjectMeta for each k8s object in Pulsar' * Merge namespace/annotations/labels at the Pulsar submit API * Support submitting jobs with multiple podspecs * Annotate each incoming gRPC request with a request id * Annotate Pulsar messages with gRPC request id * Annotate per-message logger with gRPC request id attached to the Pulsar message * Import ordering * Preserve ordering within sequences (#6) * Publish job transitions to Pulsar (#7) * Added JobRunFailed reasons * Added logic to covert legacy events to Pulsar events * Publish events to Pulsar in addition to Redis * Added e2e tests that connrect directly to Pulsar * Updated Pulsar message spec (#8) * Comments * comments * Added function to return a request id or missing if none is found * Updated events spec * Updated Pulsar e2e tests * Removed commented-out code * Updated state transition message adapter to reflect changes to the proto * Generate JobSucceeded on JobRunSucceeded, logging * Provide Pulsar producer for SubmitFromLog service * Added utility function to insert error information and stack trace to a logrus.Entry * Removed deprecated code * Import ordering * Removed commented-out code * Removed commented-out code * Added isSequencef that takes a message to be logged on error * Removed commented-out code * Comments, removed debug logging * Removed temporary swagger.merged file (#9) * Removed temporary swagger.merged file * Removed temporary swagger.merged file * add pulsar tls config * add pulsar tls config * remove stray files * Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10) * Pulsar message utilities * Added service for writing to Pulsar based on Pulsar messages * Refactoring, use separate PulsarFromPulsar service * Import ordering * Refactoring * Return an error on invalid pulsar message id comparison * Improved error message * Renamed Pulsar events topic to be more descriptive * Removed commented-out code * add advanced pulsar config * review comments * review comments * more review comments * more review comments * more review comments * Pulsar events spec improvements (#13) * Use uint32 instead of double for priority * Todo comment * Hash queue + job_set_name instead of job_set_name * Added efficient UUID message type * Added conversion between google UUID and proto message UUID * Added converters between proto UUIDs and ULIDs * Import ordering * Use optimised uuid message * Added converters between strings and proto uuids * Function to generate a plain ULID, comments * Use optimised proto UUIDs * Comments * More fine-grained settings for job guarantees * Replace 4294967295 by math.MaxUint32 * Break priority parsing into a separate function * Securely hash queue and jobSetName together * Refactoring * Added lifetime to SubmitJob message * fix chart * move defaults * remove pulsar enabled * test fixes on wsl and windows * End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15) * Pass through GOPROXY/GOPRIVATE from the host for make proto * Removed commented-out code * Open armadactl by relative path, use valid priority * Refactoring, cleanup * Added test submitting several jobs, more rigorous event comparison * Removed test submitting only a single job * Pulsar e2e test cleanup * Added code for getting jobIds from events * Remove GR-specific GOPROXY/GOPRIVATE * Remove references to GR from proto build * Todo, whitespace * Test improvements * Use same alpine image as for tests, set limits equal to requests (as required by Armada) * Removed todos * Disallow combining PodSpec and PodSpecs, dissallow PodSpecs * Correctly create services and ingresses in log submit API * Set name of objects to create from the ObjectMeta included with the SubmitJob message * Comments * Added todo * Comments * Todos * Test jobs with services/ingresses * Comments * Use PodSpec instead of PodSpecs * Fail immediately on failure to connect to db * Convert PodSpecs with 1 entry to PodSpec * Avoid panics, check for PodSpec instead of PodSpecs[0] * Import ordering * Pulsar events refactoring (#17) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Moved events package into pkg * Update reference to events.proto * Updated events package import * Added Pulsar properties to distinguish between control and utilisation messages * Handle legacy job utilisation messages, set message key * Removed queue_job_set_hash * Comments * Added ObjectMeta to main object, comments * Added executor_id to ObjectMeta * Renamed code to exit_code in ApplicationError message * Comments * Comments Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> * Address comments on PR ARMADA/990 GRPub/armada (#18) * Refer to corporate proxies in general terms * Comments * Removed events.pb.go to simplify PR * Restore swagger files to simplify PR * Comments * Removed proposed scheduler code * Sync changes made internally 220322-220419 to GRPub (#15) * Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * Added gr-tests-e2e make target (#2) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * added gr-tests-e2e target * make e2e tests work inside gr (#3) * make e2e tests work inside gr * rever change for normal e2e * and again * enable docker build * Enable e2e tests running in WSL (#4) * Enable e2e tests running in WSL * Submit to pulsar, fall back to existing API for queue admin * Added pulsar-client-go and go-multierror * Spin up Pulsar in e2e tests, load config for Pulsar * Start Pulsar submit API and log processor in Armada * Removed debug messages, comments * Add flag to explicitly enable Pulsar * Added periodic logging to the submit from Pulsar service * Import ordering * Kubernetes object metadata improvements, improved logging (#5) * Improved logging and error handling * Import ordering * Comments, logging * Include any additional podspecs in Pulsar submit jobs message * go mod tidy * Use a separate ObjectMeta for each k8s object in Pulsar' * Merge namespace/annotations/labels at the Pulsar submit API * Support submitting jobs with multiple podspecs * Annotate each incoming gRPC request with a request id * Annotate Pulsar messages with gRPC request id * Annotate per-message logger with gRPC request id attached to the Pulsar message * Import ordering * Preserve ordering within sequences (#6) * Publish job transitions to Pulsar (#7) * Added JobRunFailed reasons * Added logic to covert legacy events to Pulsar events * Publish events to Pulsar in addition to Redis * Added e2e tests that connrect directly to Pulsar * Updated Pulsar message spec (#8) * Comments * comments * Added function to return a request id or missing if none is found * Updated events spec * Updated Pulsar e2e tests * Removed commented-out code * Updated state transition message adapter to reflect changes to the proto * Generate JobSucceeded on JobRunSucceeded, logging * Provide Pulsar producer for SubmitFromLog service * Added utility function to insert error information and stack trace to a logrus.Entry * Removed deprecated code * Import ordering * Removed commented-out code * Removed commented-out code * Added isSequencef that takes a message to be logged on error * Removed commented-out code * Comments, removed debug logging * Removed temporary swagger.merged file (#9) * Removed temporary swagger.merged file * Removed temporary swagger.merged file * add pulsar tls config * add pulsar tls config * remove stray files * Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10) * Pulsar message utilities * Added service for writing to Pulsar based on Pulsar messages * Refactoring, use separate PulsarFromPulsar service * Import ordering * Refactoring * Return an error on invalid pulsar message id comparison * Improved error message * Renamed Pulsar events topic to be more descriptive * Removed commented-out code * add advanced pulsar config * review comments * review comments * more review comments * more review comments * more review comments * Pulsar events spec improvements (#13) * Use uint32 instead of double for priority * Todo comment * Hash queue + job_set_name instead of job_set_name * Added efficient UUID message type * Added conversion between google UUID and proto message UUID * Added converters between proto UUIDs and ULIDs * Import ordering * Use optimised uuid message * Added converters between strings and proto uuids * Function to generate a plain ULID, comments * Use optimised proto UUIDs * Comments * More fine-grained settings for job guarantees * Replace 4294967295 by math.MaxUint32 * Break priority parsing into a separate function * Securely hash queue and jobSetName together * Refactoring * Added lifetime to SubmitJob message * fix chart * move defaults * remove pulsar enabled * test fixes on wsl and windows * End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15) * Pass through GOPROXY/GOPRIVATE from the host for make proto * Removed commented-out code * Open armadactl by relative path, use valid priority * Refactoring, cleanup * Added test submitting several jobs, more rigorous event comparison * Removed test submitting only a single job * Pulsar e2e test cleanup * Added code for getting jobIds from events * Remove GR-specific GOPROXY/GOPRIVATE * Remove references to GR from proto build * Todo, whitespace * Test improvements * Use same alpine image as for tests, set limits equal to requests (as required by Armada) * Removed todos * Disallow combining PodSpec and PodSpecs, dissallow PodSpecs * Correctly create services and ingresses in log submit API * Set name of objects to create from the ObjectMeta included with the SubmitJob message * Comments * Added todo * Comments * Todos * Test jobs with services/ingresses * Comments * Use PodSpec instead of PodSpecs * Fail immediately on failure to connect to db * Convert PodSpecs with 1 entry to PodSpec * Avoid panics, check for PodSpec instead of PodSpecs[0] * Import ordering * Pulsar events refactoring (#17) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Moved events package into pkg * Update reference to events.proto * Updated events package import * Added Pulsar properties to distinguish between control and utilisation messages * Handle legacy job utilisation messages, set message key * Removed queue_job_set_hash * Comments * Added ObjectMeta to main object, comments * Added executor_id to ObjectMeta * Renamed code to exit_code in ApplicationError message * Comments * Comments Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> * Address comments on PR ARMADA/990 GRPub/armada (#18) * Refer to corporate proxies in general terms * Comments * Removed events.pb.go to simplify PR * Restore swagger files to simplify PR * Comments * Removed proposed scheduler code * Only report jobs done once their state has been reported (#899) Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true * Enable dotnet and npm build internally (#19) * Added make target for dotnet that works internally * Added dotnet tests to tests-e2e target * Handle end-of-line symbols in a cross-platform manner * Moved dotnet build to separate make target * Get protoc via Maven * Load Maven URL from environment variables * Single Dockerfile for building proto * Cleaned up tests make target * Run unit tests in docker containers * Removed unused armada-test docker network * Run e2e tests and dotnet target in containers * Bump go version to 1.16 for consitency * Optionally run all go commands in containers * Mount GOPROXY and GOPRIVATE into go containers * Comments * Run npm in docker containers * Get go version string correctly from containers * Added missing .SubmitServer * Generate legacy job submitted events when submitting to Pulsar * Have cancel and reprioritise endpoints generate set messages * Always run builds in containers * Moved armadactl tests into e2e * Moved pulsar e2e tests into separate directory * Renamed directory * Updated e2e tests target to reflect new directories * Updated tests-e2e-no-setup target * Commented in tests * Removed npm environment variables values * Commented in tests * Removed commented-out code * ARMADA-1028 Events proto updates (#20) * Moved terminal flag into individual errors * Generate JobErrors instead of JobRunErrors on JobFailed API message * Updated to reflect moving terminal flag in errors * Added JobErrors to JobIdFromEvent * Added config files for local use to .gitignore * Added ReprioritisedJob, CancelledJob, and JobDuplicateDetected * Handle all api messages * Populate ObjectMeta info for errors * Include container name with container errors * Create EventId type (#21) * Return a concrete type from new to enable comparison * Added event id type * Create utility for sniffing Pulsar events (#22) * Added program to print events * Write eventsprinter as a cobra app * Take pulsar.Message instead of pulsar.ConsumerMessage * Filter out non-control messages * Fix bug associated with creating nil events * Include CancelledJob in list of messages indicating job failure * Print job ids * Improved testing (#23) * Added missing events to JobIdFromEvent * write test output to disk, convert test output to junit format * Added go-junit-report as dependency * Spin up postgres for e2e tests * Write html test report if possible * Removed problematic -e flag * Test submitting a job with errors, test cancelling jobs * Ignore test_reports * Improve Pulsar consumer retry logic (#24) * Do not return an error on messages requiring no action * Ack messages after processing * Fail test immediately on error * Propagate errors correctly * Add code to detect (possibly nested) network errors * Return immediately on nil in IsNetworkError * Improved error handling and retry logic * Added missing parentheses * Consider context.DeadlineExceeded a network error * Improved failure and retry logic * Removed seek from Pulsar setup * Removed multierror from GetActiveJobIds * Added tests for non-network errors * Only ack on successfully processing a sequence * Set Pulsar message key * keyshared sub (#27) * Merge In changes from public github (#28) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Only report jobs done once their state has been reported (#899) Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> * fix infinite loop (#29) * lowe case returned jobIds (#30) * ARMADA-995: Ingester from pulsar-> lookout database (#26) * initial impl of lookout ingester * added pod/container error * fixes after testing * goimports * doc * doc * fixes * remove unneeded code * add docker file- move tests * Fix event sequence number update (#31) * Enable CI builds (#25) * Added Jenkinsfile * Fix syntax errors * Changed var to def * Removed need to clone armada-ci * Removed armada-ci dependency * Use writeFile instead of echo * Removed -it * Get version correctly * Added debug printouts * Printouts * Set PWD to one valid on the host * Set PWD in sh * Set PWD consistently * Make proto * Exit on proto no matching * Added goimports as dependency * Import ordering * Check error code * Check error * Added code checks make target * Go mod tidy * Aded code checks stage * Updated line endings * Escape $ * Added download make target * Set GOPATH such that it is available on the host * Removed unnecessary mkdir * Added build tag to avoid compiling tools into binary * Automatically install tools * Run goimports for pkg/events * Whitespace * Fixed line endings * Added templify to list of tools * Comment out proto stage during development * Renamed junit_report to junit-report for consistency * Run go-junit-report in docker * Require kind, run kubectl in docker * Do not check for kind, to allow downloading it using make download * Add kind to list of tools * Fixed import for go-junit-report * Added tests-teardown target, avoid returning error code from tests-e2e-teardown * Run teardown targets before tests * Removed unnecessary cleanup stage * Set shorter OperationTimeout, bundle stack trace with errors * Commented out tests failing due to Docker setup * Added gox to list of tools * Run gox in docker * Comments * Whitespace * Whitespace * Whitespace * Commented out tests during development * Commented in all stages * Enabled junit plugin * Removed junit2html * Updated paths * Updated PWD * Removed -set-exit-code from go-junit-report * For Pulsar, only expose IPv4 to fix issue with Pulsar client library * Added go-swagger and grpc-gateway to tools * Move proto post-processing from proto.sh into makefile for performance * Fix go-imports * Comment in Pulsar tests * Update Jenkinsfile * Fix ineffassign * Fix typo NoError -> Error * Only build if tests succeed * Delete Jenkinsfile * Import ordering * Timeout retrying processing an event sequence (#32) * Add dependent targets to tests-e2e-no-setup * Only ack messages if sequence is empty or if at least 1 event was processed * Timeout processing a sequence after 5 minutes * Added rebuild-server target for testing, removed tests-e2e-no-setup dependencies * Comments * Always sleep on making no progress * Rename events module to armadaevents (#33) * Fix test * Rename events -> armadaevents * Rename events -> armadaevents * Import ordering * Lookout Ingester fixes following testing (#35) * fixes following testing * fixes following testing * Use PGX for Database Conections to Lookout (#915) * use pgx * removed log lines * fix import order * fix test * fix test * Add retries to Lookout Ingester (#36) * fixes following testing * fixes following testing * added database error handling * import order * code review comments * code review comments * cause doesn't work * import order * Set default tolerations (#34) * Fix test * Rename events -> armadaevents * Rename events -> armadaevents * Import ordering * Add default tolerations for e2e tests * Test for default tolerations * Improved job vonersion code * Add job conversion tests * Use job conversion code in eventutil * Test expected tolerations * More verbose printing * Lookout ingester produces compressed proto (#37) * add compressed proto * wip * go imports * fixed config- added todo * Remove changed generated files * Comments * Remove changed generated files * Remove changed generated files Co-authored-by: Chris Martin <Chris.Martin@gresearch.co.uk> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: Chris Martin <council_tax@hotmail.com> * goimports * removed duplicated tests * fix duplicate gopath * fix duplicate gopath * Minor fixes * Regenerate dotnet * Update circleci to reflect makefile changes * Separate e2e-test job * Renamed e2e test job for consistency * Run e2e tests * Increase test VM size * Download tools for build job * Enable integration tests in circleci * Store junit test report * Whitespace * Circleci cleanup * Update job name * Typo * Improved dependency caching * Remove unnecessary make download calls * Print GOPATH * Print GOPATH * Change go cache directories * Only download with GO_TEST_CMD * Always check dependencies * Go mod tidy after make download * Use large resource class * Refactored * Use xlarge resource class * Removed unused dependency * Specify container versions * Remove unused jobs * Enable DLC * Print logs * Fix order of printing logs * Bump e2e test instance resource class * Revert e2e test instance resource class Co-authored-by: Albin Severinson <Albin.Severinson@gresearch.co.uk> Co-authored-by: Chris Martin <Chris.Martin@gresearch.co.uk> Co-authored-by: JamesMurkin <jamesmurkin@hotmail.com> Co-authored-by: Chris Martin <council_tax@hotmail.com> Co-authored-by: Albin Severinson <larsalbins@latpoc32.maas>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Normally the state gets reported instantly so this is already true 99% of the time.
However if reporting the state goes wrong, we shouldn't report the job as done
In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too
This will only really impact edge cases and most of the time this will already be true