Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omnibus actual concurrency and major refactor #1530

Merged
merged 96 commits into from
May 16, 2024
Merged

Conversation

technillogue
Copy link
Contributor

@technillogue technillogue commented Feb 12, 2024

this should be reviewable at last. I'm mostly interested in whether the changes are comprehensible/legible and what comments I can add. there's a grab-bag of random changes like a /ready route, predictor.log, cancellation fixes, etc and the core change of moving worker into runner. I'm very open to reconsidering this (e.g. call it worker instead and maybe move some of it into http), but not right now. once this is merged, the plan is to gradually cut small changes from the async branch to merge into main, and review those changes more thoroughly.

original description I originally tried to split up my work in #1499 and #1508 as "refactor runner + add concurrency" and "fix uploads/downloads" but ended up interleaving these changes. this PR will just be the overall changeset for now, and hopefully as this coalesces more it'll be clear how to carve it up into separate changesets

major points:

  • add concurrency to cog.yaml
  • use httpx for everything except URLFile, pull out all the client code from everywhere else
  • completely rework URLPath
  • very dirty hack to unblock us on large file uploads
  • merge worker into runner (maybe the other way around would be better?)

Copy link
Contributor

@yorickvP yorickvP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't look at the tests yet

pkg/config/config.go Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pkg/config/validator.go Outdated Show resolved Hide resolved
python/cog/predictor.py Outdated Show resolved Hide resolved
python/cog/predictor.py Outdated Show resolved Hide resolved
python/cog/server/clients.py Outdated Show resolved Hide resolved
python/cog/server/clients.py Outdated Show resolved Hide resolved
python/cog/server/clients.py Outdated Show resolved Hide resolved
python/cog/server/http.py Outdated Show resolved Hide resolved
python/cog/server/clients.py Outdated Show resolved Hide resolved
@technillogue technillogue force-pushed the syl/more-refactor branch 2 times, most recently from 8f7a594 to 3563178 Compare February 19, 2024 18:02
@technillogue technillogue marked this pull request as ready for review February 19, 2024 23:53
@technillogue technillogue force-pushed the syl/more-refactor branch 2 times, most recently from 9bc6ece to f390777 Compare February 21, 2024 19:21
@technillogue technillogue force-pushed the syl/more-refactor branch 2 times, most recently from d965185 to 644d1cd Compare February 29, 2024 21:45
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
…busy state to runner

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
…point return the same result and fix tests somewhat

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Copy link
Member

@nickstenning nickstenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is (as advertised, to be fair) a bit of a mishmash of changes with different intents and scopes, and as such it's a bit hard to review.

If I understand it correctly the gist of the effort here is to update the Worker API to support the management of multiple predictions, indexed by their IDs. That seems sensible, but I don't understand why we moved 80% of Worker to PredictionRunner to do that.

I'm also more than a little bit suspicious at the fact that the CI checks on this branch are all green, especially given that by far the most substantive part of the cog test suite (test_worker.py) is apparently not running at all.

@technillogue technillogue reopened this May 10, 2024
@technillogue technillogue force-pushed the syl/more-refactor branch 7 times, most recently from bd69678 to 7098fde Compare May 16, 2024 20:09
Signed-off-by: technillogue <technillogue@gmail.com>
…ancelation and validation

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
@technillogue technillogue merged commit 0ebfc54 into async May 16, 2024
10 checks passed
@technillogue technillogue deleted the syl/more-refactor branch May 16, 2024 21:08
technillogue added a commit that referenced this pull request Jun 19, 2024
* add concurrency to config

* more descriptive names for predict functions

* don't cancel from signal handler if a loop is running. expose worker busy state to runner

* move handle_event_stream to PredictionEventHandler

* make setup and canceling work

* keep track of multiple runner prediction tasks to make idempotent endpoint return the same result and fix tests somewhat

* drop Runner._result, comments

* move create_event_handler into PredictionEventHandler.__init__

* break out Path.validate into value_to_path and inline get_filename and File.validate

* split out URLPath into BackwardsCompatibleDataURLTempFilePath and URLThatCanBeConvertedToPath with the download part of URLFile inlined

* let's make DataURLTempFilePath also use convert and move value_to_path back to Path.validate

* drop should_cancel

* prediction->request

* split up predict/inner/prediction_ctx into enter_predict/exit_predict/prediction_ctx/inner_async_predict/predict/good_predict as one way to do it. however, exposing all of those for runner predict enter/coro exit still sucks, but this is still an improvement

* bigish change: inline predict_and_handle_errors

* inline make_error_handler into setup

* move runner.setup into runner.Runner.setup

* add concurrency to config in go

* try explicitly using prediction_ctx __enter__ and __exit__

* relax setup argument requirement to str

* glom worker into runner

* add logging message

* fix prediction retry and improve logging

* split out handle_event

* use CURL_CA_BUNDLE for file upload

* dubious upload fix

* skip worker and webhook tests since those were erroring on removed imports. fix or xfail runner tests

* validate prediction response to raise errors, but return the unvalidated output to avoid converting urls to File/Path

* expose concurrency in healthcheck

* mediocre logging that works like print

* COG_DISABLE_CANCEL to ignore cancelations

* COG_CONCURRENCY_OVERRIDE

* add ready probe as an http route

* encode webhooks only after knowing they will be sent, and bail our of upload type checks early for strs

* don't validate outputs

* add AsyncConcatenateIterator

* should_exit is not actually used by http

* format

* codecov

* describe the remaining problems with this PR and add comments about cancelation and validation

* add a test

* fix test (#1669)

* fix config schema

* allow setting both max and target concurrency in cog.yaml (#1672)

* drop default_target (#1685)

---------
Signed-off-by: technillogue <technillogue@gmail.com>
Co-authored-by: Mattt <mattt@replicate.com>
technillogue added a commit that referenced this pull request Jun 19, 2024
* add concurrency to config

* more descriptive names for predict functions

* don't cancel from signal handler if a loop is running. expose worker busy state to runner

* move handle_event_stream to PredictionEventHandler

* make setup and canceling work

* keep track of multiple runner prediction tasks to make idempotent endpoint return the same result and fix tests somewhat

* drop Runner._result, comments

* move create_event_handler into PredictionEventHandler.__init__

* break out Path.validate into value_to_path and inline get_filename and File.validate

* split out URLPath into BackwardsCompatibleDataURLTempFilePath and URLThatCanBeConvertedToPath with the download part of URLFile inlined

* let's make DataURLTempFilePath also use convert and move value_to_path back to Path.validate

* drop should_cancel

* prediction->request

* split up predict/inner/prediction_ctx into enter_predict/exit_predict/prediction_ctx/inner_async_predict/predict/good_predict as one way to do it. however, exposing all of those for runner predict enter/coro exit still sucks, but this is still an improvement

* bigish change: inline predict_and_handle_errors

* inline make_error_handler into setup

* move runner.setup into runner.Runner.setup

* add concurrency to config in go

* try explicitly using prediction_ctx __enter__ and __exit__

* relax setup argument requirement to str

* glom worker into runner

* add logging message

* fix prediction retry and improve logging

* split out handle_event

* use CURL_CA_BUNDLE for file upload

* dubious upload fix

* skip worker and webhook tests since those were erroring on removed imports. fix or xfail runner tests

* validate prediction response to raise errors, but return the unvalidated output to avoid converting urls to File/Path

* expose concurrency in healthcheck

* mediocre logging that works like print

* COG_DISABLE_CANCEL to ignore cancelations

* COG_CONCURRENCY_OVERRIDE

* add ready probe as an http route

* encode webhooks only after knowing they will be sent, and bail our of upload type checks early for strs

* don't validate outputs

* add AsyncConcatenateIterator

* should_exit is not actually used by http

* format

* codecov

* describe the remaining problems with this PR and add comments about cancelation and validation

* add a test

* fix test (#1669)

* fix config schema

* allow setting both max and target concurrency in cog.yaml (#1672)

* drop default_target (#1685)

---------
Signed-off-by: technillogue <technillogue@gmail.com>
Co-authored-by: Mattt <mattt@replicate.com>
Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Jun 19, 2024
* add concurrency to config

* more descriptive names for predict functions

* don't cancel from signal handler if a loop is running. expose worker busy state to runner

* move handle_event_stream to PredictionEventHandler

* make setup and canceling work

* keep track of multiple runner prediction tasks to make idempotent endpoint return the same result and fix tests somewhat

* drop Runner._result, comments

* move create_event_handler into PredictionEventHandler.__init__

* break out Path.validate into value_to_path and inline get_filename and File.validate

* split out URLPath into BackwardsCompatibleDataURLTempFilePath and URLThatCanBeConvertedToPath with the download part of URLFile inlined

* let's make DataURLTempFilePath also use convert and move value_to_path back to Path.validate

* drop should_cancel

* prediction->request

* split up predict/inner/prediction_ctx into enter_predict/exit_predict/prediction_ctx/inner_async_predict/predict/good_predict as one way to do it. however, exposing all of those for runner predict enter/coro exit still sucks, but this is still an improvement

* bigish change: inline predict_and_handle_errors

* inline make_error_handler into setup

* move runner.setup into runner.Runner.setup

* add concurrency to config in go

* try explicitly using prediction_ctx __enter__ and __exit__

* relax setup argument requirement to str

* glom worker into runner

* add logging message

* fix prediction retry and improve logging

* split out handle_event

* use CURL_CA_BUNDLE for file upload

* dubious upload fix

* skip worker and webhook tests since those were erroring on removed imports. fix or xfail runner tests

* validate prediction response to raise errors, but return the unvalidated output to avoid converting urls to File/Path

* expose concurrency in healthcheck

* mediocre logging that works like print

* COG_DISABLE_CANCEL to ignore cancelations

* COG_CONCURRENCY_OVERRIDE

* add ready probe as an http route

* encode webhooks only after knowing they will be sent, and bail our of upload type checks early for strs

* don't validate outputs

* add AsyncConcatenateIterator

* should_exit is not actually used by http

* format

* codecov

* describe the remaining problems with this PR and add comments about cancelation and validation

* add a test

* fix test (#1669)

* fix config schema

* allow setting both max and target concurrency in cog.yaml (#1672)

* drop default_target (#1685)

---------
Signed-off-by: technillogue <technillogue@gmail.com>
Co-authored-by: Mattt <mattt@replicate.com>
Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Jun 19, 2024
* add concurrency to config

* more descriptive names for predict functions

* don't cancel from signal handler if a loop is running. expose worker busy state to runner

* move handle_event_stream to PredictionEventHandler

* make setup and canceling work

* keep track of multiple runner prediction tasks to make idempotent endpoint return the same result and fix tests somewhat

* drop Runner._result, comments

* move create_event_handler into PredictionEventHandler.__init__

* break out Path.validate into value_to_path and inline get_filename and File.validate

* split out URLPath into BackwardsCompatibleDataURLTempFilePath and URLThatCanBeConvertedToPath with the download part of URLFile inlined

* let's make DataURLTempFilePath also use convert and move value_to_path back to Path.validate

* drop should_cancel

* prediction->request

* split up predict/inner/prediction_ctx into enter_predict/exit_predict/prediction_ctx/inner_async_predict/predict/good_predict as one way to do it. however, exposing all of those for runner predict enter/coro exit still sucks, but this is still an improvement

* bigish change: inline predict_and_handle_errors

* inline make_error_handler into setup

* move runner.setup into runner.Runner.setup

* add concurrency to config in go

* try explicitly using prediction_ctx __enter__ and __exit__

* relax setup argument requirement to str

* glom worker into runner

* add logging message

* fix prediction retry and improve logging

* split out handle_event

* use CURL_CA_BUNDLE for file upload

* dubious upload fix

* skip worker and webhook tests since those were erroring on removed imports. fix or xfail runner tests

* validate prediction response to raise errors, but return the unvalidated output to avoid converting urls to File/Path

* expose concurrency in healthcheck

* mediocre logging that works like print

* COG_DISABLE_CANCEL to ignore cancelations

* COG_CONCURRENCY_OVERRIDE

* add ready probe as an http route

* encode webhooks only after knowing they will be sent, and bail our of upload type checks early for strs

* don't validate outputs

* add AsyncConcatenateIterator

* should_exit is not actually used by http

* format

* codecov

* describe the remaining problems with this PR and add comments about cancelation and validation

* add a test

* fix test (#1669)

* fix config schema

* allow setting both max and target concurrency in cog.yaml (#1672)

* drop default_target (#1685)

---------
Signed-off-by: technillogue <technillogue@gmail.com>
Co-authored-by: Mattt <mattt@replicate.com>
Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Jul 3, 2024
* add concurrency to config

* more descriptive names for predict functions

* don't cancel from signal handler if a loop is running. expose worker busy state to runner

* move handle_event_stream to PredictionEventHandler

* make setup and canceling work

* keep track of multiple runner prediction tasks to make idempotent endpoint return the same result and fix tests somewhat

* drop Runner._result, comments

* move create_event_handler into PredictionEventHandler.__init__

* break out Path.validate into value_to_path and inline get_filename and File.validate

* split out URLPath into BackwardsCompatibleDataURLTempFilePath and URLThatCanBeConvertedToPath with the download part of URLFile inlined

* let's make DataURLTempFilePath also use convert and move value_to_path back to Path.validate

* prediction->request

* split up predict/inner/prediction_ctx into enter_predict/exit_predict/prediction_ctx/inner_async_predict/predict/good_predict as one way to do it. however, exposing all of those for runner predict enter/coro exit still sucks, but this is still an improvement

* bigish change: inline predict_and_handle_errors

* inline make_error_handler into setup

* move runner.setup into runner.Runner.setup

* add concurrency to config in go

* try explicitly using prediction_ctx __enter__ and __exit__

* relax setup argument requirement to str

* glom worker into runner

* add logging message

* fix prediction retry and improve logging

* split out handle_event

* use CURL_CA_BUNDLE for file upload

* dubious upload fix

* skip worker and webhook tests since those were erroring on removed imports. fix or xfail runner tests

* validate prediction response to raise errors, but return the unvalidated output to avoid converting urls to File/Path

* expose concurrency in healthcheck

* mediocre logging that works like print

* COG_DISABLE_CANCEL to ignore cancelations

* COG_CONCURRENCY_OVERRIDE

* add ready probe as an http route

* encode webhooks only after knowing they will be sent, and bail our of upload type checks early for strs

* don't validate outputs

* add AsyncConcatenateIterator

* should_exit is not actually used by http

* format

* codecov

* describe the remaining problems with this PR and add comments about cancelation and validation

* add a test

* fix test (#1669)

* fix config schema

* allow setting both max and target concurrency in cog.yaml (#1672)

* drop default_target (#1685)

---------
Signed-off-by: technillogue <technillogue@gmail.com>
Co-authored-by: Mattt <mattt@replicate.com>
technillogue added a commit that referenced this pull request Jul 18, 2024
* add concurrency to config

* more descriptive names for predict functions

* don't cancel from signal handler if a loop is running. expose worker busy state to runner

* move handle_event_stream to PredictionEventHandler

* make setup and canceling work

* keep track of multiple runner prediction tasks to make idempotent endpoint return the same result and fix tests somewhat

* drop Runner._result, comments

* move create_event_handler into PredictionEventHandler.__init__

* break out Path.validate into value_to_path and inline get_filename and File.validate

* split out URLPath into BackwardsCompatibleDataURLTempFilePath and URLThatCanBeConvertedToPath with the download part of URLFile inlined

* let's make DataURLTempFilePath also use convert and move value_to_path back to Path.validate

* prediction->request

* split up predict/inner/prediction_ctx into enter_predict/exit_predict/prediction_ctx/inner_async_predict/predict/good_predict as one way to do it. however, exposing all of those for runner predict enter/coro exit still sucks, but this is still an improvement

* bigish change: inline predict_and_handle_errors

* inline make_error_handler into setup

* move runner.setup into runner.Runner.setup

* add concurrency to config in go

* try explicitly using prediction_ctx __enter__ and __exit__

* relax setup argument requirement to str

* glom worker into runner

* add logging message

* fix prediction retry and improve logging

* split out handle_event

* use CURL_CA_BUNDLE for file upload

* dubious upload fix

* skip worker and webhook tests since those were erroring on removed imports. fix or xfail runner tests

* validate prediction response to raise errors, but return the unvalidated output to avoid converting urls to File/Path

* expose concurrency in healthcheck

* mediocre logging that works like print

* COG_DISABLE_CANCEL to ignore cancelations

* COG_CONCURRENCY_OVERRIDE

* add ready probe as an http route

* encode webhooks only after knowing they will be sent, and bail our of upload type checks early for strs

* don't validate outputs

* add AsyncConcatenateIterator

* should_exit is not actually used by http

* format

* codecov

* describe the remaining problems with this PR and add comments about cancelation and validation

* add a test

* fix test (#1669)

* fix config schema

* allow setting both max and target concurrency in cog.yaml (#1672)

* drop default_target (#1685)

---------
Signed-off-by: technillogue <technillogue@gmail.com>
Co-authored-by: Mattt <mattt@replicate.com>
mattt pushed a commit that referenced this pull request Jul 18, 2024
* add concurrency to config

* more descriptive names for predict functions

* don't cancel from signal handler if a loop is running. expose worker busy state to runner

* move handle_event_stream to PredictionEventHandler

* make setup and canceling work

* keep track of multiple runner prediction tasks to make idempotent endpoint return the same result and fix tests somewhat

* drop Runner._result, comments

* move create_event_handler into PredictionEventHandler.__init__

* break out Path.validate into value_to_path and inline get_filename and File.validate

* split out URLPath into BackwardsCompatibleDataURLTempFilePath and URLThatCanBeConvertedToPath with the download part of URLFile inlined

* let's make DataURLTempFilePath also use convert and move value_to_path back to Path.validate

* prediction->request

* split up predict/inner/prediction_ctx into enter_predict/exit_predict/prediction_ctx/inner_async_predict/predict/good_predict as one way to do it. however, exposing all of those for runner predict enter/coro exit still sucks, but this is still an improvement

* bigish change: inline predict_and_handle_errors

* inline make_error_handler into setup

* move runner.setup into runner.Runner.setup

* add concurrency to config in go

* try explicitly using prediction_ctx __enter__ and __exit__

* relax setup argument requirement to str

* glom worker into runner

* add logging message

* fix prediction retry and improve logging

* split out handle_event

* use CURL_CA_BUNDLE for file upload

* dubious upload fix

* skip worker and webhook tests since those were erroring on removed imports. fix or xfail runner tests

* validate prediction response to raise errors, but return the unvalidated output to avoid converting urls to File/Path

* expose concurrency in healthcheck

* mediocre logging that works like print

* COG_DISABLE_CANCEL to ignore cancelations

* COG_CONCURRENCY_OVERRIDE

* add ready probe as an http route

* encode webhooks only after knowing they will be sent, and bail our of upload type checks early for strs

* don't validate outputs

* add AsyncConcatenateIterator

* should_exit is not actually used by http

* format

* codecov

* describe the remaining problems with this PR and add comments about cancelation and validation

* add a test

* fix test (#1669)

* fix config schema

* allow setting both max and target concurrency in cog.yaml (#1672)

* drop default_target (#1685)

---------
Signed-off-by: technillogue <technillogue@gmail.com>
Co-authored-by: Mattt <mattt@replicate.com>
mattt pushed a commit that referenced this pull request Jul 19, 2024
* add concurrency to config

* more descriptive names for predict functions

* don't cancel from signal handler if a loop is running. expose worker busy state to runner

* move handle_event_stream to PredictionEventHandler

* make setup and canceling work

* keep track of multiple runner prediction tasks to make idempotent endpoint return the same result and fix tests somewhat

* drop Runner._result, comments

* move create_event_handler into PredictionEventHandler.__init__

* break out Path.validate into value_to_path and inline get_filename and File.validate

* split out URLPath into BackwardsCompatibleDataURLTempFilePath and URLThatCanBeConvertedToPath with the download part of URLFile inlined

* let's make DataURLTempFilePath also use convert and move value_to_path back to Path.validate

* prediction->request

* split up predict/inner/prediction_ctx into enter_predict/exit_predict/prediction_ctx/inner_async_predict/predict/good_predict as one way to do it. however, exposing all of those for runner predict enter/coro exit still sucks, but this is still an improvement

* bigish change: inline predict_and_handle_errors

* inline make_error_handler into setup

* move runner.setup into runner.Runner.setup

* add concurrency to config in go

* try explicitly using prediction_ctx __enter__ and __exit__

* relax setup argument requirement to str

* glom worker into runner

* add logging message

* fix prediction retry and improve logging

* split out handle_event

* use CURL_CA_BUNDLE for file upload

* dubious upload fix

* skip worker and webhook tests since those were erroring on removed imports. fix or xfail runner tests

* validate prediction response to raise errors, but return the unvalidated output to avoid converting urls to File/Path

* expose concurrency in healthcheck

* mediocre logging that works like print

* COG_DISABLE_CANCEL to ignore cancelations

* COG_CONCURRENCY_OVERRIDE

* add ready probe as an http route

* encode webhooks only after knowing they will be sent, and bail our of upload type checks early for strs

* don't validate outputs

* add AsyncConcatenateIterator

* should_exit is not actually used by http

* format

* codecov

* describe the remaining problems with this PR and add comments about cancelation and validation

* add a test

* fix test (#1669)

* fix config schema

* allow setting both max and target concurrency in cog.yaml (#1672)

* drop default_target (#1685)

---------
Signed-off-by: technillogue <technillogue@gmail.com>
Co-authored-by: Mattt <mattt@replicate.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants