-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up reproducible Python environments with conda-lock #2968
Conversation
78d735d
to
8ee276e
Compare
Here are some thoughts on the remaining questions:
I think this might prevent us from piping the nightly build STDOUT into a log file. If there is no way to stop micromamba from intercepting STDOUT we’ll have to find another way to capture the logs. We might be able to do this using Cloud Logging.
Looks like AWS is not interested in publishing awscli2 on PyPI. I'm assuming the same goes for conda. It looks like our google python packages were installed via other dependencies (gcsfs, pandas-gcp). I think we installed the CLI tool separately: Sometimes it is helpful to download the logs and data outputs of nightly builds when debugging failures. To do this you’ll need to set up the Google Cloud software Development Kit (SDK). Install the gcloud utilities on your computer. There are several ways to do this. We recommend using conda or its faster sibling mamba. If you’re not using conda environments, there are other ways to install the Google Cloud SDK explained in the link above.
What are the benefits of installing from catalystcoop.pudl package instead of from the the repository directly? Distributing the package in PyPI and condo-forge seems like it will create unnecessary overhead.
I'm not sure! Happy to help debug.
Great point! It doesn't make much sense anymore to have optional dependencies if we treat PUDL like an application with pinned dependencies that produces data outputs. Does it make sense to keep the optional dependencies around until our main users no longer depend on install PUDL as a package? |
@@ -10,19 +10,15 @@ on: | |||
- ready_for_review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably rename this workflow now that tox is no more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was just holding off since I think I have to change the main
branch to do that right? And the name is insinuated into a bunch of different places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to rename the update-lockfile
action too since that's a bit generic.
Makefile
Outdated
test/integration/glue_test.py | ||
|
||
######################################################################################## | ||
# The github- prefixed targets are meant to be run by GitHub Actions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on why there are different commands for local and github in this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the GitHub ones I'm having them install PUDL and deal with coverage before & after the run. But maybe we could do those things every time locally or on GitHub and it would be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I say we keep it as is for now. Just run coverage on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if the make
targets are exactly what we do on CI - once we start having local-
, github-
, and targets with no prefix we lose the standardization benefit.
This would also let us have less code in the Makefile which is a plus.
The tricky thing is handling PUDL installation & coverage.
PUDL installation: this should be part of a separate devenv
target, I think - something like
devenv: conda-env
mamba run -n pudl-dev pip install -e .
conda-env: environments/conda-lock.yml
conda-lock install -n pudl-dev $^
Coverage: on CI, we're doing coverage for unit/integration/docs separately, and combining them at the end. Locally, we might want to do coverage for any subset of those, which is the impetus behind the various local-pytest-ci
targets. I think we should:
- stop running
coverage erase
withinmake
- on CI runners it is unnecessary, and locally this forces a new target for every subset of tests we want to run coverage for. - include the coverage args for all unit/integration/doc-build targets, for local/CI consistency. If you really want to skip coverage, you can always run the naked command.
Here's what I have locally:
diff --git a/Makefile b/Makefile
index a86b30874..246ee37f5 100644
--- a/Makefile
+++ b/Makefile
@@ -14,6 +14,12 @@ pip_install_pudl := pip install --no-deps --editable ./
dagster:
dagster dev -m pudl.etl -m pudl.ferc_to_sqlite
+devenv: conda-env
+ mamba run -n pudl-dev pip install -e .
+
+conda-env: environments/conda-lock.yml
+ conda-lock install -n pudl-dev $^
+
########################################################################################
# Conda lockfile generation
########################################################################################
@@ -47,16 +53,19 @@ docs-clean:
docs-build: docs-clean
doc8 docs/ README.rst
- sphinx-build -W -b html docs docs/_build/html
+ coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html
########################################################################################
# Generic pytest commands for local use, without test coverage
########################################################################################
pytest-unit:
- pytest --doctest-modules src/pudl test/unit
+ pytest ${pytest_args} --doctest-modules src/pudl test/unit
pytest-integration:
- pytest test/integration
+ pytest ${pytest_args} --etl-settings ${etl_fast_yml} test/integration
+
+pytest-integration-full:
+ pytest ${pytest_args} --etl-settings ${etl_full_yml} test/integration
pytest-validate:
pytest --live-dbs test/validate
@@ -65,41 +74,17 @@ pytest-validate:
########################################################################################
# More complex pytest commands for local use that collect test coverage
########################################################################################
-# Run unit & integration tests on 1-2 years of data and collect test coverage data.
-local-pytest-ci: docs-clean
- ${coverage_erase}
- doc8 docs/ README.rst
- coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html
- pytest ${pytest_args} --doctest-modules src/pudl test/unit
- pytest ${pytest_args} --etl-settings ${etl_fast_yml} test/integration
- ${coverage_report}
-
-# Run unit & integration tests on ALL years of data and collect test coverage data.
-# NOTE: This will take 1+ hours to run and the PUDL DB will not be retained.
-local-pytest-ci-all-years: docs-clean
- ${coverage_erase}
- doc8 docs/ README.rst
- coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html
- pytest ${pytest_args} --doctest-modules src/pudl test/unit
- pytest ${pytest_args} --etl-settings ${etl_full_yml} test/integration
- ${coverage_report}
-
# Run the full ETL, generating new FERC & PUDL SQLite DBs and EPA CEMS Parquet files.
# Then run the full integration tests and data validations on all years of data.
# NOTE: This will clobber your existing databases and takes hours to run!!!
# Backgrounding the data validation and integration tests and using wait allows them to
# run in parallel.
-nuke: docs-clean
- ${coverage_erase}
- doc8 docs/ README.rst
- coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html
- pytest ${pytest_args} --doctest-modules src/pudl test/unit
- pytest ${pytest_args} \
- --etl-settings ${etl_fast_yml} \
- test/integration
+#
+
+nuke: docs-build pytest-unit
rm -f tox-nuke.log
coverage run ${covargs} -- \
- src/pudl/convert/ferc_to_sqlite.py \
+ src/pudl/ferc_to_sqlite/cli.py \
--logfile tox-nuke.log \
--clobber \
${gcs_cache_path} \
@@ -126,11 +111,7 @@ pytest-jupyter:
# Compare actual and expected number of rows in many tables:
pytest-minmax-rows:
- pytest --live-dbs \
- test/validate/epacamd_eia_test.py::test_minmax_rows \
- test/validate/ferc1_test.py::test_minmax_rows \
- test/validate/eia_test.py::test_minmax_rows \
- test/validate/mcoe_test.py::test_minmax_rows_mcoe
+ pytest --live-dbs test/validate -k minmax_rows
# Build the FERC 1 and PUDL DBs, ignoring foreign key constraints.
# Identify any plant or utility IDs in the DBs that haven't yet been mapped
Couple extra notes:
nuke
:
- I think
clobber
has some unpleasant behavior withferc_to_sqlite
where it will delete the XBRL databases when reading the 2022 files, but...src/pudl/conver/ferc_to_sqlite.py
doesn't even exist, so I thinknuke
was totally broken already
- should we be managing logfiles like this in dagster-land? seems a little clunky, but I also know logs in dagster are funky
pytest-minmax-rows
: this is a shortcut to adding @pytest.mark.minmax
to the minmax_rows
tests. I sort of think this particular make target should be replaced by the built-in pytest test filtering tooling. Maybe the jupyter one too.
RUN useradd -Ums /bin/bash catalyst | ||
|
||
ENV CONTAINER_HOME=/home/catalyst | ||
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is what Amazon wants us to do?!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And who are we to question Lord Bezos?
There's the
There's a stale user-contributed package on
One thing with allowing dependabot to keep doing the updates is it'll make one PR for every individual dependency that changes, which will be good for isolating changes, but it'll be a lot of PRs, and every one of them will need the lockfile to be regenerated. I was thinking that we'd maybe run With our ~500 direct+indirect dependencies, I don't think there will ever be a week in which none of them release a new version. In removing the upper bounds I was thinking just of removing the ones where there's no known problem and the upper bound is just a safety measure that keeps the dependency from changing too much without some intervention & testing on our part. For One thing I think probably want to try and avoid is merge conflicts in
I think this should definitely be doable, in the same way that
I think we've really gone to an "application" model at this point, where
I think maybe we want to consolidate all the currently "optional" dependencies and make them required, but what do we want to do about dependencies that are for interactive local development? Like the Docker container doesn't need to have JupyterLab installed in it, but we all probably want it. Should we just leave that up to individuals? Or include them in the lockfile? Or I guess, don't include them in the lockfile but include them in the rendered environment files that we'll use to install the environment locally with the new |
Oh, the downside of using |
4e5589f
to
32e8353
Compare
I thought dependabot only updates direct dependencies? If so, our dependabot PR frequency wouldn't change right? If it opens a PR for every indirect dependency update I think we should all the dependencies on a schedule.
I'm in favor of no more package releases. Can't think of a reason why someone would want to install PUDL given we're only distributing data now.
This sounds like a good plan to me. So have two targets: |
The dependabot only looks at our direct dependencies (in Would an |
bdc50fd
to
2ae5658
Compare
Updating all of the dependencies weekly sounds simpler than opening multiple dependabot PRs 👍 When would the lock and environment files have merge conflicts? Contributor updates pyproject.toml on a feature branch, |
Hmm, @jdangerx it looks like we got another stochastic failure on the hypothesis test:
It also seems like a lot of the dates that show up in these tests are suspiciously close to (within a few nanoseconds of) Unix time zero. |
I think any time there are changes to In the I think we're trying to avoid a scenario where, say, we run a nightly build in an environment that's determined by a lockfile that doesn't actually correspond to the the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woohoo! Seems like everything works, the lockfile-updating schedule makes sense, and the only remaining question is whether to stick with make
, right?
My thoughts:
- it's nice that
make
exists outside of the Python environment, so we can use it to standardize some of our env management stuff - it's a little awkward to use
make
as a general task-runner because we'll have to make various.PHONY
targets... but that seems OK. As long as we keep the task definitions short and split anything complicated out into a real Python script I think we won't run into maintainability hell.
So I think I've talked myself into "just use make
, it's good enough and extremely standard." As such I've weighed in with my opinions on how to make
in a slightly nicer way.
Other stuff:
- when renaming a workflow, the old one will be there forever unless you delete all the existing workflow runs... that's kind of annoying but I guess it's better than losing history.
Makefile
Outdated
test/integration/glue_test.py | ||
|
||
######################################################################################## | ||
# The github- prefixed targets are meant to be run by GitHub Actions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if the make
targets are exactly what we do on CI - once we start having local-
, github-
, and targets with no prefix we lose the standardization benefit.
This would also let us have less code in the Makefile which is a plus.
The tricky thing is handling PUDL installation & coverage.
PUDL installation: this should be part of a separate devenv
target, I think - something like
devenv: conda-env
mamba run -n pudl-dev pip install -e .
conda-env: environments/conda-lock.yml
conda-lock install -n pudl-dev $^
Coverage: on CI, we're doing coverage for unit/integration/docs separately, and combining them at the end. Locally, we might want to do coverage for any subset of those, which is the impetus behind the various local-pytest-ci
targets. I think we should:
- stop running
coverage erase
withinmake
- on CI runners it is unnecessary, and locally this forces a new target for every subset of tests we want to run coverage for. - include the coverage args for all unit/integration/doc-build targets, for local/CI consistency. If you really want to skip coverage, you can always run the naked command.
Here's what I have locally:
diff --git a/Makefile b/Makefile
index a86b30874..246ee37f5 100644
--- a/Makefile
+++ b/Makefile
@@ -14,6 +14,12 @@ pip_install_pudl := pip install --no-deps --editable ./
dagster:
dagster dev -m pudl.etl -m pudl.ferc_to_sqlite
+devenv: conda-env
+ mamba run -n pudl-dev pip install -e .
+
+conda-env: environments/conda-lock.yml
+ conda-lock install -n pudl-dev $^
+
########################################################################################
# Conda lockfile generation
########################################################################################
@@ -47,16 +53,19 @@ docs-clean:
docs-build: docs-clean
doc8 docs/ README.rst
- sphinx-build -W -b html docs docs/_build/html
+ coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html
########################################################################################
# Generic pytest commands for local use, without test coverage
########################################################################################
pytest-unit:
- pytest --doctest-modules src/pudl test/unit
+ pytest ${pytest_args} --doctest-modules src/pudl test/unit
pytest-integration:
- pytest test/integration
+ pytest ${pytest_args} --etl-settings ${etl_fast_yml} test/integration
+
+pytest-integration-full:
+ pytest ${pytest_args} --etl-settings ${etl_full_yml} test/integration
pytest-validate:
pytest --live-dbs test/validate
@@ -65,41 +74,17 @@ pytest-validate:
########################################################################################
# More complex pytest commands for local use that collect test coverage
########################################################################################
-# Run unit & integration tests on 1-2 years of data and collect test coverage data.
-local-pytest-ci: docs-clean
- ${coverage_erase}
- doc8 docs/ README.rst
- coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html
- pytest ${pytest_args} --doctest-modules src/pudl test/unit
- pytest ${pytest_args} --etl-settings ${etl_fast_yml} test/integration
- ${coverage_report}
-
-# Run unit & integration tests on ALL years of data and collect test coverage data.
-# NOTE: This will take 1+ hours to run and the PUDL DB will not be retained.
-local-pytest-ci-all-years: docs-clean
- ${coverage_erase}
- doc8 docs/ README.rst
- coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html
- pytest ${pytest_args} --doctest-modules src/pudl test/unit
- pytest ${pytest_args} --etl-settings ${etl_full_yml} test/integration
- ${coverage_report}
-
# Run the full ETL, generating new FERC & PUDL SQLite DBs and EPA CEMS Parquet files.
# Then run the full integration tests and data validations on all years of data.
# NOTE: This will clobber your existing databases and takes hours to run!!!
# Backgrounding the data validation and integration tests and using wait allows them to
# run in parallel.
-nuke: docs-clean
- ${coverage_erase}
- doc8 docs/ README.rst
- coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html
- pytest ${pytest_args} --doctest-modules src/pudl test/unit
- pytest ${pytest_args} \
- --etl-settings ${etl_fast_yml} \
- test/integration
+#
+
+nuke: docs-build pytest-unit
rm -f tox-nuke.log
coverage run ${covargs} -- \
- src/pudl/convert/ferc_to_sqlite.py \
+ src/pudl/ferc_to_sqlite/cli.py \
--logfile tox-nuke.log \
--clobber \
${gcs_cache_path} \
@@ -126,11 +111,7 @@ pytest-jupyter:
# Compare actual and expected number of rows in many tables:
pytest-minmax-rows:
- pytest --live-dbs \
- test/validate/epacamd_eia_test.py::test_minmax_rows \
- test/validate/ferc1_test.py::test_minmax_rows \
- test/validate/eia_test.py::test_minmax_rows \
- test/validate/mcoe_test.py::test_minmax_rows_mcoe
+ pytest --live-dbs test/validate -k minmax_rows
# Build the FERC 1 and PUDL DBs, ignoring foreign key constraints.
# Identify any plant or utility IDs in the DBs that haven't yet been mapped
Couple extra notes:
nuke
:
- I think
clobber
has some unpleasant behavior withferc_to_sqlite
where it will delete the XBRL databases when reading the 2022 files, but...src/pudl/conver/ferc_to_sqlite.py
doesn't even exist, so I thinknuke
was totally broken already
- should we be managing logfiles like this in dagster-land? seems a little clunky, but I also know logs in dagster are funky
pytest-minmax-rows
: this is a shortcut to adding @pytest.mark.minmax
to the minmax_rows
tests. I sort of think this particular make target should be replaced by the built-in pytest test filtering tooling. Maybe the jupyter one too.
e852b6a
to
aaa8082
Compare
|
927c9a3
to
071766a
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #2968 +/- ##
=====================================
Coverage 88.7% 88.7%
=====================================
Files 90 90
Lines 10994 10994
=====================================
Hits 9758 9758
Misses 1236 1236 ☔ View full report in Codecov by Sentry. |
071766a
to
8d85019
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking. I'm excited about make and conda-lock!
Responses to your Outstanding Questions
Logging
I'm not sure what the logs will look like with the parallel tests and the change to --attach ""
. I say we run a full build on this branch before we merge it into dev
.
Docker stuff
For some reason, when I run update-conda-lockfile using workflow_dispatch the docker-build-test action is not kicked off. See #3022
Yeah this one is stumping me. I can took a look again tomorrow.
Installing the AWS CLI and Google Cloud SDK is hacky. Is AWS really only available via curl? gcloud can be installed similarly with curl, or via apt-get or from conda-forge (for OS X + Intel based Linux). What's the right approach? Also, how were we previously installing the Google Cloud SDK? I don't know what I changed to cause it not to be available any more.
I thought we decided to use conda to install Google Cloud SDK using and curl to install the AWS CLI?
There are many environment variables which are set both in the docker/.env file, and inside Dockerfile Is that intentional? Or should one of those be the unique source of that information?
I think the env vars are set in the .env
and Dockerfile files so folks could overwrite the env var values when using docker-compose
locally in the .env
file. We haven't really kept docker-compose.yml
and local_pudl_etl.sh
up to date because we mostly test the image build and container execution using github actions and the GCP VM. I think we could probably remove these files.
Remaining dependency issues
Why does pip think that the recordlinkage version is 0.0.0 when it's installed with mamba?
I'm not sure. Is this causing problems?
--container-arg="/home/catalyst/env" \ | ||
--container-arg="--prefix" \ | ||
--container-arg="/home/mambauser/env" \ | ||
--container-arg="--attach" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conda swallows all of the logs if you don't include --no-capture-output
but it sounds like --attach ""
should work:
Attach to stdin, stdout and/or stderr. -a "" for disabling stream redirection
Maybe we run part of a full build to make sure the logs are coming through before we merge into dev
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did run one full build, I think it should be in the build outputs if you want to take a look at the logs and see if they look like you'd expect them to. I think using pytest-xdist
does obscure some of the test logging, but also shaves a couple of hours off the nightly builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the output from the parallelized tests if you want to take a look:
gcloud storage cp gs://nightly-build-outputs.catalyst.coop/ece863d762c9a8790c7baff5397a0c6ee30609bb-conda-lockfile/pudl-etl.log .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These logs look good to me!
if: ${{ (github.event_name == 'push') }} | ||
run: | | ||
echo "GITHUB_REF="${{ github.ref_name }} >> $GITHUB_ENV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're setting GITHUB_REF = github.ref_name
when github.event_name == "push" or "workflow_dispatch"
. Why not handle it in one action step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were different at some point while I was developing the setup, and it felt more legible to enumerate the 3 cases we're trying to cover. But I could go either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I'm down to keep it as is if it feels more legible.
# If running on push due to dependency changes, commit directly to the base | ||
# branch of the existing PR. Don't trigger the workflow again if we're already | ||
# running it as pudlbot (to avoid infinite recursion). | ||
if: ${{ (github.event_name == 'push' && github.actor != 'pudlbot') }} | ||
uses: stefanzweifel/git-auto-commit-action@v5 | ||
with: | ||
file_pattern: "environments/*" | ||
commit_message: "Update conda-lock.yml and rendered conda environment files." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's stopping the action from running again on a push?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second time, the action is being run by pudlbot
. This felt a little hacky, but it worked. Is there a more elegant way to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the docs for the push
trigger again and it don't look like there is a way to filter based on who triggered the push.
Would it make sense test this condition and abort earlier in the workflow?
|
* Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Switch to using conda-forge version of recordlinkage v0.16 * Update conda-lock.yml now that all dependencies are available on conda-forge * Consolidate conda env files under environments/ dir * Add a GitHub action to relock dependencies * Quote the pip install command * Remove pip install of pudl from environment.yml * Rename workflow * Only build lockfile from pyproject.toml, don't install extras. * Just install conda-lock, not pudl, before running conda-lock. * install conda-lock with pip * Move all remaining dev-environment.yml deps to pyproject.toml * Add other platforms; make draft PR against dev. * Comment out dev base branch for now. * Remove pandas extras and recordlinkage deps from pyproject.toml * Use conda-lock --micromamba rather than --mamba * Don't specify grpcio, or specific recordlinkage version * Render platform-specific environment files in github action * Fix paths relative to environments directory * Add some comment notes to workflow * Render environment for Read The Docs. * Use environment not explicit rendered lockfile * Add readthedocs specific sphinx extension * Don't render explicit conda env for RTD since it can't read it. * Build linux-aarch64 lockfile. Use conda-lock.yml in workflows. * Comment out non-working linux-aarch64 platform for now. * Switch to using rendered lockfiles. * Remove deprecated environment files * Switch to using a micromamba docker image * Install git into the docker image. * Use micromamba and unrendered multi-platform lockfile. * Add main category to micromamba environment creation. * Use conda-lock not base as env name * Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Add linux-aarch64 platform back into conda-lock settings.
- Remove deprecated API_KEY_EIA envvar - Add .gitignore in new environments directory - Create Makefile and migrat tox.ini logic into it - Replace spaces in Makefile with tabs - Install pudl via pip install --no-deps to avoid contaminating the locked environment - Move pip install and coverage XML logic from GHA workflow into the Makefile - Increase the minimum versions of some dependencies. - Move update-lockfile GHA workflow logic into Makefile - Attempt to run slow tests in parallel using "wait" to prevent premature exit of the shell
* Use micromamba not conda in Dockerfile CMD, also use pip install --no-deps * Use micromamba not conda in command passed to build container * Use default mambauser rather than catalyst in docker container * Remove --no-capture-output which isn't supported by micromamba. Is this a problem? * Remove uninterpolated vars in .env and more --no-capture-output * Separate ETL and pytest commands. * Stop trying to run tests in parallel. Sigh. * Add google cloud sdk to conda environment. * Install Google Cloud SDK from conda-forge. * Add back in the making of required directories. Oops. * Attempt to have micromamba run pass through output * Use prettier to standardize formatting of lockfiles. * Add dagster (server startup) target to Makefile * Update conda lockfile and rerender environment files * Attempt to trigger update-lockfile when pyproject.toml is changed. * Remove non-required upper bounds on dependency versions. * Use correct branch name for update-lockfile trigger. * Fix incorrect nbformat minimum version * Update Makefile in response to PR feedback
* Remove dagster-postgres version to avoid PyPI conda-forge conflict. * Break up and distribute the nuke target. * Resolve issues with pandera dependency. Reorganize makefile. * Add triggers and commit/PR for workflow_dispatch, pull_request, schedule and set appropriate GITHUB_REF values for each case. * Use push instead of pull_request to trigger on path. This avoids re-locking the dependencies every single time you push to a PR that had a change to pyproject.toml *somewhere* in it. * Also trigger based on path if .github/workflows/update-lockfile.yml changes. * Update conda-lock.yml and rendered conda environment files.
* Move previous dev, test, and datasette optional dependencies into the required dependencies to simplify application installation. * Test make nuke; parallelize --live-dbs tests * Move prettier into conda-only dependencies * Update conda-lock.yml and rendered conda environment files. * Remove action test file hashlog * Remove merge markers. * Remove transitive astroid dependency that's now correctly included in solve. * Use the real immature library version of dagster-postgres (0.21.6) rather than the accidentally packaged 1.5.6 version found in conda. We'll need to keep an eye out for when dagster-postgres graduates to the stable versioning and update it. This is a bit of a mess because of some broken automation in the conda packaging for dagster which has now been fixed. * Update "make pudl" to remove the old PUDL DB and reinitialize with alembic, rather than writing to the DB that already exists. * Fixed some groupby.agg() deprecation warnings. * Fix dagster-postgres version (again). * Update username in path to settings file * Avoid bugs in ferc_to_sqlite --clobber; don't use cache_dir for pip install. * Make FERC extraction output removal more specific. * Bump numpy and numba minimum versions. * Bump black version in pre-commit * Bump ruff pre-commit hook version
* Rename tox-pytest and update-lockfile workflows. * Make scheduled PRs against dev rather than conda-lockfile branch. * Update to pyarrow 14 and grpcio 1.59 b/c security * Update release notes
5574498
to
08f318f
Compare
* Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Switch to using conda-forge version of recordlinkage v0.16 * Update conda-lock.yml now that all dependencies are available on conda-forge * Consolidate conda env files under environments/ dir * Add a GitHub action to relock dependencies * Quote the pip install command * Remove pip install of pudl from environment.yml * Rename workflow * Only build lockfile from pyproject.toml, don't install extras. * Just install conda-lock, not pudl, before running conda-lock. * install conda-lock with pip * Move all remaining dev-environment.yml deps to pyproject.toml * Add other platforms; make draft PR against dev. * Comment out dev base branch for now. * Remove pandas extras and recordlinkage deps from pyproject.toml * Use conda-lock --micromamba rather than --mamba * Don't specify grpcio, or specific recordlinkage version * Render platform-specific environment files in github action * Fix paths relative to environments directory * Add some comment notes to workflow * Render environment for Read The Docs. * Use environment not explicit rendered lockfile * Add readthedocs specific sphinx extension * Don't render explicit conda env for RTD since it can't read it. * Build linux-aarch64 lockfile. Use conda-lock.yml in workflows. * Comment out non-working linux-aarch64 platform for now. * Switch to using rendered lockfiles. * Remove deprecated environment files * Switch to using a micromamba docker image * Install git into the docker image. * Use micromamba and unrendered multi-platform lockfile. * Add main category to micromamba environment creation. * Use conda-lock not base as env name * Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Add linux-aarch64 platform back into conda-lock settings. - Remove deprecated API_KEY_EIA envvar - Add .gitignore in new environments directory - Create Makefile and migrat tox.ini logic into it - Replace spaces in Makefile with tabs - Install pudl via pip install --no-deps to avoid contaminating the locked environment - Move pip install and coverage XML logic from GHA workflow into the Makefile - Increase the minimum versions of some dependencies. - Move update-lockfile GHA workflow logic into Makefile - Attempt to run slow tests in parallel using "wait" to prevent premature exit of the shell * Use micromamba not conda in Dockerfile CMD, also use pip install --no-deps * Use micromamba not conda in command passed to build container * Use default mambauser rather than catalyst in docker container * Remove --no-capture-output which isn't supported by micromamba. Is this a problem? * Remove uninterpolated vars in .env and more --no-capture-output * Separate ETL and pytest commands. * Stop trying to run tests in parallel. Sigh. * Add google cloud sdk to conda environment. * Install Google Cloud SDK from conda-forge. * Add back in the making of required directories. Oops. * Attempt to have micromamba run pass through output * Use prettier to standardize formatting of lockfiles. * Add dagster (server startup) target to Makefile * Update conda lockfile and rerender environment files * Attempt to trigger update-lockfile when pyproject.toml is changed. * Remove non-required upper bounds on dependency versions. * Use correct branch name for update-lockfile trigger. * Fix incorrect nbformat minimum version * Update Makefile in response to PR feedback * Remove dagster-postgres version to avoid PyPI conda-forge conflict. * Break up and distribute the nuke target. * Resolve issues with pandera dependency. Reorganize makefile. * Add triggers and commit/PR for workflow_dispatch, pull_request, schedule and set appropriate GITHUB_REF values for each case. * Use push instead of pull_request to trigger on path. This avoids re-locking the dependencies every single time you push to a PR that had a change to pyproject.toml *somewhere* in it. * Also trigger based on path if .github/workflows/update-lockfile.yml changes. * Update conda-lock.yml and rendered conda environment files. * Move previous dev, test, and datasette optional dependencies into the required dependencies to simplify application installation. * Test make nuke; parallelize --live-dbs tests * Move prettier into conda-only dependencies * Update conda-lock.yml and rendered conda environment files. * Remove action test file hashlog * Remove merge markers. * Remove transitive astroid dependency that's now correctly included in solve. * Use the real immature library version of dagster-postgres (0.21.6) rather than the accidentally packaged 1.5.6 version found in conda. We'll need to keep an eye out for when dagster-postgres graduates to the stable versioning and update it. This is a bit of a mess because of some broken automation in the conda packaging for dagster which has now been fixed. * Update "make pudl" to remove the old PUDL DB and reinitialize with alembic, rather than writing to the DB that already exists. * Fixed some groupby.agg() deprecation warnings. * Fix dagster-postgres version (again). * Update username in path to settings file * Avoid bugs in ferc_to_sqlite --clobber; don't use cache_dir for pip install. * Make FERC extraction output removal more specific. * Bump numpy and numba minimum versions. * Bump black version in pre-commit * Bump ruff pre-commit hook version * Rename tox-pytest and update-lockfile workflows. * Make scheduled PRs against dev rather than conda-lockfile branch. * Update to pyarrow 14 and grpcio 1.59 b/c security * Update release notes * Add CI targets in Makefile. Docs cleanup. * Update conda-lock.yml and rendered conda environment files. --------- Co-authored-by: zaneselvans <zaneselvans@users.noreply.github.com>
* Set up reproducible Python environments with conda-lock (#2968) * Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Switch to using conda-forge version of recordlinkage v0.16 * Update conda-lock.yml now that all dependencies are available on conda-forge * Consolidate conda env files under environments/ dir * Add a GitHub action to relock dependencies * Quote the pip install command * Remove pip install of pudl from environment.yml * Rename workflow * Only build lockfile from pyproject.toml, don't install extras. * Just install conda-lock, not pudl, before running conda-lock. * install conda-lock with pip * Move all remaining dev-environment.yml deps to pyproject.toml * Add other platforms; make draft PR against dev. * Comment out dev base branch for now. * Remove pandas extras and recordlinkage deps from pyproject.toml * Use conda-lock --micromamba rather than --mamba * Don't specify grpcio, or specific recordlinkage version * Render platform-specific environment files in github action * Fix paths relative to environments directory * Add some comment notes to workflow * Render environment for Read The Docs. * Use environment not explicit rendered lockfile * Add readthedocs specific sphinx extension * Don't render explicit conda env for RTD since it can't read it. * Build linux-aarch64 lockfile. Use conda-lock.yml in workflows. * Comment out non-working linux-aarch64 platform for now. * Switch to using rendered lockfiles. * Remove deprecated environment files * Switch to using a micromamba docker image * Install git into the docker image. * Use micromamba and unrendered multi-platform lockfile. * Add main category to micromamba environment creation. * Use conda-lock not base as env name * Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Add linux-aarch64 platform back into conda-lock settings. - Remove deprecated API_KEY_EIA envvar - Add .gitignore in new environments directory - Create Makefile and migrat tox.ini logic into it - Replace spaces in Makefile with tabs - Install pudl via pip install --no-deps to avoid contaminating the locked environment - Move pip install and coverage XML logic from GHA workflow into the Makefile - Increase the minimum versions of some dependencies. - Move update-lockfile GHA workflow logic into Makefile - Attempt to run slow tests in parallel using "wait" to prevent premature exit of the shell * Use micromamba not conda in Dockerfile CMD, also use pip install --no-deps * Use micromamba not conda in command passed to build container * Use default mambauser rather than catalyst in docker container * Remove --no-capture-output which isn't supported by micromamba. Is this a problem? * Remove uninterpolated vars in .env and more --no-capture-output * Separate ETL and pytest commands. * Stop trying to run tests in parallel. Sigh. * Add google cloud sdk to conda environment. * Install Google Cloud SDK from conda-forge. * Add back in the making of required directories. Oops. * Attempt to have micromamba run pass through output * Use prettier to standardize formatting of lockfiles. * Add dagster (server startup) target to Makefile * Update conda lockfile and rerender environment files * Attempt to trigger update-lockfile when pyproject.toml is changed. * Remove non-required upper bounds on dependency versions. * Use correct branch name for update-lockfile trigger. * Fix incorrect nbformat minimum version * Update Makefile in response to PR feedback * Remove dagster-postgres version to avoid PyPI conda-forge conflict. * Break up and distribute the nuke target. * Resolve issues with pandera dependency. Reorganize makefile. * Add triggers and commit/PR for workflow_dispatch, pull_request, schedule and set appropriate GITHUB_REF values for each case. * Use push instead of pull_request to trigger on path. This avoids re-locking the dependencies every single time you push to a PR that had a change to pyproject.toml *somewhere* in it. * Also trigger based on path if .github/workflows/update-lockfile.yml changes. * Update conda-lock.yml and rendered conda environment files. * Move previous dev, test, and datasette optional dependencies into the required dependencies to simplify application installation. * Test make nuke; parallelize --live-dbs tests * Move prettier into conda-only dependencies * Update conda-lock.yml and rendered conda environment files. * Remove action test file hashlog * Remove merge markers. * Remove transitive astroid dependency that's now correctly included in solve. * Use the real immature library version of dagster-postgres (0.21.6) rather than the accidentally packaged 1.5.6 version found in conda. We'll need to keep an eye out for when dagster-postgres graduates to the stable versioning and update it. This is a bit of a mess because of some broken automation in the conda packaging for dagster which has now been fixed. * Update "make pudl" to remove the old PUDL DB and reinitialize with alembic, rather than writing to the DB that already exists. * Fixed some groupby.agg() deprecation warnings. * Fix dagster-postgres version (again). * Update username in path to settings file * Avoid bugs in ferc_to_sqlite --clobber; don't use cache_dir for pip install. * Make FERC extraction output removal more specific. * Bump numpy and numba minimum versions. * Bump black version in pre-commit * Bump ruff pre-commit hook version * Rename tox-pytest and update-lockfile workflows. * Make scheduled PRs against dev rather than conda-lockfile branch. * Update to pyarrow 14 and grpcio 1.59 b/c security * Update release notes * Add CI targets in Makefile. Docs cleanup. * Update conda-lock.yml and rendered conda environment files. --------- Co-authored-by: zaneselvans <zaneselvans@users.noreply.github.com> * Revert temporary change to .pre-commit.yml * Update conda-lock.yml and rendered conda environment files. * Remove no longer required create args. --------- Co-authored-by: zaneselvans <zaneselvans@users.noreply.github.com>
PR Overview
This PR switches our dependency management system to use platform-specific lockfiles produced by conda-lock to generate reproducible Python environments, hopefully increasing the stability of our nightly builds, and ensuring that we can reproduce a given set of data products at a later date if need be, given the commit hash of our own repository, and the exact versions of all our software dependencies.
Conda Lockfile
pyproject.toml
as the single source specifying all of our software dependencies.environments/conda-lock.yml
which specifies exact versions and hashes for thelinux-64
,linux-aarch64
,osx-64
,osx-arm64
, andwin-64
platforms.environments/
directory.conda-lock
ormicromamba
(which can parse the fullconda-lock.yml
) we refer to the full, multi-platformconda-lock.yml
.conda
ormamba
are available, we have to specify the platform-specific environment files.update-lockfile
that will eventually run on a schedule, against thedev
branch, and will re-lock the dependencies, and re-render the environment files underenvironments/
to update them with the new exact versions (assuming all tests pass).mambaorg/micromamba
which comes with the tinymicromamba
package manager installed, and not much else.micromamba
can directly read theconda-lock.yml
file and install the packages very rapidly. It's designed for this kind of locked environment reproduction.doc
but this didn't match the directory name and other references which were labeleddocs
and this PR switches everything to refer todocs
(which is also what we use in thecheshire
template repo).recordlinkage
andlibsnappy
) are now available via conda, so we don't need to have apip install
section of any of our environment files.pyproject.toml
in the section dedicated toconda-lock
. This includes e.g.python
itself,sqlite
,nodejs, and
pandoc`.conda-lock
appears unable to parse the "extras" format sometimes used withpip
dependencies, so rather than being able to usepandas[performance,gcp]>=2,<2.2
you actually have to specify the packages that would have been installed as a result of including those extras.grpcio
still has to be specified directly inpyproject.toml
because it has binary incompatibilities on MacOS that evenconda
isn't solving.conda-forge
because we're not testing a built package anymore. From now on PUDL is an application that gets deployed from a repository in a locked conda environment.Swtiching from Tox to Makefile
conda
environment based on the lockfile, the environment produced by Tox would be both redundant, and likely install other versions of dependencies that aren't part of the locked environment.catalystcoop.pudl
from the repository rather than from a built package, then all Tox is doing is storing logic about how to run the tests and various other tasks.Makefile
and brought as much of the logic from the GHA workflows into theMakefile
as possible, so that it can be easily run either locally or in CI.Makefile
targets are "phony" in that they do not correspond directly to a filesystem output whose creation date indicates which commands need to be re-run, but in general the setup replicates the "collection of code snippets" that we had stored intox.ini
. We can accumulate additional reusable snippets in there if we want to standardize other tasks across all of our development environments.Repository vs. Package Installation & optional dependencies aka "extras"
catalystcoop.pudl
from the repository, then the way we've historically split out our dev, test, docs, etc dependencies no longer makes sense. Almost everywhere we're installing all dependencies directly from the explicit multi-platformconda-lock.yml
file.pygraphviz
,dagster-webserver
,jupyterlab
etc. These have been split out into a very smalldev
list of optional dependencies that is installed locally when you runmake install-pudl
but that don't get installed in the Docker container.Tasks
Outstanding Questions
Logging
micromamba
doesn't have a direct analog toconda --no-capture-input
and I'm not sure how its stream handling / attachement args correspond (if at all) so for now I've removed--no-capture-input
from the command that's passed into the container. But not sure what the implications of that are.Docker stuff
update-conda-lockfile
usingworkflow_dispatch
thedocker-build-test
action is not kicked off. See Update Lockfile #3022curl
?gcloud
can be installed similarly withcurl
, or viaapt-get
or fromconda-forge
(for OS X + Intel based Linux). What's the right approach? Also, how were we previously installing the Google Cloud SDK? I don't know what I changed to cause it not to be available any more.docker/.env
file, and insideDockerfile
Is that intentional? Or should one of those be the unique source of that information?Remaining dependency issues
pip
think that therecordlinkage
version is0.0.0
when it's installed withmamba
?PR Checklist
dev
).