Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to pydantic >= 2.0 #613

Merged
merged 38 commits into from
Jul 13, 2024
Merged

Conversation

Vectorrent
Copy link
Contributor

This is a first pass at upgrading Pydantic to 2.7.3, as discussed in #597. Fortunately, Pydantic is barely used in Hivemind - so this should be fairly straightforward.

So far, I have extended the BaseModel class, in order to define a Config object with arbitrary_types_allowed = True. From what I can tell, this was the simplest way to deal with our validator's regex handling. The __get_validators__ method was removed in 2.0, and this function needed some work.

I also removed the _patch_schema calls, because 1) field.required was changed to field.is_required in 2.0, and 2) field.is_required is now read-only. I couldn't figure out how to set it... and the code seems to work with out it? So is it really necessary here?

DO NOT MERGE; this is an initial attempt. Within the next few days, I will test this in a multi-node environment, and I will test it with Petals as well.

Until then, this version of the code seems to work well. I wanted to open the PR, to get us all started.

@Vectorrent
Copy link
Contributor Author

This PR is as far as I can reasonably take it. I fixed the linting errors introduced by my changes (but not the ones already existing in the master branch), and I reverted Pydantic to 2.5.3 (to make it compatible with CI environment). That said, Hivemind is also compatible with 2.7.3.

I tested this change with a multi-node training run, and it works well. I also tested this with Petals, and was able to run inference there as well.

I'd say this is ready for merge.

@Vectorrent
Copy link
Contributor Author

It looks like #612 might fix the errors we're seeing in tests here.

@justheuristic justheuristic self-requested a review June 6, 2024 08:36
@justheuristic
Copy link
Member

NB: We'll need to update hivemind to fix incompatibility with the latest torch update to so we can test this (see failed tests). I intend to do this as soon as I have some bandwidth, but if someone's willing to do this earlier, please do.

@Vectorrent
Copy link
Contributor Author

So, it appears that torch.cuda.amp was deprecated, and everything was moved to torch.amp. So, I fixed that.

Strangely, some of these tests will install torch 2.3.1, while others are installing 1.9.0 (and thus, failing). Not sure if that is intentional, but I'll keep digging.

@samsja
Copy link
Contributor

samsja commented Jun 7, 2024

It looks like #612 might fix the errors we're seeing in tests here.

yes it should fix all ci error, expect the black/isort one

@samsja
Copy link
Contributor

samsja commented Jun 7, 2024

NB: We'll need to update hivemind to fix incompatibility with the latest torch update to so we can test this (see failed tests). I intend to do this as soon as I have some bandwidth, but if someone's willing to do this earlier, please do.

fy: #612

@Vectorrent
Copy link
Contributor Author

It looks like mine and @samsja's PR are addressing the same bug, so we should only merge one.

That said, I have no idea why these tests are failing right now, while theirs worked. All I'm seeing is Error: The operation was canceled.. Tests are not failing with my code; they are simply hanging forever, and cancelled after a while.

Both PRs still have the same bug, relating to the Albert test:

#15 20.13   × python setup.py egg_info did not run successfully.
#15 20.13   │ exit code: 1
#15 20.13   ╰─> [6 lines of output]
#15 20.13       Traceback (most recent call last):
#15 20.13         File "<string>", line 2, in <module>
#15 20.13         File "<pip-setuptools-caller>", line 34, in <module>
#15 20.13         File "/tmp/pip-install-_8jw_ha0/pathtools_37faf27303d54936a074efc743740ba1/setup.py", line 25, in <module>
#15 20.13           import imp
#15 20.13       ModuleNotFoundError: No module named 'imp'
#15 20.13       [end of output]

Apparently, the "imp" module was removed in Python 3.12, so I'm sure that's where the problem lies. I was able to reproduce this bug in Docker locally, so I'll spend some time offline troubleshooting.

@Vectorrent
Copy link
Contributor Author

It seems that tests/test_optimizer.py is hanging at test_progress_tracker(). Since that code references BytesWithPublicKey, and my Pydantic changes touched that code - it's probably related. I just don't know why it hangs without doing anything.

If anyone with Pydantic experience wants to jump in, I'd appreciate it.

@samsja
Copy link
Contributor

samsja commented Jun 7, 2024

It seems that tests/test_optimizer.py is hanging at test_progress_tracker(). Since that code references BytesWithPublicKey, and my Pydantic changes touched that code - it's probably related. I just don't know why it hangs without doing anything.

If anyone with Pydantic experience wants to jump in, I'd appreciate it.

I just tried your PR with pydantic 2.5.3 and 2.7.3. Seems to be working fine, at least the test is not hanging for me. Happy to help if you hint me on how to reproduce the hanging behavior

@Vectorrent
Copy link
Contributor Author

The code is not hanging anymore, but we are getting some failures relating to the validator. From what I can tell, match_regex() is never executing, and so - it makes sense that we're getting errors about missing fields. I don't know Pydantic well enough to know how to fix this.

Easiest way to reproduce is with the docker-compose.yml I added. Just cd into that directory, and run two commands:

docker compose build
docker compose up

That will execute the tests specified in docker-compose.yml. You can change that as needed.

@Vectorrent
Copy link
Contributor Author

Vectorrent commented Jun 10, 2024

I reverted to v1 API, since v2 would probably break Pydantic usage in Petals as well... and I don't want to deal with that.

It seems that most tests are passing now. I have no idea why some of them are being cancelled, but it seems like they're timing-out. And these failures are not even consistent; whereas 3.8 was successful on the previous run, it timed-out on this attempt. Not sure what's happening there, but it looks like a Github issue more than anything.

@Vectorrent
Copy link
Contributor Author

These tests keep failing with random, transient errors. What works on this run, will fail on the next. Hivemind is just being unreliable right now.

@Vectorrent
Copy link
Contributor Author

I'd recommend merging, if you're okay with this code. Tests keep failing, but they are failing with transient errors that were probably not introduced by the Pydantic update:

test_averaging.py::test_allreduce_once[0-0] PASSED                       [ 23%]
test_averaging.py::test_allreduce_once[0-1] PASSED                       [ 23%]
test_averaging.py::test_allreduce_once[0-2] PASSED                       [ 24%]
test_averaging.py::test_allreduce_once[1-0] PASSED                       [ 24%]
test_averaging.py::test_allreduce_once[1-1] PASSED                       [ 24%]
test_averaging.py::test_allreduce_once[1-2] PASSED                       [ 25%]
test_averaging.py::test_allreduce_once[2-0] PASSED                       [ 25%]
test_averaging.py::test_allreduce_once[2-1] PASSED                       [ 25%]
test_averaging.py::test_allreduce_once[2-2] PASSED                       [ 25%]

Stuff like this is just a quirk of Hivemind. all_reduce can execute successfully 9 times in a row, before randomly hanging forever (until the test times-out). There is a deeper issue here, but one that should probably be fixed in a different PR.

Copy link
Member

@justheuristic justheuristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@mryab mryab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! However, before merging, please reduce the diff of the PR to make sure all changes are Pydantic-related

LICENSE Outdated Show resolved Hide resolved
docker-compose.yml Outdated Show resolved Hide resolved
Copy link
Member

@mryab mryab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the pull request! I believe we are close to merging, but there is still a small list of changes that need to be made before this

tests/test_utils/p2p_daemon.py Outdated Show resolved Hide resolved
tests/test_dht_schema.py Outdated Show resolved Hide resolved
examples/albert/requirements.txt Outdated Show resolved Hide resolved
benchmarks/benchmark_optimizer.py Outdated Show resolved Hide resolved
requirements.txt Outdated
@@ -12,5 +12,5 @@ configargparse>=1.2.3
py-multihash>=0.2.3
multiaddr @ git+https://github.com/multiformats/py-multiaddr.git@e01dbd38f2c0464c0f78b556691d655265018cce
cryptography>=3.4.6
pydantic>=1.8.1,<2.0
pydantic>=2.5.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version is quite recent, can you provide any reasoning behind how it was chosen? Maybe we can bump to just 2.0? Ideally, we should even keep backwards compatibility with older versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose this version because it was the highest possible version I could use, which was still compatible with all of the tests. Remember: the whole reason we're upgrading Pydantic is because this old version has been conflicting with other dependencies, in other projects.

We can revert to 2.0.0 if you still want me to do that, though 2.5.3 seems to be working fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, If the code is still compatible with pydantic v1 it might be worth it to just do pydantic>1.8.1

.github/workflows/run-tests.yml Outdated Show resolved Hide resolved
@Vectorrent
Copy link
Contributor Author

#619 and #620 to fix these errors again

@Vectorrent
Copy link
Contributor Author

@mryab - is there anything else you need from me? Between the 3 open PRs, we should be able to get past these failing tests.

@samsja
Copy link
Contributor

samsja commented Jul 5, 2024

@mryab - is there anything else you need from me? Between the 3 open PRs, we should be able to get past these failing tests.

I have been using this PR for couple of weeks, seems to be stable.

requirements.txt Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jul 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.87%. Comparing base (d20e810) to head (6df0176).
Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #613      +/-   ##
==========================================
+ Coverage   85.39%   85.87%   +0.47%     
==========================================
  Files          81       81              
  Lines        8006     8014       +8     
==========================================
+ Hits         6837     6882      +45     
+ Misses       1169     1132      -37     
Files Coverage Δ
hivemind/dht/schema.py 100.00% <100.00%> (ø)
hivemind/optim/progress_tracker.py 97.80% <100.00%> (ø)

... and 3 files with indirect coverage changes

@mryab mryab changed the title migration to pydantic > 2.0 Migrate to pydantic > 2.0 Jul 13, 2024
@mryab mryab changed the title Migrate to pydantic > 2.0 Migrate to pydantic >= 2.0 Jul 13, 2024
@mryab mryab merged commit 128ee90 into learning-at-home:master Jul 13, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants