Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict PyTorch version not to be more advanced than that used in Elasticsearch #479

Merged
merged 7 commits into from
Jul 7, 2022

Conversation

davidkyle
Copy link
Member

PyTorch recently released version 1.12 which contains breaking changes to torchscript models making them incompatible with earlier versions.

The Elastic stack uses version 1.11 of the PyTorch C++ library (libtorch), a model traced to torchscript format using a v1.12 Python PyTorch install cannot be evaluated in libtorch v1.11.

Anyone using Eland to import a model into Elastic today would pull the latest version of PyTorch and find the model cannot be evaluated in Elasticsearch. Opening the model errors with the message:

Attempted to read a PyTorch file with version 10, but the maximum supported version for reading is 9. Your PyTorch installation may be too old.

This change to the requirements prevents Eland using v1.12 of PyTorch.

HuggingFace transformers has also pinned the torch requirement to <1.12 in huggingface/transformers@5a3d0cb

The sentence-transformers and transformers versions have also been tighten to a range known to be compatible with PyTorch 1.11.

@davidkyle davidkyle added the topic:NLP Issue or PR about NLP model support and eland_import_hub_model label Jun 30, 2022
@droberts195
Copy link
Contributor

It sounds like the README needs updating too. Currently it says:

  • Supports Elasticsearch clusters that are 7.11+, recommended 7.14 or later for all features to work. Make sure your Eland major version matches the major version of your Elasticsearch cluster.

But it sounds like actually the requirement is that your Eland distribution uses the same PyTorch version as the ML C++ code, and to avoid having some really complicated compatibility matrix the easiest way to express that is to say Eland and Elasticsearch must be of the same minor version (any patch level within that minor).

@davidkyle
Copy link
Member Author

davidkyle commented Jul 4, 2022

The CI failure was due to a change in the Python client after _infer API was renamed in the Elasticsearch 8.3 release.

elastic/elasticsearch-py@0e16cd6

As the the _infer API is in tech preview and the old name is supported but deprecated it is ok to rename but it means a 8.3 version of Eland must be used with an 8.3 Python client and Elasticsearch cluster.

Given that upgrading PyTorch has already enforced this requirement and may happen again on a later upgrade I've updated the README to state that if using the PyTorch features the minor versions of Elasticsearch and Eland must be the same.

Update
Actually it is Python client that must be >=v8.3

@davidkyle
Copy link
Member Author

I pushed c97d641 which bumps the required Elasticsearch Python client to version 8.3.

Eland does not make any claims for compatibility with older versions of the Python client and the latest is always preferred

@davidkyle
Copy link
Member Author

davidkyle commented Jul 4, 2022

clients-ci is now failing against pre 8.3 nodes (obviously) which do not know about the _infer API change. I'm not sure why the tests pass for Python 3.10 I'm assuming the NLP tests are skipped for 3.10

 'no handler found for uri [/_ml/trained_models/distilbert-base-uncased/_infer?timeout=60s] and method [POST]')

This can be fixed using the REST compatibility layer to call the old deprecated endpoint but it would still leave the problem where the current version of PyTorch is not compatible with the latest Elasticsearch release and cannot be upgraded without breaking BWC at some point.

Returning to the comment above:

avoid having some really complicated compatibility matrix the easiest way to express that is to say Eland and Elasticsearch must be of the same minor version

We want to avoid the situation where some parts of Eland work with earlier versions of Elasticsearch and others do not. Tightly coupling the compatible versions of Eland, Elasticsearch and the Elasticsearch Python client at a minor level will guarantee compatibility. However, this is a change to the current policy and I'm happy to discuss it.

My proposal is this:

  1. Document that both the major and minor versions of Eland should match the Elasticsearch cluster it is running against
  2. Pin the Python client to the same major and minor version of Eland and update with every release
  3. Update extra requirements (torch, etc.) as and when required

@droberts195
Copy link
Contributor

My proposal is this:

  1. Document that both the major and minor versions of Eland should match the Elasticsearch cluster it is running against
  2. Pin the Python client to the same major and minor version of Eland and update with every release
  3. Update extra requirements (torch, etc.) as and when required

I agree with 2 and 3. On reflection I think my suggestion to require the minor version matches was probably too strict.

With Kibana and Elasticsearch the rule is that you have to upgrade Elasticsearch first and Kibana afterwards. This means that you can upgrade with no downtime, as the older version of Kibana continues to work with the newer version of Elasticsearch in the mixed version state during the upgrade.

I think we should specify a similar constraint for Eland: you have to upgrade Elasticsearch before Eland and an older Eland should work with a newer Elasticsearch during the upgrade period.

With trained models it sounds like we've got an additional hard break at the 8.3 boundary where we've changed the infer API. This is OK while we're in technical preview but going forwards we should aim to stick to the rule that older Eland works with newer Elasticsearch within the same major version.

@davidkyle
Copy link
Member Author

On reflection the allowing Eland to work with different versions of the same major Elasticsearch cluster will be required.

A developer who is using version 8.2 of Eland where the torch has not been pinned to <1.12 will automatically pull the latest torch package when the create a venv, install the requirements or build the docker image. That version is not compatible with the latest Elasticsearch release.

In 8.3 torch will be pinned to a compatible version so we want everyone to upgrade to 8.3 if they are using eland_import_hub_model. Even if they are running against an older version of Elasticsearch e.g. 8.1.0 we want the developer to upgrade to Eland 8.3.

Therefore Eland 8.3 must be compatible with 8.0, 8.1, and 8.2. which can be achieved by using the deprecated deployment/_infer which is supported throughout 8.

Given that, points 2 & 3 still stand:

  1. Pin the Python client to the same major and minor version of Eland and update with every release
  2. Update extra requirements (torch, etc.) as and when required

Everyone using the eland_import_hub_model script should upgrade to Eland 8.3 and we will make it backwards compatible with earlier 8.x versions and future compatible with later 8.x versions.

# Conflicts:
#	eland/ml/pytorch/_pytorch_model.py
@droberts195
Copy link
Contributor

we will make it backwards compatible with earlier 8.x versions and future compatible with later 8.x versions

Is it the case that if you trace a model using PyTorch 1.11 then PyTorch 1.9 cannot understand it? If that is the case then it's going to be very hard to make Eland compatible with older stack versions because it will have to dynamically get the appropriate version of PyTorch after finding out what the server version is.

@davidkyle
Copy link
Member Author

Is it the case that if you trace a model using PyTorch 1.11 then PyTorch 1.9 cannot understand it?

Anecdotally we know that models traced in PyTorch 1.11 can run in libtorch 1.9 as that is what many people have been doing. But it's worth verifying so I spun up a cloud cluster of Elasticsearch version 8.0.1 and uploaded a NER model using the changes in this PR (with PyTorch pinned to 1.11). Evaluating the model did work 😁 so PyTorch 1.11 is backward compatible with 1.9, the breaking change is in 1.12.

@droberts195
Copy link
Contributor

the breaking change is in 1.12

But then we're setting ourselves up for a problem the next time we upgrade PyTorch if we say Eland 8.3 and above is going to be compatible with all Elasticsearch 8.x versions.

We could say Eland 8.3 is compatible with all Elasticsearch 8.x versions and then never release another version of Eland. Or we could never upgrade the version of PyTorch used within Eland again in 8.x, so that by 8.last Elasticsearch might be on PyTorch 1.17 but Eland would still be on PyTorch 1.11.

But neither of those options sounds ideal. We could be forced into doing a quick upgrade by a major security bug for example.

This is why I think it would be better to define a compatibility policy for Eland and Elasticsearch that accounts for the possibility of breaking changes in PyTorch.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidkyle davidkyle merged commit 0eb36fa into elastic:main Jul 7, 2022
@davidkyle davidkyle deleted the no-future-versions branch July 7, 2022 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:NLP Issue or PR about NLP model support and eland_import_hub_model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants