Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ML AI Plugin To Support Keras 3 #1223

Merged
merged 23 commits into from
Jul 2, 2024

Conversation

bpaul4
Copy link
Contributor

@bpaul4 bpaul4 commented May 21, 2024

Fixes/Addresses:

Updates syntax in ML AI Plugin examples, methods, and tests to support Keras 3 which replaces the standard H5 (.h5) file format with a new Keras (.keras) file format, and deprecates the legacy SavedModel (folder) model type into a inference-only TFSM loading support structure. There is still a way to load SavedModel files, but they cannot have a custom layer that is detectable by FOQUS.

Summary/Motivation:

Changes proposed in this PR:

  • Reorganize ML AI example file structure into supported Keras, deprecated (legacy) Keras, and non-Keras trainer examples
  • Update examples to Keras 3 syntax and regenerate example files (files for non-Keras trainers were regenerated as well to ensure compatibility with the latest versions)
  • Update node method structure and tests to confirm compatibility with Keras 3 (latest TensorFlow, which is currently 2.16.1)

Legal Acknowledgement

By contributing to this software project, I agree to the following terms and conditions for my contribution:

  1. I agree my contributions are submitted under the copyright and license terms described in the LICENSE.md file at the top level of this directory.
  2. I represent I am authorized to make the contributions and grant the license. If my employer has rights to intellectual property that includes these contributions, I represent that I have received permission to make contributions and grant the required license on behalf of that employer.

@bpaul4 bpaul4 added the ML/AI Machine Learning/Artificial Intelligence label May 21, 2024
@bpaul4 bpaul4 self-assigned this May 21, 2024
@bpaul4 bpaul4 linked an issue May 21, 2024 that may be closed by this pull request
@ksbeattie ksbeattie added the Priority:High High Priority Issue or PR label May 28, 2024
@bpaul4 bpaul4 marked this pull request as ready for review June 4, 2024 17:50
@bpaul4 bpaul4 requested review from lbianchi-lbl and franflame June 4, 2024 17:50
@bpaul4 bpaul4 changed the title [WIP] Update ML AI Plugin To Support Keras 3 Update ML AI Plugin To Support Keras 3 Jun 4, 2024
@lbianchi-lbl
Copy link
Contributor

lbianchi-lbl commented Jun 4, 2024

@bpaul4 to enable the ML/AI tests on versions of Python other than 3.10, you can try to add items to the job matrix include section:

name: pytest (py${{ matrix.python-version }}/${{ matrix.os }})
needs: [spell-check, code-formatting, pylint]
runs-on: ${{ matrix.os-version }}
strategy:
fail-fast: false
matrix:
python-version:
- '3.8'
- '3.9'
- '3.10'
- '3.11'
- '3.12'
os:
- linux
- win64
- macos-x86_64
- macos
include:
- os: macos-x86_64
os-version: macos-13
- os: macos
os-version: macos-14
- os: linux
os-version: ubuntu-20.04
- os: win64
os-version: windows-2019
- python-version: '3.10' # avoid uploading coverage for full matrix
use_coverage: true
- python-version: '3.10' # this is to avoid installing optional dependencies in all environments
optional-dependencies: -r requirements-mlai.txt

        include:
          - os: macos-x86_64
            os-version: macos-13
          - os: macos
            os-version: macos-14
          - os: linux
            os-version: ubuntu-20.04
          - os: win64
            os-version: windows-2019
          - python-version: '3.10'  # avoid uploading coverage for full matrix
            use_coverage: true
          - python-version: '3.10'  # this is to avoid installing optional dependencies in all environments
            optional-dependencies: -r requirements-mlai.txt
+         - python-version: '3.9'
+           optional-dependencies: -r requirements-mlai.txt
+         - python-version: '3.11'
+           optional-dependencies: -r requirements-mlai.txt

@bpaul4
Copy link
Contributor Author

bpaul4 commented Jun 5, 2024

@lbianchi-lbl per the log, it appears that the "savedmodel" zip file is copied over and extracted to the test directory, but is not picked up by the model_files fixture in the test module itself. It picks up all other models and all Python files, but not the "savedmodel" folder. I'm currently investigating why, as this doesn't occur locally, only in CI runs.

image

@lbianchi-lbl
Copy link
Contributor

lbianchi-lbl commented Jun 5, 2024

@bpaul4 thanks for looking into this. When running locally, have you tried running the tests in a freshly created directory? e.g.

mkdir testing-foqus-1223
cd testing-foqus-1223
git clone https://github.com/bpaul4/FOQUS
cd FOQUS
git switch update-keras
# activate environment
# run tests

One key difference between local development and the CI environment is that the latter always starts from a "clean slate", which might be one of the factors that results in the different behavior that we're seeing.

@bpaul4
Copy link
Contributor Author

bpaul4 commented Jun 5, 2024

@lbianchi-lbl I did some more debugging, and it looks like in the fixture below

@pytest.fixture(scope="session")
def model_files(
    foqus_ml_ai_models_dir: Path,
    install_ml_ai_model_files,
    suffixes: Tuple[str] = (".py", ".keras", ".h5", ".json", ".pt", ".pkl"),
) -> List[Path]:
    paths = []
    for path in sorted(foqus_ml_ai_models_dir.glob("*")):
        if all(
            [
                ((path.is_file() and path.suffix in suffixes) or path.is_dir()),
                path.stat().st_size > 0,
                path.name != "__init__.py",
            ]
        ):
            paths.append(path)
    return paths

the "savedmodel" folder is picked up when path.stat().st_size > 0 and path.name != "__init__.py" are excluded. Assuming the first one is the cause, is there a reason it would exclude the folder and is it safe to remove this condition?

Update: just saw your prior comment, I haven't tried that but can try that next.

@lbianchi-lbl
Copy link
Contributor

@lbianchi-lbl I did some more debugging, and it looks like in the fixture below

@pytest.fixture(scope="session")
def model_files(
    foqus_ml_ai_models_dir: Path,
    install_ml_ai_model_files,
    suffixes: Tuple[str] = (".py", ".keras", ".h5", ".json", ".pt", ".pkl"),
) -> List[Path]:
    paths = []
    for path in sorted(foqus_ml_ai_models_dir.glob("*")):
        if all(
            [
                ((path.is_file() and path.suffix in suffixes) or path.is_dir()),
                path.stat().st_size > 0,
                path.name != "__init__.py",
            ]
        ):
            paths.append(path)
    return paths

the "savedmodel" folder is picked up when path.stat().st_size > 0 and path.name != "__init__.py" are excluded. Assuming the first one is the cause, is there a reason it would exclude the folder and is it safe to remove this condition?

Update: just saw your prior comment, I haven't tried that but can try that next.

Great point @bpaul4. If path points to a directory, path.stat().st_size should be 4096 and thus greater than zero. However, I've just checked on both Linux and Windows and, while this is true on Linux, on Windows the same snippet returns 0 🙃 I think you can go ahead and remove that condition and push the changes, and we can check if that resolves the failure.

@bpaul4
Copy link
Contributor Author

bpaul4 commented Jun 5, 2024

@lbianchi-lbl everything is green, so I think it's ready for your review per your availability. Thank you for your suggestions!

The changes within the folder examples/other_files/ML_AI_Plugin/ are mostly files being moved around. I moved old TensorFlow files into a "deprecated" folder, regenerated all of the models with the latest packages (both TensorFlow and non-TensorFlow models), and moved the current models into "supported Keras" and "other" folders. The only big change is that the example without a custom layer is now also the mea column model, as the autothermal reformer model couldn't readily be regenerated.

.github/workflows/checks.yml Outdated Show resolved Hide resolved
@lbianchi-lbl
Copy link
Contributor

@bpaul4 comparing the GHA logs for a passing test (above, Windows/Python 3.10) and failing test (below, macOS/Python 3.10), it looks like something between TestImports::test_import_tensorflow_success and TestPymodelMLAI::test_build_and_run_as_expected_1 is hanging indefinitely on macOS, causing the GHA step to time out after the built-in limit of 6 hours:

2024-06-12_09-25
2024-06-12_09-28

Do you have any idea why this is happening? What type of test is TestPymodelMLAI::test_build_and_run_as_expected_1? (e.g. is it a GUI test?)

Since this seems to be happening consistently for all macOS runners (both Intel and Apple Silicon), I wonder if it's possible to someone with a macOS machine to try to reproduce this locally.

@bpaul4
Copy link
Contributor Author

bpaul4 commented Jun 12, 2024

@lbianchi-lbl thanks for investigating further. The TestPymodelMLAI::test_build_and_run_as_expected_ tests just call the ML AI model class, which is done directly as shown below to avoid calling GUI commands:

def test_build_and_run_as_expected_1(self, example_1):
        # test that the loaded models run with no issues without modifications
        # as in subsequent tests, an alias is created to preserve the fixture
        # no custom layer, so set keras_has_custom_layer to False
        pytest.importorskip("tensorflow", reason="tensorflow not installed")
        test_pymodel = pymodel_ml_ai(
            example_1, trainer="keras", keras_has_custom_layer=False
        )
        test_pymodel.run()

It can't be due to just importing one of the machine learning packages or executing the models, as the foqus_lib/gui/tests/test_ml_ai.py::TestMLAIPluginFlowsheetRun tests and foqus_lib/unit_test/node_test.py::testImports tests are passing.

However, I see that

foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-keras_nn-tensorflow.keras] FAILED [ 27%]
foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-pytorch_nn-torch.nn] FAILED [ 28%]
foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-scikit_nn-sklearn.neural_network] FAILED [ 29%]

appears on the macOS runs but not in Windows or Linux, and this may be a more useful error to look into and see what's happening.

@lbianchi-lbl
Copy link
Contributor

However, I see that

foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-keras_nn-tensorflow.keras] FAILED [ 27%]
foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-pytorch_nn-torch.nn] FAILED [ 28%]
foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-scikit_nn-sklearn.neural_network] FAILED [ 29%]

appears on the macOS runs but not in Windows or Linux, and this may be a more useful error to look into and see what's happening.

Ah, great point, I didn't notice that, good catch. We're not getting more information about those failures because of the timeout, but I agree that it's worth looking into it more in detail. If you're OK, I'll push a change to the CI workflow file so that pytest exits after 3 failures so that we can see the stack traces for those tests.

@lbianchi-lbl
Copy link
Contributor

lbianchi-lbl commented Jun 12, 2024

@bpaul4 it looks like the failure in the ML/AI surrogate plugin is due to a timeout being reached by pytest-qt when waiting for the successful completion of the run() method. So it seems there might be some general ML/AI issue at play, since we're seeing problems for both the Pymodule and surrogate plugin parts of the FOQUS codebase.

Unfortunately, I'm not sure can get much more information from CI runs since (at least for the ML/AI surrogate) the output is redirected to the "pseudo-console" within the GUI to be accessible to the user without having to check the logs, so I think the next step would be for someone to run this locally and check visually and/or take a screenshot.

Is there any macOS-using generous soul who'd be willing to check? @ksbeattie @boverhof @kbuma @franflame It should be enough to check out the code in this branch, install the ML/AI dependencies, and run the tests. I can write a complete set of commands but it'll have to wait a bit.

UPDATE the following should work:

cd "$(mktemp -d)"
git clone https://github.com/CCSI-Toolset/FOQUS.git && cd FOQUS
git fetch origin pull/1223/merge:foqus-1223
git switch foqus-1223
conda create -c conda-forge -c CCSI-Toolset -n test-foqus-1223 python=3.10 psuade-lite=1.9 --yes
conda activate test-foqus-1223
pip install -r requirements-dev.txt -r requirements-mlai.txt

# run all tests, exiting after 3 failures
pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir --maxfail=3

# only run GUI tests for the ML/AI surrogate plugins
pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir -k test_surrogate

# only run the Pymodel non-GUI tests
pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir -k node_test

@franflame
Copy link
Contributor

I was able to use the above commands to set up a clean environment and did some manual testing. At least locally, pytorch_nn and scikit_nn are both working fine with no issues; only keras_nn doesn't complete as expected. I traced through keras_nn's execution by printing to the GUI log, and it appears that model.fit() is where it gets stuck.

@franflame
Copy link
Contributor

franflame commented Jun 19, 2024

@bpaul4 @lbianchi-lbl Here's what I got from running the automated tests locally:
pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir --maxfail=3 Exits after failing the tests for keras_nn, pytorch_nn, and scikit_nn.

pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir -k test_surrogate Exits with three tests failing (timeout)

pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir -k node_test
Gets stuck on foqus_lib/unit_test/node_test.py::TestPymodelMLAI::test_build_and_run_as_expected_1, doesn't seem to have a timeout(?) I had to manually exit after it was stuck for 25+ minutes.

Side note: from watching the GUI tests execute, the pytorch_nn and scikit_nn tests fail because the keras_nn test is still being run, and a new process for pytorch_nn or scikit_nn can't be started when keras_nn isn't finished. This is comparable to what I observed from manual testing, described in my previous comment.

@lbianchi-lbl
Copy link
Contributor

lbianchi-lbl commented Jun 25, 2024

  • macOS tests are failing with Tensorflow 2.16.1, which is the version that's required for the changes in this PR to work correctly
  • The failure seems to be caused by the execution hanging indefinitely when the .fit() method is called on any TF model
  • After some discussion, we're leaning toward skipping or xfailing the tests that are currently failing for macOS, and hope that a fix will be released upstream to make macOS work again with tensorflow>2.16.1
  • @lbianchi-lbl will push to this PR with the changes needed to skip/xfail these tests

Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 30.23256% with 30 lines in your changes missing coverage. Please review.

Project coverage is 37.88%. Comparing base (8da0631) to head (1531097).
Report is 20 commits behind head on master.

Files with missing lines Patch % Lines
foqus_lib/framework/graph/node.py 28.57% 22 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1223      +/-   ##
==========================================
- Coverage   38.54%   37.88%   -0.66%     
==========================================
  Files         164      164              
  Lines       37032    37048      +16     
  Branches     6132     6153      +21     
==========================================
- Hits        14274    14036     -238     
- Misses      21619    21835     +216     
- Partials     1139     1177      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ksbeattie ksbeattie merged commit 06fe4b7 into CCSI-Toolset:master Jul 2, 2024
29 of 31 checks passed
@bpaul4 bpaul4 deleted the update-keras branch December 3, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ML/AI Machine Learning/Artificial Intelligence Priority:High High Priority Issue or PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resolve Keras File Format Issues
4 participants