[FEA]: Remove redundant bert hash files #850

dagardner-nv · 2023-04-05T23:52:59Z

Is this a new feature, an improvement, or a change to existing functionality?

Change

How would you describe the priority of this feature request

Low (would be nice)

Please provide a clear description of problem this feature solves

Looks like we have a couple of different copies of these, not sure if git-lfs is smart enough to prevent redundant copies of these, but even if it is, we have these also stored in the morpheus/data dir as regular git files.

$ find ./models/ -name "bert-base-cased-hash.txt" -exec diff -s morpheus/data/bert-base-cased-hash.txt {} \;
Files morpheus/data/bert-base-cased-hash.txt and ./models/training-tuning-scripts/sid-models/resources/bert-base-cased-hash.txt are identical
Files morpheus/data/bert-base-cased-hash.txt and ./models/training-tuning-scripts/log-parsing-models/resources/bert-base-cased-hash.txt are identical
$ find ./models/ -name "bert-base-uncased-hash.txt" -exec diff -s morpheus/data/bert-base-uncased-hash.txt {} \;
Files morpheus/data/bert-base-uncased-hash.txt and ./models/training-tuning-scripts/phishing-models/resources/bert-base-uncased-hash.txt are identical
Files morpheus/data/bert-base-uncased-hash.txt and ./models/training-tuning-scripts/sid-models/resources/bert-base-uncased-hash.txt are identical
Files morpheus/data/bert-base-uncased-hash.txt and ./models/training-tuning-scripts/root-cause-models/resources/bert-base-uncased-hash.txt are identical

Describe your ideal solution

delete

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct
I have searched the open feature requests and have found no duplicates for this feature request

The text was updated successfully, but these errors were encountered:

dagardner-nv · 2023-04-06T16:09:49Z

Similarly we have two copies of bert-base-cased-vocab.txt:

$ diff -s ./models/training-tuning-scripts/sid-models/resources/bert-base-cased-vocab.txt ./models/training-tuning-scripts/log-parsing-models/resources/bert-base-cased-vocab.txt
Files ./models/training-tuning-scripts/sid-models/resources/bert-base-cased-vocab.txt and ./models/training-tuning-scripts/log-parsing-models/resources/bert-base-cased-vocab.txt are identical

This PR creates at least one test for each example containing custom stages. This PR currently only covers those examples which do not require additional packages. Part of #849. * Moves the bert vocabulary files to `morpheus/data` dir, no longer requiring them to be fetched from LFS and making them available to unittests. * Fixes type hints and remove a redundant method in `examples/log_parsing/inference.py` * Remove redundant copies of `bert-base-cased-hash.txt` and `bert-base-uncased-hash.txt` files, replacing them with symlinks to the files in the morpheus/data` dir fixes #850 * Explicitly set `encoding='UTF-8'` in `examples/log_parsing/postprocessing.py` as a work-around for issue #859 * Add `py::kw_only` to Python bindings for `TensorMemory` and sublasses to ensure parity with Python impls. * Set `repr=False` for the `tensors` field of `TensorMemory` avoids bug when printing due to the fact that we assign the value to `self._tensors` * Seed cupy's random number generator in `manual_seed` method. * Fix usage of `reload_modules` fixture, requesting a reload of multiple modules should be done with `@pytest.mark.reload_modules([mod1, mod2])` not calling `reload_modules` twice. * New test data in `tests/tests_data/log_parsing` is based upon the first 5 rows of data from `models/datasets/validation-data/log-parsing-validation-data-input.csv` Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: #885

dagardner-nv added the feature request New feature or request label Apr 5, 2023

dagardner-nv self-assigned this Apr 5, 2023

github-actions bot added the Needs Triage Need team to review and classify label Apr 5, 2023

dagardner-nv removed the Needs Triage Need team to review and classify label Apr 6, 2023

dagardner-nv mentioned this issue Apr 13, 2023

Create tests for examples with custom stages #885

Merged

3 tasks

rapids-bot bot closed this as completed in #885 Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Remove redundant bert hash files #850

[FEA]: Remove redundant bert hash files #850

dagardner-nv commented Apr 5, 2023

dagardner-nv commented Apr 6, 2023

[FEA]: Remove redundant bert hash files #850

[FEA]: Remove redundant bert hash files #850

Comments

dagardner-nv commented Apr 5, 2023

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

dagardner-nv commented Apr 6, 2023