Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fork bergamot-translator into this repostiory as inference #867

Merged
merged 467 commits into from
Oct 1, 2024

Conversation

nordzilla
Copy link
Contributor

@nordzilla nordzilla commented Sep 26, 2024

This PR forks the https://github.com/browsermt/bergamot-translator repository into this repository, self-contained within the inference-engine directory.

It also modifies it in the following ways:

  • Remove immediately unnecessary-for-this-project code.
  • Refactor and add scripts for build/test tasks.
  • Make tasks runnable in docker, listed in Taskfile.yaml
  • Modify the CODEOWNERS file for this repository.

Jerin Philip and others added 30 commits April 27, 2021 15:56
Fixes browsermt/bergamot-translator#101.
ResponseBuilder is called with empty histories to trigger a valid but
mostly-empty response.
* Control validating the config options via a boolean flag

 - parseOptions() function now validates the parsed options
   based on the validate argument

* Minor syntactic fix
* Bindings to load model and shortlist files as bytes
* Modified wasm test page for byte based loading of files
* Updates wasm README for byte loading based usage of TranslationModel
 - bergamot-models now contains lexical shortlist bin files as well
* Update marian-dev to the newest mac version

* Attempt windows workflow

* force workflow rerun

* Separate id

* Attempt 3 at github action

* Marian dev submodule now compiles with apple clang

* Updated ssplit version to something more recent

* Attempt to fix compile on wasm

* Do not compile subproject tests

* Fix emscripten compilation on Mac

* 99% on the way to windows compile

* Try with a different generator

* Build release not debug

* Revert CMakeLists.txt hacks

* Fix sse2 compilation failure

* MSVC settings for WIN32

* Add nodefaultlib LIBCMT

* Do not compile ssplit.cpp as it contains sys/mman.h

* Revert ab56b9aa4f4360b0ab98d5806658d4302f31db1d

* Update paths

* Set the build type to release if not set previously

* Attempt to build release with the windows workflow

* Attempt 5 at VS studio release build

* Attempt 6 at getting release build on MSVC generator

* The windows build is debug at the moment...

* fix ssplit for ubuntu 16.04

* Fix compilation with clang

* Compile on ubuntu16.04

* Explain what is going on

* Updated ssplit and workflow
 - This increases the inference speed while providing
   models as bytes to the translation engine
   (it wasn't needed while providing models as files)
* Safe transfer of bindings through typedefs

* Removing Translation* files and bringing in counterparts

* Remove previously commented out code

* Removing commented out include

* Absorb Translation* documentation

Co-authored-by: abhi-agg <66322306+abhi-agg@users.noreply.github.com>
…wasm test (#125)

* Improved script that patches wasm artifacts to enable wormhole

 - Made the regex pattern ignore multiple whitespaces b/w words of
   the matching pattern

* Fix for loading EN->DE vocabularies in wasm test page

 - Loading vocabularies for EN->DE was failing because of
   the new structure of bergamot-models
 - Changed "browsermt" to "mozilla"
 - The upstream browsermt/bergamot-translator builds the wasm artifacts
   in top level build folder now
* Enable worker file system
* Avoid node.js-code in emscripten glue-code
* Enable worker file system
* Avoid node.js-code in emscripten glue-code
* Fix busy loop in windows

* Nick wants the while loop gone

* Fix continue leftover

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>
* Adding bytearray option

* collapse intermediate for bytearray apps

* Removing service-cli-bytearray

* Removing the bergamot bytearray app

* Bumping updates to brt collapsing apps

* Reasonable defaults and hard check when cmd enabled

* Update documentation for flags

* Bump brt with MKL check and skip

* Bumping BRT with MKL_FOUND instead of USE_MKL

* Bumping BRT with no mkl enforce

* Bumping BRT with ssse3 output

* Let's try disabling OpenBLAS

* Trying to disable apple accelerate

* Using WASM compatible BLAS can enable intgemm

* Adding a CMake -L to see what exactly is the diff

* Revert "Let's try disabling OpenBLAS"

This reverts commit 9a6b9bc53bf7dec956889f6e0b7047e5388e1b7e.

* Revert "Using WASM compatible BLAS can enable intgemm"

This reverts commit 936a592e18431c279e6c5952a278d012d18ff295.

* Restricting mac tests through tags and on GitHub CI

* Using only check-bytearray

* Bumping BRT with change of default behaviour
* First draft of faithful translation

* Comments explaining pre and post

* Comments on response_builder

* Updating bergamot-translator-tests with new outputs

* Cosmetic changes in response target text construction

* Replacing &(x[0]) -> x.data() to avoid illegal indices

* Removing nullptr given both branches init pointer with legal values

* pre, post -> gap(i) addressing review comments

Functions which were pre and post before are subsumed by gap(i), and the
algorithm in ResponseBuilder adjusted to fix.

`x = nullptr` is back, should be harmless.

* Updating brt with paragraph outputs

* Bumping brt with updated outputs, buffer text at begin as well

* Bumping BRT with sync after bytearray collapse merge

* Pointing BRT to main after merge

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>
* first attempt to enable vocabs pass as byte arrays

* pass vocabs bytes as AlignedMemory

* add vocabIndices to avoid double loading

* small fix on parameter names and documentation

* fix windows build plus tiny update on documentation

* update marian-dev submodule

* move validate model bytearray in BatchTranslator

* small refactors on validateBinaryModel()

* switch vocab memories to std::vector<marian::Ptr<AlignedMemory>>

* update marian-dev submodule

* replace marian::Ptr to std::shared_ptr for vocab memories

* add note for vocab memories
* Update ssplit submodule, removing absl

* Fix ssplit variables

* Update ssplit branch

* Fix emscripten compilaiton

* Update tests
… for WASM (#138)

* Change WASM_COMPATIBLE_SOURCE=OFF by default

The default was WASN_COMPATIBLE_SOURCE=ON COMPILE_WASM=OFF which is a
testing configuration, not a sensible default for native or wasm.

* Always USE_WASM_COMPATIBLE_SOURCE with COMPILE_WASM

* Set CMP0077 to fix variable handling
 - This is required in the extension while using wasm module in a worker environment
 - Added "-g2" flag furing linking step
@nordzilla nordzilla force-pushed the fork-bergamot branch 2 times, most recently from f769460 to 6243ad1 Compare September 26, 2024 19:10
@nordzilla nordzilla marked this pull request as ready for review September 26, 2024 19:19
@gregtatum
Copy link
Member

Locally the commits look like this:

c62bea0 2024-09-25 Erik Nordin (HEAD -> fork-bergamot, origin/fork-bergamot) Add review groups to CODEOWNERS
07e3216 2024-09-26 Erik Nordin Move build-wasm script to inference-engine/scripts directory
cf23bf7 2024-09-26 Erik Nordin Add clean script to inference-engine
bb47eed 2024-09-26 Erik Nordin Add unit-tests script to inference-engine
00f4a30 2024-09-26 Erik Nordin Add build-local script to inference-engine
b6906d9 2024-09-26 Erik Nordin Remove unneded doc code
1019fdd 2024-09-26 Erik Nordin Remove unneeded CLI code
27e85d2 2024-09-20 Erik Nordin Remove unneeded Python code
37d0113 2024-09-20 Erik Nordin Remove .circleci and .github files
3da08c9 2024-09-26 Erik Nordin Remove bergamot-translator-tests dependency
cad3963 2024-09-19 Erik Nordin Rename inference-engine/3rd_party/marian-nmt
bbb8442 2024-09-19 Erik Nordin Move inference-engine git submodules to the repository root
285cf6b 2024-09-26 Erik Nordin Add 'inference-engine/' from commit '9271618ebbdc5d21ac4dc4df9e72beb7ce644774'
77479b3 2024-09-24 Greg Tatum (origin/main, origin/HEAD, main) Update the statistics class to use data attributes and use Statistics in merge-mono.py (#853)

Then the merge commit looks like:

➤ g show 285cf6b
commit 285cf6ba4d1ff96e018944bf5abf348a3945a873
Merge: 77479b3 9271618
Author: Erik Nordin <enordin@mozilla.com>
Date:   Thu Sep 26 12:54:32 2024 -0500

    Add 'inference-engine/' from commit '9271618ebbdc5d21ac4dc4df9e72beb7ce644774'

    git-subtree-dir: inference-engine
    git-subtree-mainline: 77479b3a77703a281bdf8313e168ecf697a376e5
    git-subtree-split: 9271618ebbdc5d21ac4dc4df9e72beb7ce644774

Copy link
Member

@gregtatum gregtatum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I manually scanned through every file, did a clean git clone, and initiated the git submodules. I didn't do a full build using the task commands as I'm currently working on docker stuff, but I figure if it's broken we can follow-up with fixes.

@@ -0,0 +1,50 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: The general rule of thumb is that if a script starts to get complicated, it should be written in python. These seem to be fairly simple, and are following the practices of the existing repo being ported over, so I think we're still fine to keep these as bash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to convert them to Python!

It sounds nicer anyway. I kept them as bash to match the pre-existing build-wasm.sh.

But after the merge, I wouldn't mind doing a bash-to-Python cleanup.

I would rather have a Python ecosystem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a build script and it's pretty much an industry standard to write such scripts in bash. I suggest we leave them alone unless there is some really complex logic and it's getting hard to maintain them.

@gregtatum
Copy link
Member

Let's make sure to get @eu9ene's approval on this before landing.

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a lot of code. I didn't look at it much but I hope we'll be able to delete everything except the wasm bindings and related classes.

There are quite a lot of small structural suggestions that should be addressed before we can merge to main. Since we decided to share documentation and lining infrastructure we should unify all this right away as it was one of the reasons why we're moving it to this repo.

.github/CODEOWNERS Show resolved Hide resolved
.gitmodules Outdated Show resolved Hide resolved
Taskfile.yml Outdated Show resolved Hide resolved
inference-engine/.clang-format Outdated Show resolved Hide resolved
@@ -0,0 +1,30 @@
# vim temporary files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have .gitignore in the root, we can merge this one there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gitignore is designed to be able to add such a file in any subdirectory within a repository, so I didn't see an issue with keeping it here.

But I don't have a strong preference, so I'm happy to merge them into the root-level .gitignore file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this, and I really feel it's cleaner to keep the .gitignore separate.

At least for now. I'd like to get this merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this version? Who will maintain it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we currently mark the version that gets pulled into Firefox.

https://searchfox.org/mozilla-central/rev/9fa446ad77af13847a7da250135fc58b1a1bd5b9/toolkit/components/translations/bergamot-translator/moz.yaml#21-27

We likely won't need this in the future, but I don't feel comfortable removing it right now.

I would rather remove only things that are obviously unneeded, and continue to build/remove more nuanced things as this project progresses.

inference-engine/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old example. It should use our models from firefox-translations-models They are already in bergamot-translator format. Also, I suggest we move this example to the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They probably should use our models, but the purpose of this PR is to merge this repository into our repository.

There is plenty of work still to be done.

I would rather merge now, rather than continuing to work on a separate branch that would accumulate an even larger pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support the general feedback here from Erik to merge now, and follow-up with cleanups. The initial merge is hard, so it's easier to work iteratively.

@@ -0,0 +1,50 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a build script and it's pretty much an industry standard to write such scripts in bash. I suggest we leave them alone unless there is some really complex logic and it's getting hard to maintain them.

inference-engine/wasm/README.md Outdated Show resolved Hide resolved
@nordzilla
Copy link
Contributor Author

nordzilla commented Sep 27, 2024

Thanks for all the feedback.

I'll get this PR cleaned up further today, so that we can get this merged.

There's quite a lot of code. I didn't look at it much but I hope we'll be able to delete everything except the wasm bindings and related classes.

This is the ultimate goal.

  • The first stage is to get all of the currently used components into our repository.
  • The second stage is to start stripping out the HTML-related code.
  • The third stage is to start stripping out the segmentation-related code.

But I need it all there to start, so that I can build up a good testing strategy, and ensure that regressions do not occur along the way.

There are quite a lot of small structural suggestions that should be addressed before we can merge to main. Since we decided to share documentation and lining infrastructure we should unify all this right away as it was one of the reasons why we're moving it to this repo.

I appreciate the feedback!

This is precisely why I would like to merge this into main sooner, rather than later, to ensure that we are all aligned on this kind of thing before continuing to make divergent changes on an independent branch.

@nordzilla
Copy link
Contributor Author

I made a few of the changes, resolving the relevant comments for them.

I would prefer to follow up with the rest after this content is merged.

@nordzilla nordzilla requested a review from eu9ene September 30, 2024 21:22
The LICENSE file is exactly the same as the root-level LICENSE file.
@nordzilla
Copy link
Contributor Author

Made a few more of the requested changes.

I'd like to get this merged today and continue with more follow-ups/other work as PRs to the main branch.

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a lot tidier now, great job cleaning it up Erik!

I left only one comment about adding a TODO for the modules. I think we will need to unify them in consequent PRs so that the begamot module is not duplicated and we are sure we use the same version for training and inference.

.gitmodules Show resolved Hide resolved
@nordzilla nordzilla merged commit 3974ccc into main Oct 1, 2024
4 checks passed
@nordzilla nordzilla deleted the fork-bergamot branch October 1, 2024 17:44
@nordzilla nordzilla changed the title For bergamot-translator into this repostiory as inference-engine Fork bergamot-translator into this repostiory as inference Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.