Use pretrained OPUS-MT models as teachers and backward models (and other changes) #117

TommiNieminen · 2023-05-03T19:41:10Z

Hi,

As requested by @andrenatal, here are the current changes made to the pipeline in connection with the GreenNLP project. The main change is the possibility to use OPUS-MT models as teachers and backward models (including multilingual models). The changes also make it possible to use multiple teachers.

In addition to download scripts for models and corpora, I've added some OPUS-MT-specific preprocessing scripts to the pipeline, as OPUS-MT models do not use Marian's inbuilt SentencePiece support. Some of the pre- and post-processing has been added directly to the Snakefile.

I've also added Snakemake configs for supercomputers from CSC (Finnish HPC provider), and made some changes to the containerization.

-Tommi

…ns-training pipeline: - Added download scripts and rules for downloading Tatoeba-Challenge data and models. - Modified training rules to accept downloaded Tatoeba-Challenge models as teachers and backward models. - Modified containerization to include conda environments inside the container (to abide by CSC's conda depreciation). - Added subword segmentation rules to marian-specific rules (since the default pipeline uses Marian's integrated sentencepiece support and Tatoeba-Challenge models don't) NOTE: The pipeline is still a work in progress, and it may fail for some Tatoeba-Challenge models due to subtle differences in the model make-up.

…es (this might be fixed in newer marian versions)

Added note about changing CSC account

Fixed opusmt-teacher value to URL as it should be

…models using marian sentencepiece integration)

Fixed swahili code in Flores importer

Merging multi-teacher work

…ion)

eu9ene · 2023-05-03T23:40:16Z

pipeline/clean/clean-mono.sh

@@ -42,19 +42,21 @@ else
        | pigz >"${output_prefix}.${lang}.monofix.gz"
 fi

+# disabled bue to errors in langid, need to debug sometime


should we just add a config setting to disable it? I don't think we had problems with it for our training

eu9ene · 2023-05-03T23:40:35Z

pipeline/clean/clean-corpus.sh

-echo "### Language identification"
-test -s "${output_prefix}.${SRC}${TRG}.langid.gz" ||
-  pigz -dc "${output_prefix}.${SRC}${TRG}.rule-based.gz" |
+#Disabled language identification due to crashes, maybe due to small corpus, try with bigger one


should we just add a config setting to disable it?

eu9ene · 2023-05-03T23:42:45Z

pipeline/translate/decoder.yml

+quiet-translation: true
+#precision: float16


we use this one for faster encoding. Should we just add another decoder config for opus models and add a config option for this?

eu9ene · 2023-05-03T23:43:53Z

profiles/local/config.yaml

@@ -1,20 +1,20 @@
 verbose: false
 use-conda: true
-resources: gpu=8
-cores: all
+resources: gpu=0


I believe the idea for this config was to train locally on a machine with GPUS

eu9ene · 2023-05-03T23:51:58Z

@TommiNieminen thank you for contributing! I'm happy to see that you used the pipeline. Using pre-trained OPUS models will be valuable for us.

I see that it's a draft but I briefly looked at it. You made great use of singularity images, configs and snakemake profiles. From the code standpoint, I would suggest preserving the capability to train from scratch using the original bergamot recipe unless we want to fix some bugs and improve it. We can add some config settings and conditional operators for this to distinguish from OPUS use case. I'm not working actively on the project though and @andrenatal should have more context on our direction.

TommiNieminen · 2023-05-04T07:09:45Z

Thanks for the review, @eu9ene. I pushed a quick draft now in case you want to use the OPUS-MT parts of the pipeline for your Taskcluster workflow, but as you noticed, there's some loose ends in the code that I need to clean up.

I saw that elsewhere you were discussing whether to maintain Snakemake support in addition to Taskcluster. On my part, I'm going to keep working with the Snakemake workflow, since I'll be working in a HPC environment with SLURM. So I will be maintaining a Snakemake fork at least for some years, although I expect it to eventually diverge quite a lot from your work (I will be mostly working on retrieval-augmented MT).

bhearsum · 2023-05-10T23:20:27Z

Thanks for the review, @eu9ene. I pushed a quick draft now in case you want to use the OPUS-MT parts of the pipeline for your Taskcluster workflow, but as you noticed, there's some loose ends in the code that I need to clean up.

I saw that elsewhere you were discussing whether to maintain Snakemake support in addition to Taskcluster. On my part, I'm going to keep working with the Snakemake workflow, since I'll be working in a HPC environment with SLURM. So I will be maintaining a Snakemake fork at least for some years, although I expect it to eventually diverge quite a lot from your work (I will be mostly working on retrieval-augmented MT).

👋. I'm the person who's overseeing the migration into Taskcluster. @andrenatal and I spoke about this a bit earlier today. We agreed that for the time being we're going to maintain support for both Slurm/Snakemake and Taskcluster. I don't think we know what will happen the medium or long term yet, though (and Andre is probably better positioned to talk about it anyways...).

I merged the first part of this work just now, which has caused merge conflicts here due to some minor changes in the pipeline scripts (that Taskcluster is also using). If you want a hand unbitrotting those please let me know and I'd be happy to do so.

andrenatal · 2023-05-11T07:45:09Z

Hi @TommiNieminen. Would please you join our Matrix room so we would coordinate a way to meet: https://matrix.to/#/#firefoxtranslations:mozilla.org? You can find me there under the handle anatal

…opus-mt training pipeline

Added tensorflow-rocm to bi-cleaner env to get it working on lumi

Added instructions for using Snakemake without non-containerized conda installation.

Formatting changes.

TommiNieminen · 2023-06-01T14:35:45Z

Sorry about the delay in getting this moving forward, I did some more modifications to the pipeline which I wanted to include. There are many scripts where the amount and order of parameters has changed, so the Taskcluster setup probably won't work without changes.

I also noticed that the bicleaner-ai processing had stopped working at some point, this was because the latest bicleaner-ai models are no longer available as Github releases (they are now on Huggingface). I changed the bicleaner-ai model download to load the v1 models, which are still on Github. The v2 models available through Huggingface are not significantly better than the v1 models (I confirmed this with the bicleaner-ai developers), so it's not extra urgent to update the bicleaner-ai model downloader to use Huggingface. I also updated the bicleaner-ai version to the latest one in the conda env configuration.

andrenatal · 2023-06-01T19:10:35Z

That's awesome @TommiNieminen, thanks for that!! I'll review it and we can chat about it tomorrow.

bhearsum

There are many scripts where the amount and order of parameters has changed, so the Taskcluster setup probably won't work without changes.

Don't worry about this. Most of your changes affect things we haven't set-up in Taskcluster yet, and I'll deal with any fallout that comes from this PR.

It also looks like this should apply cleanly on top of #125 - so that shouldn't disrupt this PR if it lands first. I'll try to hold off merging anything that would conflict until you finish up here.

bhearsum · 2023-06-01T19:43:56Z

configs/config.opusmt-multilingual-test.yml

+  src: en
+  trg: sw
+  src_three_letter: eng
+  trg_three_letter: swa


Putting this here instead of calling mtdata every time we need it is a nice improvement!

bhearsum · 2023-06-01T19:46:06Z

configs/config.opusmt-test.yml

+
+experiment:
+  name: opusmt
+  src: en


I'll note that the existing test config uses ru -> en. I think this is to ensure that the reverse locale handling in certain pipeline scripts works OK. Probably not crucial - just calling it out.

eu9ene

@TommiNieminen @bhearsum I did another pass at reviewing this. Here are my thoughts:

What I'm not concerned about:

adding new profiles
adding new configs
adding new pipeline steps
installation scripts
bicleaner fixes
backward compatible changes inside the pipeline directory

What I'm concerned about:

(!) backward incompatible changes inside the pipeline directory This means we'll have to modify the task cluster scripts and retest the training from scratch and I don't think it's the right time to do that
modifying existing configs including the training one (ideally, everything related to OPUS use case should be in separate configs)
readme
changed default singularity image
complexity of the Snakefile (I don't have any good suggestions here except maybe adding a new file for OPUS use case. Maybe it's fine and we can live with it since we're moving to the TC.) - not critical

On one hand, we need this merged to start experimenting with the pre-trained OPUS models and @TommiNieminen did a great job of making this work (thanks again!). On the other hand, I see risks to break our main training procedure due to the complexity and cost of retesting. Especially since we still testing the integration with the Task Cluster which will serve as our main training scheduler soon.

I believe, to move forward with this, we should address the backward compatibility issues to at least keep our current pipeline that was migrated to Task Cluster working. Even if there are bugs in Snakefile, it would be not that critical. Then we can start experimentation with OPUS models using Snakemake and then gradually migrate it to Task Cluster if it proves to be useful.

I'm also not sure whether @TommiNieminen has time to address those concerns. Let me know what you think.

eu9ene · 2023-08-04T22:00:10Z

profiles/local/config.yaml

 cache: false
 reason: true
 config:
  # install dependencies on a local machine
  - deps=true
  # root path to a folder with data, models and logs
-  - root=/data
+  - root=/home/tommin/greennlp/data


please revert values in this config to the original ones, since the idea is to test things on a GPU machine locally

eu9ene · 2023-08-04T22:03:10Z

Makefile

 CONDA_PATH=../mambaforge
 SNAKEMAKE_OUTPUT_CACHE=../cache
-PROFILE=local
+#PROFILE=local


let's keep the default values here as before

eu9ene · 2023-08-04T22:03:16Z

Makefile

@@ -4,12 +4,13 @@
 SHELL=/bin/bash

 ### 1. change these settings or override with env variables
-CONFIG=configs/config.prod.yml
+CONFIG=configs/config.opusmt-multimodel-test.yml


let's keep the default values here as before

eu9ene · 2023-08-04T22:05:02Z

README.md

@@ -1,3 +1,64 @@
+# OPUS-MT integration


Please move an extra docs section related to the OPUS integration to another place. Ideally, we can create a docs folder and keep it there with the link from the main readme. It's already quite long.

eu9ene · 2023-08-04T22:06:09Z

configs/config.prod.yml

@@ -63,55 +63,7 @@ marian-args:
 datasets:
  # parallel training corpus
  train:
-    - opus_ada83/v1


let's keep this config as it was and revert to the original values. This is an example of a real training set

eu9ene · 2023-08-04T22:18:57Z

Snakefile

@@ -13,48 +14,72 @@ min_version("6.6.1")

 ### configuration

-container: 'Singularity.sif'
+containerized: 'Ftt.sif'


revert to the original image

eu9ene · 2023-08-04T22:32:25Z

pipeline/translate/translate-nbest.sh

+    if [[ -f ${opus_vocab} ]]; then
+	vocab=${opus_vocab}
+	else
+	vocab=$3


It looks like the vocab variable will not be set if there are no opus vocabs in the model dir, so this is not backward compatible. Should we set it in the beginning like before and then override it?

eu9ene · 2023-08-04T22:32:54Z

pipeline/translate/translate.sh

+    if [[ -f ${opus_vocab} ]]; then
+    	vocab=${opus_vocab}
+    else
+	vocab=$3


same here, doesn't look backward compatible

eu9ene · 2023-08-04T22:49:16Z

pipeline/translate/translate.sh

@@ -11,9 +11,19 @@ test -v MARIAN
 test -v WORKSPACE

 input=$1
-vocab=$2
-models=( "${@:3}" )
+output=$2


can we keep the arguments in the same order to make it backward compatible and not change the task cluster part?

eu9ene · 2023-08-04T22:54:12Z

Snakefile

        conda: "envs/base.yml"
        threads: gpus_num * 2
        resources: gpu=gpus_num
        input:
-            rules.merge_devset.output, model=f'{teacher_base_dir}{{ens}}/{best_model}',
+            rules.merge_devset.output, model=f'{teacher_base_dir}0-{{ens}}/{best_model}',


why 0? I see that it was also added to the finetuned one. It's not clear to me why it was added.

…edited local-container profile to work with current Snakefile setup

eu9ene

I've tested it and it's ready to merge into a separate branch. I'll address my concerns with readme, configs and compatibility with TC afterward.

* Cleanup API: Refactor request on-complete transition (#80) * Handle empty translation requests Fixes https://github.com/browsermt/bergamot-translator/issues/101. ResponseBuilder is called with empty histories to trigger a valid but mostly-empty response. * Control validating the config options via a boolean flag (#116) * Control validating the config options via a boolean flag - parseOptions() function now validates the parsed options based on the validate argument * Minor syntactic fix * JS bindings for loading model and shortlist files as bytes (#117) * Bindings to load model and shortlist files as bytes * Modified wasm test page for byte based loading of files * Updates wasm README for byte loading based usage of TranslationModel * Make wasm test page work with bergamot-models repository - bergamot-models now contains lexical shortlist bin files as well * Better error logging for wasm test page * Update to marian-dev master * Full windows support with ssplit from browsermt, not a fork (#109) * Update marian-dev to the newest mac version * Attempt windows workflow * force workflow rerun * Separate id * Attempt 3 at github action * Marian dev submodule now compiles with apple clang * Updated ssplit version to something more recent * Attempt to fix compile on wasm * Do not compile subproject tests * Fix emscripten compilation on Mac * 99% on the way to windows compile * Try with a different generator * Build release not debug * Revert CMakeLists.txt hacks * Fix sse2 compilation failure * MSVC settings for WIN32 * Add nodefaultlib LIBCMT * Do not compile ssplit.cpp as it contains sys/mman.h * Revert ab56b9aa4f4360b0ab98d5806658d4302f31db1d * Update paths * Set the build type to release if not set previously * Attempt to build release with the windows workflow * Attempt 5 at VS studio release build * Attempt 6 at getting release build on MSVC generator * The windows build is debug at the moment... * fix ssplit for ubuntu 16.04 * Fix compilation with clang * Compile on ubuntu16.04 * Explain what is going on * Updated ssplit and workflow * Enabled gemm-precision in wasm test page - This increases the inference speed while providing models as bytes to the translation engine (it wasn't needed while providing models as files) * Updated wasm/README file with instructions for byte loading APIs * WASM Bindings collapse (#87) * Safe transfer of bindings through typedefs * Removing Translation* files and bringing in counterparts * Remove previously commented out code * Removing commented out include * Absorb Translation* documentation Co-authored-by: abhi-agg <66322306+abhi-agg@users.noreply.github.com> * Improve script to patch wasm artifacts and load EN->DE vocabulary in wasm test (#125) * Improved script that patches wasm artifacts to enable wormhole - Made the regex pattern ignore multiple whitespaces b/w words of the matching pattern * Fix for loading EN->DE vocabularies in wasm test page - Loading vocabularies for EN->DE was failing because of the new structure of bergamot-models * Improved wasm scripts and README (#128) * Minor README change - Changed "browsermt" to "mozilla" * Updating ci scripts for the latest upstream changes - The upstream browsermt/bergamot-translator builds the wasm artifacts in top level build folder now * Extension desired changes (#129) * Enable worker file system * Avoid node.js-code in emscripten glue-code * Extension desired changes (#129) * Enable worker file system * Avoid node.js-code in emscripten glue-code * Fix busy loop in windows (#131) * Fix busy loop in windows * Nick wants the while loop gone * Fix continue leftover Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> * Making bytearray a commandline switch (#127) * Adding bytearray option * collapse intermediate for bytearray apps * Removing service-cli-bytearray * Removing the bergamot bytearray app * Bumping updates to brt collapsing apps * Reasonable defaults and hard check when cmd enabled * Update documentation for flags * Bump brt with MKL check and skip * Bumping BRT with MKL_FOUND instead of USE_MKL * Bumping BRT with no mkl enforce * Bumping BRT with ssse3 output * Let's try disabling OpenBLAS * Trying to disable apple accelerate * Using WASM compatible BLAS can enable intgemm * Adding a CMake -L to see what exactly is the diff * Revert "Let's try disabling OpenBLAS" This reverts commit 9a6b9bc53bf7dec956889f6e0b7047e5388e1b7e. * Revert "Using WASM compatible BLAS can enable intgemm" This reverts commit 936a592e18431c279e6c5952a278d012d18ff295. * Restricting mac tests through tags and on GitHub CI * Using only check-bytearray * Bumping BRT with change of default behaviour * Faithful to source-structure translation (#115) * First draft of faithful translation * Comments explaining pre and post * Comments on response_builder * Updating bergamot-translator-tests with new outputs * Cosmetic changes in response target text construction * Replacing &(x[0]) -> x.data() to avoid illegal indices * Removing nullptr given both branches init pointer with legal values * pre, post -> gap(i) addressing review comments Functions which were pre and post before are subsumed by gap(i), and the algorithm in ResponseBuilder adjusted to fix. `x = nullptr` is back, should be harmless. * Updating brt with paragraph outputs * Bumping brt with updated outputs, buffer text at begin as well * Bumping BRT with sync after bytearray collapse merge * Pointing BRT to main after merge Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> * Enable vocabs pass as byte arrays (#122) * first attempt to enable vocabs pass as byte arrays * pass vocabs bytes as AlignedMemory * add vocabIndices to avoid double loading * small fix on parameter names and documentation * fix windows build plus tiny update on documentation * update marian-dev submodule * move validate model bytearray in BatchTranslator * small refactors on validateBinaryModel() * switch vocab memories to std::vector<marian::Ptr<AlignedMemory>> * update marian-dev submodule * replace marian::Ptr to std::shared_ptr for vocab memories * add note for vocab memories * Update ssplit submodule, removing absl (#132) * Update ssplit submodule, removing absl * Fix ssplit variables * Update ssplit branch * Fix emscripten compilaiton * Update tests * Minor rename: sentence_ranges -> annotation (#134) * Target master of ssplit-cpp * Remove unused used types TokenRanges, SentenceTokenRanges, UPtr (#137) * Change USE_WASM_COMPATIBLE_SOURCE =OFF by default on native, force on for WASM (#138) * Change WASM_COMPATIBLE_SOURCE=OFF by default The default was WASN_COMPATIBLE_SOURCE=ON COMPILE_WASM=OFF which is a testing configuration, not a sensible default for native or wasm. * Always USE_WASM_COMPATIBLE_SOURCE with COMPILE_WASM * Set CMP0077 to fix variable handling * Export "addOnPreMain" function from wasm module - This is required in the extension while using wasm module in a worker environment * Enable Debugging information in wasm module builds - Added "-g2" flag furing linking step * JS bindings for vocabularies as bytes * Updated wasm test page to pass vocabulary files as bytes * Refactoring TranslationModelBindings class - typdef AlignedMemory for code readability - Added documentation for one of the binding function * Avoid packaging vocab files into wasm binary in CI builds - We don't need to package vocab files into wasm binary any more as a sync with upstream enabled passing vocabs as bytes * Updated wasm README to update for passing vocabs as bytes - Updated Using JS APIs section to pass vocabs as bytes * Updated README to remove packaging steps for wasm compilation - We don't need to package model, shortlist or vocab files into wasm binary at build time * Updated CMakeLists.txt to remove packaging steps for wasm compilation - Removed PACKAGE_DIR cmake option - Removed Workerfs, FORCE_FILESYSTEM=1 in wasm builds -- File system support is not needed any more (since model, shortlist and vocabs are being passed as bytes now) * Bundle AlignedMemory inputs with MemoryBundle (#147) * Enabling ccache on github builds for Ubuntu (#95) * CI Changes to add tiny regression tests * Adding an inspect cache step * Removing ccache, pursue in another * Incorporating Nick's changes through submodule merge * Submodule now points to master * Restoring ccache enabled workflow file * Restoring ccache enabled CMakeLists * cache -> ccache typo fix * Moving CCACHE setup to GitHub runner file * Find also uses CCACHE dir * Updating CMakeLists not to override env * Cache compiler binary's contents * Changing a few names to trigger new build; Testing cache looks fun * USE_CCACHE=on, -L for inspection * Adding a ccache_cmd, but will only use in next commit * Using ccache_cmd * Removing " * Adding compiler hash script * Bunch of absolute paths * GITHUB_WORKSPACE typo * Nah, I'll keep -L and trigger another build * Trying something with compiler hash on cache key backup as well * builtin, bash it seems * Empty commit #1 * Move ccache stats to after compile * Reshuffling ccache vars * No comments * Updates to Github output set syntax * Empty Commit 1 * Empty Commit 2 * Empty commit 3 * /bin/bash -> bash; ccache_cmd for consistency * Adding ccache -s before and after build * Adding comments to compiler-hash script * Let's build cached and non-cached variants together for comparison * Fixing quotes, /bin/bash -> bash * Minor var/env adjustment * Adding ccache -z before the job * Reverting CMakeLists.txt without CCACHE * Switching to CMAKE_LANG_COMPILER_LAUNCHER instead of CMakeLists.txt rule * 5G -> 1G cache size * 1G -> 2G; Hyperparameter tuning * Refactor vocabs in Service (#143) Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> * Rewrite annotation class to remove corner cases (#135) * Added cmake file to compute version information - Reads BERGAMOT_VERSION file for generating various strings for versioning * Import GetVersionFromFile cmake file in root level CMakeLists.txt * Modified wasm cmake file to include version information in built artifacts * Generate project version file for native builds - The header file exposes a function that provides version information for native binaries * Bumped version to 0.3.0 - This brings the version info in sync with the various releases of extension * Corrected the version number - To be in sync with versioning in mozilla/bergamot-translator repo * Marian submodule with unified loading (#157) * Collapsing TranslationRequest -> ResponseOptions (#139) * Rewriting batching for threadsafety (#155) This does make the batcher a critical section across job submission and cleaving though. If that becomes a problem, we should go back to incoming and outgoing queues with a batcher thread. Also removes blocking mode from native compiles. Note that translateMultiple no longer guarantees great batching. Guess we could lease the mutex from ThreadsafeBatcher and create a session. There is the risk that one sentence comes in at a time and each thread grabs one sentence at a time instead of better batching. Not sure what to do about that other than some sort of Nagle algorithm. Due to non-deterministic batching, even with one thread, the regression tests will go haywire. * Use binary lexical shortlist in documentation (#152) * Use binary lexical shortlist in documentation * MKL/AppleAccelerate note Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> Co-authored-by: Jerin Philip <jphilip@ed.ac.uk> * initialise MemoryBundle members (#167) * Adding clang-format and updating existing sources to adhere (#151) * Adding a first version of clang-format * Adding run-clang-format.py * Adding coding styles to workflow * Fix indentation on coding-styles workflow * run-clang-format.'py' * -style -> --style in python * Updating ColumnLimit: 120 * Format update with clang-format * Revert "Format update with clang-format" This reverts commit 5340b19eae8fcc91a2a79205e0b3dd65ad61ad4c. * Apply update after sync * Removing a few empty lines * Removing one more empty line * Removing empty in workflow file * Updating README with coding style instructions * clang-format-* provided in this repository doc update Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> * Pin emsdk version to the same one used in Circle CI (#165) * GitHub action to push browsermt/main branch to mozilla/bergamot-translator every hour (#160) * Create push-browsermt-main-to-mozilla-main.yml * Update .github/workflows/push-browsermt-main-to-mozilla-main.yml Co-authored-by: Graeme <graemenail@gmail.com> * Tweaks * Fix yaml syntax * Parametrized the workflow based on @jerinphilip's example Co-authored-by: Graeme <graemenail@gmail.com> * Update tests * Bumping BRT for hotfixes (#169) * Bumping BRT for hotfixes * updating brt to point to main * Remove O(N^2) reallocation (#171) * Adding documentation action (#168) Adds a GitHub workflow that builds documentation from sources through doxygen through sphinx on push to the main branch or on push of any semantic version tags. The built documentation is deployed at https://github.com/browsermt/docs@gh-pages, which is rendered at https://browser.mt/docs/<suffix>, where <suffix> is 'main' or a tag vM.m.p corresponding to a semantic version. On pull request artifacts are uploaded for reviewers to inspect if need be. * Fix failures when loading text shortlist (#154) * Updating marian dev RelwithDebInfo -> Release (#178) * Updating marian dev RelwithDebInfo -> Release * Updating submodule to point to master * Single executable (#175) * Collapsing executables * Adding new test executable * Deleting old executable sources * Updating brt to operate with modes * cli-framework -> cli * Updating workflows to check for bergamot instead of bergamot-translator-app * Adding documentation * Making fn pure virtual * Shuffling apps into app namespace, alongside class documentation * Include app folder in documentation * BRT update service-cli -> native * parser.h: service-cli -> native * Updates to marian-integration.md * Cleanup: Remove templates, interface proper * change 4 to 2 cores for build instructions * service-cli -> native * Commenting the string constructor explanation * Not doing halfway interface / inheritance * Nick hates state, let's try this one * Revert "Nick hates state, let's try this one" This reverts commit e56db9f474b1906e62af0b06afb7c7d9e08ea9c8. * class -> struct before trying std::function stuff * oop -> functional? * Hints on what is happening * app::ftable -> app::REGISTRY * We have if-else and functions now. And we won't have test apps. * Doc linking to usage examples in brt * Remove unordered_map * Documentation updates * Fix warning * Deploy generated documentation only if browsermt (#179) * Including WASM documentation in sphinx build toc (#176) * Updating marian-dev: intgemm with env variable matmul switches (#187) * Remove addSentenceWithPriority (#186) * Update native (ubuntu, mac) workflows with ccache (#181) * Matrix is now more organized, Ubuntu 20.04-gcc9.3, Ubuntu-18.04-gcc7.5 is added. * ccache is extended to MacOS, and brings down CI run times to <5m when ccache works. * The compiler hash scripts are gone, ccache already covers most ground by default. The shell script is unnecessary. Cache works by preprocessor mode output of running the compiler with -E, which includes the necessary information. ccache-docs:How the cache works. * BRT if failed prints the final 20 lines of the test*.log to inspect what's going wrong without having to artifact download. * Pull request on any branch triggers workflow. * Push on main and ci-sandbox triggers workflow. * Replace resize with possible negative range with pop_back() (#189) * Consistent EMSDK version and parallel make jobs in README and github actions - Set EMSDK version to 2.0.9 to make it consistent everywhere in repo - Set parallel make jobs to 2 * CMake fixes: Generate project.h in binary dir, fix GetVersionFromFile for use as submodule. (#193) * Use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR for project bound version string * marian-dev cmake fix * Generate project.h in binary dir * We don't want people asking about extra spaces * Fixing if syntax with YAML var subsitution (#188) * Generating cmake configured project version (.js) file in build folder (#194) - Earlier this file was being generated in folder containing actual sources - Fixes https://github.com/browsermt/bergamot-translator/issues/161 * Partial test-apps and tolerance in evaluations (#184) * Partial test applications Previously service-cli was used to generate output and accomplish regression testing for all of: (1) translated-text (2) alignment tokens + scores (3) quality scores (4) indirectly annotation and tokenizations. The --mode native now only outputs a faithful to source translated text of the input source on stdin. Test apps are separated into testing only individual functionalities. This can help in independently testing ssplit-cpp, quality-scores for the quality estimation implementation etc. Separating numbers and text have the advantage of being able to compare one with tolerance using BLEU (text) and some allowed error-rates (numbers). * Removing #mac tag * Moving test apps to src/tests * Tests are always on for CI Unit tests are turned off looking for WASM_COMPATIBLE_SOURCES. * Fixing WASM_COMPATIBLE_SOURCE -> USE_WASM_COMPATIBLE_SOURCE * Workaround for now; CMakeLists.txt horrors are starting to bite * BRT: use bergamot-test instead of bergamot now * This should fix issues: CMakeLists.txt has so many paths * Casing to camelCase and removing legacyServiceCli * removing leftover service-cli declaration, some doc updates * #pragma once is starting to look easier * All the more reasons to do #pragma once * Updating marian-dev with intgemm::kCPU print, resolved from INTGEMM_CPUID * BRT: Use --gemm-highest-arch instead of python script * Adding intgemm resolve here, where always(?) have intgemm on? * intgemm-resolve in default binary directory * BRT: Update to use intgemm-resolve * marian-dev: Reset to without --gemm-highest-precision Co-authored-by: Kenneth Heafield <kpu@users.noreply.github.com> * Removing alignments and quality-scores test-code (#196) * Removing alignments and quality-scores test-code * BRT: Update to main * Refactor wasm bindings to use consistent interface names as in native (#195) * Refactored wasm bindings code - Replaced TranslationModel, TranslationRequest and TranslationResult with Service, ResponseOptions and Response - Corresponding documentation changes - Names of the bindings files changed - Moved Vector<Response> definition in Response specific bindings file * Account for EOS in both source and target annotations (#190) * Load sentence-splitter (non-breaking prefixes) from ByteArray Service now allows loading Sentence-Splitter (non-breaking prefix file) from ByteArray. Behaviour is consistent with the rest of the ByteArray loads (model, shortlist), where first the ByteArray is checked if empty, if not fall back to loading from file-path. Adds regression test to check if source-sentences in constructed Response match expected behaviour when the non-breaking-prefixes file is provided. Bonus refactoring to remove an extra layer that existed for no reason. * maxLengthBreak_ -> wrapStep bugfix (#200) * Change ResponseBuilder to accept callback instead of future (#142) * Change ResponseBuilder to accept callback Breaks things everywhere, now we follow the compiler to fix and convert the std::future -> callback. * More std::future -> callback * std::future out of service.{h,cpp} * compile is working, so is callback * Some reshuffling of args * Fixing merge error * Fixing signature conflicts out of merge * Fixing that test duct-taping future * Minor adjustment to get that future back * Add documentation for the new callback function * Applying clang-format after update * Using default responseOptions * Remove future references from documentation * translateMultiple only for WASM (#177) * BRT: update to main; fresh-failures hopefully * Converting test translateFromStdin to use callback * BRT: Add fresh #native and #wasm tags * future from promise, fix error * Adding #native to GitHub CI Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> * Added public methods in Response class to return sentences - Refactored ByteRange struct and moved it to definition.h * JS bindings to return sentence byte ranges * Wasm: Enabled sentence byte ranges in the wasm test page - Use JS bindings to print all sentences individually on console * Windows workflow: run-vcpkg7.{3->4}; vcpkg master (#208) A cmake change has caused vcpkg to fail without much error message, which is causing windows workflow runs to fail. Details in the following link: * https://github.com/microsoft/vcpkg/issues/18718 To fix, we're going with a version bump in vcpkg. Seeing that run-vcpkg also seems to have gotten an update, updating run-vcpkg from 7.3 to 7.4 Playing with fire: vcpkg master commit * Added build instructions to run on other browsers - Disabled compiling with wormhole which is Firefox specific feature * Add a clang-tidy run (#214) Adds a clang-tidy run in addition to the existing clang-format checks. The clang-tidy checks are not enforced, but is potentially useful to point to during review. * Wasm test page using web workers now (#218) * Updated marian submodule to latest commit of master * Wasm builds without SharedArrayBuffer * Circle CI wasm artifacts for non-wormhole builds * BRT: Update sacrebleu to get tests back working (#217) Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> * QualityEstimation: Preliminary Implementation (#197) Unifies quality estimation with an interface, refactors previously available quality scores to fit this interface. Adds a new class of model with Logistic Regression powering the predictions as an implementation of said interface. QE now provides annotations on words using subwords to word rule-based algorithms working with space characters. QualityEstimation ----------------- Implementations of QE are bound together by a `QualityEstimator` Interface. 1. The log-probabilities from the machine-translation model re-interpreted as quality scores are crafted as an implementation of QualityEstimator. 2. A Logistic-Regression based model is added. This class of models is trained supervised with scores labeled by a human annotator. Handcrafted features - number of words, log probs from MT model and statistics over the sequence are used to generate the numeric features. LogisticRegressor, Matrix (to hold features) are added. The creation of an instance is switched by the `AlignedMemory` supplied (be it loaded from the file-system or supplied as a parameter). An empty AlignedMemory leads to quality scores from NMT while supplying weights of a trained logistic-regression model in binary format as the contents lead to an additional pass through the said model to provide more refined scores. Both the above now transform subwords into "words" using a heuristic algorithm, scanning for spaces. This allows the client to work with "words" to denote quality instead of subwords, as the former is more sensible to the user. Testing ------- 1. BRT now has two new test apps to check the QE outputs in text (covers subword to words) and numbers domain (covers quality scores). These are tested with en-et models for which QualityEstimation is available now, on a new input to avoid architecture/compiler issues. 2. Unit test for LogisticRegression model is added. Docs ---- Doxygen now supports MathJax properly to render explanations for Logistic Regressions' reductions in place to make computation more efficient correctly. Co-authored-by: Felipe C. Dos Santos <felipe.santos.k@gmail.com> Co-authored-by: Jerin Philip <jerinphilip@live.in> * Multiple TranslationModels Implementation (#210) For outbound translation, we require having multiple models in the inventory at the same time and abstracting the "how-to-translate" using a model out. Reorganization: TranslationModel + Service. The new entity which contains everything required to translate in one direction is `TranslationModel`. The how-to-translate blocking single-threaded mode of operation or async multi-threaded mode of operation is decoupled as `BlockingService` and `AsyncService`. There is a new regression-test using multiple models in conjunction added, also serving as a demonstration for using multiple models in Outbound Translation. WASM: WebAssembly due to the inability to use threads uses `BlockingService. Bindings are provided with a new API to work with a Service, and multiple TranslationModels which the client (JS extension) can inventory and maintain. Ownership of a given `TranslationModel` is shared while translations using the model are active in the internal mechanism. Config-Parsing: So far bergamot-translator has been hijacking marian's config-parsing mechanisms. However, in order to support multiple models, it has become impractical to continue this approach and a new config-parsing that is bergamot specific is provisioned for command-line applications constituting tests. The original marian config-parsing tooling is only associated with a subset of `TranslationModel` now. The new config-parsing for the library manages workers and other common options (tentatively). There is a known issue of: Inefficient placing of workspaces, leading to more memory usage than what's necessary. This is to be fixed trickling down from marian-dev in a later pull request. This PR also brings in BRT changes which fix speed-tests that were broken and also fixes some QE outputs which were different due to not using shortlist. * Adapted wasm test page for new Service interface (#224) - The new interface now supports running multiple TranslationModels * Wasm test page UI for translating b/w non-English language pairs (#231) * Updated Wasm test page UI for translating b/w non-English language pairs * Both "from" and "to" language dropdowns now allow non-English languages * Import matrix-multiply from a separate wasm module (#232) * Updated marian-dev submodule * Import wasm gemm from a separate wasm module - The fallback implementation of gemm is currently being imported dynamically for wasm target * Updated CI scripts and README to import GEMM from a separate wasm module * Setting model config to int8shiftAlphaAll in wasm test page * JS bindings for Quality Estimation (#239) * Quality Score bindings complete * Updated wasm test page to test the bindings - Word and sentence scores can be seen in browser console * Cache for translations (#227) Sets a cache to operate for each sentence that a TranslationModel process caching the corresponding marian::History for a {TranslationModel::Id, marian::Words} key. Cache is thus shared across multiple TranslationModels bound to the lifetime of a Service. Cache gracefully downgrades in the case of WebAssembly. * Set PR to any branch to trigger workflows (#230) * [ssplit-cpp] Enable position independent library when compiled from sources (#240) * EXCLUDE_FROM_ALL for marian and ssplit-cpp 3rd-party libraries (#243) * Update config "skip-cost" to enable log probabilities for QE scores (#247) - Updated wasm test page * Recover logging (#226) * Deprecate hardAlignment in favour of softAlignment (#250) * Updated marian submodule (#256) * Update ssplit cpp, pcre2 source compile to fix broken builds (#258) * Update ssplit cpp, pcre2 source compile to fix tests * Syncing with browsermt/ssplit-cpp * Removing accidental binary inclusion * Removing brt accidental update by git add -u * Fix windows workflow, vcpkg is broken use our cmake route * [ssplit-cpp] Try searching different library names for Windows * Fixes windows workflow for PCRE2 (#260) * Fix badge to point to this repo instead mozilla's (#261) * Make script run from any directory (#262) * Make script run from any directory * Import optimized gemm implementation (when available) for wasm target (#265) * Enable importing optimized gemm module for wasm - Updated emscripten generated JS code to -- import and use the optimized gemm module when available, otherwise use fallback gemm implementation * Added logging for gemm implementation being used for wasm target * HTML input (#253) Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl> Co-authored-by: Abhishek Aggarwal <aaggarwal@mozilla.com> * HTML handling improvements (#266) * Fix out-of-bounds error when determining alignment for whole word If token at offset 0 was a continuation (which it always is, since the first word of a sentence does not start with a space) it would jump to (unsigned) -1 which is probably out of bounds. * Don't segfault if alignment info is not available When alignment info is requested, but model is missing `alignment: soft` you'd get empty alignment info for every target token. * Partial fix for handling empty elements This fixes a parse error when dealing with something like `<p>...<br></p>` or `...<br>` where there is no text after the last empty element. This also prevents losing empty elements in the source side of the translation. Empty elements are not yet transferred correctly to the target side. * Fix formatting * Updated marian-dev submodule * Updated configuration for html text translation to work in wasm test page (#269) * Updated translator configuration in wasm test page - Added alignment: soft * Set ResponseOptions::alignment to "true" - Had to be set for html text translation to work * More robust logic to import wasm gemm (#276) - Import optimized gemm implementation only if all the necessary functions are provided by it, othewise use the fallback gemm * Constrain mistune to fix docs CI (#278) * Additional logs in JS translation worker (#277) - Print source text received in the response - Print no. of block elements in the input * Proper arch setting on win32 (#275) * Proper arch detection on win32 * Whoops * Remove value length limit from HTML parser & interpolated alignments (#274) * Remove InterpolateAlignment And some code improvements * Replace the fixed value buffer with a std::string backing * Fix tests that had no alignment info These depended on the linear interpolation that I removed * Remove arbitrary limits on tag and attribute names This might also fix a bug caused by the eager lower casing of tag names, which could break <![CDATA , <style> and <script> * Remove equals() in favour of operator==() I trust the compiler can come up with better optimisations than I can. * Expose std::strings instead of their data Should save us some std::strlen() calls * Add & remove headers and no-longer-defined functions from header files * Remove all string buffers from xh_scanner It now directly refers to either the input stream or constant strings * Replace custom string_view with even lighter struct that's only used internally To the outside world we just expose std::string_view * Remove __builtin_sub_overflow for MSVC * ABORT if trying to restore HTML when no alignment info is available * Add test cases specifically for xh_scanner Both good for testing regression, and as a little example/reference for what behaviour to expect from it. * Add --html option to bergamot for tests This should make it easier to have some integration tests for HTML input * Add test and fix for empty inputs failing due to alignment check Co-authored-by: Jerin Philip <jerinphilip@live.in> * Disabled importing optimized gemm module (#282) - Until the optimized gemm module stops requiring Shared Array Buffer, we can't really use it in Firefox * Adding circle ci job to push the wasm artifacts to github releases (#280) * Adding circle ci job to push the wasm artifacts to github releases. * Updated config.yml * Increase HTML test coverage (#279) * Fix bug in HasAlignments check When fixing it to allow empty sentences, it no longer caught misconfigured models. I've added a test that triggers this scenario, and a fix in HasAlignments for it. * Add more unit tests for xh_scanner Trying to increase that code coverage to 100% * Add test for whitespaces around attributes * Make accessing value(), attr_name() and tag_name() at the wrong time safer * Fix bug in <style> and <script> parsing The end tag was never found * Fix parsing of mix of valueless and quoteless attributes * Sync list of void tags with Firefox' implementation of outerHTML and innerHTML Also lets use their name for it: IsVoidTag instead of IsEmptyElement. Empty was a bit ambiguous. * Bring back support for processing instructions support in xh_scanner I noticed in https://searchfox.org/mozilla-central/source/dom/base/nsContentUtils.cpp#8961 that these can be produced by innerHTML under some circumstances. * More permanent link * Use CamelCase for the internal functions I added * Rename *_PI to *_PROCESSING_INSTRUCTION Your IDE will do the typing for you anyway * Match symbol naming of the rest of code base CapitalCase for classes, camelCase for functions, snake_case for variables still. * Missed one 😴 * Change xhscanner's variable case also to camelCase * Partially fix case variables in html.cpp * Better command-line with isolation for both Services and co-located defaults and parsing (#252) * CLI Rework * Consolidate common tests, template specialize CLI * Remove remnant cache stuff * [BRT]: Run BRT with new cli * Formalizing bridge * Removing stuff from parsing and moving to TestSuite * Template includes, everything consolidating at tests * Inlining readFromStdin * Removing unnecessary headers * Checking in template implementation which was missing * Sane defaults, some catches at BRT * BRT: Install fixes * Updating marian-dev to point to main * Removing the enum indirection, using strings at one place, directly * Fix typo; * [BRT] test blocking service via native * Conservative defaults for workers and cache-mutex buckets in AsyncService * Create proper barriers for cmdline app * Build failure fixes * Moving common, common-impl to a familiar structure * Binary reorganization: async, blocking, wasm - async tests AsyncService - blocking tests BlockingService - wasm arranges tests for things that are Mozilla requirements. eg: - bytearray - multiple sentences in same translate request workflow. * [brt] updates to adapt to cli rework * [brt] updates to adapt to cli rework, all working * Empty commit, sync brt online and run GitHub CI * Switch for parser to have multiple mode or not * [brt]: Fix for --bergamot-mode being removed from CLI app * [brt]: Fix for --bergamot-mode being removed from CLI app * [brt]: Removing remnant faithful translation test from blocking/ * HTML transfer empty elements (#283) * Fix test case This should now be implemented * Remove FilterEmpty This path wasn't used anymore anyway, empty tags just got their own spans, and never reached the stack. * Insert skipped empty source spans into target HTML Also refactor variable names to better match their contents and be more consistent with each other. This implementation passes all test cases, finally! * Fix remaining style changes * Move HTML formatting to its own section That code had become exact copies in three different places * CI: Circle CI config script update (#287) - Robust artifact presence check - Variable name refactoring - Storing only those artifacts that are required - Remove commit sha from the names of the Github Releases - Use BERGAMOT_VERSION file contents for Git Tag names * GitHub CI: Update YAML to run all tests on marian-full (#292) Previously there were #native tags and #wasm tags separating the two. There is now a clear separation between async, blocking and wasm. * HTML basic integration tests (#291) * Fix typo in BRT args on CI runs (#294) * Turn logging off by default, allow turning on via config/cmdline (#295) * Turn logging off by default, allow turning on via config/cmdline * No need to store config in member variable if things are decided at construction time * cache: threadsafety-fixes; optional stats collection (#245) * Make stats hits misses atomic to guard when mutex has multiple buckets * Use compile time switch for cache-stats-collection bound to COMPILE_TESTS cmake variable * -DENABLE_CACHE_STATS on if COMPILE_TESTS otherwise optional * Make stats() call without enabling build fatal abort * Have alignments placed if HTML is on (#296) * HTML transfer script/style/etc elements (#285) * CI guaranteed example documentation (#300) * Convert marian-integration markdown to rst * Convert native run into a script, include in rst * Check with CI that the native running example works without fail * Defer model loading to parallel worker thread (#303) * Treat most HTML elements as word-breaking (#286) * First class pivot translation capability (#236) Translates a text from source-language to target-language through a pivot-language. Effectively runs models in series, while having the following additional benefits compared to when `Service::translate(...)` would be used repeatedly. 1. Consistency in sentences between source and target. Consistent creation of the alignment matrix for use in downstream tasks like tag-translation. 2. Efficient sentence-splitting (does not sentence-split twice, creating inconsistencies). 3. The `Response` generated can be used as if it were coming through `translate(...)`, eliminating any need for additional code for clients in JS or python or C++. `AsyncService::pivot(...)` is provisioned for C++ multi-threaded setting and `BlockingService::pivotMultiple(...)` provisioned for blocking use-case targeted at WebAssembly. # [BRT]: Test additions, accompanying fixes For `AsyncService` for a test-case involving of en->es, es->en (same vocabulary, another one might be more coverage but is too much work). 1. Asserts the Alignment generated after pivoting is a probability distribution over source tokens given target. 2. Outputs the sentences going from en->en, which should stay consistent over continuous development to ensure nothing breaks. 3. An accuracy minimum of 70% of token matches from source to target calibrated on the standard bergamot input text is additionally present, ensuring that the English tokens at start and end match exactly. # HTML Pipeline This PR reworks the HTML translation pipeline to be outside response-construction via callbacks. * Accept XHTML-style self-closing void tags (#305) Allow the self-closing `/>` end for void tags. For non-void tags these were already "allowed" due to how the HTML parser works, but for elements where they actually occur, like `<br/>`, they caused a parse error. Support for them was not implemented since we only expect valid HTML5, e.g. the output of Firefox' Element.innerHTML. Use case: TranslateLocally uses Qt's HTML representation of rich text. That HTML uses self-closing tags like `<meta .../>` and `<br/>`. Implementing a string replace operation that would only match these elements without parsing HTML is tricky. Fixing it in bergamot-translator is not. Implementation: Currently `<img>` is marked as a void tag (an element which cannot have children or text, and therefore treated differently. Since void tags normally have no close tag, they are treated as immediately closed. The HTML parser we use reads `<img/>` as `<img></img>` which thus causes a problem since now we close an element that was never open, to begin with. This fix ignores the `TT_TAG_END` token from the parser when the tag name is that of a void tag. * Streamline memory-bundle loads (#307) Provides an additional constructor which takes care of the bundle loading inside the boundary of the source here, when a configuration file is supplied from a client like translateLocally or python bindings. Once the config file is read, we have access to the information required to construct the MemoryBundle. - The command-line application supplied from here, app/bergamot is configured to use the fast-load path now. - Changes to binary-loading additionally revealed a bug in the example-run script used in docs and tied to CI and the fix is included. - Shortlist is made optional in the memory bundle, making changes to getModelMemoryFromConfig. Fixes #304. Fixes #306. See also: XapaJIaMnu/translateLocally#82. * Add API to trigger fast shutdown of AsyncService (#297) Add a way to AsyncService to shut down without finishing the full queue through `AsyncService::clear()`. The default behaviour is that `AsyncService::~AsyncService()` will wait for any pending translation requests to finish. One can call `AsyncService::clear()` before the calls to the destructor to ensure there is no work for the service to finish before the workers can stop and join. Marian batches that are already in progress will not stop. We are not trying to cause interrupts in threads or something that complex. However, these single batches often do not take that long to complete. Changes: - Add clear() to AsyncService - Add clear() to BatchingPool - Documentation See also: XapaJIaMnu/translateLocally#80 * Speed up Windows CI with ccache (#308) Use https://github.com/cristianadam/ccache/releases/ to speed up windows compilation. Remove /Zi as it is unsupported by ccache at the moment. This is a debug flag that was removed in upstream marian-dev https://github.com/browsermt/marian-dev/pull/43. However, the bergamot CMakeLists.txt which was originally taken from marian maintained this under MSCV. * Remove unused compiler hash script (#309) * Batteries included python package (#310) Imports python bindings and associated sources incubated in https://github.com/jerinphilip/lemonade to bergamot-translator. Adds a pybind11 dependency for python bindings. Following the import, the python build is integrated into the existing CMake based build system here. There is a command-line application provided through python which provides the ability to fetch and prepare models from model-repositories (like browsermt/students or OPUS). Wheels built for a few common operating systems are provided via GitHub releases through automated actions configured to run at tagged semantic versions and pushes to main. The documentation for python is also integrated into our existing documentation setup. Previous documentation GitHub action is now configured to run behind python builds in Ubuntu 18.04 Python3.7, in order to pick up the packaged as a wheel bergamot module and the sphinx documentation using the python module. Formatting checks of black, isort with profile black and a pytype type checker is configured for the python component residing in this repository. * BRT: Update to fix QE download failures (#321) * Fix HTML with pivoting (#323) Previously BlockingService pivoting missed preproc and postproc for HTML leading to issues in WebAssembly API. This change adds fixes for the same, along with test coverage for the functionality over both async and blocking services. * Remove obsolete workflow transferring source across forks (#326) * Wasm/JS: Pivot translation API JS binding and test page update (#327) * emscripten: ccache and artefact upload (#325) Enables ccache for emscripten. The configuration uses pyiodide for a reference (https://github.com/pyodide/pyodide/pull/1805). Two workflows to run on macOS and Ubuntu, reduced to one on Ubuntu. As emscripten and the target is cross-platform, also macOS runners being limited - it makes sense to have this removed. Upload artefact enabled in preparation for a release action to be scheduled which will upload the bergamot*.wasm and bergamot*.js for consumption. * Consolidate release artefacts (#329) Brings in the previously wasm.yml into python.yml and the new file is renamed as build.yml. python.yml already had a version and pre-release jobs. These jobs downloaded artefacts from prior ran jobs (python wheel builds). The newly attached emscripten build now uploads artefacts of a WebAssembly binary and javascript file which are fed into the release and pre-release jobs in addition to the existing python builds. * Increment version to v0.4.0 (#328) * Make default throw exception on abort for python (#333) This also allows conversion of exiting aborts into runtime errors in python, providing informative messages to the user via pybind11 existing tooling. * Revert "Make default throw exception on abort for python (#333)" This reverts commit 97bd6e36dbdec3519133d91289d7fd31816cb09a. As discussed, we need messages for debugging in -fno-exceptions. * Revert "Revert "Make default throw exception on abort for python (#333)"" This reverts commit 62ff781ed4ea642912878145beaf3157123520fe. Sorry I should have realized Jerin was only amending python and therefore this didn't break WASM. Apologies to Jerin on this. * JS/WASM: Re-enable importing optimized gemm module for (#336) - Re-enabled the code that imports optimized gemm module for wasm when available * Print errors by default in WASM build (#343) * Remove BadHTML exception in favour of ABORT macro `ABORT()` gives us readable error messages, even when exception support is disabled. * Control marian exception global setting in tests through fixture * WASM: construct BlockingService with critical logging by default This log level is only used by ABORT() See also: - mozilla/firefox-translations#65, - mozilla/firefox-translations#68 - mozilla/firefox-translations#70 - mozilla/firefox-translations#56 * Add ability to load `.npz` models (#342) Changes `ABORT` on non `.bin` model to an additional check for a `.npz` extension. If `.bin`, the fast load path is activated by returning `AlignedMemory`. Otherwise, the return of empty `AlignedMemory` causes fallback to filesystem-based loads. BRT: A test that checks if translation using `.npz` is approximately similar to that of default CLI translation is checked in to ensure stability going ahead. Previously, we only supported `.bin` models' loading via a fast mmap path. While we had the underlying capability to load non `.bin` models, this was not exposed, encouraging fast loads. Loading `.npz` models are helpful for quick debugging and broader coverage of models available, which will enhance user experience at translateLocally and python bindings. Fixes #341. See also: XapaJIaMnu/translateLocally#89 * Allow per-input options (#346) Changes signature of BlockingService::{translate,pivot}Multiple functions to take per input options, so a mix of HTML and plaintext can be sent from the extension. Templating over testing is adjusted to allow for continuous evaluations by modifying the test code. Updates WebAssembly bindings to reflect the change in signature and the javascript test-page to work with the new bindings. This change lacks an accompanying test specific to the mixed HTML and plaintext inputs. Fixes: #345 See also: mozilla/firefox-translations#94 Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl> * JS/WASM: Passing ResponseOptions for every item for translation batch api (#348) - Now translate() JS API accepts ResponseOptions per batch item - Fixed the logic to create vector<ResponseOption> * Update aligned vector following intgemm 1b8cbd6f611c21011325cfe0312940f0635dea33 (#334) Fixes memory leak ifdef for -fno-exceptions including clang-cl Move spacing back to intgemm upstream Co-authored-by: Jerin Philip <jerin.philip@research.iiit.ac.in> * Improve cache (#347) Hide `cache-mutex-buckets` from the user. Now configured to be equal to number of workers. Python bindings which had exposed these are modified to reflect the API change. `std::optional` enabled on cache, constructed only if enabled. Pointers used are replaced with an equivalent `std::optional.` Fixes: #317 * JS: Refactoring wasm test page (#354) * Free all the objects properly that were constructed for translation api * Refactored pivot detection mechanism * Create github release via CircleCI only for mozilla fork (#349) * Create github release via circleci only for mozilla fork - The extension uses mozilla fork for translator artifacts -- Hence create github release via circleci only when running in mozilla fork * Small refactoring in ci script * Bump version to 0.4.1 (#356) * Improve handling HTML special cases (#312) - Prefer spreading markup over a full word. - Ignore certain tags that are unlikely to be supposed to be translated, such as `<code>` and `<samp>`. - Never treat `<wbr>` as a space. - Allow for inconsistent cases in tag names. - Fix bug where void elements were inserted multiple times. - Better handling of whitespace around punctuation. - Ignore parsing `<noscript>` to be compatible with Firefox. - Improvements to documentation and readability of `HTML` and `Scanner` classes. Fixes: #313, #339 * Simplify cache config and bind for use in JS (#359) Deprecates cacheEnabled parameter to be replaced with cacheSize=0. Python bindings, Documentation in comments and tests updated to reflect this change. Exposes the fields corresponding to cache via embind as a value object. The equivalent object-based syntax in worker.js allows propagation from JS. Fixes: #351 See also: mozilla/firefox-translations#96 * Embed quality-scores as HTML tag attributes (#358) Quality scores for HTML translation exposed as <font x-bergamot-sentence-score=""> and <font x-bergamot-word-score=""> tags in the HTML output. While this increases the size of the HTML returned, the resulting rendered HTML can easily be styled to show the scores. With Javascript or CSS, developers can easily have some interface based on these extra attributes. Also includes updates to the test page to show a proof-of-concept demonstration. Fixes: #355 * Enable dependabot to automate updating dependencies (#365) Following marian-nmt/marian-dev. * Use right range and threshold for showing "bad" words/sentences (#370) * Use ln(0.5) as the threshold * Use right range for showing "bad" words/sentences * Bump version to 0.4.2 (#371) * Bump 3rd_party/marian-dev from `08b1544` to `7e67124` (#372) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `08b1544` to `7e67124`. - [Commits](https://github.com/browsermt/marian-dev/compare/08b1544636fe13eaf1fbacb17c6fb050abfb8d42...7e67124ae0bc11b42f2e6373489831c9a2498499) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * JS: Reuse Model registry from firefox-translation-models for test page (#377) * JS: Reuse Model registry from firefox-translation-models repo for test page - https://github.com/mozilla/firefox-translations-models/blob/main/registry.json is reused - Removed existing registry * JS: Using supervised QE models for available language pairs (#378) * JS: Refactored model loading - Passing single vocab memory via JS * JS: Use supervised QE models when available * Ran clang format * Bump 3rd_party/marian-dev from `7e67124` to `844800e` (#382) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `7e67124` to `844800e`. - [Release notes](https://github.com/browsermt/marian-dev/releases) - [Commits](https://github.com/browsermt/marian-dev/compare/7e67124ae0bc11b42f2e6373489831c9a2498499...844800efccba6e670250caac1735ca2c8c8e508e) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * JS: Update languages & use Intl API for their display names (#379) Got the languages from registry.json, including non-prod models. Code now calls into `Intl.DisplayNames()`[1] to make life easier. [1] (http://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/DisplayNames/DisplayNames) * JS: Fix swap button on test-page (#388) * Docs: Pin Jinja2 to last known working version (#389) Fixes the docs workflow which is failing after pip is picking up Jinja 3.20. We only need >=2.3, this one sets it to 3.0.3 builds were successful last. * Bump version to 0.4.3 (#392) * Bump bergamot-translator-tests from `d03a9d3` to `7984d14` (#394) Bumps [bergamot-translator-tests](https://github.com/browsermt/bergamot-translator-tests) from `d03a9d3` to `7984d14`. - [Release notes](https://github.com/browsermt/bergamot-translator-tests/releases) - [Commits](https://github.com/browsermt/bergamot-translator-tests/compare/d03a9d316d40ba45c475018287971523666bf51e...7984d140aef00489699d0b7711fa942816224294) --- updated-dependencies: - dependency-name: bergamot-translator-tests dependency-type: direct:production ... * Fix call to `isspace` (#396) Documentation is explicit about only calling it with unsigned char, and Windows runtime is checking this. * Bump 3rd_party/ssplit-cpp from `a08d6bc` to `49fde6d` (#408) Bumps [3rd_party/ssplit-cpp](https://github.com/browsermt/ssplit-cpp) from `a08d6bc` to `49fde6d`. - [Release notes](https://github.com/browsermt/ssplit-cpp/releases) - [Commits](https://github.com/browsermt/ssplit-cpp/compare/a08d6bce20619a8475736832d5418458c14db9d4...49fde6df7ee9199aedb9571be800448192e3515c) --- updated-dependencies: - dependency-name: 3rd_party/ssplit-cpp dependency-type: direct:production ... * Update and fix windows CI (#410) * Use a more vanilla windows workflow from translateLocally, remove the complicated lukka/*. Also removes port overrides in the overall upgrade. * Disable vcpkg binary caching * Remove PCRE library hacks after upstream ssplit improvements * Upgrade emsdk to 3.1.8 (#414) * Rework WASM compilation options Necessary to work with newer versions of emscripten that are more picky about which option goes to the compiler, and which to the linker. Also took the opportunity to remove the need for the patching of the bergamot-translation-worker.js file, this can now easily be done through supported apis. Furthermore, I tried to downsize the generated javascript and wasm code a bit. Initial estimates show that bergamot-translator compiled with emscripten 3.0.0 runs at about 3x the speed of 2.0.9 (when using embedded intgemm). Speed-up when using mozIntGemm is less dramatic. * Updated marian-dev submodule * Revert changes specific to patching external gemm modules for wasm * Better Compilation and Link flags - Added "-O3" optimization flag for linking as well - "-g2" only for release and debug builds - "-g1" for release builds - Replaced deprecated "--bind" flag with "-lembind" - Removed redundant link flag * Upgraded emsdk to 3.1.8 * Enclosed EXPORTED_FUNCTIONS values in a list * Fixed the remaining 2.0.9 reference in circle ci build script * Updated README Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl> * Bump version to 0.4.4 (#415) * Bump 3rd_party/marian-dev from `199201e` to `e88c1aa` (#416) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `199201e` to `e88c1aa`. - [Release notes](https://github.com/browsermt/marian-dev/releases) - [Commits](https://github.com/browsermt/marian-dev/compare/199201eb89b2941afdadb14164e936d412f897ad...e88c1aa5d5c5622cb52c7df09fbb7c3d7f4b5b5a) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Set up python packaging for pypi distribution (#424) Old GitHub CI using Ubuntu and MacOS explicitly and building wheels have been removed in favour of the more portable pypa specified builds. These wheels should work just as well across a wider range of distributions. pybind11:CMakeLists.txt requires Development.Module instead of Development.* to avoid Embed from getting in the way of manylinux builds. manylinux_x86_64 builds are added for cp3.6 - 3.10. The linux build uses an old image via docker. Since the docker images are able to use shared ccache folder, builds quite fast on warm starts. ccache usage in setup.py is now triggered by an environment variable. This allows for builds not to fail if ccache not present. On tag pushes corresponding to versions, CI is configured to deliver built wheels to PyPI, reading from repository secrets. Improves setup.py including documentation and some formatting, and additional links to source. Fixes: #315 * Basic HTML property testing for WebAssembly (#425) Import https://gist.github.com/jelmervdl/a4c8b6b92ad88a885e1cbd51c6ad4902 and attach it to CI. NodeJS-14 is failing on trying to use the WebAssembly binary. So we use node-16 independently setup. This paves way for more complicated testing for WebAssembly bindings in the future. * Bump version to 0.4.5 (#427) * Python package: pyyaml >= 5.1 (#429) Fixes issue on Colab which says vanilla YAML intall (3.x) does not have yaml.FullLoader (https://stackoverflow.com/a/55553392/4565794). Fix a broken link for presentation in PyPI. * Python: Work offline if models are available (#431) Try to check if models.json is downloaded first, if it is use it. If not, fall back to attempting to fetch it from the network. Fixes: #430 * MacOS Wheels (#432) * Remove trailing whitespace * Additional MacOS wheels: Wheels for python 3.6 to 3.10 with a minimum target of MacOS 10.9 * Install bergamot package from wheel directory * Remove no-index as we need dependencies * update download path * try to update coding_styles workflow * Latest and greatest clang-format * Bump qs and express in /wasm/test_page (#444) Bumps [qs](https://github.com/ljharb/qs) to 6.11.0 and updates ancestor dependency [express](https://github.com/expressjs/express). These dependencies need to be updated together. Updates `qs` from 6.7.0 to 6.11.0 - [Release notes](https://github.com/ljharb/qs/releases) - [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md) - [Commits](https://github.com/ljharb/qs/compare/v6.7.0...v6.11.0) Updates `express` from 4.17.1 to 4.18.2 - [Release notes](https://github.com/expressjs/express/releases) - [Changelog](https://github.com/expressjs/express/blob/master/History.md) - [Commits](https://github.com/expressjs/express/compare/4.17.1...4.18.2) --- updated-dependencies: - dependency-name: qs dependency-type: indirect - dependency-name: express dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Arm updated (#443) * ARM Support using ruy and simd_utils * Adding ARM build on GitHub CI * Add workflow and successful build ssplit-cpp modified to get cross compiled android on GitHub CI working. * Client side fixes for int8 no shift on ARM [python] * Revert "Client side fixes for int8 no shift on ARM [python]" This reverts commit 020af05a8b1f4b4ef46373e6e61dcd32869fc1b1. * moving int8shift no-op inside the library * Bump 3rd-party/marian-dev * update the marian branch test * arm backend works * Latest and greatest clang-format Co-authored-by: Jerin Philip <jerinphilip@live.in> * Apply security update and formatting * Expand the node-test.js example code with documentation (#434) * Expand the node-test.js example code with documentation Is there a better way to document code than by providing an annotated & working example of it? Just listing all the exposed methods feels like giving people a box of bricks and expecting them to build a house with it. * Use @Jerin's feedback to simplify node-test.js explanations * Use native `console.assert` instead See #426 for an explanation * Fix comment Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> * More portable WASM demo (#437) * Replace most of the wasm demo page with code from the firefox extension This code should be more generic and copy/pastable into other projects. Maybe one day it will be an npm package? * Fix Ukrainian model support * Add quality estimation output Automatically enabled when the model(s) support it * Little "Translating…" indicator * Don't make Safari fail on something tiny * Rewire lots of async state to be able to predictably know when the translator is working or not Previously so much was lazy loaded that it was not easy to catch lack of SIMD support. Now I can just enable the interface only after it has properly loaded. * No need for a two-stage setup for the worker. Just promise to call `initialize()`! * More (correct) types and comments for code * Keyboard shortcuts for input area for bold, italic and underline. Enough to demo mark-up translation * Fix `delete()` * Move javascript glue code into its own npm package * Add nodejs support and test to package * More stand-alone build command …for now, not really used by anything I think * Ignore build packages * Use local filesystem for build so it is automatically cached * fix overflow on demo page But this might break the mobile demo? I'll have to check into that * Bring back integrity check, except for NodeJS for now * Make `build` part of `prepare` so we always make sure we build a complete package * Move worker code into its own folder This way I can mark it as a commonjs module which will help cause nodejs treat the files the same as WebWorkers do right now. Firefox doesn't implement `{type: 'module'}` yet for WebWorkers. * Add README * Fix paths * Add npm publish automation * Make sure webpack ignores node compatibility code * Add missing webpack:ignore around a worker * Default to getting models from S3 * Separate "loading" and "translating" indicators * Bump npm package version * Add credits * Don't block on the worker loading * Not just Mozilla, but Bergamot! * Make individual translation requests cancelable * Swap button turns vertically when in skyscraper mode * Make it easier to debug errors from inside the worker * Don't bork on deleting a failed worker * Don't bork on calling translate() with a failed worker * Handle compilation error with more grace * `contenteditable=true` seems to work better with some browser extensions Looking at you, Vimium! * Clean up abort promise * Bump npm package version * Remove `workerUrl` option in favour of better webpack support With that option it was hard for Webpack to figure out dependencies, and it did not enter my worker script for rewriting. With the hardcoded url it does, and with a bit of `new webpack.DefinePlugin({'typeof self': JSON.stringify('object')}),` we can have webpack remove node-specific code on build! * Bump version Minor API change hehe Co-authored-by: Nikolay Bogoychev <nheart@gmail.com> * Fix comp…

TommiNieminen and others added 20 commits March 8, 2023 10:55

reduced workspace, since Marian crashes training with larger workspac…

583e56f

…es (this might be fixed in newer marian versions)

Update README.md

e98ab2b

Added note about changing CSC account

Update config.opusmt.yml

2c9f920

Fixed opusmt-teacher value to URL as it should be

added target language token addition for multilingual models

0254fe2

new test config for multilingual models

e339bdc

fixed data language pair reverse with tatoeba data

4e6d3bc

added config parameter for pretrained teacher model (only pretrained …

adae541

…models using marian sentencepiece integration)

Update flores.sh

4e9fe65

Fixed swahili code in Flores importer

Working on using multiple teacher models, not ready for action yet

9773fbc

added profiles for csc mahti

663a463

Update README.md

f5572d8

multiteach additions

8656d40

more multiteacher changes

7ee2a58

multiple teachers added, monolingual src fixed

3b30bcc

fixed vocabs with multiteacher, other minor fixes

241480e

Merge branch 'develop'

a2f7b99

Merging multi-teacher work

fixed dummy mono src rules

fa7ded0

fixed model indices if no opus mt teachers

5efc025

added file for preinstalling snakemake envs (for easier containerizat…

5bc051d

…ion)

eu9ene reviewed May 3, 2023

View reviewed changes

Tommi Nieminen added 2 commits May 17, 2023 13:37

added profiles for lumi, support for amd gpus, fixing the broken non-…

2d2bf76

…opus-mt training pipeline

both train from scratch and opus-mt teacher should work now

8b931dc

Update bicleaner-ai.yml

0219f05

Added tensorflow-rocm to bi-cleaner env to get it working on lumi

marco-c mentioned this pull request May 31, 2023

Add comparisons with other open source models mozilla/firefox-translations-evaluation#14

Closed

3 tasks

Tommi Nieminen and others added 5 commits June 1, 2023 12:22

lumi slurm fixes and bicleaner-ai bug fixing

11b69c6

Update README.md

888e1e1

Added instructions for using Snakemake without non-containerized conda installation.

merged develop to main

ad68e16

Update README.md

75a75a5

Formatting changes.

Merge remote-tracking branch 'upstream/main' into main

a6023b0

TommiNieminen marked this pull request as ready for review June 1, 2023 14:26

bhearsum reviewed Jun 1, 2023

View reviewed changes

bhearsum mentioned this pull request Jun 9, 2023

Replace DAG.pdf with a more accurate version #133

Merged

updated mtdata in base env

59e0f92

eu9ene requested changes Aug 4, 2023

View reviewed changes

Tommi Nieminen and others added 8 commits August 16, 2023 16:53

updated container to match envs

c820c48

added env variables required by new clean mono

668bfaf

Merge branch 'main' of github.com:GreenNLP/firefox-translations-training

4561c64

added separate bicleaner-ai env for lumi

8ab99b2

added lumi bicleaner env

aed8c17

added tensorflow to bicleaner-ai env

2c2a07d

fixed bicleaner-ai script bug and added a missing argument for train_spm

fdc8b35

singularity fixes: kenlm installation, added hunspell dict download, …

c4f7c1a

…edited local-container profile to work with current Snakefile setup

This was referenced Aug 30, 2023

Add comparisons with other open source models #179

Closed

Support training a student model from an already existing teacher model #180

Closed

eu9ene changed the base branch from main to opusmt September 12, 2023 22:07

eu9ene approved these changes Sep 12, 2023

View reviewed changes

eu9ene changed the base branch from opusmt to opusmt2 September 12, 2023 22:35

eu9ene merged commit 4029eb4 into mozilla:opusmt2 Sep 12, 2023

eu9ene mentioned this pull request Nov 18, 2023

Move Snakemake code to a separate directory #266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pretrained OPUS-MT models as teachers and backward models (and other changes) #117

Use pretrained OPUS-MT models as teachers and backward models (and other changes) #117

TommiNieminen commented May 3, 2023

eu9ene May 3, 2023

eu9ene May 3, 2023

eu9ene May 3, 2023

eu9ene May 3, 2023

eu9ene commented May 3, 2023

TommiNieminen commented May 4, 2023

bhearsum commented May 10, 2023

andrenatal commented May 11, 2023

TommiNieminen commented Jun 1, 2023

andrenatal commented Jun 1, 2023

bhearsum left a comment

bhearsum Jun 1, 2023

bhearsum Jun 1, 2023

eu9ene left a comment •

edited

Loading

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene Aug 4, 2023

eu9ene left a comment

		quiet-translation: true
		#precision: float16

Use pretrained OPUS-MT models as teachers and backward models (and other changes) #117

Use pretrained OPUS-MT models as teachers and backward models (and other changes) #117

Conversation

TommiNieminen commented May 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene commented May 3, 2023

TommiNieminen commented May 4, 2023

bhearsum commented May 10, 2023

andrenatal commented May 11, 2023

TommiNieminen commented Jun 1, 2023

andrenatal commented Jun 1, 2023

bhearsum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene left a comment

Choose a reason for hiding this comment

eu9ene left a comment •

edited

Loading