Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pretrained OPUS-MT models as teachers and backward models (and other changes) #117

Merged
merged 43 commits into from
Sep 12, 2023

Conversation

TommiNieminen
Copy link

Hi,

As requested by @andrenatal, here are the current changes made to the pipeline in connection with the GreenNLP project. The main change is the possibility to use OPUS-MT models as teachers and backward models (including multilingual models). The changes also make it possible to use multiple teachers.

In addition to download scripts for models and corpora, I've added some OPUS-MT-specific preprocessing scripts to the pipeline, as OPUS-MT models do not use Marian's inbuilt SentencePiece support. Some of the pre- and post-processing has been added directly to the Snakefile.

I've also added Snakemake configs for supercomputers from CSC (Finnish HPC provider), and made some changes to the containerization.

-Tommi

TommiNieminen and others added 20 commits March 8, 2023 10:55
…ns-training pipeline:

- Added download scripts and rules for downloading Tatoeba-Challenge data and models.
- Modified training rules to accept downloaded Tatoeba-Challenge models as teachers and backward models.
- Modified containerization to include conda environments inside the container (to abide by CSC's conda depreciation).
- Added subword segmentation rules to marian-specific rules (since the default pipeline uses Marian's integrated sentencepiece support and Tatoeba-Challenge models don't)
NOTE: The pipeline is still a work in progress, and it may fail for some Tatoeba-Challenge models due to subtle differences in the model make-up.
…es (this might be fixed in newer marian versions)
Added note about changing CSC account
Fixed opusmt-teacher value to URL as it should be
…models using marian sentencepiece integration)
Fixed swahili code in Flores importer
Merging multi-teacher work
@@ -42,19 +42,21 @@ else
| pigz >"${output_prefix}.${lang}.monofix.gz"
fi

# disabled bue to errors in langid, need to debug sometime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just add a config setting to disable it? I don't think we had problems with it for our training

echo "### Language identification"
test -s "${output_prefix}.${SRC}${TRG}.langid.gz" ||
pigz -dc "${output_prefix}.${SRC}${TRG}.rule-based.gz" |
#Disabled language identification due to crashes, maybe due to small corpus, try with bigger one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just add a config setting to disable it?

quiet-translation: true
#precision: float16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use this one for faster encoding. Should we just add another decoder config for opus models and add a config option for this?

@@ -1,20 +1,20 @@
verbose: false
use-conda: true
resources: gpu=8
cores: all
resources: gpu=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the idea for this config was to train locally on a machine with GPUS

@eu9ene
Copy link
Collaborator

eu9ene commented May 3, 2023

@TommiNieminen thank you for contributing! I'm happy to see that you used the pipeline. Using pre-trained OPUS models will be valuable for us.

I see that it's a draft but I briefly looked at it. You made great use of singularity images, configs and snakemake profiles. From the code standpoint, I would suggest preserving the capability to train from scratch using the original bergamot recipe unless we want to fix some bugs and improve it. We can add some config settings and conditional operators for this to distinguish from OPUS use case. I'm not working actively on the project though and @andrenatal should have more context on our direction.

@TommiNieminen
Copy link
Author

Thanks for the review, @eu9ene. I pushed a quick draft now in case you want to use the OPUS-MT parts of the pipeline for your Taskcluster workflow, but as you noticed, there's some loose ends in the code that I need to clean up.

I saw that elsewhere you were discussing whether to maintain Snakemake support in addition to Taskcluster. On my part, I'm going to keep working with the Snakemake workflow, since I'll be working in a HPC environment with SLURM. So I will be maintaining a Snakemake fork at least for some years, although I expect it to eventually diverge quite a lot from your work (I will be mostly working on retrieval-augmented MT).

@bhearsum
Copy link
Collaborator

Thanks for the review, @eu9ene. I pushed a quick draft now in case you want to use the OPUS-MT parts of the pipeline for your Taskcluster workflow, but as you noticed, there's some loose ends in the code that I need to clean up.

I saw that elsewhere you were discussing whether to maintain Snakemake support in addition to Taskcluster. On my part, I'm going to keep working with the Snakemake workflow, since I'll be working in a HPC environment with SLURM. So I will be maintaining a Snakemake fork at least for some years, although I expect it to eventually diverge quite a lot from your work (I will be mostly working on retrieval-augmented MT).

👋. I'm the person who's overseeing the migration into Taskcluster. @andrenatal and I spoke about this a bit earlier today. We agreed that for the time being we're going to maintain support for both Slurm/Snakemake and Taskcluster. I don't think we know what will happen the medium or long term yet, though (and Andre is probably better positioned to talk about it anyways...).

I merged the first part of this work just now, which has caused merge conflicts here due to some minor changes in the pipeline scripts (that Taskcluster is also using). If you want a hand unbitrotting those please let me know and I'd be happy to do so.

@andrenatal
Copy link
Contributor

Hi @TommiNieminen. Would please you join our Matrix room so we would coordinate a way to meet: https://matrix.to/#/#firefoxtranslations:mozilla.org? You can find me there under the handle anatal

Added tensorflow-rocm to bi-cleaner env to get it working on lumi
Tommi Nieminen and others added 5 commits June 1, 2023 12:22
@TommiNieminen TommiNieminen marked this pull request as ready for review June 1, 2023 14:26
@TommiNieminen
Copy link
Author

Sorry about the delay in getting this moving forward, I did some more modifications to the pipeline which I wanted to include. There are many scripts where the amount and order of parameters has changed, so the Taskcluster setup probably won't work without changes.

I also noticed that the bicleaner-ai processing had stopped working at some point, this was because the latest bicleaner-ai models are no longer available as Github releases (they are now on Huggingface). I changed the bicleaner-ai model download to load the v1 models, which are still on Github. The v2 models available through Huggingface are not significantly better than the v1 models (I confirmed this with the bicleaner-ai developers), so it's not extra urgent to update the bicleaner-ai model downloader to use Huggingface. I also updated the bicleaner-ai version to the latest one in the conda env configuration.

@andrenatal
Copy link
Contributor

That's awesome @TommiNieminen, thanks for that!! I'll review it and we can chat about it tomorrow.

Copy link
Collaborator

@bhearsum bhearsum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many scripts where the amount and order of parameters has changed, so the Taskcluster setup probably won't work without changes.

Don't worry about this. Most of your changes affect things we haven't set-up in Taskcluster yet, and I'll deal with any fallout that comes from this PR.

It also looks like this should apply cleanly on top of #125 - so that shouldn't disrupt this PR if it lands first. I'll try to hold off merging anything that would conflict until you finish up here.

src: en
trg: sw
src_three_letter: eng
trg_three_letter: swa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this here instead of calling mtdata every time we need it is a nice improvement!


experiment:
name: opusmt
src: en
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note that the existing test config uses ru -> en. I think this is to ensure that the reverse locale handling in certain pipeline scripts works OK. Probably not crucial - just calling it out.

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TommiNieminen @bhearsum I did another pass at reviewing this. Here are my thoughts:

What I'm not concerned about:

  • adding new profiles
  • adding new configs
  • adding new pipeline steps
  • installation scripts
  • bicleaner fixes
  • backward compatible changes inside the pipeline directory

What I'm concerned about:

  • (!) backward incompatible changes inside the pipeline directory This means we'll have to modify the task cluster scripts and retest the training from scratch and I don't think it's the right time to do that
  • modifying existing configs including the training one (ideally, everything related to OPUS use case should be in separate configs)
  • readme
  • changed default singularity image
  • complexity of the Snakefile (I don't have any good suggestions here except maybe adding a new file for OPUS use case. Maybe it's fine and we can live with it since we're moving to the TC.) - not critical

On one hand, we need this merged to start experimenting with the pre-trained OPUS models and @TommiNieminen did a great job of making this work (thanks again!). On the other hand, I see risks to break our main training procedure due to the complexity and cost of retesting. Especially since we still testing the integration with the Task Cluster which will serve as our main training scheduler soon.

I believe, to move forward with this, we should address the backward compatibility issues to at least keep our current pipeline that was migrated to Task Cluster working. Even if there are bugs in Snakefile, it would be not that critical. Then we can start experimentation with OPUS models using Snakemake and then gradually migrate it to Task Cluster if it proves to be useful.

I'm also not sure whether @TommiNieminen has time to address those concerns. Let me know what you think.

cache: false
reason: true
config:
# install dependencies on a local machine
- deps=true
# root path to a folder with data, models and logs
- root=/data
- root=/home/tommin/greennlp/data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert values in this config to the original ones, since the idea is to test things on a GPU machine locally

CONDA_PATH=../mambaforge
SNAKEMAKE_OUTPUT_CACHE=../cache
PROFILE=local
#PROFILE=local
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the default values here as before

@@ -4,12 +4,13 @@
SHELL=/bin/bash

### 1. change these settings or override with env variables
CONFIG=configs/config.prod.yml
CONFIG=configs/config.opusmt-multimodel-test.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the default values here as before

@@ -1,3 +1,64 @@
# OPUS-MT integration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move an extra docs section related to the OPUS integration to another place. Ideally, we can create a docs folder and keep it there with the link from the main readme. It's already quite long.

@@ -63,55 +63,7 @@ marian-args:
datasets:
# parallel training corpus
train:
- opus_ada83/v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep this config as it was and revert to the original values. This is an example of a real training set

@@ -13,48 +14,72 @@ min_version("6.6.1")

### configuration

container: 'Singularity.sif'
containerized: 'Ftt.sif'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert to the original image

if [[ -f ${opus_vocab} ]]; then
vocab=${opus_vocab}
else
vocab=$3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the vocab variable will not be set if there are no opus vocabs in the model dir, so this is not backward compatible. Should we set it in the beginning like before and then override it?

if [[ -f ${opus_vocab} ]]; then
vocab=${opus_vocab}
else
vocab=$3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, doesn't look backward compatible

@@ -11,9 +11,19 @@ test -v MARIAN
test -v WORKSPACE

input=$1
vocab=$2
models=( "${@:3}" )
output=$2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the arguments in the same order to make it backward compatible and not change the task cluster part?

conda: "envs/base.yml"
threads: gpus_num * 2
resources: gpu=gpus_num
input:
rules.merge_devset.output, model=f'{teacher_base_dir}{{ens}}/{best_model}',
rules.merge_devset.output, model=f'{teacher_base_dir}0-{{ens}}/{best_model}',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 0? I see that it was also added to the finetuned one. It's not clear to me why it was added.

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it and it's ready to merge into a separate branch. I'll address my concerns with readme, configs and compatibility with TC afterward.

@eu9ene eu9ene changed the base branch from opusmt to opusmt2 September 12, 2023 22:35
@eu9ene eu9ene merged commit 4029eb4 into mozilla:opusmt2 Sep 12, 2023
nordzilla added a commit that referenced this pull request Oct 1, 2024
* Cleanup API: Refactor request on-complete transition (#80)

* Handle empty translation requests

Fixes https://github.com/browsermt/bergamot-translator/issues/101.
ResponseBuilder is called with empty histories to trigger a valid but
mostly-empty response.

* Control validating the config options via a boolean flag (#116)

* Control validating the config options via a boolean flag

 - parseOptions() function now validates the parsed options
   based on the validate argument

* Minor syntactic fix

* JS bindings for loading model and shortlist files as bytes (#117)

* Bindings to load model and shortlist files as bytes
* Modified wasm test page for byte based loading of files
* Updates wasm README for byte loading based usage of TranslationModel

* Make wasm test page work with bergamot-models repository

 - bergamot-models now contains lexical shortlist bin files as well

* Better error logging for wasm test page

* Update to marian-dev master

* Full windows support with ssplit from browsermt, not a fork (#109)

* Update marian-dev to the newest mac version

* Attempt windows workflow

* force workflow rerun

* Separate id

* Attempt 3 at github action

* Marian dev submodule now compiles with apple clang

* Updated ssplit version to something more recent

* Attempt to fix compile on wasm

* Do not compile subproject tests

* Fix emscripten compilation on Mac

* 99% on the way to windows compile

* Try with a different generator

* Build release not debug

* Revert CMakeLists.txt hacks

* Fix sse2 compilation failure

* MSVC settings for WIN32

* Add nodefaultlib LIBCMT

* Do not compile ssplit.cpp as it contains sys/mman.h

* Revert ab56b9aa4f4360b0ab98d5806658d4302f31db1d

* Update paths

* Set the build type to release if not set previously

* Attempt to build release with the windows workflow

* Attempt 5 at VS studio release build

* Attempt 6 at getting release build on MSVC generator

* The windows build is debug at the moment...

* fix ssplit for ubuntu 16.04

* Fix compilation with clang

* Compile on ubuntu16.04

* Explain what is going on

* Updated ssplit and workflow

* Enabled gemm-precision in wasm test page

 - This increases the inference speed while providing
   models as bytes to the translation engine
   (it wasn't needed while providing models as files)

* Updated wasm/README file with instructions for byte loading APIs

* WASM Bindings collapse (#87)

* Safe transfer of bindings through typedefs

* Removing Translation* files and bringing in counterparts

* Remove previously commented out code

* Removing commented out include

* Absorb Translation* documentation

Co-authored-by: abhi-agg <66322306+abhi-agg@users.noreply.github.com>

* Improve script to patch wasm artifacts and load EN->DE vocabulary in wasm test (#125)

* Improved script that patches wasm artifacts to enable wormhole

 - Made the regex pattern ignore multiple whitespaces b/w words of
   the matching pattern

* Fix for loading EN->DE vocabularies in wasm test page

 - Loading vocabularies for EN->DE was failing because of
   the new structure of bergamot-models

* Improved wasm scripts and README (#128)

* Minor README change

 - Changed "browsermt" to "mozilla"

* Updating ci scripts for the latest upstream changes

 - The upstream browsermt/bergamot-translator builds the wasm artifacts
   in top level build folder now

* Extension desired changes (#129)

* Enable worker file system
* Avoid node.js-code in emscripten glue-code

* Extension desired changes (#129)

* Enable worker file system
* Avoid node.js-code in emscripten glue-code

* Fix busy loop in windows (#131)

* Fix busy loop in windows

* Nick wants the while loop gone

* Fix continue leftover

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>

* Making bytearray a commandline switch (#127)

* Adding bytearray option

* collapse intermediate for bytearray apps

* Removing service-cli-bytearray

* Removing the bergamot bytearray app

* Bumping updates to brt collapsing apps

* Reasonable defaults and hard check when cmd enabled

* Update documentation for flags

* Bump brt with MKL check and skip

* Bumping BRT with MKL_FOUND instead of USE_MKL

* Bumping BRT with no mkl enforce

* Bumping BRT with ssse3 output

* Let's try disabling OpenBLAS

* Trying to disable apple accelerate

* Using WASM compatible BLAS can enable intgemm

* Adding a CMake -L to see what exactly is the diff

* Revert "Let's try disabling OpenBLAS"

This reverts commit 9a6b9bc53bf7dec956889f6e0b7047e5388e1b7e.

* Revert "Using WASM compatible BLAS can enable intgemm"

This reverts commit 936a592e18431c279e6c5952a278d012d18ff295.

* Restricting mac tests through tags and on GitHub CI

* Using only check-bytearray

* Bumping BRT with change of default behaviour

* Faithful to source-structure translation (#115)

* First draft of faithful translation

* Comments explaining pre and post

* Comments on response_builder

* Updating bergamot-translator-tests with new outputs

* Cosmetic changes in response target text construction

* Replacing &(x[0]) -> x.data() to avoid illegal indices

* Removing nullptr given both branches init pointer with legal values

* pre, post -> gap(i) addressing review comments

Functions which were pre and post before are subsumed by gap(i), and the
algorithm in ResponseBuilder adjusted to fix.

`x = nullptr` is back, should be harmless.

* Updating brt with paragraph outputs

* Bumping brt with updated outputs, buffer text at begin as well

* Bumping BRT with sync after bytearray collapse merge

* Pointing BRT to main after merge

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>

* Enable vocabs pass as byte arrays (#122)

* first attempt to enable vocabs pass as byte arrays

* pass vocabs bytes as AlignedMemory

* add vocabIndices to avoid double loading

* small fix on parameter names and documentation

* fix windows build plus tiny update on documentation

* update marian-dev submodule

* move validate model bytearray in BatchTranslator

* small refactors on validateBinaryModel()

* switch vocab memories to std::vector<marian::Ptr<AlignedMemory>>

* update marian-dev submodule

* replace marian::Ptr to std::shared_ptr for vocab memories

* add note for vocab memories

* Update ssplit submodule, removing absl (#132)

* Update ssplit submodule, removing absl

* Fix ssplit variables

* Update ssplit branch

* Fix emscripten compilaiton

* Update tests

* Minor rename: sentence_ranges -> annotation (#134)

* Target master of ssplit-cpp

* Remove unused used types TokenRanges, SentenceTokenRanges, UPtr (#137)

* Change USE_WASM_COMPATIBLE_SOURCE =OFF by default on native, force on for WASM (#138)

* Change WASM_COMPATIBLE_SOURCE=OFF by default

The default was WASN_COMPATIBLE_SOURCE=ON COMPILE_WASM=OFF which is a
testing configuration, not a sensible default for native or wasm.

* Always USE_WASM_COMPATIBLE_SOURCE with COMPILE_WASM

* Set CMP0077 to fix variable handling

* Export "addOnPreMain" function from wasm module

 - This is required in the extension while using wasm module in a worker environment

* Enable Debugging information in wasm module builds

 - Added "-g2" flag furing linking step

* JS bindings for vocabularies as bytes

* Updated wasm test page to pass vocabulary files as bytes

* Refactoring TranslationModelBindings class

 - typdef AlignedMemory for code readability

 - Added documentation for one of the binding function

* Avoid packaging vocab files into wasm binary in CI builds

 - We don't need to package vocab files into wasm binary any more
   as a sync with upstream enabled passing vocabs as bytes

* Updated wasm README to update for passing vocabs as bytes

 - Updated Using JS APIs section to pass vocabs as bytes

* Updated README to remove packaging steps for wasm compilation

 - We don't need to package model, shortlist or vocab files into wasm
   binary at build time

* Updated CMakeLists.txt to remove packaging steps for wasm compilation

 - Removed PACKAGE_DIR cmake option
 - Removed Workerfs, FORCE_FILESYSTEM=1 in wasm builds
   -- File system support is not needed any more (since model,
     shortlist and vocabs are being passed as bytes now)

* Bundle AlignedMemory inputs with MemoryBundle (#147)

* Enabling ccache on github builds for Ubuntu (#95)

* CI Changes to add tiny regression tests

* Adding an inspect cache step

* Removing ccache, pursue in another

* Incorporating Nick's changes through submodule merge

* Submodule now points to master

* Restoring ccache enabled workflow file

* Restoring ccache enabled CMakeLists

* cache -> ccache typo fix

* Moving CCACHE setup to GitHub runner file

* Find also uses CCACHE dir

* Updating CMakeLists not to override env

* Cache compiler binary's contents

* Changing a few names to trigger new build; Testing cache looks fun

* USE_CCACHE=on, -L for inspection

* Adding a ccache_cmd, but will only use in next commit

* Using ccache_cmd

* Removing "

* Adding compiler hash script

* Bunch of absolute paths

* GITHUB_WORKSPACE typo

* Nah, I'll keep -L and trigger another build

* Trying something with compiler hash on cache key backup as well

* builtin, bash it seems

* Empty commit #1

* Move ccache stats to after compile

* Reshuffling ccache vars

* No comments

* Updates to Github output set syntax

* Empty Commit 1

* Empty Commit 2

* Empty commit 3

* /bin/bash -> bash; ccache_cmd for consistency

* Adding ccache -s before and after build

* Adding comments to compiler-hash script

* Let's build cached and non-cached variants together for comparison

* Fixing quotes, /bin/bash -> bash

* Minor var/env adjustment

* Adding ccache -z before the job

* Reverting CMakeLists.txt without CCACHE

* Switching to CMAKE_LANG_COMPILER_LAUNCHER instead of CMakeLists.txt rule

* 5G -> 1G cache size

* 1G -> 2G; Hyperparameter tuning

* Refactor vocabs in Service (#143)

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>

* Rewrite annotation class to remove corner cases (#135)

* Added cmake file to compute version information

 - Reads BERGAMOT_VERSION file for generating various strings
   for versioning

* Import GetVersionFromFile cmake file in root level CMakeLists.txt

* Modified wasm cmake file to include version information in built artifacts

* Generate project version file for native builds

 - The header file exposes a function that provides version information
   for native binaries

* Bumped version to 0.3.0

 - This brings the version info in sync with the various releases
   of extension

* Corrected the version number

 - To be in sync with versioning in mozilla/bergamot-translator repo

* Marian submodule with unified loading (#157)

* Collapsing TranslationRequest -> ResponseOptions (#139)

* Rewriting batching for threadsafety (#155)

This does make the batcher a critical section across job submission and
cleaving though.  If that becomes a problem, we should go back to
incoming and outgoing queues with a batcher thread.

Also removes blocking mode from native compiles.

Note that translateMultiple no longer guarantees great batching.  Guess
we could lease the mutex from ThreadsafeBatcher and create a session.

There is the risk that one sentence comes in at a time and each thread
grabs one sentence at a time instead of better batching.  Not sure what
to do about that other than some sort of Nagle algorithm.

Due to non-deterministic batching, even with one thread, the regression
tests will go haywire.

* Use binary lexical shortlist in documentation (#152)

* Use binary lexical shortlist in documentation

* MKL/AppleAccelerate note

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>
Co-authored-by: Jerin Philip <jphilip@ed.ac.uk>

* initialise MemoryBundle members (#167)

* Adding clang-format and updating existing sources to adhere (#151)

* Adding a first version of clang-format

* Adding run-clang-format.py

* Adding coding styles to workflow

* Fix indentation on coding-styles workflow

* run-clang-format.'py'

* -style -> --style in python

* Updating ColumnLimit: 120

* Format update with clang-format

* Revert "Format update with clang-format"

This reverts commit 5340b19eae8fcc91a2a79205e0b3dd65ad61ad4c.

* Apply update after sync

* Removing a few empty lines

* Removing one more empty line

* Removing empty in workflow file

* Updating README with coding style instructions

* clang-format-* provided in this repository doc update

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>

* Pin emsdk version to the same one used in Circle CI (#165)

* GitHub action to push browsermt/main branch to mozilla/bergamot-translator every hour (#160)

* Create push-browsermt-main-to-mozilla-main.yml

* Update .github/workflows/push-browsermt-main-to-mozilla-main.yml

Co-authored-by: Graeme <graemenail@gmail.com>

* Tweaks

* Fix yaml syntax

* Parametrized the workflow based on @jerinphilip's example

Co-authored-by: Graeme <graemenail@gmail.com>

* Update tests

* Bumping BRT for hotfixes (#169)

* Bumping BRT for hotfixes

* updating brt to point to main

* Remove O(N^2) reallocation (#171)

* Adding documentation action (#168)

Adds a GitHub workflow that builds documentation from sources through doxygen through sphinx on push to the main branch or on push of any semantic version tags. The built documentation is deployed at https://github.com/browsermt/docs@gh-pages, which is rendered at https://browser.mt/docs/<suffix>, where <suffix> is 'main' or a tag vM.m.p corresponding to a semantic version.

On pull request artifacts are uploaded for reviewers to inspect if need be.

* Fix failures when loading text shortlist (#154)

* Updating marian dev RelwithDebInfo -> Release (#178)

* Updating marian dev RelwithDebInfo -> Release

* Updating submodule to point to master

* Single executable (#175)

* Collapsing executables

* Adding new test executable

* Deleting old executable sources

* Updating brt to operate with modes

* cli-framework -> cli

* Updating workflows to check for bergamot instead of bergamot-translator-app

* Adding documentation

* Making fn pure virtual

* Shuffling apps into app namespace, alongside class documentation

* Include app folder in documentation

* BRT update service-cli -> native

* parser.h: service-cli -> native

* Updates to marian-integration.md

* Cleanup: Remove templates, interface proper

* change 4 to 2 cores for build instructions

* service-cli -> native

* Commenting the string constructor explanation

* Not doing halfway interface / inheritance

* Nick hates state, let's try this one

* Revert "Nick hates state, let's try this one"

This reverts commit e56db9f474b1906e62af0b06afb7c7d9e08ea9c8.

* class -> struct before trying std::function stuff

* oop -> functional?

* Hints on what is happening

* app::ftable -> app::REGISTRY

* We have if-else and functions now.

And we won't have test apps.

* Doc linking to usage examples in brt

* Remove unordered_map

* Documentation updates

* Fix warning

* Deploy generated documentation only if browsermt (#179)

* Including WASM documentation in sphinx build toc (#176)

* Updating marian-dev: intgemm with env variable matmul switches (#187)

* Remove addSentenceWithPriority (#186)

* Update native (ubuntu, mac) workflows with ccache (#181)

* Matrix is now more organized, Ubuntu 20.04-gcc9.3, Ubuntu-18.04-gcc7.5 is added.
* ccache is extended to MacOS, and brings down CI run times to <5m when
  ccache works.
* The compiler hash scripts are gone, ccache already covers most ground
  by default. The shell script is unnecessary. Cache works by preprocessor
  mode output of running the compiler with -E, which includes the
  necessary information. ccache-docs:How the cache works.
* BRT if failed prints the final 20 lines of the test*.log to inspect
  what's going wrong without having to artifact download.
* Pull request on any branch triggers workflow.
* Push on main and ci-sandbox triggers workflow.

* Replace resize with possible negative range with pop_back() (#189)

* Consistent EMSDK version and parallel make jobs in README and github actions

 - Set EMSDK version to 2.0.9 to make it consistent
   everywhere in repo
 - Set parallel make jobs to 2

* CMake fixes: Generate project.h in binary dir, fix GetVersionFromFile for use as submodule. (#193)

* Use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR for project bound version string

* marian-dev cmake fix

* Generate project.h in binary dir

* We don't want people asking about extra spaces

* Fixing if syntax with YAML var subsitution (#188)

* Generating cmake configured project version (.js) file in build folder (#194)

- Earlier this file was being generated in folder containing
   actual sources

 - Fixes https://github.com/browsermt/bergamot-translator/issues/161

* Partial test-apps and tolerance in evaluations (#184)

* Partial test applications

Previously service-cli was used to generate output and accomplish
regression testing for all of: (1) translated-text (2) alignment tokens
+ scores (3) quality scores (4) indirectly annotation and tokenizations.

The --mode native now only outputs a faithful to source translated text
of the input source on stdin.

Test apps are separated into testing only individual functionalities.
This can help in independently testing ssplit-cpp, quality-scores for
the quality estimation implementation etc.

Separating numbers and text have the advantage of being able to compare
one with tolerance using BLEU (text) and some allowed error-rates
(numbers).

* Removing #mac tag

* Moving test apps to src/tests

* Tests are always on for CI

Unit tests are turned off looking for WASM_COMPATIBLE_SOURCES.

* Fixing WASM_COMPATIBLE_SOURCE -> USE_WASM_COMPATIBLE_SOURCE

* Workaround for now; CMakeLists.txt horrors are starting to bite

* BRT: use bergamot-test instead of bergamot now

* This should fix issues: CMakeLists.txt has so many paths

* Casing to camelCase and removing legacyServiceCli

* removing leftover service-cli declaration, some doc updates

* #pragma once is starting to look easier

* All the more reasons to do #pragma once

* Updating marian-dev with intgemm::kCPU print, resolved from INTGEMM_CPUID

* BRT: Use --gemm-highest-arch instead of python script

* Adding intgemm resolve here, where always(?) have intgemm on?

* intgemm-resolve in default binary directory

* BRT: Update to use intgemm-resolve

* marian-dev: Reset to without --gemm-highest-precision

Co-authored-by: Kenneth Heafield <kpu@users.noreply.github.com>

* Removing alignments and quality-scores test-code (#196)

* Removing alignments and quality-scores test-code
* BRT: Update to main

* Refactor wasm bindings to use consistent interface names as in native (#195)

* Refactored wasm bindings code
 - Replaced TranslationModel, TranslationRequest and TranslationResult
    with Service, ResponseOptions and Response
 - Corresponding documentation changes
 - Names of the bindings files changed
 - Moved Vector<Response> definition in Response specific bindings
   file

* Account for EOS in both source and target annotations (#190)

* Load sentence-splitter (non-breaking prefixes) from ByteArray

Service now allows loading Sentence-Splitter (non-breaking prefix file) from ByteArray. Behaviour is consistent with the rest of the ByteArray loads (model, shortlist), where first the ByteArray is checked if empty, if not fall back to loading from file-path. 

Adds regression test to check if source-sentences in constructed Response match expected behaviour when the non-breaking-prefixes file is provided. 

Bonus refactoring to remove an extra layer that existed for no reason.

* maxLengthBreak_ -> wrapStep bugfix (#200)

* Change ResponseBuilder to accept callback instead of future (#142)

* Change ResponseBuilder to accept callback

Breaks things everywhere, now we follow the compiler to fix and convert
the std::future -> callback.

* More std::future -> callback

* std::future out of service.{h,cpp}

* compile is working, so is callback

* Some reshuffling of args

* Fixing merge error

* Fixing signature conflicts out of merge

* Fixing that test duct-taping future

* Minor adjustment to get that future back

* Add documentation for the new callback function

* Applying clang-format after update

* Using default responseOptions

* Remove future references from documentation

* translateMultiple only for WASM (#177)

* BRT: update to main; fresh-failures hopefully

* Converting test translateFromStdin to use callback

* BRT: Add fresh #native and #wasm tags

* future from promise, fix error

* Adding #native to GitHub CI

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>

* Added public methods in Response class to return sentences

 - Refactored ByteRange struct and moved it to definition.h

* JS bindings to return sentence byte ranges

* Wasm: Enabled sentence byte ranges in the wasm test page

 - Use JS bindings to print all sentences individually on
   console

* Windows workflow: run-vcpkg7.{3->4}; vcpkg master (#208)

A cmake change has caused vcpkg to fail without much error message,
which is causing windows workflow runs to fail. Details in the following
link:

* https://github.com/microsoft/vcpkg/issues/18718

To fix, we're going with a version bump in vcpkg. Seeing that run-vcpkg
also seems to have gotten an update, updating run-vcpkg from 7.3 to 7.4
Playing with fire: vcpkg master commit

* Added build instructions to run on other browsers

 - Disabled compiling with wormhole which is Firefox specific feature

* Add a clang-tidy run (#214)

Adds a clang-tidy run in addition to the existing clang-format checks.
The clang-tidy checks are not enforced, but is potentially useful to
point to during review.

* Wasm test page using web workers now (#218)

* Updated marian submodule to latest commit of master

* Wasm builds without SharedArrayBuffer

* Circle CI wasm artifacts for non-wormhole builds

* BRT: Update sacrebleu to get tests back working (#217)

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>

* QualityEstimation: Preliminary Implementation (#197)

Unifies quality estimation with an interface, refactors previously available
quality scores to fit this interface. Adds a new class of  model with Logistic
Regression powering the predictions as an implementation of said interface. 
QE now provides annotations on words using subwords to word rule-based 
algorithms working with space characters. 

QualityEstimation
-----------------

Implementations of QE are bound together by a `QualityEstimator`
Interface. 

1. The log-probabilities from the machine-translation model re-interpreted
   as quality scores are crafted as an implementation of QualityEstimator.

2. A Logistic-Regression based model is added. This class of models is
   trained supervised with scores labeled by a human annotator.
   Handcrafted features - number of words, log probs from MT model and 
   statistics over the sequence are used to generate the numeric features.
   LogisticRegressor, Matrix (to hold features) are added.

The creation of an instance is switched by the `AlignedMemory` supplied
(be it loaded from the file-system or supplied as a parameter). An empty
AlignedMemory leads to quality scores from NMT while supplying weights
of a trained logistic-regression model in binary format as the contents
lead to an additional pass through the said model to provide more
refined scores.

Both the above now transform subwords into "words" using a heuristic
algorithm, scanning for spaces. This allows the client to work with "words"
to denote quality instead of subwords, as the former is more sensible to
the user.

Testing
-------

1. BRT now has two new test apps to check the QE outputs in text
  (covers subword to words) and numbers domain (covers quality scores).
  These are tested with en-et models for which QualityEstimation is
  available now, on a new input to avoid architecture/compiler issues.
2. Unit test for LogisticRegression model is added.


Docs
----

Doxygen now supports MathJax properly to render explanations for
Logistic Regressions' reductions in place to make computation more
efficient correctly.

Co-authored-by: Felipe C. Dos Santos <felipe.santos.k@gmail.com>
Co-authored-by: Jerin Philip <jerinphilip@live.in>

* Multiple TranslationModels Implementation (#210)

For outbound translation, we require having multiple models in the
inventory at the same time and abstracting the "how-to-translate" 
using a model out.

Reorganization: TranslationModel + Service. The new entity which
contains everything required to translate in one direction is
`TranslationModel`. The how-to-translate blocking single-threaded mode
of operation or async multi-threaded mode of operation is decoupled as
`BlockingService` and `AsyncService`. There is a new regression-test
using multiple models in conjunction added, also serving as
a demonstration for using multiple models in Outbound Translation.

WASM: WebAssembly due to the inability to use threads uses
`BlockingService.  Bindings are provided with a new API to work with a
Service, and multiple TranslationModels which the client (JS extension)
can inventory and maintain.  Ownership of a given `TranslationModel` is
shared while translations using the model are active in the internal
mechanism.

Config-Parsing: So far bergamot-translator has been hijacking marian's
config-parsing mechanisms. However, in order to support multiple models,
it has become impractical to continue this approach and a new
config-parsing that is bergamot specific is provisioned for
command-line applications constituting tests. The original marian
config-parsing tooling is only associated with a subset of
`TranslationModel` now. The new config-parsing for the library manages
workers and other common options (tentatively).

There is a known issue of: Inefficient placing of workspaces, leading to
more memory usage than what's necessary. This is to be fixed trickling
down from marian-dev in a later pull request. 

This PR also brings in BRT changes which fix speed-tests that were
broken and also fixes some QE outputs which were different due to not
using shortlist.

* Adapted wasm test page for new Service interface (#224)

- The new interface now supports running multiple TranslationModels

* Wasm test page UI for translating b/w non-English language pairs (#231)

* Updated Wasm test page UI for translating b/w non-English language pairs
* Both "from" and "to" language dropdowns now allow non-English languages

* Import matrix-multiply from a separate wasm module (#232)

* Updated marian-dev submodule
* Import wasm gemm from a separate wasm module
 - The fallback implementation of gemm is currently being imported dynamically
   for wasm target
* Updated CI scripts and README to import GEMM from a separate wasm module
* Setting model config to int8shiftAlphaAll in wasm test page

* JS bindings for Quality Estimation (#239)

* Quality Score bindings complete
* Updated wasm test page to test the bindings
  - Word and sentence scores can be seen in browser console

* Cache for translations (#227)

Sets a cache to operate for each sentence that a TranslationModel process
caching the corresponding marian::History for a {TranslationModel::Id, marian::Words}
key.  Cache is thus shared across multiple TranslationModels bound to the lifetime
of a Service. Cache gracefully downgrades in the case of WebAssembly.

* Set PR to any branch to trigger workflows (#230)

* [ssplit-cpp] Enable position independent library when compiled from sources (#240)

* EXCLUDE_FROM_ALL for marian and ssplit-cpp 3rd-party libraries (#243)

* Update config "skip-cost" to enable log probabilities for QE scores (#247)

- Updated wasm test page

* Recover logging (#226)

* Deprecate hardAlignment in favour of softAlignment (#250)

* Updated marian submodule (#256)

* Update ssplit cpp, pcre2 source compile to fix broken builds (#258)

* Update ssplit cpp, pcre2 source compile to fix tests

* Syncing with browsermt/ssplit-cpp

* Removing accidental binary inclusion

* Removing brt accidental update by git add -u

* Fix windows workflow, vcpkg is broken use our cmake route

* [ssplit-cpp] Try searching different library names for Windows

* Fixes windows workflow for PCRE2  (#260)

* Fix badge to point to this repo instead mozilla's (#261)

* Make script run from any directory (#262)

* Make script run from any directory

* Import optimized gemm implementation (when available) for wasm target (#265)

* Enable importing optimized gemm module for wasm

 - Updated emscripten generated JS code to
   -- import and use the optimized gemm module when available, otherwise
     use fallback gemm implementation

* Added logging for gemm implementation being used for wasm target

* HTML input (#253)

Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl>
Co-authored-by: Abhishek Aggarwal <aaggarwal@mozilla.com>

* HTML handling improvements (#266)

* Fix out-of-bounds error when determining alignment for whole word

If token at offset 0 was a continuation (which it always is, since the first word of a sentence does not start with a space) it would jump to (unsigned) -1 which is probably out of bounds.

* Don't segfault if alignment info is not available

When alignment info is requested, but model is missing `alignment: soft` you'd get empty alignment info for every target token.

* Partial fix for handling empty elements

This fixes a parse error when dealing with something like `<p>...<br></p>` or `...<br>` where there is no text after the last empty element. This also prevents losing empty elements in the source side of the translation. Empty elements are not yet transferred correctly to the target side.

* Fix formatting

* Updated marian-dev submodule

* Updated configuration for html text translation to work in wasm test page (#269)

* Updated translator configuration in wasm test page
 - Added alignment: soft

* Set ResponseOptions::alignment to "true"
 - Had to be set for html text translation to work

* More robust logic to import wasm gemm (#276)

- Import optimized gemm implementation only if all the necessary functions
   are provided by it, othewise use the fallback gemm

* Constrain mistune to fix docs CI (#278)

* Additional logs in JS translation worker (#277)

- Print source text received in the response
 - Print no. of block elements in the input

* Proper arch setting on win32 (#275)

* Proper arch detection on win32

* Whoops

* Remove value length limit from HTML parser & interpolated alignments (#274)

* Remove InterpolateAlignment

And some code improvements

* Replace the fixed value buffer with a std::string backing

* Fix tests that had no alignment info

These depended on the linear interpolation that I removed

* Remove arbitrary limits on tag and attribute names

This might also fix a bug caused by the eager lower casing of tag names, which could break <![CDATA , <style> and <script>

* Remove equals() in favour of operator==()

I trust the compiler can come up with better optimisations than I can.

* Expose std::strings instead of their data

Should save us some std::strlen() calls

* Add & remove headers and no-longer-defined functions from header files

* Remove all string buffers from xh_scanner

It now directly refers to either the input stream or constant strings

* Replace custom string_view with even lighter struct that's only used internally

To the outside world we just expose std::string_view

* Remove __builtin_sub_overflow for MSVC

* ABORT if trying to restore HTML when no alignment info is available

* Add test cases specifically for xh_scanner

Both good for testing regression, and as a little example/reference for what behaviour to expect from it.

* Add --html option to bergamot for tests

This should make it easier to have some integration tests for HTML input

* Add test and fix for empty inputs failing due to alignment check

Co-authored-by: Jerin Philip <jerinphilip@live.in>

* Disabled importing optimized gemm module (#282)

- Until the optimized gemm module stops requiring
   Shared Array Buffer, we can't really use it in
   Firefox

* Adding circle ci job to push the wasm artifacts to github releases (#280)

* Adding circle ci job to push the wasm artifacts to github releases.
* Updated config.yml

* Increase HTML test coverage (#279)

* Fix bug in HasAlignments check

When fixing it to allow empty sentences, it no longer caught misconfigured models. I've added a test that triggers this scenario, and a fix in HasAlignments for it.

* Add more unit tests for xh_scanner

Trying to increase that code coverage to 100%

* Add test for whitespaces around attributes

* Make accessing value(), attr_name() and tag_name() at the wrong time safer

* Fix bug in <style> and <script> parsing

The end tag was never found

* Fix parsing of mix of valueless and quoteless attributes

* Sync list of void tags with Firefox' implementation of outerHTML and innerHTML

Also lets use their name for it: IsVoidTag instead of IsEmptyElement. Empty was a bit ambiguous.

* Bring back support for processing instructions support in xh_scanner

I noticed in https://searchfox.org/mozilla-central/source/dom/base/nsContentUtils.cpp#8961 that these can be produced by innerHTML under some circumstances.

* More permanent link

* Use CamelCase for the internal functions I added

* Rename *_PI to *_PROCESSING_INSTRUCTION

Your IDE will do the typing for you anyway

* Match symbol naming of the rest of code base

CapitalCase for classes, camelCase for functions, snake_case for variables still.

* Missed one 😴

* Change xhscanner's variable case also to camelCase

* Partially fix case variables in html.cpp

* Better command-line with isolation for both Services and co-located defaults and parsing (#252)

* CLI Rework

* Consolidate common tests, template specialize CLI

* Remove remnant cache stuff

* [BRT]: Run BRT with new cli

* Formalizing bridge

* Removing stuff from parsing and moving to TestSuite

* Template includes, everything consolidating at tests

* Inlining readFromStdin

* Removing unnecessary headers

* Checking in template implementation which was missing

* Sane defaults, some catches at BRT

* BRT: Install fixes

* Updating marian-dev to point to main

* Removing the enum indirection, using strings at one place, directly

* Fix typo;

* [BRT] test blocking service via native

* Conservative defaults for workers and cache-mutex buckets in AsyncService

* Create proper barriers for cmdline app

* Build failure fixes

* Moving common, common-impl to a familiar structure

* Binary reorganization: async, blocking, wasm

- async tests AsyncService
- blocking tests BlockingService
- wasm arranges tests for things that are Mozilla requirements. eg:
    - bytearray
    - multiple sentences in same translate request workflow.

* [brt] updates to adapt to cli rework

* [brt] updates to adapt to cli rework, all working

* Empty commit, sync brt online and run GitHub CI

* Switch for parser to have multiple mode or not

* [brt]: Fix for --bergamot-mode being removed from CLI app

* [brt]: Fix for --bergamot-mode being removed from CLI app

* [brt]: Removing remnant faithful translation test from blocking/

* HTML transfer empty elements (#283)

* Fix test case

This should now be implemented

* Remove FilterEmpty

This path wasn't used anymore anyway, empty tags just got their own spans, and never reached the stack.

* Insert skipped empty source spans into target HTML

Also refactor variable names to better match their contents and be more consistent with each other.
This implementation passes all test cases, finally!

* Fix remaining style changes

* Move HTML formatting to its own section

That code had become exact copies in three different places

* CI: Circle CI config script update (#287)

- Robust artifact presence check
 - Variable name refactoring
 - Storing only those artifacts that are required
 - Remove commit sha from the names of the Github Releases
 - Use BERGAMOT_VERSION file contents for Git Tag names

* GitHub CI: Update YAML to run all tests on marian-full (#292)

Previously there were #native tags and #wasm tags separating the two.
There is now a clear separation between async, blocking and wasm.

* HTML basic integration tests (#291)

* Fix typo in BRT args on CI runs (#294)

* Turn logging off by default, allow turning on via config/cmdline (#295)

* Turn logging off by default, allow turning on via config/cmdline
* No need to store config in member variable if things are decided at construction time

* cache: threadsafety-fixes; optional stats collection (#245)

* Make stats hits misses atomic to guard when mutex has multiple buckets
* Use compile time switch for cache-stats-collection bound to COMPILE_TESTS cmake variable
* -DENABLE_CACHE_STATS on if COMPILE_TESTS otherwise optional
* Make stats() call without enabling build fatal abort

* Have alignments placed if HTML is on (#296)

* HTML transfer script/style/etc elements (#285)

* CI guaranteed example documentation (#300)

* Convert marian-integration markdown to rst
* Convert native run into a script, include in rst
* Check with CI that the native running example works without fail

* Defer model loading to parallel worker thread (#303)

* Treat most HTML elements as word-breaking (#286)

* First class pivot translation capability (#236)

Translates a text from source-language to target-language through a
pivot-language. Effectively runs models in series, while having the
following additional benefits compared to when `Service::translate(...)`
would be used repeatedly.

1. Consistency in sentences between source and target. Consistent
creation of the alignment matrix for use in downstream tasks like
tag-translation.

2. Efficient sentence-splitting (does not sentence-split twice, creating
inconsistencies).

3. The `Response` generated can be used as if it were coming through
`translate(...)`, eliminating any need for additional code for clients
in JS or python or C++.

`AsyncService::pivot(...)` is provisioned for C++ multi-threaded setting
and `BlockingService::pivotMultiple(...)` provisioned for blocking
use-case targeted at WebAssembly.

# [BRT]: Test additions, accompanying fixes

For `AsyncService` for a test-case involving of en->es, es->en (same
vocabulary, another one might be more coverage but is too much work).

1. Asserts the Alignment generated after pivoting is a probability
distribution over source tokens given target.

2. Outputs the sentences going from en->en, which should stay consistent
over continuous development to ensure nothing breaks.

3. An accuracy minimum of 70% of token matches from source to target
calibrated on the standard bergamot input text is additionally present,
ensuring that the English tokens at start and end match exactly.

# HTML Pipeline

This PR reworks the HTML translation pipeline to be outside
response-construction via callbacks.

* Accept XHTML-style self-closing void tags (#305)

Allow the self-closing `/>` end for void tags. For non-void tags these
were already "allowed" due to how the HTML parser works, but for
elements where they actually occur, like `<br/>`, they caused a parse
error. Support for them was not implemented since we only expect valid
HTML5, e.g. the output of Firefox' Element.innerHTML.

Use case: TranslateLocally uses Qt's HTML representation of rich text.
That HTML uses self-closing tags like `<meta .../>` and `<br/>`.
Implementing a string replace operation that would only match these
elements without parsing HTML is tricky. Fixing it in
bergamot-translator is not.

Implementation: Currently `<img>` is marked as a void tag (an element
which cannot have children or text, and therefore treated differently.
Since void tags normally have no close tag, they are treated as
immediately closed. The HTML parser we use reads `<img/>` as
`<img></img>` which thus causes a problem since now we close an element
that was never open, to begin with.

This fix ignores the `TT_TAG_END` token from the parser when the tag
name is that of a void tag.

* Streamline memory-bundle loads (#307)

Provides an additional constructor which takes care of the bundle
loading inside the boundary of the source here, when a configuration
file is supplied from a client like translateLocally or python bindings.
Once the config file is read, we have access to the information required
to construct the MemoryBundle.

 - The command-line application supplied from here, app/bergamot is
   configured to use the fast-load path now.
 - Changes to binary-loading additionally revealed a bug in the
   example-run script used in docs and tied to CI and the fix is
   included.
 - Shortlist is made optional in the memory bundle, making changes to
   getModelMemoryFromConfig.

Fixes #304.
Fixes #306.
See also: XapaJIaMnu/translateLocally#82.

* Add API to trigger fast shutdown of AsyncService (#297)

Add a way to AsyncService to shut down without finishing the full queue
through `AsyncService::clear()`. The default behaviour is that
`AsyncService::~AsyncService()` will wait for any pending translation
requests to finish.

One can call `AsyncService::clear()` before the calls to the destructor
to ensure there is no work for the service to finish before the workers
can stop and join. Marian batches that are already in progress will not
stop. We are not trying to cause interrupts in threads or something that
complex. However, these single batches often do not take that long to
complete.

Changes:

 - Add clear() to AsyncService
 - Add clear() to BatchingPool
 - Documentation

See also:  XapaJIaMnu/translateLocally#80

* Speed up Windows CI with ccache (#308)

Use https://github.com/cristianadam/ccache/releases/ to speed up windows
compilation.

Remove /Zi as it is unsupported by ccache at the moment. This is a debug
flag that was removed in upstream marian-dev
https://github.com/browsermt/marian-dev/pull/43. However, the bergamot
CMakeLists.txt which was originally taken from
marian maintained this under MSCV.

* Remove unused compiler hash script (#309)

* Batteries included python package (#310)

Imports python bindings and associated sources incubated in
https://github.com/jerinphilip/lemonade to bergamot-translator. Adds
 a pybind11 dependency for python bindings.

Following the import, the python build is integrated into the existing 
CMake based build system here. There is a command-line application 
provided through python which provides the ability to fetch and prepare 
models from model-repositories (like browsermt/students or OPUS).

Wheels built for a few common operating systems are provided via GitHub
releases through automated actions configured to run at tagged semantic
versions and pushes to main.

The documentation for python is also integrated into our existing
documentation setup. Previous documentation GitHub action is now
configured to run behind python builds in Ubuntu 18.04 Python3.7,
in order to pick up the packaged as a wheel bergamot module and the
sphinx documentation using the python module.

Formatting checks of black, isort with profile black and a pytype type
checker is configured for the python component residing in this repository.

* BRT: Update to fix QE download failures (#321)

* Fix HTML with pivoting (#323)

Previously BlockingService pivoting missed preproc and postproc for HTML
leading to issues in WebAssembly API. This change adds fixes for the
same, along with test coverage for the functionality over both async and
blocking services.

* Remove obsolete workflow transferring source across forks (#326)

* Wasm/JS: Pivot translation API JS binding and test page update (#327)

* emscripten: ccache and artefact upload (#325)

Enables ccache for emscripten. The configuration uses pyiodide for a
reference (https://github.com/pyodide/pyodide/pull/1805).

Two workflows to run on macOS and Ubuntu, reduced to one on Ubuntu. As
emscripten and the target is cross-platform, also macOS runners being
limited - it makes sense to have this removed.

Upload artefact enabled in preparation for a release action to be
scheduled which will upload the bergamot*.wasm and bergamot*.js for
consumption.

* Consolidate release artefacts (#329)

Brings in the previously wasm.yml into python.yml and the new file is
renamed as build.yml.

python.yml already had a version and pre-release jobs. These jobs
downloaded artefacts from prior ran jobs (python wheel builds). The
newly attached emscripten build now uploads artefacts of a WebAssembly
binary and javascript file which are fed into the release and
pre-release jobs in addition to the existing python builds.

* Increment version to v0.4.0 (#328)

* Make default throw exception on abort for python (#333)

This also allows conversion of exiting aborts into runtime errors in python, 
providing informative messages to the user via pybind11 existing tooling.

* Revert "Make default throw exception on abort for python (#333)"

This reverts commit 97bd6e36dbdec3519133d91289d7fd31816cb09a.

As discussed, we need messages for debugging in -fno-exceptions.

* Revert "Revert "Make default throw exception on abort for python (#333)""

This reverts commit 62ff781ed4ea642912878145beaf3157123520fe.

Sorry I should have realized Jerin was only amending python and
therefore this didn't break WASM.

Apologies to Jerin on this.

* JS/WASM: Re-enable importing optimized gemm module for (#336)

- Re-enabled the code that imports optimized gemm module
   for wasm when available

* Print errors by default in WASM build (#343)

* Remove BadHTML exception in favour of ABORT macro
   `ABORT()` gives us readable error messages, even when exception support is disabled.
* Control marian exception global setting in tests through fixture
* WASM: construct BlockingService with critical logging by default
   This log level is only used by ABORT()

See also: 
- mozilla/firefox-translations#65, 
- mozilla/firefox-translations#68
- mozilla/firefox-translations#70 
- mozilla/firefox-translations#56

* Add ability to load `.npz` models (#342)

Changes `ABORT` on non `.bin` model to an additional check for a `.npz` 
extension. If `.bin`, the fast load path is activated by returning `AlignedMemory`. 
Otherwise, the return of empty `AlignedMemory` causes fallback to
filesystem-based loads.

BRT: A test that checks if translation using `.npz` is approximately similar to 
that of default CLI translation is checked in to ensure stability going ahead.

Previously, we only supported `.bin` models' loading via a fast mmap 
path. While we had the underlying capability to load non `.bin` models, this 
was not exposed, encouraging fast loads. Loading `.npz` models are helpful 
for quick debugging and broader coverage of models available, which will 
enhance user experience at translateLocally and python bindings. 


Fixes #341.
See also: XapaJIaMnu/translateLocally#89

* Allow per-input options (#346)

Changes signature of BlockingService::{translate,pivot}Multiple
functions to take per input options, so a mix of HTML and plaintext
can be sent from the extension. Templating over testing is adjusted
to allow for continuous evaluations by modifying the test code.

Updates WebAssembly bindings to reflect the change in signature
and the javascript test-page to work with the new bindings.

This change lacks an accompanying test specific to the mixed HTML
and plaintext inputs.

Fixes: #345
See also: mozilla/firefox-translations#94
Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl>

* JS/WASM: Passing ResponseOptions for every item for translation batch api (#348)

- Now translate() JS API accepts ResponseOptions per batch item

 - Fixed the logic to create vector<ResponseOption>

* Update aligned vector following intgemm 1b8cbd6f611c21011325cfe0312940f0635dea33 (#334)

Fixes memory leak
ifdef for -fno-exceptions including clang-cl
Move spacing back to intgemm upstream

Co-authored-by: Jerin Philip <jerin.philip@research.iiit.ac.in>

* Improve cache (#347)

Hide `cache-mutex-buckets` from the user. Now configured to be equal to number
of workers. Python bindings which had exposed these are modified to reflect
the API change. `std::optional` enabled on cache, constructed only if enabled.
Pointers used are replaced with an equivalent `std::optional.`

Fixes: #317

* JS: Refactoring wasm test page (#354)

* Free all the objects properly that were constructed for translation api
* Refactored pivot detection mechanism

* Create github release via CircleCI only for mozilla fork (#349)

* Create github release via circleci only for mozilla fork

 - The extension uses mozilla fork for translator artifacts
   -- Hence create github release via circleci only when
      running in mozilla fork

* Small refactoring in ci script

* Bump version to 0.4.1 (#356)

* Improve handling HTML special cases (#312)

- Prefer spreading markup over a full word.
- Ignore certain tags that are unlikely to be supposed to be translated,
  such as `<code>` and `<samp>`.
- Never treat `<wbr>` as a space.
- Allow for inconsistent cases in tag names.
- Fix bug where void elements were inserted multiple times.
- Better handling of whitespace around punctuation.
- Ignore parsing `<noscript>` to be compatible with Firefox.
- Improvements to documentation and readability of `HTML` and `Scanner`
  classes.

Fixes: #313, #339

* Simplify cache config and bind for use in JS (#359)

Deprecates cacheEnabled parameter to be replaced with cacheSize=0.
Python bindings, Documentation in comments and tests updated to reflect
this change.

Exposes the fields corresponding to cache via embind as a value object.
The equivalent object-based syntax in worker.js allows propagation
from JS.

Fixes: #351
See also: mozilla/firefox-translations#96

* Embed quality-scores as HTML tag attributes (#358)

Quality scores for HTML translation exposed as <font
x-bergamot-sentence-score=""> and <font x-bergamot-word-score=""> tags
in the HTML output. While this increases the size of the HTML returned,
the resulting rendered HTML can easily be styled to show the scores.
With Javascript or CSS, developers can easily have some interface based
on these extra attributes.

Also includes updates to the test page to show a proof-of-concept 
demonstration.

Fixes: #355

* Enable dependabot to automate updating dependencies (#365)

Following marian-nmt/marian-dev.

* Use right range and threshold for showing "bad" words/sentences (#370)

* Use ln(0.5) as the threshold
* Use right range for showing "bad" words/sentences

* Bump version to 0.4.2 (#371)

* Bump 3rd_party/marian-dev from `08b1544` to `7e67124` (#372)

Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `08b1544` to `7e67124`.
- [Commits](https://github.com/browsermt/marian-dev/compare/08b1544636fe13eaf1fbacb17c6fb050abfb8d42...7e67124ae0bc11b42f2e6373489831c9a2498499)

---
updated-dependencies:
- dependency-name: 3rd_party/marian-dev
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* JS: Reuse Model registry from firefox-translation-models for test page (#377)

* JS: Reuse Model registry from firefox-translation-models repo for test page

 - https://github.com/mozilla/firefox-translations-models/blob/main/registry.json
   is reused
 - Removed existing registry

* JS: Using supervised QE models for available language pairs (#378)

* JS: Refactored model loading
 - Passing single vocab memory via JS
* JS: Use supervised QE models when available
* Ran clang format

* Bump 3rd_party/marian-dev from `7e67124` to `844800e` (#382)

Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `7e67124` to `844800e`.
- [Release notes](https://github.com/browsermt/marian-dev/releases)
- [Commits](https://github.com/browsermt/marian-dev/compare/7e67124ae0bc11b42f2e6373489831c9a2498499...844800efccba6e670250caac1735ca2c8c8e508e)

---
updated-dependencies:
- dependency-name: 3rd_party/marian-dev
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* JS: Update languages & use Intl API for their display names (#379)

Got the languages from registry.json, including non-prod models. 
Code now calls into `Intl.DisplayNames()`[1] to make life easier.

[1] (http://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/DisplayNames/DisplayNames)

* JS: Fix swap button on test-page (#388)

* Docs: Pin Jinja2 to last known working version (#389)

Fixes the docs workflow which is failing after pip is picking up Jinja 3.20. 
We only need >=2.3, this one sets it to 3.0.3 builds were successful last.

* Bump version to 0.4.3 (#392)

* Bump bergamot-translator-tests from `d03a9d3` to `7984d14` (#394)

Bumps [bergamot-translator-tests](https://github.com/browsermt/bergamot-translator-tests) from `d03a9d3` to `7984d14`.
- [Release notes](https://github.com/browsermt/bergamot-translator-tests/releases)
- [Commits](https://github.com/browsermt/bergamot-translator-tests/compare/d03a9d316d40ba45c475018287971523666bf51e...7984d140aef00489699d0b7711fa942816224294)

---
updated-dependencies:
- dependency-name: bergamot-translator-tests
  dependency-type: direct:production
...

* Fix call to `isspace` (#396)

Documentation is explicit about only calling it with unsigned char, and Windows runtime is checking this.

* Bump 3rd_party/ssplit-cpp from `a08d6bc` to `49fde6d` (#408)

Bumps [3rd_party/ssplit-cpp](https://github.com/browsermt/ssplit-cpp) from `a08d6bc` to `49fde6d`.
- [Release notes](https://github.com/browsermt/ssplit-cpp/releases)
- [Commits](https://github.com/browsermt/ssplit-cpp/compare/a08d6bce20619a8475736832d5418458c14db9d4...49fde6df7ee9199aedb9571be800448192e3515c)

---
updated-dependencies:
- dependency-name: 3rd_party/ssplit-cpp
  dependency-type: direct:production
...

* Update and fix windows CI (#410)

* Use a more vanilla windows workflow from translateLocally, remove the
complicated lukka/*. Also removes port overrides in the overall upgrade.
* Disable vcpkg binary caching
* Remove PCRE library hacks after upstream ssplit improvements

* Upgrade emsdk to 3.1.8 (#414)

* Rework WASM compilation options

Necessary to work with newer versions of emscripten that are more picky about which option goes to the compiler, and which to the linker. Also took the opportunity to remove the need for the patching of the bergamot-translation-worker.js file, this can now easily be done through supported apis. Furthermore, I tried to downsize the generated javascript and wasm code a bit.

Initial estimates show that bergamot-translator compiled with emscripten 3.0.0 runs at about 3x the speed of 2.0.9 (when using embedded intgemm). Speed-up when using mozIntGemm is less dramatic.

* Updated marian-dev submodule
* Revert changes specific to patching external gemm modules for wasm
* Better Compilation and Link flags

 - Added "-O3" optimization flag for linking as well
 - "-g2" only for release and debug builds
 - "-g1" for release builds
 - Replaced deprecated "--bind" flag with "-lembind"
 - Removed redundant link flag

* Upgraded emsdk to 3.1.8
* Enclosed EXPORTED_FUNCTIONS values in a list
* Fixed the remaining 2.0.9 reference in circle ci build script
* Updated README

Co-authored-by: Jelmer van der Linde <jelmer@ikhoefgeen.nl>

* Bump version to 0.4.4 (#415)

* Bump 3rd_party/marian-dev from `199201e` to `e88c1aa` (#416)

Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `199201e` to `e88c1aa`.
- [Release notes](https://github.com/browsermt/marian-dev/releases)
- [Commits](https://github.com/browsermt/marian-dev/compare/199201eb89b2941afdadb14164e936d412f897ad...e88c1aa5d5c5622cb52c7df09fbb7c3d7f4b5b5a)

---
updated-dependencies:
- dependency-name: 3rd_party/marian-dev
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Set up python packaging for pypi distribution (#424)

Old GitHub CI using Ubuntu and MacOS explicitly and building wheels have
been removed in favour of the more portable pypa specified builds. These
wheels should work just as well across a wider range of distributions.

pybind11:CMakeLists.txt requires Development.Module instead of
Development.* to avoid Embed from getting in the way of manylinux
builds.

manylinux_x86_64 builds are added for cp3.6 - 3.10. The linux build
uses an old image via docker.  Since the docker images are able to use
shared ccache folder, builds quite fast on warm starts.

ccache usage in setup.py is now triggered by an environment variable.
This allows for builds not to fail if ccache not present.

On tag pushes corresponding to versions, CI is configured to deliver
built wheels to PyPI, reading from repository secrets.

Improves setup.py including documentation and some formatting, and
additional links to source.

Fixes: #315

* Basic HTML property testing for WebAssembly (#425)

Import
https://gist.github.com/jelmervdl/a4c8b6b92ad88a885e1cbd51c6ad4902 and
attach it to CI.  NodeJS-14 is failing on trying to use the WebAssembly
binary. So we use node-16 independently setup.  This paves way for more
complicated testing for WebAssembly bindings in the future.

* Bump version to 0.4.5 (#427)

* Python package: pyyaml >= 5.1 (#429)

Fixes issue on Colab which says vanilla YAML intall (3.x) does not have
yaml.FullLoader (https://stackoverflow.com/a/55553392/4565794).

Fix a broken link for presentation in PyPI.

* Python: Work offline if models are available (#431)

Try to check if models.json is downloaded first, if it is use it. 
If not, fall back to attempting to fetch it from the network.

Fixes: #430

* MacOS Wheels (#432)

* Remove trailing whitespace
* Additional MacOS wheels: Wheels for python 3.6 to 3.10 with a 
   minimum target of MacOS 10.9
* Install bergamot package from wheel directory
* Remove no-index as we need dependencies

* update download path

* try to update coding_styles workflow

* Latest and greatest clang-format

* Bump qs and express in /wasm/test_page (#444)

Bumps [qs](https://github.com/ljharb/qs) to 6.11.0 and updates ancestor dependency [express](https://github.com/expressjs/express). These dependencies need to be updated together.


Updates `qs` from 6.7.0 to 6.11.0
- [Release notes](https://github.com/ljharb/qs/releases)
- [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ljharb/qs/compare/v6.7.0...v6.11.0)

Updates `express` from 4.17.1 to 4.18.2
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/master/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.17.1...4.18.2)

---
updated-dependencies:
- dependency-name: qs
  dependency-type: indirect
- dependency-name: express
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Arm updated (#443)

* ARM Support using ruy and simd_utils

* Adding ARM build on GitHub CI

* Add workflow and successful build

ssplit-cpp modified to get cross compiled android on GitHub CI working.

* Client side fixes for int8 no shift on ARM [python]

* Revert "Client side fixes for int8 no shift on ARM [python]"

This reverts commit 020af05a8b1f4b4ef46373e6e61dcd32869fc1b1.

* moving int8shift no-op inside the library

* Bump 3rd-party/marian-dev

* update the marian branch test

* arm backend works

* Latest and greatest clang-format

Co-authored-by: Jerin Philip <jerinphilip@live.in>

* Apply security update and formatting

* Expand the node-test.js example code with documentation (#434)

* Expand the node-test.js example code with documentation

Is there a better way to document code than by providing an annotated & working example of it? Just listing all the exposed methods feels like giving people a box of bricks and expecting them to build a house with it.

* Use @Jerin's feedback to simplify node-test.js explanations

* Use native `console.assert` instead

See #426 for an explanation

* Fix comment

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>

* More portable WASM demo (#437)

* Replace most of the wasm demo page with code from the firefox extension

This code should be more generic and copy/pastable into other projects. Maybe one day it will be an npm package?

* Fix Ukrainian model support

* Add quality estimation output

Automatically enabled when the model(s) support it

* Little "Translating…" indicator

* Don't make Safari fail on something tiny

* Rewire lots of async state to be able to predictably know when the translator is working or not

Previously so much was lazy loaded that it was not easy to catch lack of SIMD support. Now I can just enable the interface only after it has properly loaded.

* No need for a two-stage setup for the worker. Just promise to call `initialize()`!

* More (correct) types and comments for code

* Keyboard shortcuts for input area for bold, italic and underline.

Enough to demo mark-up translation

* Fix `delete()`

* Move javascript glue code into its own npm package

* Add nodejs support and test to package

* More stand-alone build command

…for now, not really used by anything I think

* Ignore build packages

* Use local filesystem for build so it is automatically cached

* fix overflow on demo page

But this might break the mobile demo? I'll have to check into that

* Bring back integrity check, except for NodeJS for now

* Make `build` part of `prepare` so we always make sure we build a complete package

* Move worker code into its own folder

This way I can mark it as a commonjs module which will help cause nodejs treat the files the same as WebWorkers do right now. Firefox doesn't implement `{type: 'module'}` yet for WebWorkers.

* Add README

* Fix paths

* Add npm publish automation

* Make sure webpack ignores node compatibility code

* Add missing webpack:ignore around a worker

* Default to getting models from S3

* Separate "loading" and "translating" indicators

* Bump npm package version

* Add credits

* Don't block on the worker loading

* Not just Mozilla, but Bergamot!

* Make individual translation requests cancelable

* Swap button turns vertically when in skyscraper mode

* Make it easier to debug errors from inside the worker

* Don't bork on deleting a failed worker

* Don't bork on calling translate() with a failed worker

* Handle compilation error with more grace

* `contenteditable=true` seems to work better with some browser extensions

Looking at you, Vimium!

* Clean up abort promise

* Bump npm package version

* Remove `workerUrl` option in favour of better webpack support

With that option it was hard for Webpack to figure out dependencies, and it did not enter my worker script for rewriting. With the hardcoded url it does, and with a bit of `new webpack.DefinePlugin({'typeof self': JSON.stringify('object')}),` we can have webpack remove node-specific code on build!

* Bump version

Minor API change hehe

Co-authored-by: Nikolay Bogoychev <nheart@gmail.com>

* Fix comp…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants