From a2dddd818c51df56e42ca2d79d3388ec756c2309 Mon Sep 17 00:00:00 2001 From: abhi-agg <66322306+abhi-agg@users.noreply.github.com> Date: Mon, 19 Oct 2020 13:49:38 +0200 Subject: [PATCH 001/442] Initial commit --- LICENSE | 373 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 1 + 2 files changed, 374 insertions(+) create mode 100644 LICENSE create mode 100644 README.md diff --git a/LICENSE b/LICENSE new file mode 100644 index 000000000..a612ad981 --- /dev/null +++ b/LICENSE @@ -0,0 +1,373 @@ +Mozilla Public License Version 2.0 +================================== + +1. Definitions +-------------- + +1.1. "Contributor" + means each individual or legal entity that creates, contributes to + the creation of, or owns Covered Software. + +1.2. "Contributor Version" + means the combination of the Contributions of others (if any) used + by a Contributor and that particular Contributor's Contribution. + +1.3. "Contribution" + means Covered Software of a particular Contributor. + +1.4. "Covered Software" + means Source Code Form to which the initial Contributor has attached + the notice in Exhibit A, the Executable Form of such Source Code + Form, and Modifications of such Source Code Form, in each case + including portions thereof. + +1.5. "Incompatible With Secondary Licenses" + means + + (a) that the initial Contributor has attached the notice described + in Exhibit B to the Covered Software; or + + (b) that the Covered Software was made available under the terms of + version 1.1 or earlier of the License, but not also under the + terms of a Secondary License. + +1.6. "Executable Form" + means any form of the work other than Source Code Form. + +1.7. "Larger Work" + means a work that combines Covered Software with other material, in + a separate file or files, that is not Covered Software. + +1.8. "License" + means this document. + +1.9. "Licensable" + means having the right to grant, to the maximum extent possible, + whether at the time of the initial grant or subsequently, any and + all of the rights conveyed by this License. + +1.10. "Modifications" + means any of the following: + + (a) any file in Source Code Form that results from an addition to, + deletion from, or modification of the contents of Covered + Software; or + + (b) any new file in Source Code Form that contains any Covered + Software. + +1.11. "Patent Claims" of a Contributor + means any patent claim(s), including without limitation, method, + process, and apparatus claims, in any patent Licensable by such + Contributor that would be infringed, but for the grant of the + License, by the making, using, selling, offering for sale, having + made, import, or transfer of either its Contributions or its + Contributor Version. + +1.12. "Secondary License" + means either the GNU General Public License, Version 2.0, the GNU + Lesser General Public License, Version 2.1, the GNU Affero General + Public License, Version 3.0, or any later versions of those + licenses. + +1.13. "Source Code Form" + means the form of the work preferred for making modifications. + +1.14. "You" (or "Your") + means an individual or a legal entity exercising rights under this + License. For legal entities, "You" includes any entity that + controls, is controlled by, or is under common control with You. For + purposes of this definition, "control" means (a) the power, direct + or indirect, to cause the direction or management of such entity, + whether by contract or otherwise, or (b) ownership of more than + fifty percent (50%) of the outstanding shares or beneficial + ownership of such entity. + +2. License Grants and Conditions +-------------------------------- + +2.1. Grants + +Each Contributor hereby grants You a world-wide, royalty-free, +non-exclusive license: + +(a) under intellectual property rights (other than patent or trademark) + Licensable by such Contributor to use, reproduce, make available, + modify, display, perform, distribute, and otherwise exploit its + Contributions, either on an unmodified basis, with Modifications, or + as part of a Larger Work; and + +(b) under Patent Claims of such Contributor to make, use, sell, offer + for sale, have made, import, and otherwise transfer either its + Contributions or its Contributor Version. + +2.2. Effective Date + +The licenses granted in Section 2.1 with respect to any Contribution +become effective for each Contribution on the date the Contributor first +distributes such Contribution. + +2.3. Limitations on Grant Scope + +The licenses granted in this Section 2 are the only rights granted under +this License. No additional rights or licenses will be implied from the +distribution or licensing of Covered Software under this License. +Notwithstanding Section 2.1(b) above, no patent license is granted by a +Contributor: + +(a) for any code that a Contributor has removed from Covered Software; + or + +(b) for infringements caused by: (i) Your and any other third party's + modifications of Covered Software, or (ii) the combination of its + Contributions with other software (except as part of its Contributor + Version); or + +(c) under Patent Claims infringed by Covered Software in the absence of + its Contributions. + +This License does not grant any rights in the trademarks, service marks, +or logos of any Contributor (except as may be necessary to comply with +the notice requirements in Section 3.4). + +2.4. Subsequent Licenses + +No Contributor makes additional grants as a result of Your choice to +distribute the Covered Software under a subsequent version of this +License (see Section 10.2) or under the terms of a Secondary License (if +permitted under the terms of Section 3.3). + +2.5. Representation + +Each Contributor represents that the Contributor believes its +Contributions are its original creation(s) or it has sufficient rights +to grant the rights to its Contributions conveyed by this License. + +2.6. Fair Use + +This License is not intended to limit any rights You have under +applicable copyright doctrines of fair use, fair dealing, or other +equivalents. + +2.7. Conditions + +Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted +in Section 2.1. + +3. Responsibilities +------------------- + +3.1. Distribution of Source Form + +All distribution of Covered Software in Source Code Form, including any +Modifications that You create or to which You contribute, must be under +the terms of this License. You must inform recipients that the Source +Code Form of the Covered Software is governed by the terms of this +License, and how they can obtain a copy of this License. You may not +attempt to alter or restrict the recipients' rights in the Source Code +Form. + +3.2. Distribution of Executable Form + +If You distribute Covered Software in Executable Form then: + +(a) such Covered Software must also be made available in Source Code + Form, as described in Section 3.1, and You must inform recipients of + the Executable Form how they can obtain a copy of such Source Code + Form by reasonable means in a timely manner, at a charge no more + than the cost of distribution to the recipient; and + +(b) You may distribute such Executable Form under the terms of this + License, or sublicense it under different terms, provided that the + license for the Executable Form does not attempt to limit or alter + the recipients' rights in the Source Code Form under this License. + +3.3. Distribution of a Larger Work + +You may create and distribute a Larger Work under terms of Your choice, +provided that You also comply with the requirements of this License for +the Covered Software. If the Larger Work is a combination of Covered +Software with a work governed by one or more Secondary Licenses, and the +Covered Software is not Incompatible With Secondary Licenses, this +License permits You to additionally distribute such Covered Software +under the terms of such Secondary License(s), so that the recipient of +the Larger Work may, at their option, further distribute the Covered +Software under the terms of either this License or such Secondary +License(s). + +3.4. Notices + +You may not remove or alter the substance of any license notices +(including copyright notices, patent notices, disclaimers of warranty, +or limitations of liability) contained within the Source Code Form of +the Covered Software, except that You may alter any license notices to +the extent required to remedy known factual inaccuracies. + +3.5. Application of Additional Terms + +You may choose to offer, and to charge a fee for, warranty, support, +indemnity or liability obligations to one or more recipients of Covered +Software. However, You may do so only on Your own behalf, and not on +behalf of any Contributor. You must make it absolutely clear that any +such warranty, support, indemnity, or liability obligation is offered by +You alone, and You hereby agree to indemnify every Contributor for any +liability incurred by such Contributor as a result of warranty, support, +indemnity or liability terms You offer. You may include additional +disclaimers of warranty and limitations of liability specific to any +jurisdiction. + +4. Inability to Comply Due to Statute or Regulation +--------------------------------------------------- + +If it is impossible for You to comply with any of the terms of this +License with respect to some or all of the Covered Software due to +statute, judicial order, or regulation then You must: (a) comply with +the terms of this License to the maximum extent possible; and (b) +describe the limitations and the code they affect. Such description must +be placed in a text file included with all distributions of the Covered +Software under this License. Except to the extent prohibited by statute +or regulation, such description must be sufficiently detailed for a +recipient of ordinary skill to be able to understand it. + +5. Termination +-------------- + +5.1. The rights granted under this License will terminate automatically +if You fail to comply with any of its terms. However, if You become +compliant, then the rights granted under this License from a particular +Contributor are reinstated (a) provisionally, unless and until such +Contributor explicitly and finally terminates Your grants, and (b) on an +ongoing basis, if such Contributor fails to notify You of the +non-compliance by some reasonable means prior to 60 days after You have +come back into compliance. Moreover, Your grants from a particular +Contributor are reinstated on an ongoing basis if such Contributor +notifies You of the non-compliance by some reasonable means, this is the +first time You have received notice of non-compliance with this License +from such Contributor, and You become compliant prior to 30 days after +Your receipt of the notice. + +5.2. If You initiate litigation against any entity by asserting a patent +infringement claim (excluding declaratory judgment actions, +counter-claims, and cross-claims) alleging that a Contributor Version +directly or indirectly infringes any patent, then the rights granted to +You by any and all Contributors for the Covered Software under Section +2.1 of this License shall terminate. + +5.3. In the event of termination under Sections 5.1 or 5.2 above, all +end user license agreements (excluding distributors and resellers) which +have been validly granted by You or Your distributors under this License +prior to termination shall survive termination. + +************************************************************************ +* * +* 6. Disclaimer of Warranty * +* ------------------------- * +* * +* Covered Software is provided under this License on an "as is" * +* basis, without warranty of any kind, either expressed, implied, or * +* statutory, including, without limitation, warranties that the * +* Covered Software is free of defects, merchantable, fit for a * +* particular purpose or non-infringing. The entire risk as to the * +* quality and performance of the Covered Software is with You. * +* Should any Covered Software prove defective in any respect, You * +* (not any Contributor) assume the cost of any necessary servicing, * +* repair, or correction. This disclaimer of warranty constitutes an * +* essential part of this License. No use of any Covered Software is * +* authorized under this License except under this disclaimer. * +* * +************************************************************************ + +************************************************************************ +* * +* 7. Limitation of Liability * +* -------------------------- * +* * +* Under no circumstances and under no legal theory, whether tort * +* (including negligence), contract, or otherwise, shall any * +* Contributor, or anyone who distributes Covered Software as * +* permitted above, be liable to You for any direct, indirect, * +* special, incidental, or consequential damages of any character * +* including, without limitation, damages for lost profits, loss of * +* goodwill, work stoppage, computer failure or malfunction, or any * +* and all other commercial damages or losses, even if such party * +* shall have been informed of the possibility of such damages. This * +* limitation of liability shall not apply to liability for death or * +* personal injury resulting from such party's negligence to the * +* extent applicable law prohibits such limitation. Some * +* jurisdictions do not allow the exclusion or limitation of * +* incidental or consequential damages, so this exclusion and * +* limitation may not apply to You. * +* * +************************************************************************ + +8. Litigation +------------- + +Any litigation relating to this License may be brought only in the +courts of a jurisdiction where the defendant maintains its principal +place of business and such litigation shall be governed by laws of that +jurisdiction, without reference to its conflict-of-law provisions. +Nothing in this Section shall prevent a party's ability to bring +cross-claims or counter-claims. + +9. Miscellaneous +---------------- + +This License represents the complete agreement concerning the subject +matter hereof. If any provision of this License is held to be +unenforceable, such provision shall be reformed only to the extent +necessary to make it enforceable. Any law or regulation which provides +that the language of a contract shall be construed against the drafter +shall not be used to construe this License against a Contributor. + +10. Versions of the License +--------------------------- + +10.1. New Versions + +Mozilla Foundation is the license steward. Except as provided in Section +10.3, no one other than the license steward has the right to modify or +publish new versions of this License. Each version will be given a +distinguishing version number. + +10.2. Effect of New Versions + +You may distribute the Covered Software under the terms of the version +of the License under which You originally received the Covered Software, +or under the terms of any subsequent version published by the license +steward. + +10.3. Modified Versions + +If you create software not governed by this License, and you want to +create a new license for such software, you may create and use a +modified version of this License if you rename the license and remove +any references to the name of the license steward (except to note that +such modified license differs from this License). + +10.4. Distributing Source Code Form that is Incompatible With Secondary +Licenses + +If You choose to distribute Source Code Form that is Incompatible With +Secondary Licenses under the terms of this version of the License, the +notice described in Exhibit B of this License must be attached. + +Exhibit A - Source Code Form License Notice +------------------------------------------- + + This Source Code Form is subject to the terms of the Mozilla Public + License, v. 2.0. If a copy of the MPL was not distributed with this + file, You can obtain one at http://mozilla.org/MPL/2.0/. + +If it is not possible or desirable to put the notice in a particular +file, then You may include the notice in a location (such as a LICENSE +file in a relevant directory) where a recipient would be likely to look +for such a notice. + +You may add additional accurate notices of copyright ownership. + +Exhibit B - "Incompatible With Secondary Licenses" Notice +--------------------------------------------------------- + + This Source Code Form is "Incompatible With Secondary Licenses", as + defined by the Mozilla Public License, v. 2.0. diff --git a/README.md b/README.md new file mode 100644 index 000000000..254f2d790 --- /dev/null +++ b/README.md @@ -0,0 +1 @@ +# bergamot-translator \ No newline at end of file From ef2323c9520c8517f23399e373441044cf11787c Mon Sep 17 00:00:00 2001 From: abhi-agg <66322306+abhi-agg@users.noreply.github.com> Date: Thu, 29 Oct 2020 09:17:32 +0100 Subject: [PATCH 002/442] Unified api draft (#1) * Changed README file - Added a short introduction of this repository - More updates to come later * First draft of the unified API --- README.md | 4 +- doc/Unified_API.md | 212 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 215 insertions(+), 1 deletion(-) create mode 100644 doc/Unified_API.md diff --git a/README.md b/README.md index 254f2d790..dd3798232 100644 --- a/README.md +++ b/README.md @@ -1 +1,3 @@ -# bergamot-translator \ No newline at end of file +# Bergamot Translator + +Bergamot translator provides a unified API for ([Marian NMT](https://marian-nmt.github.io/) framework based) neural machine translation functionality in accordance with the [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. diff --git a/doc/Unified_API.md b/doc/Unified_API.md new file mode 100644 index 000000000..e6a14301b --- /dev/null +++ b/doc/Unified_API.md @@ -0,0 +1,212 @@ +# Unified (C++) API of Bergamot Translator + +/* A Translation model interface for translating a plain utf-8 encoded text (without any markups and emojis). The model supports translation from 1 source language to 1 target language. There can be different implementations of this interface. */ + +class **AbstractTranslationModel** { + + public: + + AbstractTranslationModel(); + + virtual ~AbstractTranslationModel() {}; + + /* This method performs translation on a list of (utf-8) texts and returns a list of results in the same order. Each text entry can either be a word, a phrase, a sentence or a list of sentences and should contain plain text (without any markups or emojis). Additional information related to the translated text can be requested via TranslationRequest which is applied equally to each text entry. The translated text corresponding to each text entry and the additional information (as specified in the TranslationRequest) is encapsulated and returned in TranslationResult. + The API splits each text entry into sentences internally, which are then translated independent of each other. The translated sentences are then joined together and returned in TranslationResult. + Please refer to the TranslationRequest class to find out what additional information can be requested. The alignment information can only be requested if the model supports it (check isAlignmentSupported() API). + */ + virtual std::vector> translate(std::vector texts, TranslationRequest request) = 0; + + /* Check if the model can provide alignment information b/w original and translated text. */ + virtual bool isAlignmentSupported() const = 0; +} + +/* This class specifies the additional information related to the translated text (e.g. quality of the translation etc.) that can be requested to be included in the TranslationResult. These optional requests are set/unset independent of each other i.e. setting any one of them doesn’t have the side effect of setting any of the others. */ + +class **TranslationRequest** { + + private: + + // Optional request. The granularity for which Quality scores of the translated text will be included in TranslationResult. By default (QualityScoreGranularity::NONE), scores are not included. + QualityScoreGranularity qualityScore = QualityScoreGranularity::NONE; + + // Optional request. The type of the alignment b/w original and translated text that will be included in TranslationResult. By default (AlignmentType::NONE), alignment is not included. + AlignmentType alignmentType = AlignmentType::NONE; + + // Optional request. A true/false value will include/exclude the original text in the TranslationResult. By default (false), the original text is not included. + bool includeOriginalText = false; + + // Optional request. A true/false value will include/exclude the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text in the TranslationResult. By default (false), this information is not included. + bool includeSentenceMapping = false; + + public: + + explicit TranslationRequest(); + + ~TranslationRequest(); + + /* Set the granularity for which the Quality scores of translated text should be included in the TranslationResult. By default (QualityScoreGranularity::NONE), scores are not included. */ + void setQualityScoreGranularity(QualityScoreGranularity granularity); + + /* Set the type of Alignment b/w original and translated text to be included in the TranslationResult. By default (AlignmentType::NONE), alignment is not included. */ + void setAlignmentType(AlignmentType alignmentType); + + /* Set to true/false to include/exclude the original text in the TranslationResult. By default (false), the original text is not included. */ + void includeOriginalText(bool originalText); + + /* Set to true/false to include/exclude the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text in the TranslationResult. By default (false), this information is not included. */ + void includeSentenceMapping(bool sentenceMapping); + + /* Return the granularity for which the Quality scores of the translated text will be included in TranslationResult. QualityScoreGranularity::NONE means the scores will not be included. */ + QualityScoreGranularity getQualityScoreGranularity() const; + + /* Return the type of Alignment b/w original and translated text that should be included in the TranslationResult. AlignmentType::NONE means the alignment will not be included. */ + AlignmentType getAlignmentType() const; + + /* Return whether the original text should be included in the TranslationResult. False means the original text will not be included. */ + bool includeOriginalText() const; + + /* Return whether the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text should be included in the TranslationResult. False means this information will not be included. */ + bool includeSentenceMapping() const; +} + +/* This class represents the result of translation on a TranslationRequest. */ + +class **TranslationResult** { + + private: + + // Original text (utf-8) that was supposed to be translated; An optional result (it will be an empty string if not requested in TranslationRequest). + std::string originalText; + + // Translation (in utf-8 format) of the originalText + std::string translatedText; + + // Quality score of the translated text at the granularity specified in TranslationRequest; An optional result (it will have no information if not requested in TranslationRequest) + QualityScore qualityScore; + + // Alignment information b/w original and translated text for AlignmentType specified in TranslationRequest; An optional result (it will have no information if not requested in TranslationRequest) + Alignment alignment; + + // Information regarding how individual sentences of originalText map to corresponding translated sentences + // in joined translated text (translatedText); An optional result (it will be empty if not requested in TranslationRequest); + // An example: + // originalText (contains 2 sentences) = "What is your name? My name is Abc." + // translatedText (contains 2 translated sentences) = "Was ist dein Name? Mein Name ist Abc." + // sentenceMappings = [ + // {"What is your name?", "Was ist dein Name?"}, // A pair of Sentence 1 of originalText (originalText[0]) and corresponding translated sentence in translatedText (translatedText[0]) + // {"My name is Abc", "Mein Name ist Abc."} // A pair of Sentence 2 of originalText (originalText[1]) and corresponding translated sentence in translatedText (translatedText[1]) + // ] + // + std::vector> sentenceMappings; + + public: + // ToDo: Public Methods +} + +/* This class encapsulates the configuration that is required by a translation model to perform translation. This configuration includes a path to the model file, source language vocabulary file, target language vocabulary file along with other options. */ + +class **TranslationModelConfiguration** { + + private: + + // Path to the translation model file + const std::string modelPath; + + // Path to the source vocabulary file to be used by the model + const std::string sourceLanguageVocabPath; + + // Path to the target vocabulary file to be used by the model + const std::string targetLanguageVocabPath; + + // ToDo: Add all possible user configurable options (e.g. min batch size, max batch size) that are relevant for translation + + public: + + // Provide the path to the model file along with the source and target vocabulary files + TranslationModelConfiguration(const std::string& modelFilePath, + const std::string& sourceVocabPath, + const std::string& targetVocabPath); + + // Return the path of the model file + const std::string& getModelFilePath() const; + + // Return the path of the source language vocabulary file + const std::string& getSourceVocabularyPath() const; + + // Return the path of the target language vocabulary file + const std::string& getSourceVocabularyPath() const; +} + +// All possible granularities for which Quality Scores can be returned for translated (utf-8) text + +enum class QualityScoreGranularity { + + WORD, + SENTENCE, + NONE, +} + +// All possible supported alignment types between a text and its translation + +enum class AlignmentType { + + SOFT, + NONE, +} + +// This class represents the Quality Scores for various spans of the translated text at a specific granularity + +class QualityScore { + + private: + + // Sections of a text for the Quality Scores + std::vector textViews; + + // Quality Scores corresponding to each section of the text in textViews in the same order + std::vector textScores; + + // Granularity of the text for the Quality scores above + QualityScoreGranularity textGranularity; + + public: + // ToDo: Public Methods +} + +// This class encapsulates a translated text, all the sections of the original text that align to this translated text and the corresponding alignments for each of these sections of original text. + +class Alignment { + + private: + + // A list of sections of a translated text + // An example: originalText = "What do you need" + // translatedText = "Was brauchst du" + // translatedTextViews = ["Was ", "brauchst", "du"] + std::vector translatedTextViews; + + // Each ith entry of this container corresponds to a list of all the sections of the original text that align to the ith entry of translatedTextView + // For the example above: + // translatedTextViews = ["Was ", "brauchst", "du"] + // originalTextViews = [ + // ["What"], // originalTextViews[0] = All sections of original text that align with translatedTextViews[0] i.e. "Was" + // ["you", "need"], // originalTextViews[1] = All sections of original text that align with translatedTextViews[1] i.e. "brauchst" + // ["you"] // originalTextViews[2] = All sections of original text that align with translatedTextViews[2] i.e. "du" + // ] + std::vector> originalTextViews; + + // Each ith entry of this container corresponds to the alignments of all the sections of the original text (ith entry of originalTextViews) that align to the ith entry of translatedTextViews + // For the example above: + // alignments = [ + // [0.90], // alignments[0] = Alignments of all sections of original text (i.e. originalTextViews[0]) to translatedTextViews[0] i.e. "Was" + // [0.3, 0.7], // alignments[1] = Alignments of all sections of original text (i.e. originalTextViews[1]) to translatedTextViews[1] i.e. "brauchst" + // [0.9] // alignments[2] = Alignments of all sections of original text (i.e. originalTextViews[2]) to translatedTextViews[2] i.e. "du" + // ] + std::vector> alignments; + + // Type of the alignment b/w original and translated text above + AlignmentType alignmentType; + + public: + // ToDo: Public Methods +} From e5f3d51effc37c21de9350124e1c354744694ffa Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 3 Nov 2020 09:00:33 +0100 Subject: [PATCH 003/442] Basic skeleton code for the Unified API specification - Contains classes for the API specification (doc/Unified_API.md) - Things to be changed/decided later: Use of std::string_view to represent ranges Adding Alignment information Basic Setters and Getters for some of the classes --- CMakeLists.txt | 13 ++++ src/CMakeLists.txt | 1 + src/translator/AbstractTranslationModel.cpp | 8 +++ src/translator/AbstractTranslationModel.h | 52 +++++++++++++++ src/translator/CMakeLists.txt | 2 + src/translator/QualityScore.h | 36 ++++++++++ src/translator/TranslationRequest.h | 69 +++++++++++++++++++ src/translator/TranslationResult.h | 74 +++++++++++++++++++++ 8 files changed, 255 insertions(+) create mode 100644 CMakeLists.txt create mode 100644 src/CMakeLists.txt create mode 100644 src/translator/AbstractTranslationModel.cpp create mode 100644 src/translator/AbstractTranslationModel.h create mode 100644 src/translator/CMakeLists.txt create mode 100644 src/translator/QualityScore.h create mode 100644 src/translator/TranslationRequest.h create mode 100644 src/translator/TranslationResult.h diff --git a/CMakeLists.txt b/CMakeLists.txt new file mode 100644 index 000000000..d4890299b --- /dev/null +++ b/CMakeLists.txt @@ -0,0 +1,13 @@ +cmake_minimum_required(VERSION 3.5.1) + +if (POLICY CMP0074) + cmake_policy(SET CMP0074 NEW) # CMake 3.12 +endif () + +project(bergamot_translator CXX C) + +set(CMAKE_CXX_STANDARD 17) +set(CMAKE_CXX_STANDARD_REQUIRED ON) +set(BUILD_ARCH native CACHE STRING "Compile for this CPU architecture.") + +add_subdirectory(src) \ No newline at end of file diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt new file mode 100644 index 000000000..27fecc4bc --- /dev/null +++ b/src/CMakeLists.txt @@ -0,0 +1 @@ +add_subdirectory(translator) \ No newline at end of file diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp new file mode 100644 index 000000000..a180a710a --- /dev/null +++ b/src/translator/AbstractTranslationModel.cpp @@ -0,0 +1,8 @@ +/* + * AbstractTranslationModel.cpp + * + */ + +#include "AbstractTranslationModel.h" + +AbstractTranslationModel::~AbstractTranslationModel() {} diff --git a/src/translator/AbstractTranslationModel.h b/src/translator/AbstractTranslationModel.h new file mode 100644 index 000000000..6f013afb0 --- /dev/null +++ b/src/translator/AbstractTranslationModel.h @@ -0,0 +1,52 @@ +/* + * AbstractTranslationModel.h + * + * An interface for a translation model for translating a plain (without any markups and emojis) UTF-8 encoded text. + * The model supports translation from 1 source language to 1 target language. There can be different implementations + * of this interface. + */ + +#ifndef SRC_TRANSLATOR_ABSTRACTTRANSLATIONMODEL_H_ +#define SRC_TRANSLATOR_ABSTRACTTRANSLATIONMODEL_H_ + +#include +#include +#include + +#include "TranslationRequest.h" +#include "TranslationResult.h" + +/* An interface for a translation model for translating a plain (without any markups and emojis) UTF-8 encoded text. + * The model supports translation from 1 source language to 1 target language. + */ +class AbstractTranslationModel { +public: + + AbstractTranslationModel(); + + virtual ~AbstractTranslationModel(); + + /* This method performs translation on a list of (UTF-8 encoded) texts and returns a list of results in the same order. + * Each text entry can either be a word, a phrase, a sentence or a list of sentences and should contain plain text + * (without any markups or emojis). Additional information related to the translated text can be requested via + * TranslationRequest which is applied equally to each text entry. + * + * The translated text corresponding to each text entry and the additional information (as specified in the + * TranslationRequest) is encapsulated and returned in TranslationResult. + * + * The API splits each text entry into sentences internally, which are then translated independent of each other. + * The translated sentences are then joined together and returned in TranslationResult. + * Please refer to the TranslationRequest class to find out what additional information can be requested. + * The alignment information can only be requested if the model supports it (check isAlignmentSupported() API). + * + * The texts argument will become empty after the execution of this API (each entry of texts list will be moved to its + * corresponding TranslationResult object). + */ + virtual std::future> translate( + std::vector &&texts, TranslationRequest request) = 0; + + /* Check if the model can provide alignment information b/w original and translated text. */ + virtual bool isAlignmentSupported() const = 0; +}; + +#endif /* SRC_TRANSLATOR_ABSTRACTTRANSLATIONMODEL_H_ */ diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt new file mode 100644 index 000000000..bcd34da33 --- /dev/null +++ b/src/translator/CMakeLists.txt @@ -0,0 +1,2 @@ +include_directories(.) +add_library(bergamot-translator STATIC AbstractTranslationModel.cpp) \ No newline at end of file diff --git a/src/translator/QualityScore.h b/src/translator/QualityScore.h new file mode 100644 index 000000000..020aebc8e --- /dev/null +++ b/src/translator/QualityScore.h @@ -0,0 +1,36 @@ +/* + * QualityScore.h + * + */ + +#ifndef SRC_TRANSLATOR_QUALITYSCORE_H_ +#define SRC_TRANSLATOR_QUALITYSCORE_H_ + +#include +#include + + +/* All possible Granularities for which Quality Scores can be returned for translated text. */ +enum class QualityScoreGranularity { + WORD, SENTENCE, NONE, +}; + +/* This class represents the Quality Scores for various spans of a translated text at a specific granularity. */ +class QualityScore { +private: + + // Sections of the translated text for the Quality Scores. + std::vector textViews; + + // Quality Scores corresponding to each entry of textViews in the same order + std::vector textScores; + + // Granularity of the text for the Quality scores above + QualityScoreGranularity textGranularity; + +public: + // ToDo: Public Methods +}; + + +#endif /* SRC_TRANSLATOR_QUALITYSCORE_H_ */ diff --git a/src/translator/TranslationRequest.h b/src/translator/TranslationRequest.h new file mode 100644 index 000000000..bdd56803a --- /dev/null +++ b/src/translator/TranslationRequest.h @@ -0,0 +1,69 @@ +/* + * TranslationRequest.h + * + * This file defines the translation request class to be used in AbstractTranslationModel::translate() API. + */ + +#ifndef SRC_TRANSLATOR_TRANSLATIONREQUEST_H_ +#define SRC_TRANSLATOR_TRANSLATIONREQUEST_H_ + +#include "QualityScore.h" + +/* This class specifies the information related to the translated text (e.g. quality of the translation etc.) that + * can be included in the TranslationResult. These optional requests are set/unset independent of each other i.e. setting + * any one of them doesn’t have the side effect of setting any of the others. + */ +class TranslationRequest { +private: + // The granularity for which Quality scores of the translated text will be included in TranslationResult. + // QualityScoreGranularity::NONE means the scores are not included in TranslationResult. + QualityScoreGranularity qualityScoreGranularity = QualityScoreGranularity::NONE; + + // A flag to include/exclude the information regarding how individual sentences of original text map to + // corresponding translated sentences in joined translated text in the TranslationResult. + // An example of sentence mappings: + // originalText (containing 2 sentences) = "What is your name? My name is Abc." + // translatedText (containing 2 translated sentences) = "Was ist dein Name? Mein Name ist Abc." + // sentenceMappings = [ + // {"What is your name?", "Was ist dein Name?"}, // Pair(originalText[0],translatedText[0]) + // {"My name is Abc", "Mein Name ist Abc."} // Pair(originalText[1],translatedText[1]) + // ] + bool includeSentenceMapping = false; + +public: + explicit TranslationRequest(); + + ~TranslationRequest(); + + /* Set the granularity for which the Quality scores of translated text should be included in the TranslationResult. + * By default (QualityScoreGranularity::NONE), scores are not included. + */ + void setQualityScoreGranularity(QualityScoreGranularity granularity) { + qualityScoreGranularity = granularity; + } + + /* Set to true/false to include/exclude the information regarding how individual sentences of original text map to + * corresponding translated sentences in joined translated text in the TranslationResult. By default (false), this + * information is not included. + */ + void sentenceMappingInResult(bool includeMapping) { + includeSentenceMapping = includeMapping; + } + + /* Return the granularity for which the Quality scores of the translated text will be included in TranslationResult. + * QualityScoreGranularity::NONE means the scores will not be included. + */ + QualityScoreGranularity getQualityScoreGranularity() const { + return qualityScoreGranularity; + } + + /* Return whether the information regarding how individual sentences of original text map to corresponding translated + * sentences in joined translated text will be included in the TranslationResult. By default (false) means this + * information will not be included. + */ + bool sentenceMappingInResult() const { + return includeSentenceMapping; + } +}; + +#endif /* SRC_TRANSLATOR_TRANSLATIONREQUEST_H_ */ diff --git a/src/translator/TranslationResult.h b/src/translator/TranslationResult.h new file mode 100644 index 000000000..33bad1b66 --- /dev/null +++ b/src/translator/TranslationResult.h @@ -0,0 +1,74 @@ +/* + * TranslationResult.h + * + * The class that represents the result of AbstractTranslationModel::translate() API for each of its text entry and + * TranslationRequest. + */ + +#ifndef SRC_TRANSLATOR_TRANSLATIONRESULT_H_ +#define SRC_TRANSLATOR_TRANSLATIONRESULT_H_ + +#include +#include + +#include "QualityScore.h" + +/* This class represents the result of AbstractTranslationModel::translate() API for each of its text entry and + * TranslationRequest. + */ +class TranslationResult { +public: + typedef std::vector> SentenceMappings; + + TranslationResult(const std::string &original, const std::string &translation); + + TranslationResult(std::string &&original, std::string &&translation); + + /* Return the original text. */ + const std::string& getOriginalText() const { + return originalText; + } + + /* Return the translated text. */ + const std::string& getTranslatedText() const { + return translatedText; + } + + /* Return the Quality scores of the translated text. */ + const QualityScore& getQualityScore() const { + return qualityScore; + } + + /* Return the Sentence mappings (information regarding how individual sentences of originalText map to + * corresponding translated sentences in translatedText). + */ + const SentenceMappings& getSentenceMappings() const { + return sentenceMappings; + } + +private: + // Original text (in UTF-8 encoded format) that was supposed to be translated + std::string originalText; + + // Translation (in UTF-8 encoded format) of the originalText + std::string translatedText; + + // Quality score of the translated text at the granularity specified in TranslationRequest. + // It is an optional result (it will have no information if not requested in TranslationRequest) + QualityScore qualityScore; + + // Information regarding how individual sentences of originalText map to corresponding translated sentences + // in joined translated text (translatedText) + // An example of sentence mapping: + // originalText (contains 2 sentences) = "What is your name? My name is Abc." + // translatedText (contains 2 translated sentences) = "Was ist dein Name? Mein Name ist Abc." + // sentenceMappings = [ + // {"What is your name?", "Was ist dein Name?"}, // Pair(originalText[0],translatedText[0]) + // {"My name is Abc", "Mein Name ist Abc."} // Pair(originalText[1],translatedText[1]) + // ] + // + // It is an optional result (it will be empty if not requested in TranslationRequest). + SentenceMappings sentenceMappings; +}; + +#endif /* SRC_TRANSLATOR_TRANSLATIONRESULT_H_ */ From cd90f89126d3a7040ebb181caa294744bdfa2d05 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 3 Nov 2020 15:33:10 +0100 Subject: [PATCH 004/442] Added TranslationModel class - This class is an implementation of AbstractTranslationModel interface - This is the main class that will implement the translate API - Contains dummy responses for now --- src/translator/AbstractTranslationModel.cpp | 7 ++ src/translator/AbstractTranslationModel.h | 7 ++ src/translator/CMakeLists.txt | 2 +- src/translator/TranslationModel.cpp | 29 ++++++++ src/translator/TranslationModel.h | 63 +++++++++++++++++ .../TranslationModelConfiguration.h | 68 +++++++++++++++++++ 6 files changed, 175 insertions(+), 1 deletion(-) create mode 100644 src/translator/TranslationModel.cpp create mode 100644 src/translator/TranslationModel.h create mode 100644 src/translator/TranslationModelConfiguration.h diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp index a180a710a..2f4f05631 100644 --- a/src/translator/AbstractTranslationModel.cpp +++ b/src/translator/AbstractTranslationModel.cpp @@ -2,7 +2,14 @@ * AbstractTranslationModel.cpp * */ +#include #include "AbstractTranslationModel.h" +#include "TranslationModel.h" AbstractTranslationModel::~AbstractTranslationModel() {} + +std::shared_ptr +AbstractTranslationModel::createInstance(const TranslationModelConfiguration& config) { + return std::make_shared(config); +} diff --git a/src/translator/AbstractTranslationModel.h b/src/translator/AbstractTranslationModel.h index 6f013afb0..77ad87e5b 100644 --- a/src/translator/AbstractTranslationModel.h +++ b/src/translator/AbstractTranslationModel.h @@ -12,7 +12,9 @@ #include #include #include +#include +#include "TranslationModelConfiguration.h" #include "TranslationRequest.h" #include "TranslationResult.h" @@ -22,6 +24,11 @@ class AbstractTranslationModel { public: + /* A Factory method to create and return an instance of AbstractTranslationModel implementation. The instance is + * created using translation model configuration (TranslationModelConfiguration). + */ + static std::shared_ptr createInstance(const TranslationModelConfiguration& config); + AbstractTranslationModel(); virtual ~AbstractTranslationModel(); diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index bcd34da33..b227decb8 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -1,2 +1,2 @@ include_directories(.) -add_library(bergamot-translator STATIC AbstractTranslationModel.cpp) \ No newline at end of file +add_library(bergamot-translator STATIC AbstractTranslationModel.cpp TranslationModel.cpp) \ No newline at end of file diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp new file mode 100644 index 000000000..a309cdd3d --- /dev/null +++ b/src/translator/TranslationModel.cpp @@ -0,0 +1,29 @@ +/* + * TranslationModel.cpp + * + */ + +#include +#include + +#include "TranslationModel.h" + +TranslationModel::TranslationModel(const TranslationModelConfiguration &configuration) : + modelConfiguration(configuration), AbstractTranslationModel() { +} + +TranslationModel::~TranslationModel() {} + +std::future> TranslationModel::translate( + std::vector &&texts, TranslationRequest request) { + //ToDo: Replace this code with the actual implementation + return std::async([]() { + std::vector results; + results.emplace_back(TranslationResult{"a","d"}); + return results; + }); +} + +bool TranslationModel::isAlignmentSupported() const { + return false; +} diff --git a/src/translator/TranslationModel.h b/src/translator/TranslationModel.h new file mode 100644 index 000000000..14cbcbd8b --- /dev/null +++ b/src/translator/TranslationModel.h @@ -0,0 +1,63 @@ +/* + * TranslationModel.h + * + * A implementation of AbstractTranslationModel interface. + */ + +#ifndef SRC_TRANSLATOR_TRANSLATIONMODEL_H_ +#define SRC_TRANSLATOR_TRANSLATIONMODEL_H_ + +#include +#include +#include + +#include "AbstractTranslationModel.h" +#include "TranslationModelConfiguration.h" + +/* A Translation model that translates a plain (without any markups and emojis) UTF-8 encoded text. + * This implementation supports translation from 1 source language to 1 target language. + */ +class TranslationModel: public AbstractTranslationModel { +public: + /* Construct the model using the model configuration. The model configuration specifies options + * that are required by a translation model to perform translation. It stays constant during the + * lifetime of the model instance. Please refer to TranslationModelConfiguration class + * for details regarding configuration. + */ + TranslationModel(const TranslationModelConfiguration &modelConfiguration); + + ~TranslationModel(); + + /* This method performs translation on a list of UTF-8 encoded plain text (without any markups + * or emojis) and returns a list of results in the same order. The model supports translation + * from 1 source language to 1 target language. + * + * Each text entry can either be a word, a phrase, a sentence or a list of sentences. Additional + * information related to the translated text can be requested via TranslationRequest which is + * applied equally to each text entry. The translated text corresponding to each text entry and + * the additional information (as specified in the TranslationRequest) is encapsulated and + * returned in TranslationResult. + * + * The API splits each text entry into sentences internally, which are then translated + * independent of each other. The translated sentences are then joined back together and returned + * in TranslationResult. + * + * Please refer to the TranslationRequest class to find out what additional information can be + * requested. The alignment information can only be requested if the model supports it (check + * isAlignmentSupported() API). + * + * The texts argument will become empty after the execution of this API (each entry of texts list + * will be moved to its corresponding TranslationResult object). + */ + std::future> translate( + std::vector &&texts, TranslationRequest request) override; + + /* Check if the model can provide alignment information b/w original and translated text. */ + bool isAlignmentSupported() const override; + +private: + // Model configuration + const TranslationModelConfiguration modelConfiguration; +}; + +#endif /* SRC_TRANSLATOR_TRANSLATIONMODEL_H_ */ diff --git a/src/translator/TranslationModelConfiguration.h b/src/translator/TranslationModelConfiguration.h new file mode 100644 index 000000000..8c6582454 --- /dev/null +++ b/src/translator/TranslationModelConfiguration.h @@ -0,0 +1,68 @@ +/* + * TranslationModelConfiguration.h + * + */ + +#ifndef SRC_TRANSLATOR_TRANSLATIONMODELCONFIGURATION_H_ +#define SRC_TRANSLATOR_TRANSLATIONMODELCONFIGURATION_H_ + +#include + +/* This class encapsulates the configuration that is required by a translation model to perform + * translation. + */ +class TranslationModelConfiguration { +public: + + // Constructor + TranslationModelConfiguration(const std::string &modelFilePath, + const std::string &sourceVocabPath, + const std::string &targetVocabPath) : + modelPath(modelFilePath), + sourceLanguageVocabPath(sourceVocabPath), + targetLanguageVocabPath(targetVocabPath) { + } + + // Copy constructor + TranslationModelConfiguration(const TranslationModelConfiguration &rhs) : + modelPath(rhs.modelPath), + sourceLanguageVocabPath(rhs.sourceLanguageVocabPath), + targetLanguageVocabPath(rhs.targetLanguageVocabPath) { + } + + // Move constructor + TranslationModelConfiguration(TranslationModelConfiguration &&rhs) : + modelPath(std::move(rhs.modelPath)), + sourceLanguageVocabPath(std::move(rhs.sourceLanguageVocabPath)), + targetLanguageVocabPath(std::move(rhs.targetLanguageVocabPath)) { + } + + // Return the path of the model file + const std::string& getModelFilePath() const { + return modelPath; + } + + // Return the path of the source language vocabulary file + const std::string& getSourceVocabularyPath() const { + return sourceLanguageVocabPath; + } + + // Return the path of the target language vocabulary file + const std::string& getTargetVocabularyPath() const { + return targetLanguageVocabPath; + } + +private: + // Path to the translation model file + const std::string modelPath; + + // Path to the source vocabulary file to be used by the model + const std::string sourceLanguageVocabPath; + + // Path to the target vocabulary file to be used by the model + const std::string targetLanguageVocabPath; + + // ToDo: Add other user configurable options (e.g. min batch size) +}; + +#endif /* SRC_TRANSLATOR_TRANSLATIONMODELCONFIGURATION_H_ */ From 468508d75d97ed628e5f9749bca00d195443a3ad Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 3 Nov 2020 18:09:07 +0100 Subject: [PATCH 005/442] Added constructor definitions - Added definitions that were absent in previous commits of unified API --- src/translator/AbstractTranslationModel.cpp | 1 - src/translator/AbstractTranslationModel.h | 12 +++++++----- src/translator/TranslationModel.cpp | 1 - src/translator/TranslationRequest.h | 9 +++++++-- src/translator/TranslationResult.h | 6 ++++-- 5 files changed, 18 insertions(+), 11 deletions(-) diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp index 2f4f05631..39b359af4 100644 --- a/src/translator/AbstractTranslationModel.cpp +++ b/src/translator/AbstractTranslationModel.cpp @@ -7,7 +7,6 @@ #include "AbstractTranslationModel.h" #include "TranslationModel.h" -AbstractTranslationModel::~AbstractTranslationModel() {} std::shared_ptr AbstractTranslationModel::createInstance(const TranslationModelConfiguration& config) { diff --git a/src/translator/AbstractTranslationModel.h b/src/translator/AbstractTranslationModel.h index 77ad87e5b..ddadc07bf 100644 --- a/src/translator/AbstractTranslationModel.h +++ b/src/translator/AbstractTranslationModel.h @@ -24,14 +24,16 @@ class AbstractTranslationModel { public: - /* A Factory method to create and return an instance of AbstractTranslationModel implementation. The instance is - * created using translation model configuration (TranslationModelConfiguration). + /* A Factory method to create and return an instance of an implementation of + * AbstractTranslationModel. The instance is created using translation model configuration + * (TranslationModelConfiguration). */ - static std::shared_ptr createInstance(const TranslationModelConfiguration& config); + static std::shared_ptr + createInstance(const TranslationModelConfiguration& config); - AbstractTranslationModel(); + AbstractTranslationModel() = default; - virtual ~AbstractTranslationModel(); + virtual ~AbstractTranslationModel() = default; /* This method performs translation on a list of (UTF-8 encoded) texts and returns a list of results in the same order. * Each text entry can either be a word, a phrase, a sentence or a list of sentences and should contain plain text diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index a309cdd3d..ed894e567 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -19,7 +19,6 @@ std::future> TranslationModel::translate( //ToDo: Replace this code with the actual implementation return std::async([]() { std::vector results; - results.emplace_back(TranslationResult{"a","d"}); return results; }); } diff --git a/src/translator/TranslationRequest.h b/src/translator/TranslationRequest.h index bdd56803a..b19cc892d 100644 --- a/src/translator/TranslationRequest.h +++ b/src/translator/TranslationRequest.h @@ -31,9 +31,14 @@ class TranslationRequest { bool includeSentenceMapping = false; public: - explicit TranslationRequest(); + TranslationRequest() {} - ~TranslationRequest(); + TranslationRequest(const TranslationRequest& request) : + qualityScoreGranularity(request.qualityScoreGranularity), + includeSentenceMapping(request.includeSentenceMapping) { + } + + ~TranslationRequest() {} /* Set the granularity for which the Quality scores of translated text should be included in the TranslationResult. * By default (QualityScoreGranularity::NONE), scores are not included. diff --git a/src/translator/TranslationResult.h b/src/translator/TranslationResult.h index 33bad1b66..4d231a89b 100644 --- a/src/translator/TranslationResult.h +++ b/src/translator/TranslationResult.h @@ -20,9 +20,11 @@ class TranslationResult { public: typedef std::vector> SentenceMappings; - TranslationResult(const std::string &original, const std::string &translation); + TranslationResult(const std::string &original, const std::string &translation) : + originalText(original), translatedText(translation) {} - TranslationResult(std::string &&original, std::string &&translation); + TranslationResult(std::string &&original, std::string &&translation) : + originalText(std::move(original)), translatedText(std::move(translation)) {} /* Return the original text. */ const std::string& getOriginalText() const { From 7a695a08cbca1b2b7e8610e11492f099dcdc1991 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 9 Nov 2020 12:01:54 +0100 Subject: [PATCH 006/442] Added "ugermann/ssplit-cpp" as a submodule --- .gitmodules | 3 +++ 3rd_party/ssplit-cpp | 1 + 2 files changed, 4 insertions(+) create mode 100644 .gitmodules create mode 160000 3rd_party/ssplit-cpp diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 000000000..c3d3b4dbb --- /dev/null +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "3rd_party/ssplit-cpp"] + path = 3rd_party/ssplit-cpp + url = https://github.com/ugermann/ssplit-cpp diff --git a/3rd_party/ssplit-cpp b/3rd_party/ssplit-cpp new file mode 160000 index 000000000..f5d022992 --- /dev/null +++ b/3rd_party/ssplit-cpp @@ -0,0 +1 @@ +Subproject commit f5d022992f4a00c860eb809389748908bb85ffcf From e8716f7fd1b0b1d68e9fab304da320f63880a957 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 9 Nov 2020 12:02:51 +0100 Subject: [PATCH 007/442] Added "browsermt/marian-dev" as submodule --- .gitmodules | 3 +++ 3rd_party/marian-dev | 1 + 2 files changed, 4 insertions(+) create mode 160000 3rd_party/marian-dev diff --git a/.gitmodules b/.gitmodules index c3d3b4dbb..d3bbf18d6 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,3 +1,6 @@ [submodule "3rd_party/ssplit-cpp"] path = 3rd_party/ssplit-cpp url = https://github.com/ugermann/ssplit-cpp +[submodule "3rd_party/marian-dev"] + path = 3rd_party/marian-dev + url = https://github.com/browsermt/marian-dev diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev new file mode 160000 index 000000000..69894793e --- /dev/null +++ b/3rd_party/marian-dev @@ -0,0 +1 @@ +Subproject commit 69894793ebd93256d824a1590924780a6d54cae8 From a220f915fc6915063b3ef5a2d3d3c6e8589df79d Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Wed, 11 Nov 2020 16:19:54 +0100 Subject: [PATCH 008/442] Compile marian submodule in the project - marian compiles successfully and is ready to be used in the project --- 3rd_party/CMakeLists.txt | 6 ++++++ CMakeLists.txt | 6 ++++++ 2 files changed, 12 insertions(+) create mode 100644 3rd_party/CMakeLists.txt diff --git a/3rd_party/CMakeLists.txt b/3rd_party/CMakeLists.txt new file mode 100644 index 000000000..5a2b56e24 --- /dev/null +++ b/3rd_party/CMakeLists.txt @@ -0,0 +1,6 @@ +add_subdirectory(marian-dev) + +# Add include directories for marian target to be able to use it anywhere in the project without +# explicitly specifying its include directories. Once marian fixes this problem, it can be removed. +get_property(INCDIRS DIRECTORY marian-dev/src PROPERTY INCLUDE_DIRECTORIES) +target_include_directories(marian PUBLIC ${INCDIRS}) \ No newline at end of file diff --git a/CMakeLists.txt b/CMakeLists.txt index d4890299b..2b868299c 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -10,4 +10,10 @@ set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD_REQUIRED ON) set(BUILD_ARCH native CACHE STRING "Compile for this CPU architecture.") +# Custom CMake options to compile marian (a 3rd party submodule) for this project +option(COMPILE_CUDA "Compile GPU version" OFF) +option(USE_SENTENCEPIECE "Download and compile SentencePiece" ON) +option(USE_STATIC_LIBS "Link statically against non-system libs" ON) + +add_subdirectory(3rd_party) add_subdirectory(src) \ No newline at end of file From 36911d39d5d155ac9a10d82c0e687caeb64895e1 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Wed, 11 Nov 2020 16:24:50 +0100 Subject: [PATCH 009/442] Link marian library in the project --- src/translator/CMakeLists.txt | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index b227decb8..c820c9309 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -1,2 +1,5 @@ -include_directories(.) -add_library(bergamot-translator STATIC AbstractTranslationModel.cpp TranslationModel.cpp) \ No newline at end of file +add_library(bergamot-translator STATIC + AbstractTranslationModel.cpp + TranslationModel.cpp) + +target_link_libraries(bergamot-translator marian) \ No newline at end of file From 358d76871fe6dce602a70cfd7608bd43443451f8 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Wed, 11 Nov 2020 17:18:12 +0100 Subject: [PATCH 010/442] Small change: Added New line endings --- 3rd_party/CMakeLists.txt | 2 +- CMakeLists.txt | 2 +- src/translator/CMakeLists.txt | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/3rd_party/CMakeLists.txt b/3rd_party/CMakeLists.txt index 5a2b56e24..97bf94e05 100644 --- a/3rd_party/CMakeLists.txt +++ b/3rd_party/CMakeLists.txt @@ -3,4 +3,4 @@ add_subdirectory(marian-dev) # Add include directories for marian target to be able to use it anywhere in the project without # explicitly specifying its include directories. Once marian fixes this problem, it can be removed. get_property(INCDIRS DIRECTORY marian-dev/src PROPERTY INCLUDE_DIRECTORIES) -target_include_directories(marian PUBLIC ${INCDIRS}) \ No newline at end of file +target_include_directories(marian PUBLIC ${INCDIRS}) diff --git a/CMakeLists.txt b/CMakeLists.txt index 2b868299c..6aaff4ee6 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -16,4 +16,4 @@ option(USE_SENTENCEPIECE "Download and compile SentencePiece" ON) option(USE_STATIC_LIBS "Link statically against non-system libs" ON) add_subdirectory(3rd_party) -add_subdirectory(src) \ No newline at end of file +add_subdirectory(src) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index c820c9309..ac8936645 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -2,4 +2,4 @@ add_library(bergamot-translator STATIC AbstractTranslationModel.cpp TranslationModel.cpp) -target_link_libraries(bergamot-translator marian) \ No newline at end of file +target_link_libraries(bergamot-translator marian) From 210c5a466a7e57acf56a0bcb17bcaa2d94b28a99 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Wed, 11 Nov 2020 17:52:27 +0100 Subject: [PATCH 011/442] Separated the public includes of the project from implementation - All interfaces are present in ROOT/src --- src/{translator => }/AbstractTranslationModel.h | 0 src/{translator => }/QualityScore.h | 0 src/{translator => }/TranslationModelConfiguration.h | 0 src/{translator => }/TranslationRequest.h | 0 src/{translator => }/TranslationResult.h | 0 src/translator/CMakeLists.txt | 4 ++++ 6 files changed, 4 insertions(+) rename src/{translator => }/AbstractTranslationModel.h (100%) rename src/{translator => }/QualityScore.h (100%) rename src/{translator => }/TranslationModelConfiguration.h (100%) rename src/{translator => }/TranslationRequest.h (100%) rename src/{translator => }/TranslationResult.h (100%) diff --git a/src/translator/AbstractTranslationModel.h b/src/AbstractTranslationModel.h similarity index 100% rename from src/translator/AbstractTranslationModel.h rename to src/AbstractTranslationModel.h diff --git a/src/translator/QualityScore.h b/src/QualityScore.h similarity index 100% rename from src/translator/QualityScore.h rename to src/QualityScore.h diff --git a/src/translator/TranslationModelConfiguration.h b/src/TranslationModelConfiguration.h similarity index 100% rename from src/translator/TranslationModelConfiguration.h rename to src/TranslationModelConfiguration.h diff --git a/src/translator/TranslationRequest.h b/src/TranslationRequest.h similarity index 100% rename from src/translator/TranslationRequest.h rename to src/TranslationRequest.h diff --git a/src/translator/TranslationResult.h b/src/TranslationResult.h similarity index 100% rename from src/translator/TranslationResult.h rename to src/TranslationResult.h diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index ac8936645..5e2b4d6e3 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -3,3 +3,7 @@ add_library(bergamot-translator STATIC TranslationModel.cpp) target_link_libraries(bergamot-translator marian) + +target_include_directories(bergamot-translator + PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} + PUBLIC ${CMAKE_SOURCE_DIR}/src) From 59c940090b9491874aae7e307953e09a8fc33eea Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 12 Nov 2020 10:23:47 +0100 Subject: [PATCH 012/442] Use marian::Options class internally for configuration options - Marian uses Options class everywhere as configuration options - Owing to this project's heavy dependency on Marian: -- Made the internal implementation files of the project work with marian::Options instead of TranslationModelConfiguration -- An Adaptor class to adapt TranslationModelConfiguration to marian::Options will be added in following commit --- src/translator/AbstractTranslationModel.cpp | 8 +++++++- src/translator/TranslationModel.cpp | 4 ++-- src/translator/TranslationModel.h | 15 ++++++++------- 3 files changed, 17 insertions(+), 10 deletions(-) diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp index 39b359af4..0ad3971e6 100644 --- a/src/translator/AbstractTranslationModel.cpp +++ b/src/translator/AbstractTranslationModel.cpp @@ -4,11 +4,17 @@ */ #include +// All 3rd party includes +#include "common/options.h" + +// All local includes #include "AbstractTranslationModel.h" #include "TranslationModel.h" std::shared_ptr AbstractTranslationModel::createInstance(const TranslationModelConfiguration& config) { - return std::make_shared(config); + // ToDo: Write an adaptor for adapting TranslationModelConfiguration to marian::Options + auto options = std::make_shared(); + return std::make_shared(options); } diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index ed894e567..099d930cd 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -8,8 +8,8 @@ #include "TranslationModel.h" -TranslationModel::TranslationModel(const TranslationModelConfiguration &configuration) : - modelConfiguration(configuration), AbstractTranslationModel() { +TranslationModel::TranslationModel(std::shared_ptr options) : + configOptions(std::move(options)), AbstractTranslationModel() { } TranslationModel::~TranslationModel() {} diff --git a/src/translator/TranslationModel.h b/src/translator/TranslationModel.h index 14cbcbd8b..5b75b8fec 100644 --- a/src/translator/TranslationModel.h +++ b/src/translator/TranslationModel.h @@ -11,6 +11,10 @@ #include #include +// All 3rd party includes +#include "common/options.h" + +// All local project includes #include "AbstractTranslationModel.h" #include "TranslationModelConfiguration.h" @@ -19,12 +23,9 @@ */ class TranslationModel: public AbstractTranslationModel { public: - /* Construct the model using the model configuration. The model configuration specifies options - * that are required by a translation model to perform translation. It stays constant during the - * lifetime of the model instance. Please refer to TranslationModelConfiguration class - * for details regarding configuration. + /* Construct the model using the model configuration options. */ - TranslationModel(const TranslationModelConfiguration &modelConfiguration); + TranslationModel(std::shared_ptr options); ~TranslationModel(); @@ -56,8 +57,8 @@ class TranslationModel: public AbstractTranslationModel { bool isAlignmentSupported() const override; private: - // Model configuration - const TranslationModelConfiguration modelConfiguration; + // Model configuration options + std::shared_ptr configOptions; }; #endif /* SRC_TRANSLATOR_TRANSLATIONMODEL_H_ */ From ce7312cfd4ff5f310abd649d7b57f6bf4ff109d5 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 12 Nov 2020 11:17:34 +0100 Subject: [PATCH 013/442] Added basic skeleton for Adaptor class - The class adapts the TranslationModelConfiguration to marian::Options - Returns a dummy marian::Options for now --- src/translator/AbstractTranslationModel.cpp | 5 +-- src/translator/CMakeLists.txt | 3 +- ...TranslationModelConfigToOptionsAdaptor.cpp | 17 ++++++++++ .../TranslationModelConfigToOptionsAdaptor.h | 32 +++++++++++++++++++ 4 files changed, 54 insertions(+), 3 deletions(-) create mode 100644 src/translator/TranslationModelConfigToOptionsAdaptor.cpp create mode 100644 src/translator/TranslationModelConfigToOptionsAdaptor.h diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp index 0ad3971e6..afd62e7ec 100644 --- a/src/translator/AbstractTranslationModel.cpp +++ b/src/translator/AbstractTranslationModel.cpp @@ -10,11 +10,12 @@ // All local includes #include "AbstractTranslationModel.h" #include "TranslationModel.h" +#include "TranslationModelConfigToOptionsAdaptor.h" std::shared_ptr AbstractTranslationModel::createInstance(const TranslationModelConfiguration& config) { - // ToDo: Write an adaptor for adapting TranslationModelConfiguration to marian::Options - auto options = std::make_shared(); + TranslationModelConfigToOptionsAdaptor adaptor; + auto options = adaptor.adapt(config); return std::make_shared(options); } diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 5e2b4d6e3..c9a51df45 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -1,6 +1,7 @@ add_library(bergamot-translator STATIC AbstractTranslationModel.cpp - TranslationModel.cpp) + TranslationModel.cpp + TranslationModelConfigToOptionsAdaptor.cpp) target_link_libraries(bergamot-translator marian) diff --git a/src/translator/TranslationModelConfigToOptionsAdaptor.cpp b/src/translator/TranslationModelConfigToOptionsAdaptor.cpp new file mode 100644 index 000000000..3405a5fcf --- /dev/null +++ b/src/translator/TranslationModelConfigToOptionsAdaptor.cpp @@ -0,0 +1,17 @@ +/* + * TranslationModelConfigToOptionsAdaptor.cpp + * + */ +#include + +#include "TranslationModelConfigToOptionsAdaptor.h" + +TranslationModelConfigToOptionsAdaptor::TranslationModelConfigToOptionsAdaptor() {} + +TranslationModelConfigToOptionsAdaptor::~TranslationModelConfigToOptionsAdaptor() {} + +std::shared_ptr +TranslationModelConfigToOptionsAdaptor::adapt(const TranslationModelConfiguration& config) { + // ToDo: Add actual implementation + return std::make_shared(); +} diff --git a/src/translator/TranslationModelConfigToOptionsAdaptor.h b/src/translator/TranslationModelConfigToOptionsAdaptor.h new file mode 100644 index 000000000..309ea69c8 --- /dev/null +++ b/src/translator/TranslationModelConfigToOptionsAdaptor.h @@ -0,0 +1,32 @@ +/* + * This class adapts the TranslationModelConfiguration object to marian::Options object. + * marian::Options is a class that is specific to Marian and is used heavily inside it + * as configuration options (even for translation workflow). It makes sense to work with + * this class internally in the implementation files. + */ + +#ifndef SRC_TRANSLATOR_TRANSLATIONMODELCONFIGTOOPTIONSADAPTOR_H_ +#define SRC_TRANSLATOR_TRANSLATIONMODELCONFIGTOOPTIONSADAPTOR_H_ + +#include + +// All 3rd party includes +#include "common/options.h" + +// All local includes +#include "TranslationModelConfiguration.h" + + +class TranslationModelConfigToOptionsAdaptor { +public: + + TranslationModelConfigToOptionsAdaptor(); + + ~TranslationModelConfigToOptionsAdaptor(); + + /* Create an Options object from the translation model configuration object. + */ + std::shared_ptr adapt(const TranslationModelConfiguration& config); +}; + +#endif /* SRC_TRANSLATOR_TRANSLATIONMODELCONFIGTOOPTIONSADAPTOR_H_ */ From cd505c9286a8d48a2f5b5e91106b5073801b6b40 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 16 Nov 2020 13:09:42 +0100 Subject: [PATCH 014/442] Updated README with 'Build' and 'Use' instructions --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index dd3798232..fbbbe7b46 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,17 @@ # Bergamot Translator Bergamot translator provides a unified API for ([Marian NMT](https://marian-nmt.github.io/) framework based) neural machine translation functionality in accordance with the [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. + +## Build Instructions +``` +$ git clone https://github.com/browsermt/bergamot-translator +$ cd bergamot-translator +$ mkdir build +$ cd build +$ cmake ../ +$ make -j + +``` + +## Using Bergamot Translator +The build will generate the library that can be linked to any project. All the public header files are specified in `src` folder. From 9478a54628eae05fc3abd4c7fcb1104ce424c713 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 16 Nov 2020 15:14:50 +0100 Subject: [PATCH 015/442] Improved 3rd party header inclusion - Inclusion now contains explicit names of the 3rd party libraries --- src/translator/AbstractTranslationModel.cpp | 2 +- src/translator/CMakeLists.txt | 1 + src/translator/TranslationModel.h | 2 +- src/translator/TranslationModelConfigToOptionsAdaptor.h | 2 +- 4 files changed, 4 insertions(+), 3 deletions(-) diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp index afd62e7ec..597c592d3 100644 --- a/src/translator/AbstractTranslationModel.cpp +++ b/src/translator/AbstractTranslationModel.cpp @@ -5,7 +5,7 @@ #include // All 3rd party includes -#include "common/options.h" +#include "3rd_party/marian-dev/src/common/options.h" // All local includes #include "AbstractTranslationModel.h" diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index c9a51df45..08a82fcb5 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -7,4 +7,5 @@ target_link_libraries(bergamot-translator marian) target_include_directories(bergamot-translator PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} + PRIVATE ${CMAKE_SOURCE_DIR} PUBLIC ${CMAKE_SOURCE_DIR}/src) diff --git a/src/translator/TranslationModel.h b/src/translator/TranslationModel.h index 5b75b8fec..587926516 100644 --- a/src/translator/TranslationModel.h +++ b/src/translator/TranslationModel.h @@ -12,7 +12,7 @@ #include // All 3rd party includes -#include "common/options.h" +#include "3rd_party/marian-dev/src/common/options.h" // All local project includes #include "AbstractTranslationModel.h" diff --git a/src/translator/TranslationModelConfigToOptionsAdaptor.h b/src/translator/TranslationModelConfigToOptionsAdaptor.h index 309ea69c8..1eba4cced 100644 --- a/src/translator/TranslationModelConfigToOptionsAdaptor.h +++ b/src/translator/TranslationModelConfigToOptionsAdaptor.h @@ -11,7 +11,7 @@ #include // All 3rd party includes -#include "common/options.h" +#include "3rd_party/marian-dev/src/common/options.h" // All local includes #include "TranslationModelConfiguration.h" From f8c9a6b0cce845ab8b6e1fe929cb7a6b260c72a4 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 16 Nov 2020 13:17:53 +0100 Subject: [PATCH 016/442] Added an application showing usage of bergamot translator - 'app' folder contains the application - The application uses dummy requests and responses for now --- CMakeLists.txt | 1 + app/CMakeLists.txt | 3 +++ app/main.cpp | 35 +++++++++++++++++++++++++++++++++++ 3 files changed, 39 insertions(+) create mode 100644 app/CMakeLists.txt create mode 100644 app/main.cpp diff --git a/CMakeLists.txt b/CMakeLists.txt index 6aaff4ee6..68a075d5c 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -17,3 +17,4 @@ option(USE_STATIC_LIBS "Link statically against non-system libs" ON) add_subdirectory(3rd_party) add_subdirectory(src) +add_subdirectory(app) diff --git a/app/CMakeLists.txt b/app/CMakeLists.txt new file mode 100644 index 000000000..f9698dc55 --- /dev/null +++ b/app/CMakeLists.txt @@ -0,0 +1,3 @@ +add_executable(bergamot-translator-app main.cpp) + +target_link_libraries(bergamot-translator-app PRIVATE bergamot-translator) diff --git a/app/main.cpp b/app/main.cpp new file mode 100644 index 000000000..dc808228f --- /dev/null +++ b/app/main.cpp @@ -0,0 +1,35 @@ +/* + * main.cpp + * + * An example application to demonstrate the use of Bergamot translator. + * + */ + +#include + +#include "TranslationModelConfiguration.h" +#include "AbstractTranslationModel.h" +#include "TranslationRequest.h" +#include "TranslationResult.h" + + +int main(int argc, char** argv) { + + // Create an instance of AbstractTranslationModel with a dummy model configuration + TranslationModelConfiguration config("dummy_modelFilePath", + "dummy_sourceVocabPath", + "dummy_targetVocabPath"); + std::shared_ptr model = + AbstractTranslationModel::createInstance(config); + + // Call to translate a dummy (empty) texts with a dummy (empty) translation request + TranslationRequest req; + std::vector texts; + auto result = model->translate(std::move(texts), req); + + // Resolve the future and get the actual result + std::vector res = result.get(); + + std::cout << "Count is: " << res.size() << std::endl; + return 0; +} From 601bd527168d6f0cdef00b5226b912ffc59a5f69 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 19:08:46 +0000 Subject: [PATCH 017/442] Import sources from mts adaptation This first commit imports files from mts which was repurposed for bergamot translator from https://github.com/browsermt/mts/tree/nuke. --- src/translator/batch_translator.cpp | 123 ++++++++++ src/translator/batch_translator.h | 51 ++++ src/translator/batcher.cpp | 54 +++++ src/translator/batcher.h | 35 +++ src/translator/definitions.h | 27 +++ src/translator/main.cpp | 92 ++++++++ src/translator/multifactor_priority.cpp | 7 + src/translator/multifactor_priority.h | 20 ++ src/translator/pcqueue.h | 299 ++++++++++++++++++++++++ src/translator/request.cpp | 93 ++++++++ src/translator/request.h | 114 +++++++++ src/translator/sanelogging.h | 44 ++++ src/translator/service.cpp | 99 ++++++++ src/translator/service.h | 44 ++++ src/translator/textops.cpp | 135 +++++++++++ src/translator/textops.h | 102 ++++++++ src/translator/timer.h | 32 +++ src/translator/translation_result.cpp | 97 ++++++++ src/translator/translation_result.h | 64 +++++ src/translator/utils.cpp | 31 +++ src/translator/utils.h | 20 ++ 21 files changed, 1583 insertions(+) create mode 100644 src/translator/batch_translator.cpp create mode 100644 src/translator/batch_translator.h create mode 100644 src/translator/batcher.cpp create mode 100644 src/translator/batcher.h create mode 100644 src/translator/definitions.h create mode 100644 src/translator/main.cpp create mode 100644 src/translator/multifactor_priority.cpp create mode 100644 src/translator/multifactor_priority.h create mode 100644 src/translator/pcqueue.h create mode 100644 src/translator/request.cpp create mode 100644 src/translator/request.h create mode 100644 src/translator/sanelogging.h create mode 100644 src/translator/service.cpp create mode 100644 src/translator/service.h create mode 100644 src/translator/textops.cpp create mode 100644 src/translator/textops.h create mode 100644 src/translator/timer.h create mode 100644 src/translator/translation_result.cpp create mode 100644 src/translator/translation_result.h create mode 100644 src/translator/utils.cpp create mode 100644 src/translator/utils.h diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp new file mode 100644 index 000000000..f41fa590f --- /dev/null +++ b/src/translator/batch_translator.cpp @@ -0,0 +1,123 @@ +#include "batch_translator.h" +#include "common/logging.h" +#include "data/corpus.h" +#include "data/text_input.h" +#include "sanelogging.h" +#include "translator/beam_search.h" +#include "utils.h" + +namespace marian { +namespace bergamot { + +BatchTranslator::BatchTranslator(DeviceId const device, + PCQueue &pcqueue, Ptr options) + : device_(device), options_(options), pcqueue_(&pcqueue) { + + thread_ = std::thread([this] { this->mainloop(); }); +} + +void BatchTranslator::initGraph() { + vocabs_ = loadVocabularies(options_); + if (options_->hasAndNotEmpty("shortlist")) { + Ptr slgen; + int srcIdx = 0, trgIdx = 1; + bool shared_vcb = vocabs_.front() == vocabs_.back(); + slgen_ = New( + options_, vocabs_.front(), vocabs_.back(), srcIdx, trgIdx, shared_vcb); + } + + graph_ = New(true); // always optimize + auto prec = options_->get>("precision", {"float32"}); + graph_->setDefaultElementType(typeFromString(prec[0])); + graph_->setDevice(device_); + graph_->getBackend()->configureDevice(options_); + graph_->reserveWorkspaceMB(options_->get("workspace")); + scorers_ = createScorers(options_); + for (auto scorer : scorers_) { + scorer->init(graph_); + if (slgen_) { + scorer->setShortlistGenerator(slgen_); + } + } + + graph_->forward(); +} + +void BatchTranslator::translate(RequestSentences &requestSentences, + Histories &histories) { + std::vector batchVector; + + for (auto &sentence : requestSentences) { + data::SentenceTuple sentence_tuple(sentence.lineNumber()); + Segment segment = sentence.getUnderlyingSegment(); + sentence_tuple.push_back(segment); + batchVector.push_back(sentence_tuple); + } + + size_t batchSize = batchVector.size(); + std::vector sentenceIds; + std::vector maxDims; + for (auto &ex : batchVector) { + if (maxDims.size() < ex.size()) + maxDims.resize(ex.size(), 0); + for (size_t i = 0; i < ex.size(); ++i) { + if (ex[i].size() > (size_t)maxDims[i]) + maxDims[i] = (int)ex[i].size(); + } + sentenceIds.push_back(ex.getId()); + } + + typedef marian::data::SubBatch SubBatch; + typedef marian::data::CorpusBatch CorpusBatch; + + std::vector> subBatches; + for (size_t j = 0; j < maxDims.size(); ++j) { + subBatches.emplace_back(New(batchSize, maxDims[j], vocabs_[j])); + } + + std::vector words(maxDims.size(), 0); + for (size_t i = 0; i < batchSize; ++i) { + for (size_t j = 0; j < maxDims.size(); ++j) { + for (size_t k = 0; k < batchVector[i][j].size(); ++k) { + subBatches[j]->data()[k * batchSize + i] = batchVector[i][j][k]; + subBatches[j]->mask()[k * batchSize + i] = 1.f; + words[j]++; + } + } + } + + for (size_t j = 0; j < maxDims.size(); ++j) + subBatches[j]->setWords(words[j]); + + auto batch = Ptr(new CorpusBatch(subBatches)); + batch->setSentenceIds(sentenceIds); + + auto trgVocab = vocabs_.back(); + auto search = New(options_, scorers_, trgVocab); + + histories = std::move(search->search(graph_, batch)); +} + +void BatchTranslator::mainloop() { + initGraph(); + + PCItem pcitem; + Histories histories; + + while (true) { + pcqueue_->ConsumeSwap(pcitem); + if (pcitem.isPoison()) { + return; + } else { + translate(pcitem.sentences, histories); + for (int i = 0; i < pcitem.sentences.size(); i++) { + pcitem.sentences[i].completeSentence(histories[i]); + } + } + } +} + +void BatchTranslator::join() { thread_.join(); } + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/batch_translator.h b/src/translator/batch_translator.h new file mode 100644 index 000000000..638a1a971 --- /dev/null +++ b/src/translator/batch_translator.h @@ -0,0 +1,51 @@ +#ifndef SRC_BERGAMOT_BATCH_TRANSLATOR_H_ +#define SRC_BERGAMOT_BATCH_TRANSLATOR_H_ + +#include +#include + +#include "common/utils.h" +#include "data/shortlist.h" +#include "definitions.h" +#include "pcqueue.h" +#include "request.h" +#include "translator/history.h" +#include "translator/scorers.h" + +namespace marian { +namespace bergamot { + +class BatchTranslator { + // Launches minimal marian-translation (only CPU at the moment) in individual + // threads. Constructor launches each worker thread running mainloop(). + // mainloop runs until until it receives poison from the PCQueue. Threads are + // shut down in Service which calls join() on the threads. + +public: + BatchTranslator(DeviceId const device, PCQueue &pcqueue, + Ptr options); + void join(); + + // convenience function for logging. TODO(jerin) + std::string _identifier() { return "worker" + std::to_string(device_.no); } + +private: + void initGraph(); + void translate(RequestSentences &requestSentences, Histories &histories); + void mainloop(); + + Ptr options_; + + DeviceId device_; + std::vector> vocabs_; + Ptr graph_; + std::vector> scorers_; + Ptr slgen_; + + PCQueue *pcqueue_; + std::thread thread_; +}; +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_BATCH_TRANSLATOR_H_ diff --git a/src/translator/batcher.cpp b/src/translator/batcher.cpp new file mode 100644 index 000000000..471263df9 --- /dev/null +++ b/src/translator/batcher.cpp @@ -0,0 +1,54 @@ +#include "batcher.h" +#include "common/logging.h" +#include "sanelogging.h" +#include + +namespace marian { +namespace bergamot { + +Batcher::Batcher(Ptr options) { + max_input_tokens_ = options->get("max-input-tokens"); + bucket_.resize(options->get("max-input-sentence-tokens") + 1); + ABORT_IF(max_input_tokens_ >= bucket_.size(), + "max-input-sentence-tokens cannot be greater than max-input-tokens, " + "batcher fail"); +} + +void Batcher::addSentenceWithPriority(RequestSentence &sentence) { + int bucket_id = sentence.numTokens(); + assert(bucket_id < bucket_.size()); + bucket_[bucket_id].insert(sentence); +} + +void Batcher::cleaveBatch(RequestSentences &sentences) { + // For now simply iterates on buckets and converts batches greedily. This + // has to be enhanced with optimizing over priority. The baseline + // implementation should at least be as fast as marian's maxi-batch with full + // corpus size as maxi-batch size. + + int segments_added = 0; + int current_input_tokens = 0; + int padded_batch_size = 0; + int prev_padded_batch_size; + + for (int i = 0; i < bucket_.size(); i++) { + auto p = bucket_[i].begin(); + while (p != bucket_[i].end()) { + padded_batch_size = (segments_added + 1) * i; + if (padded_batch_size <= max_input_tokens_) { + auto q = p; + ++p; + current_input_tokens += i; + sentences.push_back(*q); + ++segments_added; + bucket_[i].erase(q); + prev_padded_batch_size = padded_batch_size; + } else { + return; + } + } + } +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/batcher.h b/src/translator/batcher.h new file mode 100644 index 000000000..b60b642c7 --- /dev/null +++ b/src/translator/batcher.h @@ -0,0 +1,35 @@ +#ifndef SRC_BERGAMOT_BATCHER_H_ +#define SRC_BERGAMOT_BATCHER_H_ + +#include "common/options.h" +#include "data/corpus_base.h" +#include "definitions.h" +#include "request.h" + +#include +#include + +namespace marian { +namespace bergamot { +class Batcher { +public: + explicit Batcher(Ptr options); + + // RequestSentence incorporates (tentative) notions of priority with each + // sentence. This method inserts the sentence into the internal data-structure + // which maintains priority among sentences from multiple concurrent requests. + void addSentenceWithPriority(RequestSentence &sentence); + + // Loads sentences with sentences compiled from (tentatively) multiple + // requests optimizing for both padding and priority. + void cleaveBatch(RequestSentences &sentences); + +private: + unsigned int max_input_tokens_; + std::vector> bucket_; +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_BATCHER_H_ diff --git a/src/translator/definitions.h b/src/translator/definitions.h new file mode 100644 index 000000000..35797a2b4 --- /dev/null +++ b/src/translator/definitions.h @@ -0,0 +1,27 @@ +#ifndef SRC_BERGAMOT_DEFINITIONS_H_ +#define SRC_BERGAMOT_DEFINITIONS_H_ + +#include "data/types.h" +#include "data/vocab_base.h" +#include + +namespace marian { +namespace bergamot { + +typedef marian::Words Segment; +typedef std::vector Segments; +typedef std::vector TokenRanges; +typedef std::vector SentenceTokenRanges; + +/** @brief Creates unique_ptr any type, passes all arguments to any available + * * constructor */ +template UPtr UNew(Args &&... args) { + return UPtr(new T(std::forward(args)...)); +} + +template UPtr UNew(UPtr p) { return UPtr(p); } + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_DEFINITIONS_H_ diff --git a/src/translator/main.cpp b/src/translator/main.cpp new file mode 100644 index 000000000..b3fb3f116 --- /dev/null +++ b/src/translator/main.cpp @@ -0,0 +1,92 @@ +#include +#include +#include + +#include "common/definitions.h" +#include "common/timer.h" +#include "common/utils.h" +#include "marian.h" +#include "translator/history.h" +#include "translator/output_collector.h" +#include "translator/output_printer.h" + +#include "service.h" + +void marian_decoder_minimal(const marian::Histories &histories, + marian::Ptr targetVocab, + marian::Ptr options) { + + bool doNbest = options->get("n-best"); + + auto collector = + marian::New(options->get("output")); + + // There is a dependency of vocabs here. + auto printer = marian::New(options, targetVocab); + if (options->get("quiet-translation")) + collector->setPrintingStrategy(marian::New()); + + for (auto &history : histories) { + std::stringstream best1; + std::stringstream bestn; + printer->print(history, best1, bestn); + collector->Write((long)history->getLineNum(), best1.str(), bestn.str(), + doNbest); + } +} + +int main(int argc, char *argv[]) { + marian::ConfigParser cp(marian::cli::mode::translation); + + cp.addOption( + "--ssplit-prefix-file", "Bergamot Options", + "File with nonbreaking prefixes for sentence splitting."); + + cp.addOption("--ssplit-mode", "Server Options", + "[paragraph, sentence, wrapped_text]"); + + cp.addOption( + "--max-input-sentence-tokens", "Bergamot Options", + "Maximum input tokens to be processed in a single sentence.", 128); + + cp.addOption("--max-input-tokens", "Bergamot Options", + "Maximum input tokens in a batch. control for" + "Bergamot Queue", + 1024); + + cp.addOption("--nbest", "Bergamot Options", + "NBest value used for decoding", 1); + + cp.addOption("--marian-decoder-alpha", "Bergamot Options", + "Run marian-decoder output printer code", false); + + // TODO(jerin): Add QE later. + // marian::qe::QualityEstimator::addOptions(cp); + + marian::timer::Timer decoderTimer; + + auto options = cp.parseOptions(argc, argv, true); + marian::bergamot::Service service(options); + + std::ostringstream std_input; + std_input << std::cin.rdbuf(); + std::string input = std_input.str(); + + LOG(info, "IO complete Translating input"); + auto translation_result_future = service.translate(std::move(input)); + translation_result_future.wait(); + auto translation_result = translation_result_future.get(); + if (options->get("marian-decoder-alpha")) { + marian_decoder_minimal(translation_result.getHistories(), + service.targetVocab(), options); + LOG(info, "Total time: {:.5f}s wall", decoderTimer.elapsed()); + } else { + for (auto &p : translation_result.getSentenceMappings()) { + std::cout << "[src] " << p.first << "\n"; + std::cout << "[tgt] " << p.second << "\n"; + } + } + + service.stop(); + return 0; +} diff --git a/src/translator/multifactor_priority.cpp b/src/translator/multifactor_priority.cpp new file mode 100644 index 000000000..0f93a8148 --- /dev/null +++ b/src/translator/multifactor_priority.cpp @@ -0,0 +1,7 @@ +#include "multifactor_priority.h" + +namespace marian { +namespace bergamot { + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/multifactor_priority.h b/src/translator/multifactor_priority.h new file mode 100644 index 000000000..1e239f73b --- /dev/null +++ b/src/translator/multifactor_priority.h @@ -0,0 +1,20 @@ +#ifndef SRC_BERGAMOT_MULTIFACTOR_PRIORITY_H_ +#define SRC_BERGAMOT_MULTIFACTOR_PRIORITY_H_ + +#include "data/types.h" +#include "definitions.h" +#include "sys/time.h" + +namespace marian { +namespace bergamot { + +struct MultiFactorPriority { + int nice; /* user configurable priority, at a request */ + unsigned int Id; + /* What else should priority depend on? */ + double priority() { return Id; } +}; +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_MULTIFACTOR_PRIORITY_H_ diff --git a/src/translator/pcqueue.h b/src/translator/pcqueue.h new file mode 100644 index 000000000..512932560 --- /dev/null +++ b/src/translator/pcqueue.h @@ -0,0 +1,299 @@ +#ifndef SRC_BERGAMOT_PCQUEUE_H_ +#define SRC_BERGAMOT_PCQUEUE_H_ + +#include "common/logging.h" + +#include +#include +#include +#include +#include + +#ifdef __APPLE__ +#include +#include +#include +#include +#elif defined(__linux) +#include +#else +#include +#endif + +#if __GNUC__ >= 3 +#define UTIL_UNLIKELY(x) __builtin_expect(!!(x), 0) +#else +#define UTIL_UNLIKELY(x) (x) +#endif + +namespace marian { +namespace bergamot { + +/* OS X Maverick and Boost interprocess were doing "Function not implemented." + * So this is my own wrapper around the mach kernel APIs. + */ +#ifdef __APPLE__ + +class Semaphore { +public: + explicit Semaphore(int value) : task_(mach_task_self()) { + ABORT_IF(KERN_SUCCESS != + semaphore_create(task_, &back_, SYNC_POLICY_FIFO, value), + "Could not create semaphore"); + } + + ~Semaphore() { + if (KERN_SUCCESS != semaphore_destroy(task_, back_)) { + std::cerr << "Could not destroy semaphore" << std::endl; + abort(); + } + } + + void wait() { + ABORT_IF(KERN_SUCCESS != semaphore_wait(back_), Exception, + "Wait for semaphore failed"); + } + + void post() { + ABORT_IF(KERN_SUCCESS != semaphore_signal(back_), Exception, + "Could not post to semaphore"); + } + +private: + semaphore_t back_; + task_t task_; +}; + +inline void WaitSemaphore(Semaphore &semaphore) { semaphore.wait(); } + +#elif defined(__linux) + +class Semaphore { +public: + explicit Semaphore(unsigned int value) { + ABORT_IF(sem_init(&sem_, 0, value), "Could not create semaphore"); + } + + ~Semaphore() { + if (-1 == sem_destroy(&sem_)) { + std::cerr << "Could not destroy semaphore " << std::endl; + abort(); + } + } + + void wait() { + while (UTIL_UNLIKELY(-1 == sem_wait(&sem_))) { + ABORT_IF(errno != EINTR, "Wait for semaphore failed"); + } + } + + void post() { + ABORT_IF(-1 == sem_post(&sem_), "Could not post to semaphore"); + } + +private: + sem_t sem_; +}; + +inline void WaitSemaphore(Semaphore &semaphore) { semaphore.wait(); } + +#else +typedef boost::interprocess::interprocess_semaphore Semaphore; + +inline void WaitSemaphore(Semaphore &on) { + while (1) { + try { + on.wait(); + break; + } catch (boost::interprocess::interprocess_exception &e) { + if (e.get_native_error() != EINTR) { + throw; + } + } + } +} + +#endif // Apple + +/** + * Producer consumer queue safe for multiple producers and multiple consumers. + * T must be default constructable and have operator=. + * The value is copied twice for Consume(T &out) or three times for Consume(), + * so larger objects should be passed via pointer. + * Strong exception guarantee if operator= throws. Undefined if semaphores + * throw. + */ +template class PCQueue { +public: + explicit PCQueue(size_t size) + : empty_(size), used_(0), storage_(new T[size]), + end_(storage_.get() + size), produce_at_(storage_.get()), + consume_at_(storage_.get()) {} + + // Add a value to the queue. + void Produce(const T &val) { + WaitSemaphore(empty_); + { + std::lock_guard produce_lock(produce_at_mutex_); + try { + *produce_at_ = val; + } catch (...) { + empty_.post(); + throw; + } + if (++produce_at_ == end_) + produce_at_ = storage_.get(); + } + used_.post(); + } + + // Add a value to the queue, but swap it into place. + void ProduceSwap(T &val) { + WaitSemaphore(empty_); + { + std::lock_guard produce_lock(produce_at_mutex_); + try { + std::swap(*produce_at_, val); + } catch (...) { + empty_.post(); + throw; + } + if (++produce_at_ == end_) + produce_at_ = storage_.get(); + } + used_.post(); + } + + // Consume a value, assigning it to out. + T &Consume(T &out) { + WaitSemaphore(used_); + { + std::lock_guard consume_lock(consume_at_mutex_); + try { + out = *consume_at_; + } catch (...) { + used_.post(); + throw; + } + if (++consume_at_ == end_) + consume_at_ = storage_.get(); + } + empty_.post(); + return out; + } + + // Consume a value, swapping it to out. + T &ConsumeSwap(T &out) { + WaitSemaphore(used_); + { + std::lock_guard consume_lock(consume_at_mutex_); + try { + std::swap(out, *consume_at_); + } catch (...) { + used_.post(); + throw; + } + if (++consume_at_ == end_) + consume_at_ = storage_.get(); + } + empty_.post(); + return out; + } + + // Convenience version of Consume that copies the value to return. + // The other version is faster. + T Consume() { + T ret; + Consume(ret); + return ret; + } + +private: + // Number of empty spaces in storage_. + Semaphore empty_; + // Number of occupied spaces in storage_. + Semaphore used_; + + std::unique_ptr storage_; + + T *const end_; + + // Index for next write in storage_. + T *produce_at_; + std::mutex produce_at_mutex_; + + // Index for next read from storage_. + T *consume_at_; + std::mutex consume_at_mutex_; +}; + +template struct UnboundedPage { + UnboundedPage() : next(nullptr) {} + UnboundedPage *next; + T entries[1023]; +}; + +template class UnboundedSingleQueue { +public: + UnboundedSingleQueue() : valid_(0) { + SetFilling(new UnboundedPage()); + SetReading(filling_); + } + + void Produce(T &&val) { + if (filling_current_ == filling_end_) { + UnboundedPage *next = new UnboundedPage(); + filling_->next = next; + SetFilling(next); + } + *(filling_current_++) = std::move(val); + valid_.post(); + } + + void Produce(const T &val) { Produce(T(val)); } + + T &Consume(T &out) { + WaitSemaphore(valid_); + if (reading_current_ == reading_end_) { + SetReading(reading_->next); + } + out = std::move(*(reading_current_++)); + return out; + } + + // Warning: very much a no-guarantees race-condition-rich implementation! + // But sufficient for our specific purpose: The single thread that consumes + // is also the only one that checks Empty, and knows that it's racing. + bool Empty() const { return reading_current_ == filling_current_; } + +private: + void SetFilling(UnboundedPage *to) { + filling_ = to; + filling_current_ = to->entries; + filling_end_ = filling_current_ + sizeof(to->entries) / sizeof(T); + } + void SetReading(UnboundedPage *to) { + reading_.reset(to); + reading_current_ = to->entries; + reading_end_ = reading_current_ + sizeof(to->entries) / sizeof(T); + } + + Semaphore valid_; + + UnboundedPage *filling_; + + std::unique_ptr> reading_; + + T *filling_current_; + T *filling_end_; + T *reading_current_; + T *reading_end_; + + UnboundedSingleQueue(const UnboundedSingleQueue &) = delete; + UnboundedSingleQueue &operator=(const UnboundedSingleQueue &) = delete; +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_PCQUEUE_H_ diff --git a/src/translator/request.cpp b/src/translator/request.cpp new file mode 100644 index 000000000..0d02c03ac --- /dev/null +++ b/src/translator/request.cpp @@ -0,0 +1,93 @@ +#include "request.h" + +#include "definitions.h" +#include "translation_result.h" + +#include "common/logging.h" + +#include + +namespace marian { +namespace bergamot { + +Request::Request(unsigned int Id, int lineNumberBegin, + std::vector> &vocabs, std::string &&source, + Segments &&segments, + std::vector &&sourceAlignments, + std::promise translationResultPromise) + : Id_(Id), lineNumberBegin_(lineNumberBegin), vocabs_(&vocabs), + source_(std::move(source)), segments_(std::move(segments)), + sourceAlignments_(std::move(sourceAlignments)), + response_(std::move(translationResultPromise)) { + + counter_ = segments_.size(); + histories_.resize(segments_.size(), nullptr); +} + +size_t Request::lineNumberBegin() const { return lineNumberBegin_; } +size_t Request::numSegments() const { return segments_.size(); } + +size_t Request::segmentTokens(size_t index) const { + return (segments_[index].size()); +} + +Segment Request::getSegment(size_t index) const { return segments_[index]; } + +void Request::processHistory(size_t index, Ptr history) { + // Concurrently called by multiple workers as a history from translation is + // ready. The container storing histories is set with the value obtained. + histories_[index] = history; + + // In case this is last request in, completeRequest is called, which sets the + // value of the promise. + if (--counter_ == 0) { + completeRequest(); + } +} + +void Request::completeRequest() { + // Request no longer needs to hold the content, can transfer it to + // TranslationResult. + TranslationResult translation_result(std::move(source_), std::move(segments_), + std::move(sourceAlignments_), + std::move(histories_), *vocabs_); + LOG(info, "Last translation in. Closing request;"); + response_.set_value(translation_result); +} + +bool Request::operator<(const Request &b) const { + // Among Requests, only sequence id is used for obtaining priority. + return Id_ < b.Id_; +} + +RequestSentence::RequestSentence(size_t index, Ptr request) + : index_(index), request_(request) {} + +size_t RequestSentence::numTokens() const { + return (request_->segmentTokens(index_)); +} + +size_t RequestSentence::lineNumber() const { + return (request_->lineNumberBegin() + index_); +} + +void RequestSentence::completeSentence(Ptr history) { + // Relays completeSentence into request's processHistory, using index + // information. + request_->processHistory(index_, history); +} + +Segment RequestSentence::getUnderlyingSegment() const { + return request_->getSegment(index_); +} + +bool operator<(const RequestSentence &a, const RequestSentence &b) { + // Operator overload for usage in priority-queue / set. + if (a.request_ == b.request_) { + return a.index_ < b.index_; + } + return a.request_ < b.request_; +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/request.h b/src/translator/request.h new file mode 100644 index 000000000..6f268ba1c --- /dev/null +++ b/src/translator/request.h @@ -0,0 +1,114 @@ +// +// Defines: +// +// Request: holds the input blob of a text, Segments (vector) which are +// to go to the batching mechanism and alignments between the processed +// segments and the input blob (sourceAlignments). In addition, Request takes +// care of the barrier which fires when all the Segments in a request are done +// translating by the workers (BatchTranslator). Request is to be extended with +// notions of Priority (sequence, user-given). +// +// RequestSentence: is a tuple of (index, Request*). This provides the +// batching mechanism access to the segment within the request. The backref to +// Request allows event triggering the barrier upon completion of the last +// sentence by a worker. +// +// PCItem: is a vector of RequestSentences and a batchNumber, which is what the +// PCQueue holds. The batches are constructed from segments returned by a +// RequestSentence. Can be enhanced with paddingSize, countTokens eventually for +// logging. + +#ifndef SRC_BERGAMOT_REQUEST_H_ +#define SRC_BERGAMOT_REQUEST_H_ + +#include "definitions.h" +#include "translation_result.h" + +#include "data/types.h" +#include "translator/beam_search.h" + +#include +#include + +namespace marian { +namespace bergamot { + +class Request { +private: + unsigned int Id_; + int lineNumberBegin_; + std::string source_; + std::atomic counter_; + std::vector> *vocabs_; + + Segments segments_; + std::vector sourceAlignments_; + std::vector> histories_; + + std::promise response_; + +public: + Request(unsigned int Id, int lineNumberBegin, + std::vector> &vocabs_, std::string &&source, + Segments &&segments, std::vector &&sourceAlignments, + std::promise translationResultPromise); + + // Obtain the count of tokens in the segment correponding to index. Used to + // insert sentence from multiple requests into the corresponding size bucket. + size_t segmentTokens(size_t index) const; + + // Obtain number of segments in a request. + size_t numSegments() const; + size_t lineNumberBegin() const; + + // Obtains segment corresponding to index to create a batch of segments among + // several requests. + Segment getSegment(size_t index) const; + + // For notions of priority among requests (used to enable in Batcher). + bool operator<(const Request &request) const; + + // Processes a history obtained after translating in a heterogenous batch + // compiled from requests. + void processHistory(size_t index, Ptr history); + + // On completion of last segment, sets value of the promise. + void completeRequest(); +}; + +class RequestSentence { +private: + size_t index_; + Ptr request_; + +public: + RequestSentence(size_t, Ptr); + size_t numTokens() const; + size_t lineNumber() const; + Segment getUnderlyingSegment() const; + void completeSentence(Ptr history); + friend bool operator<(const RequestSentence &a, const RequestSentence &b); +}; + +typedef std::vector RequestSentences; + +struct PCItem { + int batchNumber; + RequestSentences sentences; + + // PCItem should be default constructible for PCQueue. Default constructed + // element is poison. + PCItem() : batchNumber(-1) {} + + // PCItem constructor to construct a legit PCItem. + explicit PCItem(int batchNumber, RequestSentences &&sentences) + : batchNumber(batchNumber), sentences(std::move(sentences)) {} + + // Convenience function to determine poison. + bool isPoison() { return (batchNumber == -1); } +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_REQUEST_H_ diff --git a/src/translator/sanelogging.h b/src/translator/sanelogging.h new file mode 100644 index 000000000..21f70dda8 --- /dev/null +++ b/src/translator/sanelogging.h @@ -0,0 +1,44 @@ +#ifndef SRC_BERGAMOT_SANELOGGING_H_ +#define SRC_BERGAMOT_SANELOGGING_H_ + +#include "spdlog/spdlog.h" +#include + +namespace marian { + +#define PLOG(worker, level, ...) +#define _PLOG(worker, level, ...) checkedPLog(worker, #level, __VA_ARGS__) + +template +void checkedPLog(std::string logger, std::string level, Args... args) { + Logger log = spdlog::get(logger); + if (!log) { + try { + log = spdlog::daily_logger_st(logger, "logs/" + logger + ".log"); + } catch (const spdlog::spdlog_ex &ex) { + std::cout << "Log initialization failed: " << ex.what() << std::endl; + } + } + + if (level == "trace") + log->trace(args...); + else if (level == "debug") + log->debug(args...); + else if (level == "info") + log->info(args...); + else if (level == "warn") + log->warn(args...); + else if (level == "error") + log->error(args...); + else if (level == "critical") + log->critical(args...); + else { + log->warn("Unknown log level '{}' for logger '{}'", level, logger); + } + // Not required when threads clean-exit. + log->flush(); +} + +} // namespace marian + +#endif // SRC_BERGAMOT_SANELOGGING_H_ diff --git a/src/translator/service.cpp b/src/translator/service.cpp new file mode 100644 index 000000000..c9260812d --- /dev/null +++ b/src/translator/service.cpp @@ -0,0 +1,99 @@ +#include "service.h" +#include "definitions.h" +#include "sanelogging.h" + +#include "utils.h" +#include +#include + +namespace marian { +namespace bergamot { + +Service::Service(Ptr options) + : requestId_(0), batchNumber_(0), + numWorkers_(options->get("cpu-threads")), text_processor_(options), + batcher_(options), pcqueue_(2 * options->get("cpu-threads")) { + + vocabs_ = loadVocabularies(options); + workers_.reserve(numWorkers_); + + for (int i = 0; i < numWorkers_; i++) { + marian::DeviceId deviceId(i, DeviceType::cpu); + workers_.emplace_back(deviceId, pcqueue_, options); + } +} + +std::future Service::translateWithCopy(std::string input) { + return translate(std::move(input)); +} + +std::future Service::translate(std::string &&input) { + // Takes in a blob of text. Segments and std::vector are + // extracted from the input (blob of text) and used to construct a Request + // along with a promise. promise value is set by the worker completing a + // request. + // + // Batcher, which currently runs on the main thread constructs batches out of + // a single request (at the moment) and adds them into a Producer-Consumer + // queue holding a bunch of requestSentences used to construct batches. + // TODO(jerin): Make asynchronous and compile from multiple requests. + // + // returns future corresponding to the promise. + + Segments segments; + std::vector sourceAlignments; + text_processor_.query_to_segments(input, segments, sourceAlignments); + + std::promise translationResultPromise; + auto future = translationResultPromise.get_future(); + + Ptr request = New( + requestId_++, /* lineNumberBegin = */ 0, vocabs_, std::move(input), + std::move(segments), std::move(sourceAlignments), + std::move(translationResultPromise)); + + for (int i = 0; i < request->numSegments(); i++) { + RequestSentence requestSentence(i, request); + batcher_.addSentenceWithPriority(requestSentence); + } + + int numSentences; + do { + RequestSentences batchSentences; + batcher_.cleaveBatch(batchSentences); + numSentences = batchSentences.size(); + + if (numSentences > 0) { + PCItem pcitem(batchNumber_++, std::move(batchSentences)); + pcqueue_.ProduceSwap(pcitem); + } + + if (batchNumber_ % 500 == 0) { + LOG(info, "Queuing batch {}", batchNumber_); + } + } while (numSentences > 0); + + return future; +} + +void Service::stop() { + int counter = 0; + for (auto &worker : workers_) { + PCItem pcitem; + pcqueue_.ProduceSwap(pcitem); + ++counter; + } + + counter = 0; + for (auto &worker : workers_) { + worker.join(); + ++counter; + } + + workers_.clear(); // Takes care of idempotency. +} + +Service::~Service() { stop(); } + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/service.h b/src/translator/service.h new file mode 100644 index 000000000..519975445 --- /dev/null +++ b/src/translator/service.h @@ -0,0 +1,44 @@ +#ifndef SRC_BERGAMOT_SERVICE_H_ +#define SRC_BERGAMOT_SERVICE_H_ + +#include "batch_translator.h" +#include "batcher.h" +#include "pcqueue.h" +#include "textops.h" +#include "translation_result.h" + +#include +#include + +#include "data/types.h" + +namespace marian { +namespace bergamot { + +class Service { +public: + explicit Service(Ptr options); + std::future translateWithCopy(std::string input); + std::future translate(std::string &&input); + void stop(); + Ptr sourceVocab() const { return vocabs_.front(); }; + Ptr targetVocab() const { return vocabs_.back(); }; + ; + ~Service(); + +private: + unsigned int requestId_; + unsigned int batchNumber_; + int numWorkers_; + + std::vector> vocabs_; + TextProcessor text_processor_; + Batcher batcher_; + PCQueue pcqueue_; + std::vector workers_; +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_SERVICE_H_ diff --git a/src/translator/textops.cpp b/src/translator/textops.cpp new file mode 100644 index 000000000..55f22dab8 --- /dev/null +++ b/src/translator/textops.cpp @@ -0,0 +1,135 @@ +#include "textops.h" +#include "common/timer.h" +#include "utils.h" +#include +#include +#include +#include +#include + +namespace marian { +namespace bergamot { + +SentenceSplitter::SentenceSplitter(marian::Ptr options) + : options_(options) { + + std::string smode_str = options_->get("ssplit-mode", ""); + mode_ = string2splitmode(smode_str); + std::string ssplit_prefix_file = + options_->get("ssplit-prefix-file", ""); + + if (ssplit_prefix_file.size()) { + ssplit_prefix_file = marian::cli::interpolateEnvVars(ssplit_prefix_file); + + LOG(info, "Loading protected prefixes for sentence splitting from {}", + ssplit_prefix_file); + + ssplit_.load(ssplit_prefix_file); + } else { + LOG(warn, "Missing list of protected prefixes for sentence splitting. " + "Set with --ssplit-prefix-file."); + } +} + +ug::ssplit::SentenceStream +SentenceSplitter::createSentenceStream(const string_view &input) { + pcrecpp::StringPiece spiece(input.begin(), input.size()); + return std::move(ug::ssplit::SentenceStream(spiece, this->ssplit_, mode_)); +} + +ug::ssplit::SentenceStream::splitmode +SentenceSplitter::string2splitmode(const std::string &m) { + typedef ug::ssplit::SentenceStream::splitmode splitmode; + // @TODO: throw Exception on error + if (m == "sentence" || m == "Sentence") + return splitmode::one_sentence_per_line; + if (m == "paragraph" || m == "Paragraph") + return splitmode::one_paragraph_per_line; + if (m != "wrapped_text" && m != "WrappedText" && m != "wrappedText") { + LOG(warn, "Ignoring unknown text input format specification: {}.", m); + } + return splitmode::wrapped_text; +} + +Tokenizer::Tokenizer(Ptr options) : inference_(true), addEOS_(false) { + vocabs_ = loadVocabularies(options); +} + +Segment Tokenizer::tokenize(const string_view &snt, TokenRanges &tokenRanges) { + // TODO(jerin): Bunch of hardcode here, 1, 0, need to get rid off somehow. + return vocabs_[0]->encodePreservingSource(snt, tokenRanges, addEOS_, + inference_); +} + +TextProcessor::TextProcessor(Ptr options) + : tokenizer_(options), sentence_splitter_(options) { + max_input_sentence_tokens_ = options->get("max-input-sentence-tokens"); + max_input_sentence_tokens_ = + max_input_sentence_tokens_ - 1; // Account for EOS + // Dirty assert, should do at configparse + assert(max_input_sentence_tokens_ > 0); +} + +void TextProcessor::query_to_segments(const string_view &query, + Segments &segments, + std::vector &sourceRanges) { + auto buf = sentence_splitter_.createSentenceStream(query); + // pcrecpp::StringPiece snt; + string_view snt; + + int sentencesProcessed{0}; + + while (buf >> snt) { + // LOG(info, "SNT: {}", snt); + string_view snt_string_view(snt.data(), snt.size()); + TokenRanges snt_alignment; + timer::Timer spiece_timer; + Segment tokenized_sentence = + tokenizer_.tokenize(snt_string_view, snt_alignment); + + // LOG(info, "Tokenization took {:.5f} seconds", spiece_timer.elapsed()); + if (tokenized_sentence.size() > 0) { + if (tokenized_sentence.size() > max_input_sentence_tokens_) { + int offset; + for (offset = 0; + offset + max_input_sentence_tokens_ < tokenized_sentence.size(); + offset += max_input_sentence_tokens_) { + auto start = tokenized_sentence.begin() + offset; + Segment segment(start, start + max_input_sentence_tokens_); + segment.push_back(tokenizer_.sourceEosId()); + segments.push_back(segment); + + auto astart = snt_alignment.begin() + offset; + TokenRanges segment_alignment(astart, + astart + max_input_sentence_tokens_); + sourceRanges.push_back(segment_alignment); + } + + if (offset < max_input_sentence_tokens_) { + auto start = tokenized_sentence.begin() + offset; + Segment segment(start, tokenized_sentence.end()); + segment.push_back(tokenizer_.sourceEosId()); + segments.push_back(segment); + + auto astart = snt_alignment.begin() + offset; + TokenRanges segment_alignment(astart, snt_alignment.end()); + sourceRanges.push_back(segment_alignment); + } + + } else { + timer::Timer push_timer; + tokenized_sentence.push_back(tokenizer_.sourceEosId()); + segments.push_back(tokenized_sentence); + sourceRanges.push_back(snt_alignment); + // LOG(info, "Push took {:.5f} seconds", push_timer.elapsed()); + } + } + ++sentencesProcessed; + if (sentencesProcessed % 10000 == 0) { + LOG(info, "Processed {}", sentencesProcessed); + } + } +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/textops.h b/src/translator/textops.h new file mode 100644 index 000000000..0b4ee6e5c --- /dev/null +++ b/src/translator/textops.h @@ -0,0 +1,102 @@ +#ifndef SRC_BERGAMOT_TEXTOPS_H_ +#define SRC_BERGAMOT_TEXTOPS_H_ + +#include "common/definitions.h" +#include "common/logging.h" +#include "common/options.h" +#include "common/types.h" // missing in shortlist.h +#include "common/utils.h" +#include "data/sentencepiece_vocab.h" +#include "data/shortlist.h" +#include "definitions.h" +#include "ssplit/ssplit.h" + +#include +#include +#include +#include + +namespace marian { +namespace bergamot { + +class StringViewStream { +private: + string_view text_; + string_view::iterator current_; + +public: + StringViewStream(const string_view &text) : text_(text) { + current_ = text_.begin(); + } + + bool operator>>(string_view &sentence_view) { + // Skip to the next non-newline; whitespaces, anything else are okay. + while (current_ != text_.end() && + (*current_ == '\n' || *current_ == ' ' || *current_ == '\t')) { + ++current_; + } + + string_view::iterator p = current_; + while (p != text_.end() && *p != '\n') { + ++p; + } + + if (p == current_) + return false; + + sentence_view = string_view(current_, p - current_); + current_ = p; + return true; + }; +}; + +class SentenceSplitter { +public: + explicit SentenceSplitter(Ptr options); + ug::ssplit::SentenceStream createSentenceStream(string_view const &input); + +private: + ug::ssplit::SentenceSplitter ssplit_; + Ptr options_; + ug::ssplit::SentenceStream::splitmode mode_; + ug::ssplit::SentenceStream::splitmode string2splitmode(const std::string &m); +}; + +class LineSplitter { +public: + explicit LineSplitter(Ptr options){ + // Do nothing. + }; + StringViewStream createSentenceStream(string_view const &input) { + return std::move(StringViewStream(input)); + } +}; + +class Tokenizer { +private: + std::vector> vocabs_; + bool inference_; + bool addEOS_; + +public: + explicit Tokenizer(Ptr); + Segment tokenize(const string_view &input, TokenRanges &tokenRanges); + Word sourceEosId() { return vocabs_.front()->getEosId(); }; +}; + +class TextProcessor { +private: + Tokenizer tokenizer_; + LineSplitter sentence_splitter_; + unsigned int max_input_sentence_tokens_; + +public: + explicit TextProcessor(Ptr); + void query_to_segments(const string_view &query, Segments &segments, + std::vector &sourceRanges); +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_TEXTOPS_H_ diff --git a/src/translator/timer.h b/src/translator/timer.h new file mode 100644 index 000000000..744038081 --- /dev/null +++ b/src/translator/timer.h @@ -0,0 +1,32 @@ +#ifndef __BERGAMOT_TIMER_H +#define __BERGAMOT_TIMER_H + +// https://stackoverflow.com/a/19800231/4565794 +// +// Careful: This won't work if the user changes his time between Timer() and +// the call to elapsed() if !std::chrono::high_resolution_clock::is_steady - +// which is the case on Linux! + +#include +#include + +namespace marian { +namespace bergamot { +class Timer { +public: + Timer() : beg_(clock_::now()) {} + void reset() { beg_ = clock_::now(); } + double elapsed() const { + return std::chrono::duration_cast + (clock_::now() - beg_).count(); } + +private: + typedef std::chrono::high_resolution_clock clock_; + typedef std::chrono::duration > second_; + std::chrono::time_point beg_; +}; + +} // namespace bergamot +} // namespace marian + +#endif // __BERGAMOT_TIMER_H diff --git a/src/translator/translation_result.cpp b/src/translator/translation_result.cpp new file mode 100644 index 000000000..43b233eed --- /dev/null +++ b/src/translator/translation_result.cpp @@ -0,0 +1,97 @@ +#include "translation_result.h" +#include "common/logging.h" +#include "data/alignment.h" + +#include + +namespace marian { +namespace bergamot { + +TranslationResult::TranslationResult(std::string &&source, Segments &&segments, + std::vector &&sourceRanges, + Histories &&histories, + std::vector> &vocabs) + : source_(std::move(source)), sourceRanges_(std::move(sourceRanges)), + segments_(std::move(segments)), histories_(std::move(histories)), + vocabs_(&vocabs) { + + // Process sourceMappings into sourceMappings_. + LOG(info, "Creating sourcemappings"); + sourceMappings_.reserve(segments_.size()); + for (int i = 0; i < segments_.size(); i++) { + string_view first = sourceRanges_[i].front(); + string_view last = sourceRanges_[i].back(); + int size = last.end() - first.begin(); + sourceMappings_.emplace_back(first.data(), size); + } + + // Compiles translations into a single std::string translation_ + // Current implementation uses += on std::string, multiple resizes. + // Stores ByterRanges as indices first, followed by conversion into + // string_views. + // TODO(jerin): Add token level string_views here as well. + LOG(info, "Decoding"); + std::vector> translationRanges; + int offset{0}, end{0}; + bool first{true}; + for (auto &history : histories_) { + // TODO(jerin): Change hardcode of nBest = 1 + NBestList onebest = history->nBest(1); + + Result result = onebest[0]; // Expecting only one result; + Words words = std::get<0>(result); + std::string decoded = vocabs_->back()->decode(words); + if (first) { + first = false; + } else { + translation_ += " "; + } + + translation_ += decoded; + end = offset + (first ? 0 : 1) /*space*/ + decoded.size(); + translationRanges.emplace_back(offset, end); + offset = end; + } + + // Converting ByteRanges as indices into string_views. + LOG(info, "generating targetMappings"); + targetMappings_.reserve(translationRanges.size()); + for (auto &p : translationRanges) { + targetMappings_.emplace_back(&translation_[p.first], p.second - p.first); + } + + // Surely, let's add sentenceMappings_ + LOG(info, "generating SentenceMappings"); + for (auto p = sourceMappings_.begin(), q = targetMappings_.begin(); + p != sourceMappings_.end() && q != targetMappings_.end(); ++p, ++q) { + sentenceMappings_.emplace_back(*p, *q); + } +} + +std::vector TranslationResult::getAlignment(unsigned int index) { + Ptr history = histories_[index]; + NBestList onebest = history->nBest(1); + Result &result = onebest[0]; // Expecting only one result; + Words &words = std::get<0>(result); + auto &hypothesis = std::get<1>(result); + + // soft alignment = P(src pos|trg pos) for each beam and batch index, stored + // in a flattened CPU-side array + // + // Also used on QuickSAND boundary where beam and batch size is 1. Then it is + // simply [t][s] -> P(s|t) + // + // typedef std::vector> SoftAlignment; + // [trg pos][beam depth * max src length * batch size] + + auto softAlignment = hypothesis->tracebackAlignment(); + auto hardAlignment = data::ConvertSoftAlignToHardAlign(softAlignment); + std::vector alignment(words.size(), -1); + for (auto &p : hardAlignment) { + alignment[p.tgtPos] = p.srcPos; + } + return alignment; +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/translation_result.h b/src/translator/translation_result.h new file mode 100644 index 000000000..b2cb393b9 --- /dev/null +++ b/src/translator/translation_result.h @@ -0,0 +1,64 @@ +#ifndef SRC_BERGAMOT_TRANSLATION_RESULT_H_ +#define SRC_BERGAMOT_TRANSLATION_RESULT_H_ + +#include "data/types.h" +#include "definitions.h" +#include "translator/beam_search.h" + +#include +#include +#include + +namespace marian { +namespace bergamot { +class TranslationResult { +public: + TranslationResult(std::string &&source, Segments &&segments, + std::vector &&sourceRanges, + Histories &&histories, + std::vector> &vocabs); + + const Histories &getHistories() const { return histories_; } + + // https://github.com/browsermt/bergamot-translator/blob/0200843ed7e5366f4143422c64fcd1837d9baca7/src/TranslationResult.h + const std::string &getOriginalText() const { return source_; } + const std::string &getTranslatedText() const { return translation_; } + typedef std::vector> SentenceMappings; + const SentenceMappings &getSentenceMappings() const { + return sentenceMappings_; + } + + // Return the Quality scores of the translated text. + // Not implemented currently, commenting out. + // const QualityScore &getQualityScore() const { return qualityScore; } + + // Provides a hard alignment between source and target words. + std::vector getAlignment(unsigned int index); + +private: + std::string source_; + std::string translation_; + + // Histories are currently required for interoperability with OutputPrinter + // and OutputCollector and hence comparisons with marian-decoder. + Histories histories_; + + // Can be removed eventually. + Segments segments_; + std::vector> *vocabs_; + + // string_views at the token level. + std::vector sourceRanges_; + + // string_views at the sentence-level. + std::vector sourceMappings_; + std::vector targetMappings_; + + // Adding the following to complete bergamot-translator spec, redundant with + // sourceMappings_ and targetMappings_. + SentenceMappings sentenceMappings_; +}; +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_TRANSLATION_RESULT_H_ diff --git a/src/translator/utils.cpp b/src/translator/utils.cpp new file mode 100644 index 000000000..ea4c5037c --- /dev/null +++ b/src/translator/utils.cpp @@ -0,0 +1,31 @@ +#include "utils.h" + +#include + +namespace marian { +namespace bergamot { + + +std::vector> loadVocabularies( + Ptr options) { + // @TODO: parallelize vocab loading for faster startup + auto vfiles = options->get>("vocabs"); + // with the current setup, we need at least two vocabs: src and trg + ABORT_IF(vfiles.size() < 2, "Insufficient number of vocabularies."); + std::vector> vocabs(vfiles.size()); + std::unordered_map> vmap; + for (size_t i = 0; i < vocabs.size(); ++i) { + auto m = vmap.emplace(std::make_pair(vfiles[i], Ptr())); + if (m.second) { // new: load the vocab + m.first->second = New(options, i); + m.first->second->load(vfiles[i]); + } + vocabs[i] = m.first->second; + } + return vocabs; +} + + + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/utils.h b/src/translator/utils.h new file mode 100644 index 000000000..594d0cabd --- /dev/null +++ b/src/translator/utils.h @@ -0,0 +1,20 @@ +#ifndef __BERGAMOT_UTILS_H +#define __BERGAMOT_UTILS_H + +#include "common/options.h" +#include "common/types.h" +#include "data/vocab.h" +#include "translator/history.h" + +#include +#include + +namespace marian { +namespace bergamot { + +std::vector> loadVocabularies(Ptr options); + +} // namespace bergamot +} // namespace marian + +#endif // __BERGAMOT_UTILS_H From d786f2554ea8cf362211d4766231b53745a97840 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 19:14:34 +0000 Subject: [PATCH 018/442] Bumping marian with sentencepiece capable fork Modifications to SentencePiece are necessary to provide token level string_views. This commit changes marian to an alternate branch which has the feature incorporated. --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 69894793e..96d5a712d 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 69894793ebd93256d824a1590924780a6d54cae8 +Subproject commit 96d5a712d3b8bc56f120ba5220365f955719f4d4 From bde90947285db4cad7da50ce89ac590dfb89dea3 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 19:52:34 +0000 Subject: [PATCH 019/442] Updating CMakeLists to build main CMakeLists have been modified with the necessary includes to add browsermt/mts@nuke files to the bergamot-translator library. In addition, adds the ssplit dependency, corresponding includes. Intel MKL fails on compilation, unable to find libraries. To solve this 3rd_party/CMakeLists.txt is modified with @ug's fixes to propogate variables (EXT_LIBS, etc) at a library level. --- 3rd_party/CMakeLists.txt | 24 ++++++++++++++++++++++-- CMakeLists.txt | 5 +++++ src/translator/CMakeLists.txt | 22 ++++++++++++++++++++-- 3 files changed, 47 insertions(+), 4 deletions(-) diff --git a/3rd_party/CMakeLists.txt b/3rd_party/CMakeLists.txt index 97bf94e05..a5aed0689 100644 --- a/3rd_party/CMakeLists.txt +++ b/3rd_party/CMakeLists.txt @@ -1,6 +1,26 @@ add_subdirectory(marian-dev) +add_subdirectory(ssplit-cpp) + +include_directories(ssplit-cpp/src) + +# Add include directories for marian target to be able to use it anywhere in the +# project without explicitly specifying its include directories. Once marian +# fixes this problem, it can be removed. -# Add include directories for marian target to be able to use it anywhere in the project without -# explicitly specifying its include directories. Once marian fixes this problem, it can be removed. get_property(INCDIRS DIRECTORY marian-dev/src PROPERTY INCLUDE_DIRECTORIES) target_include_directories(marian PUBLIC ${INCDIRS}) + + +get_property(INCLUDE_DIRECTORIES DIRECTORY . PROPERTY INCLUDE_DIRECTORIES) +set(INCLUDE_DIRECTORIES ${INCLUDE_DIRECTORIES} PARENT_SCOPE) + +# Required to enable MKL, at least +get_directory_property(EXT_LIBS DIRECTORY marian-dev DEFINITION EXT_LIBS) +set(EXT_LIBS ${EXT_LIBS} PARENT_SCOPE) + +# Compilation flags +get_directory_property(CMAKE_C_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_C_FLAGS) +get_directory_property(CMAKE_CXX_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_CXX_FLAGS) +set(CMAKE_C_FLAGS ${CMAKE_C_FLAGS} PARENT_SCOPE) +set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} PARENT_SCOPE) + diff --git a/CMakeLists.txt b/CMakeLists.txt index 68a075d5c..935cd1eab 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -14,7 +14,12 @@ set(BUILD_ARCH native CACHE STRING "Compile for this CPU architecture.") option(COMPILE_CUDA "Compile GPU version" OFF) option(USE_SENTENCEPIECE "Download and compile SentencePiece" ON) option(USE_STATIC_LIBS "Link statically against non-system libs" ON) +option(USE_MKL "Compile with MKL support" ON) add_subdirectory(3rd_party) + +# Adds the include directories set inside 3rd_party. +include_directories(${INCLUDE_DIRECTORIES}) + add_subdirectory(src) add_subdirectory(app) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 08a82fcb5..16c99e7d6 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -1,11 +1,29 @@ add_library(bergamot-translator STATIC AbstractTranslationModel.cpp TranslationModel.cpp - TranslationModelConfigToOptionsAdaptor.cpp) + TranslationModelConfigToOptionsAdaptor.cpp -target_link_libraries(bergamot-translator marian) + # Following files added from browsermt/mts@nuke + textops.cpp + batch_translator.cpp + multifactor_priority.cpp + request.cpp + service.cpp + batcher.cpp + utils.cpp + translation_result.cpp +) + +# Replacement app for marian-decoder from browsermt/mts@nuke +add_executable(main main.cpp) +set_target_properties(main PROPERTIES OUTPUT bergamot-cli RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}") +target_compile_options(main PUBLIC ${ALL_WARNINGS}) +set(EXECUTABLES ${EXECUTABLES} main) +target_link_libraries(main bergamot-translator marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) +target_link_libraries(bergamot-translator marian) target_include_directories(bergamot-translator PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} PRIVATE ${CMAKE_SOURCE_DIR} PUBLIC ${CMAKE_SOURCE_DIR}/src) + From b25b2276e35cf7f0079a254793a042b714efdbcf Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 20:10:19 +0000 Subject: [PATCH 020/442] Undoing LineSplitter, reverting SentenceSplitter. A faster linesplitter added for benchmarks is removed in favour of @ug's ssplit-cpp. NOTE: ssplit-cpp's regex based implementation is slow for one-line parses, which ideally needs to be improved in upstream ssplit-cpp to trivially reduce to a faster newline character based split. --- src/translator/textops.cpp | 4 ++-- src/translator/textops.h | 43 +------------------------------------- 2 files changed, 3 insertions(+), 44 deletions(-) diff --git a/src/translator/textops.cpp b/src/translator/textops.cpp index 55f22dab8..add3b1026 100644 --- a/src/translator/textops.cpp +++ b/src/translator/textops.cpp @@ -74,8 +74,8 @@ void TextProcessor::query_to_segments(const string_view &query, Segments &segments, std::vector &sourceRanges) { auto buf = sentence_splitter_.createSentenceStream(query); - // pcrecpp::StringPiece snt; - string_view snt; + pcrecpp::StringPiece snt; + // string_view snt; int sentencesProcessed{0}; diff --git a/src/translator/textops.h b/src/translator/textops.h index 0b4ee6e5c..5de54fdd5 100644 --- a/src/translator/textops.h +++ b/src/translator/textops.h @@ -19,37 +19,6 @@ namespace marian { namespace bergamot { -class StringViewStream { -private: - string_view text_; - string_view::iterator current_; - -public: - StringViewStream(const string_view &text) : text_(text) { - current_ = text_.begin(); - } - - bool operator>>(string_view &sentence_view) { - // Skip to the next non-newline; whitespaces, anything else are okay. - while (current_ != text_.end() && - (*current_ == '\n' || *current_ == ' ' || *current_ == '\t')) { - ++current_; - } - - string_view::iterator p = current_; - while (p != text_.end() && *p != '\n') { - ++p; - } - - if (p == current_) - return false; - - sentence_view = string_view(current_, p - current_); - current_ = p; - return true; - }; -}; - class SentenceSplitter { public: explicit SentenceSplitter(Ptr options); @@ -62,16 +31,6 @@ class SentenceSplitter { ug::ssplit::SentenceStream::splitmode string2splitmode(const std::string &m); }; -class LineSplitter { -public: - explicit LineSplitter(Ptr options){ - // Do nothing. - }; - StringViewStream createSentenceStream(string_view const &input) { - return std::move(StringViewStream(input)); - } -}; - class Tokenizer { private: std::vector> vocabs_; @@ -87,7 +46,7 @@ class Tokenizer { class TextProcessor { private: Tokenizer tokenizer_; - LineSplitter sentence_splitter_; + SentenceSplitter sentence_splitter_; unsigned int max_input_sentence_tokens_; public: From b3f1905a120caeff042faa4a7cc539e9fa495194 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 20:56:50 +0000 Subject: [PATCH 021/442] Adding documentation and example to service.h --- src/translator/service.h | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/src/translator/service.h b/src/translator/service.h index 519975445..8270a33a8 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -16,14 +16,27 @@ namespace marian { namespace bergamot { class Service { + + // Service exposes methods to translate an incoming blob of text to the + // Consumer of bergamot API. + // + // An example use of this API looks as follows: + // + // options = ...; + // service = Service(options); + // std::string input_blob = "Hello World"; + // std::future + // response = service.translate(std::move(input)_blob); + // response.wait(); + // TranslationResult result = response.get(); + public: explicit Service(Ptr options); std::future translateWithCopy(std::string input); std::future translate(std::string &&input); void stop(); - Ptr sourceVocab() const { return vocabs_.front(); }; - Ptr targetVocab() const { return vocabs_.back(); }; - ; + Ptr sourceVocab() const { return vocabs_.front(); } + Ptr targetVocab() const { return vocabs_.back(); } ~Service(); private: @@ -31,6 +44,15 @@ class Service { unsigned int batchNumber_; int numWorkers_; + // Consists of: + // 1. an instance of text-processing class (TextProcessor), + // 2. a Batcher // class which handles efficient batching by minimizing + // padding wasting compute. + // 3. Multiple workers - which are instances of BatchTranslators are + // spawned in threads. The Batcher acts as a producer for a + // producer-consumer queue, with idle BatchTranslators requesting batches + // as they're ready. + std::vector> vocabs_; TextProcessor text_processor_; Batcher batcher_; From d3c707f73541879795940d113683ad72a1c2aa76 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 21:11:27 +0000 Subject: [PATCH 022/442] Enhancing service.h further --- src/translator/service.h | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/src/translator/service.h b/src/translator/service.h index 8270a33a8..982166eea 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -26,17 +26,22 @@ class Service { // service = Service(options); // std::string input_blob = "Hello World"; // std::future - // response = service.translate(std::move(input)_blob); + // response = service.translate(std::move(input_blob)); // response.wait(); // TranslationResult result = response.get(); public: explicit Service(Ptr options); + + // Constructs new string copying, calls translate internally. std::future translateWithCopy(std::string input); std::future translate(std::string &&input); + void stop(); + Ptr sourceVocab() const { return vocabs_.front(); } Ptr targetVocab() const { return vocabs_.back(); } + ~Service(); private: @@ -44,16 +49,23 @@ class Service { unsigned int batchNumber_; int numWorkers_; + // vocabs are used to construct a Request, which later uses it to construct + // TranslationResult (decode from words to string). + std::vector> vocabs_; + // Consists of: - // 1. an instance of text-processing class (TextProcessor), - // 2. a Batcher // class which handles efficient batching by minimizing + // + // 1. text-processing class (TextProcessor), which handles breaking a blob of + // text into sentences and providing them representated by finite + // vocabulary for further processing by hte neural machine translation. + // 2. a Batcher class which handles efficient batching by minimizing // padding wasting compute. // 3. Multiple workers - which are instances of BatchTranslators are - // spawned in threads. The Batcher acts as a producer for a - // producer-consumer queue, with idle BatchTranslators requesting batches - // as they're ready. + // spawned in separate threads. + // + // Batcher acts as a producer for a producer-consumer queue (pcqueue_), with + // idle BatchTranslators being consumers requesting batches as they're ready. - std::vector> vocabs_; TextProcessor text_processor_; Batcher batcher_; PCQueue pcqueue_; From 54a6c6ce8088ba1123d8f3e7a1518f367bad0cbb Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 21:18:20 +0000 Subject: [PATCH 023/442] Moving main (mts) to app/ Commit modifies the example test-code main-mts into the app folder, updating CMakeLists accordingly. --- app/CMakeLists.txt | 8 ++- app/main-mts.cpp | 58 ++++++++++++++++++++++ src/translator/CMakeLists.txt | 7 --- src/translator/main.cpp | 92 ----------------------------------- 4 files changed, 65 insertions(+), 100 deletions(-) create mode 100644 app/main-mts.cpp delete mode 100644 src/translator/main.cpp diff --git a/app/CMakeLists.txt b/app/CMakeLists.txt index f9698dc55..fcc03237e 100644 --- a/app/CMakeLists.txt +++ b/app/CMakeLists.txt @@ -1,3 +1,9 @@ add_executable(bergamot-translator-app main.cpp) - target_link_libraries(bergamot-translator-app PRIVATE bergamot-translator) + +# Replacement app for marian-decoder from browsermt/mts@nuke +add_executable(main main-mts.cpp) +set_target_properties(main PROPERTIES OUTPUT bergamot-cli RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}") +target_compile_options(main PUBLIC ${ALL_WARNINGS}) +set(EXECUTABLES ${EXECUTABLES} main) +target_link_libraries(main bergamot-translator marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) diff --git a/app/main-mts.cpp b/app/main-mts.cpp new file mode 100644 index 000000000..3de57b074 --- /dev/null +++ b/app/main-mts.cpp @@ -0,0 +1,58 @@ +#include +#include +#include + +#include "common/definitions.h" +#include "common/timer.h" +#include "common/utils.h" +#include "marian.h" +#include "translator/history.h" +#include "translator/output_collector.h" +#include "translator/output_printer.h" + +#include "translator/service.h" + +int main(int argc, char *argv[]) { + marian::ConfigParser cp(marian::cli::mode::translation); + + cp.addOption( + "--ssplit-prefix-file", "Bergamot Options", + "File with nonbreaking prefixes for sentence splitting."); + + cp.addOption("--ssplit-mode", "Server Options", + "[paragraph, sentence, wrapped_text]"); + + cp.addOption( + "--max-input-sentence-tokens", "Bergamot Options", + "Maximum input tokens to be processed in a single sentence.", 128); + + cp.addOption("--max-input-tokens", "Bergamot Options", + "Maximum input tokens in a batch. control for" + "Bergamot Queue", + 1024); + + // Launch service. + auto options = cp.parseOptions(argc, argv, true); + marian::bergamot::Service service(options); + + // Read a large input text blob from stdin + std::ostringstream std_input; + std_input << std::cin.rdbuf(); + std::string input = std_input.str(); + + LOG(info, "IO complete Translating input"); + // Wait on future until TranslationResult is complete + auto translation_result_future = service.translate(std::move(input)); + translation_result_future.wait(); + auto translation_result = translation_result_future.get(); + + // Obtain sentencemappings and print them as Proof of Concept. + for (auto &p : translation_result.getSentenceMappings()) { + std::cout << "[src] " << p.first << "\n"; + std::cout << "[tgt] " << p.second << "\n"; + } + + // Stop Service. + service.stop(); + return 0; +} diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 16c99e7d6..ce3193d41 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -14,13 +14,6 @@ add_library(bergamot-translator STATIC translation_result.cpp ) -# Replacement app for marian-decoder from browsermt/mts@nuke -add_executable(main main.cpp) -set_target_properties(main PROPERTIES OUTPUT bergamot-cli RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}") -target_compile_options(main PUBLIC ${ALL_WARNINGS}) -set(EXECUTABLES ${EXECUTABLES} main) -target_link_libraries(main bergamot-translator marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) - target_link_libraries(bergamot-translator marian) target_include_directories(bergamot-translator PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} diff --git a/src/translator/main.cpp b/src/translator/main.cpp deleted file mode 100644 index b3fb3f116..000000000 --- a/src/translator/main.cpp +++ /dev/null @@ -1,92 +0,0 @@ -#include -#include -#include - -#include "common/definitions.h" -#include "common/timer.h" -#include "common/utils.h" -#include "marian.h" -#include "translator/history.h" -#include "translator/output_collector.h" -#include "translator/output_printer.h" - -#include "service.h" - -void marian_decoder_minimal(const marian::Histories &histories, - marian::Ptr targetVocab, - marian::Ptr options) { - - bool doNbest = options->get("n-best"); - - auto collector = - marian::New(options->get("output")); - - // There is a dependency of vocabs here. - auto printer = marian::New(options, targetVocab); - if (options->get("quiet-translation")) - collector->setPrintingStrategy(marian::New()); - - for (auto &history : histories) { - std::stringstream best1; - std::stringstream bestn; - printer->print(history, best1, bestn); - collector->Write((long)history->getLineNum(), best1.str(), bestn.str(), - doNbest); - } -} - -int main(int argc, char *argv[]) { - marian::ConfigParser cp(marian::cli::mode::translation); - - cp.addOption( - "--ssplit-prefix-file", "Bergamot Options", - "File with nonbreaking prefixes for sentence splitting."); - - cp.addOption("--ssplit-mode", "Server Options", - "[paragraph, sentence, wrapped_text]"); - - cp.addOption( - "--max-input-sentence-tokens", "Bergamot Options", - "Maximum input tokens to be processed in a single sentence.", 128); - - cp.addOption("--max-input-tokens", "Bergamot Options", - "Maximum input tokens in a batch. control for" - "Bergamot Queue", - 1024); - - cp.addOption("--nbest", "Bergamot Options", - "NBest value used for decoding", 1); - - cp.addOption("--marian-decoder-alpha", "Bergamot Options", - "Run marian-decoder output printer code", false); - - // TODO(jerin): Add QE later. - // marian::qe::QualityEstimator::addOptions(cp); - - marian::timer::Timer decoderTimer; - - auto options = cp.parseOptions(argc, argv, true); - marian::bergamot::Service service(options); - - std::ostringstream std_input; - std_input << std::cin.rdbuf(); - std::string input = std_input.str(); - - LOG(info, "IO complete Translating input"); - auto translation_result_future = service.translate(std::move(input)); - translation_result_future.wait(); - auto translation_result = translation_result_future.get(); - if (options->get("marian-decoder-alpha")) { - marian_decoder_minimal(translation_result.getHistories(), - service.targetVocab(), options); - LOG(info, "Total time: {:.5f}s wall", decoderTimer.elapsed()); - } else { - for (auto &p : translation_result.getSentenceMappings()) { - std::cout << "[src] " << p.first << "\n"; - std::cout << "[tgt] " << p.second << "\n"; - } - } - - service.stop(); - return 0; -} From caa03e1d9fbc62fe4295798a0ac668139ad30451 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 21:21:43 +0000 Subject: [PATCH 024/442] Removing unused timer.h --- src/translator/timer.h | 32 -------------------------------- 1 file changed, 32 deletions(-) delete mode 100644 src/translator/timer.h diff --git a/src/translator/timer.h b/src/translator/timer.h deleted file mode 100644 index 744038081..000000000 --- a/src/translator/timer.h +++ /dev/null @@ -1,32 +0,0 @@ -#ifndef __BERGAMOT_TIMER_H -#define __BERGAMOT_TIMER_H - -// https://stackoverflow.com/a/19800231/4565794 -// -// Careful: This won't work if the user changes his time between Timer() and -// the call to elapsed() if !std::chrono::high_resolution_clock::is_steady - -// which is the case on Linux! - -#include -#include - -namespace marian { -namespace bergamot { -class Timer { -public: - Timer() : beg_(clock_::now()) {} - void reset() { beg_ = clock_::now(); } - double elapsed() const { - return std::chrono::duration_cast - (clock_::now() - beg_).count(); } - -private: - typedef std::chrono::high_resolution_clock clock_; - typedef std::chrono::duration > second_; - std::chrono::time_point beg_; -}; - -} // namespace bergamot -} // namespace marian - -#endif // __BERGAMOT_TIMER_H From d6ec007df93ffac6c40bd8adc4db861d960ee1c1 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 20 Jan 2021 21:58:13 +0000 Subject: [PATCH 025/442] TranslationResult Docs Removed Alignments, too many questions and no concrete answers. Better off removing unused code. History is kept for now, for internal use. --- src/translator/translation_result.cpp | 25 ------------------------- src/translator/translation_result.h | 9 ++++----- 2 files changed, 4 insertions(+), 30 deletions(-) diff --git a/src/translator/translation_result.cpp b/src/translator/translation_result.cpp index 43b233eed..1c74314e3 100644 --- a/src/translator/translation_result.cpp +++ b/src/translator/translation_result.cpp @@ -68,30 +68,5 @@ TranslationResult::TranslationResult(std::string &&source, Segments &&segments, } } -std::vector TranslationResult::getAlignment(unsigned int index) { - Ptr history = histories_[index]; - NBestList onebest = history->nBest(1); - Result &result = onebest[0]; // Expecting only one result; - Words &words = std::get<0>(result); - auto &hypothesis = std::get<1>(result); - - // soft alignment = P(src pos|trg pos) for each beam and batch index, stored - // in a flattened CPU-side array - // - // Also used on QuickSAND boundary where beam and batch size is 1. Then it is - // simply [t][s] -> P(s|t) - // - // typedef std::vector> SoftAlignment; - // [trg pos][beam depth * max src length * batch size] - - auto softAlignment = hypothesis->tracebackAlignment(); - auto hardAlignment = data::ConvertSoftAlignToHardAlign(softAlignment); - std::vector alignment(words.size(), -1); - for (auto &p : hardAlignment) { - alignment[p.tgtPos] = p.srcPos; - } - return alignment; -} - } // namespace bergamot } // namespace marian diff --git a/src/translator/translation_result.h b/src/translator/translation_result.h index b2cb393b9..27bfb370b 100644 --- a/src/translator/translation_result.h +++ b/src/translator/translation_result.h @@ -18,11 +18,9 @@ class TranslationResult { Histories &&histories, std::vector> &vocabs); - const Histories &getHistories() const { return histories_; } - - // https://github.com/browsermt/bergamot-translator/blob/0200843ed7e5366f4143422c64fcd1837d9baca7/src/TranslationResult.h const std::string &getOriginalText() const { return source_; } const std::string &getTranslatedText() const { return translation_; } + typedef std::vector> SentenceMappings; const SentenceMappings &getSentenceMappings() const { return sentenceMappings_; @@ -32,8 +30,8 @@ class TranslationResult { // Not implemented currently, commenting out. // const QualityScore &getQualityScore() const { return qualityScore; } - // Provides a hard alignment between source and target words. - std::vector getAlignment(unsigned int index); + // For development use to benchmark with marian-decoder. + const Histories &getHistories() const { return histories_; } private: std::string source_; @@ -41,6 +39,7 @@ class TranslationResult { // Histories are currently required for interoperability with OutputPrinter // and OutputCollector and hence comparisons with marian-decoder. + // Future hook to gain alignments. Histories histories_; // Can be removed eventually. From 4640ae409121bd01fb1f5eda3ee9764531ba6dc3 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 21 Jan 2021 00:29:53 +0000 Subject: [PATCH 026/442] Fixes copying around vocabs Vocabs was earlier loaded in each thread and copied several times. Modified this to be loaded only once in Service and reference used consistently later on. This change makes Tokenizer as a class rather moot, as there's only one private member and a function. Moved this into TextProcessor. SentenceSplitter, however remains a separate class. utils.{h,cpp} had only a single loadVocabularies function, which is at the moment required only in Service. Making loadVocabularies a function inside Service and getting rid of utils.*. --- src/translator/CMakeLists.txt | 1 - src/translator/batch_translator.cpp | 20 ++++++++++--------- src/translator/batch_translator.h | 4 ++-- src/translator/service.cpp | 29 ++++++++++++++++++++++----- src/translator/service.h | 6 ++++-- src/translator/textops.cpp | 26 ++++++++++-------------- src/translator/textops.h | 25 ++++++++--------------- src/translator/utils.cpp | 31 ----------------------------- src/translator/utils.h | 20 ------------------- 9 files changed, 60 insertions(+), 102 deletions(-) delete mode 100644 src/translator/utils.cpp delete mode 100644 src/translator/utils.h diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index ce3193d41..025ef3d9c 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -10,7 +10,6 @@ add_library(bergamot-translator STATIC request.cpp service.cpp batcher.cpp - utils.cpp translation_result.cpp ) diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index f41fa590f..622162ca4 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -4,26 +4,27 @@ #include "data/text_input.h" #include "sanelogging.h" #include "translator/beam_search.h" -#include "utils.h" namespace marian { namespace bergamot { BatchTranslator::BatchTranslator(DeviceId const device, - PCQueue &pcqueue, Ptr options) - : device_(device), options_(options), pcqueue_(&pcqueue) { + PCQueue &pcqueue, + std::vector> &vocabs, + Ptr options) + : device_(device), options_(options), pcqueue_(&pcqueue), vocabs_(&vocabs) { thread_ = std::thread([this] { this->mainloop(); }); } void BatchTranslator::initGraph() { - vocabs_ = loadVocabularies(options_); if (options_->hasAndNotEmpty("shortlist")) { Ptr slgen; int srcIdx = 0, trgIdx = 1; - bool shared_vcb = vocabs_.front() == vocabs_.back(); - slgen_ = New( - options_, vocabs_.front(), vocabs_.back(), srcIdx, trgIdx, shared_vcb); + bool shared_vcb = vocabs_->front() == vocabs_->back(); + slgen_ = New(options_, vocabs_->front(), + vocabs_->back(), srcIdx, + trgIdx, shared_vcb); } graph_ = New(true); // always optimize @@ -72,7 +73,8 @@ void BatchTranslator::translate(RequestSentences &requestSentences, std::vector> subBatches; for (size_t j = 0; j < maxDims.size(); ++j) { - subBatches.emplace_back(New(batchSize, maxDims[j], vocabs_[j])); + subBatches.emplace_back( + New(batchSize, maxDims[j], vocabs_->at(j))); } std::vector words(maxDims.size(), 0); @@ -92,7 +94,7 @@ void BatchTranslator::translate(RequestSentences &requestSentences, auto batch = Ptr(new CorpusBatch(subBatches)); batch->setSentenceIds(sentenceIds); - auto trgVocab = vocabs_.back(); + auto trgVocab = vocabs_->back(); auto search = New(options_, scorers_, trgVocab); histories = std::move(search->search(graph_, batch)); diff --git a/src/translator/batch_translator.h b/src/translator/batch_translator.h index 638a1a971..069155efb 100644 --- a/src/translator/batch_translator.h +++ b/src/translator/batch_translator.h @@ -23,7 +23,7 @@ class BatchTranslator { public: BatchTranslator(DeviceId const device, PCQueue &pcqueue, - Ptr options); + std::vector> &vocabs, Ptr options); void join(); // convenience function for logging. TODO(jerin) @@ -37,7 +37,7 @@ class BatchTranslator { Ptr options_; DeviceId device_; - std::vector> vocabs_; + std::vector> *vocabs_; Ptr graph_; std::vector> scorers_; Ptr slgen_; diff --git a/src/translator/service.cpp b/src/translator/service.cpp index c9260812d..fa6e59767 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -2,7 +2,6 @@ #include "definitions.h" #include "sanelogging.h" -#include "utils.h" #include #include @@ -11,15 +10,16 @@ namespace bergamot { Service::Service(Ptr options) : requestId_(0), batchNumber_(0), - numWorkers_(options->get("cpu-threads")), text_processor_(options), - batcher_(options), pcqueue_(2 * options->get("cpu-threads")) { + numWorkers_(options->get("cpu-threads")), + vocabs_(std::move(loadVocabularies(options))), + text_processor_(vocabs_, options), batcher_(options), + pcqueue_(2 * options->get("cpu-threads")) { - vocabs_ = loadVocabularies(options); workers_.reserve(numWorkers_); for (int i = 0; i < numWorkers_; i++) { marian::DeviceId deviceId(i, DeviceType::cpu); - workers_.emplace_back(deviceId, pcqueue_, options); + workers_.emplace_back(deviceId, pcqueue_, vocabs_, options); } } @@ -95,5 +95,24 @@ void Service::stop() { Service::~Service() { stop(); } +// Internal function nobody used, only within service. +std::vector> loadVocabularies(Ptr options) { + // @TODO: parallelize vocab loading for faster startup + auto vfiles = options->get>("vocabs"); + // with the current setup, we need at least two vocabs: src and trg + ABORT_IF(vfiles.size() < 2, "Insufficient number of vocabularies."); + std::vector> vocabs(vfiles.size()); + std::unordered_map> vmap; + for (size_t i = 0; i < vocabs.size(); ++i) { + auto m = vmap.emplace(std::make_pair(vfiles[i], Ptr())); + if (m.second) { // new: load the vocab + m.first->second = New(options, i); + m.first->second->load(vfiles[i]); + } + vocabs[i] = m.first->second; + } + return vocabs; +} + } // namespace bergamot } // namespace marian diff --git a/src/translator/service.h b/src/translator/service.h index 982166eea..4069d1392 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -51,7 +51,7 @@ class Service { // vocabs are used to construct a Request, which later uses it to construct // TranslationResult (decode from words to string). - std::vector> vocabs_; + std::vector> vocabs_; // ORDER DEPENDENCY // Consists of: // @@ -66,12 +66,14 @@ class Service { // Batcher acts as a producer for a producer-consumer queue (pcqueue_), with // idle BatchTranslators being consumers requesting batches as they're ready. - TextProcessor text_processor_; + TextProcessor text_processor_; // ORDER DEPENDENCY Batcher batcher_; PCQueue pcqueue_; std::vector workers_; }; +std::vector> loadVocabularies(Ptr options); + } // namespace bergamot } // namespace marian diff --git a/src/translator/textops.cpp b/src/translator/textops.cpp index add3b1026..80d262edb 100644 --- a/src/translator/textops.cpp +++ b/src/translator/textops.cpp @@ -1,6 +1,5 @@ #include "textops.h" #include "common/timer.h" -#include "utils.h" #include #include #include @@ -51,18 +50,16 @@ SentenceSplitter::string2splitmode(const std::string &m) { return splitmode::wrapped_text; } -Tokenizer::Tokenizer(Ptr options) : inference_(true), addEOS_(false) { - vocabs_ = loadVocabularies(options); -} - -Segment Tokenizer::tokenize(const string_view &snt, TokenRanges &tokenRanges) { +Segment TextProcessor::tokenize(const string_view &snt, + TokenRanges &tokenRanges) { // TODO(jerin): Bunch of hardcode here, 1, 0, need to get rid off somehow. - return vocabs_[0]->encodePreservingSource(snt, tokenRanges, addEOS_, - inference_); + return vocabs_->front()->encodePreservingSource( + snt, tokenRanges, /*addEOS=*/false, /*inference=*/true); } -TextProcessor::TextProcessor(Ptr options) - : tokenizer_(options), sentence_splitter_(options) { +TextProcessor::TextProcessor(std::vector> &vocabs, + Ptr options) + : vocabs_(&vocabs), sentence_splitter_(options) { max_input_sentence_tokens_ = options->get("max-input-sentence-tokens"); max_input_sentence_tokens_ = max_input_sentence_tokens_ - 1; // Account for EOS @@ -84,8 +81,7 @@ void TextProcessor::query_to_segments(const string_view &query, string_view snt_string_view(snt.data(), snt.size()); TokenRanges snt_alignment; timer::Timer spiece_timer; - Segment tokenized_sentence = - tokenizer_.tokenize(snt_string_view, snt_alignment); + Segment tokenized_sentence = tokenize(snt_string_view, snt_alignment); // LOG(info, "Tokenization took {:.5f} seconds", spiece_timer.elapsed()); if (tokenized_sentence.size() > 0) { @@ -96,7 +92,7 @@ void TextProcessor::query_to_segments(const string_view &query, offset += max_input_sentence_tokens_) { auto start = tokenized_sentence.begin() + offset; Segment segment(start, start + max_input_sentence_tokens_); - segment.push_back(tokenizer_.sourceEosId()); + segment.push_back(sourceEosId()); segments.push_back(segment); auto astart = snt_alignment.begin() + offset; @@ -108,7 +104,7 @@ void TextProcessor::query_to_segments(const string_view &query, if (offset < max_input_sentence_tokens_) { auto start = tokenized_sentence.begin() + offset; Segment segment(start, tokenized_sentence.end()); - segment.push_back(tokenizer_.sourceEosId()); + segment.push_back(sourceEosId()); segments.push_back(segment); auto astart = snt_alignment.begin() + offset; @@ -118,7 +114,7 @@ void TextProcessor::query_to_segments(const string_view &query, } else { timer::Timer push_timer; - tokenized_sentence.push_back(tokenizer_.sourceEosId()); + tokenized_sentence.push_back(sourceEosId()); segments.push_back(tokenized_sentence); sourceRanges.push_back(snt_alignment); // LOG(info, "Push took {:.5f} seconds", push_timer.elapsed()); diff --git a/src/translator/textops.h b/src/translator/textops.h index 5de54fdd5..5202f1b0c 100644 --- a/src/translator/textops.h +++ b/src/translator/textops.h @@ -31,28 +31,19 @@ class SentenceSplitter { ug::ssplit::SentenceStream::splitmode string2splitmode(const std::string &m); }; -class Tokenizer { -private: - std::vector> vocabs_; - bool inference_; - bool addEOS_; - +class TextProcessor { public: - explicit Tokenizer(Ptr); - Segment tokenize(const string_view &input, TokenRanges &tokenRanges); - Word sourceEosId() { return vocabs_.front()->getEosId(); }; -}; + explicit TextProcessor(std::vector> &vocabs, Ptr); + void query_to_segments(const string_view &query, Segments &segments, + std::vector &sourceRanges); -class TextProcessor { private: - Tokenizer tokenizer_; + Segment tokenize(const string_view &input, TokenRanges &tokenRanges); + Word sourceEosId() { return vocabs_->front()->getEosId(); } + + std::vector> *vocabs_; SentenceSplitter sentence_splitter_; unsigned int max_input_sentence_tokens_; - -public: - explicit TextProcessor(Ptr); - void query_to_segments(const string_view &query, Segments &segments, - std::vector &sourceRanges); }; } // namespace bergamot diff --git a/src/translator/utils.cpp b/src/translator/utils.cpp deleted file mode 100644 index ea4c5037c..000000000 --- a/src/translator/utils.cpp +++ /dev/null @@ -1,31 +0,0 @@ -#include "utils.h" - -#include - -namespace marian { -namespace bergamot { - - -std::vector> loadVocabularies( - Ptr options) { - // @TODO: parallelize vocab loading for faster startup - auto vfiles = options->get>("vocabs"); - // with the current setup, we need at least two vocabs: src and trg - ABORT_IF(vfiles.size() < 2, "Insufficient number of vocabularies."); - std::vector> vocabs(vfiles.size()); - std::unordered_map> vmap; - for (size_t i = 0; i < vocabs.size(); ++i) { - auto m = vmap.emplace(std::make_pair(vfiles[i], Ptr())); - if (m.second) { // new: load the vocab - m.first->second = New(options, i); - m.first->second->load(vfiles[i]); - } - vocabs[i] = m.first->second; - } - return vocabs; -} - - - -} // namespace bergamot -} // namespace marian diff --git a/src/translator/utils.h b/src/translator/utils.h deleted file mode 100644 index 594d0cabd..000000000 --- a/src/translator/utils.h +++ /dev/null @@ -1,20 +0,0 @@ -#ifndef __BERGAMOT_UTILS_H -#define __BERGAMOT_UTILS_H - -#include "common/options.h" -#include "common/types.h" -#include "data/vocab.h" -#include "translator/history.h" - -#include -#include - -namespace marian { -namespace bergamot { - -std::vector> loadVocabularies(Ptr options); - -} // namespace bergamot -} // namespace marian - -#endif // __BERGAMOT_UTILS_H From ea1a628cd2a894de7eed9d3a3111120612b49d53 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 21 Jan 2021 01:31:29 +0000 Subject: [PATCH 027/442] Neaten TextProcessor, add a bit of docs. - Truncating long sentences into those of a specified length for faster processing is now a separate function, for improved readability. - Changes doing push_back -> emplace_back at places to avoid copy. - query_to_segments is renamed as process. - Comments are added in an attempt to bring some sanity. --- src/translator/service.cpp | 2 +- src/translator/textops.cpp | 113 +++++++++++++++++-------------------- src/translator/textops.h | 25 +++++++- 3 files changed, 76 insertions(+), 64 deletions(-) diff --git a/src/translator/service.cpp b/src/translator/service.cpp index fa6e59767..4a5af301c 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -42,7 +42,7 @@ std::future Service::translate(std::string &&input) { Segments segments; std::vector sourceAlignments; - text_processor_.query_to_segments(input, segments, sourceAlignments); + text_processor_.process(input, segments, sourceAlignments); std::promise translationResultPromise; auto future = translationResultPromise.get_future(); diff --git a/src/translator/textops.cpp b/src/translator/textops.cpp index 80d262edb..22b65e9c7 100644 --- a/src/translator/textops.cpp +++ b/src/translator/textops.cpp @@ -52,7 +52,6 @@ SentenceSplitter::string2splitmode(const std::string &m) { Segment TextProcessor::tokenize(const string_view &snt, TokenRanges &tokenRanges) { - // TODO(jerin): Bunch of hardcode here, 1, 0, need to get rid off somehow. return vocabs_->front()->encodePreservingSource( snt, tokenRanges, /*addEOS=*/false, /*inference=*/true); } @@ -60,70 +59,64 @@ Segment TextProcessor::tokenize(const string_view &snt, TextProcessor::TextProcessor(std::vector> &vocabs, Ptr options) : vocabs_(&vocabs), sentence_splitter_(options) { + max_input_sentence_tokens_ = options->get("max-input-sentence-tokens"); - max_input_sentence_tokens_ = - max_input_sentence_tokens_ - 1; // Account for EOS - // Dirty assert, should do at configparse - assert(max_input_sentence_tokens_ > 0); + max_input_sentence_tokens_ = max_input_sentence_tokens_ - 1; + ABORT_IF(max_input_sentence_tokens < 0, + "max-input-sentence-tokens cannot be < 0"); +} + +void TextProcessor::process(const string_view &query, Segments &segments, + std::vector &sourceRanges) { + + auto sentenceStream = sentence_splitter_.createSentenceStream(query); + pcrecpp::StringPiece sentenceStringPiece; + + while (sentenceStream >> sentenceStringPiece) { + string_view sentence(sentenceStringPiece.data(), + sentenceStringPiece.size()); + TokenRanges tokenRanges; + Segment segment = tokenize(sentence, tokenRanges); + + // There are some cases where SentencePiece or vocab returns no words + // after normalization. 0 prevents any empty entries from being added. + if (segment.size() > 0) { + // Truncate segment into max_input_size segments. + truncate(segment, tokenRanges, segments, sourceRanges); + } + } } -void TextProcessor::query_to_segments(const string_view &query, - Segments &segments, - std::vector &sourceRanges) { - auto buf = sentence_splitter_.createSentenceStream(query); - pcrecpp::StringPiece snt; - // string_view snt; - - int sentencesProcessed{0}; - - while (buf >> snt) { - // LOG(info, "SNT: {}", snt); - string_view snt_string_view(snt.data(), snt.size()); - TokenRanges snt_alignment; - timer::Timer spiece_timer; - Segment tokenized_sentence = tokenize(snt_string_view, snt_alignment); - - // LOG(info, "Tokenization took {:.5f} seconds", spiece_timer.elapsed()); - if (tokenized_sentence.size() > 0) { - if (tokenized_sentence.size() > max_input_sentence_tokens_) { - int offset; - for (offset = 0; - offset + max_input_sentence_tokens_ < tokenized_sentence.size(); - offset += max_input_sentence_tokens_) { - auto start = tokenized_sentence.begin() + offset; - Segment segment(start, start + max_input_sentence_tokens_); - segment.push_back(sourceEosId()); - segments.push_back(segment); - - auto astart = snt_alignment.begin() + offset; - TokenRanges segment_alignment(astart, - astart + max_input_sentence_tokens_); - sourceRanges.push_back(segment_alignment); - } - - if (offset < max_input_sentence_tokens_) { - auto start = tokenized_sentence.begin() + offset; - Segment segment(start, tokenized_sentence.end()); - segment.push_back(sourceEosId()); - segments.push_back(segment); - - auto astart = snt_alignment.begin() + offset; - TokenRanges segment_alignment(astart, snt_alignment.end()); - sourceRanges.push_back(segment_alignment); - } - - } else { - timer::Timer push_timer; - tokenized_sentence.push_back(sourceEosId()); - segments.push_back(tokenized_sentence); - sourceRanges.push_back(snt_alignment); - // LOG(info, "Push took {:.5f} seconds", push_timer.elapsed()); - } +void TextProcessor::truncate(Segment &segment, TokenRanges &tokenRanges, + Segments &segments, + std::vector &sourceRanges) { + if (segment.size() > max_input_sentence_tokens_) { + int offset; + // Loop as long as I can grab max_input_sentence_tokens_ + for (offset = 0; offset + max_input_sentence_tokens_ < segment.size(); + offset += max_input_sentence_tokens_) { + auto start = segment.begin() + offset; + + segments.emplace_back(start, start + max_input_sentence_tokens_); + segments.back().push_back(sourceEosId()); + + auto astart = tokenRanges.begin() + offset; + sourceRanges.emplace_back(astart, astart + max_input_sentence_tokens_); } - ++sentencesProcessed; - if (sentencesProcessed % 10000 == 0) { - LOG(info, "Processed {}", sentencesProcessed); + + if (offset < max_input_sentence_tokens_) { + auto start = segment.begin() + offset; + segments.emplace_back(start, segment.end()); + segments.back().push_back(sourceEosId()); + + auto astart = tokenRanges.begin() + offset; + sourceRanges.emplace_back(astart, tokenRanges.end()); } + + } else { + segments.emplace_back(segment); + segments.back().push_back(sourceEosId()); + sourceRanges.emplace_back(tokenRanges); } } diff --git a/src/translator/textops.h b/src/translator/textops.h index 5202f1b0c..e5c07b6b7 100644 --- a/src/translator/textops.h +++ b/src/translator/textops.h @@ -20,6 +20,10 @@ namespace marian { namespace bergamot { class SentenceSplitter { + // A wrapper around @ugermann's ssplit-cpp compiled from several places in + // mts. Constructed based on options. Used in TextProcessor below to create + // sentence-streams, which provide access to one sentence from blob of text at + // a time. public: explicit SentenceSplitter(Ptr options); ug::ssplit::SentenceStream createSentenceStream(string_view const &input); @@ -32,14 +36,29 @@ class SentenceSplitter { }; class TextProcessor { + // TextProcessor handles loading the sentencepiece vocabulary and also + // contains an instance of sentence-splitter based on ssplit. + // + // Used in Service to convert an incoming blog of text to a vector of + // sentences (vector of words). In addition, the ByteRanges of the + // source-tokens in unnormalized text are provided as string_views. public: explicit TextProcessor(std::vector> &vocabs, Ptr); - void query_to_segments(const string_view &query, Segments &segments, - std::vector &sourceRanges); + + void process(const string_view &query, Segments &segments, + std::vector &sourceRanges); private: + // Tokenizes an input string, returns Words corresponding. Loads the + // corresponding byte-ranges into tokenRanges. Segment tokenize(const string_view &input, TokenRanges &tokenRanges); - Word sourceEosId() { return vocabs_->front()->getEosId(); } + + // Truncate sentence into max_input_size segments. + void truncate(Segment &sentence, TokenRanges &tokenRanges, Segments &segments, + std::vector &sourceRanges); + + // shorthand, used only in truncate() + const Word sourceEosId() const { return vocabs_->front()->getEosId(); } std::vector> *vocabs_; SentenceSplitter sentence_splitter_; From 9b18bd9ffcfcc3b918ef7533f1de128e7396d36a Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 21 Jan 2021 02:03:47 +0000 Subject: [PATCH 028/442] MTranslationResult, more comments --- src/translator/translation_result.h | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/src/translator/translation_result.h b/src/translator/translation_result.h index 27bfb370b..fb5a42a09 100644 --- a/src/translator/translation_result.h +++ b/src/translator/translation_result.h @@ -18,9 +18,16 @@ class TranslationResult { Histories &&histories, std::vector> &vocabs); + // Returns const references to source and translated texts, for external + // consumption. + const std::string &getOriginalText() const { return source_; } const std::string &getTranslatedText() const { return translation_; } + // A mapping of string_views in the source_ and translation_ are provide as a + // pair for external consumption. Each entry corresponding + // to a (source-sentence, target-sentence). + typedef std::vector> SentenceMappings; const SentenceMappings &getSentenceMappings() const { return sentenceMappings_; @@ -53,8 +60,9 @@ class TranslationResult { std::vector sourceMappings_; std::vector targetMappings_; - // Adding the following to complete bergamot-translator spec, redundant with - // sourceMappings_ and targetMappings_. + // Adding the following to complete bergamot-translator spec, redundant while + // sourceMappings_ and targetMappings_ exists or vice-versa. + SentenceMappings sentenceMappings_; }; } // namespace bergamot From 12e7e2c650fdabfe9b815e2ad1e36a452d891318 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 21 Jan 2021 14:53:53 +0000 Subject: [PATCH 029/442] Fixing compile error, need tests, CI --- src/translator/textops.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/translator/textops.cpp b/src/translator/textops.cpp index 22b65e9c7..837ea7226 100644 --- a/src/translator/textops.cpp +++ b/src/translator/textops.cpp @@ -62,7 +62,7 @@ TextProcessor::TextProcessor(std::vector> &vocabs, max_input_sentence_tokens_ = options->get("max-input-sentence-tokens"); max_input_sentence_tokens_ = max_input_sentence_tokens_ - 1; - ABORT_IF(max_input_sentence_tokens < 0, + ABORT_IF(max_input_sentence_tokens_ < 0, "max-input-sentence-tokens cannot be < 0"); } From 80125e2789825b8ea05e2c7a71f398d69a538034 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 21 Jan 2021 14:54:30 +0000 Subject: [PATCH 030/442] Removing unused variable in batch_translator --- src/translator/batch_translator.cpp | 1 - 1 file changed, 1 deletion(-) diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index 622162ca4..6380a00cc 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -19,7 +19,6 @@ BatchTranslator::BatchTranslator(DeviceId const device, void BatchTranslator::initGraph() { if (options_->hasAndNotEmpty("shortlist")) { - Ptr slgen; int srcIdx = 0, trgIdx = 1; bool shared_vcb = vocabs_->front() == vocabs_->back(); slgen_ = New(options_, vocabs_->front(), From 37143933a19c81cf3d2ce7785c77321aaab6e616 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 22 Jan 2021 11:29:32 +0000 Subject: [PATCH 031/442] CMakeLists improvements Only the bergamot-translator library should be linked to main target Any other library (marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) should be linked to bergamot-translator target inside src/translator folder. --- app/CMakeLists.txt | 8 ++------ src/translator/CMakeLists.txt | 2 +- 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/app/CMakeLists.txt b/app/CMakeLists.txt index fcc03237e..6e71e9e27 100644 --- a/app/CMakeLists.txt +++ b/app/CMakeLists.txt @@ -1,9 +1,5 @@ add_executable(bergamot-translator-app main.cpp) target_link_libraries(bergamot-translator-app PRIVATE bergamot-translator) -# Replacement app for marian-decoder from browsermt/mts@nuke -add_executable(main main-mts.cpp) -set_target_properties(main PROPERTIES OUTPUT bergamot-cli RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}") -target_compile_options(main PUBLIC ${ALL_WARNINGS}) -set(EXECUTABLES ${EXECUTABLES} main) -target_link_libraries(main bergamot-translator marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) +add_executable(service-cli main-mts.cpp) +target_link_libraries(service-cli PRIVATE bergamot-translator) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 025ef3d9c..24b9a7b85 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -13,7 +13,7 @@ add_library(bergamot-translator STATIC translation_result.cpp ) -target_link_libraries(bergamot-translator marian) +target_link_libraries(bergamot-translator marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) target_include_directories(bergamot-translator PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} PRIVATE ${CMAKE_SOURCE_DIR} From e75bd7eb57da3d0c407184d531911e95c1d2c23c Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 22 Jan 2021 11:31:20 +0000 Subject: [PATCH 032/442] Adding vim temporary files to .gitignore --- .gitignore | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 000000000..e63aee1e1 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +# vim temporary files +*.swp +*.swo + From 3b6b9cd2bf2328a397366faa2305737240b8c854 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 22 Jan 2021 11:51:49 +0000 Subject: [PATCH 033/442] Updating README.md with instructions to run service-cli --- README.md | 45 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 44 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index fbbbe7b46..52f60b287 100644 --- a/README.md +++ b/README.md @@ -13,5 +13,48 @@ $ make -j ``` -## Using Bergamot Translator +## Usage + +### Bergamot Translator + The build will generate the library that can be linked to any project. All the public header files are specified in `src` folder. + +### `service-cli` + +An executable `service-cli` is generated by the build in the `app` folder and +provides command line interface to the underlying translator. The models +required to run the command-line are available at +[data.statmt.org/bergamot/models/](http://data.statmt.org/bergamot/models/). +The following example uses an English to German tiny11 student model, available +at: + +* [data.statmt.org/bergamot/models/deen/ende.student.tiny11.tar.gz](http://data.statmt.org/bergamot/models/deen/ende.student.tiny11.tar.gz) + +```bash +MODEL_DIR=... # path to where the model-files are. +ARGS=( + -m $MODEL_DIR/model.intgemm.alphas.bin # Path to model file. + --vocabs + $MODEL_DIR/vocab.deen.spm # source-vocabulary + $MODEL_DIR/vocab.deen.spm # target-vocabulary + + # The following increases speed through one-best-decoding, shortlist and quantization. + --beam-size 1 --skip-cost --shortlist $MODEL_DIR/lex.s2t.gz 50 50 --int8shiftAlphaAll + + # Number of CPU threads (workers to launch). Parallelizes over cores and improves speed. + --cpu-threads 4 + + # Hyperparameters of how many tokens to be accounted for in a batch and maximum tokens in a sentence. + --max-input-sentence-tokens 1024 --max-input-tokens 1024 + + # Three modes are supported + # - sentence: One sentence per line + # - paragraph: One paragraph per line. + # - wrapped text: Paragraphs are separated by empty line. + + --ssplit-mode paragraph + +) + +./app/service-cli "${ARGS[@]}" < path-to-input-file +``` From c8fc004452d5a90fe9405fce65badb620080aa9e Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Fri, 22 Jan 2021 12:44:08 +0100 Subject: [PATCH 034/442] Improved 3rd party header inclusion and library linking --- 3rd_party/CMakeLists.txt | 25 +++++-------------------- src/translator/CMakeLists.txt | 3 ++- src/translator/textops.h | 2 +- 3 files changed, 8 insertions(+), 22 deletions(-) diff --git a/3rd_party/CMakeLists.txt b/3rd_party/CMakeLists.txt index a5aed0689..6d5a5c926 100644 --- a/3rd_party/CMakeLists.txt +++ b/3rd_party/CMakeLists.txt @@ -1,26 +1,11 @@ add_subdirectory(marian-dev) add_subdirectory(ssplit-cpp) -include_directories(ssplit-cpp/src) - -# Add include directories for marian target to be able to use it anywhere in the -# project without explicitly specifying its include directories. Once marian -# fixes this problem, it can be removed. - +# Add include directories for 3rd party targets to be able to use it anywhere in the +# project without explicitly specifying their include directories. Once they +# fixe this problem, it can be removed. get_property(INCDIRS DIRECTORY marian-dev/src PROPERTY INCLUDE_DIRECTORIES) target_include_directories(marian PUBLIC ${INCDIRS}) - -get_property(INCLUDE_DIRECTORIES DIRECTORY . PROPERTY INCLUDE_DIRECTORIES) -set(INCLUDE_DIRECTORIES ${INCLUDE_DIRECTORIES} PARENT_SCOPE) - -# Required to enable MKL, at least -get_directory_property(EXT_LIBS DIRECTORY marian-dev DEFINITION EXT_LIBS) -set(EXT_LIBS ${EXT_LIBS} PARENT_SCOPE) - -# Compilation flags -get_directory_property(CMAKE_C_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_C_FLAGS) -get_directory_property(CMAKE_CXX_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_CXX_FLAGS) -set(CMAKE_C_FLAGS ${CMAKE_C_FLAGS} PARENT_SCOPE) -set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} PARENT_SCOPE) - +get_property(INCLUDE_DIRECTORIES DIRECTORY ssplit-cpp/src PROPERTY INCLUDE_DIRECTORIES) +target_include_directories(ssplit PUBLIC ${INCLUDE_DIRECTORIES}) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 24b9a7b85..25dc77210 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -13,7 +13,8 @@ add_library(bergamot-translator STATIC translation_result.cpp ) -target_link_libraries(bergamot-translator marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) +target_link_libraries(bergamot-translator marian ssplit) + target_include_directories(bergamot-translator PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} PRIVATE ${CMAKE_SOURCE_DIR} diff --git a/src/translator/textops.h b/src/translator/textops.h index e5c07b6b7..79a504013 100644 --- a/src/translator/textops.h +++ b/src/translator/textops.h @@ -9,7 +9,7 @@ #include "data/sentencepiece_vocab.h" #include "data/shortlist.h" #include "definitions.h" -#include "ssplit/ssplit.h" +#include "ssplit.h" #include #include From 1c3b656852641457a2675ffd9aa1aa3fa3dcfb3a Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Fri, 22 Jan 2021 15:53:19 +0100 Subject: [PATCH 035/442] Removed a redundant directory inclusion in CMakeFile --- src/translator/CMakeLists.txt | 1 - 1 file changed, 1 deletion(-) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 25dc77210..27158a786 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -16,7 +16,6 @@ add_library(bergamot-translator STATIC target_link_libraries(bergamot-translator marian ssplit) target_include_directories(bergamot-translator - PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} PRIVATE ${CMAKE_SOURCE_DIR} PUBLIC ${CMAKE_SOURCE_DIR}/src) From 988e76baf973723daf694a1eed768bd20a86233d Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 22 Jan 2021 15:13:30 +0000 Subject: [PATCH 036/442] Removing Exception to fix Apple compile --- src/translator/pcqueue.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/translator/pcqueue.h b/src/translator/pcqueue.h index 512932560..f0b354145 100644 --- a/src/translator/pcqueue.h +++ b/src/translator/pcqueue.h @@ -50,12 +50,12 @@ class Semaphore { } void wait() { - ABORT_IF(KERN_SUCCESS != semaphore_wait(back_), Exception, + ABORT_IF(KERN_SUCCESS != semaphore_wait(back_), "Wait for semaphore failed"); } void post() { - ABORT_IF(KERN_SUCCESS != semaphore_signal(back_), Exception, + ABORT_IF(KERN_SUCCESS != semaphore_signal(back_), "Could not post to semaphore"); } From 7e2eb02e18cb029f599292f536b32964e854daf5 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 22 Jan 2021 18:17:10 +0000 Subject: [PATCH 037/442] CI and Associated Changes Enables Mac and Ubuntu CPU only builds through GitHub CI. CI scripts are copied from marian-dev with necessary changes. 3rd-party/marian-dev is modified to meet C++17 requirements modifying for half_float. --- .github/workflows/macos.yml | 59 +++++++++++++++ .github/workflows/ubuntu.yml | 124 ++++++++++++++++++++++++++++++++ .github/workflows/windows.yml | 130 ++++++++++++++++++++++++++++++++++ 3rd_party/CMakeLists.txt | 6 ++ 3rd_party/marian-dev | 2 +- 5 files changed, 320 insertions(+), 1 deletion(-) create mode 100644 .github/workflows/macos.yml create mode 100644 .github/workflows/ubuntu.yml create mode 100644 .github/workflows/windows.yml diff --git a/.github/workflows/macos.yml b/.github/workflows/macos.yml new file mode 100644 index 000000000..4a34a3cd7 --- /dev/null +++ b/.github/workflows/macos.yml @@ -0,0 +1,59 @@ +name: MacOS + +on: + push: + branches: [ master ] + pull_request: + branches: [ master ] + +jobs: + build-macos: + name: MacOS CPU-only + runs-on: macos-10.15 + + steps: + - name: Checkout + uses: actions/checkout@v2 + with: + submodules: recursive + + - name: Install dependencies + run: brew install openblas protobuf + + # Openblas location is exported explicitly because openblas is keg-only, + # which means it was not symlinked into /usr/local/. + # CMake cannot find BLAS on GitHub runners if Marian is being compiled + # statically, hence USE_STATIC_LIBS=off + - name: Configure CMake + run: | + export LDFLAGS="-L/usr/local/opt/openblas/lib" + export CPPFLAGS="-I/usr/local/opt/openblas/include" + mkdir -p build + cd build + cmake .. \ + -DCOMPILE_CPU=on \ + -DCOMPILE_CUDA=off \ + -DCOMPILE_EXAMPLES=on \ + -DCOMPILE_SERVER=on \ + -DCOMPILE_TESTS=on \ + -DUSE_FBGEMM=on \ + -DUSE_SENTENCEPIECE=on \ + -DUSE_STATIC_LIBS=off + + - name: Compile + working-directory: build + run: make -j2 + + # Removing unit-tests, taken care of in browsermt/marian-dev + # - name: Run unit tests + # - working-directory: build + # - run: make test + + - name: Print versions + working-directory: build + run: | + ./marian --version + ./marian-decoder --version + ./marian-scorer --version + ./spm_encode --version + diff --git a/.github/workflows/ubuntu.yml b/.github/workflows/ubuntu.yml new file mode 100644 index 000000000..88e72f780 --- /dev/null +++ b/.github/workflows/ubuntu.yml @@ -0,0 +1,124 @@ +name: Ubuntu + +on: + push: + branches: [ master ] + pull_request: + branches: [ master ] + +jobs: + build-ubuntu: + strategy: + matrix: + include: + # Ubuntu CPU-only build + - name: "Ubuntu CPU-only" + os: ubuntu-latest + cuda: "" + gcc: 7 + cpu: true + gpu: false + # GPU Builds are commented out, for bergamot-translator CI runs. + # Ubuntu GPU-only build + # - name: "Ubuntu GPU-only" + # os: ubuntu-latest + # cuda: "10.2" + # gcc: 7 + # cpu: false + # gpu: true + # Ubuntu 20.04 supports CUDA 11+ + #- name: "Ubuntu 20.04 CUDA 11.0 gcc-9" + #os: ubuntu-20.04 + #cuda: "11.0" + #gcc: 9 + #cpu: false + #gpu: true + # Ubuntu 18.04 supports CUDA 10.1+ + # - name: "Ubuntu 18.04 CUDA 10.2 gcc-8" + # os: ubuntu-18.04 + # cuda: "10.2" + # gcc: 8 + # cpu: true + # gpu: true + # Ubuntu 16.04 supports CUDA 8+ + # - name: "Ubuntu 16.04 CUDA 9.2 gcc-7" + # os: ubuntu-16.04 + # cuda: "9.2" + # gcc: 7 + # cpu: true + # gpu: true + + runs-on: ${{ matrix.os }} + name: ${{ matrix.name }} + + steps: + - name: Checkout + uses: actions/checkout@v2 + with: + submodules: recursive + + # The following packages are already installed on GitHub-hosted runners: + # build-essential openssl libssl-dev + # No need to install libprotobuf{17,10,9v5} on Ubuntu {20,18,16}.04 because + # it is installed together with libprotobuf-dev + - name: Install dependencies + run: sudo apt-get install -y libgoogle-perftools-dev libprotobuf-dev protobuf-compiler + + # https://software.intel.com/content/www/us/en/develop/articles/installing-intel-free-libs-and-python-apt-repo.html + - name: Install MKL + run: | + wget -qO- "https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB" | sudo apt-key add - + sudo sh -c "echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list" + sudo apt-get update -o Dir::Etc::sourcelist="/etc/apt/sources.list.d/intel-mkl.list" + sudo apt-get install -y --no-install-recommends intel-mkl-64bit-2020.0-088 + if: matrix.cpu == true + + # The script simplifies installation of different versions of CUDA + - name: Install CUDA + run: ./3rd_party/marian-dev/scripts/ci/install_cuda_ubuntu.sh ${{ matrix.cuda }} + if: matrix.gpu == true + + # Boost is installed on GitHub-hosted runners in a non-standard location + # https://github.com/actions/virtual-environments/issues/687#issuecomment-610471671 + - name: Configure CMake + run: | + mkdir -p build + cd build + CC=/usr/bin/gcc-${{ matrix.gcc }} CXX=/usr/bin/g++-${{ matrix.gcc }} CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }} \ + cmake .. \ + -DBoost_ARCHITECTURE=-x64 \ + -DBOOST_INCLUDEDIR=$BOOST_ROOT_1_72_0/include \ + -DBOOST_LIBRARYDIR=$BOOST_ROOT_1_72_0/lib \ + -DBOOST_ROOT=$BOOST_ROOT_1_72_0 \ + -DCMAKE_BUILD_TYPE=Release \ + -DCOMPILE_CPU=${{ matrix.cpu }} \ + -DCOMPILE_CUDA=${{ matrix.gpu }} \ + -DCOMPILE_EXAMPLES=on \ + -DCOMPILE_SERVER=on \ + -DCOMPILE_TESTS=on \ + -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-${{ matrix.cuda }} \ + -DUSE_FBGEMM=${{ matrix.cpu }} \ + -DUSE_SENTENCEPIECE=on \ + -DUSE_STATIC_LIBS=on \ + + - name: Compile + working-directory: build + run: make -j2 + + # Removing unit-tests, taken care of in browsermt/marian-dev + # TODO: add a flag to CMake to compile unit tests only on CPU + # - name: Run unit tests + # working-directory: build + # run: make test + # # GitHub-hosted VMs do not have GPUs, so can not be run in CUDA builds + # if: matrix.gpu == false + + - name: Print versions + working-directory: build + run: | + ./marian --version + ./marian-decoder --version + ./marian-scorer --version + ./marian-server --version + ./spm_encode --version + diff --git a/.github/workflows/windows.yml b/.github/workflows/windows.yml new file mode 100644 index 000000000..cc0b1bef5 --- /dev/null +++ b/.github/workflows/windows.yml @@ -0,0 +1,130 @@ +name: Windows + +on: + push: + branches: [ master ] + pull_request: + branches: [ master ] + +env: + MKL_URL: "https://romang.blob.core.windows.net/mariandev/ci/mkl-2020.1-windows-static.zip" + +jobs: + build-windows: + strategy: + matrix: + include: + # Windows CPU-only build + - name: "Windows CPU-only" + cuda: "" + gpu: false + # GPU Builds are commented out, for bergamot-translator CI runs. + # Windows CPU+GPU build + # - name: "Windows CPU+CUDA" + # cuda: "10.2" + # gpu: true + + runs-on: windows-2019 + name: ${{ matrix.name }} + + steps: + - name: Checkout + uses: actions/checkout@v2 + with: + submodules: recursive + + - name: Download MKL + run: | + # Wget retries downloading files and is faster than Invoke-WebRequest + C:\msys64\usr\bin\wget.exe -nv ${{ env.MKL_URL }} -O mkl.zip + Expand-Archive -Force mkl.zip ${{ github.workspace }}\mkl + # Set MKLROOT environment variable so that CMake can find MKL + echo "MKLROOT=${{ github.workspace }}\mkl" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append + shell: powershell + + - name: Install CUDA + run: | + .\3rd_party\marian-dev\scripts\ci\install_cuda_windows.ps1 "10.2" + # Set CUDA_PATH environment variable so that CMake can find CUDA + echo "CUDA_PATH=$env:CUDA_PATH" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append + echo "$env:CUDA_PATH/bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append + shell: powershell + if: matrix.gpu == true + + - name: Prepare vcpkg + uses: lukka/run-vcpkg@v4 + with: + vcpkgArguments: protobuf + vcpkgGitCommitId: 6185aa76504a5025f36754324abf307cc776f3da + vcpkgDirectory: ${{ github.workspace }}/vcpkg/ + vcpkgTriplet: x64-windows-static + + # Windows CUDA builds use USE_NCCL=off due to compilation errors. + - name: Build Debug + uses: lukka/run-cmake@v3 + with: + buildDirectory: ${{ github.workspace }}/build/Debug + cmakeAppendedArgs: '-G Ninja + -DCMAKE_BUILD_TYPE="Debug" + -DOPENSSL_USE_STATIC_LIBS="TRUE" + -DOPENSSL_MSVC_STATIC_RT="TRUE" + -DCOMPILE_CPU="TRUE" + -DCOMPILE_CUDA="${{ matrix.gpu }}" + -DCOMPILE_SERVER="FALSE" + -DCOMPILE_TESTS="TRUE" + -DUSE_FBGEMM="TRUE" + -DUSE_MPI="FALSE" + -DUSE_NCCL="FALSE" + -DUSE_SENTENCEPIECE="TRUE" + -DUSE_STATIC_LIBS="TRUE"' + cmakeListsOrSettingsJson: CMakeListsTxtAdvanced + cmakeListsTxtPath: ${{ github.workspace }}/CMakeLists.txt + useVcpkgToolchainFile: true + # Building in Debug is sufficient for the all-in CPU+GPU compilation; + # its main purpose is to detect warnings that the Release build is not + # able to find sometimes. + if: matrix.gpu == true + + # Windows CUDA builds use USE_NCCL=off due to compilation errors + # Boost is pre-installed on Azure/GitHub-hosted Windows runners + # https://github.com/actions/virtual-environments/blob/main/images/win/Windows2019-Readme.md#boost + # (not used yet) + - name: Build Release + uses: lukka/run-cmake@v3 + with: + buildDirectory: ${{ github.workspace }}/build/ + cmakeAppendedArgs: '-G Ninja + -DBOOST_ROOT="$(BOOST_ROOT_1_72_0)" + -DBOOST_INCLUDEDIR="$(BOOST_ROOT_1_72_0)/include" + -DBOOST_LIBRARYDIR="$(BOOST_ROOT_1_72_0)/lib" + -DCMAKE_BUILD_TYPE="Release" + -DOPENSSL_USE_STATIC_LIBS="TRUE" + -DOPENSSL_MSVC_STATIC_RT="TRUE" + -DCOMPILE_CPU="TRUE" + -DCOMPILE_CUDA="${{ matrix.gpu }}" + -DCOMPILE_SERVER="FALSE" + -DCOMPILE_TESTS="TRUE" + -DUSE_FBGEMM="TRUE" + -DUSE_MPI="FALSE" + -DUSE_NCCL="FALSE" + -DUSE_SENTENCEPIECE="TRUE" + -DUSE_STATIC_LIBS="TRUE"' + cmakeListsOrSettingsJson: CMakeListsTxtAdvanced + cmakeListsTxtPath: ${{ github.workspace }}/CMakeLists.txt + useVcpkgToolchainFile: true + + # Removing unit-tests, taken care of in browsermt/marian-dev + # - name: Run unit tests + # working-directory: build/ + # run: ctest + # # Not run in GPU builds because GitHub-hosted VMs do not have GPUs + # if: matrix.gpu == false + + - name: Print versions + working-directory: build/ + run: | + .\marian.exe --version + .\marian-decoder.exe --version + .\marian-scorer.exe --version + dir *.exe + shell: cmd diff --git a/3rd_party/CMakeLists.txt b/3rd_party/CMakeLists.txt index 6d5a5c926..644ac52de 100644 --- a/3rd_party/CMakeLists.txt +++ b/3rd_party/CMakeLists.txt @@ -9,3 +9,9 @@ target_include_directories(marian PUBLIC ${INCDIRS}) get_property(INCLUDE_DIRECTORIES DIRECTORY ssplit-cpp/src PROPERTY INCLUDE_DIRECTORIES) target_include_directories(ssplit PUBLIC ${INCLUDE_DIRECTORIES}) + +# Compilation flags +get_directory_property(CMAKE_C_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_C_FLAGS) +get_directory_property(CMAKE_CXX_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_CXX_FLAGS) +set(CMAKE_C_FLAGS ${CMAKE_C_FLAGS} PARENT_SCOPE) +set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} PARENT_SCOPE) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 96d5a712d..ee56e02f0 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 96d5a712d3b8bc56f120ba5220365f955719f4d4 +Subproject commit ee56e02f0525a4651157a07f74b44f456db14c8c From cd025e9f651bab6b901e0306690b8bed5625165a Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sat, 23 Jan 2021 14:39:08 +0000 Subject: [PATCH 038/442] CI scripts: master -> main --- .github/workflows/macos.yml | 4 ++-- .github/workflows/ubuntu.yml | 4 ++-- .github/workflows/windows.yml | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/.github/workflows/macos.yml b/.github/workflows/macos.yml index 4a34a3cd7..8ccdecaf5 100644 --- a/.github/workflows/macos.yml +++ b/.github/workflows/macos.yml @@ -2,9 +2,9 @@ name: MacOS on: push: - branches: [ master ] + branches: [ main ] pull_request: - branches: [ master ] + branches: [ main ] jobs: build-macos: diff --git a/.github/workflows/ubuntu.yml b/.github/workflows/ubuntu.yml index 88e72f780..240efd2c3 100644 --- a/.github/workflows/ubuntu.yml +++ b/.github/workflows/ubuntu.yml @@ -2,9 +2,9 @@ name: Ubuntu on: push: - branches: [ master ] + branches: [ main ] pull_request: - branches: [ master ] + branches: [ main ] jobs: build-ubuntu: diff --git a/.github/workflows/windows.yml b/.github/workflows/windows.yml index cc0b1bef5..ef9ad25d1 100644 --- a/.github/workflows/windows.yml +++ b/.github/workflows/windows.yml @@ -2,9 +2,9 @@ name: Windows on: push: - branches: [ master ] + branches: [ main ] pull_request: - branches: [ master ] + branches: [ main ] env: MKL_URL: "https://romang.blob.core.windows.net/mariandev/ci/mkl-2020.1-windows-static.zip" From 69adc7af777b5e672d54345f6e7bec5d915faade Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 24 Jan 2021 21:46:47 +0000 Subject: [PATCH 039/442] Changing code-style to clang-format-google --- app/main.cpp | 34 +++--- src/AbstractTranslationModel.h | 84 ++++++++------ src/QualityScore.h | 29 ++--- src/TranslationModelConfiguration.h | 87 +++++++------- src/TranslationRequest.h | 108 ++++++++++-------- src/TranslationResult.h | 95 +++++++-------- src/translator/AbstractTranslationModel.cpp | 10 +- src/translator/TranslationModel.cpp | 24 ++-- src/translator/TranslationModel.h | 78 +++++++------ ...TranslationModelConfigToOptionsAdaptor.cpp | 14 ++- .../TranslationModelConfigToOptionsAdaptor.h | 22 ++-- 11 files changed, 302 insertions(+), 283 deletions(-) diff --git a/app/main.cpp b/app/main.cpp index dc808228f..bb0fa34e2 100644 --- a/app/main.cpp +++ b/app/main.cpp @@ -7,29 +7,29 @@ #include -#include "TranslationModelConfiguration.h" #include "AbstractTranslationModel.h" +#include "TranslationModelConfiguration.h" #include "TranslationRequest.h" #include "TranslationResult.h" +int main(int argc, char **argv) { -int main(int argc, char** argv) { - - // Create an instance of AbstractTranslationModel with a dummy model configuration - TranslationModelConfiguration config("dummy_modelFilePath", - "dummy_sourceVocabPath", - "dummy_targetVocabPath"); - std::shared_ptr model = - AbstractTranslationModel::createInstance(config); + // Create an instance of AbstractTranslationModel with a dummy model + // configuration + TranslationModelConfiguration config( + "dummy_modelFilePath", "dummy_sourceVocabPath", "dummy_targetVocabPath"); + std::shared_ptr model = + AbstractTranslationModel::createInstance(config); - // Call to translate a dummy (empty) texts with a dummy (empty) translation request - TranslationRequest req; - std::vector texts; - auto result = model->translate(std::move(texts), req); + // Call to translate a dummy (empty) texts with a dummy (empty) translation + // request + TranslationRequest req; + std::vector texts; + auto result = model->translate(std::move(texts), req); - // Resolve the future and get the actual result - std::vector res = result.get(); + // Resolve the future and get the actual result + std::vector res = result.get(); - std::cout << "Count is: " << res.size() << std::endl; - return 0; + std::cout << "Count is: " << res.size() << std::endl; + return 0; } diff --git a/src/AbstractTranslationModel.h b/src/AbstractTranslationModel.h index ddadc07bf..b76aeebed 100644 --- a/src/AbstractTranslationModel.h +++ b/src/AbstractTranslationModel.h @@ -1,61 +1,69 @@ /* * AbstractTranslationModel.h * - * An interface for a translation model for translating a plain (without any markups and emojis) UTF-8 encoded text. - * The model supports translation from 1 source language to 1 target language. There can be different implementations + * An interface for a translation model for translating a plain (without any + * markups and emojis) UTF-8 encoded text. The model supports translation from 1 + * source language to 1 target language. There can be different implementations * of this interface. */ #ifndef SRC_TRANSLATOR_ABSTRACTTRANSLATIONMODEL_H_ #define SRC_TRANSLATOR_ABSTRACTTRANSLATIONMODEL_H_ -#include -#include #include #include +#include +#include #include "TranslationModelConfiguration.h" #include "TranslationRequest.h" #include "TranslationResult.h" -/* An interface for a translation model for translating a plain (without any markups and emojis) UTF-8 encoded text. - * The model supports translation from 1 source language to 1 target language. +/* An interface for a translation model for translating a plain (without any + * markups and emojis) UTF-8 encoded text. The model supports translation from 1 + * source language to 1 target language. */ class AbstractTranslationModel { public: + /* A Factory method to create and return an instance of an implementation of + * AbstractTranslationModel. The instance is created using translation model + * configuration (TranslationModelConfiguration). + */ + static std::shared_ptr + createInstance(const TranslationModelConfiguration &config); + + AbstractTranslationModel() = default; + + virtual ~AbstractTranslationModel() = default; + + /* This method performs translation on a list of (UTF-8 encoded) texts and + * returns a list of results in the same order. Each text entry can either be + * a word, a phrase, a sentence or a list of sentences and should contain + * plain text (without any markups or emojis). Additional information related + * to the translated text can be requested via TranslationRequest which is + * applied equally to each text entry. + * + * The translated text corresponding to each text entry and the additional + * information (as specified in the TranslationRequest) is encapsulated and + * returned in TranslationResult. + * + * The API splits each text entry into sentences internally, which are then + * translated independent of each other. The translated sentences are then + * joined together and returned in TranslationResult. Please refer to the + * TranslationRequest class to find out what additional information can be + * requested. The alignment information can only be requested if the model + * supports it (check isAlignmentSupported() API). + * + * The texts argument will become empty after the execution of this API (each + * entry of texts list will be moved to its corresponding TranslationResult + * object). + */ + virtual std::future> + translate(std::vector &&texts, TranslationRequest request) = 0; - /* A Factory method to create and return an instance of an implementation of - * AbstractTranslationModel. The instance is created using translation model configuration - * (TranslationModelConfiguration). - */ - static std::shared_ptr - createInstance(const TranslationModelConfiguration& config); - - AbstractTranslationModel() = default; - - virtual ~AbstractTranslationModel() = default; - - /* This method performs translation on a list of (UTF-8 encoded) texts and returns a list of results in the same order. - * Each text entry can either be a word, a phrase, a sentence or a list of sentences and should contain plain text - * (without any markups or emojis). Additional information related to the translated text can be requested via - * TranslationRequest which is applied equally to each text entry. - * - * The translated text corresponding to each text entry and the additional information (as specified in the - * TranslationRequest) is encapsulated and returned in TranslationResult. - * - * The API splits each text entry into sentences internally, which are then translated independent of each other. - * The translated sentences are then joined together and returned in TranslationResult. - * Please refer to the TranslationRequest class to find out what additional information can be requested. - * The alignment information can only be requested if the model supports it (check isAlignmentSupported() API). - * - * The texts argument will become empty after the execution of this API (each entry of texts list will be moved to its - * corresponding TranslationResult object). - */ - virtual std::future> translate( - std::vector &&texts, TranslationRequest request) = 0; - - /* Check if the model can provide alignment information b/w original and translated text. */ - virtual bool isAlignmentSupported() const = 0; + /* Check if the model can provide alignment information b/w original and + * translated text. */ + virtual bool isAlignmentSupported() const = 0; }; #endif /* SRC_TRANSLATOR_ABSTRACTTRANSLATIONMODEL_H_ */ diff --git a/src/QualityScore.h b/src/QualityScore.h index 020aebc8e..3ad6349bd 100644 --- a/src/QualityScore.h +++ b/src/QualityScore.h @@ -6,31 +6,32 @@ #ifndef SRC_TRANSLATOR_QUALITYSCORE_H_ #define SRC_TRANSLATOR_QUALITYSCORE_H_ -#include #include +#include - -/* All possible Granularities for which Quality Scores can be returned for translated text. */ +/* All possible Granularities for which Quality Scores can be returned for + * translated text. */ enum class QualityScoreGranularity { - WORD, SENTENCE, NONE, + WORD, + SENTENCE, + NONE, }; -/* This class represents the Quality Scores for various spans of a translated text at a specific granularity. */ +/* This class represents the Quality Scores for various spans of a translated + * text at a specific granularity. */ class QualityScore { private: + // Sections of the translated text for the Quality Scores. + std::vector textViews; - // Sections of the translated text for the Quality Scores. - std::vector textViews; + // Quality Scores corresponding to each entry of textViews in the same order + std::vector textScores; - // Quality Scores corresponding to each entry of textViews in the same order - std::vector textScores; - - // Granularity of the text for the Quality scores above - QualityScoreGranularity textGranularity; + // Granularity of the text for the Quality scores above + QualityScoreGranularity textGranularity; public: - // ToDo: Public Methods + // ToDo: Public Methods }; - #endif /* SRC_TRANSLATOR_QUALITYSCORE_H_ */ diff --git a/src/TranslationModelConfiguration.h b/src/TranslationModelConfiguration.h index 8c6582454..f4a5572ea 100644 --- a/src/TranslationModelConfiguration.h +++ b/src/TranslationModelConfiguration.h @@ -8,61 +8,54 @@ #include -/* This class encapsulates the configuration that is required by a translation model to perform - * translation. +/* This class encapsulates the configuration that is required by a translation + * model to perform translation. */ class TranslationModelConfiguration { public: - - // Constructor - TranslationModelConfiguration(const std::string &modelFilePath, - const std::string &sourceVocabPath, - const std::string &targetVocabPath) : - modelPath(modelFilePath), - sourceLanguageVocabPath(sourceVocabPath), - targetLanguageVocabPath(targetVocabPath) { - } - - // Copy constructor - TranslationModelConfiguration(const TranslationModelConfiguration &rhs) : - modelPath(rhs.modelPath), - sourceLanguageVocabPath(rhs.sourceLanguageVocabPath), - targetLanguageVocabPath(rhs.targetLanguageVocabPath) { - } - - // Move constructor - TranslationModelConfiguration(TranslationModelConfiguration &&rhs) : - modelPath(std::move(rhs.modelPath)), - sourceLanguageVocabPath(std::move(rhs.sourceLanguageVocabPath)), - targetLanguageVocabPath(std::move(rhs.targetLanguageVocabPath)) { - } - - // Return the path of the model file - const std::string& getModelFilePath() const { - return modelPath; - } - - // Return the path of the source language vocabulary file - const std::string& getSourceVocabularyPath() const { - return sourceLanguageVocabPath; - } - - // Return the path of the target language vocabulary file - const std::string& getTargetVocabularyPath() const { - return targetLanguageVocabPath; - } + // Constructor + TranslationModelConfiguration(const std::string &modelFilePath, + const std::string &sourceVocabPath, + const std::string &targetVocabPath) + : modelPath(modelFilePath), sourceLanguageVocabPath(sourceVocabPath), + targetLanguageVocabPath(targetVocabPath) {} + + // Copy constructor + TranslationModelConfiguration(const TranslationModelConfiguration &rhs) + : modelPath(rhs.modelPath), + sourceLanguageVocabPath(rhs.sourceLanguageVocabPath), + targetLanguageVocabPath(rhs.targetLanguageVocabPath) {} + + // Move constructor + TranslationModelConfiguration(TranslationModelConfiguration &&rhs) + : modelPath(std::move(rhs.modelPath)), + sourceLanguageVocabPath(std::move(rhs.sourceLanguageVocabPath)), + targetLanguageVocabPath(std::move(rhs.targetLanguageVocabPath)) {} + + // Return the path of the model file + const std::string &getModelFilePath() const { return modelPath; } + + // Return the path of the source language vocabulary file + const std::string &getSourceVocabularyPath() const { + return sourceLanguageVocabPath; + } + + // Return the path of the target language vocabulary file + const std::string &getTargetVocabularyPath() const { + return targetLanguageVocabPath; + } private: - // Path to the translation model file - const std::string modelPath; + // Path to the translation model file + const std::string modelPath; - // Path to the source vocabulary file to be used by the model - const std::string sourceLanguageVocabPath; + // Path to the source vocabulary file to be used by the model + const std::string sourceLanguageVocabPath; - // Path to the target vocabulary file to be used by the model - const std::string targetLanguageVocabPath; + // Path to the target vocabulary file to be used by the model + const std::string targetLanguageVocabPath; - // ToDo: Add other user configurable options (e.g. min batch size) + // ToDo: Add other user configurable options (e.g. min batch size) }; #endif /* SRC_TRANSLATOR_TRANSLATIONMODELCONFIGURATION_H_ */ diff --git a/src/TranslationRequest.h b/src/TranslationRequest.h index b19cc892d..6d449bbab 100644 --- a/src/TranslationRequest.h +++ b/src/TranslationRequest.h @@ -1,7 +1,8 @@ /* * TranslationRequest.h * - * This file defines the translation request class to be used in AbstractTranslationModel::translate() API. + * This file defines the translation request class to be used in + * AbstractTranslationModel::translate() API. */ #ifndef SRC_TRANSLATOR_TRANSLATIONREQUEST_H_ @@ -9,66 +10,75 @@ #include "QualityScore.h" -/* This class specifies the information related to the translated text (e.g. quality of the translation etc.) that - * can be included in the TranslationResult. These optional requests are set/unset independent of each other i.e. setting - * any one of them doesn’t have the side effect of setting any of the others. +/* This class specifies the information related to the translated text (e.g. + * quality of the translation etc.) that can be included in the + * TranslationResult. These optional requests are set/unset independent of each + * other i.e. setting any one of them doesn’t have the side effect of setting + * any of the others. */ class TranslationRequest { private: - // The granularity for which Quality scores of the translated text will be included in TranslationResult. - // QualityScoreGranularity::NONE means the scores are not included in TranslationResult. - QualityScoreGranularity qualityScoreGranularity = QualityScoreGranularity::NONE; + // The granularity for which Quality scores of the translated text will be + // included in TranslationResult. QualityScoreGranularity::NONE means the + // scores are not included in TranslationResult. + QualityScoreGranularity qualityScoreGranularity = + QualityScoreGranularity::NONE; - // A flag to include/exclude the information regarding how individual sentences of original text map to - // corresponding translated sentences in joined translated text in the TranslationResult. - // An example of sentence mappings: - // originalText (containing 2 sentences) = "What is your name? My name is Abc." - // translatedText (containing 2 translated sentences) = "Was ist dein Name? Mein Name ist Abc." - // sentenceMappings = [ - // {"What is your name?", "Was ist dein Name?"}, // Pair(originalText[0],translatedText[0]) - // {"My name is Abc", "Mein Name ist Abc."} // Pair(originalText[1],translatedText[1]) - // ] - bool includeSentenceMapping = false; + // A flag to include/exclude the information regarding how individual + // sentences of original text map to corresponding translated sentences in + // joined translated text in the TranslationResult. An example of sentence + // mappings: + // originalText (containing 2 sentences) = "What is your + // name? My name is Abc." translatedText (containing 2 translated + // sentences) = "Was ist dein Name? Mein Name ist Abc." sentenceMappings = + // [ + // {"What is your name?", "Was ist dein Name?"}, // + // Pair(originalText[0],translatedText[0]) + // {"My name is Abc", "Mein Name ist Abc."} // + // Pair(originalText[1],translatedText[1]) + // ] + bool includeSentenceMapping = false; public: - TranslationRequest() {} + TranslationRequest() {} - TranslationRequest(const TranslationRequest& request) : - qualityScoreGranularity(request.qualityScoreGranularity), - includeSentenceMapping(request.includeSentenceMapping) { - } + TranslationRequest(const TranslationRequest &request) + : qualityScoreGranularity(request.qualityScoreGranularity), + includeSentenceMapping(request.includeSentenceMapping) {} - ~TranslationRequest() {} + ~TranslationRequest() {} - /* Set the granularity for which the Quality scores of translated text should be included in the TranslationResult. - * By default (QualityScoreGranularity::NONE), scores are not included. - */ - void setQualityScoreGranularity(QualityScoreGranularity granularity) { - qualityScoreGranularity = granularity; - } + /* Set the granularity for which the Quality scores of translated text should + * be included in the TranslationResult. By default + * (QualityScoreGranularity::NONE), scores are not included. + */ + void setQualityScoreGranularity(QualityScoreGranularity granularity) { + qualityScoreGranularity = granularity; + } - /* Set to true/false to include/exclude the information regarding how individual sentences of original text map to - * corresponding translated sentences in joined translated text in the TranslationResult. By default (false), this - * information is not included. - */ - void sentenceMappingInResult(bool includeMapping) { - includeSentenceMapping = includeMapping; - } + /* Set to true/false to include/exclude the information regarding how + * individual sentences of original text map to corresponding translated + * sentences in joined translated text in the TranslationResult. By default + * (false), this information is not included. + */ + void sentenceMappingInResult(bool includeMapping) { + includeSentenceMapping = includeMapping; + } - /* Return the granularity for which the Quality scores of the translated text will be included in TranslationResult. - * QualityScoreGranularity::NONE means the scores will not be included. - */ - QualityScoreGranularity getQualityScoreGranularity() const { - return qualityScoreGranularity; - } + /* Return the granularity for which the Quality scores of the translated text + * will be included in TranslationResult. QualityScoreGranularity::NONE means + * the scores will not be included. + */ + QualityScoreGranularity getQualityScoreGranularity() const { + return qualityScoreGranularity; + } - /* Return whether the information regarding how individual sentences of original text map to corresponding translated - * sentences in joined translated text will be included in the TranslationResult. By default (false) means this - * information will not be included. - */ - bool sentenceMappingInResult() const { - return includeSentenceMapping; - } + /* Return whether the information regarding how individual sentences of + * original text map to corresponding translated sentences in joined + * translated text will be included in the TranslationResult. By default + * (false) means this information will not be included. + */ + bool sentenceMappingInResult() const { return includeSentenceMapping; } }; #endif /* SRC_TRANSLATOR_TRANSLATIONREQUEST_H_ */ diff --git a/src/TranslationResult.h b/src/TranslationResult.h index 4d231a89b..34858f74c 100644 --- a/src/TranslationResult.h +++ b/src/TranslationResult.h @@ -1,76 +1,77 @@ /* * TranslationResult.h * - * The class that represents the result of AbstractTranslationModel::translate() API for each of its text entry and - * TranslationRequest. + * The class that represents the result of AbstractTranslationModel::translate() + * API for each of its text entry and TranslationRequest. */ #ifndef SRC_TRANSLATOR_TRANSLATIONRESULT_H_ #define SRC_TRANSLATOR_TRANSLATIONRESULT_H_ -#include #include +#include #include "QualityScore.h" -/* This class represents the result of AbstractTranslationModel::translate() API for each of its text entry and - * TranslationRequest. +/* This class represents the result of AbstractTranslationModel::translate() API + * for each of its text entry and TranslationRequest. */ class TranslationResult { public: - typedef std::vector> SentenceMappings; + typedef std::vector> + SentenceMappings; - TranslationResult(const std::string &original, const std::string &translation) : - originalText(original), translatedText(translation) {} + TranslationResult(const std::string &original, const std::string &translation) + : originalText(original), translatedText(translation) {} - TranslationResult(std::string &&original, std::string &&translation) : - originalText(std::move(original)), translatedText(std::move(translation)) {} + TranslationResult(std::string &&original, std::string &&translation) + : originalText(std::move(original)), + translatedText(std::move(translation)) {} - /* Return the original text. */ - const std::string& getOriginalText() const { - return originalText; - } + /* Return the original text. */ + const std::string &getOriginalText() const { return originalText; } - /* Return the translated text. */ - const std::string& getTranslatedText() const { - return translatedText; - } + /* Return the translated text. */ + const std::string &getTranslatedText() const { return translatedText; } - /* Return the Quality scores of the translated text. */ - const QualityScore& getQualityScore() const { - return qualityScore; - } + /* Return the Quality scores of the translated text. */ + const QualityScore &getQualityScore() const { return qualityScore; } - /* Return the Sentence mappings (information regarding how individual sentences of originalText map to - * corresponding translated sentences in translatedText). - */ - const SentenceMappings& getSentenceMappings() const { - return sentenceMappings; - } + /* Return the Sentence mappings (information regarding how individual + * sentences of originalText map to corresponding translated sentences in + * translatedText). + */ + const SentenceMappings &getSentenceMappings() const { + return sentenceMappings; + } private: - // Original text (in UTF-8 encoded format) that was supposed to be translated - std::string originalText; + // Original text (in UTF-8 encoded format) that was supposed to be translated + std::string originalText; - // Translation (in UTF-8 encoded format) of the originalText - std::string translatedText; + // Translation (in UTF-8 encoded format) of the originalText + std::string translatedText; - // Quality score of the translated text at the granularity specified in TranslationRequest. - // It is an optional result (it will have no information if not requested in TranslationRequest) - QualityScore qualityScore; + // Quality score of the translated text at the granularity specified in + // TranslationRequest. It is an optional result (it will have no information + // if not requested in TranslationRequest) + QualityScore qualityScore; - // Information regarding how individual sentences of originalText map to corresponding translated sentences - // in joined translated text (translatedText) - // An example of sentence mapping: - // originalText (contains 2 sentences) = "What is your name? My name is Abc." - // translatedText (contains 2 translated sentences) = "Was ist dein Name? Mein Name ist Abc." - // sentenceMappings = [ - // {"What is your name?", "Was ist dein Name?"}, // Pair(originalText[0],translatedText[0]) - // {"My name is Abc", "Mein Name ist Abc."} // Pair(originalText[1],translatedText[1]) - // ] - // - // It is an optional result (it will be empty if not requested in TranslationRequest). - SentenceMappings sentenceMappings; + // Information regarding how individual sentences of originalText map to + // corresponding translated sentences in joined translated text + // (translatedText) An example of sentence mapping: + // originalText (contains 2 sentences) = "What is your name? + // My name is Abc." translatedText (contains 2 translated sentences) = + // "Was ist dein Name? Mein Name ist Abc." sentenceMappings = [ + // {"What is your name?", "Was ist dein Name?"}, // + // Pair(originalText[0],translatedText[0]) + // {"My name is Abc", "Mein Name ist Abc."} // + // Pair(originalText[1],translatedText[1]) + // ] + // + // It is an optional result (it will be empty if not requested in + // TranslationRequest). + SentenceMappings sentenceMappings; }; #endif /* SRC_TRANSLATOR_TRANSLATIONRESULT_H_ */ diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp index 597c592d3..94782fa81 100644 --- a/src/translator/AbstractTranslationModel.cpp +++ b/src/translator/AbstractTranslationModel.cpp @@ -12,10 +12,10 @@ #include "TranslationModel.h" #include "TranslationModelConfigToOptionsAdaptor.h" - std::shared_ptr -AbstractTranslationModel::createInstance(const TranslationModelConfiguration& config) { - TranslationModelConfigToOptionsAdaptor adaptor; - auto options = adaptor.adapt(config); - return std::make_shared(options); +AbstractTranslationModel::createInstance( + const TranslationModelConfiguration &config) { + TranslationModelConfigToOptionsAdaptor adaptor; + auto options = adaptor.adapt(config); + return std::make_shared(options); } diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index 099d930cd..b3a8fec32 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -8,21 +8,19 @@ #include "TranslationModel.h" -TranslationModel::TranslationModel(std::shared_ptr options) : - configOptions(std::move(options)), AbstractTranslationModel() { -} +TranslationModel::TranslationModel(std::shared_ptr options) + : configOptions(std::move(options)), AbstractTranslationModel() {} TranslationModel::~TranslationModel() {} -std::future> TranslationModel::translate( - std::vector &&texts, TranslationRequest request) { - //ToDo: Replace this code with the actual implementation - return std::async([]() { - std::vector results; - return results; - }); +std::future> +TranslationModel::translate(std::vector &&texts, + TranslationRequest request) { + // ToDo: Replace this code with the actual implementation + return std::async([]() { + std::vector results; + return results; + }); } -bool TranslationModel::isAlignmentSupported() const { - return false; -} +bool TranslationModel::isAlignmentSupported() const { return false; } diff --git a/src/translator/TranslationModel.h b/src/translator/TranslationModel.h index 587926516..ba58969d1 100644 --- a/src/translator/TranslationModel.h +++ b/src/translator/TranslationModel.h @@ -7,9 +7,9 @@ #ifndef SRC_TRANSLATOR_TRANSLATIONMODEL_H_ #define SRC_TRANSLATOR_TRANSLATIONMODEL_H_ -#include -#include #include +#include +#include // All 3rd party includes #include "3rd_party/marian-dev/src/common/options.h" @@ -18,47 +18,53 @@ #include "AbstractTranslationModel.h" #include "TranslationModelConfiguration.h" -/* A Translation model that translates a plain (without any markups and emojis) UTF-8 encoded text. - * This implementation supports translation from 1 source language to 1 target language. +/* A Translation model that translates a plain (without any markups and emojis) + * UTF-8 encoded text. This implementation supports translation from 1 source + * language to 1 target language. */ -class TranslationModel: public AbstractTranslationModel { +class TranslationModel : public AbstractTranslationModel { public: - /* Construct the model using the model configuration options. - */ - TranslationModel(std::shared_ptr options); + /* Construct the model using the model configuration options. + */ + TranslationModel(std::shared_ptr options); - ~TranslationModel(); + ~TranslationModel(); - /* This method performs translation on a list of UTF-8 encoded plain text (without any markups - * or emojis) and returns a list of results in the same order. The model supports translation - * from 1 source language to 1 target language. - * - * Each text entry can either be a word, a phrase, a sentence or a list of sentences. Additional - * information related to the translated text can be requested via TranslationRequest which is - * applied equally to each text entry. The translated text corresponding to each text entry and - * the additional information (as specified in the TranslationRequest) is encapsulated and - * returned in TranslationResult. - * - * The API splits each text entry into sentences internally, which are then translated - * independent of each other. The translated sentences are then joined back together and returned - * in TranslationResult. - * - * Please refer to the TranslationRequest class to find out what additional information can be - * requested. The alignment information can only be requested if the model supports it (check - * isAlignmentSupported() API). - * - * The texts argument will become empty after the execution of this API (each entry of texts list - * will be moved to its corresponding TranslationResult object). - */ - std::future> translate( - std::vector &&texts, TranslationRequest request) override; + /* This method performs translation on a list of UTF-8 encoded plain text + * (without any markups or emojis) and returns a list of results in the same + * order. The model supports translation from 1 source language to 1 target + * language. + * + * Each text entry can either be a word, a phrase, a sentence or a list of + * sentences. Additional information related to the translated text can be + * requested via TranslationRequest which is applied equally to each text + * entry. The translated text corresponding to each text entry and the + * additional information (as specified in the TranslationRequest) is + * encapsulated and returned in TranslationResult. + * + * The API splits each text entry into sentences internally, which are then + * translated independent of each other. The translated sentences are then + * joined back together and returned in TranslationResult. + * + * Please refer to the TranslationRequest class to find out what additional + * information can be requested. The alignment information can only be + * requested if the model supports it (check isAlignmentSupported() API). + * + * The texts argument will become empty after the execution of this API (each + * entry of texts list will be moved to its corresponding TranslationResult + * object). + */ + std::future> + translate(std::vector &&texts, + TranslationRequest request) override; - /* Check if the model can provide alignment information b/w original and translated text. */ - bool isAlignmentSupported() const override; + /* Check if the model can provide alignment information b/w original and + * translated text. */ + bool isAlignmentSupported() const override; private: - // Model configuration options - std::shared_ptr configOptions; + // Model configuration options + std::shared_ptr configOptions; }; #endif /* SRC_TRANSLATOR_TRANSLATIONMODEL_H_ */ diff --git a/src/translator/TranslationModelConfigToOptionsAdaptor.cpp b/src/translator/TranslationModelConfigToOptionsAdaptor.cpp index 3405a5fcf..00e37e0eb 100644 --- a/src/translator/TranslationModelConfigToOptionsAdaptor.cpp +++ b/src/translator/TranslationModelConfigToOptionsAdaptor.cpp @@ -6,12 +6,14 @@ #include "TranslationModelConfigToOptionsAdaptor.h" -TranslationModelConfigToOptionsAdaptor::TranslationModelConfigToOptionsAdaptor() {} +TranslationModelConfigToOptionsAdaptor:: + TranslationModelConfigToOptionsAdaptor() {} -TranslationModelConfigToOptionsAdaptor::~TranslationModelConfigToOptionsAdaptor() {} +TranslationModelConfigToOptionsAdaptor:: + ~TranslationModelConfigToOptionsAdaptor() {} -std::shared_ptr -TranslationModelConfigToOptionsAdaptor::adapt(const TranslationModelConfiguration& config) { - // ToDo: Add actual implementation - return std::make_shared(); +std::shared_ptr TranslationModelConfigToOptionsAdaptor::adapt( + const TranslationModelConfiguration &config) { + // ToDo: Add actual implementation + return std::make_shared(); } diff --git a/src/translator/TranslationModelConfigToOptionsAdaptor.h b/src/translator/TranslationModelConfigToOptionsAdaptor.h index 1eba4cced..49197b898 100644 --- a/src/translator/TranslationModelConfigToOptionsAdaptor.h +++ b/src/translator/TranslationModelConfigToOptionsAdaptor.h @@ -1,8 +1,9 @@ /* - * This class adapts the TranslationModelConfiguration object to marian::Options object. - * marian::Options is a class that is specific to Marian and is used heavily inside it - * as configuration options (even for translation workflow). It makes sense to work with - * this class internally in the implementation files. + * This class adapts the TranslationModelConfiguration object to marian::Options + * object. marian::Options is a class that is specific to Marian and is used + * heavily inside it as configuration options (even for translation workflow). + * It makes sense to work with this class internally in the implementation + * files. */ #ifndef SRC_TRANSLATOR_TRANSLATIONMODELCONFIGTOOPTIONSADAPTOR_H_ @@ -16,17 +17,16 @@ // All local includes #include "TranslationModelConfiguration.h" - class TranslationModelConfigToOptionsAdaptor { public: + TranslationModelConfigToOptionsAdaptor(); - TranslationModelConfigToOptionsAdaptor(); - - ~TranslationModelConfigToOptionsAdaptor(); + ~TranslationModelConfigToOptionsAdaptor(); - /* Create an Options object from the translation model configuration object. - */ - std::shared_ptr adapt(const TranslationModelConfiguration& config); + /* Create an Options object from the translation model configuration object. + */ + std::shared_ptr + adapt(const TranslationModelConfiguration &config); }; #endif /* SRC_TRANSLATOR_TRANSLATIONMODELCONFIGTOOPTIONSADAPTOR_H_ */ From 08a7358c3d6caf55a6eb38f24b9955f474cd9729 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Mon, 25 Jan 2021 18:15:22 +0000 Subject: [PATCH 040/442] Integrating marian-translator through API Using std::string for config. Now capable of launching marian translator through API interface. There's a sketchy workaround to convert a string config to marian::Options, with an added note. --- app/main-mts.cpp | 26 +------ app/main.cpp | 52 ++++++++++--- src/AbstractTranslationModel.h | 2 +- src/TranslationResult.h | 12 ++- src/translator/AbstractTranslationModel.cpp | 9 +-- src/translator/TranslationModel.cpp | 82 +++++++++++++++++++-- src/translator/TranslationModel.h | 6 +- src/translator/parser.h | 32 ++++++++ src/translator/translation_result.h | 11 ++- 9 files changed, 173 insertions(+), 59 deletions(-) create mode 100644 src/translator/parser.h diff --git a/app/main-mts.cpp b/app/main-mts.cpp index 3de57b074..9a1e71c63 100644 --- a/app/main-mts.cpp +++ b/app/main-mts.cpp @@ -3,35 +3,13 @@ #include #include "common/definitions.h" -#include "common/timer.h" #include "common/utils.h" #include "marian.h" -#include "translator/history.h" -#include "translator/output_collector.h" -#include "translator/output_printer.h" - +#include "translator/parser.h" #include "translator/service.h" int main(int argc, char *argv[]) { - marian::ConfigParser cp(marian::cli::mode::translation); - - cp.addOption( - "--ssplit-prefix-file", "Bergamot Options", - "File with nonbreaking prefixes for sentence splitting."); - - cp.addOption("--ssplit-mode", "Server Options", - "[paragraph, sentence, wrapped_text]"); - - cp.addOption( - "--max-input-sentence-tokens", "Bergamot Options", - "Maximum input tokens to be processed in a single sentence.", 128); - - cp.addOption("--max-input-tokens", "Bergamot Options", - "Maximum input tokens in a batch. control for" - "Bergamot Queue", - 1024); - - // Launch service. + auto cp = marian::bergamot::createConfigParser(); auto options = cp.parseOptions(argc, argv, true); marian::bergamot::Service service(options); diff --git a/app/main.cpp b/app/main.cpp index bb0fa34e2..ec6ef6da0 100644 --- a/app/main.cpp +++ b/app/main.cpp @@ -11,25 +11,57 @@ #include "TranslationModelConfiguration.h" #include "TranslationRequest.h" #include "TranslationResult.h" +#include "translator/parser.h" int main(int argc, char **argv) { - // Create an instance of AbstractTranslationModel with a dummy model - // configuration - TranslationModelConfiguration config( - "dummy_modelFilePath", "dummy_sourceVocabPath", "dummy_targetVocabPath"); + // Create a configParser and load command line parameters into a YAML config + // string. + auto configParser = marian::bergamot::createConfigParser(); + auto options = configParser.parseOptions(argc, argv, true); + std::string config = options->asYamlString(); + std::cout << config << std::endl; + + // Route the config string to construct marian model through + // AbstractTranslationModel std::shared_ptr model = AbstractTranslationModel::createInstance(config); - // Call to translate a dummy (empty) texts with a dummy (empty) translation - // request - TranslationRequest req; + TranslationRequest translationRequest; std::vector texts; - auto result = model->translate(std::move(texts), req); + for (int i = 0; i < 10; i++) { + texts.emplace_back( + "The Bergamot project will add and improve client-side machine" + "translation in a web browser. Unlike current cloud-based" + "options, running directly on users’ machines empowers citizens to" + "preserve their privacy and increases the uptake of language" + "technologies in Europe in various sectors that require" + "confidentiality. Free software integrated with an open-source web" + "browser, such as Mozilla Firefox, will enable bottom-up adoption" + "by non-experts, resulting in cost savings for private and public" + "sector users who would otherwise procure translation or operate" + "monolingually. Bergamot is a consortium coordinated by the" + "University of Edinburgh with partners Charles University in" + "Prague, the University of Sheffield, University of Tartu, and" + "Mozilla."); + } + + auto result = model->translate(std::move(texts), translationRequest); // Resolve the future and get the actual result - std::vector res = result.get(); + std::vector results = result.get(); + + for (auto &result : results) { + auto mappings = result.getSentenceMappings(); + for (auto &p : mappings) { + std::string_view src = p.first; + std::string_view tgt = p.second; + + std::cout << "[src]: " << src << std::endl; + std::cout << "[tgt]: " << tgt << std::endl; + std::cout << std::endl; + } + } - std::cout << "Count is: " << res.size() << std::endl; return 0; } diff --git a/src/AbstractTranslationModel.h b/src/AbstractTranslationModel.h index b76aeebed..69b72cf39 100644 --- a/src/AbstractTranslationModel.h +++ b/src/AbstractTranslationModel.h @@ -30,7 +30,7 @@ class AbstractTranslationModel { * configuration (TranslationModelConfiguration). */ static std::shared_ptr - createInstance(const TranslationModelConfiguration &config); + createInstance(const std::string &config); AbstractTranslationModel() = default; diff --git a/src/TranslationResult.h b/src/TranslationResult.h index 34858f74c..6e5d801e1 100644 --- a/src/TranslationResult.h +++ b/src/TranslationResult.h @@ -21,12 +21,16 @@ class TranslationResult { typedef std::vector> SentenceMappings; - TranslationResult(const std::string &original, const std::string &translation) - : originalText(original), translatedText(translation) {} + TranslationResult(const std::string &original, const std::string &translation, + SentenceMappings &sentenceMappings) + : originalText(original), translatedText(translation), + sentenceMappings(sentenceMappings) {} - TranslationResult(std::string &&original, std::string &&translation) + TranslationResult(std::string &&original, std::string &&translation, + SentenceMappings &&sentenceMappings) : originalText(std::move(original)), - translatedText(std::move(translation)) {} + translatedText(std::move(translation)), + sentenceMappings(std::move(sentenceMappings)) {} /* Return the original text. */ const std::string &getOriginalText() const { return originalText; } diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp index 94782fa81..e7a917922 100644 --- a/src/translator/AbstractTranslationModel.cpp +++ b/src/translator/AbstractTranslationModel.cpp @@ -13,9 +13,8 @@ #include "TranslationModelConfigToOptionsAdaptor.h" std::shared_ptr -AbstractTranslationModel::createInstance( - const TranslationModelConfiguration &config) { - TranslationModelConfigToOptionsAdaptor adaptor; - auto options = adaptor.adapt(config); - return std::make_shared(options); +AbstractTranslationModel::createInstance(const std::string &config) { + // TranslationModelConfigToOptionsAdaptor adaptor; + // auto options = adaptor.adapt(config); + return std::make_shared(config); } diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index b3a8fec32..ce1310614 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -6,21 +6,89 @@ #include #include +#include "3rd_party/marian-dev/src/3rd_party/yaml-cpp/yaml.h" +#include "3rd_party/marian-dev/src/common/config_parser.h" #include "TranslationModel.h" +#include "common/config_validator.h" +#include "common/options.h" +#include "translator/service.h" -TranslationModel::TranslationModel(std::shared_ptr options) - : configOptions(std::move(options)), AbstractTranslationModel() {} +std::shared_ptr parseOptions(const std::string &config) { + marian::Options options; + + // @TODO(jerinphilip) There's something off here, @XapaJIaMnu suggests + // that should not be using the defaultConfig. This function only has access + // to std::string config and needs to be able to construct Options from the + // same. + + // Absent the following code-segment, there is a parsing exception thrown on + // rebuilding YAML. + // + // Error: Unhandled exception of type 'N4YAML11InvalidNodeE': invalid node; + // this may result from using a map iterator as a sequence iterator, or + // vice-versa + // + // Error: Aborted from void unhandledException() in + // 3rd_party/marian-dev/src/common/logging.cpp:113 + + marian::ConfigParser configParser(marian::cli::mode::translation); + const YAML::Node &defaultConfig = configParser.getConfig(); + + options.merge(defaultConfig); + + // Parse configs onto defaultConfig. + options.parse(config); + YAML::Node configCopy = options.cloneToYamlNode(); + + marian::ConfigValidator validator(configCopy); + validator.validateOptions(marian::cli::mode::translation); + + return std::make_shared(options); +} + +TranslationModel::TranslationModel(const std::string &config) + : configOptions_(std::move(parseOptions(config))), + AbstractTranslationModel(), service_(configOptions_) {} TranslationModel::~TranslationModel() {} std::future> TranslationModel::translate(std::vector &&texts, TranslationRequest request) { - // ToDo: Replace this code with the actual implementation - return std::async([]() { - std::vector results; - return results; - }); + // Implementing a non-async version first. Unpleasant, but should work. + std::promise> promise; + auto future = promise.get_future(); + + auto convert = [](marian::bergamot::TranslationResult &mTranslationResult) { + // Change marian::string_view to std::string_view + TranslationResult::SentenceMappings sentenceMappings; + for (auto &p : mTranslationResult.getSentenceMappings()) { + std::string_view src(p.first.data(), p.first.size()), + tgt(p.second.data(), p.second.size()); + sentenceMappings.emplace_back(src, tgt); + } + + TranslationResult translationResult( + std::move(mTranslationResult.source_), + std::move(mTranslationResult.translation_), + std::move(sentenceMappings)); + + return translationResult; + }; + + // This code, move into async? + std::vector translationResults; + for (auto &text : texts) { + // Copying text, can also be replaced with move based function. + // translate(...) + auto intermediate = service_.translateWithCopy(text); + intermediate.wait(); + marian::bergamot::TranslationResult result = intermediate.get(); + translationResults.push_back(convert(result)); + } + + promise.set_value(translationResults); + return future; } bool TranslationModel::isAlignmentSupported() const { return false; } diff --git a/src/translator/TranslationModel.h b/src/translator/TranslationModel.h index ba58969d1..686ca0554 100644 --- a/src/translator/TranslationModel.h +++ b/src/translator/TranslationModel.h @@ -17,6 +17,7 @@ // All local project includes #include "AbstractTranslationModel.h" #include "TranslationModelConfiguration.h" +#include "translator/service.h" /* A Translation model that translates a plain (without any markups and emojis) * UTF-8 encoded text. This implementation supports translation from 1 source @@ -26,7 +27,7 @@ class TranslationModel : public AbstractTranslationModel { public: /* Construct the model using the model configuration options. */ - TranslationModel(std::shared_ptr options); + TranslationModel(const std::string &config); ~TranslationModel(); @@ -64,7 +65,8 @@ class TranslationModel : public AbstractTranslationModel { private: // Model configuration options - std::shared_ptr configOptions; + std::shared_ptr configOptions_; // ORDER DEPENDECNY + marian::bergamot::Service service_; // ORDER DEPENDENCY }; #endif /* SRC_TRANSLATOR_TRANSLATIONMODEL_H_ */ diff --git a/src/translator/parser.h b/src/translator/parser.h new file mode 100644 index 000000000..e273d6aea --- /dev/null +++ b/src/translator/parser.h @@ -0,0 +1,32 @@ +#ifndef SRC_BERGAMOT_PARSER_H +#define SRC_BERGAMOT_PARSER_H + +#include "marian.h" + +namespace marian { +namespace bergamot { +marian::ConfigParser createConfigParser() { + marian::ConfigParser cp(marian::cli::mode::translation); + cp.addOption( + "--ssplit-prefix-file", "Bergamot Options", + "File with nonbreaking prefixes for sentence splitting."); + + cp.addOption("--ssplit-mode", "Server Options", + "[paragraph, sentence, wrapped_text]", "paragraph"); + + cp.addOption( + "--max-input-sentence-tokens", "Bergamot Options", + "Maximum input tokens to be processed in a single sentence.", 128); + + cp.addOption("--max-input-tokens", "Bergamot Options", + "Maximum input tokens in a batch. control for" + "Bergamot Queue", + 1024); + + return cp; +} + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_PARSER_H diff --git a/src/translator/translation_result.h b/src/translator/translation_result.h index fb5a42a09..3987ee27b 100644 --- a/src/translator/translation_result.h +++ b/src/translator/translation_result.h @@ -40,10 +40,14 @@ class TranslationResult { // For development use to benchmark with marian-decoder. const Histories &getHistories() const { return histories_; } -private: std::string source_; std::string translation_; + // Adding the following to complete bergamot-translator spec, redundant while + // sourceMappings_ and targetMappings_ exists or vice-versa. + SentenceMappings sentenceMappings_; + +private: // Histories are currently required for interoperability with OutputPrinter // and OutputCollector and hence comparisons with marian-decoder. // Future hook to gain alignments. @@ -59,11 +63,6 @@ class TranslationResult { // string_views at the sentence-level. std::vector sourceMappings_; std::vector targetMappings_; - - // Adding the following to complete bergamot-translator spec, redundant while - // sourceMappings_ and targetMappings_ exists or vice-versa. - - SentenceMappings sentenceMappings_; }; } // namespace bergamot } // namespace marian From 026f1af887bb4e6dc205207b6433598f0ce89114 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 25 Jan 2021 11:52:23 +0100 Subject: [PATCH 041/442] Removed redundant lines from CMakeFile --- CMakeLists.txt | 4 ---- 1 file changed, 4 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 935cd1eab..0a2005dc1 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -17,9 +17,5 @@ option(USE_STATIC_LIBS "Link statically against non-system libs" ON) option(USE_MKL "Compile with MKL support" ON) add_subdirectory(3rd_party) - -# Adds the include directories set inside 3rd_party. -include_directories(${INCLUDE_DIRECTORIES}) - add_subdirectory(src) add_subdirectory(app) From b49f2c1af3a9113fbdf426b4133c0587e799ffa0 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 25 Jan 2021 18:46:04 +0100 Subject: [PATCH 042/442] Cleanup TranslationModelConfiguration to std::string change in API - Provide yaml formatted string as model configuration - Remove redundant files --- app/main.cpp | 1 - src/AbstractTranslationModel.h | 3 +- src/TranslationModelConfiguration.h | 61 ------------------- src/translator/AbstractTranslationModel.cpp | 6 -- src/translator/CMakeLists.txt | 1 - src/translator/TranslationModel.cpp | 5 +- src/translator/TranslationModel.h | 3 +- ...TranslationModelConfigToOptionsAdaptor.cpp | 19 ------ .../TranslationModelConfigToOptionsAdaptor.h | 32 ---------- 9 files changed, 6 insertions(+), 125 deletions(-) delete mode 100644 src/TranslationModelConfiguration.h delete mode 100644 src/translator/TranslationModelConfigToOptionsAdaptor.cpp delete mode 100644 src/translator/TranslationModelConfigToOptionsAdaptor.h diff --git a/app/main.cpp b/app/main.cpp index ec6ef6da0..f5d65969d 100644 --- a/app/main.cpp +++ b/app/main.cpp @@ -8,7 +8,6 @@ #include #include "AbstractTranslationModel.h" -#include "TranslationModelConfiguration.h" #include "TranslationRequest.h" #include "TranslationResult.h" #include "translator/parser.h" diff --git a/src/AbstractTranslationModel.h b/src/AbstractTranslationModel.h index 69b72cf39..6cb30c4a2 100644 --- a/src/AbstractTranslationModel.h +++ b/src/AbstractTranslationModel.h @@ -15,7 +15,6 @@ #include #include -#include "TranslationModelConfiguration.h" #include "TranslationRequest.h" #include "TranslationResult.h" @@ -27,7 +26,7 @@ class AbstractTranslationModel { public: /* A Factory method to create and return an instance of an implementation of * AbstractTranslationModel. The instance is created using translation model - * configuration (TranslationModelConfiguration). + * configuration provided as yaml-formatted string. */ static std::shared_ptr createInstance(const std::string &config); diff --git a/src/TranslationModelConfiguration.h b/src/TranslationModelConfiguration.h deleted file mode 100644 index f4a5572ea..000000000 --- a/src/TranslationModelConfiguration.h +++ /dev/null @@ -1,61 +0,0 @@ -/* - * TranslationModelConfiguration.h - * - */ - -#ifndef SRC_TRANSLATOR_TRANSLATIONMODELCONFIGURATION_H_ -#define SRC_TRANSLATOR_TRANSLATIONMODELCONFIGURATION_H_ - -#include - -/* This class encapsulates the configuration that is required by a translation - * model to perform translation. - */ -class TranslationModelConfiguration { -public: - // Constructor - TranslationModelConfiguration(const std::string &modelFilePath, - const std::string &sourceVocabPath, - const std::string &targetVocabPath) - : modelPath(modelFilePath), sourceLanguageVocabPath(sourceVocabPath), - targetLanguageVocabPath(targetVocabPath) {} - - // Copy constructor - TranslationModelConfiguration(const TranslationModelConfiguration &rhs) - : modelPath(rhs.modelPath), - sourceLanguageVocabPath(rhs.sourceLanguageVocabPath), - targetLanguageVocabPath(rhs.targetLanguageVocabPath) {} - - // Move constructor - TranslationModelConfiguration(TranslationModelConfiguration &&rhs) - : modelPath(std::move(rhs.modelPath)), - sourceLanguageVocabPath(std::move(rhs.sourceLanguageVocabPath)), - targetLanguageVocabPath(std::move(rhs.targetLanguageVocabPath)) {} - - // Return the path of the model file - const std::string &getModelFilePath() const { return modelPath; } - - // Return the path of the source language vocabulary file - const std::string &getSourceVocabularyPath() const { - return sourceLanguageVocabPath; - } - - // Return the path of the target language vocabulary file - const std::string &getTargetVocabularyPath() const { - return targetLanguageVocabPath; - } - -private: - // Path to the translation model file - const std::string modelPath; - - // Path to the source vocabulary file to be used by the model - const std::string sourceLanguageVocabPath; - - // Path to the target vocabulary file to be used by the model - const std::string targetLanguageVocabPath; - - // ToDo: Add other user configurable options (e.g. min batch size) -}; - -#endif /* SRC_TRANSLATOR_TRANSLATIONMODELCONFIGURATION_H_ */ diff --git a/src/translator/AbstractTranslationModel.cpp b/src/translator/AbstractTranslationModel.cpp index e7a917922..1b2f2b104 100644 --- a/src/translator/AbstractTranslationModel.cpp +++ b/src/translator/AbstractTranslationModel.cpp @@ -4,17 +4,11 @@ */ #include -// All 3rd party includes -#include "3rd_party/marian-dev/src/common/options.h" - // All local includes #include "AbstractTranslationModel.h" #include "TranslationModel.h" -#include "TranslationModelConfigToOptionsAdaptor.h" std::shared_ptr AbstractTranslationModel::createInstance(const std::string &config) { - // TranslationModelConfigToOptionsAdaptor adaptor; - // auto options = adaptor.adapt(config); return std::make_shared(config); } diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 27158a786..b6fcf69fc 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -1,7 +1,6 @@ add_library(bergamot-translator STATIC AbstractTranslationModel.cpp TranslationModel.cpp - TranslationModelConfigToOptionsAdaptor.cpp # Following files added from browsermt/mts@nuke textops.cpp diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index ce1310614..9bfaf1bec 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -6,11 +6,14 @@ #include #include +// All 3rd party includes #include "3rd_party/marian-dev/src/3rd_party/yaml-cpp/yaml.h" #include "3rd_party/marian-dev/src/common/config_parser.h" -#include "TranslationModel.h" #include "common/config_validator.h" #include "common/options.h" + +// All local project includes +#include "TranslationModel.h" #include "translator/service.h" std::shared_ptr parseOptions(const std::string &config) { diff --git a/src/translator/TranslationModel.h b/src/translator/TranslationModel.h index 686ca0554..c922538a3 100644 --- a/src/translator/TranslationModel.h +++ b/src/translator/TranslationModel.h @@ -16,7 +16,6 @@ // All local project includes #include "AbstractTranslationModel.h" -#include "TranslationModelConfiguration.h" #include "translator/service.h" /* A Translation model that translates a plain (without any markups and emojis) @@ -25,7 +24,7 @@ */ class TranslationModel : public AbstractTranslationModel { public: - /* Construct the model using the model configuration options. + /* Construct the model using the model configuration options as yaml-formatted string */ TranslationModel(const std::string &config); diff --git a/src/translator/TranslationModelConfigToOptionsAdaptor.cpp b/src/translator/TranslationModelConfigToOptionsAdaptor.cpp deleted file mode 100644 index 00e37e0eb..000000000 --- a/src/translator/TranslationModelConfigToOptionsAdaptor.cpp +++ /dev/null @@ -1,19 +0,0 @@ -/* - * TranslationModelConfigToOptionsAdaptor.cpp - * - */ -#include - -#include "TranslationModelConfigToOptionsAdaptor.h" - -TranslationModelConfigToOptionsAdaptor:: - TranslationModelConfigToOptionsAdaptor() {} - -TranslationModelConfigToOptionsAdaptor:: - ~TranslationModelConfigToOptionsAdaptor() {} - -std::shared_ptr TranslationModelConfigToOptionsAdaptor::adapt( - const TranslationModelConfiguration &config) { - // ToDo: Add actual implementation - return std::make_shared(); -} diff --git a/src/translator/TranslationModelConfigToOptionsAdaptor.h b/src/translator/TranslationModelConfigToOptionsAdaptor.h deleted file mode 100644 index 49197b898..000000000 --- a/src/translator/TranslationModelConfigToOptionsAdaptor.h +++ /dev/null @@ -1,32 +0,0 @@ -/* - * This class adapts the TranslationModelConfiguration object to marian::Options - * object. marian::Options is a class that is specific to Marian and is used - * heavily inside it as configuration options (even for translation workflow). - * It makes sense to work with this class internally in the implementation - * files. - */ - -#ifndef SRC_TRANSLATOR_TRANSLATIONMODELCONFIGTOOPTIONSADAPTOR_H_ -#define SRC_TRANSLATOR_TRANSLATIONMODELCONFIGTOOPTIONSADAPTOR_H_ - -#include - -// All 3rd party includes -#include "3rd_party/marian-dev/src/common/options.h" - -// All local includes -#include "TranslationModelConfiguration.h" - -class TranslationModelConfigToOptionsAdaptor { -public: - TranslationModelConfigToOptionsAdaptor(); - - ~TranslationModelConfigToOptionsAdaptor(); - - /* Create an Options object from the translation model configuration object. - */ - std::shared_ptr - adapt(const TranslationModelConfiguration &config); -}; - -#endif /* SRC_TRANSLATOR_TRANSLATIONMODELCONFIGTOOPTIONSADAPTOR_H_ */ From 0d16b1957ff3bda44311cd48d267ed238cf1c594 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 26 Jan 2021 14:49:28 +0100 Subject: [PATCH 043/442] Improved main.cpp file - Print original and translated text - Just add 2 vector entries for texts --- app/main.cpp | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/app/main.cpp b/app/main.cpp index f5d65969d..8b7fe5390 100644 --- a/app/main.cpp +++ b/app/main.cpp @@ -28,38 +28,38 @@ int main(int argc, char **argv) { TranslationRequest translationRequest; std::vector texts; - for (int i = 0; i < 10; i++) { - texts.emplace_back( - "The Bergamot project will add and improve client-side machine" - "translation in a web browser. Unlike current cloud-based" - "options, running directly on users’ machines empowers citizens to" - "preserve their privacy and increases the uptake of language" - "technologies in Europe in various sectors that require" - "confidentiality. Free software integrated with an open-source web" - "browser, such as Mozilla Firefox, will enable bottom-up adoption" - "by non-experts, resulting in cost savings for private and public" - "sector users who would otherwise procure translation or operate" - "monolingually. Bergamot is a consortium coordinated by the" - "University of Edinburgh with partners Charles University in" - "Prague, the University of Sheffield, University of Tartu, and" + texts.emplace_back("The Bergamot project will add and improve client-side machine " + "translation in a web browser. Unlike current cloud-based " + "options, running directly on users’ machines empowers citizens to " + "preserve their privacy and increases the uptake of language " + "technologies in Europe in various sectors that require " + "confidentiality."); + texts.emplace_back("Free software integrated with an open-source web " + "browser, such as Mozilla Firefox, will enable bottom-up adoption " + "by non-experts, resulting in cost savings for private and public " + "sector users who would otherwise procure translation or operate " + "monolingually. Bergamot is a consortium coordinated by the " + "University of Edinburgh with partners Charles University in " + "Prague, the University of Sheffield, University of Tartu, and " "Mozilla."); - } - auto result = model->translate(std::move(texts), translationRequest); + auto futureResults = model->translate(std::move(texts), translationRequest); // Resolve the future and get the actual result - std::vector results = result.get(); + std::vector results = futureResults.get(); for (auto &result : results) { + std::cout << "[original]: " << result.getOriginalText() << std::endl; + std::cout << "[translated]: " << result.getTranslatedText() << std::endl; auto mappings = result.getSentenceMappings(); for (auto &p : mappings) { std::string_view src = p.first; std::string_view tgt = p.second; - std::cout << "[src]: " << src << std::endl; - std::cout << "[tgt]: " << tgt << std::endl; - std::cout << std::endl; + std::cout << " [src Sentence]: " << src << std::endl; + std::cout << " [tgt Sentence]: " << tgt << std::endl; } + std::cout << std::endl; } return 0; From 9a17f365c6c0161742af901e2ff0c93f75aa7593 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Tue, 26 Jan 2021 21:18:15 +0000 Subject: [PATCH 044/442] Fix for garbled output through cli. Requirement for string_view is the original source string be transferred all the way from input to service to back to TranslationResult. This constraint was violated in several places by means of existence of a copy-constructor. The issue is fixed by deleting copy and assignment constructors in marian::bergamot::TranslationResult and UnifiedAPI::TranslationResult, which demonstrated a few occurances of the same. Replaced the same with move semantics. In addition, future is set and get using move semantics at the moment. Default move-constructor didn't seem to be working, so they're made explicit for TranslationResults. This commit additionally packs a few deletions and improvements made to improve structure (textops.cpp, batcher.cpp) along the process of inspecting and fixing the garbled outputs. They are choose to be kept, in the interest of time, against a prettified atomic commit engineering. Combinations of the following commits in jp/string-view-bug [acfc92 78a588 12d91b 00a277 919e2f 9d3a46 b7e39b 18f67b bf667c] --- app/main-mts.cpp | 25 ++++++++++---- src/TranslationResult.h | 7 ++++ src/translator/TranslationModel.cpp | 34 ++++++++----------- src/translator/batcher.cpp | 7 ++-- src/translator/request.cpp | 5 ++- src/translator/textops.cpp | 37 ++++++--------------- src/translator/translation_result.cpp | 48 +++++++++++++-------------- src/translator/translation_result.h | 26 +++++++++------ 8 files changed, 96 insertions(+), 93 deletions(-) diff --git a/app/main-mts.cpp b/app/main-mts.cpp index 9a1e71c63..44a019a0d 100644 --- a/app/main-mts.cpp +++ b/app/main-mts.cpp @@ -1,4 +1,5 @@ #include +#include #include #include @@ -7,6 +8,7 @@ #include "marian.h" #include "translator/parser.h" #include "translator/service.h" +#include "translator/translation_result.h" int main(int argc, char *argv[]) { auto cp = marian::bergamot::createConfigParser(); @@ -17,17 +19,26 @@ int main(int argc, char *argv[]) { std::ostringstream std_input; std_input << std::cin.rdbuf(); std::string input = std_input.str(); + using marian::bergamot::TranslationResult; - LOG(info, "IO complete Translating input"); // Wait on future until TranslationResult is complete - auto translation_result_future = service.translate(std::move(input)); + std::future translation_result_future = + service.translate(std::move(input)); translation_result_future.wait(); - auto translation_result = translation_result_future.get(); + const TranslationResult &translation_result = translation_result_future.get(); - // Obtain sentencemappings and print them as Proof of Concept. - for (auto &p : translation_result.getSentenceMappings()) { - std::cout << "[src] " << p.first << "\n"; - std::cout << "[tgt] " << p.second << "\n"; + std::cout << "service-cli [Source text]: "; + std::cout << translation_result.getOriginalText() << std::endl; + + std::cout << "service-cli [Translated text]: "; + std::cout << translation_result.getTranslatedText() << std::endl; + + // Obtain sentenceMappings and print them as Proof of Concept. + const TranslationResult::SentenceMappings &sentenceMappings = + translation_result.getSentenceMappings(); + for (auto &p : sentenceMappings) { + std::cout << "service-cli [src] " << p.first << "\n"; + std::cout << "service-cli [tgt] " << p.second << "\n"; } // Stop Service. diff --git a/src/TranslationResult.h b/src/TranslationResult.h index 6e5d801e1..d743ff5ff 100644 --- a/src/TranslationResult.h +++ b/src/TranslationResult.h @@ -26,12 +26,19 @@ class TranslationResult { : originalText(original), translatedText(translation), sentenceMappings(sentenceMappings) {} + TranslationResult(TranslationResult &&other) + : originalText(std::move(other.originalText)), + translatedText(std::move(other.translatedText)), + sentenceMappings(std::move(other.sentenceMappings)) {} + TranslationResult(std::string &&original, std::string &&translation, SentenceMappings &&sentenceMappings) : originalText(std::move(original)), translatedText(std::move(translation)), sentenceMappings(std::move(sentenceMappings)) {} + TranslationResult &operator=(const TranslationResult &) = delete; + /* Return the original text. */ const std::string &getOriginalText() const { return originalText; } diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index 9bfaf1bec..f501678cf 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -62,8 +62,15 @@ TranslationModel::translate(std::vector &&texts, std::promise> promise; auto future = promise.get_future(); - auto convert = [](marian::bergamot::TranslationResult &mTranslationResult) { - // Change marian::string_view to std::string_view + // This code, move into async? + std::vector translationResults; + for (auto &text : texts) { + // Collect future as marian::bergamot::TranslationResult + auto intermediate = service_.translate(std::move(text)); + intermediate.wait(); + auto mTranslationResult(std::move(intermediate.get())); + + // Convert to UnifiedAPI::TranslationResult TranslationResult::SentenceMappings sentenceMappings; for (auto &p : mTranslationResult.getSentenceMappings()) { std::string_view src(p.first.data(), p.first.size()), @@ -71,26 +78,13 @@ TranslationModel::translate(std::vector &&texts, sentenceMappings.emplace_back(src, tgt); } - TranslationResult translationResult( - std::move(mTranslationResult.source_), - std::move(mTranslationResult.translation_), - std::move(sentenceMappings)); - - return translationResult; - }; - - // This code, move into async? - std::vector translationResults; - for (auto &text : texts) { - // Copying text, can also be replaced with move based function. - // translate(...) - auto intermediate = service_.translateWithCopy(text); - intermediate.wait(); - marian::bergamot::TranslationResult result = intermediate.get(); - translationResults.push_back(convert(result)); + // In place construction. + translationResults.emplace_back(std::move(mTranslationResult.source_), + std::move(mTranslationResult.translation_), + std::move(sentenceMappings)); } - promise.set_value(translationResults); + promise.set_value(std::move(translationResults)); return future; } diff --git a/src/translator/batcher.cpp b/src/translator/batcher.cpp index 471263df9..22ee46d2a 100644 --- a/src/translator/batcher.cpp +++ b/src/translator/batcher.cpp @@ -9,9 +9,10 @@ namespace bergamot { Batcher::Batcher(Ptr options) { max_input_tokens_ = options->get("max-input-tokens"); bucket_.resize(options->get("max-input-sentence-tokens") + 1); - ABORT_IF(max_input_tokens_ >= bucket_.size(), - "max-input-sentence-tokens cannot be greater than max-input-tokens, " - "batcher fail"); + ABORT_IF( + max_input_tokens_ < bucket_.size() - 1, + "max-input-tokens cannot be less than than max-input-sentence-tokens, " + "batcher fail"); } void Batcher::addSentenceWithPriority(RequestSentence &sentence) { diff --git a/src/translator/request.cpp b/src/translator/request.cpp index 0d02c03ac..a743389b4 100644 --- a/src/translator/request.cpp +++ b/src/translator/request.cpp @@ -48,11 +48,10 @@ void Request::processHistory(size_t index, Ptr history) { void Request::completeRequest() { // Request no longer needs to hold the content, can transfer it to // TranslationResult. - TranslationResult translation_result(std::move(source_), std::move(segments_), + TranslationResult translation_result(std::move(source_), std::move(sourceAlignments_), std::move(histories_), *vocabs_); - LOG(info, "Last translation in. Closing request;"); - response_.set_value(translation_result); + response_.set_value(std::move(translation_result)); } bool Request::operator<(const Request &b) const { diff --git a/src/translator/textops.cpp b/src/translator/textops.cpp index 837ea7226..25e48f1fd 100644 --- a/src/translator/textops.cpp +++ b/src/translator/textops.cpp @@ -50,10 +50,10 @@ SentenceSplitter::string2splitmode(const std::string &m) { return splitmode::wrapped_text; } -Segment TextProcessor::tokenize(const string_view &snt, +Segment TextProcessor::tokenize(const string_view &segment, TokenRanges &tokenRanges) { return vocabs_->front()->encodePreservingSource( - snt, tokenRanges, /*addEOS=*/false, /*inference=*/true); + segment, tokenRanges, /*addEOS=*/false, /*inference=*/true); } TextProcessor::TextProcessor(std::vector> &vocabs, @@ -90,33 +90,18 @@ void TextProcessor::process(const string_view &query, Segments &segments, void TextProcessor::truncate(Segment &segment, TokenRanges &tokenRanges, Segments &segments, std::vector &sourceRanges) { - if (segment.size() > max_input_sentence_tokens_) { - int offset; - // Loop as long as I can grab max_input_sentence_tokens_ - for (offset = 0; offset + max_input_sentence_tokens_ < segment.size(); - offset += max_input_sentence_tokens_) { - auto start = segment.begin() + offset; - - segments.emplace_back(start, start + max_input_sentence_tokens_); - segments.back().push_back(sourceEosId()); - - auto astart = tokenRanges.begin() + offset; - sourceRanges.emplace_back(astart, astart + max_input_sentence_tokens_); - } - - if (offset < max_input_sentence_tokens_) { - auto start = segment.begin() + offset; - segments.emplace_back(start, segment.end()); - segments.back().push_back(sourceEosId()); + for (int offset = 0; offset < segment.size(); + offset += max_input_sentence_tokens_) { + auto start = segment.begin() + offset; - auto astart = tokenRanges.begin() + offset; - sourceRanges.emplace_back(astart, tokenRanges.end()); - } + unsigned int left = segment.size() - offset; + unsigned int diff = std::min(max_input_sentence_tokens_, left); - } else { - segments.emplace_back(segment); + segments.emplace_back(start, start + diff); segments.back().push_back(sourceEosId()); - sourceRanges.emplace_back(tokenRanges); + + auto astart = tokenRanges.begin() + offset; + sourceRanges.emplace_back(astart, astart + diff); } } diff --git a/src/translator/translation_result.cpp b/src/translator/translation_result.cpp index 1c74314e3..d69259f84 100644 --- a/src/translator/translation_result.cpp +++ b/src/translator/translation_result.cpp @@ -7,32 +7,31 @@ namespace marian { namespace bergamot { -TranslationResult::TranslationResult(std::string &&source, Segments &&segments, +TranslationResult::TranslationResult(std::string &&source, std::vector &&sourceRanges, Histories &&histories, std::vector> &vocabs) : source_(std::move(source)), sourceRanges_(std::move(sourceRanges)), - segments_(std::move(segments)), histories_(std::move(histories)), - vocabs_(&vocabs) { + histories_(std::move(histories)) { - // Process sourceMappings into sourceMappings_. - LOG(info, "Creating sourcemappings"); - sourceMappings_.reserve(segments_.size()); - for (int i = 0; i < segments_.size(); i++) { + std::vector sourceMappings; + std::vector targetMappings; + + // Process sourceMappings into sourceMappings. + sourceMappings.reserve(sourceRanges_.size()); + for (int i = 0; i < sourceRanges_.size(); i++) { string_view first = sourceRanges_[i].front(); string_view last = sourceRanges_[i].back(); - int size = last.end() - first.begin(); - sourceMappings_.emplace_back(first.data(), size); + sourceMappings.emplace_back(first.data(), last.end() - first.begin()); } // Compiles translations into a single std::string translation_ // Current implementation uses += on std::string, multiple resizes. - // Stores ByterRanges as indices first, followed by conversion into + // Stores ByteRanges as indices first, followed by conversion into // string_views. // TODO(jerin): Add token level string_views here as well. - LOG(info, "Decoding"); std::vector> translationRanges; - int offset{0}, end{0}; + size_t offset{0}; bool first{true}; for (auto &history : histories_) { // TODO(jerin): Change hardcode of nBest = 1 @@ -40,31 +39,32 @@ TranslationResult::TranslationResult(std::string &&source, Segments &&segments, Result result = onebest[0]; // Expecting only one result; Words words = std::get<0>(result); - std::string decoded = vocabs_->back()->decode(words); + std::string decoded = (vocabs.back())->decode(words); if (first) { first = false; } else { translation_ += " "; + ++offset; } translation_ += decoded; - end = offset + (first ? 0 : 1) /*space*/ + decoded.size(); - translationRanges.emplace_back(offset, end); - offset = end; + translationRanges.emplace_back(offset, decoded.size()); + offset += decoded.size(); } // Converting ByteRanges as indices into string_views. - LOG(info, "generating targetMappings"); - targetMappings_.reserve(translationRanges.size()); - for (auto &p : translationRanges) { - targetMappings_.emplace_back(&translation_[p.first], p.second - p.first); + targetMappings.reserve(translationRanges.size()); + for (auto &range : translationRanges) { + const char *begin = &translation_[range.first]; + targetMappings.emplace_back(begin, range.second); } // Surely, let's add sentenceMappings_ - LOG(info, "generating SentenceMappings"); - for (auto p = sourceMappings_.begin(), q = targetMappings_.begin(); - p != sourceMappings_.end() && q != targetMappings_.end(); ++p, ++q) { - sentenceMappings_.emplace_back(*p, *q); + for (auto src = sourceMappings.begin(), tgt = targetMappings.begin(); + src != sourceMappings.end() && tgt != targetMappings.end(); + ++src, ++tgt) { + sentenceMappings_.emplace_back(*src, *tgt); + auto &t = sentenceMappings_.back(); } } diff --git a/src/translator/translation_result.h b/src/translator/translation_result.h index 3987ee27b..edc9a8ddd 100644 --- a/src/translator/translation_result.h +++ b/src/translator/translation_result.h @@ -13,11 +13,21 @@ namespace marian { namespace bergamot { class TranslationResult { public: - TranslationResult(std::string &&source, Segments &&segments, + TranslationResult(std::string &&source, std::vector &&sourceRanges, Histories &&histories, std::vector> &vocabs); + TranslationResult(TranslationResult &&other) + : source_(std::move(other.source_)), + translation_(std::move(other.translation_)), + sourceRanges_(std::move(other.sourceRanges_)), + sentenceMappings_(std::move(other.sentenceMappings_)), + histories_(std::move(other.histories_)){}; + + TranslationResult(const TranslationResult &) = delete; + TranslationResult &operator=(const TranslationResult &) = delete; + // Returns const references to source and translated texts, for external // consumption. @@ -28,7 +38,8 @@ class TranslationResult { // pair for external consumption. Each entry corresponding // to a (source-sentence, target-sentence). - typedef std::vector> SentenceMappings; + typedef std::vector> + SentenceMappings; const SentenceMappings &getSentenceMappings() const { return sentenceMappings_; } @@ -40,6 +51,9 @@ class TranslationResult { // For development use to benchmark with marian-decoder. const Histories &getHistories() const { return histories_; } + // @jerinphilip: Why are these members no longer-private? For move-semantics + // with consistent string_views for bergamot-translator. + std::string source_; std::string translation_; // Adding the following to complete bergamot-translator spec, redundant while @@ -53,16 +67,8 @@ class TranslationResult { // Future hook to gain alignments. Histories histories_; - // Can be removed eventually. - Segments segments_; - std::vector> *vocabs_; - // string_views at the token level. std::vector sourceRanges_; - - // string_views at the sentence-level. - std::vector sourceMappings_; - std::vector targetMappings_; }; } // namespace bergamot } // namespace marian From e76a602dc7205567fdcb76820349bafa8f51bf51 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 28 Jan 2021 21:44:05 +0000 Subject: [PATCH 045/442] Removing config file printing --- app/main.cpp | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/app/main.cpp b/app/main.cpp index 8b7fe5390..ef61eb2da 100644 --- a/app/main.cpp +++ b/app/main.cpp @@ -19,7 +19,6 @@ int main(int argc, char **argv) { auto configParser = marian::bergamot::createConfigParser(); auto options = configParser.parseOptions(argc, argv, true); std::string config = options->asYamlString(); - std::cout << config << std::endl; // Route the config string to construct marian model through // AbstractTranslationModel @@ -28,20 +27,22 @@ int main(int argc, char **argv) { TranslationRequest translationRequest; std::vector texts; - texts.emplace_back("The Bergamot project will add and improve client-side machine " - "translation in a web browser. Unlike current cloud-based " - "options, running directly on users’ machines empowers citizens to " - "preserve their privacy and increases the uptake of language " - "technologies in Europe in various sectors that require " - "confidentiality."); - texts.emplace_back("Free software integrated with an open-source web " - "browser, such as Mozilla Firefox, will enable bottom-up adoption " - "by non-experts, resulting in cost savings for private and public " - "sector users who would otherwise procure translation or operate " - "monolingually. Bergamot is a consortium coordinated by the " - "University of Edinburgh with partners Charles University in " - "Prague, the University of Sheffield, University of Tartu, and " - "Mozilla."); + texts.emplace_back( + "The Bergamot project will add and improve client-side machine " + "translation in a web browser. Unlike current cloud-based " + "options, running directly on users’ machines empowers citizens to " + "preserve their privacy and increases the uptake of language " + "technologies in Europe in various sectors that require " + "confidentiality."); + texts.emplace_back( + "Free software integrated with an open-source web " + "browser, such as Mozilla Firefox, will enable bottom-up adoption " + "by non-experts, resulting in cost savings for private and public " + "sector users who would otherwise procure translation or operate " + "monolingually. Bergamot is a consortium coordinated by the " + "University of Edinburgh with partners Charles University in " + "Prague, the University of Sheffield, University of Tartu, and " + "Mozilla."); auto futureResults = model->translate(std::move(texts), translationRequest); From 548c8880ff024104e46673107709dd3f9d2c67f9 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Tue, 2 Feb 2021 14:39:19 +0000 Subject: [PATCH 046/442] CMake updates submodules --- CMakeLists.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/CMakeLists.txt b/CMakeLists.txt index 0a2005dc1..2341410d7 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -19,3 +19,7 @@ option(USE_MKL "Compile with MKL support" ON) add_subdirectory(3rd_party) add_subdirectory(src) add_subdirectory(app) + +execute_process(COMMAND git submodule update --init --recursive --no-fetch + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) + From 2929077324acbb4488eac615422394e2f42218b8 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Tue, 2 Feb 2021 14:41:26 +0000 Subject: [PATCH 047/442] Reordering git submodule update before includes --- CMakeLists.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 2341410d7..ce48a9079 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -16,10 +16,11 @@ option(USE_SENTENCEPIECE "Download and compile SentencePiece" ON) option(USE_STATIC_LIBS "Link statically against non-system libs" ON) option(USE_MKL "Compile with MKL support" ON) +execute_process(COMMAND git submodule update --init --recursive --no-fetch + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) + add_subdirectory(3rd_party) add_subdirectory(src) add_subdirectory(app) -execute_process(COMMAND git submodule update --init --recursive --no-fetch - WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) From 9a54d2116cc0b26fcc7582c0a99c7905c2d3be66 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 8 Feb 2021 13:46:59 +0100 Subject: [PATCH 048/442] Updated marian-dev submodule - Switch to "wasm" branch of browsermt/marian-dev --- 3rd_party/marian-dev | 2 +- CMakeLists.txt | 5 ++++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index ee56e02f0..a4e50b66b 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit ee56e02f0525a4651157a07f74b44f456db14c8c +Subproject commit a4e50b66be38a94b90c46c4695d86de9932c34e8 diff --git a/CMakeLists.txt b/CMakeLists.txt index ce48a9079..45551ea85 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -14,7 +14,10 @@ set(BUILD_ARCH native CACHE STRING "Compile for this CPU architecture.") option(COMPILE_CUDA "Compile GPU version" OFF) option(USE_SENTENCEPIECE "Download and compile SentencePiece" ON) option(USE_STATIC_LIBS "Link statically against non-system libs" ON) -option(USE_MKL "Compile with MKL support" ON) +option(USE_MKL "Compile with MKL support" OFF) +option(COMPILE_DECODER_ONLY "Compile marian-decoder only" ON) +option(COMPILE_WITH_PTHREADS "Compile with pthreads support" OFF) +option(USE_WASM_COMPATIBLE_BLAS "Compile with a WASM compatible blas for decoder only builds" ON) execute_process(COMMAND git submodule update --init --recursive --no-fetch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) From 47b4bae268bf98dd1fad70ce50731a5f74e09c3b Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 8 Feb 2021 14:31:12 +0100 Subject: [PATCH 049/442] Changed encodePreservingSource -> encodeWithByteRanges - This change happened because marian submodule changed this name - Native builds are working fine -- bergamot-translator-app output is consistent --- src/translator/textops.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/translator/textops.cpp b/src/translator/textops.cpp index 25e48f1fd..ac93421ab 100644 --- a/src/translator/textops.cpp +++ b/src/translator/textops.cpp @@ -52,7 +52,7 @@ SentenceSplitter::string2splitmode(const std::string &m) { Segment TextProcessor::tokenize(const string_view &segment, TokenRanges &tokenRanges) { - return vocabs_->front()->encodePreservingSource( + return vocabs_->front()->encodeWithByteRanges( segment, tokenRanges, /*addEOS=*/false, /*inference=*/true); } From 5683168a8d0011e7311ec62e13806b23bce52ec9 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 9 Feb 2021 15:42:02 +0100 Subject: [PATCH 050/442] Updated ssplit submodule to a different repository - Added abhi-agg/ssplit-cpp - Added its wasm branch in bergamot-translator - Native builds of bergamot-translator are successful -- Sentence splitting is NOT WORKING -- Only translation is working --- .gitmodules | 2 +- 3rd_party/ssplit-cpp | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.gitmodules b/.gitmodules index d3bbf18d6..e4feab500 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,6 +1,6 @@ [submodule "3rd_party/ssplit-cpp"] path = 3rd_party/ssplit-cpp - url = https://github.com/ugermann/ssplit-cpp + url = https://github.com/abhi-agg/ssplit-cpp [submodule "3rd_party/marian-dev"] path = 3rd_party/marian-dev url = https://github.com/browsermt/marian-dev diff --git a/3rd_party/ssplit-cpp b/3rd_party/ssplit-cpp index f5d022992..4f5d1348a 160000 --- a/3rd_party/ssplit-cpp +++ b/3rd_party/ssplit-cpp @@ -1 +1 @@ -Subproject commit f5d022992f4a00c860eb809389748908bb85ffcf +Subproject commit 4f5d1348a3fba1a8cb70135f68470d613573f9f3 From 584700ce911de9da92489661c42a4ecc7c58d35e Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Wed, 10 Feb 2021 11:15:16 +0100 Subject: [PATCH 051/442] Changed translate() API from non-blocking to blocking - Can be changed back to non-blocking once blocking API becomes integrable via WASM port in browser --- app/main.cpp | 4 ++-- src/AbstractTranslationModel.h | 2 +- src/translator/TranslationModel.cpp | 5 ++--- src/translator/TranslationModel.h | 2 +- 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/app/main.cpp b/app/main.cpp index ef61eb2da..2f67feb9c 100644 --- a/app/main.cpp +++ b/app/main.cpp @@ -44,10 +44,10 @@ int main(int argc, char **argv) { "Prague, the University of Sheffield, University of Tartu, and " "Mozilla."); - auto futureResults = model->translate(std::move(texts), translationRequest); + auto results = model->translate(std::move(texts), translationRequest); // Resolve the future and get the actual result - std::vector results = futureResults.get(); + //std::vector results = futureResults.get(); for (auto &result : results) { std::cout << "[original]: " << result.getOriginalText() << std::endl; diff --git a/src/AbstractTranslationModel.h b/src/AbstractTranslationModel.h index 6cb30c4a2..7562b0ad0 100644 --- a/src/AbstractTranslationModel.h +++ b/src/AbstractTranslationModel.h @@ -57,7 +57,7 @@ class AbstractTranslationModel { * entry of texts list will be moved to its corresponding TranslationResult * object). */ - virtual std::future> + virtual std::vector translate(std::vector &&texts, TranslationRequest request) = 0; /* Check if the model can provide alignment information b/w original and diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index f501678cf..3d5ae2380 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -55,7 +55,7 @@ TranslationModel::TranslationModel(const std::string &config) TranslationModel::~TranslationModel() {} -std::future> +std::vector TranslationModel::translate(std::vector &&texts, TranslationRequest request) { // Implementing a non-async version first. Unpleasant, but should work. @@ -84,8 +84,7 @@ TranslationModel::translate(std::vector &&texts, std::move(sentenceMappings)); } - promise.set_value(std::move(translationResults)); - return future; + return translationResults; } bool TranslationModel::isAlignmentSupported() const { return false; } diff --git a/src/translator/TranslationModel.h b/src/translator/TranslationModel.h index c922538a3..d468e2fb6 100644 --- a/src/translator/TranslationModel.h +++ b/src/translator/TranslationModel.h @@ -54,7 +54,7 @@ class TranslationModel : public AbstractTranslationModel { * entry of texts list will be moved to its corresponding TranslationResult * object). */ - std::future> + std::vector translate(std::vector &&texts, TranslationRequest request) override; From a2d32693448fbbc582efc0da1e05f6731e548845 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Wed, 10 Feb 2021 11:27:16 +0100 Subject: [PATCH 052/442] Updated ssplit submodule --- 3rd_party/ssplit-cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/ssplit-cpp b/3rd_party/ssplit-cpp index 4f5d1348a..16864967b 160000 --- a/3rd_party/ssplit-cpp +++ b/3rd_party/ssplit-cpp @@ -1 +1 @@ -Subproject commit 4f5d1348a3fba1a8cb70135f68470d613573f9f3 +Subproject commit 16864967b7313e76e3b107d11ec39d8d5cedff1e From 9747d9ba83e2eb6f7cf5edfee37a90592d2c220b Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 15:34:27 +0100 Subject: [PATCH 053/442] Add cmake option to compile project on WASM - Set cmake option COMPILE_WASM to ON to compile the project on WASM --- CMakeLists.txt | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 45551ea85..b662a7880 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -8,7 +8,9 @@ project(bergamot_translator CXX C) set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD_REQUIRED ON) -set(BUILD_ARCH native CACHE STRING "Compile for this CPU architecture.") + +# Project specific cmake options +option(COMPILE_WASM "Compile for WASM" OFF) # Custom CMake options to compile marian (a 3rd party submodule) for this project option(COMPILE_CUDA "Compile GPU version" OFF) @@ -22,8 +24,19 @@ option(USE_WASM_COMPATIBLE_BLAS "Compile with a WASM compatible blas for decoder execute_process(COMMAND git submodule update --init --recursive --no-fetch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) -add_subdirectory(3rd_party) -add_subdirectory(src) -add_subdirectory(app) +if(NOT COMPILE_WASM) + # Set BUILD_ARCH to native only while compiling for non wasm platform + set(BUILD_ARCH native CACHE STRING "Compile for this CPU architecture.") +endif() +if(COMPILE_WASM) + add_compile_options(-pthread -O3 -g2 -fPIC -mssse3 -msimd128) + add_compile_options("SHELL:-s WASM=1" "SHELL:-s ASSERTIONS=1" "SHELL:-s DISABLE_EXCEPTION_CATCHING=0" "SHELL:-s LLD_REPORT_UNDEFINED" "SHELL:-s FORCE_FILESYSTEM=1" "SHELL:-s ALLOW_MEMORY_GROWTH=1") + add_compile_options(-Wno-error=pthreads-mem-growth) +endif(COMPILE_WASM) +add_subdirectory(3rd_party) +add_subdirectory(src) +if(NOT COMPILE_WASM) + add_subdirectory(app) +endif() From b73d4f4cc275277b35545af2a0d35ea7953166d4 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 15:37:38 +0100 Subject: [PATCH 054/442] Set cmake option to compile marian library only - Set COMPILE_LIBRARY_ONLY to ON for marian library --- CMakeLists.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/CMakeLists.txt b/CMakeLists.txt index b662a7880..daea56074 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -20,6 +20,7 @@ option(USE_MKL "Compile with MKL support" OFF) option(COMPILE_DECODER_ONLY "Compile marian-decoder only" ON) option(COMPILE_WITH_PTHREADS "Compile with pthreads support" OFF) option(USE_WASM_COMPATIBLE_BLAS "Compile with a WASM compatible blas for decoder only builds" ON) +SET(COMPILE_LIBRARY_ONLY ON CACHE BOOL "Build only the Marian library and exclude all executables.") execute_process(COMMAND git submodule update --init --recursive --no-fetch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) From 838547e4d582089d6222aadf14e77732d8955d17 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 15:42:18 +0100 Subject: [PATCH 055/442] Set cmake options of marian properly for this project --- CMakeLists.txt | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index daea56074..09ac2fce3 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -12,14 +12,14 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON) # Project specific cmake options option(COMPILE_WASM "Compile for WASM" OFF) -# Custom CMake options to compile marian (a 3rd party submodule) for this project -option(COMPILE_CUDA "Compile GPU version" OFF) -option(USE_SENTENCEPIECE "Download and compile SentencePiece" ON) -option(USE_STATIC_LIBS "Link statically against non-system libs" ON) -option(USE_MKL "Compile with MKL support" OFF) -option(COMPILE_DECODER_ONLY "Compile marian-decoder only" ON) -option(COMPILE_WITH_PTHREADS "Compile with pthreads support" OFF) -option(USE_WASM_COMPATIBLE_BLAS "Compile with a WASM compatible blas for decoder only builds" ON) +# Set marian (3rd party submodule) cmake options to compile for this project +SET(COMPILE_CUDA OFF CACHE BOOL "Compile GPU version") +SET(USE_SENTENCEPIECE ON CACHE BOOL "Download and compile SentencePiece") +SET(USE_STATIC_LIBS ON CACHE BOOL "Link statically against non-system libs") +SET(USE_MKL OFF CACHE BOOL "Compile with MKL support") +SET(COMPILE_DECODER_ONLY ON CACHE BOOL "Compile marian-decoder only") +SET(COMPILE_WITH_PTHREADS OFF CACHE BOOL "Compile with pthreads support") +SET(USE_WASM_COMPATIBLE_BLAS ON CACHE BOOL "Compile with a WASM compatible blas for decoder only builds") SET(COMPILE_LIBRARY_ONLY ON CACHE BOOL "Build only the Marian library and exclude all executables.") execute_process(COMMAND git submodule update --init --recursive --no-fetch From 9b896507e3860b5c3cf0e452659d336fe43958e1 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 15:53:38 +0100 Subject: [PATCH 056/442] cmake compile option changes - Make native builds successful with marian decoder - COMPILE_DECODER_ONLY flag requires importing some compile definitions from marian --- src/translator/CMakeLists.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index b6fcf69fc..eab04abf3 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -11,6 +11,10 @@ add_library(bergamot-translator STATIC batcher.cpp translation_result.cpp ) +if (COMPILE_DECODER_ONLY) + # A dirty hack because of marian's bad cmake practices + target_compile_definitions(bergamot-translator PUBLIC DECODER_ONLY) +endif() target_link_libraries(bergamot-translator marian ssplit) From 79c445ae3a9c63fa68cd7687e5bdae7b76dc72b1 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 15:57:26 +0100 Subject: [PATCH 057/442] cmake compile option changes for wasm builds - Make WASM builds successful with marian decoder - Setting COMPILE_WASM to ON requires importing some compile definitions from marian --- src/translator/CMakeLists.txt | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index eab04abf3..b8ed19635 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -16,6 +16,11 @@ if (COMPILE_DECODER_ONLY) target_compile_definitions(bergamot-translator PUBLIC DECODER_ONLY) endif() +if(COMPILE_WASM) + # A dirty hack because of marian's bad cmake practices + target_compile_definitions(bergamot-translator PUBLIC USE_SSE2 WASM) +endif(COMPILE_WASM) + target_link_libraries(bergamot-translator marian ssplit) target_include_directories(bergamot-translator From a06530e92b6d16527487c8fa0ead4ae04f0ddbb5 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 16:14:03 +0100 Subject: [PATCH 058/442] Fixed a bug in TranslationModel class - Using bergamot-translator as a library fails at run time because necessary parser options are not set --- src/translator/TranslationModel.cpp | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index 3d5ae2380..fd2db1d2d 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -15,6 +15,8 @@ // All local project includes #include "TranslationModel.h" #include "translator/service.h" +#include "translator/parser.h" + std::shared_ptr parseOptions(const std::string &config) { marian::Options options; @@ -34,7 +36,7 @@ std::shared_ptr parseOptions(const std::string &config) { // Error: Aborted from void unhandledException() in // 3rd_party/marian-dev/src/common/logging.cpp:113 - marian::ConfigParser configParser(marian::cli::mode::translation); + marian::ConfigParser configParser = marian::bergamot::createConfigParser(); const YAML::Node &defaultConfig = configParser.getConfig(); options.merge(defaultConfig); From 23a952782479401c4ac31bab6eccccb546c1f4ee Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 16:38:36 +0100 Subject: [PATCH 059/442] Source code changes to compile the project without threads - Set COMPILE_THREAD_VARIANT cmake option to ON to compile multithreaded variant of the project --- CMakeLists.txt | 4 ++++ src/translator/CMakeLists.txt | 4 ++++ src/translator/batch_translator.cpp | 16 +++++++++++++++- src/translator/batch_translator.h | 8 ++++++++ src/translator/pcqueue.h | 29 +++++++++++++++++++++++++++++ src/translator/service.cpp | 3 +++ 6 files changed, 63 insertions(+), 1 deletion(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 09ac2fce3..7327e1449 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -11,6 +11,7 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON) # Project specific cmake options option(COMPILE_WASM "Compile for WASM" OFF) +option(COMPILE_THREAD_VARIANT "Compile with thread support" OFF) # Set marian (3rd party submodule) cmake options to compile for this project SET(COMPILE_CUDA OFF CACHE BOOL "Compile GPU version") @@ -41,3 +42,6 @@ add_subdirectory(src) if(NOT COMPILE_WASM) add_subdirectory(app) endif() +if(COMPILE_WASM) + add_subdirectory(app) +endif(COMPILE_WASM) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index b8ed19635..71bdd97f6 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -21,6 +21,10 @@ if(COMPILE_WASM) target_compile_definitions(bergamot-translator PUBLIC USE_SSE2 WASM) endif(COMPILE_WASM) +if (COMPILE_THREAD_VARIANT) + target_compile_definitions(bergamot-translator PRIVATE WITH_PTHREADS) +endif() + target_link_libraries(bergamot-translator marian ssplit) target_include_directories(bergamot-translator diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index 6380a00cc..6dc399321 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -14,7 +14,11 @@ BatchTranslator::BatchTranslator(DeviceId const device, Ptr options) : device_(device), options_(options), pcqueue_(&pcqueue), vocabs_(&vocabs) { +#ifdef WITH_PTHREADS thread_ = std::thread([this] { this->mainloop(); }); +#else + this->initGraph(); +#endif } void BatchTranslator::initGraph() { @@ -100,12 +104,16 @@ void BatchTranslator::translate(RequestSentences &requestSentences, } void BatchTranslator::mainloop() { +#ifdef WITH_PTHREADS initGraph(); +#endif PCItem pcitem; Histories histories; +#ifdef WITH_PTHREADS while (true) { +#endif pcqueue_->ConsumeSwap(pcitem); if (pcitem.isPoison()) { return; @@ -115,10 +123,16 @@ void BatchTranslator::mainloop() { pcitem.sentences[i].completeSentence(histories[i]); } } +#ifdef WITH_PTHREADS } +#endif } -void BatchTranslator::join() { thread_.join(); } +void BatchTranslator::join() { +#ifdef WITH_PTHREADS + thread_.join(); +#endif +} } // namespace bergamot } // namespace marian diff --git a/src/translator/batch_translator.h b/src/translator/batch_translator.h index 069155efb..3f1d2e4bd 100644 --- a/src/translator/batch_translator.h +++ b/src/translator/batch_translator.h @@ -29,10 +29,16 @@ class BatchTranslator { // convenience function for logging. TODO(jerin) std::string _identifier() { return "worker" + std::to_string(device_.no); } +#ifndef WITH_PTHREADS + void mainloop(); +#endif + private: void initGraph(); void translate(RequestSentences &requestSentences, Histories &histories); +#ifdef WITH_PTHREADS void mainloop(); +#endif Ptr options_; @@ -43,7 +49,9 @@ class BatchTranslator { Ptr slgen_; PCQueue *pcqueue_; +#ifdef WITH_PTHREADS std::thread thread_; +#endif }; } // namespace bergamot } // namespace marian diff --git a/src/translator/pcqueue.h b/src/translator/pcqueue.h index f0b354145..79d6b75e0 100644 --- a/src/translator/pcqueue.h +++ b/src/translator/pcqueue.h @@ -9,6 +9,7 @@ #include #include +#ifdef WITH_PTHREADS #ifdef __APPLE__ #include #include @@ -19,6 +20,7 @@ #else #include #endif +#endif // WITH_PTHREADS #if __GNUC__ >= 3 #define UTIL_UNLIKELY(x) __builtin_expect(!!(x), 0) @@ -29,6 +31,7 @@ namespace marian { namespace bergamot { +#ifdef WITH_PTHREADS /* OS X Maverick and Boost interprocess were doing "Function not implemented." * So this is my own wrapper around the mach kernel APIs. */ @@ -114,6 +117,20 @@ inline void WaitSemaphore(Semaphore &on) { } #endif // Apple +#else // WITH_PTHREADS +// A dummy Semaphore class that does nothing +class Semaphore { +public: + explicit Semaphore(unsigned int value) : count(value) {} + ~Semaphore() {} + void wait() {} + void post() {} +private: + unsigned int count; +}; + +inline void WaitSemaphore(Semaphore &semaphore) { semaphore.wait(); } +#endif // WITH_PTHREADS /** * Producer consumer queue safe for multiple producers and multiple consumers. @@ -134,7 +151,9 @@ template class PCQueue { void Produce(const T &val) { WaitSemaphore(empty_); { + #ifdef WITH_PTHREADS std::lock_guard produce_lock(produce_at_mutex_); + #endif try { *produce_at_ = val; } catch (...) { @@ -151,7 +170,9 @@ template class PCQueue { void ProduceSwap(T &val) { WaitSemaphore(empty_); { + #ifdef WITH_PTHREADS std::lock_guard produce_lock(produce_at_mutex_); + #endif try { std::swap(*produce_at_, val); } catch (...) { @@ -168,7 +189,9 @@ template class PCQueue { T &Consume(T &out) { WaitSemaphore(used_); { + #ifdef WITH_PTHREADS std::lock_guard consume_lock(consume_at_mutex_); + #endif try { out = *consume_at_; } catch (...) { @@ -186,7 +209,9 @@ template class PCQueue { T &ConsumeSwap(T &out) { WaitSemaphore(used_); { + #ifdef WITH_PTHREADS std::lock_guard consume_lock(consume_at_mutex_); + #endif try { std::swap(out, *consume_at_); } catch (...) { @@ -220,11 +245,15 @@ template class PCQueue { // Index for next write in storage_. T *produce_at_; +#ifdef WITH_PTHREADS std::mutex produce_at_mutex_; +#endif // Index for next read from storage_. T *consume_at_; +#ifdef WITH_PTHREADS std::mutex consume_at_mutex_; +#endif }; template struct UnboundedPage { diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 4a5af301c..f61ad4731 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -73,6 +73,9 @@ std::future Service::translate(std::string &&input) { } } while (numSentences > 0); +#ifndef WITH_PTHREADS + workers_[0].mainloop(); +#endif return future; } From 7b80003a5fd60d5e28beee74d8f45590390581f5 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 16:59:07 +0100 Subject: [PATCH 060/442] Added code to generate proper JS bindings of translator - COMPILE_WASM cmake option sets WASM_BINDINGS compile definition that enables code for generating proper JS bindings --- src/TranslationResult.h | 22 +++++++++++++++++++++- src/translator/CMakeLists.txt | 2 ++ 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/src/TranslationResult.h b/src/TranslationResult.h index d743ff5ff..b4867af65 100644 --- a/src/TranslationResult.h +++ b/src/TranslationResult.h @@ -20,7 +20,11 @@ class TranslationResult { public: typedef std::vector> SentenceMappings; - +#ifdef WASM_BINDINGS + TranslationResult(const std::string &original, const std::string &translation) + : originalText(original), translatedText(translation), + sentenceMappings() {} +#endif TranslationResult(const std::string &original, const std::string &translation, SentenceMappings &sentenceMappings) : originalText(original), translatedText(translation), @@ -31,13 +35,29 @@ class TranslationResult { translatedText(std::move(other.translatedText)), sentenceMappings(std::move(other.sentenceMappings)) {} +#ifdef WASM_BINDINGS + TranslationResult(const TranslationResult &other) + : originalText(other.originalText), + translatedText(other.translatedText), + sentenceMappings(other.sentenceMappings) {} +#endif + TranslationResult(std::string &&original, std::string &&translation, SentenceMappings &&sentenceMappings) : originalText(std::move(original)), translatedText(std::move(translation)), sentenceMappings(std::move(sentenceMappings)) {} +#ifndef WASM_BINDINGS TranslationResult &operator=(const TranslationResult &) = delete; +#else + TranslationResult &operator=(const TranslationResult &result) { + originalText = result.originalText; + translatedText = result.translatedText; + sentenceMappings = result.sentenceMappings; + return *this; + } +#endif /* Return the original text. */ const std::string &getOriginalText() const { return originalText; } diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 71bdd97f6..ba2c2e033 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -19,6 +19,8 @@ endif() if(COMPILE_WASM) # A dirty hack because of marian's bad cmake practices target_compile_definitions(bergamot-translator PUBLIC USE_SSE2 WASM) + # Enable code that is required for generating JS bindings + target_compile_definitions(bergamot-translator PRIVATE WASM_BINDINGS) endif(COMPILE_WASM) if (COMPILE_THREAD_VARIANT) From 74b06d863ebbd0b0b59dfd7be1e541a338c8a3f8 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 19:09:30 +0100 Subject: [PATCH 061/442] Add wasm folder to compile JS bindings --- CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 7327e1449..4b6e2241b 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -43,5 +43,5 @@ if(NOT COMPILE_WASM) add_subdirectory(app) endif() if(COMPILE_WASM) - add_subdirectory(app) + add_subdirectory(wasm) endif(COMPILE_WASM) From de501e8f963b8fed6cc6f1799d55f2e20b325d3e Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 20:48:29 +0100 Subject: [PATCH 062/442] Added JS binding files and cmake infrastructure to build them - Added "wasm" folder - Contains README file as well --- CMakeLists.txt | 1 + wasm/CMakeLists.txt | 27 ++++++++++ wasm/README.md | 52 +++++++++++++++++++ wasm/bergamot.html | 54 ++++++++++++++++++++ wasm/bindings/TranslationModelBindings.cpp | 23 +++++++++ wasm/bindings/TranslationRequestBindings.cpp | 17 ++++++ wasm/bindings/TranslationResultBindings.cpp | 20 ++++++++ 7 files changed, 194 insertions(+) create mode 100644 wasm/CMakeLists.txt create mode 100644 wasm/README.md create mode 100644 wasm/bergamot.html create mode 100644 wasm/bindings/TranslationModelBindings.cpp create mode 100644 wasm/bindings/TranslationRequestBindings.cpp create mode 100644 wasm/bindings/TranslationResultBindings.cpp diff --git a/CMakeLists.txt b/CMakeLists.txt index 4b6e2241b..505d78549 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -12,6 +12,7 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON) # Project specific cmake options option(COMPILE_WASM "Compile for WASM" OFF) option(COMPILE_THREAD_VARIANT "Compile with thread support" OFF) +option(PACKAGE_DIR "Directory including all the files to be packaged (pre-loaded) in wasm builds" "") # Set marian (3rd party submodule) cmake options to compile for this project SET(COMPILE_CUDA OFF CACHE BOOL "Compile GPU version") diff --git a/wasm/CMakeLists.txt b/wasm/CMakeLists.txt new file mode 100644 index 000000000..9ede6a612 --- /dev/null +++ b/wasm/CMakeLists.txt @@ -0,0 +1,27 @@ +add_executable(bergamot-translator-worker + bindings/TranslationModelBindings.cpp + bindings/TranslationRequestBindings.cpp + bindings/TranslationResultBindings.cpp +) + +# This header inclusion needs to go away later as path to public headers of bergamot +# translator should be directly available from "bergamot-translator" target +target_include_directories(bergamot-translator-worker + PRIVATE ${CMAKE_SOURCE_DIR}/src/translator + PRIVATE ${CMAKE_SOURCE_DIR} +) +# This compile definition is required for generating binding code properly +target_compile_definitions(bergamot-translator-worker PRIVATE WASM_BINDINGS) + +set(LINKER_FLAGS "--bind -s ASSERTIONS=1 -s DISABLE_EXCEPTION_CATCHING=0 -s FORCE_FILESYSTEM=1 -s ALLOW_MEMORY_GROWTH=1") +if (NOT PACKAGE_DIR STREQUAL "") + set(LINKER_FLAGS "${LINKER_FLAGS} --preload-file ${PACKAGE_DIR}@/") +endif() + +set_target_properties(bergamot-translator-worker PROPERTIES + SUFFIX ".js" + LINK_FLAGS ${LINKER_FLAGS} + ) +#target_link_options(bergamot-translator-worker --preload-file ${PACKAGE_DIR}@/) + +target_link_libraries(bergamot-translator-worker bergamot-translator) diff --git a/wasm/README.md b/wasm/README.md new file mode 100644 index 000000000..83d4738cd --- /dev/null +++ b/wasm/README.md @@ -0,0 +1,52 @@ +## Using Bergamot Translator in JavaScript +The example file `bergamot.html` in this folder demonstrates how to use the bergamot translator in JavaScript via a ` + + diff --git a/wasm/bindings/TranslationModelBindings.cpp b/wasm/bindings/TranslationModelBindings.cpp new file mode 100644 index 000000000..245416c6a --- /dev/null +++ b/wasm/bindings/TranslationModelBindings.cpp @@ -0,0 +1,23 @@ +/* + * TranslationModelBindings.cpp + * + * Bindings for TranslationModel class + */ + +#include + +#include "TranslationModel.h" + +using namespace emscripten; + +// Binding code +EMSCRIPTEN_BINDINGS(translation_model) { + class_("TranslationModel") + .constructor() + .function("translate", &TranslationModel::translate) + .function("isAlignmentSupported", &TranslationModel::isAlignmentSupported) + ; + + register_vector("VectorString"); + register_vector("VectorTranslationResult"); +} diff --git a/wasm/bindings/TranslationRequestBindings.cpp b/wasm/bindings/TranslationRequestBindings.cpp new file mode 100644 index 000000000..bb5ec9884 --- /dev/null +++ b/wasm/bindings/TranslationRequestBindings.cpp @@ -0,0 +1,17 @@ +/* + * Bindings for TranslationRequest class + * + */ + +#include + +#include "TranslationRequest.h" + +using namespace emscripten; + +// Binding code +EMSCRIPTEN_BINDINGS(translation_request) { + class_("TranslationRequest") + .constructor<>() + ; +} diff --git a/wasm/bindings/TranslationResultBindings.cpp b/wasm/bindings/TranslationResultBindings.cpp new file mode 100644 index 000000000..a3713a130 --- /dev/null +++ b/wasm/bindings/TranslationResultBindings.cpp @@ -0,0 +1,20 @@ +/* + * Bindings for TranslationResult class + * + */ + +#include +#include + +#include "TranslationResult.h" + +using namespace emscripten; + +// Binding code +EMSCRIPTEN_BINDINGS(translation_result) { + class_("TranslationResult") + .constructor() + .function("getOriginalText", &TranslationResult::getOriginalText) + .function("getTranslatedText", &TranslationResult::getTranslatedText) + ; +} From e12647076c69c4e0355b598b16127d4112f662bd Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 23:27:16 +0100 Subject: [PATCH 063/442] Updated README with wasm build and use instructions --- README.md | 84 ++++++++++++++++++++++++++++++------------------------- 1 file changed, 46 insertions(+), 38 deletions(-) diff --git a/README.md b/README.md index 52f60b287..e1ad9c37a 100644 --- a/README.md +++ b/README.md @@ -3,58 +3,66 @@ Bergamot translator provides a unified API for ([Marian NMT](https://marian-nmt.github.io/) framework based) neural machine translation functionality in accordance with the [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. ## Build Instructions -``` -$ git clone https://github.com/browsermt/bergamot-translator -$ cd bergamot-translator -$ mkdir build -$ cd build -$ cmake ../ -$ make -j +### Build Natively + +```bash +git clone https://github.com/browsermt/bergamot-translator +cd bergamot-translator +mkdir build +cd build +cmake ../ +make -j ``` -## Usage +### Build WASM -### Bergamot Translator +To compile WASM, first download and Install Emscripten using following instructions: -The build will generate the library that can be linked to any project. All the public header files are specified in `src` folder. +1. Get the latest sdk: `git clone https://github.com/emscripten-core/emsdk.git` +2. Enter the cloned directory: `cd emsdk` +3. Install the lastest sdk tools: `./emsdk install latest` +4. Activate the latest sdk tools: `./emsdk activate latest` +5. Activate path variables: `source ./emsdk_env.sh` -### `service-cli` +After the successful installation of Emscripten, perform these steps: -An executable `service-cli` is generated by the build in the `app` folder and -provides command line interface to the underlying translator. The models -required to run the command-line are available at -[data.statmt.org/bergamot/models/](http://data.statmt.org/bergamot/models/). -The following example uses an English to German tiny11 student model, available -at: +```bash +git clone https://github.com/browsermt/bergamot-translator +cd bergamot-translator +mkdir build-wasm +cd build-wasm +emcmake cmake -DCOMPILE_WASM=on ../ +emmake make -j +``` -* [data.statmt.org/bergamot/models/deen/ende.student.tiny11.tar.gz](http://data.statmt.org/bergamot/models/deen/ende.student.tiny11.tar.gz) +It should generate the artefacts (.js and .wasm files) in `wasm` folder inside build directory ("build-wasm" in this case). +The build also allows packaging files into wasm binary (i.e. preloading in Emscripten’s virtual file system) using cmake +option `PACKAGE_DIR`. The compile command below packages all the files in PATH directory into wasm binary. ```bash -MODEL_DIR=... # path to where the model-files are. -ARGS=( - -m $MODEL_DIR/model.intgemm.alphas.bin # Path to model file. - --vocabs - $MODEL_DIR/vocab.deen.spm # source-vocabulary - $MODEL_DIR/vocab.deen.spm # target-vocabulary +emcmake cmake -DCOMPILE_WASM=on -DPACKAGE_DIR= ../ +``` +Files packaged this way are preloaded in the root of the virtual file system. - # The following increases speed through one-best-decoding, shortlist and quantization. - --beam-size 1 --skip-cost --shortlist $MODEL_DIR/lex.s2t.gz 50 50 --int8shiftAlphaAll - # Number of CPU threads (workers to launch). Parallelizes over cores and improves speed. - --cpu-threads 4 +After Editing Files: - # Hyperparameters of how many tokens to be accounted for in a batch and maximum tokens in a sentence. - --max-input-sentence-tokens 1024 --max-input-tokens 1024 +```bash +emmake make -j +``` + +After Adding/Removing Files: + +```bash +emcmake cmake -DCOMPILE_WASM=on ../ +emmake make -j +``` - # Three modes are supported - # - sentence: One sentence per line - # - paragraph: One paragraph per line. - # - wrapped text: Paragraphs are separated by empty line. +### Using Native version - --ssplit-mode paragraph +The builds generate library that can be integrated to any project. All the public header files are specified in `src` folder. A short example of how to use the APIs is provided in `app/main.cpp` file -) +### Using WASM version -./app/service-cli "${ARGS[@]}" < path-to-input-file -``` +Please follow the `README` inside the `wasm` folder of this repository that demonstrates how to use the translator in JavaScript. From ff95e37f89e2ed67e4a6420e6a3415bb8e794994 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Thu, 11 Feb 2021 23:51:45 +0100 Subject: [PATCH 064/442] Improved cmake option PACKAGE_DIR --- CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 505d78549..10256c218 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -12,7 +12,7 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON) # Project specific cmake options option(COMPILE_WASM "Compile for WASM" OFF) option(COMPILE_THREAD_VARIANT "Compile with thread support" OFF) -option(PACKAGE_DIR "Directory including all the files to be packaged (pre-loaded) in wasm builds" "") +SET(PACKAGE_DIR "" CACHE STRING "Directory including all the files to be packaged (pre-loaded) in wasm builds") # Set marian (3rd party submodule) cmake options to compile for this project SET(COMPILE_CUDA OFF CACHE BOOL "Compile GPU version") From 28dcf55b417549f1b5ba7ec739e416166ac93591 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Fri, 12 Feb 2021 11:35:47 +0100 Subject: [PATCH 065/442] Improved cmake to use wasm compilation flags across project --- 3rd_party/CMakeLists.txt | 6 ++++++ CMakeLists.txt | 6 +++--- src/translator/CMakeLists.txt | 1 + wasm/CMakeLists.txt | 2 +- 4 files changed, 11 insertions(+), 4 deletions(-) diff --git a/3rd_party/CMakeLists.txt b/3rd_party/CMakeLists.txt index 644ac52de..74ce906dd 100644 --- a/3rd_party/CMakeLists.txt +++ b/3rd_party/CMakeLists.txt @@ -1,4 +1,10 @@ add_subdirectory(marian-dev) + +if(COMPILE_WASM) + # This is a bad way of adding compilation flags. Will be improved soon. + add_compile_options(${WASM_COMPILE_FLAGS}) +endif(COMPILE_WASM) + add_subdirectory(ssplit-cpp) # Add include directories for 3rd party targets to be able to use it anywhere in the diff --git a/CMakeLists.txt b/CMakeLists.txt index 10256c218..677963f12 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -33,9 +33,9 @@ if(NOT COMPILE_WASM) endif() if(COMPILE_WASM) - add_compile_options(-pthread -O3 -g2 -fPIC -mssse3 -msimd128) - add_compile_options("SHELL:-s WASM=1" "SHELL:-s ASSERTIONS=1" "SHELL:-s DISABLE_EXCEPTION_CATCHING=0" "SHELL:-s LLD_REPORT_UNDEFINED" "SHELL:-s FORCE_FILESYSTEM=1" "SHELL:-s ALLOW_MEMORY_GROWTH=1") - add_compile_options(-Wno-error=pthreads-mem-growth) + list(APPEND WASM_COMPILE_FLAGS -pthread -O3 -g2 -fPIC -mssse3 -msimd128) + list(APPEND WASM_COMPILE_FLAGS "SHELL:-s WASM=1" "SHELL:-s ASSERTIONS=1" "SHELL:-s DISABLE_EXCEPTION_CATCHING=0" "SHELL:-s LLD_REPORT_UNDEFINED" "SHELL:-s FORCE_FILESYSTEM=1" "SHELL:-s ALLOW_MEMORY_GROWTH=1") + list(APPEND WASM_COMPILE_FLAGS -Wno-error=pthreads-mem-growth) endif(COMPILE_WASM) add_subdirectory(3rd_party) diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index ba2c2e033..1a664b3ef 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -21,6 +21,7 @@ if(COMPILE_WASM) target_compile_definitions(bergamot-translator PUBLIC USE_SSE2 WASM) # Enable code that is required for generating JS bindings target_compile_definitions(bergamot-translator PRIVATE WASM_BINDINGS) + target_compile_options(bergamot-translator PRIVATE ${WASM_COMPILE_FLAGS}) endif(COMPILE_WASM) if (COMPILE_THREAD_VARIANT) diff --git a/wasm/CMakeLists.txt b/wasm/CMakeLists.txt index 9ede6a612..40b08bf6a 100644 --- a/wasm/CMakeLists.txt +++ b/wasm/CMakeLists.txt @@ -12,6 +12,7 @@ target_include_directories(bergamot-translator-worker ) # This compile definition is required for generating binding code properly target_compile_definitions(bergamot-translator-worker PRIVATE WASM_BINDINGS) +target_compile_options(bergamot-translator-worker PRIVATE ${WASM_COMPILE_FLAGS}) set(LINKER_FLAGS "--bind -s ASSERTIONS=1 -s DISABLE_EXCEPTION_CATCHING=0 -s FORCE_FILESYSTEM=1 -s ALLOW_MEMORY_GROWTH=1") if (NOT PACKAGE_DIR STREQUAL "") @@ -22,6 +23,5 @@ set_target_properties(bergamot-translator-worker PROPERTIES SUFFIX ".js" LINK_FLAGS ${LINKER_FLAGS} ) -#target_link_options(bergamot-translator-worker --preload-file ${PACKAGE_DIR}@/) target_link_libraries(bergamot-translator-worker bergamot-translator) From 3b7673bf15e9877f3cfc15c17a366db8a494a4d5 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Fri, 12 Feb 2021 14:38:16 +0100 Subject: [PATCH 066/442] Updated marian-dev submodule - This fixes the issue of sentencepiece not being able to checkout properly --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index a4e50b66b..29ecba1cb 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit a4e50b66be38a94b90c46c4695d86de9932c34e8 +Subproject commit 29ecba1cb1b8ea26ae582d3851e214769b89e566 From 38e8b3cd6d5a2db561ce201c3e69fb79c676389c Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 5 Feb 2021 12:55:57 +0000 Subject: [PATCH 067/442] Updates: marian-dev, ssplit for marian-decoder-new Updates marian-dev and ssplit submodules to point to the upstream commits which implements the following: - marian-dev: encodeWithByteRanges(...) to get source token byte-ranges - ssplit: Has a trivial sentencesplitter functionality implemented, and now is faster to benchmark with marian-decoder. This enables a marian-decoder replacement written through ssplit in this source to be benchmarked constantly with existing marian-decoder. Nits: Removes logging introduced for multiple workers, and respective log statements. --- .gitignore | 14 +++++ 3rd_party/marian-dev | 2 +- 3rd_party/ssplit-cpp | 2 +- app/CMakeLists.txt | 3 + app/main-mts.cpp | 13 ---- app/marian-decoder-new.cpp | 63 +++++++++++++++++++ src/translator/CMakeLists.txt | 4 +- src/translator/batch_translator.cpp | 1 - src/translator/batcher.cpp | 1 - src/translator/sanelogging.h | 44 ------------- src/translator/sentence_splitter.cpp | 52 +++++++++++++++ src/translator/sentence_splitter.h | 31 +++++++++ src/translator/service.cpp | 1 - src/translator/service.h | 2 +- .../{textops.cpp => text_processor.cpp} | 61 +++--------------- .../{textops.h => text_processor.h} | 37 +++-------- 16 files changed, 186 insertions(+), 145 deletions(-) create mode 100644 app/marian-decoder-new.cpp delete mode 100644 src/translator/sanelogging.h create mode 100644 src/translator/sentence_splitter.cpp create mode 100644 src/translator/sentence_splitter.h rename src/translator/{textops.cpp => text_processor.cpp} (52%) rename src/translator/{textops.h => text_processor.h} (56%) diff --git a/.gitignore b/.gitignore index e63aee1e1..54493b911 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,17 @@ *.swp *.swo +# CMake +CMakeLists.txt.user +CMakeCache.txt +CMakeFiles +CMakeScripts +Testing +Makefile +cmake_install.cmake +install_manifest.txt +compile_commands.json +CTestTestfile.cmake +_deps + + diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index ee56e02f0..2f6528045 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit ee56e02f0525a4651157a07f74b44f456db14c8c +Subproject commit 2f65280459737c37c270e4ad0b6d41de215d11e0 diff --git a/3rd_party/ssplit-cpp b/3rd_party/ssplit-cpp index f5d022992..01e71b496 160000 --- a/3rd_party/ssplit-cpp +++ b/3rd_party/ssplit-cpp @@ -1 +1 @@ -Subproject commit f5d022992f4a00c860eb809389748908bb85ffcf +Subproject commit 01e71b4964fdc351f932a7a23cab4cb80b9698e8 diff --git a/app/CMakeLists.txt b/app/CMakeLists.txt index 6e71e9e27..24bd0b43e 100644 --- a/app/CMakeLists.txt +++ b/app/CMakeLists.txt @@ -3,3 +3,6 @@ target_link_libraries(bergamot-translator-app PRIVATE bergamot-translator) add_executable(service-cli main-mts.cpp) target_link_libraries(service-cli PRIVATE bergamot-translator) + +add_executable(marian-decoder-new marian-decoder-new.cpp) +target_link_libraries(marian-decoder-new PRIVATE bergamot-translator) diff --git a/app/main-mts.cpp b/app/main-mts.cpp index 44a019a0d..c94ff306c 100644 --- a/app/main-mts.cpp +++ b/app/main-mts.cpp @@ -26,21 +26,8 @@ int main(int argc, char *argv[]) { service.translate(std::move(input)); translation_result_future.wait(); const TranslationResult &translation_result = translation_result_future.get(); - - std::cout << "service-cli [Source text]: "; - std::cout << translation_result.getOriginalText() << std::endl; - - std::cout << "service-cli [Translated text]: "; std::cout << translation_result.getTranslatedText() << std::endl; - // Obtain sentenceMappings and print them as Proof of Concept. - const TranslationResult::SentenceMappings &sentenceMappings = - translation_result.getSentenceMappings(); - for (auto &p : sentenceMappings) { - std::cout << "service-cli [src] " << p.first << "\n"; - std::cout << "service-cli [tgt] " << p.second << "\n"; - } - // Stop Service. service.stop(); return 0; diff --git a/app/marian-decoder-new.cpp b/app/marian-decoder-new.cpp new file mode 100644 index 000000000..62b1bb4b3 --- /dev/null +++ b/app/marian-decoder-new.cpp @@ -0,0 +1,63 @@ +#include +#include +#include +#include + +#include "common/definitions.h" +#include "common/timer.h" +#include "common/utils.h" +#include "marian.h" +#include "translator/history.h" +#include "translator/output_collector.h" +#include "translator/output_printer.h" +#include "translator/parser.h" +#include "translator/service.h" +#include "translator/translation_result.h" + +void marian_decoder_minimal(const marian::Histories &histories, + marian::Ptr targetVocab, + marian::Ptr options) { + + bool doNbest = options->get("n-best"); + auto collector = + marian::New(options->get("output")); + + // There is a dependency of vocabs here. + auto printer = marian::New(options, targetVocab); + if (options->get("quiet-translation")) + collector->setPrintingStrategy(marian::New()); + + for (auto &history : histories) { + std::stringstream best1; + std::stringstream bestn; + printer->print(history, best1, bestn); + collector->Write((long)history->getLineNum(), best1.str(), bestn.str(), + doNbest); + } +} + +int main(int argc, char *argv[]) { + auto cp = marian::bergamot::createConfigParser(); + auto options = cp.parseOptions(argc, argv, true); + marian::timer::Timer decoderTimer; + + marian::bergamot::Service service(options); + // Read a large input text blob from stdin + std::ostringstream std_input; + std_input << std::cin.rdbuf(); + std::string input = std_input.str(); + using marian::bergamot::TranslationResult; + + // Wait on future until TranslationResult is complete + std::future translation_result_future = + service.translate(std::move(input)); + translation_result_future.wait(); + const TranslationResult &translation_result = translation_result_future.get(); + + marian_decoder_minimal(translation_result.getHistories(), + service.targetVocab(), options); + + LOG(info, "Total time: {:.5f}s wall", decoderTimer.elapsed()); + service.stop(); + return 0; +} diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index b6fcf69fc..16c3db962 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -3,7 +3,8 @@ add_library(bergamot-translator STATIC TranslationModel.cpp # Following files added from browsermt/mts@nuke - textops.cpp + text_processor.cpp + sentence_splitter.cpp batch_translator.cpp multifactor_priority.cpp request.cpp @@ -18,3 +19,4 @@ target_include_directories(bergamot-translator PRIVATE ${CMAKE_SOURCE_DIR} PUBLIC ${CMAKE_SOURCE_DIR}/src) + diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index 6380a00cc..860255cd4 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -2,7 +2,6 @@ #include "common/logging.h" #include "data/corpus.h" #include "data/text_input.h" -#include "sanelogging.h" #include "translator/beam_search.h" namespace marian { diff --git a/src/translator/batcher.cpp b/src/translator/batcher.cpp index 22ee46d2a..2fa4eaf09 100644 --- a/src/translator/batcher.cpp +++ b/src/translator/batcher.cpp @@ -1,6 +1,5 @@ #include "batcher.h" #include "common/logging.h" -#include "sanelogging.h" #include namespace marian { diff --git a/src/translator/sanelogging.h b/src/translator/sanelogging.h deleted file mode 100644 index 21f70dda8..000000000 --- a/src/translator/sanelogging.h +++ /dev/null @@ -1,44 +0,0 @@ -#ifndef SRC_BERGAMOT_SANELOGGING_H_ -#define SRC_BERGAMOT_SANELOGGING_H_ - -#include "spdlog/spdlog.h" -#include - -namespace marian { - -#define PLOG(worker, level, ...) -#define _PLOG(worker, level, ...) checkedPLog(worker, #level, __VA_ARGS__) - -template -void checkedPLog(std::string logger, std::string level, Args... args) { - Logger log = spdlog::get(logger); - if (!log) { - try { - log = spdlog::daily_logger_st(logger, "logs/" + logger + ".log"); - } catch (const spdlog::spdlog_ex &ex) { - std::cout << "Log initialization failed: " << ex.what() << std::endl; - } - } - - if (level == "trace") - log->trace(args...); - else if (level == "debug") - log->debug(args...); - else if (level == "info") - log->info(args...); - else if (level == "warn") - log->warn(args...); - else if (level == "error") - log->error(args...); - else if (level == "critical") - log->critical(args...); - else { - log->warn("Unknown log level '{}' for logger '{}'", level, logger); - } - // Not required when threads clean-exit. - log->flush(); -} - -} // namespace marian - -#endif // SRC_BERGAMOT_SANELOGGING_H_ diff --git a/src/translator/sentence_splitter.cpp b/src/translator/sentence_splitter.cpp new file mode 100644 index 000000000..0f9be019a --- /dev/null +++ b/src/translator/sentence_splitter.cpp @@ -0,0 +1,52 @@ +#include "common/cli_helper.h" +#include "common/logging.h" +#include "common/options.h" +#include "sentence_splitter.h" +#include + +namespace marian { +namespace bergamot { + +SentenceSplitter::SentenceSplitter(marian::Ptr options) + : options_(options) { + + std::string smode_str = options_->get("ssplit-mode", ""); + mode_ = string2splitmode(smode_str); + std::string ssplit_prefix_file = + options_->get("ssplit-prefix-file", ""); + + if (ssplit_prefix_file.size()) { + ssplit_prefix_file = marian::cli::interpolateEnvVars(ssplit_prefix_file); + + LOG(info, "Loading protected prefixes for sentence splitting from {}", + ssplit_prefix_file); + + ssplit_.load(ssplit_prefix_file); + } else { + LOG(warn, "Missing list of protected prefixes for sentence splitting. " + "Set with --ssplit-prefix-file."); + } +} + +ug::ssplit::SentenceStream +SentenceSplitter::createSentenceStream(const string_view &input) { + return std::move(ug::ssplit::SentenceStream(input.data(), input.size(), + this->ssplit_, mode_)); +} + +ug::ssplit::SentenceStream::splitmode +SentenceSplitter::string2splitmode(const std::string &m) { + typedef ug::ssplit::SentenceStream::splitmode splitmode; + // @TODO: throw Exception on error + if (m == "sentence" || m == "Sentence") + return splitmode::one_sentence_per_line; + if (m == "paragraph" || m == "Paragraph") + return splitmode::one_paragraph_per_line; + if (m != "wrapped_text" && m != "WrappedText" && m != "wrappedText") { + LOG(warn, "Ignoring unknown text input format specification: {}.", m); + } + return splitmode::wrapped_text; +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/sentence_splitter.h b/src/translator/sentence_splitter.h new file mode 100644 index 000000000..5175176bf --- /dev/null +++ b/src/translator/sentence_splitter.h @@ -0,0 +1,31 @@ +#ifndef SRC_BERGAMOT_SENTENCE_SPLITTER_H_ +#define SRC_BERGAMOT_SENTENCE_SPLITTER_H_ + +#include "common/options.h" +#include "data/types.h" +#include "ssplit.h" +#include + +namespace marian { +namespace bergamot { + +class SentenceSplitter { + // A wrapper around @ugermann's ssplit-cpp compiled from several places in + // mts. Constructed based on options. Used in TextProcessor below to create + // sentence-streams, which provide access to one sentence from blob of text at + // a time. +public: + explicit SentenceSplitter(Ptr options); + ug::ssplit::SentenceStream createSentenceStream(string_view const &input); + +private: + ug::ssplit::SentenceSplitter ssplit_; + Ptr options_; + ug::ssplit::SentenceStream::splitmode mode_; + ug::ssplit::SentenceStream::splitmode string2splitmode(const std::string &m); +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_SENTENCE_SPLITTER_H_ diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 4a5af301c..2acbbdb1b 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -1,6 +1,5 @@ #include "service.h" #include "definitions.h" -#include "sanelogging.h" #include #include diff --git a/src/translator/service.h b/src/translator/service.h index 4069d1392..0ed8d0c1e 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -4,7 +4,7 @@ #include "batch_translator.h" #include "batcher.h" #include "pcqueue.h" -#include "textops.h" +#include "text_processor.h" #include "translation_result.h" #include diff --git a/src/translator/textops.cpp b/src/translator/text_processor.cpp similarity index 52% rename from src/translator/textops.cpp rename to src/translator/text_processor.cpp index 25e48f1fd..8114855bb 100644 --- a/src/translator/textops.cpp +++ b/src/translator/text_processor.cpp @@ -1,58 +1,17 @@ -#include "textops.h" -#include "common/timer.h" -#include -#include -#include -#include +#include "text_processor.h" +#include "data/types.h" +#include "definitions.h" + +#include "common/options.h" +#include "data/vocab.h" #include namespace marian { namespace bergamot { -SentenceSplitter::SentenceSplitter(marian::Ptr options) - : options_(options) { - - std::string smode_str = options_->get("ssplit-mode", ""); - mode_ = string2splitmode(smode_str); - std::string ssplit_prefix_file = - options_->get("ssplit-prefix-file", ""); - - if (ssplit_prefix_file.size()) { - ssplit_prefix_file = marian::cli::interpolateEnvVars(ssplit_prefix_file); - - LOG(info, "Loading protected prefixes for sentence splitting from {}", - ssplit_prefix_file); - - ssplit_.load(ssplit_prefix_file); - } else { - LOG(warn, "Missing list of protected prefixes for sentence splitting. " - "Set with --ssplit-prefix-file."); - } -} - -ug::ssplit::SentenceStream -SentenceSplitter::createSentenceStream(const string_view &input) { - pcrecpp::StringPiece spiece(input.begin(), input.size()); - return std::move(ug::ssplit::SentenceStream(spiece, this->ssplit_, mode_)); -} - -ug::ssplit::SentenceStream::splitmode -SentenceSplitter::string2splitmode(const std::string &m) { - typedef ug::ssplit::SentenceStream::splitmode splitmode; - // @TODO: throw Exception on error - if (m == "sentence" || m == "Sentence") - return splitmode::one_sentence_per_line; - if (m == "paragraph" || m == "Paragraph") - return splitmode::one_paragraph_per_line; - if (m != "wrapped_text" && m != "WrappedText" && m != "wrappedText") { - LOG(warn, "Ignoring unknown text input format specification: {}.", m); - } - return splitmode::wrapped_text; -} - Segment TextProcessor::tokenize(const string_view &segment, TokenRanges &tokenRanges) { - return vocabs_->front()->encodePreservingSource( + return vocabs_->front()->encodeWithByteRanges( segment, tokenRanges, /*addEOS=*/false, /*inference=*/true); } @@ -70,11 +29,11 @@ void TextProcessor::process(const string_view &query, Segments &segments, std::vector &sourceRanges) { auto sentenceStream = sentence_splitter_.createSentenceStream(query); - pcrecpp::StringPiece sentenceStringPiece; + std::string_view sentenceStringPiece; while (sentenceStream >> sentenceStringPiece) { - string_view sentence(sentenceStringPiece.data(), - sentenceStringPiece.size()); + marian::string_view sentence(sentenceStringPiece.data(), + sentenceStringPiece.size()); TokenRanges tokenRanges; Segment segment = tokenize(sentence, tokenRanges); diff --git a/src/translator/textops.h b/src/translator/text_processor.h similarity index 56% rename from src/translator/textops.h rename to src/translator/text_processor.h index 79a504013..111ae009b 100644 --- a/src/translator/textops.h +++ b/src/translator/text_processor.h @@ -1,40 +1,17 @@ -#ifndef SRC_BERGAMOT_TEXTOPS_H_ -#define SRC_BERGAMOT_TEXTOPS_H_ +#ifndef SRC_BERGAMOT_TEXT_PROCESSOR_H_ +#define SRC_BERGAMOT_TEXT_PROCESSOR_H_ -#include "common/definitions.h" -#include "common/logging.h" -#include "common/options.h" -#include "common/types.h" // missing in shortlist.h -#include "common/utils.h" -#include "data/sentencepiece_vocab.h" -#include "data/shortlist.h" +#include "data/types.h" +#include "data/vocab.h" #include "definitions.h" -#include "ssplit.h" -#include -#include -#include +#include "sentence_splitter.h" + #include namespace marian { namespace bergamot { -class SentenceSplitter { - // A wrapper around @ugermann's ssplit-cpp compiled from several places in - // mts. Constructed based on options. Used in TextProcessor below to create - // sentence-streams, which provide access to one sentence from blob of text at - // a time. -public: - explicit SentenceSplitter(Ptr options); - ug::ssplit::SentenceStream createSentenceStream(string_view const &input); - -private: - ug::ssplit::SentenceSplitter ssplit_; - Ptr options_; - ug::ssplit::SentenceStream::splitmode mode_; - ug::ssplit::SentenceStream::splitmode string2splitmode(const std::string &m); -}; - class TextProcessor { // TextProcessor handles loading the sentencepiece vocabulary and also // contains an instance of sentence-splitter based on ssplit. @@ -68,4 +45,4 @@ class TextProcessor { } // namespace bergamot } // namespace marian -#endif // SRC_BERGAMOT_TEXTOPS_H_ +#endif // SRC_BERGAMOT_TEXT_PROCESSOR_H_ From 9108d9f0b3e96c1890746ab740df1901b5cc2245 Mon Sep 17 00:00:00 2001 From: Andre Natal Date: Fri, 12 Feb 2021 15:25:40 -0800 Subject: [PATCH 068/442] Update README.md Add `--recursive` to `git clone` instructions --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e1ad9c37a..e8adaba32 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ Bergamot translator provides a unified API for ([Marian NMT](https://marian-nmt. ### Build Natively ```bash -git clone https://github.com/browsermt/bergamot-translator +git clone --recursive https://github.com/browsermt/bergamot-translator cd bergamot-translator mkdir build cd build From 3a53a68444834aeb6e78bfdb35ae12570187acd7 Mon Sep 17 00:00:00 2001 From: Andre Natal Date: Fri, 12 Feb 2021 15:41:17 -0800 Subject: [PATCH 069/442] Update README.md updating `--recursive` on wasm instructions too --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e8adaba32..4b1094415 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ make -j To compile WASM, first download and Install Emscripten using following instructions: -1. Get the latest sdk: `git clone https://github.com/emscripten-core/emsdk.git` +1. Get the latest sdk: `git clone --recursive https://github.com/emscripten-core/emsdk.git` 2. Enter the cloned directory: `cd emsdk` 3. Install the lastest sdk tools: `./emsdk install latest` 4. Activate the latest sdk tools: `./emsdk activate latest` From a97bf7b504e151494d3206e8b2459e666482640b Mon Sep 17 00:00:00 2001 From: Andre Natal Date: Fri, 12 Feb 2021 17:00:12 -0800 Subject: [PATCH 070/442] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 4b1094415..2791ebf96 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ make -j To compile WASM, first download and Install Emscripten using following instructions: -1. Get the latest sdk: `git clone --recursive https://github.com/emscripten-core/emsdk.git` +1. Get the latest sdk: `git clone https://github.com/emscripten-core/emsdk.git` 2. Enter the cloned directory: `cd emsdk` 3. Install the lastest sdk tools: `./emsdk install latest` 4. Activate the latest sdk tools: `./emsdk activate latest` @@ -28,7 +28,7 @@ To compile WASM, first download and Install Emscripten using following instructi After the successful installation of Emscripten, perform these steps: ```bash -git clone https://github.com/browsermt/bergamot-translator +git clone --recursive https://github.com/browsermt/bergamot-translator cd bergamot-translator mkdir build-wasm cd build-wasm From 47db65972cd791cbb59b4ee9825e1d80a1e9d0f1 Mon Sep 17 00:00:00 2001 From: Andre Natal Date: Fri, 12 Feb 2021 17:18:57 -0800 Subject: [PATCH 071/442] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 2791ebf96..3e458dfe0 100644 --- a/README.md +++ b/README.md @@ -30,6 +30,8 @@ After the successful installation of Emscripten, perform these steps: ```bash git clone --recursive https://github.com/browsermt/bergamot-translator cd bergamot-translator +git checkout wasm-integration +git submodule update --recursive mkdir build-wasm cd build-wasm emcmake cmake -DCOMPILE_WASM=on ../ From 4764f11e95cb2ec3c2766949ba58a74ee0d2cc90 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sat, 13 Feb 2021 10:55:07 +0000 Subject: [PATCH 072/442] Move BatchTranslator::thread_ to Service (#10) Service now holds an std::vector instead of BatchTranslators. --- src/translator/batch_translator.cpp | 26 +++++++++++--------------- src/translator/batch_translator.h | 19 ++++++++----------- src/translator/service.cpp | 8 +++++--- src/translator/service.h | 2 +- 4 files changed, 25 insertions(+), 30 deletions(-) diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index 860255cd4..7f801c97c 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -8,15 +8,10 @@ namespace marian { namespace bergamot { BatchTranslator::BatchTranslator(DeviceId const device, - PCQueue &pcqueue, std::vector> &vocabs, Ptr options) - : device_(device), options_(options), pcqueue_(&pcqueue), vocabs_(&vocabs) { - - thread_ = std::thread([this] { this->mainloop(); }); -} - -void BatchTranslator::initGraph() { + : device_(device), options_(options), vocabs_(&vocabs) { + // Initializes the graph. if (options_->hasAndNotEmpty("shortlist")) { int srcIdx = 0, trgIdx = 1; bool shared_vcb = vocabs_->front() == vocabs_->back(); @@ -38,7 +33,6 @@ void BatchTranslator::initGraph() { scorer->setShortlistGenerator(slgen_); } } - graph_->forward(); } @@ -98,18 +92,22 @@ void BatchTranslator::translate(RequestSentences &requestSentences, histories = std::move(search->search(graph_, batch)); } -void BatchTranslator::mainloop() { - initGraph(); +// void BatchTranslator::join() { thread_.join(); } + +void translation_loop(DeviceId const &device, PCQueue &pcqueue, + std::vector> &vocabs, + Ptr options) { + + BatchTranslator translator(device, vocabs, options); PCItem pcitem; Histories histories; - while (true) { - pcqueue_->ConsumeSwap(pcitem); + pcqueue.ConsumeSwap(pcitem); if (pcitem.isPoison()) { return; } else { - translate(pcitem.sentences, histories); + translator.translate(pcitem.sentences, histories); for (int i = 0; i < pcitem.sentences.size(); i++) { pcitem.sentences[i].completeSentence(histories[i]); } @@ -117,7 +115,5 @@ void BatchTranslator::mainloop() { } } -void BatchTranslator::join() { thread_.join(); } - } // namespace bergamot } // namespace marian diff --git a/src/translator/batch_translator.h b/src/translator/batch_translator.h index 069155efb..c718b32a0 100644 --- a/src/translator/batch_translator.h +++ b/src/translator/batch_translator.h @@ -22,29 +22,26 @@ class BatchTranslator { // shut down in Service which calls join() on the threads. public: - BatchTranslator(DeviceId const device, PCQueue &pcqueue, - std::vector> &vocabs, Ptr options); - void join(); + BatchTranslator(DeviceId const device, std::vector> &vocabs, + Ptr options); // convenience function for logging. TODO(jerin) std::string _identifier() { return "worker" + std::to_string(device_.no); } - -private: - void initGraph(); void translate(RequestSentences &requestSentences, Histories &histories); - void mainloop(); +private: Ptr options_; - DeviceId device_; std::vector> *vocabs_; Ptr graph_; std::vector> scorers_; Ptr slgen_; - - PCQueue *pcqueue_; - std::thread thread_; }; + +void translation_loop(DeviceId const &device, PCQueue &pcqueue, + std::vector> &vocabs, + Ptr options); + } // namespace bergamot } // namespace marian diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 2acbbdb1b..62073f931 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -16,9 +16,11 @@ Service::Service(Ptr options) workers_.reserve(numWorkers_); - for (int i = 0; i < numWorkers_; i++) { - marian::DeviceId deviceId(i, DeviceType::cpu); - workers_.emplace_back(deviceId, pcqueue_, vocabs_, options); + for (int cpuId = 0; cpuId < numWorkers_; cpuId++) { + workers_.emplace_back([&] { + marian::DeviceId deviceId(cpuId, DeviceType::cpu); + translation_loop(deviceId, pcqueue_, vocabs_, options); + }); } } diff --git a/src/translator/service.h b/src/translator/service.h index 0ed8d0c1e..e516bba60 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -69,7 +69,7 @@ class Service { TextProcessor text_processor_; // ORDER DEPENDENCY Batcher batcher_; PCQueue pcqueue_; - std::vector workers_; + std::vector workers_; }; std::vector> loadVocabularies(Ptr options); From f1d9f67b56ed5d84f74236b166fd592c060bf8d2 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sat, 13 Feb 2021 11:42:57 +0000 Subject: [PATCH 073/442] single-threaded run with --cpu-threads 0 (#10) --- src/translator/batch_translator.cpp | 13 +++---- src/translator/batch_translator.h | 2 +- src/translator/batcher.cpp | 25 +++++++++++++ src/translator/batcher.h | 4 ++ src/translator/service.cpp | 57 ++++++++++++++++------------- src/translator/service.h | 3 ++ 6 files changed, 70 insertions(+), 34 deletions(-) diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index 7f801c97c..3d2ec41c3 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -36,8 +36,7 @@ BatchTranslator::BatchTranslator(DeviceId const device, graph_->forward(); } -void BatchTranslator::translate(RequestSentences &requestSentences, - Histories &histories) { +void BatchTranslator::translate(RequestSentences &requestSentences) { std::vector batchVector; for (auto &sentence : requestSentences) { @@ -89,7 +88,10 @@ void BatchTranslator::translate(RequestSentences &requestSentences, auto trgVocab = vocabs_->back(); auto search = New(options_, scorers_, trgVocab); - histories = std::move(search->search(graph_, batch)); + auto histories = std::move(search->search(graph_, batch)); + for (int i = 0; i < requestSentences.size(); i++) { + requestSentences[i].completeSentence(histories[i]); + } } // void BatchTranslator::join() { thread_.join(); } @@ -107,10 +109,7 @@ void translation_loop(DeviceId const &device, PCQueue &pcqueue, if (pcitem.isPoison()) { return; } else { - translator.translate(pcitem.sentences, histories); - for (int i = 0; i < pcitem.sentences.size(); i++) { - pcitem.sentences[i].completeSentence(histories[i]); - } + translator.translate(pcitem.sentences); } } } diff --git a/src/translator/batch_translator.h b/src/translator/batch_translator.h index c718b32a0..4067e59a0 100644 --- a/src/translator/batch_translator.h +++ b/src/translator/batch_translator.h @@ -27,7 +27,7 @@ class BatchTranslator { // convenience function for logging. TODO(jerin) std::string _identifier() { return "worker" + std::to_string(device_.no); } - void translate(RequestSentences &requestSentences, Histories &histories); + void translate(RequestSentences &requestSentences); private: Ptr options_; diff --git a/src/translator/batcher.cpp b/src/translator/batcher.cpp index 2fa4eaf09..18bf5fdc1 100644 --- a/src/translator/batcher.cpp +++ b/src/translator/batcher.cpp @@ -50,5 +50,30 @@ void Batcher::cleaveBatch(RequestSentences &sentences) { } } +void Batcher::addWholeRequest(Ptr request) { + for (int i = 0; i < request->numSegments(); i++) { + RequestSentence requestSentence(i, request); + addSentenceWithPriority(requestSentence); + } +} + +void Batcher::enqueue(PCQueue &pcqueue) { + int numSentences; + do { + RequestSentences batchSentences; + cleaveBatch(batchSentences); + numSentences = batchSentences.size(); + + if (numSentences > 0) { + PCItem pcitem(batchNumber_++, std::move(batchSentences)); + pcqueue.ProduceSwap(pcitem); + } + + if (batchNumber_ % 500 == 0) { + LOG(info, "Queuing batch {}", batchNumber_); + } + } while (numSentences > 0); +} + } // namespace bergamot } // namespace marian diff --git a/src/translator/batcher.h b/src/translator/batcher.h index b60b642c7..2499cd2ff 100644 --- a/src/translator/batcher.h +++ b/src/translator/batcher.h @@ -4,6 +4,7 @@ #include "common/options.h" #include "data/corpus_base.h" #include "definitions.h" +#include "pcqueue.h" #include "request.h" #include @@ -19,6 +20,8 @@ class Batcher { // sentence. This method inserts the sentence into the internal data-structure // which maintains priority among sentences from multiple concurrent requests. void addSentenceWithPriority(RequestSentence &sentence); + void addWholeRequest(Ptr request); + void enqueue(PCQueue &pcqueue); // Loads sentences with sentences compiled from (tentatively) multiple // requests optimizing for both padding and priority. @@ -27,6 +30,7 @@ class Batcher { private: unsigned int max_input_tokens_; std::vector> bucket_; + unsigned int batchNumber_{0}; }; } // namespace bergamot diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 62073f931..fc713851e 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -14,13 +14,17 @@ Service::Service(Ptr options) text_processor_(vocabs_, options), batcher_(options), pcqueue_(2 * options->get("cpu-threads")) { - workers_.reserve(numWorkers_); - - for (int cpuId = 0; cpuId < numWorkers_; cpuId++) { - workers_.emplace_back([&] { - marian::DeviceId deviceId(cpuId, DeviceType::cpu); - translation_loop(deviceId, pcqueue_, vocabs_, options); - }); + if (numWorkers_ > 0) { + workers_.reserve(numWorkers_); + for (int cpuId = 0; cpuId < numWorkers_; cpuId++) { + workers_.emplace_back([&] { + marian::DeviceId deviceId(cpuId, DeviceType::cpu); + translation_loop(deviceId, pcqueue_, vocabs_, options); + }); + } + } else { + marian::DeviceId deviceId(/*cpuId=*/0, DeviceType::cpu); + translator = new BatchTranslator(deviceId, vocabs_, options); } } @@ -53,27 +57,28 @@ std::future Service::translate(std::string &&input) { std::move(segments), std::move(sourceAlignments), std::move(translationResultPromise)); - for (int i = 0; i < request->numSegments(); i++) { - RequestSentence requestSentence(i, request); - batcher_.addSentenceWithPriority(requestSentence); + batcher_.addWholeRequest(request); + if (numWorkers_ > 0) { + batcher_.enqueue(pcqueue_); + } else { + // Queue single-threaded + int numSentences; + do { + RequestSentences batchSentences; + batcher_.cleaveBatch(batchSentences); + numSentences = batchSentences.size(); + + if (numSentences > 0) { + translator->translate(batchSentences); + batchNumber_++; + } + + if (batchNumber_ % 500 == 0) { + LOG(info, "Tranlsating batch {}", batchNumber_); + } + } while (numSentences > 0); } - int numSentences; - do { - RequestSentences batchSentences; - batcher_.cleaveBatch(batchSentences); - numSentences = batchSentences.size(); - - if (numSentences > 0) { - PCItem pcitem(batchNumber_++, std::move(batchSentences)); - pcqueue_.ProduceSwap(pcitem); - } - - if (batchNumber_ % 500 == 0) { - LOG(info, "Queuing batch {}", batchNumber_); - } - } while (numSentences > 0); - return future; } diff --git a/src/translator/service.h b/src/translator/service.h index e516bba60..951398df5 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -70,6 +70,9 @@ class Service { Batcher batcher_; PCQueue pcqueue_; std::vector workers_; + + // Optional + BatchTranslator *translator{nullptr}; }; std::vector> loadVocabularies(Ptr options); From 77a600b637afd854a189f96b052f37896d37acb7 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sat, 13 Feb 2021 14:19:10 +0000 Subject: [PATCH 074/442] Removing join() (#10) --- src/translator/batch_translator.cpp | 2 -- 1 file changed, 2 deletions(-) diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index 3d2ec41c3..b944bed32 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -94,8 +94,6 @@ void BatchTranslator::translate(RequestSentences &requestSentences) { } } -// void BatchTranslator::join() { thread_.join(); } - void translation_loop(DeviceId const &device, PCQueue &pcqueue, std::vector> &vocabs, Ptr options) { From 73a56a8f4fa447fb58e230905c7c6e3d25c366da Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sat, 13 Feb 2021 15:48:23 +0000 Subject: [PATCH 075/442] Refactoring batching-mechanisms into Batcher Guided by an objective to move batching mechanism and queueing request to generate batches into a diffenrent thread. This commit is in preparation for this functionality. First, PCItem from the looks of it is *Batch*. Renamed to reflect the same. Fingers crossed, hopefully no naming conflicts with marian. BatchTranslator translates a "Batch" now, instead of vector. Additional data members are setup at Batch to enable development. Workflows previously in Service, but more adequate in Batcher are now moved, preparing to move Batcher/enqueuing of a request into a new thread making it non-blocking. This will allow service to queue requests into the batcher thread and exit, without waiting until the full-request is queued. Batcher now has a path with and without pcqueue. --- src/translator/batch_translator.cpp | 25 +++++----- src/translator/batch_translator.h | 4 +- src/translator/batcher.cpp | 73 +++++++++++++++-------------- src/translator/batcher.h | 7 +-- src/translator/request.h | 22 +++++---- src/translator/service.cpp | 27 ++++------- src/translator/service.h | 3 +- 7 files changed, 78 insertions(+), 83 deletions(-) diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index b944bed32..a6e6b9347 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -36,10 +36,10 @@ BatchTranslator::BatchTranslator(DeviceId const device, graph_->forward(); } -void BatchTranslator::translate(RequestSentences &requestSentences) { +void BatchTranslator::translate(Batch &batch) { std::vector batchVector; - for (auto &sentence : requestSentences) { + for (auto &sentence : batch.sentences) { data::SentenceTuple sentence_tuple(sentence.lineNumber()); Segment segment = sentence.getUnderlyingSegment(); sentence_tuple.push_back(segment); @@ -82,32 +82,31 @@ void BatchTranslator::translate(RequestSentences &requestSentences) { for (size_t j = 0; j < maxDims.size(); ++j) subBatches[j]->setWords(words[j]); - auto batch = Ptr(new CorpusBatch(subBatches)); - batch->setSentenceIds(sentenceIds); + auto corpus_batch = Ptr(new CorpusBatch(subBatches)); + corpus_batch->setSentenceIds(sentenceIds); auto trgVocab = vocabs_->back(); auto search = New(options_, scorers_, trgVocab); - auto histories = std::move(search->search(graph_, batch)); - for (int i = 0; i < requestSentences.size(); i++) { - requestSentences[i].completeSentence(histories[i]); + auto histories = std::move(search->search(graph_, corpus_batch)); + for (int i = 0; i < batch.sentences.size(); i++) { + batch.sentences[i].completeSentence(histories[i]); } } -void translation_loop(DeviceId const &device, PCQueue &pcqueue, +void translation_loop(DeviceId const &device, PCQueue &pcqueue, std::vector> &vocabs, Ptr options) { BatchTranslator translator(device, vocabs, options); - - PCItem pcitem; + Batch batch; Histories histories; while (true) { - pcqueue.ConsumeSwap(pcitem); - if (pcitem.isPoison()) { + pcqueue.ConsumeSwap(batch); + if (batch.isPoison()) { return; } else { - translator.translate(pcitem.sentences); + translator.translate(batch); } } } diff --git a/src/translator/batch_translator.h b/src/translator/batch_translator.h index 4067e59a0..2ee4e04ef 100644 --- a/src/translator/batch_translator.h +++ b/src/translator/batch_translator.h @@ -27,7 +27,7 @@ class BatchTranslator { // convenience function for logging. TODO(jerin) std::string _identifier() { return "worker" + std::to_string(device_.no); } - void translate(RequestSentences &requestSentences); + void translate(Batch &batch); private: Ptr options_; @@ -38,7 +38,7 @@ class BatchTranslator { Ptr slgen_; }; -void translation_loop(DeviceId const &device, PCQueue &pcqueue, +void translation_loop(DeviceId const &device, PCQueue &pcqueue, std::vector> &vocabs, Ptr options); diff --git a/src/translator/batcher.cpp b/src/translator/batcher.cpp index 18bf5fdc1..13b563542 100644 --- a/src/translator/batcher.cpp +++ b/src/translator/batcher.cpp @@ -6,10 +6,10 @@ namespace marian { namespace bergamot { Batcher::Batcher(Ptr options) { - max_input_tokens_ = options->get("max-input-tokens"); + miniBatchWords = options->get("max-input-tokens"); bucket_.resize(options->get("max-input-sentence-tokens") + 1); ABORT_IF( - max_input_tokens_ < bucket_.size() - 1, + miniBatchWords < bucket_.size() - 1, "max-input-tokens cannot be less than than max-input-sentence-tokens, " "batcher fail"); } @@ -20,34 +20,48 @@ void Batcher::addSentenceWithPriority(RequestSentence &sentence) { bucket_[bucket_id].insert(sentence); } -void Batcher::cleaveBatch(RequestSentences &sentences) { +bool Batcher::operator>>(Batch &batch) { return cleaveBatch(batch); } + +bool Batcher::cleaveBatch(Batch &batch) { // For now simply iterates on buckets and converts batches greedily. This // has to be enhanced with optimizing over priority. The baseline // implementation should at least be as fast as marian's maxi-batch with full // corpus size as maxi-batch size. + batch.reset(); + int paddedBatchSize = 0; - int segments_added = 0; - int current_input_tokens = 0; - int padded_batch_size = 0; - int prev_padded_batch_size; - - for (int i = 0; i < bucket_.size(); i++) { - auto p = bucket_[i].begin(); - while (p != bucket_[i].end()) { - padded_batch_size = (segments_added + 1) * i; - if (padded_batch_size <= max_input_tokens_) { + for (int length = 0; length < bucket_.size(); length++) { + auto p = bucket_[length].begin(); + while (p != bucket_[length].end()) { + paddedBatchSize = (batch.sentences.size() + 1) * length; + if (paddedBatchSize <= miniBatchWords) { auto q = p; ++p; - current_input_tokens += i; - sentences.push_back(*q); - ++segments_added; - bucket_[i].erase(q); - prev_padded_batch_size = padded_batch_size; + + batch.numTokens += length; + batch.sentences.push_back(*q); + batch.maxLength = std::max(batch.maxLength, length); + + bucket_[length].erase(q); } else { - return; + // Check if elements exist + assert(batch.sentences.size() > 0); + batch.Id = ++batchNumber_; + if (batchId % 500 == 0) { + batch.log(); + } + return true; } } } + + if (batch.sentences.size()) { + batch.Id = ++batchNumber_; + batch.log(); + return true; + } else { + return false; + } } void Batcher::addWholeRequest(Ptr request) { @@ -57,22 +71,11 @@ void Batcher::addWholeRequest(Ptr request) { } } -void Batcher::enqueue(PCQueue &pcqueue) { - int numSentences; - do { - RequestSentences batchSentences; - cleaveBatch(batchSentences); - numSentences = batchSentences.size(); - - if (numSentences > 0) { - PCItem pcitem(batchNumber_++, std::move(batchSentences)); - pcqueue.ProduceSwap(pcitem); - } - - if (batchNumber_ % 500 == 0) { - LOG(info, "Queuing batch {}", batchNumber_); - } - } while (numSentences > 0); +void Batcher::enqueue(PCQueue &pcqueue) { + Batch batch; + while (cleaveBatch(batch)) { + pcqueue.ProduceSwap(batch); + } } } // namespace bergamot diff --git a/src/translator/batcher.h b/src/translator/batcher.h index 2499cd2ff..d6b85f3f3 100644 --- a/src/translator/batcher.h +++ b/src/translator/batcher.h @@ -21,14 +21,15 @@ class Batcher { // which maintains priority among sentences from multiple concurrent requests. void addSentenceWithPriority(RequestSentence &sentence); void addWholeRequest(Ptr request); - void enqueue(PCQueue &pcqueue); + void enqueue(PCQueue &pcqueue); // Loads sentences with sentences compiled from (tentatively) multiple // requests optimizing for both padding and priority. - void cleaveBatch(RequestSentences &sentences); + bool cleaveBatch(Batch &batch); + bool operator>>(Batch &batch); // alias private: - unsigned int max_input_tokens_; + unsigned int miniBatchWords; std::vector> bucket_; unsigned int batchNumber_{0}; }; diff --git a/src/translator/request.h b/src/translator/request.h index 6f268ba1c..673f88ce3 100644 --- a/src/translator/request.h +++ b/src/translator/request.h @@ -24,6 +24,7 @@ #include "definitions.h" #include "translation_result.h" +#include "common/logging.h" #include "data/types.h" #include "translator/beam_search.h" @@ -92,20 +93,23 @@ class RequestSentence { typedef std::vector RequestSentences; -struct PCItem { - int batchNumber; +struct Batch { + int Id; + int numTokens, maxLength; RequestSentences sentences; - // PCItem should be default constructible for PCQueue. Default constructed + // Batch should be default constructible for PCQueue. Default constructed // element is poison. - PCItem() : batchNumber(-1) {} - - // PCItem constructor to construct a legit PCItem. - explicit PCItem(int batchNumber, RequestSentences &&sentences) - : batchNumber(batchNumber), sentences(std::move(sentences)) {} + Batch() { reset(); } + void reset() { Id = -1, numTokens = 0, maxLength = 0, sentences.clear(); } // Convenience function to determine poison. - bool isPoison() { return (batchNumber == -1); } + bool isPoison() { return (Id == -1); } + + void log() { + LOG(info, "Batch(Id={}, tokens={}, max-length={}, sentences={})", Id, + numTokens, maxLength, sentences.size()); + } }; } // namespace bergamot diff --git a/src/translator/service.cpp b/src/translator/service.cpp index fc713851e..37019552c 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -8,8 +8,7 @@ namespace marian { namespace bergamot { Service::Service(Ptr options) - : requestId_(0), batchNumber_(0), - numWorkers_(options->get("cpu-threads")), + : requestId_(0), numWorkers_(options->get("cpu-threads")), vocabs_(std::move(loadVocabularies(options))), text_processor_(vocabs_, options), batcher_(options), pcqueue_(2 * options->get("cpu-threads")) { @@ -58,25 +57,15 @@ std::future Service::translate(std::string &&input) { std::move(translationResultPromise)); batcher_.addWholeRequest(request); + if (numWorkers_ > 0) { batcher_.enqueue(pcqueue_); } else { // Queue single-threaded - int numSentences; - do { - RequestSentences batchSentences; - batcher_.cleaveBatch(batchSentences); - numSentences = batchSentences.size(); - - if (numSentences > 0) { - translator->translate(batchSentences); - batchNumber_++; - } - - if (batchNumber_ % 500 == 0) { - LOG(info, "Tranlsating batch {}", batchNumber_); - } - } while (numSentences > 0); + Batch batch; + while (batcher_ >> batch) { + translator->translate(batch); + } } return future; @@ -85,8 +74,8 @@ std::future Service::translate(std::string &&input) { void Service::stop() { int counter = 0; for (auto &worker : workers_) { - PCItem pcitem; - pcqueue_.ProduceSwap(pcitem); + Batch batch; + pcqueue_.ProduceSwap(batch); ++counter; } diff --git a/src/translator/service.h b/src/translator/service.h index 951398df5..c57e609a7 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -46,7 +46,6 @@ class Service { private: unsigned int requestId_; - unsigned int batchNumber_; int numWorkers_; // vocabs are used to construct a Request, which later uses it to construct @@ -68,7 +67,7 @@ class Service { TextProcessor text_processor_; // ORDER DEPENDENCY Batcher batcher_; - PCQueue pcqueue_; + PCQueue pcqueue_; std::vector workers_; // Optional From e585a9e7861934e40d3d4e2a5793724be3a9e3a6 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sat, 13 Feb 2021 16:31:30 +0000 Subject: [PATCH 076/442] Sanitizing Batch construction Batch Ids cannot be set by outside classes to values < 0. Batch.Id_ = -1 : Poison, for use in PCQueue 0 : Default constructed, invalid batch. >0 : Legit batch. Book-keeping for batch metrics (maxLength, numTokens, etc) and logging are now moved to Batch. Batch is now a class instead of a struct with accessors controlling how members can be modified to suit above. --- src/translator/batch_translator.cpp | 7 ++-- src/translator/batcher.cpp | 23 ++++--------- src/translator/request.h | 53 ++++++++++++++++++++++------- src/translator/service.cpp | 4 +-- 4 files changed, 53 insertions(+), 34 deletions(-) diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index a6e6b9347..13eb58a21 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -39,7 +39,8 @@ BatchTranslator::BatchTranslator(DeviceId const device, void BatchTranslator::translate(Batch &batch) { std::vector batchVector; - for (auto &sentence : batch.sentences) { + auto &sentences = batch.sentences(); + for (auto &sentence : sentences) { data::SentenceTuple sentence_tuple(sentence.lineNumber()); Segment segment = sentence.getUnderlyingSegment(); sentence_tuple.push_back(segment); @@ -89,9 +90,7 @@ void BatchTranslator::translate(Batch &batch) { auto search = New(options_, scorers_, trgVocab); auto histories = std::move(search->search(graph_, corpus_batch)); - for (int i = 0; i < batch.sentences.size(); i++) { - batch.sentences[i].completeSentence(histories[i]); - } + batch.completeBatch(histories); } void translation_loop(DeviceId const &device, PCQueue &pcqueue, diff --git a/src/translator/batcher.cpp b/src/translator/batcher.cpp index 13b563542..5fdcc3ac6 100644 --- a/src/translator/batcher.cpp +++ b/src/translator/batcher.cpp @@ -33,31 +33,22 @@ bool Batcher::cleaveBatch(Batch &batch) { for (int length = 0; length < bucket_.size(); length++) { auto p = bucket_[length].begin(); while (p != bucket_[length].end()) { - paddedBatchSize = (batch.sentences.size() + 1) * length; + paddedBatchSize = (batch.size() + 1) * length; if (paddedBatchSize <= miniBatchWords) { - auto q = p; - ++p; - - batch.numTokens += length; - batch.sentences.push_back(*q); - batch.maxLength = std::max(batch.maxLength, length); - + auto q = p++; + batch.add(*q); bucket_[length].erase(q); } else { // Check if elements exist - assert(batch.sentences.size() > 0); - batch.Id = ++batchNumber_; - if (batchId % 500 == 0) { - batch.log(); - } + assert(batch.size() > 0); + batch.setId(++batchNumber_); return true; } } } - if (batch.sentences.size()) { - batch.Id = ++batchNumber_; - batch.log(); + if (batch.size()) { + batch.setId(++batchNumber_); return true; } else { return false; diff --git a/src/translator/request.h b/src/translator/request.h index 673f88ce3..5fb9c3c5d 100644 --- a/src/translator/request.h +++ b/src/translator/request.h @@ -28,6 +28,8 @@ #include "data/types.h" #include "translator/beam_search.h" +#include + #include #include @@ -93,23 +95,50 @@ class RequestSentence { typedef std::vector RequestSentences; -struct Batch { - int Id; - int numTokens, maxLength; - RequestSentences sentences; - - // Batch should be default constructible for PCQueue. Default constructed - // element is poison. +class Batch { +public: Batch() { reset(); } - void reset() { Id = -1, numTokens = 0, maxLength = 0, sentences.clear(); } - + void reset() { Id_ = 0, numTokens_ = 0, maxLength_ = 0, sentences_.clear(); } // Convenience function to determine poison. - bool isPoison() { return (Id == -1); } + bool isPoison() { return (Id_ == -1); } + static Batch poison() { + Batch poison_; + poison_.Id_ = -1; + return poison_; + } void log() { - LOG(info, "Batch(Id={}, tokens={}, max-length={}, sentences={})", Id, - numTokens, maxLength, sentences.size()); + LOG(info, "Batch(Id_={}, tokens={}, max-length={}, sentences_={})", Id_, + numTokens_, maxLength_, sentences_.size()); + } + + void add(const RequestSentence &sentence) { + sentences_.push_back(sentence); + maxLength_ = std::max(sentence.numTokens(), maxLength_); + numTokens_ += sentence.numTokens(); } + + size_t size() { return sentences_.size(); } + + void setId(int Id) { + assert(Id > 0); + Id_ = Id; + if (Id % 500 == 0) { + log(); + } + } + + const RequestSentences &sentences() { return sentences_; } + void completeBatch(const Histories &histories) { + for (int i = 0; i < sentences_.size(); i++) { + sentences_[i].completeSentence(histories[i]); + } + } + +private: + int Id_; + size_t numTokens_, maxLength_; + RequestSentences sentences_; }; } // namespace bergamot diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 37019552c..c93aa5f00 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -74,8 +74,8 @@ std::future Service::translate(std::string &&input) { void Service::stop() { int counter = 0; for (auto &worker : workers_) { - Batch batch; - pcqueue_.ProduceSwap(batch); + Batch poison = Batch::poison(); + pcqueue_.ProduceSwap(poison); ++counter; } From 1e413f71cd583bba570af8ed8fde7f797174dd41 Mon Sep 17 00:00:00 2001 From: Andre Natal Date: Sat, 13 Feb 2021 14:54:36 -0800 Subject: [PATCH 077/442] Including a more elaborated test page, a node webserver containing the proper cors headers and wasm mimetype --- .gitignore | 2 + README.md | 6 +- wasm/README.md | 42 +- wasm/bergamot.html | 54 -- wasm/test_page/bergamot.html | 140 +++++ wasm/test_page/helper.js | 40 ++ wasm/test_page/package-lock.json | 904 +++++++++++++++++++++++++++++++ wasm/test_page/package.json | 7 + wasm/test_page/start_server.sh | 8 + 9 files changed, 1131 insertions(+), 72 deletions(-) delete mode 100644 wasm/bergamot.html create mode 100644 wasm/test_page/bergamot.html create mode 100644 wasm/test_page/helper.js create mode 100644 wasm/test_page/package-lock.json create mode 100644 wasm/test_page/package.json create mode 100644 wasm/test_page/start_server.sh diff --git a/.gitignore b/.gitignore index e63aee1e1..59363a81c 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,5 @@ *.swp *.swo +wasm/test_page/node_modules +build-wasm diff --git a/README.md b/README.md index 3e458dfe0..333e758e3 100644 --- a/README.md +++ b/README.md @@ -40,10 +40,12 @@ emmake make -j It should generate the artefacts (.js and .wasm files) in `wasm` folder inside build directory ("build-wasm" in this case). +Download the models from `https://github.com/mozilla-applied-ml/bergamot-models`, and place all the desired ones to package in a folder called `models`. + The build also allows packaging files into wasm binary (i.e. preloading in Emscripten’s virtual file system) using cmake -option `PACKAGE_DIR`. The compile command below packages all the files in PATH directory into wasm binary. +option `PACKAGE_DIR`. The compile command below packages all the files in PATH directory (in these case, your models) into wasm binary. ```bash -emcmake cmake -DCOMPILE_WASM=on -DPACKAGE_DIR= ../ +emcmake cmake -DCOMPILE_WASM=on -DPACKAGE_DIR= ./models ``` Files packaged this way are preloaded in the root of the virtual file system. diff --git a/wasm/README.md b/wasm/README.md index 83d4738cd..6be620956 100644 --- a/wasm/README.md +++ b/wasm/README.md @@ -1,5 +1,5 @@ ## Using Bergamot Translator in JavaScript -The example file `bergamot.html` in this folder demonstrates how to use the bergamot translator in JavaScript via a ` - - diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html new file mode 100644 index 000000000..49ca50e96 --- /dev/null +++ b/wasm/test_page/bergamot.html @@ -0,0 +1,140 @@ + + + + + + + + + +
+ + + + +
+ +
+ + +

+ + +

+ +
+ +
+
+ +
+ + + + + diff --git a/wasm/test_page/helper.js b/wasm/test_page/helper.js new file mode 100644 index 000000000..bff116ced --- /dev/null +++ b/wasm/test_page/helper.js @@ -0,0 +1,40 @@ +/* + * @author - Based of a file from Gist here: https://gist.github.com/1757658 + * + * @modified - Mike Newell - it was on Gist so I figure I can use it + * + * @Description - Added support for a few more mime types including the new + * .ogv, .webm, and .mp4 file types for HTML5 video. + * + */ + +/* +* @modified - Andre Natal - removed unused types for the purpose of this use +case +*/ + +Helper = { + + types: { + "wasm" : "application/wasm" + , "js" : "application/javascript" + , "html" : "text/html" + , "htm" : "text/html" + , "ico" : "image/vnd.microsoft.icon", + }, + + getMime: function(u) { + + var ext = this.getExt(u.pathname).replace('.', ''); + + return this.types[ext.toLowerCase()] || 'application/octet-stream'; + + }, + + getExt: function(path) { + var i = path.lastIndexOf('.'); + + return (i < 0) ? '' : path.substr(i); + } + +}; diff --git a/wasm/test_page/package-lock.json b/wasm/test_page/package-lock.json new file mode 100644 index 000000000..065c92de8 --- /dev/null +++ b/wasm/test_page/package-lock.json @@ -0,0 +1,904 @@ +{ + "name": "test_page", + "lockfileVersion": 2, + "requires": true, + "packages": { + "": { + "dependencies": { + "cors": "^2.8.5", + "express": "^4.17.1", + "nocache": "^2.1.0" + } + }, + "node_modules/accepts": { + "version": "1.3.7", + "resolved": "https://registry.npmjs.org/accepts/-/accepts-1.3.7.tgz", + "integrity": "sha512-Il80Qs2WjYlJIBNzNkK6KYqlVMTbZLXgHx2oT0pU/fjRHyEp+PEfEPY0R3WCwAGVOtauxh1hOxNgIf5bv7dQpA==", + "dependencies": { + "mime-types": "~2.1.24", + "negotiator": "0.6.2" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/array-flatten": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/array-flatten/-/array-flatten-1.1.1.tgz", + "integrity": "sha1-ml9pkFGx5wczKPKgCJaLZOopVdI=" + }, + "node_modules/body-parser": { + "version": "1.19.0", + "resolved": "https://registry.npmjs.org/body-parser/-/body-parser-1.19.0.tgz", + "integrity": "sha512-dhEPs72UPbDnAQJ9ZKMNTP6ptJaionhP5cBb541nXPlW60Jepo9RV/a4fX4XWW9CuFNK22krhrj1+rgzifNCsw==", + "dependencies": { + "bytes": "3.1.0", + "content-type": "~1.0.4", + "debug": "2.6.9", + "depd": "~1.1.2", + "http-errors": "1.7.2", + "iconv-lite": "0.4.24", + "on-finished": "~2.3.0", + "qs": "6.7.0", + "raw-body": "2.4.0", + "type-is": "~1.6.17" + }, + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/bytes": { + "version": "3.1.0", + "resolved": "https://registry.npmjs.org/bytes/-/bytes-3.1.0.tgz", + "integrity": "sha512-zauLjrfCG+xvoyaqLoV8bLVXXNGC4JqlxFCutSDWA6fJrTo2ZuvLYTqZ7aHBLZSMOopbzwv8f+wZcVzfVTI2Dg==", + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/content-disposition": { + "version": "0.5.3", + "resolved": "https://registry.npmjs.org/content-disposition/-/content-disposition-0.5.3.tgz", + "integrity": "sha512-ExO0774ikEObIAEV9kDo50o+79VCUdEB6n6lzKgGwupcVeRlhrj3qGAfwq8G6uBJjkqLrhT0qEYFcWng8z1z0g==", + "dependencies": { + "safe-buffer": "5.1.2" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/content-type": { + "version": "1.0.4", + "resolved": "https://registry.npmjs.org/content-type/-/content-type-1.0.4.tgz", + "integrity": "sha512-hIP3EEPs8tB9AT1L+NUqtwOAps4mk2Zob89MWXMHjHWg9milF/j4osnnQLXBCBFBk/tvIG/tUc9mOUJiPBhPXA==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/cookie": { + "version": "0.4.0", + "resolved": "https://registry.npmjs.org/cookie/-/cookie-0.4.0.tgz", + "integrity": "sha512-+Hp8fLp57wnUSt0tY0tHEXh4voZRDnoIrZPqlo3DPiI4y9lwg/jqx+1Om94/W6ZaPDOUbnjOt/99w66zk+l1Xg==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/cookie-signature": { + "version": "1.0.6", + "resolved": "https://registry.npmjs.org/cookie-signature/-/cookie-signature-1.0.6.tgz", + "integrity": "sha1-4wOogrNCzD7oylE6eZmXNNqzriw=" + }, + "node_modules/cors": { + "version": "2.8.5", + "resolved": "https://registry.npmjs.org/cors/-/cors-2.8.5.tgz", + "integrity": "sha512-KIHbLJqu73RGr/hnbrO9uBeixNGuvSQjul/jdFvS/KFSIH1hWVd1ng7zOHx+YrEfInLG7q4n6GHQ9cDtxv/P6g==", + "dependencies": { + "object-assign": "^4", + "vary": "^1" + }, + "engines": { + "node": ">= 0.10" + } + }, + "node_modules/debug": { + "version": "2.6.9", + "resolved": "https://registry.npmjs.org/debug/-/debug-2.6.9.tgz", + "integrity": "sha512-bC7ElrdJaJnPbAP+1EotYvqZsb3ecl5wi6Bfi6BJTUcNowp6cvspg0jXznRTKDjm/E7AdgFBVeAPVMNcKGsHMA==", + "dependencies": { + "ms": "2.0.0" + } + }, + "node_modules/depd": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/depd/-/depd-1.1.2.tgz", + "integrity": "sha1-m81S4UwJd2PnSbJ0xDRu0uVgtak=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/destroy": { + "version": "1.0.4", + "resolved": "https://registry.npmjs.org/destroy/-/destroy-1.0.4.tgz", + "integrity": "sha1-l4hXRCxEdJ5CBmE+N5RiBYJqvYA=" + }, + "node_modules/ee-first": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/ee-first/-/ee-first-1.1.1.tgz", + "integrity": "sha1-WQxhFWsK4vTwJVcyoViyZrxWsh0=" + }, + "node_modules/encodeurl": { + "version": "1.0.2", + "resolved": "https://registry.npmjs.org/encodeurl/-/encodeurl-1.0.2.tgz", + "integrity": "sha1-rT/0yG7C0CkyL1oCw6mmBslbP1k=", + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/escape-html": { + "version": "1.0.3", + "resolved": "https://registry.npmjs.org/escape-html/-/escape-html-1.0.3.tgz", + "integrity": "sha1-Aljq5NPQwJdN4cFpGI7wBR0dGYg=" + }, + "node_modules/etag": { + "version": "1.8.1", + "resolved": "https://registry.npmjs.org/etag/-/etag-1.8.1.tgz", + "integrity": "sha1-Qa4u62XvpiJorr/qg6x9eSmbCIc=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/express": { + "version": "4.17.1", + "resolved": "https://registry.npmjs.org/express/-/express-4.17.1.tgz", + "integrity": "sha512-mHJ9O79RqluphRrcw2X/GTh3k9tVv8YcoyY4Kkh4WDMUYKRZUq0h1o0w2rrrxBqM7VoeUVqgb27xlEMXTnYt4g==", + "dependencies": { + "accepts": "~1.3.7", + "array-flatten": "1.1.1", + "body-parser": "1.19.0", + "content-disposition": "0.5.3", + "content-type": "~1.0.4", + "cookie": "0.4.0", + "cookie-signature": "1.0.6", + "debug": "2.6.9", + "depd": "~1.1.2", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "etag": "~1.8.1", + "finalhandler": "~1.1.2", + "fresh": "0.5.2", + "merge-descriptors": "1.0.1", + "methods": "~1.1.2", + "on-finished": "~2.3.0", + "parseurl": "~1.3.3", + "path-to-regexp": "0.1.7", + "proxy-addr": "~2.0.5", + "qs": "6.7.0", + "range-parser": "~1.2.1", + "safe-buffer": "5.1.2", + "send": "0.17.1", + "serve-static": "1.14.1", + "setprototypeof": "1.1.1", + "statuses": "~1.5.0", + "type-is": "~1.6.18", + "utils-merge": "1.0.1", + "vary": "~1.1.2" + }, + "engines": { + "node": ">= 0.10.0" + } + }, + "node_modules/finalhandler": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/finalhandler/-/finalhandler-1.1.2.tgz", + "integrity": "sha512-aAWcW57uxVNrQZqFXjITpW3sIUQmHGG3qSb9mUah9MgMC4NeWhNOlNjXEYq3HjRAvL6arUviZGGJsBg6z0zsWA==", + "dependencies": { + "debug": "2.6.9", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "on-finished": "~2.3.0", + "parseurl": "~1.3.3", + "statuses": "~1.5.0", + "unpipe": "~1.0.0" + }, + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/forwarded": { + "version": "0.1.2", + "resolved": "https://registry.npmjs.org/forwarded/-/forwarded-0.1.2.tgz", + "integrity": "sha1-mMI9qxF1ZXuMBXPozszZGw/xjIQ=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/fresh": { + "version": "0.5.2", + "resolved": "https://registry.npmjs.org/fresh/-/fresh-0.5.2.tgz", + "integrity": "sha1-PYyt2Q2XZWn6g1qx+OSyOhBWBac=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/http-errors": { + "version": "1.7.2", + "resolved": "https://registry.npmjs.org/http-errors/-/http-errors-1.7.2.tgz", + "integrity": "sha512-uUQBt3H/cSIVfch6i1EuPNy/YsRSOUBXTVfZ+yR7Zjez3qjBz6i9+i4zjNaoqcoFVI4lQJ5plg63TvGfRSDCRg==", + "dependencies": { + "depd": "~1.1.2", + "inherits": "2.0.3", + "setprototypeof": "1.1.1", + "statuses": ">= 1.5.0 < 2", + "toidentifier": "1.0.0" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/iconv-lite": { + "version": "0.4.24", + "resolved": "https://registry.npmjs.org/iconv-lite/-/iconv-lite-0.4.24.tgz", + "integrity": "sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA==", + "dependencies": { + "safer-buffer": ">= 2.1.2 < 3" + }, + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/inherits": { + "version": "2.0.3", + "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.3.tgz", + "integrity": "sha1-Yzwsg+PaQqUC9SRmAiSA9CCCYd4=" + }, + "node_modules/ipaddr.js": { + "version": "1.9.1", + "resolved": "https://registry.npmjs.org/ipaddr.js/-/ipaddr.js-1.9.1.tgz", + "integrity": "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g==", + "engines": { + "node": ">= 0.10" + } + }, + "node_modules/media-typer": { + "version": "0.3.0", + "resolved": "https://registry.npmjs.org/media-typer/-/media-typer-0.3.0.tgz", + "integrity": "sha1-hxDXrwqmJvj/+hzgAWhUUmMlV0g=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/merge-descriptors": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-1.0.1.tgz", + "integrity": "sha1-sAqqVW3YtEVoFQ7J0blT8/kMu2E=" + }, + "node_modules/methods": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/methods/-/methods-1.1.2.tgz", + "integrity": "sha1-VSmk1nZUE07cxSZmVoNbD4Ua/O4=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/mime": { + "version": "1.6.0", + "resolved": "https://registry.npmjs.org/mime/-/mime-1.6.0.tgz", + "integrity": "sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==", + "bin": { + "mime": "cli.js" + }, + "engines": { + "node": ">=4" + } + }, + "node_modules/mime-db": { + "version": "1.45.0", + "resolved": "https://registry.npmjs.org/mime-db/-/mime-db-1.45.0.tgz", + "integrity": "sha512-CkqLUxUk15hofLoLyljJSrukZi8mAtgd+yE5uO4tqRZsdsAJKv0O+rFMhVDRJgozy+yG6md5KwuXhD4ocIoP+w==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/mime-types": { + "version": "2.1.28", + "resolved": "https://registry.npmjs.org/mime-types/-/mime-types-2.1.28.tgz", + "integrity": "sha512-0TO2yJ5YHYr7M2zzT7gDU1tbwHxEUWBCLt0lscSNpcdAfFyJOVEpRYNS7EXVcTLNj/25QO8gulHC5JtTzSE2UQ==", + "dependencies": { + "mime-db": "1.45.0" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/ms": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/ms/-/ms-2.0.0.tgz", + "integrity": "sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g=" + }, + "node_modules/negotiator": { + "version": "0.6.2", + "resolved": "https://registry.npmjs.org/negotiator/-/negotiator-0.6.2.tgz", + "integrity": "sha512-hZXc7K2e+PgeI1eDBe/10Ard4ekbfrrqG8Ep+8Jmf4JID2bNg7NvCPOZN+kfF574pFQI7mum2AUqDidoKqcTOw==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/nocache": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/nocache/-/nocache-2.1.0.tgz", + "integrity": "sha512-0L9FvHG3nfnnmaEQPjT9xhfN4ISk0A8/2j4M37Np4mcDesJjHgEUfgPhdCyZuFI954tjokaIj/A3NdpFNdEh4Q==", + "engines": { + "node": ">=4.0.0" + } + }, + "node_modules/object-assign": { + "version": "4.1.1", + "resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz", + "integrity": "sha1-IQmtx5ZYh8/AXLvUQsrIv7s2CGM=", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/on-finished": { + "version": "2.3.0", + "resolved": "https://registry.npmjs.org/on-finished/-/on-finished-2.3.0.tgz", + "integrity": "sha1-IPEzZIGwg811M3mSoWlxqi2QaUc=", + "dependencies": { + "ee-first": "1.1.1" + }, + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/parseurl": { + "version": "1.3.3", + "resolved": "https://registry.npmjs.org/parseurl/-/parseurl-1.3.3.tgz", + "integrity": "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ==", + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/path-to-regexp": { + "version": "0.1.7", + "resolved": "https://registry.npmjs.org/path-to-regexp/-/path-to-regexp-0.1.7.tgz", + "integrity": "sha1-32BBeABfUi8V60SQ5yR6G/qmf4w=" + }, + "node_modules/proxy-addr": { + "version": "2.0.6", + "resolved": "https://registry.npmjs.org/proxy-addr/-/proxy-addr-2.0.6.tgz", + "integrity": "sha512-dh/frvCBVmSsDYzw6n926jv974gddhkFPfiN8hPOi30Wax25QZyZEGveluCgliBnqmuM+UJmBErbAUFIoDbjOw==", + "dependencies": { + "forwarded": "~0.1.2", + "ipaddr.js": "1.9.1" + }, + "engines": { + "node": ">= 0.10" + } + }, + "node_modules/qs": { + "version": "6.7.0", + "resolved": "https://registry.npmjs.org/qs/-/qs-6.7.0.tgz", + "integrity": "sha512-VCdBRNFTX1fyE7Nb6FYoURo/SPe62QCaAyzJvUjwRaIsc+NePBEniHlvxFmmX56+HZphIGtV0XeCirBtpDrTyQ==", + "engines": { + "node": ">=0.6" + } + }, + "node_modules/range-parser": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/range-parser/-/range-parser-1.2.1.tgz", + "integrity": "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/raw-body": { + "version": "2.4.0", + "resolved": "https://registry.npmjs.org/raw-body/-/raw-body-2.4.0.tgz", + "integrity": "sha512-4Oz8DUIwdvoa5qMJelxipzi/iJIi40O5cGV1wNYp5hvZP8ZN0T+jiNkL0QepXs+EsQ9XJ8ipEDoiH70ySUJP3Q==", + "dependencies": { + "bytes": "3.1.0", + "http-errors": "1.7.2", + "iconv-lite": "0.4.24", + "unpipe": "1.0.0" + }, + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/safe-buffer": { + "version": "5.1.2", + "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz", + "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==" + }, + "node_modules/safer-buffer": { + "version": "2.1.2", + "resolved": "https://registry.npmjs.org/safer-buffer/-/safer-buffer-2.1.2.tgz", + "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg==" + }, + "node_modules/send": { + "version": "0.17.1", + "resolved": "https://registry.npmjs.org/send/-/send-0.17.1.tgz", + "integrity": "sha512-BsVKsiGcQMFwT8UxypobUKyv7irCNRHk1T0G680vk88yf6LBByGcZJOTJCrTP2xVN6yI+XjPJcNuE3V4fT9sAg==", + "dependencies": { + "debug": "2.6.9", + "depd": "~1.1.2", + "destroy": "~1.0.4", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "etag": "~1.8.1", + "fresh": "0.5.2", + "http-errors": "~1.7.2", + "mime": "1.6.0", + "ms": "2.1.1", + "on-finished": "~2.3.0", + "range-parser": "~1.2.1", + "statuses": "~1.5.0" + }, + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/send/node_modules/ms": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.1.tgz", + "integrity": "sha512-tgp+dl5cGk28utYktBsrFqA7HKgrhgPsg6Z/EfhWI4gl1Hwq8B/GmY/0oXZ6nF8hDVesS/FpnYaD/kOWhYQvyg==" + }, + "node_modules/serve-static": { + "version": "1.14.1", + "resolved": "https://registry.npmjs.org/serve-static/-/serve-static-1.14.1.tgz", + "integrity": "sha512-JMrvUwE54emCYWlTI+hGrGv5I8dEwmco/00EvkzIIsR7MqrHonbD9pO2MOfFnpFntl7ecpZs+3mW+XbQZu9QCg==", + "dependencies": { + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "parseurl": "~1.3.3", + "send": "0.17.1" + }, + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/setprototypeof": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/setprototypeof/-/setprototypeof-1.1.1.tgz", + "integrity": "sha512-JvdAWfbXeIGaZ9cILp38HntZSFSo3mWg6xGcJJsd+d4aRMOqauag1C63dJfDw7OaMYwEbHMOxEZ1lqVRYP2OAw==" + }, + "node_modules/statuses": { + "version": "1.5.0", + "resolved": "https://registry.npmjs.org/statuses/-/statuses-1.5.0.tgz", + "integrity": "sha1-Fhx9rBd2Wf2YEfQ3cfqZOBR4Yow=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/toidentifier": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/toidentifier/-/toidentifier-1.0.0.tgz", + "integrity": "sha512-yaOH/Pk/VEhBWWTlhI+qXxDFXlejDGcQipMlyxda9nthulaxLZUNcUqFxokp0vcYnvteJln5FNQDRrxj3YcbVw==", + "engines": { + "node": ">=0.6" + } + }, + "node_modules/type-is": { + "version": "1.6.18", + "resolved": "https://registry.npmjs.org/type-is/-/type-is-1.6.18.tgz", + "integrity": "sha512-TkRKr9sUTxEH8MdfuCSP7VizJyzRNMjj2J2do2Jr3Kym598JVdEksuzPQCnlFPW4ky9Q+iA+ma9BGm06XQBy8g==", + "dependencies": { + "media-typer": "0.3.0", + "mime-types": "~2.1.24" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/unpipe": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/unpipe/-/unpipe-1.0.0.tgz", + "integrity": "sha1-sr9O6FFKrmFltIF4KdIbLvSZBOw=", + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/utils-merge": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/utils-merge/-/utils-merge-1.0.1.tgz", + "integrity": "sha1-n5VxD1CiZ5R7LMwSR0HBAoQn5xM=", + "engines": { + "node": ">= 0.4.0" + } + }, + "node_modules/vary": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/vary/-/vary-1.1.2.tgz", + "integrity": "sha1-IpnwLG3tMNSllhsLn3RSShj2NPw=", + "engines": { + "node": ">= 0.8" + } + } + }, + "dependencies": { + "accepts": { + "version": "1.3.7", + "resolved": "https://registry.npmjs.org/accepts/-/accepts-1.3.7.tgz", + "integrity": "sha512-Il80Qs2WjYlJIBNzNkK6KYqlVMTbZLXgHx2oT0pU/fjRHyEp+PEfEPY0R3WCwAGVOtauxh1hOxNgIf5bv7dQpA==", + "requires": { + "mime-types": "~2.1.24", + "negotiator": "0.6.2" + } + }, + "array-flatten": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/array-flatten/-/array-flatten-1.1.1.tgz", + "integrity": "sha1-ml9pkFGx5wczKPKgCJaLZOopVdI=" + }, + "body-parser": { + "version": "1.19.0", + "resolved": "https://registry.npmjs.org/body-parser/-/body-parser-1.19.0.tgz", + "integrity": "sha512-dhEPs72UPbDnAQJ9ZKMNTP6ptJaionhP5cBb541nXPlW60Jepo9RV/a4fX4XWW9CuFNK22krhrj1+rgzifNCsw==", + "requires": { + "bytes": "3.1.0", + "content-type": "~1.0.4", + "debug": "2.6.9", + "depd": "~1.1.2", + "http-errors": "1.7.2", + "iconv-lite": "0.4.24", + "on-finished": "~2.3.0", + "qs": "6.7.0", + "raw-body": "2.4.0", + "type-is": "~1.6.17" + } + }, + "bytes": { + "version": "3.1.0", + "resolved": "https://registry.npmjs.org/bytes/-/bytes-3.1.0.tgz", + "integrity": "sha512-zauLjrfCG+xvoyaqLoV8bLVXXNGC4JqlxFCutSDWA6fJrTo2ZuvLYTqZ7aHBLZSMOopbzwv8f+wZcVzfVTI2Dg==" + }, + "content-disposition": { + "version": "0.5.3", + "resolved": "https://registry.npmjs.org/content-disposition/-/content-disposition-0.5.3.tgz", + "integrity": "sha512-ExO0774ikEObIAEV9kDo50o+79VCUdEB6n6lzKgGwupcVeRlhrj3qGAfwq8G6uBJjkqLrhT0qEYFcWng8z1z0g==", + "requires": { + "safe-buffer": "5.1.2" + } + }, + "content-type": { + "version": "1.0.4", + "resolved": "https://registry.npmjs.org/content-type/-/content-type-1.0.4.tgz", + "integrity": "sha512-hIP3EEPs8tB9AT1L+NUqtwOAps4mk2Zob89MWXMHjHWg9milF/j4osnnQLXBCBFBk/tvIG/tUc9mOUJiPBhPXA==" + }, + "cookie": { + "version": "0.4.0", + "resolved": "https://registry.npmjs.org/cookie/-/cookie-0.4.0.tgz", + "integrity": "sha512-+Hp8fLp57wnUSt0tY0tHEXh4voZRDnoIrZPqlo3DPiI4y9lwg/jqx+1Om94/W6ZaPDOUbnjOt/99w66zk+l1Xg==" + }, + "cookie-signature": { + "version": "1.0.6", + "resolved": "https://registry.npmjs.org/cookie-signature/-/cookie-signature-1.0.6.tgz", + "integrity": "sha1-4wOogrNCzD7oylE6eZmXNNqzriw=" + }, + "cors": { + "version": "2.8.5", + "resolved": "https://registry.npmjs.org/cors/-/cors-2.8.5.tgz", + "integrity": "sha512-KIHbLJqu73RGr/hnbrO9uBeixNGuvSQjul/jdFvS/KFSIH1hWVd1ng7zOHx+YrEfInLG7q4n6GHQ9cDtxv/P6g==", + "requires": { + "object-assign": "^4", + "vary": "^1" + } + }, + "debug": { + "version": "2.6.9", + "resolved": "https://registry.npmjs.org/debug/-/debug-2.6.9.tgz", + "integrity": "sha512-bC7ElrdJaJnPbAP+1EotYvqZsb3ecl5wi6Bfi6BJTUcNowp6cvspg0jXznRTKDjm/E7AdgFBVeAPVMNcKGsHMA==", + "requires": { + "ms": "2.0.0" + } + }, + "depd": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/depd/-/depd-1.1.2.tgz", + "integrity": "sha1-m81S4UwJd2PnSbJ0xDRu0uVgtak=" + }, + "destroy": { + "version": "1.0.4", + "resolved": "https://registry.npmjs.org/destroy/-/destroy-1.0.4.tgz", + "integrity": "sha1-l4hXRCxEdJ5CBmE+N5RiBYJqvYA=" + }, + "ee-first": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/ee-first/-/ee-first-1.1.1.tgz", + "integrity": "sha1-WQxhFWsK4vTwJVcyoViyZrxWsh0=" + }, + "encodeurl": { + "version": "1.0.2", + "resolved": "https://registry.npmjs.org/encodeurl/-/encodeurl-1.0.2.tgz", + "integrity": "sha1-rT/0yG7C0CkyL1oCw6mmBslbP1k=" + }, + "escape-html": { + "version": "1.0.3", + "resolved": "https://registry.npmjs.org/escape-html/-/escape-html-1.0.3.tgz", + "integrity": "sha1-Aljq5NPQwJdN4cFpGI7wBR0dGYg=" + }, + "etag": { + "version": "1.8.1", + "resolved": "https://registry.npmjs.org/etag/-/etag-1.8.1.tgz", + "integrity": "sha1-Qa4u62XvpiJorr/qg6x9eSmbCIc=" + }, + "express": { + "version": "4.17.1", + "resolved": "https://registry.npmjs.org/express/-/express-4.17.1.tgz", + "integrity": "sha512-mHJ9O79RqluphRrcw2X/GTh3k9tVv8YcoyY4Kkh4WDMUYKRZUq0h1o0w2rrrxBqM7VoeUVqgb27xlEMXTnYt4g==", + "requires": { + "accepts": "~1.3.7", + "array-flatten": "1.1.1", + "body-parser": "1.19.0", + "content-disposition": "0.5.3", + "content-type": "~1.0.4", + "cookie": "0.4.0", + "cookie-signature": "1.0.6", + "debug": "2.6.9", + "depd": "~1.1.2", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "etag": "~1.8.1", + "finalhandler": "~1.1.2", + "fresh": "0.5.2", + "merge-descriptors": "1.0.1", + "methods": "~1.1.2", + "on-finished": "~2.3.0", + "parseurl": "~1.3.3", + "path-to-regexp": "0.1.7", + "proxy-addr": "~2.0.5", + "qs": "6.7.0", + "range-parser": "~1.2.1", + "safe-buffer": "5.1.2", + "send": "0.17.1", + "serve-static": "1.14.1", + "setprototypeof": "1.1.1", + "statuses": "~1.5.0", + "type-is": "~1.6.18", + "utils-merge": "1.0.1", + "vary": "~1.1.2" + } + }, + "finalhandler": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/finalhandler/-/finalhandler-1.1.2.tgz", + "integrity": "sha512-aAWcW57uxVNrQZqFXjITpW3sIUQmHGG3qSb9mUah9MgMC4NeWhNOlNjXEYq3HjRAvL6arUviZGGJsBg6z0zsWA==", + "requires": { + "debug": "2.6.9", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "on-finished": "~2.3.0", + "parseurl": "~1.3.3", + "statuses": "~1.5.0", + "unpipe": "~1.0.0" + } + }, + "forwarded": { + "version": "0.1.2", + "resolved": "https://registry.npmjs.org/forwarded/-/forwarded-0.1.2.tgz", + "integrity": "sha1-mMI9qxF1ZXuMBXPozszZGw/xjIQ=" + }, + "fresh": { + "version": "0.5.2", + "resolved": "https://registry.npmjs.org/fresh/-/fresh-0.5.2.tgz", + "integrity": "sha1-PYyt2Q2XZWn6g1qx+OSyOhBWBac=" + }, + "http-errors": { + "version": "1.7.2", + "resolved": "https://registry.npmjs.org/http-errors/-/http-errors-1.7.2.tgz", + "integrity": "sha512-uUQBt3H/cSIVfch6i1EuPNy/YsRSOUBXTVfZ+yR7Zjez3qjBz6i9+i4zjNaoqcoFVI4lQJ5plg63TvGfRSDCRg==", + "requires": { + "depd": "~1.1.2", + "inherits": "2.0.3", + "setprototypeof": "1.1.1", + "statuses": ">= 1.5.0 < 2", + "toidentifier": "1.0.0" + } + }, + "iconv-lite": { + "version": "0.4.24", + "resolved": "https://registry.npmjs.org/iconv-lite/-/iconv-lite-0.4.24.tgz", + "integrity": "sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA==", + "requires": { + "safer-buffer": ">= 2.1.2 < 3" + } + }, + "inherits": { + "version": "2.0.3", + "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.3.tgz", + "integrity": "sha1-Yzwsg+PaQqUC9SRmAiSA9CCCYd4=" + }, + "ipaddr.js": { + "version": "1.9.1", + "resolved": "https://registry.npmjs.org/ipaddr.js/-/ipaddr.js-1.9.1.tgz", + "integrity": "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g==" + }, + "media-typer": { + "version": "0.3.0", + "resolved": "https://registry.npmjs.org/media-typer/-/media-typer-0.3.0.tgz", + "integrity": "sha1-hxDXrwqmJvj/+hzgAWhUUmMlV0g=" + }, + "merge-descriptors": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-1.0.1.tgz", + "integrity": "sha1-sAqqVW3YtEVoFQ7J0blT8/kMu2E=" + }, + "methods": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/methods/-/methods-1.1.2.tgz", + "integrity": "sha1-VSmk1nZUE07cxSZmVoNbD4Ua/O4=" + }, + "mime": { + "version": "1.6.0", + "resolved": "https://registry.npmjs.org/mime/-/mime-1.6.0.tgz", + "integrity": "sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==" + }, + "mime-db": { + "version": "1.45.0", + "resolved": "https://registry.npmjs.org/mime-db/-/mime-db-1.45.0.tgz", + "integrity": "sha512-CkqLUxUk15hofLoLyljJSrukZi8mAtgd+yE5uO4tqRZsdsAJKv0O+rFMhVDRJgozy+yG6md5KwuXhD4ocIoP+w==" + }, + "mime-types": { + "version": "2.1.28", + "resolved": "https://registry.npmjs.org/mime-types/-/mime-types-2.1.28.tgz", + "integrity": "sha512-0TO2yJ5YHYr7M2zzT7gDU1tbwHxEUWBCLt0lscSNpcdAfFyJOVEpRYNS7EXVcTLNj/25QO8gulHC5JtTzSE2UQ==", + "requires": { + "mime-db": "1.45.0" + } + }, + "ms": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/ms/-/ms-2.0.0.tgz", + "integrity": "sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g=" + }, + "negotiator": { + "version": "0.6.2", + "resolved": "https://registry.npmjs.org/negotiator/-/negotiator-0.6.2.tgz", + "integrity": "sha512-hZXc7K2e+PgeI1eDBe/10Ard4ekbfrrqG8Ep+8Jmf4JID2bNg7NvCPOZN+kfF574pFQI7mum2AUqDidoKqcTOw==" + }, + "nocache": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/nocache/-/nocache-2.1.0.tgz", + "integrity": "sha512-0L9FvHG3nfnnmaEQPjT9xhfN4ISk0A8/2j4M37Np4mcDesJjHgEUfgPhdCyZuFI954tjokaIj/A3NdpFNdEh4Q==" + }, + "object-assign": { + "version": "4.1.1", + "resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz", + "integrity": "sha1-IQmtx5ZYh8/AXLvUQsrIv7s2CGM=" + }, + "on-finished": { + "version": "2.3.0", + "resolved": "https://registry.npmjs.org/on-finished/-/on-finished-2.3.0.tgz", + "integrity": "sha1-IPEzZIGwg811M3mSoWlxqi2QaUc=", + "requires": { + "ee-first": "1.1.1" + } + }, + "parseurl": { + "version": "1.3.3", + "resolved": "https://registry.npmjs.org/parseurl/-/parseurl-1.3.3.tgz", + "integrity": "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ==" + }, + "path-to-regexp": { + "version": "0.1.7", + "resolved": "https://registry.npmjs.org/path-to-regexp/-/path-to-regexp-0.1.7.tgz", + "integrity": "sha1-32BBeABfUi8V60SQ5yR6G/qmf4w=" + }, + "proxy-addr": { + "version": "2.0.6", + "resolved": "https://registry.npmjs.org/proxy-addr/-/proxy-addr-2.0.6.tgz", + "integrity": "sha512-dh/frvCBVmSsDYzw6n926jv974gddhkFPfiN8hPOi30Wax25QZyZEGveluCgliBnqmuM+UJmBErbAUFIoDbjOw==", + "requires": { + "forwarded": "~0.1.2", + "ipaddr.js": "1.9.1" + } + }, + "qs": { + "version": "6.7.0", + "resolved": "https://registry.npmjs.org/qs/-/qs-6.7.0.tgz", + "integrity": "sha512-VCdBRNFTX1fyE7Nb6FYoURo/SPe62QCaAyzJvUjwRaIsc+NePBEniHlvxFmmX56+HZphIGtV0XeCirBtpDrTyQ==" + }, + "range-parser": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/range-parser/-/range-parser-1.2.1.tgz", + "integrity": "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg==" + }, + "raw-body": { + "version": "2.4.0", + "resolved": "https://registry.npmjs.org/raw-body/-/raw-body-2.4.0.tgz", + "integrity": "sha512-4Oz8DUIwdvoa5qMJelxipzi/iJIi40O5cGV1wNYp5hvZP8ZN0T+jiNkL0QepXs+EsQ9XJ8ipEDoiH70ySUJP3Q==", + "requires": { + "bytes": "3.1.0", + "http-errors": "1.7.2", + "iconv-lite": "0.4.24", + "unpipe": "1.0.0" + } + }, + "safe-buffer": { + "version": "5.1.2", + "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz", + "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==" + }, + "safer-buffer": { + "version": "2.1.2", + "resolved": "https://registry.npmjs.org/safer-buffer/-/safer-buffer-2.1.2.tgz", + "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg==" + }, + "send": { + "version": "0.17.1", + "resolved": "https://registry.npmjs.org/send/-/send-0.17.1.tgz", + "integrity": "sha512-BsVKsiGcQMFwT8UxypobUKyv7irCNRHk1T0G680vk88yf6LBByGcZJOTJCrTP2xVN6yI+XjPJcNuE3V4fT9sAg==", + "requires": { + "debug": "2.6.9", + "depd": "~1.1.2", + "destroy": "~1.0.4", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "etag": "~1.8.1", + "fresh": "0.5.2", + "http-errors": "~1.7.2", + "mime": "1.6.0", + "ms": "2.1.1", + "on-finished": "~2.3.0", + "range-parser": "~1.2.1", + "statuses": "~1.5.0" + }, + "dependencies": { + "ms": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.1.tgz", + "integrity": "sha512-tgp+dl5cGk28utYktBsrFqA7HKgrhgPsg6Z/EfhWI4gl1Hwq8B/GmY/0oXZ6nF8hDVesS/FpnYaD/kOWhYQvyg==" + } + } + }, + "serve-static": { + "version": "1.14.1", + "resolved": "https://registry.npmjs.org/serve-static/-/serve-static-1.14.1.tgz", + "integrity": "sha512-JMrvUwE54emCYWlTI+hGrGv5I8dEwmco/00EvkzIIsR7MqrHonbD9pO2MOfFnpFntl7ecpZs+3mW+XbQZu9QCg==", + "requires": { + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "parseurl": "~1.3.3", + "send": "0.17.1" + } + }, + "setprototypeof": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/setprototypeof/-/setprototypeof-1.1.1.tgz", + "integrity": "sha512-JvdAWfbXeIGaZ9cILp38HntZSFSo3mWg6xGcJJsd+d4aRMOqauag1C63dJfDw7OaMYwEbHMOxEZ1lqVRYP2OAw==" + }, + "statuses": { + "version": "1.5.0", + "resolved": "https://registry.npmjs.org/statuses/-/statuses-1.5.0.tgz", + "integrity": "sha1-Fhx9rBd2Wf2YEfQ3cfqZOBR4Yow=" + }, + "toidentifier": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/toidentifier/-/toidentifier-1.0.0.tgz", + "integrity": "sha512-yaOH/Pk/VEhBWWTlhI+qXxDFXlejDGcQipMlyxda9nthulaxLZUNcUqFxokp0vcYnvteJln5FNQDRrxj3YcbVw==" + }, + "type-is": { + "version": "1.6.18", + "resolved": "https://registry.npmjs.org/type-is/-/type-is-1.6.18.tgz", + "integrity": "sha512-TkRKr9sUTxEH8MdfuCSP7VizJyzRNMjj2J2do2Jr3Kym598JVdEksuzPQCnlFPW4ky9Q+iA+ma9BGm06XQBy8g==", + "requires": { + "media-typer": "0.3.0", + "mime-types": "~2.1.24" + } + }, + "unpipe": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/unpipe/-/unpipe-1.0.0.tgz", + "integrity": "sha1-sr9O6FFKrmFltIF4KdIbLvSZBOw=" + }, + "utils-merge": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/utils-merge/-/utils-merge-1.0.1.tgz", + "integrity": "sha1-n5VxD1CiZ5R7LMwSR0HBAoQn5xM=" + }, + "vary": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/vary/-/vary-1.1.2.tgz", + "integrity": "sha1-IpnwLG3tMNSllhsLn3RSShj2NPw=" + } + } +} diff --git a/wasm/test_page/package.json b/wasm/test_page/package.json new file mode 100644 index 000000000..20af6d2ab --- /dev/null +++ b/wasm/test_page/package.json @@ -0,0 +1,7 @@ +{ + "dependencies": { + "cors": "^2.8.5", + "express": "^4.17.1", + "nocache": "^2.1.0" + } +} diff --git a/wasm/test_page/start_server.sh b/wasm/test_page/start_server.sh new file mode 100644 index 000000000..b83344b8a --- /dev/null +++ b/wasm/test_page/start_server.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +cp ../../build-wasm/wasm/bergamot-translator-worker.data . +cp ../../build-wasm/wasm/bergamot-translator-worker.js . +cp ../../build-wasm/wasm/bergamot-translator-worker.wasm . +cp ../../build-wasm/wasm/bergamot-translator-worker.worker.js . +npm install +node bergamot-httpserver.js \ No newline at end of file From 47323d21b93795e19d82a499bfb13b71f7032c40 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 14 Feb 2021 13:05:05 +0000 Subject: [PATCH 078/442] Getting rid of unused variables in Batch --- src/translator/request.h | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/src/translator/request.h b/src/translator/request.h index 5fb9c3c5d..eab0d4b2a 100644 --- a/src/translator/request.h +++ b/src/translator/request.h @@ -98,7 +98,10 @@ typedef std::vector RequestSentences; class Batch { public: Batch() { reset(); } - void reset() { Id_ = 0, numTokens_ = 0, maxLength_ = 0, sentences_.clear(); } + void reset() { + Id_ = 0; + sentences_.clear(); + } // Convenience function to determine poison. bool isPoison() { return (Id_ == -1); } static Batch poison() { @@ -108,15 +111,17 @@ class Batch { } void log() { + int numTokens{0}, maxLength{0}; + for (auto &sentence : sentences_) { + numTokens += sentence.numTokens(); + maxLength = std::max(maxLength, static_cast(sentence.numTokens())); + } + LOG(info, "Batch(Id_={}, tokens={}, max-length={}, sentences_={})", Id_, - numTokens_, maxLength_, sentences_.size()); + numTokens, maxLength, sentences_.size()); } - void add(const RequestSentence &sentence) { - sentences_.push_back(sentence); - maxLength_ = std::max(sentence.numTokens(), maxLength_); - numTokens_ += sentence.numTokens(); - } + void add(const RequestSentence &sentence) { sentences_.push_back(sentence); } size_t size() { return sentences_.size(); } @@ -137,7 +142,6 @@ class Batch { private: int Id_; - size_t numTokens_, maxLength_; RequestSentences sentences_; }; From ecc91c51e3b439b32173e3e4a821fdfe1a538436 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 14 Feb 2021 13:23:46 +0000 Subject: [PATCH 079/442] BatchTranslator* -> unique_ptr --- src/translator/service.cpp | 3 ++- src/translator/service.h | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/src/translator/service.cpp b/src/translator/service.cpp index c93aa5f00..bdfb7e992 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -23,7 +23,8 @@ Service::Service(Ptr options) } } else { marian::DeviceId deviceId(/*cpuId=*/0, DeviceType::cpu); - translator = new BatchTranslator(deviceId, vocabs_, options); + translator = + UPtr(new BatchTranslator(deviceId, vocabs_, options)); } } diff --git a/src/translator/service.h b/src/translator/service.h index c57e609a7..db01468c7 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -71,7 +71,7 @@ class Service { std::vector workers_; // Optional - BatchTranslator *translator{nullptr}; + UPtr translator{nullptr}; }; std::vector> loadVocabularies(Ptr options); From 0dbc8612c2431722152ca925f1bd7152187a399a Mon Sep 17 00:00:00 2001 From: Andre Natal Date: Sun, 14 Feb 2021 09:15:08 -0800 Subject: [PATCH 080/442] Adding missing bergamot-httpserver.js --- wasm/test_page/bergamot-httpserver.js | 39 +++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 wasm/test_page/bergamot-httpserver.js diff --git a/wasm/test_page/bergamot-httpserver.js b/wasm/test_page/bergamot-httpserver.js new file mode 100644 index 000000000..f23b3e750 --- /dev/null +++ b/wasm/test_page/bergamot-httpserver.js @@ -0,0 +1,39 @@ +require(__dirname + '/helper.js'); + +var http = require('http'); +var express = require('express'); +var app = express(); +var server = http.createServer(app); +var fs = require('fs'); +var url = require('url'); +const nocache = require('nocache'); +const cors = require('cors'); + +app.use(cors()) +app.use(nocache()); +app.get('/*.*' , cors(), function(req, res) { + var options = url.parse(req.url, true); + var mime = Helper.getMime(options); + serveFile(res, options.pathname, mime); +}); + +function serveFile(res, pathName, mime) { + mime = mime || 'text/html'; + fs.readFile(__dirname + '/' + pathName, function (err, data) { + if (err) { + res.writeHead(500, {"Content-Type": "text/plain"}); + return res.end('Error loading ' + pathName + " with Error: " + err); + } + res.header('Cross-Origin-Embedder-Policy','require-corp'); + res.header('Cross-Origin-Opener-Policy','same-origin'); + res.writeHead(200, {"Content-Type": mime}); + res.end(data); + }); +} + +server.listen(8000); +console.log('HTTP and BinaryJS server started on port 8000'); + + + + From 5bd4a1a3c0ef388249794298b5ed2c0b1cf92d05 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 14 Feb 2021 19:58:29 +0000 Subject: [PATCH 081/442] Refactor: marian-TranslationResult and associated marian-TranslationResult has more guards in place. Switching to a construction on demand model for sentenceMappings. These changes propogate to bergamot translation results. Integration broke with the change in marian's internals, which are updated accordingly to get back functionality. Changes revealed a few bugs, which are fixed: - ConfigParser already discovered in wasm-integration (https://github.com/browsermt/bergamot-translator/commit/a06530e92b6d16527487c8fa0ead4ae04f0ddbb5). - Lambda captures and undefined values in DeviceId --- app/main-mts.cpp | 2 +- app/marian-decoder-new.cpp | 4 +- src/translator/TranslationModel.cpp | 18 +++++--- src/translator/parser.h | 3 +- src/translator/service.cpp | 10 ++--- src/translator/translation_result.cpp | 65 +++++++++++++++++---------- src/translator/translation_result.h | 42 +++++------------ 7 files changed, 75 insertions(+), 69 deletions(-) diff --git a/app/main-mts.cpp b/app/main-mts.cpp index c94ff306c..d8e756704 100644 --- a/app/main-mts.cpp +++ b/app/main-mts.cpp @@ -26,7 +26,7 @@ int main(int argc, char *argv[]) { service.translate(std::move(input)); translation_result_future.wait(); const TranslationResult &translation_result = translation_result_future.get(); - std::cout << translation_result.getTranslatedText() << std::endl; + std::cout << translation_result.translation() << std::endl; // Stop Service. service.stop(); diff --git a/app/marian-decoder-new.cpp b/app/marian-decoder-new.cpp index 62b1bb4b3..6e44fb777 100644 --- a/app/marian-decoder-new.cpp +++ b/app/marian-decoder-new.cpp @@ -54,8 +54,8 @@ int main(int argc, char *argv[]) { translation_result_future.wait(); const TranslationResult &translation_result = translation_result_future.get(); - marian_decoder_minimal(translation_result.getHistories(), - service.targetVocab(), options); + marian_decoder_minimal(translation_result.histories(), service.targetVocab(), + options); LOG(info, "Total time: {:.5f}s wall", decoderTimer.elapsed()); service.stop(); diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index f501678cf..9c55422ef 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -14,6 +14,7 @@ // All local project includes #include "TranslationModel.h" +#include "translator/parser.h" #include "translator/service.h" std::shared_ptr parseOptions(const std::string &config) { @@ -34,7 +35,7 @@ std::shared_ptr parseOptions(const std::string &config) { // Error: Aborted from void unhandledException() in // 3rd_party/marian-dev/src/common/logging.cpp:113 - marian::ConfigParser configParser(marian::cli::mode::translation); + marian::ConfigParser configParser = marian::bergamot::createConfigParser(); const YAML::Node &defaultConfig = configParser.getConfig(); options.merge(defaultConfig); @@ -70,18 +71,25 @@ TranslationModel::translate(std::vector &&texts, intermediate.wait(); auto mTranslationResult(std::move(intermediate.get())); + // This mess because marian::string_view != std::string_view + std::string source, translation; + marian::bergamot::TranslationResult::SentenceMappings mSentenceMappings; + mTranslationResult.move(source, translation, mSentenceMappings); + // Convert to UnifiedAPI::TranslationResult TranslationResult::SentenceMappings sentenceMappings; - for (auto &p : mTranslationResult.getSentenceMappings()) { + for (auto &p : mSentenceMappings) { std::string_view src(p.first.data(), p.first.size()), tgt(p.second.data(), p.second.size()); sentenceMappings.emplace_back(src, tgt); } // In place construction. - translationResults.emplace_back(std::move(mTranslationResult.source_), - std::move(mTranslationResult.translation_), - std::move(sentenceMappings)); + translationResults.emplace_back( + std::move(source), // &&mTranslationResult.source_ + std::move(translation), // &&mTranslationResult.translation_ + std::move(sentenceMappings) // &&sentenceMappings + ); } promise.set_value(std::move(translationResults)); diff --git a/src/translator/parser.h b/src/translator/parser.h index e273d6aea..606b6a47b 100644 --- a/src/translator/parser.h +++ b/src/translator/parser.h @@ -5,7 +5,8 @@ namespace marian { namespace bergamot { -marian::ConfigParser createConfigParser() { + +inline marian::ConfigParser createConfigParser() { marian::ConfigParser cp(marian::cli::mode::translation); cp.addOption( "--ssplit-prefix-file", "Bergamot Options", diff --git a/src/translator/service.cpp b/src/translator/service.cpp index bdfb7e992..ef2bacb64 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -15,11 +15,11 @@ Service::Service(Ptr options) if (numWorkers_ > 0) { workers_.reserve(numWorkers_); - for (int cpuId = 0; cpuId < numWorkers_; cpuId++) { - workers_.emplace_back([&] { - marian::DeviceId deviceId(cpuId, DeviceType::cpu); - translation_loop(deviceId, pcqueue_, vocabs_, options); - }); + for (size_t cpuId = 0; cpuId < numWorkers_; cpuId++) { + marian::DeviceId deviceId(cpuId, DeviceType::cpu); + workers_.emplace_back(translation_loop, // Function + deviceId, std::ref(pcqueue_), std::ref(vocabs_), + options); } } else { marian::DeviceId deviceId(/*cpuId=*/0, DeviceType::cpu); diff --git a/src/translator/translation_result.cpp b/src/translator/translation_result.cpp index d69259f84..ee147be42 100644 --- a/src/translator/translation_result.cpp +++ b/src/translator/translation_result.cpp @@ -14,22 +14,26 @@ TranslationResult::TranslationResult(std::string &&source, : source_(std::move(source)), sourceRanges_(std::move(sourceRanges)), histories_(std::move(histories)) { - std::vector sourceMappings; - std::vector targetMappings; + constructTargetProperties(vocabs); +} - // Process sourceMappings into sourceMappings. - sourceMappings.reserve(sourceRanges_.size()); - for (int i = 0; i < sourceRanges_.size(); i++) { - string_view first = sourceRanges_[i].front(); - string_view last = sourceRanges_[i].back(); - sourceMappings.emplace_back(first.data(), last.end() - first.begin()); - } +void TranslationResult::move(std::string &source, std::string &translation, + SentenceMappings &sentenceMappings) { + + constructSentenceMappings(sentenceMappings); + // Totally illegal stuff. + source = std::move(source_); + translation = std::move(translation_); - // Compiles translations into a single std::string translation_ - // Current implementation uses += on std::string, multiple resizes. - // Stores ByteRanges as indices first, followed by conversion into - // string_views. - // TODO(jerin): Add token level string_views here as well. + // The above assignment expects source, target be moved. + // which makes the following invalid, hence required to be cleared. + sourceRanges_.clear(); + targetRanges_.clear(); + histories_.clear(); +} + +void TranslationResult::constructTargetProperties( + std::vector> &vocabs) { std::vector> translationRanges; size_t offset{0}; bool first{true}; @@ -52,21 +56,36 @@ TranslationResult::TranslationResult(std::string &&source, offset += decoded.size(); } - // Converting ByteRanges as indices into string_views. - targetMappings.reserve(translationRanges.size()); + // TODO(@jerinphilip): + // Currently considers target tokens as whole text. Needs + // to be further enhanced in marian-dev to extract alignments. for (auto &range : translationRanges) { + std::vector targetMappings; const char *begin = &translation_[range.first]; targetMappings.emplace_back(begin, range.second); + targetRanges_.push_back(std::move(targetMappings)); } +} - // Surely, let's add sentenceMappings_ - for (auto src = sourceMappings.begin(), tgt = targetMappings.begin(); - src != sourceMappings.end() && tgt != targetMappings.end(); - ++src, ++tgt) { - sentenceMappings_.emplace_back(*src, *tgt); - auto &t = sentenceMappings_.back(); +void TranslationResult::constructSentenceMappings( + TranslationResult::SentenceMappings &sentenceMappings) { + + for (int i = 0; i < sourceRanges_.size(); i++) { + string_view first, last; + + // Handle source-sentence + first = sourceRanges_[i].front(); + last = sourceRanges_[i].back(); + string_view src_sentence(first.data(), last.end() - first.begin()); + + // Handle target-sentence + first = targetRanges_[i].front(); + last = targetRanges_[i].back(); + string_view tgt_sentence(first.data(), last.end() - first.begin()); + + // Add both into sentence-mappings + sentenceMappings.emplace_back(src_sentence, tgt_sentence); } } - } // namespace bergamot } // namespace marian diff --git a/src/translator/translation_result.h b/src/translator/translation_result.h index edc9a8ddd..5903145ad 100644 --- a/src/translator/translation_result.h +++ b/src/translator/translation_result.h @@ -22,53 +22,31 @@ class TranslationResult { : source_(std::move(other.source_)), translation_(std::move(other.translation_)), sourceRanges_(std::move(other.sourceRanges_)), - sentenceMappings_(std::move(other.sentenceMappings_)), + targetRanges_(std::move(other.targetRanges_)), histories_(std::move(other.histories_)){}; TranslationResult(const TranslationResult &) = delete; TranslationResult &operator=(const TranslationResult &) = delete; - // Returns const references to source and translated texts, for external - // consumption. - - const std::string &getOriginalText() const { return source_; } - const std::string &getTranslatedText() const { return translation_; } - - // A mapping of string_views in the source_ and translation_ are provide as a - // pair for external consumption. Each entry corresponding - // to a (source-sentence, target-sentence). - typedef std::vector> SentenceMappings; - const SentenceMappings &getSentenceMappings() const { - return sentenceMappings_; - } - // Return the Quality scores of the translated text. - // Not implemented currently, commenting out. - // const QualityScore &getQualityScore() const { return qualityScore; } + void move(std::string &source, std::string &target, + SentenceMappings &sentenceMappings); - // For development use to benchmark with marian-decoder. - const Histories &getHistories() const { return histories_; } + const Histories &histories() const { return histories_; } + const std::string &source() const { return source_; } + const std::string &translation() const { return translation_; } - // @jerinphilip: Why are these members no longer-private? For move-semantics - // with consistent string_views for bergamot-translator. +private: + void constructTargetProperties(std::vector> &vocabs); + void constructSentenceMappings(SentenceMappings &); std::string source_; std::string translation_; - // Adding the following to complete bergamot-translator spec, redundant while - // sourceMappings_ and targetMappings_ exists or vice-versa. - - SentenceMappings sentenceMappings_; - -private: - // Histories are currently required for interoperability with OutputPrinter - // and OutputCollector and hence comparisons with marian-decoder. - // Future hook to gain alignments. Histories histories_; - - // string_views at the token level. std::vector sourceRanges_; + std::vector targetRanges_; }; } // namespace bergamot } // namespace marian From 0fc6105df49a4e0f05e1d382ea9909776ad3aeec Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 14 Feb 2021 20:27:53 +0000 Subject: [PATCH 082/442] No more two TranslationResults (sort-of) To avoid confusion, this commit renames marian::bergamot::TranslationResult -> marian::bergamot::Response. Usages of marian::bergamot::TranslationResults are updated across the source to be consistent with the change and get source back working. --- app/main-mts.cpp | 13 ++++++------- app/marian-decoder-new.cpp | 14 ++++++-------- src/translator/TranslationModel.cpp | 10 +++++----- src/translator/TranslationModel.h | 3 ++- src/translator/request.cpp | 11 +++++------ src/translator/request.h | 4 ++-- src/translator/service.cpp | 16 ++++++++-------- src/translator/service.h | 10 +++++----- src/translator/translation_result.cpp | 17 ++++++++--------- src/translator/translation_result.h | 14 ++++++-------- 10 files changed, 53 insertions(+), 59 deletions(-) diff --git a/app/main-mts.cpp b/app/main-mts.cpp index d8e756704..b5a4938b0 100644 --- a/app/main-mts.cpp +++ b/app/main-mts.cpp @@ -19,14 +19,13 @@ int main(int argc, char *argv[]) { std::ostringstream std_input; std_input << std::cin.rdbuf(); std::string input = std_input.str(); - using marian::bergamot::TranslationResult; + using marian::bergamot::Response; - // Wait on future until TranslationResult is complete - std::future translation_result_future = - service.translate(std::move(input)); - translation_result_future.wait(); - const TranslationResult &translation_result = translation_result_future.get(); - std::cout << translation_result.translation() << std::endl; + // Wait on future until Response is complete + std::future responseFuture = service.translate(std::move(input)); + responseFuture.wait(); + const Response &response = responseFuture.get(); + std::cout << response.translation() << std::endl; // Stop Service. service.stop(); diff --git a/app/marian-decoder-new.cpp b/app/marian-decoder-new.cpp index 6e44fb777..8988310aa 100644 --- a/app/marian-decoder-new.cpp +++ b/app/marian-decoder-new.cpp @@ -46,16 +46,14 @@ int main(int argc, char *argv[]) { std::ostringstream std_input; std_input << std::cin.rdbuf(); std::string input = std_input.str(); - using marian::bergamot::TranslationResult; + using marian::bergamot::Response; - // Wait on future until TranslationResult is complete - std::future translation_result_future = - service.translate(std::move(input)); - translation_result_future.wait(); - const TranslationResult &translation_result = translation_result_future.get(); + // Wait on future until Response is complete + std::future responseFuture = service.translate(std::move(input)); + responseFuture.wait(); + const Response &response = responseFuture.get(); - marian_decoder_minimal(translation_result.histories(), service.targetVocab(), - options); + marian_decoder_minimal(response.histories(), service.targetVocab(), options); LOG(info, "Total time: {:.5f}s wall", decoderTimer.elapsed()); service.stop(); diff --git a/src/translator/TranslationModel.cpp b/src/translator/TranslationModel.cpp index 9c55422ef..a5d396e36 100644 --- a/src/translator/TranslationModel.cpp +++ b/src/translator/TranslationModel.cpp @@ -69,12 +69,12 @@ TranslationModel::translate(std::vector &&texts, // Collect future as marian::bergamot::TranslationResult auto intermediate = service_.translate(std::move(text)); intermediate.wait(); - auto mTranslationResult(std::move(intermediate.get())); + auto marianResponse(std::move(intermediate.get())); // This mess because marian::string_view != std::string_view std::string source, translation; - marian::bergamot::TranslationResult::SentenceMappings mSentenceMappings; - mTranslationResult.move(source, translation, mSentenceMappings); + marian::bergamot::Response::SentenceMappings mSentenceMappings; + marianResponse.move(source, translation, mSentenceMappings); // Convert to UnifiedAPI::TranslationResult TranslationResult::SentenceMappings sentenceMappings; @@ -86,8 +86,8 @@ TranslationModel::translate(std::vector &&texts, // In place construction. translationResults.emplace_back( - std::move(source), // &&mTranslationResult.source_ - std::move(translation), // &&mTranslationResult.translation_ + std::move(source), // &&marianResponse.source_ + std::move(translation), // &&marianResponse.translation_ std::move(sentenceMappings) // &&sentenceMappings ); } diff --git a/src/translator/TranslationModel.h b/src/translator/TranslationModel.h index c922538a3..5f590d9e9 100644 --- a/src/translator/TranslationModel.h +++ b/src/translator/TranslationModel.h @@ -24,7 +24,8 @@ */ class TranslationModel : public AbstractTranslationModel { public: - /* Construct the model using the model configuration options as yaml-formatted string + /* Construct the model using the model configuration options as yaml-formatted + * string */ TranslationModel(const std::string &config); diff --git a/src/translator/request.cpp b/src/translator/request.cpp index a743389b4..5433699f0 100644 --- a/src/translator/request.cpp +++ b/src/translator/request.cpp @@ -14,11 +14,11 @@ Request::Request(unsigned int Id, int lineNumberBegin, std::vector> &vocabs, std::string &&source, Segments &&segments, std::vector &&sourceAlignments, - std::promise translationResultPromise) + std::promise responsePromise) : Id_(Id), lineNumberBegin_(lineNumberBegin), vocabs_(&vocabs), source_(std::move(source)), segments_(std::move(segments)), sourceAlignments_(std::move(sourceAlignments)), - response_(std::move(translationResultPromise)) { + response_(std::move(responsePromise)) { counter_ = segments_.size(); histories_.resize(segments_.size(), nullptr); @@ -47,10 +47,9 @@ void Request::processHistory(size_t index, Ptr history) { void Request::completeRequest() { // Request no longer needs to hold the content, can transfer it to - // TranslationResult. - TranslationResult translation_result(std::move(source_), - std::move(sourceAlignments_), - std::move(histories_), *vocabs_); + // Response. + Response translation_result(std::move(source_), std::move(sourceAlignments_), + std::move(histories_), *vocabs_); response_.set_value(std::move(translation_result)); } diff --git a/src/translator/request.h b/src/translator/request.h index eab0d4b2a..ddd6cccc0 100644 --- a/src/translator/request.h +++ b/src/translator/request.h @@ -48,13 +48,13 @@ class Request { std::vector sourceAlignments_; std::vector> histories_; - std::promise response_; + std::promise response_; public: Request(unsigned int Id, int lineNumberBegin, std::vector> &vocabs_, std::string &&source, Segments &&segments, std::vector &&sourceAlignments, - std::promise translationResultPromise); + std::promise responsePromise); // Obtain the count of tokens in the segment correponding to index. Used to // insert sentence from multiple requests into the corresponding size bucket. diff --git a/src/translator/service.cpp b/src/translator/service.cpp index ef2bacb64..4ab539fa8 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -28,11 +28,11 @@ Service::Service(Ptr options) } } -std::future Service::translateWithCopy(std::string input) { +std::future Service::translateWithCopy(std::string input) { return translate(std::move(input)); } -std::future Service::translate(std::string &&input) { +std::future Service::translate(std::string &&input) { // Takes in a blob of text. Segments and std::vector are // extracted from the input (blob of text) and used to construct a Request // along with a promise. promise value is set by the worker completing a @@ -49,13 +49,13 @@ std::future Service::translate(std::string &&input) { std::vector sourceAlignments; text_processor_.process(input, segments, sourceAlignments); - std::promise translationResultPromise; - auto future = translationResultPromise.get_future(); + std::promise responsePromise; + auto future = responsePromise.get_future(); - Ptr request = New( - requestId_++, /* lineNumberBegin = */ 0, vocabs_, std::move(input), - std::move(segments), std::move(sourceAlignments), - std::move(translationResultPromise)); + Ptr request = + New(requestId_++, /* lineNumberBegin = */ 0, vocabs_, + std::move(input), std::move(segments), + std::move(sourceAlignments), std::move(responsePromise)); batcher_.addWholeRequest(request); diff --git a/src/translator/service.h b/src/translator/service.h index db01468c7..6f26bc8a6 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -25,17 +25,17 @@ class Service { // options = ...; // service = Service(options); // std::string input_blob = "Hello World"; - // std::future + // std::future // response = service.translate(std::move(input_blob)); // response.wait(); - // TranslationResult result = response.get(); + // Response result = response.get(); public: explicit Service(Ptr options); // Constructs new string copying, calls translate internally. - std::future translateWithCopy(std::string input); - std::future translate(std::string &&input); + std::future translateWithCopy(std::string input); + std::future translate(std::string &&input); void stop(); @@ -49,7 +49,7 @@ class Service { int numWorkers_; // vocabs are used to construct a Request, which later uses it to construct - // TranslationResult (decode from words to string). + // Response (decode from words to string). std::vector> vocabs_; // ORDER DEPENDENCY // Consists of: diff --git a/src/translator/translation_result.cpp b/src/translator/translation_result.cpp index ee147be42..58f092630 100644 --- a/src/translator/translation_result.cpp +++ b/src/translator/translation_result.cpp @@ -7,18 +7,17 @@ namespace marian { namespace bergamot { -TranslationResult::TranslationResult(std::string &&source, - std::vector &&sourceRanges, - Histories &&histories, - std::vector> &vocabs) +Response::Response(std::string &&source, + std::vector &&sourceRanges, + Histories &&histories, std::vector> &vocabs) : source_(std::move(source)), sourceRanges_(std::move(sourceRanges)), histories_(std::move(histories)) { constructTargetProperties(vocabs); } -void TranslationResult::move(std::string &source, std::string &translation, - SentenceMappings &sentenceMappings) { +void Response::move(std::string &source, std::string &translation, + SentenceMappings &sentenceMappings) { constructSentenceMappings(sentenceMappings); // Totally illegal stuff. @@ -32,7 +31,7 @@ void TranslationResult::move(std::string &source, std::string &translation, histories_.clear(); } -void TranslationResult::constructTargetProperties( +void Response::constructTargetProperties( std::vector> &vocabs) { std::vector> translationRanges; size_t offset{0}; @@ -67,8 +66,8 @@ void TranslationResult::constructTargetProperties( } } -void TranslationResult::constructSentenceMappings( - TranslationResult::SentenceMappings &sentenceMappings) { +void Response::constructSentenceMappings( + Response::SentenceMappings &sentenceMappings) { for (int i = 0; i < sourceRanges_.size(); i++) { string_view first, last; diff --git a/src/translator/translation_result.h b/src/translator/translation_result.h index 5903145ad..6ed892732 100644 --- a/src/translator/translation_result.h +++ b/src/translator/translation_result.h @@ -11,22 +11,20 @@ namespace marian { namespace bergamot { -class TranslationResult { +class Response { public: - TranslationResult(std::string &&source, - std::vector &&sourceRanges, - Histories &&histories, - std::vector> &vocabs); + Response(std::string &&source, std::vector &&sourceRanges, + Histories &&histories, std::vector> &vocabs); - TranslationResult(TranslationResult &&other) + Response(Response &&other) : source_(std::move(other.source_)), translation_(std::move(other.translation_)), sourceRanges_(std::move(other.sourceRanges_)), targetRanges_(std::move(other.targetRanges_)), histories_(std::move(other.histories_)){}; - TranslationResult(const TranslationResult &) = delete; - TranslationResult &operator=(const TranslationResult &) = delete; + Response(const Response &) = delete; + Response &operator=(const Response &) = delete; typedef std::vector> SentenceMappings; From 370e9e2fb619b5f45693a3d4e6e3dac1442b6fed Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 14 Feb 2021 20:35:41 +0000 Subject: [PATCH 083/442] {translation_result -> response}.h; propogates; --- app/main-mts.cpp | 2 +- app/marian-decoder-new.cpp | 2 +- src/translator/CMakeLists.txt | 2 +- src/translator/request.cpp | 2 +- src/translator/request.h | 2 +- src/translator/{translation_result.cpp => response.cpp} | 2 +- src/translator/{translation_result.h => response.h} | 6 +++--- src/translator/service.h | 2 +- 8 files changed, 10 insertions(+), 10 deletions(-) rename src/translator/{translation_result.cpp => response.cpp} (98%) rename src/translator/{translation_result.h => response.h} (91%) diff --git a/app/main-mts.cpp b/app/main-mts.cpp index b5a4938b0..78967be0e 100644 --- a/app/main-mts.cpp +++ b/app/main-mts.cpp @@ -7,8 +7,8 @@ #include "common/utils.h" #include "marian.h" #include "translator/parser.h" +#include "translator/response.h" #include "translator/service.h" -#include "translator/translation_result.h" int main(int argc, char *argv[]) { auto cp = marian::bergamot::createConfigParser(); diff --git a/app/marian-decoder-new.cpp b/app/marian-decoder-new.cpp index 8988310aa..f8079096d 100644 --- a/app/marian-decoder-new.cpp +++ b/app/marian-decoder-new.cpp @@ -11,8 +11,8 @@ #include "translator/output_collector.h" #include "translator/output_printer.h" #include "translator/parser.h" +#include "translator/response.h" #include "translator/service.h" -#include "translator/translation_result.h" void marian_decoder_minimal(const marian::Histories &histories, marian::Ptr targetVocab, diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 16c3db962..c279ab975 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -10,7 +10,7 @@ add_library(bergamot-translator STATIC request.cpp service.cpp batcher.cpp - translation_result.cpp + response.cpp ) target_link_libraries(bergamot-translator marian ssplit) diff --git a/src/translator/request.cpp b/src/translator/request.cpp index 5433699f0..23bd67963 100644 --- a/src/translator/request.cpp +++ b/src/translator/request.cpp @@ -1,7 +1,7 @@ #include "request.h" #include "definitions.h" -#include "translation_result.h" +#include "response.h" #include "common/logging.h" diff --git a/src/translator/request.h b/src/translator/request.h index ddd6cccc0..8912a497d 100644 --- a/src/translator/request.h +++ b/src/translator/request.h @@ -22,7 +22,7 @@ #define SRC_BERGAMOT_REQUEST_H_ #include "definitions.h" -#include "translation_result.h" +#include "response.h" #include "common/logging.h" #include "data/types.h" diff --git a/src/translator/translation_result.cpp b/src/translator/response.cpp similarity index 98% rename from src/translator/translation_result.cpp rename to src/translator/response.cpp index 58f092630..d40f88da7 100644 --- a/src/translator/translation_result.cpp +++ b/src/translator/response.cpp @@ -1,4 +1,4 @@ -#include "translation_result.h" +#include "response.h" #include "common/logging.h" #include "data/alignment.h" diff --git a/src/translator/translation_result.h b/src/translator/response.h similarity index 91% rename from src/translator/translation_result.h rename to src/translator/response.h index 6ed892732..57377176d 100644 --- a/src/translator/translation_result.h +++ b/src/translator/response.h @@ -1,5 +1,5 @@ -#ifndef SRC_BERGAMOT_TRANSLATION_RESULT_H_ -#define SRC_BERGAMOT_TRANSLATION_RESULT_H_ +#ifndef SRC_BERGAMOT_RESPONSE_H_ +#define SRC_BERGAMOT_RESPONSE_H_ #include "data/types.h" #include "definitions.h" @@ -49,4 +49,4 @@ class Response { } // namespace bergamot } // namespace marian -#endif // SRC_BERGAMOT_TRANSLATION_RESULT_H_ +#endif // SRC_BERGAMOT_RESPONSE_H_ diff --git a/src/translator/service.h b/src/translator/service.h index 6f26bc8a6..38a45c6d0 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -4,8 +4,8 @@ #include "batch_translator.h" #include "batcher.h" #include "pcqueue.h" +#include "response.h" #include "text_processor.h" -#include "translation_result.h" #include #include From be455a3da101132c5d7c3a283b90cc1cffd8a119 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 14 Feb 2021 22:08:17 +0000 Subject: [PATCH 084/442] Straightening multithreading in translator workers BatchTranslators are now held in Service. Threads are separate, and constructed via lambdas. Retaining BatchTranslator class and member function (Probably a matter of taste I guess). This should eliminate complaints in (#10), hopefully. --- src/translator/batch_translator.cpp | 12 +++++------- src/translator/batch_translator.h | 6 ++---- src/translator/service.cpp | 28 +++++++++++++++++++--------- src/translator/service.h | 4 +--- 4 files changed, 27 insertions(+), 23 deletions(-) diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp index 13eb58a21..7da63cf57 100644 --- a/src/translator/batch_translator.cpp +++ b/src/translator/batch_translator.cpp @@ -10,7 +10,9 @@ namespace bergamot { BatchTranslator::BatchTranslator(DeviceId const device, std::vector> &vocabs, Ptr options) - : device_(device), options_(options), vocabs_(&vocabs) { + : device_(device), options_(options), vocabs_(&vocabs) {} + +void BatchTranslator::initialize() { // Initializes the graph. if (options_->hasAndNotEmpty("shortlist")) { int srcIdx = 0, trgIdx = 1; @@ -93,11 +95,7 @@ void BatchTranslator::translate(Batch &batch) { batch.completeBatch(histories); } -void translation_loop(DeviceId const &device, PCQueue &pcqueue, - std::vector> &vocabs, - Ptr options) { - - BatchTranslator translator(device, vocabs, options); +void BatchTranslator::consumeFrom(PCQueue &pcqueue) { Batch batch; Histories histories; while (true) { @@ -105,7 +103,7 @@ void translation_loop(DeviceId const &device, PCQueue &pcqueue, if (batch.isPoison()) { return; } else { - translator.translate(batch); + translate(batch); } } } diff --git a/src/translator/batch_translator.h b/src/translator/batch_translator.h index 2ee4e04ef..83b911ceb 100644 --- a/src/translator/batch_translator.h +++ b/src/translator/batch_translator.h @@ -28,6 +28,8 @@ class BatchTranslator { // convenience function for logging. TODO(jerin) std::string _identifier() { return "worker" + std::to_string(device_.no); } void translate(Batch &batch); + void initialize(); + void consumeFrom(PCQueue &pcqueue); private: Ptr options_; @@ -38,10 +40,6 @@ class BatchTranslator { Ptr slgen_; }; -void translation_loop(DeviceId const &device, PCQueue &pcqueue, - std::vector> &vocabs, - Ptr options); - } // namespace bergamot } // namespace marian diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 4ab539fa8..1b33558e7 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -13,18 +13,28 @@ Service::Service(Ptr options) text_processor_(vocabs_, options), batcher_(options), pcqueue_(2 * options->get("cpu-threads")) { - if (numWorkers_ > 0) { + if (numWorkers_ == 0) { + // In case workers are 0, a single-translator is created and initialized + // in the main thread. + marian::DeviceId deviceId(/*cpuId=*/0, DeviceType::cpu); + translators_.emplace_back(deviceId, vocabs_, options); + translators_.back().initialize(); + } else { + // If workers specified are greater than 0, translators_ are populated with + // unitialized instances. These are then initialized inside + // individual threads and set to consume from producer-consumer queue. workers_.reserve(numWorkers_); + translators_.reserve(numWorkers_); for (size_t cpuId = 0; cpuId < numWorkers_; cpuId++) { marian::DeviceId deviceId(cpuId, DeviceType::cpu); - workers_.emplace_back(translation_loop, // Function - deviceId, std::ref(pcqueue_), std::ref(vocabs_), - options); + translators_.emplace_back(deviceId, vocabs_, options); + + auto &translator = translators_.back(); + workers_.emplace_back([&translator, this] { + translator.initialize(); + translator.consumeFrom(pcqueue_); + }); } - } else { - marian::DeviceId deviceId(/*cpuId=*/0, DeviceType::cpu); - translator = - UPtr(new BatchTranslator(deviceId, vocabs_, options)); } } @@ -65,7 +75,7 @@ std::future Service::translate(std::string &&input) { // Queue single-threaded Batch batch; while (batcher_ >> batch) { - translator->translate(batch); + translators_[0].translate(batch); } } diff --git a/src/translator/service.h b/src/translator/service.h index 38a45c6d0..55b754a2f 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -68,10 +68,8 @@ class Service { TextProcessor text_processor_; // ORDER DEPENDENCY Batcher batcher_; PCQueue pcqueue_; + std::vector translators_; std::vector workers_; - - // Optional - UPtr translator{nullptr}; }; std::vector> loadVocabularies(Ptr options); From 45a8309c6972b121d62f1e9329267f752b8c796b Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 14 Feb 2021 22:28:08 +0000 Subject: [PATCH 085/442] Missed translation_result -> response rename --- src/translator/request.cpp | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/translator/request.cpp b/src/translator/request.cpp index 23bd67963..9317f697e 100644 --- a/src/translator/request.cpp +++ b/src/translator/request.cpp @@ -48,9 +48,9 @@ void Request::processHistory(size_t index, Ptr history) { void Request::completeRequest() { // Request no longer needs to hold the content, can transfer it to // Response. - Response translation_result(std::move(source_), std::move(sourceAlignments_), - std::move(histories_), *vocabs_); - response_.set_value(std::move(translation_result)); + Response response(std::move(source_), std::move(sourceAlignments_), + std::move(histories_), *vocabs_); + response_.set_value(std::move(response)); } bool Request::operator<(const Request &b) const { From d27a96fc53add7b36d063aaf86c528bc03798eea Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 10:04:15 +0200 Subject: [PATCH 086/442] Updated wasm readme --- wasm/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/wasm/README.md b/wasm/README.md index 6be620956..131f9eb06 100644 --- a/wasm/README.md +++ b/wasm/README.md @@ -37,11 +37,12 @@ You can also see everything in action by following the next steps: * Start the test webserver (ensure you have the latest nodejs installed) ``` cd test_page -bash start_server +bash start_server.sh ``` * Open any of the browsers below * Firefox Nightly +87: make sure the following prefs are on (about:config) ```` + dom.postMessage.sharedArrayBuffer.bypassCOOP_COEP.insecure.enabled = true javascript.options.wasm_simd = true javascript.options.wasm_simd_wormhole = true ```` From f7c86518cfbe418ba9db6655a6e093de520c618d Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 10:04:49 +0200 Subject: [PATCH 087/442] Update test page package-lock.json --- wasm/test_page/package-lock.json | 515 +------------------------------ 1 file changed, 1 insertion(+), 514 deletions(-) diff --git a/wasm/test_page/package-lock.json b/wasm/test_page/package-lock.json index 065c92de8..ae4cb9dd6 100644 --- a/wasm/test_page/package-lock.json +++ b/wasm/test_page/package-lock.json @@ -1,519 +1,6 @@ { - "name": "test_page", - "lockfileVersion": 2, "requires": true, - "packages": { - "": { - "dependencies": { - "cors": "^2.8.5", - "express": "^4.17.1", - "nocache": "^2.1.0" - } - }, - "node_modules/accepts": { - "version": "1.3.7", - "resolved": "https://registry.npmjs.org/accepts/-/accepts-1.3.7.tgz", - "integrity": "sha512-Il80Qs2WjYlJIBNzNkK6KYqlVMTbZLXgHx2oT0pU/fjRHyEp+PEfEPY0R3WCwAGVOtauxh1hOxNgIf5bv7dQpA==", - "dependencies": { - "mime-types": "~2.1.24", - "negotiator": "0.6.2" - }, - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/array-flatten": { - "version": "1.1.1", - "resolved": "https://registry.npmjs.org/array-flatten/-/array-flatten-1.1.1.tgz", - "integrity": "sha1-ml9pkFGx5wczKPKgCJaLZOopVdI=" - }, - "node_modules/body-parser": { - "version": "1.19.0", - "resolved": "https://registry.npmjs.org/body-parser/-/body-parser-1.19.0.tgz", - "integrity": "sha512-dhEPs72UPbDnAQJ9ZKMNTP6ptJaionhP5cBb541nXPlW60Jepo9RV/a4fX4XWW9CuFNK22krhrj1+rgzifNCsw==", - "dependencies": { - "bytes": "3.1.0", - "content-type": "~1.0.4", - "debug": "2.6.9", - "depd": "~1.1.2", - "http-errors": "1.7.2", - "iconv-lite": "0.4.24", - "on-finished": "~2.3.0", - "qs": "6.7.0", - "raw-body": "2.4.0", - "type-is": "~1.6.17" - }, - "engines": { - "node": ">= 0.8" - } - }, - "node_modules/bytes": { - "version": "3.1.0", - "resolved": "https://registry.npmjs.org/bytes/-/bytes-3.1.0.tgz", - "integrity": "sha512-zauLjrfCG+xvoyaqLoV8bLVXXNGC4JqlxFCutSDWA6fJrTo2ZuvLYTqZ7aHBLZSMOopbzwv8f+wZcVzfVTI2Dg==", - "engines": { - "node": ">= 0.8" - } - }, - "node_modules/content-disposition": { - "version": "0.5.3", - "resolved": "https://registry.npmjs.org/content-disposition/-/content-disposition-0.5.3.tgz", - "integrity": "sha512-ExO0774ikEObIAEV9kDo50o+79VCUdEB6n6lzKgGwupcVeRlhrj3qGAfwq8G6uBJjkqLrhT0qEYFcWng8z1z0g==", - "dependencies": { - "safe-buffer": "5.1.2" - }, - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/content-type": { - "version": "1.0.4", - "resolved": "https://registry.npmjs.org/content-type/-/content-type-1.0.4.tgz", - "integrity": "sha512-hIP3EEPs8tB9AT1L+NUqtwOAps4mk2Zob89MWXMHjHWg9milF/j4osnnQLXBCBFBk/tvIG/tUc9mOUJiPBhPXA==", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/cookie": { - "version": "0.4.0", - "resolved": "https://registry.npmjs.org/cookie/-/cookie-0.4.0.tgz", - "integrity": "sha512-+Hp8fLp57wnUSt0tY0tHEXh4voZRDnoIrZPqlo3DPiI4y9lwg/jqx+1Om94/W6ZaPDOUbnjOt/99w66zk+l1Xg==", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/cookie-signature": { - "version": "1.0.6", - "resolved": "https://registry.npmjs.org/cookie-signature/-/cookie-signature-1.0.6.tgz", - "integrity": "sha1-4wOogrNCzD7oylE6eZmXNNqzriw=" - }, - "node_modules/cors": { - "version": "2.8.5", - "resolved": "https://registry.npmjs.org/cors/-/cors-2.8.5.tgz", - "integrity": "sha512-KIHbLJqu73RGr/hnbrO9uBeixNGuvSQjul/jdFvS/KFSIH1hWVd1ng7zOHx+YrEfInLG7q4n6GHQ9cDtxv/P6g==", - "dependencies": { - "object-assign": "^4", - "vary": "^1" - }, - "engines": { - "node": ">= 0.10" - } - }, - "node_modules/debug": { - "version": "2.6.9", - "resolved": "https://registry.npmjs.org/debug/-/debug-2.6.9.tgz", - "integrity": "sha512-bC7ElrdJaJnPbAP+1EotYvqZsb3ecl5wi6Bfi6BJTUcNowp6cvspg0jXznRTKDjm/E7AdgFBVeAPVMNcKGsHMA==", - "dependencies": { - "ms": "2.0.0" - } - }, - "node_modules/depd": { - "version": "1.1.2", - "resolved": "https://registry.npmjs.org/depd/-/depd-1.1.2.tgz", - "integrity": "sha1-m81S4UwJd2PnSbJ0xDRu0uVgtak=", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/destroy": { - "version": "1.0.4", - "resolved": "https://registry.npmjs.org/destroy/-/destroy-1.0.4.tgz", - "integrity": "sha1-l4hXRCxEdJ5CBmE+N5RiBYJqvYA=" - }, - "node_modules/ee-first": { - "version": "1.1.1", - "resolved": "https://registry.npmjs.org/ee-first/-/ee-first-1.1.1.tgz", - "integrity": "sha1-WQxhFWsK4vTwJVcyoViyZrxWsh0=" - }, - "node_modules/encodeurl": { - "version": "1.0.2", - "resolved": "https://registry.npmjs.org/encodeurl/-/encodeurl-1.0.2.tgz", - "integrity": "sha1-rT/0yG7C0CkyL1oCw6mmBslbP1k=", - "engines": { - "node": ">= 0.8" - } - }, - "node_modules/escape-html": { - "version": "1.0.3", - "resolved": "https://registry.npmjs.org/escape-html/-/escape-html-1.0.3.tgz", - "integrity": "sha1-Aljq5NPQwJdN4cFpGI7wBR0dGYg=" - }, - "node_modules/etag": { - "version": "1.8.1", - "resolved": "https://registry.npmjs.org/etag/-/etag-1.8.1.tgz", - "integrity": "sha1-Qa4u62XvpiJorr/qg6x9eSmbCIc=", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/express": { - "version": "4.17.1", - "resolved": "https://registry.npmjs.org/express/-/express-4.17.1.tgz", - "integrity": "sha512-mHJ9O79RqluphRrcw2X/GTh3k9tVv8YcoyY4Kkh4WDMUYKRZUq0h1o0w2rrrxBqM7VoeUVqgb27xlEMXTnYt4g==", - "dependencies": { - "accepts": "~1.3.7", - "array-flatten": "1.1.1", - "body-parser": "1.19.0", - "content-disposition": "0.5.3", - "content-type": "~1.0.4", - "cookie": "0.4.0", - "cookie-signature": "1.0.6", - "debug": "2.6.9", - "depd": "~1.1.2", - "encodeurl": "~1.0.2", - "escape-html": "~1.0.3", - "etag": "~1.8.1", - "finalhandler": "~1.1.2", - "fresh": "0.5.2", - "merge-descriptors": "1.0.1", - "methods": "~1.1.2", - "on-finished": "~2.3.0", - "parseurl": "~1.3.3", - "path-to-regexp": "0.1.7", - "proxy-addr": "~2.0.5", - "qs": "6.7.0", - "range-parser": "~1.2.1", - "safe-buffer": "5.1.2", - "send": "0.17.1", - "serve-static": "1.14.1", - "setprototypeof": "1.1.1", - "statuses": "~1.5.0", - "type-is": "~1.6.18", - "utils-merge": "1.0.1", - "vary": "~1.1.2" - }, - "engines": { - "node": ">= 0.10.0" - } - }, - "node_modules/finalhandler": { - "version": "1.1.2", - "resolved": "https://registry.npmjs.org/finalhandler/-/finalhandler-1.1.2.tgz", - "integrity": "sha512-aAWcW57uxVNrQZqFXjITpW3sIUQmHGG3qSb9mUah9MgMC4NeWhNOlNjXEYq3HjRAvL6arUviZGGJsBg6z0zsWA==", - "dependencies": { - "debug": "2.6.9", - "encodeurl": "~1.0.2", - "escape-html": "~1.0.3", - "on-finished": "~2.3.0", - "parseurl": "~1.3.3", - "statuses": "~1.5.0", - "unpipe": "~1.0.0" - }, - "engines": { - "node": ">= 0.8" - } - }, - "node_modules/forwarded": { - "version": "0.1.2", - "resolved": "https://registry.npmjs.org/forwarded/-/forwarded-0.1.2.tgz", - "integrity": "sha1-mMI9qxF1ZXuMBXPozszZGw/xjIQ=", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/fresh": { - "version": "0.5.2", - "resolved": "https://registry.npmjs.org/fresh/-/fresh-0.5.2.tgz", - "integrity": "sha1-PYyt2Q2XZWn6g1qx+OSyOhBWBac=", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/http-errors": { - "version": "1.7.2", - "resolved": "https://registry.npmjs.org/http-errors/-/http-errors-1.7.2.tgz", - "integrity": "sha512-uUQBt3H/cSIVfch6i1EuPNy/YsRSOUBXTVfZ+yR7Zjez3qjBz6i9+i4zjNaoqcoFVI4lQJ5plg63TvGfRSDCRg==", - "dependencies": { - "depd": "~1.1.2", - "inherits": "2.0.3", - "setprototypeof": "1.1.1", - "statuses": ">= 1.5.0 < 2", - "toidentifier": "1.0.0" - }, - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/iconv-lite": { - "version": "0.4.24", - "resolved": "https://registry.npmjs.org/iconv-lite/-/iconv-lite-0.4.24.tgz", - "integrity": "sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA==", - "dependencies": { - "safer-buffer": ">= 2.1.2 < 3" - }, - "engines": { - "node": ">=0.10.0" - } - }, - "node_modules/inherits": { - "version": "2.0.3", - "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.3.tgz", - "integrity": "sha1-Yzwsg+PaQqUC9SRmAiSA9CCCYd4=" - }, - "node_modules/ipaddr.js": { - "version": "1.9.1", - "resolved": "https://registry.npmjs.org/ipaddr.js/-/ipaddr.js-1.9.1.tgz", - "integrity": "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g==", - "engines": { - "node": ">= 0.10" - } - }, - "node_modules/media-typer": { - "version": "0.3.0", - "resolved": "https://registry.npmjs.org/media-typer/-/media-typer-0.3.0.tgz", - "integrity": "sha1-hxDXrwqmJvj/+hzgAWhUUmMlV0g=", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/merge-descriptors": { - "version": "1.0.1", - "resolved": "https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-1.0.1.tgz", - "integrity": "sha1-sAqqVW3YtEVoFQ7J0blT8/kMu2E=" - }, - "node_modules/methods": { - "version": "1.1.2", - "resolved": "https://registry.npmjs.org/methods/-/methods-1.1.2.tgz", - "integrity": "sha1-VSmk1nZUE07cxSZmVoNbD4Ua/O4=", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/mime": { - "version": "1.6.0", - "resolved": "https://registry.npmjs.org/mime/-/mime-1.6.0.tgz", - "integrity": "sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==", - "bin": { - "mime": "cli.js" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/mime-db": { - "version": "1.45.0", - "resolved": "https://registry.npmjs.org/mime-db/-/mime-db-1.45.0.tgz", - "integrity": "sha512-CkqLUxUk15hofLoLyljJSrukZi8mAtgd+yE5uO4tqRZsdsAJKv0O+rFMhVDRJgozy+yG6md5KwuXhD4ocIoP+w==", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/mime-types": { - "version": "2.1.28", - "resolved": "https://registry.npmjs.org/mime-types/-/mime-types-2.1.28.tgz", - "integrity": "sha512-0TO2yJ5YHYr7M2zzT7gDU1tbwHxEUWBCLt0lscSNpcdAfFyJOVEpRYNS7EXVcTLNj/25QO8gulHC5JtTzSE2UQ==", - "dependencies": { - "mime-db": "1.45.0" - }, - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/ms": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/ms/-/ms-2.0.0.tgz", - "integrity": "sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g=" - }, - "node_modules/negotiator": { - "version": "0.6.2", - "resolved": "https://registry.npmjs.org/negotiator/-/negotiator-0.6.2.tgz", - "integrity": "sha512-hZXc7K2e+PgeI1eDBe/10Ard4ekbfrrqG8Ep+8Jmf4JID2bNg7NvCPOZN+kfF574pFQI7mum2AUqDidoKqcTOw==", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/nocache": { - "version": "2.1.0", - "resolved": "https://registry.npmjs.org/nocache/-/nocache-2.1.0.tgz", - "integrity": "sha512-0L9FvHG3nfnnmaEQPjT9xhfN4ISk0A8/2j4M37Np4mcDesJjHgEUfgPhdCyZuFI954tjokaIj/A3NdpFNdEh4Q==", - "engines": { - "node": ">=4.0.0" - } - }, - "node_modules/object-assign": { - "version": "4.1.1", - "resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz", - "integrity": "sha1-IQmtx5ZYh8/AXLvUQsrIv7s2CGM=", - "engines": { - "node": ">=0.10.0" - } - }, - "node_modules/on-finished": { - "version": "2.3.0", - "resolved": "https://registry.npmjs.org/on-finished/-/on-finished-2.3.0.tgz", - "integrity": "sha1-IPEzZIGwg811M3mSoWlxqi2QaUc=", - "dependencies": { - "ee-first": "1.1.1" - }, - "engines": { - "node": ">= 0.8" - } - }, - "node_modules/parseurl": { - "version": "1.3.3", - "resolved": "https://registry.npmjs.org/parseurl/-/parseurl-1.3.3.tgz", - "integrity": "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ==", - "engines": { - "node": ">= 0.8" - } - }, - "node_modules/path-to-regexp": { - "version": "0.1.7", - "resolved": "https://registry.npmjs.org/path-to-regexp/-/path-to-regexp-0.1.7.tgz", - "integrity": "sha1-32BBeABfUi8V60SQ5yR6G/qmf4w=" - }, - "node_modules/proxy-addr": { - "version": "2.0.6", - "resolved": "https://registry.npmjs.org/proxy-addr/-/proxy-addr-2.0.6.tgz", - "integrity": "sha512-dh/frvCBVmSsDYzw6n926jv974gddhkFPfiN8hPOi30Wax25QZyZEGveluCgliBnqmuM+UJmBErbAUFIoDbjOw==", - "dependencies": { - "forwarded": "~0.1.2", - "ipaddr.js": "1.9.1" - }, - "engines": { - "node": ">= 0.10" - } - }, - "node_modules/qs": { - "version": "6.7.0", - "resolved": "https://registry.npmjs.org/qs/-/qs-6.7.0.tgz", - "integrity": "sha512-VCdBRNFTX1fyE7Nb6FYoURo/SPe62QCaAyzJvUjwRaIsc+NePBEniHlvxFmmX56+HZphIGtV0XeCirBtpDrTyQ==", - "engines": { - "node": ">=0.6" - } - }, - "node_modules/range-parser": { - "version": "1.2.1", - "resolved": "https://registry.npmjs.org/range-parser/-/range-parser-1.2.1.tgz", - "integrity": "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg==", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/raw-body": { - "version": "2.4.0", - "resolved": "https://registry.npmjs.org/raw-body/-/raw-body-2.4.0.tgz", - "integrity": "sha512-4Oz8DUIwdvoa5qMJelxipzi/iJIi40O5cGV1wNYp5hvZP8ZN0T+jiNkL0QepXs+EsQ9XJ8ipEDoiH70ySUJP3Q==", - "dependencies": { - "bytes": "3.1.0", - "http-errors": "1.7.2", - "iconv-lite": "0.4.24", - "unpipe": "1.0.0" - }, - "engines": { - "node": ">= 0.8" - } - }, - "node_modules/safe-buffer": { - "version": "5.1.2", - "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz", - "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==" - }, - "node_modules/safer-buffer": { - "version": "2.1.2", - "resolved": "https://registry.npmjs.org/safer-buffer/-/safer-buffer-2.1.2.tgz", - "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg==" - }, - "node_modules/send": { - "version": "0.17.1", - "resolved": "https://registry.npmjs.org/send/-/send-0.17.1.tgz", - "integrity": "sha512-BsVKsiGcQMFwT8UxypobUKyv7irCNRHk1T0G680vk88yf6LBByGcZJOTJCrTP2xVN6yI+XjPJcNuE3V4fT9sAg==", - "dependencies": { - "debug": "2.6.9", - "depd": "~1.1.2", - "destroy": "~1.0.4", - "encodeurl": "~1.0.2", - "escape-html": "~1.0.3", - "etag": "~1.8.1", - "fresh": "0.5.2", - "http-errors": "~1.7.2", - "mime": "1.6.0", - "ms": "2.1.1", - "on-finished": "~2.3.0", - "range-parser": "~1.2.1", - "statuses": "~1.5.0" - }, - "engines": { - "node": ">= 0.8.0" - } - }, - "node_modules/send/node_modules/ms": { - "version": "2.1.1", - "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.1.tgz", - "integrity": "sha512-tgp+dl5cGk28utYktBsrFqA7HKgrhgPsg6Z/EfhWI4gl1Hwq8B/GmY/0oXZ6nF8hDVesS/FpnYaD/kOWhYQvyg==" - }, - "node_modules/serve-static": { - "version": "1.14.1", - "resolved": "https://registry.npmjs.org/serve-static/-/serve-static-1.14.1.tgz", - "integrity": "sha512-JMrvUwE54emCYWlTI+hGrGv5I8dEwmco/00EvkzIIsR7MqrHonbD9pO2MOfFnpFntl7ecpZs+3mW+XbQZu9QCg==", - "dependencies": { - "encodeurl": "~1.0.2", - "escape-html": "~1.0.3", - "parseurl": "~1.3.3", - "send": "0.17.1" - }, - "engines": { - "node": ">= 0.8.0" - } - }, - "node_modules/setprototypeof": { - "version": "1.1.1", - "resolved": "https://registry.npmjs.org/setprototypeof/-/setprototypeof-1.1.1.tgz", - "integrity": "sha512-JvdAWfbXeIGaZ9cILp38HntZSFSo3mWg6xGcJJsd+d4aRMOqauag1C63dJfDw7OaMYwEbHMOxEZ1lqVRYP2OAw==" - }, - "node_modules/statuses": { - "version": "1.5.0", - "resolved": "https://registry.npmjs.org/statuses/-/statuses-1.5.0.tgz", - "integrity": "sha1-Fhx9rBd2Wf2YEfQ3cfqZOBR4Yow=", - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/toidentifier": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/toidentifier/-/toidentifier-1.0.0.tgz", - "integrity": "sha512-yaOH/Pk/VEhBWWTlhI+qXxDFXlejDGcQipMlyxda9nthulaxLZUNcUqFxokp0vcYnvteJln5FNQDRrxj3YcbVw==", - "engines": { - "node": ">=0.6" - } - }, - "node_modules/type-is": { - "version": "1.6.18", - "resolved": "https://registry.npmjs.org/type-is/-/type-is-1.6.18.tgz", - "integrity": "sha512-TkRKr9sUTxEH8MdfuCSP7VizJyzRNMjj2J2do2Jr3Kym598JVdEksuzPQCnlFPW4ky9Q+iA+ma9BGm06XQBy8g==", - "dependencies": { - "media-typer": "0.3.0", - "mime-types": "~2.1.24" - }, - "engines": { - "node": ">= 0.6" - } - }, - "node_modules/unpipe": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/unpipe/-/unpipe-1.0.0.tgz", - "integrity": "sha1-sr9O6FFKrmFltIF4KdIbLvSZBOw=", - "engines": { - "node": ">= 0.8" - } - }, - "node_modules/utils-merge": { - "version": "1.0.1", - "resolved": "https://registry.npmjs.org/utils-merge/-/utils-merge-1.0.1.tgz", - "integrity": "sha1-n5VxD1CiZ5R7LMwSR0HBAoQn5xM=", - "engines": { - "node": ">= 0.4.0" - } - }, - "node_modules/vary": { - "version": "1.1.2", - "resolved": "https://registry.npmjs.org/vary/-/vary-1.1.2.tgz", - "integrity": "sha1-IpnwLG3tMNSllhsLn3RSShj2NPw=", - "engines": { - "node": ">= 0.8" - } - } - }, + "lockfileVersion": 1, "dependencies": { "accepts": { "version": "1.3.7", From 26ea5bba7a0a37c5785d34be6586f154f1bebb0b Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 10:26:04 +0200 Subject: [PATCH 088/442] Some cleanup --- wasm/README.md | 6 +++--- wasm/test_page/bergamot-httpserver.js | 4 ---- wasm/test_page/bergamot.html | 6 +++--- 3 files changed, 6 insertions(+), 10 deletions(-) diff --git a/wasm/README.md b/wasm/README.md index 131f9eb06..bb431447c 100644 --- a/wasm/README.md +++ b/wasm/README.md @@ -35,17 +35,17 @@ input.delete(); You can also see everything in action by following the next steps: * Start the test webserver (ensure you have the latest nodejs installed) -``` +```bash cd test_page bash start_server.sh ``` * Open any of the browsers below * Firefox Nightly +87: make sure the following prefs are on (about:config) - ```` + ``` dom.postMessage.sharedArrayBuffer.bypassCOOP_COEP.insecure.enabled = true javascript.options.wasm_simd = true javascript.options.wasm_simd_wormhole = true - ```` + ``` * Chrome Canary +90: start with the following argument ``` diff --git a/wasm/test_page/bergamot-httpserver.js b/wasm/test_page/bergamot-httpserver.js index f23b3e750..b28719fed 100644 --- a/wasm/test_page/bergamot-httpserver.js +++ b/wasm/test_page/bergamot-httpserver.js @@ -33,7 +33,3 @@ function serveFile(res, pathName, mime) { server.listen(8000); console.log('HTTP and BinaryJS server started on port 8000'); - - - - diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 49ca50e96..e7e1fe5b3 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -29,9 +29,9 @@
- - - + + +
From d3969bcd2d2430a4bf5f047d791eb768ba4cb013 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 10:34:57 +0200 Subject: [PATCH 089/442] Add support for translating multiple sentences on the test page + report words per second metric in the log --- wasm/test_page/bergamot.html | 42 +++++++++++++++++++++++++----------- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index e7e1fe5b3..d093208c2 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -37,10 +37,13 @@
- +

- +

@@ -65,17 +68,23 @@ model = new Module.TranslationModel(modelConfig); } - const translate = (sentence) => { + const translate = (sentences) => { // Instantiate the arguments of translate() API i.e. TranslationRequest and input (vector) var request = new Module.TranslationRequest(); let input = new Module.VectorString; // Initialize the input - input.push_back(sentence); - /* + sentences.forEach(sentence => { + // prevent empty sentences - it breaks the translation + if (sentence.trim() === "") { + return; + } + input.push_back(sentence.trim()) + }) // Access input (just for debugging) console.log('Input size=', input.size()); + /* for (let i = 0; i < input.size(); i++) { console.log(' val:' + input.get(i)); } @@ -85,14 +94,14 @@ let result = model.translate(input, request); // Access original and translated text from each entry of vector //console.log('Result size=', result.size(), ' - TimeDiff - ', (Date.now() - start)/1000); - let translatedText = ""; + const translatedSentences = []; for (let i = 0; i < result.size(); i++) { - translatedText += result.get(i).getTranslatedText() + " "; + translatedSentences.push(result.get(i).getTranslatedText()); } - console.log(translatedText); + console.log({translatedSentences}); request.delete(); input.delete(); - return translatedText; + return translatedSentences; } document.querySelector("#load").addEventListener("click", () => { @@ -105,10 +114,17 @@ const translateCall = () => { const text = document.querySelector('#from').value; - let start = Date.now(); - const translate_text = translate(text); - log(`sentence translation time ${(Date.now() - start)/1000} secs`); - document.querySelector('#to').value = translate_text; + const sentences = text.split("\n"); + let wordCount = 0; + sentences.forEach(sentence => { + wordCount += sentence.trim().split(" ").length; + }) + const start = Date.now(); + const translatedSentences = translate(sentences); + const secs = (Date.now() - start) / 1000; + log(`Translation of ${translatedSentences.length} sentences (wordCount ${wordCount}) took ${secs} secs (${Math.round(wordCount / secs)} words per second)`); + + document.querySelector('#to').value = translatedSentences.join("\n"); } document.querySelector("#translate").addEventListener("click", () => { From 28c0ab2e04f6e32b999aac0caa181cd914f92e30 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 10:37:37 +0200 Subject: [PATCH 090/442] Tweak words per second metric in the test page log --- wasm/test_page/bergamot.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index d093208c2..992d7585d 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -117,7 +117,7 @@ const sentences = text.split("\n"); let wordCount = 0; sentences.forEach(sentence => { - wordCount += sentence.trim().split(" ").length; + wordCount += sentence.trim().split(" ").filter(word => word.trim() !== "").length; }) const start = Date.now(); const translatedSentences = translate(sentences); From a33b3a3bb5bcac9fe34135d671773a11554dce82 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 11:21:36 +0200 Subject: [PATCH 091/442] Add instructions on how to assemble and package the set of files expected by the test page --- README.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 333e758e3..4bff753d1 100644 --- a/README.md +++ b/README.md @@ -45,10 +45,25 @@ Download the models from `https://github.com/mozilla-applied-ml/bergamot-models` The build also allows packaging files into wasm binary (i.e. preloading in Emscripten’s virtual file system) using cmake option `PACKAGE_DIR`. The compile command below packages all the files in PATH directory (in these case, your models) into wasm binary. ```bash -emcmake cmake -DCOMPILE_WASM=on -DPACKAGE_DIR= ./models +emcmake cmake -DCOMPILE_WASM=on -DPACKAGE_DIR=/repo/models ../ ``` Files packaged this way are preloaded in the root of the virtual file system. +To package the set of files expected by the test page: + +```bash +git clone https://github.com/browsermt/students +cd students/esen/ +./download-models.sh +cp esen.student.tiny11/lex.s2t ../../models/lex.esen.s2t +cp esen.student.tiny11/model.npz ../../models/model.esen.npz +cp esen.student.tiny11/vocab.esen.spm ../../models/vocab.esen.spm +cd - +cd students/enes/ +./download-models.sh +cp enes.student.tiny11/lex.s2t ../../models/lex.enes.s2t +cp enes.student.tiny11/model.npz ../../models/model.enes.npz +``` After Editing Files: From 53e0b9fc5c219ae57d79be57acbec0dd580e89a8 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 11:22:23 +0200 Subject: [PATCH 092/442] Fix typo in lexical shortlist argument on test page --- wasm/test_page/bergamot.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 992d7585d..4ead87dbb 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -61,7 +61,7 @@ // For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ // This example captures the most relevant options: model file, vocabulary files and shortlist file // var modelConfig = "{\"models\":[\"/model.enes.npz\"],\"vocabs\":[\"/vocab.esen.spm\"],\"beam-size\":1}";//,\"shortlist\":[\"/lex.s2t\"] - const modelConfig = `{\"models\":[\"/model.${lang}.npz\"],\"vocabs\":[\"/vocab.esen.spm\",\"/vocab.esen.spm\"],\"beam-size\":1} ,\"shortlist\":[\"/lex.s2t\"]`; + const modelConfig = `{\"models\":[\"/model.${lang}.npz\"],\"vocabs\":[\"/vocab.esen.spm\",\"/vocab.esen.spm\"],\"beam-size\":1} ,\"shortlist\":[\"/lex.esen.s2t\"]`; // Instantiate the TranslationModel if (model) model.delete(); From e50dd0909f4709a6336b46a0baee175353ed0150 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 11:23:08 +0200 Subject: [PATCH 093/442] Ignore contents in models directory --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 59363a81c..6c301d661 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,4 @@ wasm/test_page/node_modules build-wasm +models From 7030fa015745070e0d7dc8ab6f0a5d25a1d95a78 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 11:25:13 +0200 Subject: [PATCH 094/442] Ignore test page bundled artifacts --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 6c301d661..d7d931f6e 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ wasm/test_page/node_modules build-wasm models +wasm/test_page/bergamot-translator-worker.* From 49ad6514aec6498e2a24a7dd96cff25d4e64ab5d Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 11:27:47 +0200 Subject: [PATCH 095/442] Add reproducible docker-based builds + let test page use these by default --- .gitignore | 2 +- docker/Makefile | 55 ++++++++++++++++++++++++++++++++++ docker/README.md | 27 +++++++++++++++++ docker/wasm/Dockerfile | 36 ++++++++++++++++++++++ wasm/test_page/start_server.sh | 8 ++--- 5 files changed, 123 insertions(+), 5 deletions(-) create mode 100644 docker/Makefile create mode 100644 docker/README.md create mode 100644 docker/wasm/Dockerfile diff --git a/.gitignore b/.gitignore index d7d931f6e..5a73aac90 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,6 @@ *.swo wasm/test_page/node_modules -build-wasm +build-* models wasm/test_page/bergamot-translator-worker.* diff --git a/docker/Makefile b/docker/Makefile new file mode 100644 index 000000000..583a58852 --- /dev/null +++ b/docker/Makefile @@ -0,0 +1,55 @@ +# -*- mode: makefile-gmake; indent-tabs-mode: true; tab-width: 4 -*- +SHELL = bash +PWD = $(shell pwd) +WASM_IMAGE = local/bergamot-translator-build-wasm + +all: wasm-image compile-wasm + +# Build the Docker image for WASM builds +wasm-image: + docker build -t local/bergamot-translator-build-wasm ./wasm/ + +# Commands for compilation: +cmake_cmd = cmake + +wasm_cmake_cmd = ${cmake_cmd} +wasm_cmake_cmd += -DCOMPILE_WASM=on +wasm_cmake_cmd += -DProtobuf_INCLUDE_DIR=/usr/opt/protobuf-wasm-lib/dist/include +wasm_cmake_cmd += -DProtobuf_LIBRARY=/usr/opt/protobuf-wasm-lib/dist/lib/libprotobuf.a +wasm_cmake_cmd += -DPACKAGE_DIR=/repo/models + +make_cmd = make +#make_cmd += VERBOSE=1 + +# ... and running things on Docker +docker_mounts = ${PWD}/..:/repo +docker_mounts += ${HOME}/.ccache:/.ccache +run_on_docker = docker run --rm +run_on_docker += $(addprefix -v, ${docker_mounts}) +run_on_docker += ${INTERACTIVE_DOCKER_SESSION} + +${HOME}/.ccache: + mkdir -p $@ + +# Remove the bergamot-translator WASM build dir, forcing a clean compilation attempt +clean-wasm: BUILD_DIR = /repo/build-wasm-docker +clean-wasm: ${HOME}/.ccache + ${run_on_docker} ${WASM_IMAGE} bash -c '(rm -rf ${BUILD_DIR} || true)' + +# Compile bergamot-translator to WASM +compile-wasm: BUILD_DIR = /repo/build-wasm-docker +compile-wasm: ${HOME}/.ccache + ${run_on_docker} ${WASM_IMAGE} bash -c 'mkdir -p ${BUILD_DIR} && \ +cd ${BUILD_DIR} && \ +(emcmake ${wasm_cmake_cmd} .. && \ +(emmake ${make_cmd}) || \ +rm CMakeCache.txt)' + +# Start interactive shells for development / debugging purposes +native-shell: INTERACTIVE_DOCKER_SESSION = -it +native-shell: + ${run_on_docker} ${NATIVE_IMAGE} bash + +wasm-shell: INTERACTIVE_DOCKER_SESSION = -it +wasm-shell: + ${run_on_docker} ${WASM_IMAGE} bash diff --git a/docker/README.md b/docker/README.md new file mode 100644 index 000000000..d98456a54 --- /dev/null +++ b/docker/README.md @@ -0,0 +1,27 @@ +## WASM + +Prepare docker image for WASM compilation: + +```bash +make wasm-image +``` + +Compile to wasm: + +```bash +make compile-wasm +``` + +## Debugging + +Remove the marian-decoder build dir, forcing the next compilation attempt to start from scratch: + +```bash +make clean-wasm +``` + +Enter a docker container shell for manually running commands: + +```bash +make wasm-shell +``` diff --git a/docker/wasm/Dockerfile b/docker/wasm/Dockerfile new file mode 100644 index 000000000..f309662a7 --- /dev/null +++ b/docker/wasm/Dockerfile @@ -0,0 +1,36 @@ +FROM emscripten/emsdk:2.0.9 + +# Install specific version of CMake +WORKDIR /usr +RUN wget https://github.com/Kitware/CMake/releases/download/v3.17.2/cmake-3.17.2-Linux-x86_64.tar.gz -qO-\ + | tar xzf - --strip-components 1 + +# Install Python and Java (needed for Closure Compiler minification) +RUN apt-get update \ + && apt-get install -y \ + python3 \ + default-jre + +# Deps to compile protobuf from source + the protoc binary which we need natively +RUN apt-get update -y && apt-get --no-install-recommends -y install \ + protobuf-compiler \ + autoconf \ + autotools-dev \ + automake \ + autogen \ + libtool && ln -s /usr/bin/libtoolize /usr/bin/libtool \ + && mkdir -p /usr/opt \ + && cd /usr/opt \ + && git clone https://github.com/menduz/protobuf-wasm-lib + +RUN cd /usr/opt/protobuf-wasm-lib \ + && /bin/bash -c "BRANCH=v3.6.1 ./prepare.sh" +RUN cd /usr/opt/protobuf-wasm-lib/protobuf \ + && bash -x ../build.sh +RUN cp /usr/bin/protoc /usr/opt/protobuf-wasm-lib/dist/bin/protoc + +RUN apt-get --no-install-recommends -y install \ + libprotobuf-dev + +# Necessary for benchmarking +RUN pip3 install sacrebleu diff --git a/wasm/test_page/start_server.sh b/wasm/test_page/start_server.sh index b83344b8a..b0b5be1b2 100644 --- a/wasm/test_page/start_server.sh +++ b/wasm/test_page/start_server.sh @@ -1,8 +1,8 @@ #!/bin/bash -cp ../../build-wasm/wasm/bergamot-translator-worker.data . -cp ../../build-wasm/wasm/bergamot-translator-worker.js . -cp ../../build-wasm/wasm/bergamot-translator-worker.wasm . -cp ../../build-wasm/wasm/bergamot-translator-worker.worker.js . +cp ../../build-wasm-docker/wasm/bergamot-translator-worker.data . +cp ../../build-wasm-docker/wasm/bergamot-translator-worker.js . +cp ../../build-wasm-docker/wasm/bergamot-translator-worker.wasm . +cp ../../build-wasm-docker/wasm/bergamot-translator-worker.worker.js . npm install node bergamot-httpserver.js \ No newline at end of file From 77f39545f314c7a931c91aef0a11e871ff5a880c Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 11:30:45 +0200 Subject: [PATCH 096/442] Add time it takes to arrive to preRun to test page --- wasm/test_page/bergamot.html | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 4ead87dbb..7b38cc22f 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -141,13 +141,15 @@ document.querySelector("#log").value += message + "\n"; } + const start = Date.now(); let moduleLoadStart; var Module = { preRun: [function() { + log(`Time until Module.preRun: ${(Date.now() - start)/1000} secs`); moduleLoadStart = Date.now(); }], onRuntimeInitialized: function() { - log(`Wasm Runtime initialized in ${(Date.now() - moduleLoadStart)/1000} secs`); + log(`Wasm Runtime initialized (preRun -> onRuntimeInitialized) in ${(Date.now() - moduleLoadStart)/1000} secs`); } }; From dbdcdab1153be9891e2a44aa308b29c0141349aa Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 11:59:03 +0200 Subject: [PATCH 097/442] Avoid use of unsafe eval in glue code --- wasm/CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wasm/CMakeLists.txt b/wasm/CMakeLists.txt index 40b08bf6a..837515837 100644 --- a/wasm/CMakeLists.txt +++ b/wasm/CMakeLists.txt @@ -14,7 +14,7 @@ target_include_directories(bergamot-translator-worker target_compile_definitions(bergamot-translator-worker PRIVATE WASM_BINDINGS) target_compile_options(bergamot-translator-worker PRIVATE ${WASM_COMPILE_FLAGS}) -set(LINKER_FLAGS "--bind -s ASSERTIONS=1 -s DISABLE_EXCEPTION_CATCHING=0 -s FORCE_FILESYSTEM=1 -s ALLOW_MEMORY_GROWTH=1") +set(LINKER_FLAGS "--bind -s ASSERTIONS=1 -s DISABLE_EXCEPTION_CATCHING=0 -s FORCE_FILESYSTEM=1 -s ALLOW_MEMORY_GROWTH=1 -s NO_DYNAMIC_EXECUTION=1") if (NOT PACKAGE_DIR STREQUAL "") set(LINKER_FLAGS "${LINKER_FLAGS} --preload-file ${PACKAGE_DIR}@/") endif() From 70bdcd436571de532ea202d95edf7cccf9505bb4 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 12:54:32 +0200 Subject: [PATCH 098/442] Fix typo from when fixing typo --- wasm/test_page/bergamot.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 7b38cc22f..e5d7a90b3 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -61,7 +61,7 @@ // For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ // This example captures the most relevant options: model file, vocabulary files and shortlist file // var modelConfig = "{\"models\":[\"/model.enes.npz\"],\"vocabs\":[\"/vocab.esen.spm\"],\"beam-size\":1}";//,\"shortlist\":[\"/lex.s2t\"] - const modelConfig = `{\"models\":[\"/model.${lang}.npz\"],\"vocabs\":[\"/vocab.esen.spm\",\"/vocab.esen.spm\"],\"beam-size\":1} ,\"shortlist\":[\"/lex.esen.s2t\"]`; + const modelConfig = `{\"models\":[\"/model.${lang}.npz\"],\"vocabs\":[\"/vocab.esen.spm\",\"/vocab.esen.spm\"],\"beam-size\":1} ,\"shortlist\":[\"/lex.${lang}.s2t\"]`; // Instantiate the TranslationModel if (model) model.delete(); From da56501c4f255d9bc57c2d244e0979c29676ad3f Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 13:10:10 +0200 Subject: [PATCH 099/442] Finally found the original typo that made it appear as if loading the model in the test page was faster than elsewhere - the lexical shortlist was not being included at the right place in the model config --- wasm/test_page/bergamot.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index e5d7a90b3..6985cee89 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -61,7 +61,7 @@ // For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ // This example captures the most relevant options: model file, vocabulary files and shortlist file // var modelConfig = "{\"models\":[\"/model.enes.npz\"],\"vocabs\":[\"/vocab.esen.spm\"],\"beam-size\":1}";//,\"shortlist\":[\"/lex.s2t\"] - const modelConfig = `{\"models\":[\"/model.${lang}.npz\"],\"vocabs\":[\"/vocab.esen.spm\",\"/vocab.esen.spm\"],\"beam-size\":1} ,\"shortlist\":[\"/lex.${lang}.s2t\"]`; + const modelConfig = `{\"models\":[\"/model.${lang}.npz\"],\"vocabs\":[\"/vocab.esen.spm\",\"/vocab.esen.spm\"],\"beam-size\":1,\"shortlist\":[\"/lex.${lang}.s2t\"]}`; // Instantiate the TranslationModel if (model) model.delete(); From 1e94d78c4d2b6bb9b763c16c59b0178a8458e18f Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 13:19:39 +0200 Subject: [PATCH 100/442] Formatting --- wasm/test_page/bergamot.html | 228 +++++++++++++++++------------------ 1 file changed, 114 insertions(+), 114 deletions(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 6985cee89..541da1580 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -1,41 +1,41 @@ - + - - - + } + + -
+
-
+
-
+


-
+
-
+

-
+
- - + }; + + From fcc998ffa4c2468baed11889951685ff0b923cf7 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 13:30:07 +0200 Subject: [PATCH 101/442] Add 10 lines of esen benchmark sentences to test page --- wasm/test_page/bergamot.html | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 541da1580..cbd266567 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -30,16 +30,24 @@
- - + +


From f3ff1d29ae4d6d036f68bc993420c36015f10b09 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 13:30:46 +0200 Subject: [PATCH 102/442] Make modelConfig an object instead of string (less likelihood of typos) --- wasm/test_page/bergamot.html | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index cbd266567..0de9925e7 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -67,13 +67,25 @@ const loadModel = (lang) => { // Set the Model Configuration as YAML formatted string. // For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ - // This example captures the most relevant options: model file, vocabulary files and shortlist file - // var modelConfig = "{\"models\":[\"/model.enes.npz\"],\"vocabs\":[\"/vocab.esen.spm\"],\"beam-size\":1}";//,\"shortlist\":[\"/lex.s2t\"] - const modelConfig = `{\"models\":[\"/model.${lang}.npz\"],\"vocabs\":[\"/vocab.esen.spm\",\"/vocab.esen.spm\"],\"beam-size\":1,\"shortlist\":[\"/lex.${lang}.s2t\"]}`; + + const modelConfig = { + "models": [ + `/model.${lang}.npz` + ], + "vocabs": [ + "/vocab.esen.spm", + "/vocab.esen.spm" + ], + "shortlist": [ + `/lex.${lang}.s2t`, + 50, + 50, + ] + }; // Instantiate the TranslationModel if (model) model.delete(); - model = new Module.TranslationModel(modelConfig); + model = new Module.TranslationModel(JSON.stringify(modelConfig)); } const translate = (sentences) => { From 7d6346d3b0b000f281e99f972bb2fe663b93b27f Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 13:35:22 +0200 Subject: [PATCH 103/442] Add model config used in pr6 benchmarks --- wasm/test_page/bergamot.html | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 0de9925e7..9322368ef 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -76,11 +76,22 @@ "/vocab.esen.spm", "/vocab.esen.spm" ], + "beam-size": 1, + "mini-batch": 32, + "maxi-batch": 100, + "maxi-batch-sort": "src", + "workspace": 128, + "skip-cost": true, + "cpu-threads": 1, "shortlist": [ `/lex.${lang}.s2t`, 50, 50, ] + // TODO: Enable when wormhole is enabled + // "int8shift": true, + // TODO: Enable when loading of binary models is supported and we use model.intgemm.alphas.bin + // "int8shiftAlphaAll": true, }; // Instantiate the TranslationModel From 64d57d8aa089957f5c8ffe88f7ce805de0423e6e Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 13:50:59 +0200 Subject: [PATCH 104/442] Use yaml for modelConfig on test page --- wasm/test_page/bergamot.html | 51 +++++++++++++++++------------------- 1 file changed, 24 insertions(+), 27 deletions(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 9322368ef..04ff5aeb9 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -67,36 +67,33 @@ const loadModel = (lang) => { // Set the Model Configuration as YAML formatted string. // For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ - - const modelConfig = { - "models": [ - `/model.${lang}.npz` - ], - "vocabs": [ - "/vocab.esen.spm", - "/vocab.esen.spm" - ], - "beam-size": 1, - "mini-batch": 32, - "maxi-batch": 100, - "maxi-batch-sort": "src", - "workspace": 128, - "skip-cost": true, - "cpu-threads": 1, - "shortlist": [ - `/lex.${lang}.s2t`, - 50, - 50, - ] - // TODO: Enable when wormhole is enabled - // "int8shift": true, - // TODO: Enable when loading of binary models is supported and we use model.intgemm.alphas.bin - // "int8shiftAlphaAll": true, - }; + const modelConfig = `models: + - /model.${lang}.npz +vocabs: + - /vocab.esen.spm + - /vocab.esen.spm +beam-size: 1 +normalize: 1.0 +word-penalty: 0 +mini-batch: 32 +maxi-batch: 100 +maxi-batch-sort: src +workspace: 128 +max-length-factor: 2.0 +skip-cost: true +shortlist: + - lex.${lang}.s2t + - 50 + - 50 +`; +// TODO: Use in model config when wormhole is enabled: +// gemm-precision: int8shift +// TODO: Use in model config when loading of binary models is supported and we use model.intgemm.alphas.bin: +// gemm-precision: int8shiftAlphaAll // Instantiate the TranslationModel if (model) model.delete(); - model = new Module.TranslationModel(JSON.stringify(modelConfig)); + model = new Module.TranslationModel(modelConfig); } const translate = (sentences) => { From 3dd7a60b3511e5ebc09169f33d37913834e83a1d Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 15 Feb 2021 12:50:40 +0100 Subject: [PATCH 105/442] Enabled simd shuffle pattern for intgemm compilation - WORMHOLE cmake option is set to ON when compiling for WASM - WASM module might not run on Chrome --- CMakeLists.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/CMakeLists.txt b/CMakeLists.txt index 677963f12..ccaf65224 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -23,6 +23,10 @@ SET(COMPILE_DECODER_ONLY ON CACHE BOOL "Compile marian-decoder only") SET(COMPILE_WITH_PTHREADS OFF CACHE BOOL "Compile with pthreads support") SET(USE_WASM_COMPATIBLE_BLAS ON CACHE BOOL "Compile with a WASM compatible blas for decoder only builds") SET(COMPILE_LIBRARY_ONLY ON CACHE BOOL "Build only the Marian library and exclude all executables.") +if(COMPILE_WASM) + # Set WORMHOLE to ON for marian whenever compiling for wasm platform + SET(WORMHOLE ON CACHE BOOL "Use WASM wormhole in intgemm https://bugzilla.mozilla.org/show_bug.cgi?id=1672160") +endif() execute_process(COMMAND git submodule update --init --recursive --no-fetch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) From 91e45cb4f08a1b9f59757c82c61fbd5b86d88915 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 13:58:12 +0200 Subject: [PATCH 106/442] Prepend shortlist path with / --- wasm/test_page/bergamot.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 04ff5aeb9..8fc7824e1 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -82,7 +82,7 @@ max-length-factor: 2.0 skip-cost: true shortlist: - - lex.${lang}.s2t + - /lex.${lang}.s2t - 50 - 50 `; From 9a5ae9568e50856d520839854dc00ee2662b2d04 Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 14:24:59 +0200 Subject: [PATCH 107/442] Turn of assertions and disable exception catching for wasm builds --- CMakeLists.txt | 2 +- wasm/CMakeLists.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index ccaf65224..8044cb08a 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -38,7 +38,7 @@ endif() if(COMPILE_WASM) list(APPEND WASM_COMPILE_FLAGS -pthread -O3 -g2 -fPIC -mssse3 -msimd128) - list(APPEND WASM_COMPILE_FLAGS "SHELL:-s WASM=1" "SHELL:-s ASSERTIONS=1" "SHELL:-s DISABLE_EXCEPTION_CATCHING=0" "SHELL:-s LLD_REPORT_UNDEFINED" "SHELL:-s FORCE_FILESYSTEM=1" "SHELL:-s ALLOW_MEMORY_GROWTH=1") + list(APPEND WASM_COMPILE_FLAGS "SHELL:-s WASM=1" "SHELL:-s ASSERTIONS=0" "SHELL:-s DISABLE_EXCEPTION_CATCHING=1" "SHELL:-s LLD_REPORT_UNDEFINED" "SHELL:-s FORCE_FILESYSTEM=1" "SHELL:-s ALLOW_MEMORY_GROWTH=1") list(APPEND WASM_COMPILE_FLAGS -Wno-error=pthreads-mem-growth) endif(COMPILE_WASM) diff --git a/wasm/CMakeLists.txt b/wasm/CMakeLists.txt index 837515837..748762d14 100644 --- a/wasm/CMakeLists.txt +++ b/wasm/CMakeLists.txt @@ -14,7 +14,7 @@ target_include_directories(bergamot-translator-worker target_compile_definitions(bergamot-translator-worker PRIVATE WASM_BINDINGS) target_compile_options(bergamot-translator-worker PRIVATE ${WASM_COMPILE_FLAGS}) -set(LINKER_FLAGS "--bind -s ASSERTIONS=1 -s DISABLE_EXCEPTION_CATCHING=0 -s FORCE_FILESYSTEM=1 -s ALLOW_MEMORY_GROWTH=1 -s NO_DYNAMIC_EXECUTION=1") +set(LINKER_FLAGS "--bind -s ASSERTIONS=0 -s DISABLE_EXCEPTION_CATCHING=1 -s FORCE_FILESYSTEM=1 -s ALLOW_MEMORY_GROWTH=1 -s NO_DYNAMIC_EXECUTION=1") if (NOT PACKAGE_DIR STREQUAL "") set(LINKER_FLAGS "${LINKER_FLAGS} --preload-file ${PACKAGE_DIR}@/") endif() From 9a5cf30bbbdee83d98e933ee122aed00b26b161a Mon Sep 17 00:00:00 2001 From: Motin Date: Mon, 15 Feb 2021 15:03:00 +0200 Subject: [PATCH 108/442] Revert "Enabled simd shuffle pattern for intgemm compilation" This reverts commit 3dd7a60b3511e5ebc09169f33d37913834e83a1d. --- CMakeLists.txt | 4 ---- 1 file changed, 4 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 8044cb08a..108338411 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -23,10 +23,6 @@ SET(COMPILE_DECODER_ONLY ON CACHE BOOL "Compile marian-decoder only") SET(COMPILE_WITH_PTHREADS OFF CACHE BOOL "Compile with pthreads support") SET(USE_WASM_COMPATIBLE_BLAS ON CACHE BOOL "Compile with a WASM compatible blas for decoder only builds") SET(COMPILE_LIBRARY_ONLY ON CACHE BOOL "Build only the Marian library and exclude all executables.") -if(COMPILE_WASM) - # Set WORMHOLE to ON for marian whenever compiling for wasm platform - SET(WORMHOLE ON CACHE BOOL "Use WASM wormhole in intgemm https://bugzilla.mozilla.org/show_bug.cgi?id=1672160") -endif() execute_process(COMMAND git submodule update --init --recursive --no-fetch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) From ca6ca154b9ee74899f1a801a8a3c91972ca10043 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Mon, 15 Feb 2021 15:22:31 +0000 Subject: [PATCH 109/442] Changing fn name from enqueue to produceTo(pcqueue) --- src/translator/batcher.cpp | 2 +- src/translator/batcher.h | 2 +- src/translator/service.cpp | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/src/translator/batcher.cpp b/src/translator/batcher.cpp index 5fdcc3ac6..9ba0d035f 100644 --- a/src/translator/batcher.cpp +++ b/src/translator/batcher.cpp @@ -62,7 +62,7 @@ void Batcher::addWholeRequest(Ptr request) { } } -void Batcher::enqueue(PCQueue &pcqueue) { +void Batcher::produceTo(PCQueue &pcqueue) { Batch batch; while (cleaveBatch(batch)) { pcqueue.ProduceSwap(batch); diff --git a/src/translator/batcher.h b/src/translator/batcher.h index d6b85f3f3..342725708 100644 --- a/src/translator/batcher.h +++ b/src/translator/batcher.h @@ -21,7 +21,7 @@ class Batcher { // which maintains priority among sentences from multiple concurrent requests. void addSentenceWithPriority(RequestSentence &sentence); void addWholeRequest(Ptr request); - void enqueue(PCQueue &pcqueue); + void produceTo(PCQueue &pcqueue); // Loads sentences with sentences compiled from (tentatively) multiple // requests optimizing for both padding and priority. diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 1b33558e7..96f391c2d 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -70,7 +70,7 @@ std::future Service::translate(std::string &&input) { batcher_.addWholeRequest(request); if (numWorkers_ > 0) { - batcher_.enqueue(pcqueue_); + batcher_.produceTo(pcqueue_); } else { // Queue single-threaded Batch batch; From 0374ac4696b124ed9e015325aef3c1501a514736 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 15 Feb 2021 14:28:06 +0100 Subject: [PATCH 110/442] Updated marian submodule - Includes try/catch free builds - Has ASSERTION=0 and DISABLE_EXCEPTION_CATCHING=1 for wasm builds --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 29ecba1cb..467c43a29 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 29ecba1cb1b8ea26ae582d3851e214769b89e566 +Subproject commit 467c43a292a68b7913af2a00d353de97c1740f92 From 3607523c24ca69fa3b195f1aae1aaf0c0bb44f65 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 15 Feb 2021 16:54:50 +0100 Subject: [PATCH 111/442] Enabled COMPILE_WITHOUT_EXCEPTIONS for marian submodule --- CMakeLists.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/CMakeLists.txt b/CMakeLists.txt index 108338411..a2aec07a3 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -23,6 +23,7 @@ SET(COMPILE_DECODER_ONLY ON CACHE BOOL "Compile marian-decoder only") SET(COMPILE_WITH_PTHREADS OFF CACHE BOOL "Compile with pthreads support") SET(USE_WASM_COMPATIBLE_BLAS ON CACHE BOOL "Compile with a WASM compatible blas for decoder only builds") SET(COMPILE_LIBRARY_ONLY ON CACHE BOOL "Build only the Marian library and exclude all executables.") +SET(COMPILE_WITHOUT_EXCEPTIONS ON CACHE BOOL "Compile without exceptions") execute_process(COMMAND git submodule update --init --recursive --no-fetch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) From c5c5339489d6d209271f76ac2f53ce7ac92fa7c0 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Mon, 15 Feb 2021 17:18:59 +0100 Subject: [PATCH 112/442] Re-enable simd shuffle pattern for intgemm compilation --- CMakeLists.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/CMakeLists.txt b/CMakeLists.txt index a2aec07a3..8d1ff1b52 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -24,6 +24,10 @@ SET(COMPILE_WITH_PTHREADS OFF CACHE BOOL "Compile with pthreads support") SET(USE_WASM_COMPATIBLE_BLAS ON CACHE BOOL "Compile with a WASM compatible blas for decoder only builds") SET(COMPILE_LIBRARY_ONLY ON CACHE BOOL "Build only the Marian library and exclude all executables.") SET(COMPILE_WITHOUT_EXCEPTIONS ON CACHE BOOL "Compile without exceptions") +if(COMPILE_WASM) + # Set WORMHOLE to ON for marian whenever compiling for wasm platform + SET(WORMHOLE ON CACHE BOOL "Use WASM wormhole in intgemm https://bugzilla.mozilla.org/show_bug.cgi?id=1672160") +endif() execute_process(COMMAND git submodule update --init --recursive --no-fetch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) From d5a5e754510aeb158fea3e82939426e4d29885ed Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Mon, 15 Feb 2021 20:21:10 +0000 Subject: [PATCH 113/442] Renaming variables; Enhancing documentation --- src/translator/request.cpp | 45 ++++++++++- src/translator/request.h | 148 ++++++++++++++++++++++--------------- src/translator/service.cpp | 6 +- 3 files changed, 133 insertions(+), 66 deletions(-) diff --git a/src/translator/request.cpp b/src/translator/request.cpp index 9317f697e..303f9cc7d 100644 --- a/src/translator/request.cpp +++ b/src/translator/request.cpp @@ -10,14 +10,15 @@ namespace marian { namespace bergamot { +// ----------------------------------------------------------------- Request::Request(unsigned int Id, int lineNumberBegin, std::vector> &vocabs, std::string &&source, Segments &&segments, - std::vector &&sourceAlignments, + std::vector &&sourceTokenRanges, std::promise responsePromise) : Id_(Id), lineNumberBegin_(lineNumberBegin), vocabs_(&vocabs), source_(std::move(source)), segments_(std::move(segments)), - sourceAlignments_(std::move(sourceAlignments)), + sourceTokenRanges_(std::move(sourceTokenRanges)), response_(std::move(responsePromise)) { counter_ = segments_.size(); @@ -48,7 +49,7 @@ void Request::processHistory(size_t index, Ptr history) { void Request::completeRequest() { // Request no longer needs to hold the content, can transfer it to // Response. - Response response(std::move(source_), std::move(sourceAlignments_), + Response response(std::move(source_), std::move(sourceTokenRanges_), std::move(histories_), *vocabs_); response_.set_value(std::move(response)); } @@ -58,6 +59,8 @@ bool Request::operator<(const Request &b) const { return Id_ < b.Id_; } +// ------------------------------------------------------------------ + RequestSentence::RequestSentence(size_t index, Ptr request) : index_(index), request_(request) {} @@ -87,5 +90,41 @@ bool operator<(const RequestSentence &a, const RequestSentence &b) { return a.request_ < b.request_; } +// ---------------------------------------------------------------------- + +void Batch::reset() { + Id_ = 0; + sentences_.clear(); +} + +void Batch::log() { + int numTokens{0}, maxLength{0}; + for (auto &sentence : sentences_) { + numTokens += sentence.numTokens(); + maxLength = std::max(maxLength, static_cast(sentence.numTokens())); + } + + LOG(info, "Batch(Id_={}, tokens={}, max-length={}, sentences_={})", Id_, + numTokens, maxLength, sentences_.size()); +} + +void Batch::add(const RequestSentence &sentence) { + sentences_.push_back(sentence); +} + +void Batch::setId(int Id) { + assert(Id > 0); + Id_ = Id; + if (Id % 500 == 0) { + log(); + } +} + +void Batch::completeBatch(const Histories &histories) { + for (int i = 0; i < sentences_.size(); i++) { + sentences_[i].completeSentence(histories[i]); + } +} + } // namespace bergamot } // namespace marian diff --git a/src/translator/request.h b/src/translator/request.h index 8912a497d..095a03ccd 100644 --- a/src/translator/request.h +++ b/src/translator/request.h @@ -3,20 +3,19 @@ // // Request: holds the input blob of a text, Segments (vector) which are // to go to the batching mechanism and alignments between the processed -// segments and the input blob (sourceAlignments). In addition, Request takes +// segments and the input blob (sourceTokenRanges). In addition, Request takes // care of the barrier which fires when all the Segments in a request are done -// translating by the workers (BatchTranslator). Request is to be extended with -// notions of Priority (sequence, user-given). +// translating by the workers (BatchTranslator). +// TODO(jerinphilip): Extend Request with notions of Priority (sequence, +// user-given). // -// RequestSentence: is a tuple of (index, Request*). This provides the +// RequestSentence: is a tuple of (index, Ptr). This provides the // batching mechanism access to the segment within the request. The backref to // Request allows event triggering the barrier upon completion of the last // sentence by a worker. // -// PCItem: is a vector of RequestSentences and a batchNumber, which is what the -// PCQueue holds. The batches are constructed from segments returned by a -// RequestSentence. Can be enhanced with paddingSize, countTokens eventually for -// logging. +// Batch: is a vector of RequestSentences tagged with a batchNumber, which is +// what the PCQueue holds. Batch is "produced" by the Batcher. #ifndef SRC_BERGAMOT_REQUEST_H_ #define SRC_BERGAMOT_REQUEST_H_ @@ -37,23 +36,10 @@ namespace marian { namespace bergamot { class Request { -private: - unsigned int Id_; - int lineNumberBegin_; - std::string source_; - std::atomic counter_; - std::vector> *vocabs_; - - Segments segments_; - std::vector sourceAlignments_; - std::vector> histories_; - - std::promise response_; - public: Request(unsigned int Id, int lineNumberBegin, std::vector> &vocabs_, std::string &&source, - Segments &&segments, std::vector &&sourceAlignments, + Segments &&segments, std::vector &&sourceTokenRanges, std::promise responsePromise); // Obtain the count of tokens in the segment correponding to index. Used to @@ -68,7 +54,8 @@ class Request { // several requests. Segment getSegment(size_t index) const; - // For notions of priority among requests (used to enable in Batcher). + // For notions of priority among requests, used to enable std::set in + // Batcher. bool operator<(const Request &request) const; // Processes a history obtained after translating in a heterogenous batch @@ -77,20 +64,60 @@ class Request { // On completion of last segment, sets value of the promise. void completeRequest(); + +private: + unsigned int Id_; + int lineNumberBegin_; + + // Multiple translation-workers can concurrently access the same Request. The + // following atomic atomically operates on the variable holding sentences + // remaining to be translated. + std::atomic counter_; + + // source_ holds the source string to be translated. segments_ hold the + // sentences generated from source_ in vector. sourceTokenRanges_ are + // string_views of the text corresponding to these words, pointing to + // sequences in source_. histories_ is a buffer which eventually stores the + // translations of each segment in the corresponding index. + std::string source_; + Segments segments_; + std::vector sourceTokenRanges_; + std::vector> histories_; + + // Members above are moved into newly constructed Response on completion + // of translation of all segments. The promise below is set to this Response + // value. future to this promise is made available to the user through + // Service. + std::promise response_; + + // Constructing Response requires the vocabs_ used to generate Request. + std::vector> *vocabs_; }; class RequestSentence { -private: - size_t index_; - Ptr request_; + // A RequestSentence provides a view to a sentence within a Request. Existence + // of this class allows the sentences and associated information to be kept + // within Request. public: RequestSentence(size_t, Ptr); size_t numTokens() const; + + // lineNumber in Request, used for matching marian-decoder. SentenceTuple + // requires lineNumber to be set for Corpus based batches. size_t lineNumber() const; + + // Accessor to the segment represented by the RequestSentence. Segment getUnderlyingSegment() const; + + // Forwards call to Request, checking for completion. void completeSentence(Ptr history); + friend bool operator<(const RequestSentence &a, const RequestSentence &b); + +private: + size_t index_; + Ptr request_; }; typedef std::vector RequestSentences; @@ -98,47 +125,48 @@ typedef std::vector RequestSentences; class Batch { public: Batch() { reset(); } - void reset() { - Id_ = 0; - sentences_.clear(); - } - // Convenience function to determine poison. - bool isPoison() { return (Id_ == -1); } + // Reset is required to reuse the same batch by consumer. + void reset(); + + // Methods to construct and determine poison. static Batch poison() { Batch poison_; poison_.Id_ = -1; return poison_; } + bool isPoison() const { return (Id_ == -1); } + + size_t size() const { return sentences_.size(); } + + // Accessors to load data into a batch. Use add(...) to add sentences into a + // batch. Once complete with a legal batch, use setId to set Id_ accordingly. + // setId only allows setting Id > 0. For use in Batcher, which acts as a + // producer to a PCQueue holding "Batch"es. + // + // Id_ = + // -1 : Batch::Poison + // 0 : Empty Batch + // >0 : Legal batch containing sentences + + void add(const RequestSentence &sentence); + void setId(int Id); + + // Accessors to read from a Batch. For use in BatchTranslator (consumer on a + // PCQueue holding batches). + // + // sentences() are used to access sentences to construct marian internal + // batch. + const RequestSentences &sentences() { return sentences_; } - void log() { - int numTokens{0}, maxLength{0}; - for (auto &sentence : sentences_) { - numTokens += sentence.numTokens(); - maxLength = std::max(maxLength, static_cast(sentence.numTokens())); - } - - LOG(info, "Batch(Id_={}, tokens={}, max-length={}, sentences_={})", Id_, - numTokens, maxLength, sentences_.size()); - } - - void add(const RequestSentence &sentence) { sentences_.push_back(sentence); } - - size_t size() { return sentences_.size(); } - - void setId(int Id) { - assert(Id > 0); - Id_ = Id; - if (Id % 500 == 0) { - log(); - } - } + // On obtaining Histories after translating a batch, completeBatch can be + // called with Histories , which forwards the call to Request through + // RequestSentence and triggers completion, by setting the promised value to + // the future given to client. + void completeBatch(const Histories &histories); - const RequestSentences &sentences() { return sentences_; } - void completeBatch(const Histories &histories) { - for (int i = 0; i < sentences_.size(); i++) { - sentences_[i].completeSentence(histories[i]); - } - } + // Convenience function to log batch-statistics. numTokens, max-length. + // TODO(jerinphilip): Use to log and report packing efficiency. + void log(); private: int Id_; diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 96f391c2d..2163eefb9 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -56,8 +56,8 @@ std::future Service::translate(std::string &&input) { // returns future corresponding to the promise. Segments segments; - std::vector sourceAlignments; - text_processor_.process(input, segments, sourceAlignments); + std::vector sourceTokenRanges; + text_processor_.process(input, segments, sourceTokenRanges); std::promise responsePromise; auto future = responsePromise.get_future(); @@ -65,7 +65,7 @@ std::future Service::translate(std::string &&input) { Ptr request = New(requestId_++, /* lineNumberBegin = */ 0, vocabs_, std::move(input), std::move(segments), - std::move(sourceAlignments), std::move(responsePromise)); + std::move(sourceTokenRanges), std::move(responsePromise)); batcher_.addWholeRequest(request); From 921c2eedf812b3304a06ebfad890fb025755c2a0 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 16 Feb 2021 14:21:46 +0100 Subject: [PATCH 114/442] Updated config for min inference time - This combination gives min inference time (~ 200 WPS) on local machine --- wasm/test_page/bergamot.html | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index 8fc7824e1..d91a9a160 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -75,17 +75,25 @@ beam-size: 1 normalize: 1.0 word-penalty: 0 -mini-batch: 32 -maxi-batch: 100 -maxi-batch-sort: src +max-input-sentence-tokens: 128 +max-input-tokens: 1024 workspace: 128 max-length-factor: 2.0 skip-cost: true +cpu-threads: 1 +quiet: true +quiet-translation: true shortlist: - /lex.${lang}.s2t - 50 - 50 `; +/* +This config is not valid anymore in new APIs +mini-batch: 32 +maxi-batch: 100 +maxi-batch-sort: src +*/ // TODO: Use in model config when wormhole is enabled: // gemm-precision: int8shift // TODO: Use in model config when loading of binary models is supported and we use model.intgemm.alphas.bin: From b1e72ce75e2bce611b6dee11408278f8b3e3e4ec Mon Sep 17 00:00:00 2001 From: Motin Date: Tue, 16 Feb 2021 15:46:15 +0200 Subject: [PATCH 115/442] Updated instructions on how to get all relevant models in place for the upcoming release --- README.md | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 4bff753d1..0d55686ff 100644 --- a/README.md +++ b/README.md @@ -52,17 +52,10 @@ Files packaged this way are preloaded in the root of the virtual file system. To package the set of files expected by the test page: ```bash -git clone https://github.com/browsermt/students -cd students/esen/ -./download-models.sh -cp esen.student.tiny11/lex.s2t ../../models/lex.esen.s2t -cp esen.student.tiny11/model.npz ../../models/model.esen.npz -cp esen.student.tiny11/vocab.esen.spm ../../models/vocab.esen.spm -cd - -cd students/enes/ -./download-models.sh -cp enes.student.tiny11/lex.s2t ../../models/lex.enes.s2t -cp enes.student.tiny11/model.npz ../../models/model.enes.npz +mkdir models +git clone https://github.com/motin/bergamot-models +cp -r bergamot-models/* models +gunzip models/*/* ``` After Editing Files: From d907400a80d59cac771dad7f31d67bcb67411270 Mon Sep 17 00:00:00 2001 From: Motin Date: Tue, 16 Feb 2021 17:00:45 +0200 Subject: [PATCH 116/442] Updated test page to use the model structure from bergamot-models repo --- wasm/test_page/bergamot.html | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/wasm/test_page/bergamot.html b/wasm/test_page/bergamot.html index d91a9a160..795654495 100644 --- a/wasm/test_page/bergamot.html +++ b/wasm/test_page/bergamot.html @@ -64,14 +64,20 @@ + diff --git a/wasm/test_page/bergamot.js b/wasm/test_page/bergamot.js new file mode 100644 index 000000000..e586b213c --- /dev/null +++ b/wasm/test_page/bergamot.js @@ -0,0 +1,48 @@ +var worker; + +if (window.Worker) { + var worker = new Worker('worker.js'); + worker.postMessage(["load_module"]); +} + +const log = (message) => { + document.querySelector("#log").value += message + "\n"; +} + +document.querySelector("#translate").addEventListener("click", () => { + translateCall(); +}); + +document.querySelector("#from").addEventListener('keyup', function(event) { + if (event.keyCode === 13) { + translateCall(); + } +}); + +document.querySelector("#load").addEventListener("click", async() => { + document.querySelector("#load").disabled = true; + const lang = document.querySelector('input[name="modellang"]:checked').value; + const from = lang.substring(0, 2); + const to = lang.substring(2, 4); + let start = Date.now(); + worker.postMessage(["load_model", from, to]); + document.querySelector("#load").disabled = false; +}); + +const translateCall = () => { + const text = document.querySelector('#from').value; + const paragraphs = text.split("\n"); + + worker.postMessage(["translate", paragraphs]); +} + +worker.onmessage = function(e) { + console.debug(`Message received from worker`); + if (e.data[0] === 'translated_result') { + document.querySelector('#to').value = e.data[1].join("\n"); + log(e.data[2]); + } + if ((e.data[0] === 'module_loaded') || (e.data[0] === 'model_loaded')) { + log(e.data[1]); + } +} \ No newline at end of file diff --git a/wasm/test_page/worker.js b/wasm/test_page/worker.js new file mode 100644 index 000000000..329081011 --- /dev/null +++ b/wasm/test_page/worker.js @@ -0,0 +1,243 @@ +var translationService, responseOptions, input = undefined; +const BERGAMOT_TRANSLATOR_MODULE = "bergamot-translator-worker.js"; + +const encoder = new TextEncoder(); // string to utf-8 converter +const decoder = new TextDecoder(); // utf-8 to string converter + +const start = Date.now(); +let moduleLoadStart; +var Module = { + preRun: [function() { + log(`Time until Module.preRun: ${(Date.now() - start) / 1000} secs`); + moduleLoadStart = Date.now(); + }], + onRuntimeInitialized: function() { + log(`Wasm Runtime initialized (preRun -> onRuntimeInitialized) in ${(Date.now() - moduleLoadStart) / 1000} secs`); + } +}; + +const log = (message) => { + console.debug(message); +} + +onmessage = async function(e) { + let command = e.data[0]; + log(`Message '${command}' received from main script`); + let result = ""; + if (command === 'load_module') { + importScripts(BERGAMOT_TRANSLATOR_MODULE); + result = `Translator wasm module successfully loaded`; + log(result); + log('Posting message back to main script'); + postMessage(['module_loaded', result]); + } + else if (command === 'load_model') { + let start = Date.now(); + await constructTranslationService(e.data[1], e.data[2]); + result = `translation model '${e.data[1]}${e.data[2]}' successfully loaded; took ${(Date.now() - start) / 1000} secs`; + log(result); + log('Posting message back to main script'); + postMessage(['model_loaded', result]); + } + else if (command === 'translate') { + const inputParagraphs = e.data[1]; + let inputWordCount = 0; + inputParagraphs.forEach(sentence => { + inputWordCount += sentence.trim().split(" ").filter(word => word.trim() !== "").length; + }) + + let start = Date.now(); + const translatedParagraphs = translate(e.data[1]); + const secs = (Date.now() - start) / 1000; + result = `Translation of (${inputWordCount}) words took ${secs} secs (${Math.round(inputWordCount / secs)} words per second)`; + log(result); + log('Posting message back to main script'); + postMessage(['translated_result', translatedParagraphs, result]); + } +} + +// This function downloads file from a url and returns the array buffer +const downloadAsArrayBuffer = async(url) => { + const response = await fetch(url); + if (!response.ok) { + throw Error(`Downloading ${url} failed: HTTP ${response.status} - ${response.statusText}`); + } + return response.arrayBuffer(); +} + +// This function constructs and initializes the AlignedMemory from the array buffer and alignment size +const prepareAlignedMemoryFromBuffer = async (buffer, alignmentSize) => { + var byteArray = new Int8Array(buffer); + log(`Constructing Aligned memory with size: ${byteArray.byteLength} bytes with alignment: ${alignmentSize}`); + var alignedMemory = new Module.AlignedMemory(byteArray.byteLength, alignmentSize); + log(`Aligned memory construction done`); + const alignedByteArrayView = alignedMemory.getByteArrayView(); + alignedByteArrayView.set(byteArray); + log(`Aligned memory initialized`); + return alignedMemory; +} + +const constructTranslationService = async (from, to) => { + const languagePair = `${from}${to}`; + + // Vocab files are re-used in both translation directions + const vocabLanguagePair = from === "en" ? `${to}${from}` : languagePair; + + // Set the Model Configuration as YAML formatted string. + // For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ + /*const modelConfig = `models: + - /${languagePair}/model.${languagePair}.intgemm.alphas.bin + vocabs: + - /${languagePair}/vocab.${vocabLanguagePair}.spm + - /${languagePair}/vocab.${vocabLanguagePair}.spm + beam-size: 1 + normalize: 1.0 + word-penalty: 0 + max-length-break: 128 + mini-batch-words: 1024 + workspace: 128 + max-length-factor: 2.0 + skip-cost: true + cpu-threads: 0 + quiet: true + quiet-translation: true + shortlist: + - /${languagePair}/lex.${languagePair}.s2t + - 50 + - 50 + `; + */ + + // TODO: gemm-precision: int8shiftAlphaAll (for the models that support this) + // DONOT CHANGE THE SPACES BETWEEN EACH ENTRY OF CONFIG + const modelConfig = `beam-size: 1 +normalize: 1.0 +word-penalty: 0 +max-length-break: 128 +mini-batch-words: 1024 +workspace: 128 +max-length-factor: 2.0 +skip-cost: true +cpu-threads: 0 +quiet: true +quiet-translation: true +gemm-precision: int8shift +`; + + const modelFile = `models/${languagePair}/model.${languagePair}.intgemm.alphas.bin`; + const shortlistFile = `models/${languagePair}/lex.50.50.${languagePair}.s2t.bin`; + const vocabFiles = [`models/${languagePair}/vocab.${vocabLanguagePair}.spm`, + `models/${languagePair}/vocab.${vocabLanguagePair}.spm`]; + + const uniqueVocabFiles = new Set(vocabFiles); + log(`modelFile: ${modelFile}\nshortlistFile: ${shortlistFile}\nNo. of unique vocabs: ${uniqueVocabFiles.size}`); + uniqueVocabFiles.forEach(item => log(`unique vocabFile: ${item}`)); + + try { + // Download the files as buffers from the given urls + let start = Date.now(); + const downloadedBuffers = await Promise.all([downloadAsArrayBuffer(modelFile), downloadAsArrayBuffer(shortlistFile)]); + const modelBuffer = downloadedBuffers[0]; + const shortListBuffer = downloadedBuffers[1]; + + const downloadedVocabBuffers = []; + for (let item of uniqueVocabFiles.values()) { + downloadedVocabBuffers.push(await downloadAsArrayBuffer(item)); + } + log(`All files for ${languagePair} language pair took ${(Date.now() - start) / 1000} secs to download`); + + // Construct AlignedMemory objects with downloaded buffers + let constructedAlignedMemories = await Promise.all([prepareAlignedMemoryFromBuffer(modelBuffer, 256), + prepareAlignedMemoryFromBuffer(shortListBuffer, 64)]); + let alignedModelMemory = constructedAlignedMemories[0]; + let alignedShortlistMemory = constructedAlignedMemories[1]; + let alignedVocabsMemoryList = new Module.AlignedMemoryList; + for(let item of downloadedVocabBuffers) { + let alignedMemory = await prepareAlignedMemoryFromBuffer(item, 64); + alignedVocabsMemoryList.push_back(alignedMemory); + } + log(`Aligned vocab memories: ${alignedVocabsMemoryList.get(0).size()}`); + log(`Aligned model memory: ${alignedModelMemory.size()}`); + log(`Aligned shortlist memory: ${alignedShortlistMemory.size()}`); + + // Instantiate the Translation Service + if (translationService) { + translationService.delete(); + translationService = undefined; + } + + log(`Creating Translation Service with config: ${modelConfig}`); + translationService = new Module.Service(modelConfig, alignedModelMemory, alignedShortlistMemory, alignedVocabsMemoryList); + if (typeof translationService === 'undefined') { + throw Error(`Translation Service construction failed`); + } + } catch (error) { + log(error); + } + } + +const translate = (paragraphs) => { + // Instantiate the arguments of translate() API i.e. ResponseOptions and input (vector) + var responseOptions = new Module.ResponseOptions(); + let input = new Module.VectorString; + + // Initialize the input + paragraphs.forEach(paragraph => { + // prevent empty paragraph - it breaks the translation + if (paragraph.trim() === "") { + return; + } + input.push_back(paragraph.trim()) + }) + // Access input (just for debugging) + log(`Input size: ${input.size()}`); + + // Translate the input, which is a vector; the result is a vector + let result = translationService.translate(input, responseOptions); + + const translatedParagraphs = []; + const translatedSentencesOfParagraphs = []; + const sourceSentencesOfParagraphs = []; + for (let i = 0; i < result.size(); i++) { + translatedParagraphs.push(result.get(i).getTranslatedText()); + translatedSentencesOfParagraphs.push(getAllTranslatedSentencesOfParagraph(result.get(i))); + sourceSentencesOfParagraphs.push(getAllSourceSentencesOfParagraph(result.get(i))); + } + log({ translatedParagraphs }); + log({ translatedSentencesOfParagraphs }); + log({ sourceSentencesOfParagraphs }); + + responseOptions.delete(); + input.delete(); + return translatedParagraphs; +} + +// This function extracts all the translated sentences from the Response and returns them. +const getAllTranslatedSentencesOfParagraph = (response) => { + const sentences = []; + const text = response.getTranslatedText(); + for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { + const utf8SentenceByteRange = response.getTranslatedSentence(sentenceIndex); + sentences.push(_getSentenceFromByteRange(text, utf8SentenceByteRange)); + } + return sentences; +} + +// This function extracts all the source sentences from the Response and returns them. +const getAllSourceSentencesOfParagraph = (response) => { + const sentences = []; + const text = response.getOriginalText(); + for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { + const utf8SentenceByteRange = response.getSourceSentence(sentenceIndex); + sentences.push(_getSentenceFromByteRange(text, utf8SentenceByteRange)); + } + return sentences; +} + +// This function returns a substring of text (a string). The substring is represented by +// byteRange (begin and end endices) within the utf-8 encoded version of the text. +const _getSentenceFromByteRange = (text, byteRange) => { + const utf8BytesView = encoder.encode(text); + const utf8SentenceBytes = utf8BytesView.subarray(byteRange.begin, byteRange.end); + return decoder.decode(utf8SentenceBytes); +} From ff391c6f0052c1fda54c77c6bab39ddfc9377455 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 24 Aug 2021 12:35:21 +0200 Subject: [PATCH 285/442] Updated marian submodule to latest commit of master --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 6087379f2..62bac858b 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 6087379f2ee7fb3062a82a6129ff81ca5fe56eed +Subproject commit 62bac858bfd37060beb707d12eb9711649ea4cf6 From cafb65e0b5df4b48be10b1788c308fb827dffdb3 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 24 Aug 2021 18:03:38 +0200 Subject: [PATCH 286/442] Wasm builds without SharedArrayBuffer --- CMakeLists.txt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index c58ddd4ff..a9586d8e5 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -87,9 +87,8 @@ endif() if(COMPILE_WASM) set(WORMHOLE ON CACHE BOOL "Use WASM wormhole in intgemm https://bugzilla.mozilla.org/show_bug.cgi?id=1672160") - list(APPEND WASM_COMPILE_FLAGS -pthread -O3 -g2 -fPIC -mssse3 -msimd128) + list(APPEND WASM_COMPILE_FLAGS -O3 -g2 -fPIC -mssse3 -msimd128) list(APPEND WASM_COMPILE_FLAGS "SHELL:-s WASM=1" "SHELL:-s ASSERTIONS=0" "SHELL:-s DISABLE_EXCEPTION_CATCHING=1" "SHELL:-s LLD_REPORT_UNDEFINED" "SHELL:-s FORCE_FILESYSTEM=1" "SHELL:-s ALLOW_MEMORY_GROWTH=1") - list(APPEND WASM_COMPILE_FLAGS -Wno-error=pthreads-mem-growth) endif(COMPILE_WASM) # Needs to be enabled before including the folder containing tests (src/tests) From 8e4374282a720c605bb9856dcd564d2fcec09baf Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 31 Aug 2021 15:45:14 +0200 Subject: [PATCH 287/442] Circle CI wasm artifacts for non-wormhole builds --- .circleci/config.yml | 41 ++++++++++++++++++++++++++++++-- build-wasm.sh | 56 ++++++++++++++++++++++++++++++++++---------- 2 files changed, 82 insertions(+), 15 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 69ae35686..9b14ed154 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -1,6 +1,37 @@ version: 2.1 jobs: - build: + build-with-wormhole: + docker: + - image: 'emscripten/emsdk:2.0.9' + resource_class: medium + + working_directory: ~/checkout + + steps: + - checkout + + - run: + name: Build WASM + command: bash build-wasm.sh WORMHOLE + + - run: + name: Check artifacts + working_directory: build-wasm + command: | + ls -all bergamot* + if ls bergamot*.wasm &>/dev/null && ls bergamot*.js &>/dev/null + then + echo "Artifacts Successfully Generated" + else + echo "Failure: Artifacts Not Present" + exit 1 + fi + + - store_artifacts: + path: "build-wasm" + destination: "wasm-wormhole" + + build-without-wormhole: docker: - image: 'emscripten/emsdk:2.0.9' resource_class: medium @@ -29,4 +60,10 @@ jobs: - store_artifacts: path: "build-wasm" - destination: "build-wasm" + destination: "wasm-without-wormhole" + +workflows: + build: + jobs: + - build-with-wormhole + - build-without-wormhole \ No newline at end of file diff --git a/build-wasm.sh b/build-wasm.sh index d3cd9d1db..7da2685cf 100755 --- a/build-wasm.sh +++ b/build-wasm.sh @@ -1,15 +1,38 @@ #!/usr/bin/env bash - -# Usage: ./build-wasm.sh - set -e set -x +# Usage +Usage="Build translator to wasm (with/without wormhole). + +Usage: $(basename "$0") [WORMHOLE] + + where: + WORMHOLE An optional string argument + - when specified on command line, builds wasm artifacts with wormhole + - when not specified (the default behaviour), builds wasm artifacts without wormhole." + +if [ "$#" -gt 1 ]; then + echo "Illegal number of parameters passed" + echo "$Usage" + exit +fi + +WORMHOLE=false + +if [ "$#" -eq 1 ]; then + if [ "$1" = "WORMHOLE" ]; then + WORMHOLE=true + else + echo "Illegal parameter passed" + echo "$Usage" + exit + fi +fi + # Run script from the context of the script-containing directory cd "$(dirname $0)" -# This file replicates the instructions found in ./README.md under "Build WASM" - # Prerequisite: Download and Install Emscripten using following instructions (unless the EMSDK env var is already set) if [ "$EMSDK" == "" ]; then EMSDK_UPDATE_REQUIRED=0 @@ -36,17 +59,24 @@ if [ "$EMSDK" == "" ]; then fi # Compile -# 1. Create a folder where you want to build all the artifacts (`build-wasm` in this case) and compile -if [ ! -d "build-wasm" ]; then - mkdir build-wasm +# 1. Create a folder where you want to build all the artifacts and compile +BUILD_DIRECTORY="build-wasm" +if [ ! -d ${BUILD_DIRECTORY} ]; then + mkdir ${BUILD_DIRECTORY} +fi +cd ${BUILD_DIRECTORY} + +if [ "$WORMHOLE" = true ]; then + emcmake cmake -DCOMPILE_WASM=on ../ +else + emcmake cmake -DCOMPILE_WASM=on -DWORMHOLE=off ../ fi -cd build-wasm -emcmake cmake -DCOMPILE_WASM=on ../ emmake make -j2 # 2. Enable SIMD Wormhole via Wasm instantiation API in generated artifacts -bash ../wasm/patch-artifacts-enable-wormhole.sh - -# The artifacts (.js and .wasm files) will be available in the build directory ("build-wasm" in this case). +if [ "$WORMHOLE" = true ]; then + bash ../wasm/patch-artifacts-enable-wormhole.sh +fi +# The artifacts (.js and .wasm files) will be available in the build directory exit 0 From 48e955c4685c6a244626416e4f0c061e4bc8ce7e Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Tue, 7 Sep 2021 19:10:41 +0100 Subject: [PATCH 288/442] BRT: Update sacrebleu to get tests back working (#217) Co-authored-by: Nikolay Bogoychev --- bergamot-translator-tests | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bergamot-translator-tests b/bergamot-translator-tests index ee534f750..2b1a1700e 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit ee534f7507966efe3199ac84e56bdd4b3950b736 +Subproject commit 2b1a1700e397934ba68746cb8ff9b251681d9eac From 63120c174e3edfd664175d4a2be095d8b50a112f Mon Sep 17 00:00:00 2001 From: Andre Barbosa Date: Thu, 16 Sep 2021 12:28:40 -0300 Subject: [PATCH 289/442] QualityEstimation: Preliminary Implementation (#197) Unifies quality estimation with an interface, refactors previously available quality scores to fit this interface. Adds a new class of model with Logistic Regression powering the predictions as an implementation of said interface. QE now provides annotations on words using subwords to word rule-based algorithms working with space characters. QualityEstimation ----------------- Implementations of QE are bound together by a `QualityEstimator` Interface. 1. The log-probabilities from the machine-translation model re-interpreted as quality scores are crafted as an implementation of QualityEstimator. 2. A Logistic-Regression based model is added. This class of models is trained supervised with scores labeled by a human annotator. Handcrafted features - number of words, log probs from MT model and statistics over the sequence are used to generate the numeric features. LogisticRegressor, Matrix (to hold features) are added. The creation of an instance is switched by the `AlignedMemory` supplied (be it loaded from the file-system or supplied as a parameter). An empty AlignedMemory leads to quality scores from NMT while supplying weights of a trained logistic-regression model in binary format as the contents lead to an additional pass through the said model to provide more refined scores. Both the above now transform subwords into "words" using a heuristic algorithm, scanning for spaces. This allows the client to work with "words" to denote quality instead of subwords, as the former is more sensible to the user. Testing ------- 1. BRT now has two new test apps to check the QE outputs in text (covers subword to words) and numbers domain (covers quality scores). These are tested with en-et models for which QualityEstimation is available now, on a new input to avoid architecture/compiler issues. 2. Unit test for LogisticRegression model is added. Docs ---- Doxygen now supports MathJax properly to render explanations for Logistic Regressions' reductions in place to make computation more efficient correctly. Co-authored-by: Felipe C. Dos Santos Co-authored-by: Jerin Philip --- .gitignore | 3 + Doxyfile.in | 4 +- bergamot-translator-tests | 2 +- doc/conf.py | 2 +- src/tests/apps.cpp | 31 +++ src/tests/apps.h | 6 + src/tests/cli.cpp | 6 +- src/tests/units/CMakeLists.txt | 2 +- src/tests/units/quality_estimator_tests.cpp | 62 +++++ src/tests/units/quality_estimator_tests.h | 5 + src/translator/CMakeLists.txt | 1 + src/translator/byte_array_util.cpp | 18 ++ src/translator/byte_array_util.h | 2 + src/translator/definitions.h | 2 + src/translator/parser.h | 2 + src/translator/quality_estimator.cpp | 288 ++++++++++++++++++++ src/translator/quality_estimator.h | 222 +++++++++++++++ src/translator/response.h | 25 +- src/translator/response_builder.cpp | 16 +- src/translator/response_builder.h | 12 +- src/translator/response_options.h | 11 - src/translator/service.cpp | 6 +- src/translator/service.h | 7 +- 23 files changed, 686 insertions(+), 49 deletions(-) create mode 100644 src/tests/units/quality_estimator_tests.cpp create mode 100644 src/tests/units/quality_estimator_tests.h create mode 100644 src/translator/quality_estimator.cpp create mode 100644 src/translator/quality_estimator.h diff --git a/.gitignore b/.gitignore index 840e69ab8..49093ba25 100644 --- a/.gitignore +++ b/.gitignore @@ -20,3 +20,6 @@ wasm/test_page/node_modules build-wasm models wasm/test_page/bergamot-translator-worker.* + +# VSCode +.vscode diff --git a/Doxyfile.in b/Doxyfile.in index 88948e2ad..7b69eb8c5 100644 --- a/Doxyfile.in +++ b/Doxyfile.in @@ -1533,7 +1533,7 @@ FORMULA_TRANSPARENT = YES # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. -USE_MATHJAX = NO +USE_MATHJAX = YES # When MathJax is enabled you can set the default output format to be used for # the MathJax output. See the MathJax site (see: @@ -1556,7 +1556,7 @@ MATHJAX_FORMAT = HTML-CSS # The default value is: http://cdn.mathjax.org/mathjax/latest. # This tag requires that the tag USE_MATHJAX is set to YES. -MATHJAX_RELPATH = http://cdn.mathjax.org/mathjax/latest +MATHJAX_RELPATH = https://cdn.jsdelivr.net/npm/mathjax@3 # The MATHJAX_EXTENSIONS tag can be used to specify one or more MathJax # extension names that should be enabled during MathJax rendering. For example diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 2b1a1700e..53c6e42a9 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 2b1a1700e397934ba68746cb8ff9b251681d9eac +Subproject commit 53c6e42a97e512698711068d0be3c208359b1801 diff --git a/doc/conf.py b/doc/conf.py index bffcda0cd..8a8f4224c 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -37,7 +37,7 @@ # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ - 'sphinx.ext.imgmath', + 'sphinx.ext.mathjax', 'sphinx.ext.todo', 'breathe', 'exhale', diff --git a/src/tests/apps.cpp b/src/tests/apps.cpp index b42f7a495..991d3c3fd 100644 --- a/src/tests/apps.cpp +++ b/src/tests/apps.cpp @@ -55,6 +55,37 @@ void annotatedTextSentences(Ptr options, bool source) { } } +void qualityEstimatorWords(const Ptr &options) { + ResponseOptions responseOptions; + responseOptions.qualityScores = true; + const Response response = translateFromStdin(options, responseOptions); + + for (const auto &sentenceQualityEstimate : response.qualityScores) { + std::cout << "[SentenceBegin]\n"; + + for (const auto &wordByteRange : sentenceQualityEstimate.wordByteRanges) { + const string_view word(response.target.text.data() + wordByteRange.begin, wordByteRange.size()); + std::cout << word << "\n"; + } + std::cout << "[SentenceEnd]\n\n"; + } +} + +void qualityEstimatorScores(const Ptr &options) { + ResponseOptions responseOptions; + responseOptions.qualityScores = true; + const Response response = translateFromStdin(options, responseOptions); + + for (const auto &sentenceQualityEstimate : response.qualityScores) { + std::cout << std::fixed << std::setprecision(3) << sentenceQualityEstimate.sentenceScore << "\n"; + + for (const float &wordScore : sentenceQualityEstimate.wordScores) { + std::cout << std::fixed << std::setprecision(3) << wordScore << "\n"; + } + std::cout << "\n"; + } +} + } // namespace testapp } // namespace bergamot } // namespace marian diff --git a/src/tests/apps.h b/src/tests/apps.h index b380b5782..deb6a12dc 100644 --- a/src/tests/apps.h +++ b/src/tests/apps.h @@ -33,6 +33,12 @@ void annotatedTextWords(Ptr options, bool source = true); // in each line, depending on source = true or false respectively. void annotatedTextSentences(Ptr options, bool source = true); +// Reads from stdin and translates the read content. Prints the quality words for each sentence. +void qualityEstimatorWords(const Ptr& options); + +// Reads from stdin and translates the read content. Prints the quality scores for each sentence. +void qualityEstimatorScores(const Ptr& options); + } // namespace testapp } // namespace bergamot } // namespace marian diff --git a/src/tests/cli.cpp b/src/tests/cli.cpp index 4ecb24e02..0e9469ab0 100644 --- a/src/tests/cli.cpp +++ b/src/tests/cli.cpp @@ -12,8 +12,10 @@ int main(int argc, char *argv[]) { testapp::annotatedTextSentences(options, /*source=*/false); } else if (mode == "test-response-source-words") { testapp::annotatedTextWords(options, /*source=*/true); - } else if (mode == "test-response-target-words") { - testapp::annotatedTextWords(options, /*source=*/false); + } else if (mode == std::string("test-quality-estimator-words")) { + testapp::qualityEstimatorWords(options); + } else if (mode == std::string("test-quality-estimator-scores")) { + testapp::qualityEstimatorScores(options); } else { ABORT("Unknown --mode {}. Please run a valid test", mode); } diff --git a/src/tests/units/CMakeLists.txt b/src/tests/units/CMakeLists.txt index 5c1bc003c..4794badcd 100644 --- a/src/tests/units/CMakeLists.txt +++ b/src/tests/units/CMakeLists.txt @@ -1,7 +1,7 @@ # Unit tests set(UNIT_TESTS annotation_tests -) + quality_estimator_tests) foreach(test ${UNIT_TESTS}) add_executable("run_${test}" run_tests.cpp "${test}.cpp") diff --git a/src/tests/units/quality_estimator_tests.cpp b/src/tests/units/quality_estimator_tests.cpp new file mode 100644 index 000000000..e11c07a7b --- /dev/null +++ b/src/tests/units/quality_estimator_tests.cpp @@ -0,0 +1,62 @@ +#include "quality_estimator_tests.h" + +#include "catch.hpp" +#include "translator/quality_estimator.h" + +using namespace marian::bergamot; + +SCENARIO("Logistic Regressor test", "[QualityEstimator]") { + GIVEN("A feature matrix") { + const std::vector > features = {{-0.3, -0.3, 1.0, -0.183683336}, + {-0.0001, -0.0001, 1.0, -0.183683336}, + {-0.002, -0.002, 1.0, -0.183683336}, + {-0.5, -0.5, 1.0, -0.183683336}, + {-0.15, -0.2, 2.0, -0.183683336}}; + + LogisticRegressorQualityEstimator::Matrix featureMatrix(features.size(), features.begin()->size()); + + for (int i = 0; i < features.size(); ++i) { + for (int j = 0; j < features.begin()->size(); ++j) { + featureMatrix.at(i, j) = features[i][j]; + } + } + + AND_GIVEN("A LogistRegressor") { + LogisticRegressorQualityEstimator::Array coefficients = {0.99000001, 0.899999976, -0.200000003, 0.5}; + const float intercept = {-0.300000012}; + + LogisticRegressorQualityEstimator::Scale scale; + scale.stds = {0.200000003, 0.300000012, 2.5, 0.100000001}; + scale.means = {-0.100000001, -0.769999981, 5, -0.5}; + + LogisticRegressorQualityEstimator lrQE(std::move(scale), std::move(coefficients), intercept); + + WHEN("It's call predict") { + const std::vector prediction = lrQE.predict(featureMatrix); + + THEN("return the prediction") { + CHECK(prediction == std::vector{-2.14596, -4.41793, -4.403, -0.93204, -3.03343}); + } + } + + WHEN("LR is construct by aligned memory") { + const auto lrQEAlignedMemory = LogisticRegressorQualityEstimator::fromAlignedMemory(lrQE.toAlignedMemory()); + + WHEN("It's call predict") { + const std::vector prediction = lrQEAlignedMemory.predict(featureMatrix); + + THEN("return the prediction") { + CHECK(prediction == std::vector{-2.14596, -4.41793, -4.403, -0.93204, -3.03343}); + } + } + } + } + } +} + +bool operator==(const std::vector& value1, const std::vector& value2) { + return std::equal(value1.begin(), value1.end(), value2.begin(), value2.end(), [](const auto& a, const auto& b) { + auto value = Approx(b).epsilon(0.001); + return a == value; + }); +} diff --git a/src/tests/units/quality_estimator_tests.h b/src/tests/units/quality_estimator_tests.h new file mode 100644 index 000000000..37cba3ef3 --- /dev/null +++ b/src/tests/units/quality_estimator_tests.h @@ -0,0 +1,5 @@ +#pragma once + +#include + +bool operator==(const std::vector& value1, const std::vector& value2); diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 34e599ba6..c0ee6be7a 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -9,6 +9,7 @@ add_library(bergamot-translator STATIC request.cpp batcher.cpp response_builder.cpp + quality_estimator.cpp batch.cpp annotation.cpp service.cpp diff --git a/src/translator/byte_array_util.cpp b/src/translator/byte_array_util.cpp index 3790a01a9..83d06acb9 100644 --- a/src/translator/byte_array_util.cpp +++ b/src/translator/byte_array_util.cpp @@ -124,12 +124,30 @@ void getVocabsMemoryFromConfig(marian::Ptr options, } } +AlignedMemory getQualityEstimatorModel(const marian::Ptr& options) { + const auto qualityEstimatorPath = options->get("quality", ""); + if (qualityEstimatorPath.empty()) { + return {}; + } + return loadFileToMemory(qualityEstimatorPath, 64); +} + +AlignedMemory getQualityEstimatorModel(MemoryBundle& memoryBundle, const marian::Ptr& options) { + if (memoryBundle.qualityEstimatorMemory.size() == 0) { + return getQualityEstimatorModel(options); + } + + return std::move(memoryBundle.qualityEstimatorMemory); +} + MemoryBundle getMemoryBundleFromConfig(marian::Ptr options) { MemoryBundle memoryBundle; memoryBundle.model = getModelMemoryFromConfig(options); memoryBundle.shortlist = getShortlistMemoryFromConfig(options); getVocabsMemoryFromConfig(options, memoryBundle.vocabs); memoryBundle.ssplitPrefixFile = getSsplitPrefixFileMemoryFromConfig(options); + memoryBundle.qualityEstimatorMemory = getQualityEstimatorModel(options); + return memoryBundle; } diff --git a/src/translator/byte_array_util.h b/src/translator/byte_array_util.h index 04cbf9ee9..b445b3dec 100644 --- a/src/translator/byte_array_util.h +++ b/src/translator/byte_array_util.h @@ -6,6 +6,8 @@ namespace bergamot { AlignedMemory loadFileToMemory(const std::string& path, size_t alignment); AlignedMemory getModelMemoryFromConfig(marian::Ptr options); +AlignedMemory getQualityEstimatorModel(const marian::Ptr& options); +AlignedMemory getQualityEstimatorModel(MemoryBundle& memoryBundle, const marian::Ptr& options); AlignedMemory getShortlistMemoryFromConfig(marian::Ptr options); AlignedMemory getSsplitPrefixFileMemoryFromConfig(marian::Ptr options); void getVocabsMemoryFromConfig(marian::Ptr options, diff --git a/src/translator/definitions.h b/src/translator/definitions.h index d5b874353..a0f544ded 100644 --- a/src/translator/definitions.h +++ b/src/translator/definitions.h @@ -29,6 +29,8 @@ struct MemoryBundle { /// @todo Not implemented yet AlignedMemory ssplitPrefixFile{}; + + AlignedMemory qualityEstimatorMemory; ///< Byte-array of qe model (aligned to 64) }; /// ByteRange stores indices for half-interval [begin, end) in a string. Can be diff --git a/src/translator/parser.h b/src/translator/parser.h index cd7096531..54aaaf86a 100644 --- a/src/translator/parser.h +++ b/src/translator/parser.h @@ -29,6 +29,8 @@ inline marian::ConfigParser createConfigParser() { cp.addOption("--bergamot-mode", "Bergamot Options", "Operating mode for bergamot: [wasm, native, decoder]", "native"); + cp.addOption("--quality", "Bergamot Options", "File considering Quality Estimation model"); + return cp; } diff --git a/src/translator/quality_estimator.cpp b/src/translator/quality_estimator.cpp new file mode 100644 index 000000000..936d293a4 --- /dev/null +++ b/src/translator/quality_estimator.cpp @@ -0,0 +1,288 @@ +#include "quality_estimator.h" + +namespace marian::bergamot { + +void UnsupervisedQualityEstimator::computeQualityScores(const Histories& histories, Response& response) const { + for (size_t i = 0; i < histories.size(); ++i) { + const Result result = histories[i]->top(); + const Hypothesis::PtrType& hypothesis = std::get<1>(result); + const std::vector logProbs = hypothesis->tracebackWordScores(); + response.qualityScores.push_back(std::move(computeSentenceScores(logProbs, response.target, i))); + } +} + +Response::SentenceQualityScore UnsupervisedQualityEstimator::computeSentenceScores(const std::vector& logProbs, + const AnnotatedText& target, + const size_t sentenceIdx) const { + const std::vector wordIndices = mapWords(logProbs, target, sentenceIdx); + + std::vector wordScores; + + for (const SubwordRange& wordIndice : wordIndices) { + wordScores.push_back( + std::accumulate(logProbs.begin() + wordIndice.begin, logProbs.begin() + wordIndice.end, float(0.0)) / + wordIndice.size()); + } + + const float sentenceScore = + std::accumulate(std::begin(wordScores), std::end(wordScores), float(0.0)) / wordScores.size(); + + return {wordScores, subwordToWords(wordIndices, target, sentenceIdx), sentenceScore}; +} + +LogisticRegressorQualityEstimator::Matrix::Matrix(const size_t rowsParam, const size_t colsParam) + : rows(rowsParam), cols(colsParam), data_(rowsParam * colsParam) {} + +LogisticRegressorQualityEstimator::Matrix::Matrix(Matrix&& other) + : rows(other.rows), cols(other.cols), data_(std::move(other.data_)) {} + +const float& LogisticRegressorQualityEstimator::Matrix::at(const size_t row, const size_t col) const { + return data_[row * cols + col]; +} + +float& LogisticRegressorQualityEstimator::Matrix::at(const size_t row, const size_t col) { + return data_[row * cols + col]; +} + +LogisticRegressorQualityEstimator::LogisticRegressorQualityEstimator(Scale&& scale, Array&& coefficients, + const float intercept) + : scale_(std::move(scale)), coefficients_(std::move(coefficients)), intercept_(intercept), coefficientsByStds_() { + // Pre-compute the scale operations for the linear model + for (int i = 0; i < coefficients_.size(); ++i) { + coefficientsByStds_[i] = coefficients_[i] / scale_.stds[i]; + constantFactor_ += coefficientsByStds_[i] * scale_.means[i]; + } +} + +LogisticRegressorQualityEstimator::LogisticRegressorQualityEstimator(LogisticRegressorQualityEstimator&& other) + : scale_(std::move(other.scale_)), + coefficients_(std::move(other.coefficients_)), + intercept_(std::move(other.intercept_)), + coefficientsByStds_(std::move(other.coefficientsByStds_)), + constantFactor_(std::move(other.constantFactor_)) {} + +LogisticRegressorQualityEstimator LogisticRegressorQualityEstimator::fromAlignedMemory( + const AlignedMemory& alignedMemory) { + LOG(info, "[data] Loading Quality Estimator model from buffer"); + + const char* ptr = alignedMemory.begin(); + const size_t blobSize = alignedMemory.size(); + + ABORT_IF(blobSize < sizeof(Header), "Quality estimation file too small"); + const Header& header = *reinterpret_cast(ptr); + + ABORT_IF(header.magic != BINARY_QE_MODEL_MAGIC, "Incorrect magic bytes for quality estimation file"); + ABORT_IF(header.lrParametersDims <= 0, "The number of lr parameter dimension cannot be equal or less than zero"); + + const uint64_t expectedSize = + sizeof(Header) + (numLrParamsWithDimension_ * header.lrParametersDims + numIntercept_) * sizeof(float); + ABORT_IF(expectedSize != blobSize, "QE header claims file size should be {} bytes but file is {} bytes", expectedSize, + blobSize); + + ptr += sizeof(Header); + const float* memoryIndex = reinterpret_cast(ptr); + + const float* stds = memoryIndex; + const float* means = memoryIndex += header.lrParametersDims; + const float* coefficientsMemory = memoryIndex += header.lrParametersDims; + const float intercept = *(memoryIndex += header.lrParametersDims); + + Scale scale; + + Array coefficients; + + for (int i = 0; i < header.lrParametersDims; ++i) { + scale.stds[i] = *(stds + i); + + ABORT_IF(scale.stds[i] == 0.0, "Invalid stds"); + + scale.means[i] = *(means + i); + coefficients[i] = *(coefficientsMemory + i); + } + + return LogisticRegressorQualityEstimator(std::move(scale), std::move(coefficients), intercept); +} + +AlignedMemory LogisticRegressorQualityEstimator::toAlignedMemory() const { + const size_t lrParametersDims = scale_.means.size(); + + const size_t lrSize = + (scale_.means.size() + scale_.stds.size() + coefficients_.size()) * sizeof(float) + sizeof(intercept_); + + Header header = {BINARY_QE_MODEL_MAGIC, lrParametersDims}; + marian::bergamot::AlignedMemory memory(sizeof(header) + lrSize); + + char* buffer = memory.begin(); + + memcpy(buffer, &header, sizeof(header)); + buffer += sizeof(header); + + for (const float std : scale_.stds) { + memcpy(buffer, &std, sizeof(std)); + buffer += sizeof(std); + } + + for (const float mean : scale_.means) { + memcpy(buffer, &mean, sizeof(mean)); + buffer += sizeof(mean); + } + + for (size_t i = 0; i < lrParametersDims; ++i) { + const float coefficient = coefficients_[i]; + memcpy(buffer, &coefficient, sizeof(coefficient)); + buffer += sizeof(coefficient); + } + + memcpy(buffer, &intercept_, sizeof(intercept_)); + buffer += sizeof(intercept_); + + return memory; +} + +void LogisticRegressorQualityEstimator::computeQualityScores(const Histories& histories, Response& response) const { + for (size_t i = 0; i < histories.size(); ++i) { + const Result result = histories[i]->top(); + const Hypothesis::PtrType& hypothesis = std::get<1>(result); + const std::vector logProbs = hypothesis->tracebackWordScores(); + + response.qualityScores.push_back(std::move(computeSentenceScores(logProbs, response.target, i))); + } +} + +Response::SentenceQualityScore LogisticRegressorQualityEstimator::computeSentenceScores( + const std::vector& logProbs, const AnnotatedText& target, const size_t sentenceIdx) const + +{ + const std::vector wordIndices = mapWords(logProbs, target, sentenceIdx); + + const std::vector wordScores = predict(extractFeatures(wordIndices, logProbs)); + + const float sentenceScore = + std::accumulate(std::begin(wordScores), std::end(wordScores), float(0.0)) / wordScores.size(); + + return {wordScores, subwordToWords(wordIndices, target, sentenceIdx), sentenceScore}; +} + +std::vector LogisticRegressorQualityEstimator::predict(const Matrix& features) const { + std::vector scores(features.rows); + + for (int i = 0; i < features.rows; ++i) { + for (int j = 0; j < features.cols; ++j) { + scores[i] += features.at(i, j) * coefficientsByStds_[j]; + } + } + + /// Applies the linear model followed by a sigmoid function to each element + + for (int i = 0; i < features.rows; ++i) { + scores[i] = std::log(1 - (1 / (1 + std::exp(-(scores[i] - constantFactor_ + intercept_))))); + } + + return scores; +} +// Preprocess input data to provide correct features for the LogisticRegression model. Currently, there are +// four features: mean of the log probability for a given word (remember that a word is made of a few subword tokens); +// the minimum log probability of the subword level tokens that a given word is made of; the number of subword level +// tokens that a word is made of and the overall log probability mean of the entire sequence +LogisticRegressorQualityEstimator::Matrix LogisticRegressorQualityEstimator::extractFeatures( + const std::vector& wordIndices, const std::vector& logProbs) const { + if (wordIndices.empty()) { + return std::move(Matrix(0, 0)); + } + // The number of features (numFeatures), which is currently must be 4 + Matrix features(wordIndices.size(), /*numFeatures =*/4); + size_t featureRow = 0; + // I_MEAN = index position in the feature vector hat represents the mean of log probability of a given word + // I_MIN = index position in the feature vector that represents the minimum of log probability of a given word + // I_NUM_SUBWORDS = index position in the feature vector that represents the number of subwords that compose a given + // I_OVERALL_MEAN = index position in the feature vector that represents the overall log probability score in the + // entire sequence + const size_t I_MEAN{0}, I_MIN{1}, I_NUM_SUBWORDS{2}, I_OVERALL_MEAN{3}; + + float overallMean = 0.0; + size_t numlogProbs = 0; + + for (const SubwordRange& wordIndice : wordIndices) { + if (wordIndice.begin == wordIndice.end) { + ++featureRow; + continue; + } + + float minScore = std::numeric_limits::max(); + + for (size_t i = wordIndice.begin; i < wordIndice.end; ++i) { + ++numlogProbs; + overallMean += logProbs[i]; + features.at(featureRow, I_MEAN) += logProbs[i]; + + minScore = std::min(logProbs[i], minScore); + } + + features.at(featureRow, I_MEAN) /= static_cast(wordIndice.size()); + features.at(featureRow, I_MIN) = minScore; + features.at(featureRow, I_NUM_SUBWORDS) = wordIndice.size(); + + ++featureRow; + } + + if (numlogProbs == 0) { + return std::move(Matrix(0, 0)); + } + + overallMean /= wordIndices.rbegin()->end; + + for (int i = 0; i < features.rows; ++i) { + features.at(i, I_OVERALL_MEAN) = overallMean; + } + + return std::move(features); +} + +std::vector mapWords(const std::vector& logProbs, const AnnotatedText& target, + const size_t sentenceIdx) { + // Ignore empty target + if ((logProbs.size() < 2) || (target.numWords(sentenceIdx) == 0)) { + return {}; + } + // It is expected that translated words will have at least one word + std::vector wordIndices(/*numWords=*/1); + + /// The LogisticRegressorQualityEstimator model ignores the presence of the EOS token, and hence we only need to + /// iterate n-1 positions. + for (size_t subwordIdx = 0; subwordIdx < (logProbs.size() - 1); ++subwordIdx) { + ByteRange subword = target.wordAsByteRange(sentenceIdx, subwordIdx); + + const char firstLetter = target.text.at(subword.begin); + + // if the first character is whitespace, it's a beginning of a new word + if (isspace(firstLetter)) { + wordIndices.back().end = subwordIdx; + wordIndices.emplace_back(); + wordIndices.back().begin = subwordIdx; + } + } + + wordIndices.back().end = logProbs.size() - 1; + + return wordIndices; +} + +std::vector subwordToWords(const std::vector& wordIndices, const AnnotatedText& target, + const size_t sentenceIdx) { + std::vector words; + + for (const SubwordRange& wordIndice : wordIndices) { + size_t wordBegin = target.wordAsByteRange(sentenceIdx, wordIndice.begin).begin; + size_t wordEnd = target.wordAsByteRange(sentenceIdx, wordIndice.end).begin; + + if (isspace(target.text.at(wordBegin))) { + ++wordBegin; + } + + words.emplace_back(ByteRange{wordBegin, wordEnd}); + } + + return words; +} + +} // namespace marian::bergamot diff --git a/src/translator/quality_estimator.h b/src/translator/quality_estimator.h new file mode 100644 index 000000000..3d2fd68ea --- /dev/null +++ b/src/translator/quality_estimator.h @@ -0,0 +1,222 @@ +#pragma once + +#include +#include + +#include "annotation.h" +#include "response.h" +#include "translator/history.h" + +namespace marian::bergamot { + +class QualityEstimator { + public: + /// Computes quality-scores using values from Histories and subword tokens which comes from Response + /// + /// + /// @param [in] histories: Histories obtained from translating a blob of source-text + /// @param [in] response: Partially constructed response, holding tokenization info + /// for source and target. The quality-scores for each sentence obtained from source-text blob + /// are written out as SentenceQualityEstimate into response. + virtual void computeQualityScores(const Histories &histories, Response &response) const = 0; +}; + +using SubwordRange = ByteRange; + +/// Unsupervised Quality Estimator model. It uses the translator model's log probabilities (log probs) as a proxy for +/// quality scores. Then, for a given word, its quality score is computed by taking the mean of the log probs of the +/// tokens that make it up. The sentence score is the mean of all word's log probs. +class UnsupervisedQualityEstimator : public QualityEstimator { + public: + void computeQualityScores(const Histories &histories, Response &response) const override; + + private: + Response::SentenceQualityScore computeSentenceScores(const std::vector &logProbs, const AnnotatedText &target, + const size_t sentenceIdx) const; +}; + +// ASCII and Unicode text files never start with the following 64 bits +// It serves as a signature for quality estimator binary files +constexpr std::uint64_t BINARY_QE_MODEL_MAGIC = 0x78cc336f1d54b180; + +/// LogisticRegressorQualityEstimator model implementation through a linear regressor + sigmoid function. Simply +/// speaking, an LR model depends on features to be scaled, so it contains four elements of data: a vector of +/// coefficients and an intercept (which represents the linear model) and a vector of means and stds (which are +/// necessary for feature scaling). These variables are firstly initialized by parsing a file (which comes from +/// `fromAlignedMemory`), and then they are used to build a model representation +class LogisticRegressorQualityEstimator : public QualityEstimator { + public: + using Array = std::array; + + struct Header { + /// Binary QE File magic number + uint64_t magic; + /// Length of lr parameters stds, means and coefficients. + uint64_t lrParametersDims; + }; + /// Struct that contains information for applying standard scaling + struct Scale { + /// Array of standard deviations of feature values. Its length will be equals as featureDims + Array stds; + /// Array of mean of feature values. Its length will be equals as featureDims + Array means; + }; + /// Matrix is an internal data structure that was created only to be used in LogisticRegressorQualityEstimator + /// methods. It intends to represent a matrix, so it receives row and column values as a constructor. Furthermore, the + /// method `at` can access specific data given a row and col position. + class Matrix { + public: + /// Number of rows + const size_t rows; + /// Number of columns + const size_t cols; + + /// @param [in] rowsParam: number of rows in the Matrix + /// @param [in] colsParam: number of columns in the Matrix + Matrix(const size_t rowsParam, const size_t colsParam); + /// Move constructor + Matrix(Matrix &&other); + + /// Return data value given a row and col position + /// @param [in] row: row position + /// @param [in] col: col position + const float &at(const size_t row, const size_t col) const; + float &at(const size_t row, const size_t col); + + private: + std::vector data_; + }; + /// Logistic Regressor constructor. It creates a LR model that fits proper for the QualityEstimator use. + /// + /// + /// @param [in] scale: Array of stds and means that can be used to apply standard scaling in the features + /// @param [in] coefficients: coefficient values of linear part of LR model + /// @param [in] intercept: intercept value of the linear part of LR model + LogisticRegressorQualityEstimator(Scale &&scale, Array &&coefficients, const float intercept); + + /// Move constructor + LogisticRegressorQualityEstimator(LogisticRegressorQualityEstimator &&other); + + /// Binary file parser which came from AlignedMemory + /// It's expected from AlignedMemory the following structure: + /// - -a header with the number of parameters dimensions + /// - -a vector of standard deviations of features + /// - -a vector of means of features + /// - -a vector of coefficients + /// - -a intercept value + static LogisticRegressorQualityEstimator fromAlignedMemory(const AlignedMemory &alignedMemory); + AlignedMemory toAlignedMemory() const; + + void computeQualityScores(const Histories &histories, Response &response) const override; + /// Given an input matrix \f$\mathbf{X}\f$, the usual Logistic Regression calculus can be seen as the following: + /// + /// 1) Standardize it, returning in \f$\mathbf{Z} = \frac{(\mathbf{X}-\mu)}{\sigma}\f$, where \f$\mu\f$ stands for the + /// mean vector and \f$\sigma\f$ represents the standard deviation + /// + /// 2) Then, we apply \f$\sum_{i=1}^{D}{ w_i z_i}\f$, where \f$D\f$ is the dimension (i.e. the number of features) and + /// \f$w\f$ is the model vector with learnt weights + /// + /// 3) We apply the sigmoid function to the result + /// + /// Notice, however, that for the first two steps we can do the following: + /// + /// \f{align*}{ + /// \sum_{i=1}^{D}{ w_i z_i} &= \mathbf{w^T}\left(\mathbf{\sigma^{-1}} \odot (\mathbf{x} - \mathbf{\mu})\right) \text{ + /// // + /// we are just vectoring step 1}\\ + /// &= \sum_{i=1}^{D}{\sigma_i^{-1} w_i (x_i - \mu_i)} \\ + /// &= \sum_{i=1}^{D}{\sigma_i^{-1} w_ix_i - \sigma_i^{-1} w_i \mu_i} \\ + /// &= \sum_{i=1}^{D}{\left(\sigma_i^{-1} w_i\right)x_i - \left(\sigma_i^{-1} w_i \mu_i\right)} + /// \f} + /// Then, \f$(\sigma_i^{-1} w_i \mu_i)\f$ can be precomputed without any dependence on inference data. This is done by + /// the variable \f$\textit{constantFactor_}\f$ and \f$\textit{intercept_}\f$ in the code. + /// + /// @param [in] features: A Matrix struct of features. For a defintion what features currently means, please refer to + /// `extractFeatures` method in `quality_estimator.cpp` + std::vector predict(const Matrix &features) const; + + private: + Scale scale_; + Array coefficients_; + float intercept_; + Array coefficientsByStds_; + float constantFactor_ = 0.0; + + // Number of parameters with dimension - Scale(stds, means) and coefficients + static constexpr const size_t numLrParamsWithDimension_ = 3; + // Number of intercept values + static constexpr const size_t numIntercept_ = 1; + + /// construct the struct SentenceQualityEstimate + /// @param [in] logProbs: the log probabilities given by an translation model + /// @param [in] target: AnnotatedText target value + /// @param [in] sentenceIdx: the id of a candidate sentence + Response::SentenceQualityScore computeSentenceScores(const std::vector &logProbs, const AnnotatedText &target, + const size_t sentenceIdx) const; + + Matrix extractFeatures(const std::vector &wordIndices, const std::vector &logProbs) const; +}; + +/// createQualityEstimator model takes an `AlignedMemory`, which is the return from `getQualityEstimatorModel`. +/// +/// `getQualityEstimatorModel` contains two different implementations, one when the `quality` argument has some value as +/// a possible `Options` and where it does not. +/// +/// If a non `quality` option is provided, then by default, it uses the UnsupervisedQualityEstimator implementation. +/// +/// If a value is passed to the `quality` argument, the model file is read and converted into an `AlignedMemory` +/// structure, which instantiates a QualityEstimator object. + +/// @param [in] qualityFileMemory: An `AlignedMemory` which is created by parsing a QE model binary file through +/// getQualityEstimatorModel +inline std::shared_ptr createQualityEstimator(const AlignedMemory &qualityFileMemory) { + // If no quality file return simple model + if (qualityFileMemory.size() == 0) { + return std::make_shared(); + } + + return std::make_shared( + LogisticRegressorQualityEstimator::fromAlignedMemory(qualityFileMemory)); +} + +/// A word is composed of multiple subtokens. Entire words are tokens splitted by whitespace. +/// This method takes a sequence of sublevel tokens (given by AnnotatedText) as well aligned with their log +/// probabilities and conflate them to their respective words +/// The return of this function is a SubwordRange (an alias of ByteRange) vector where each value corresponds to a word +/// id and its content represent the range of subword value that compose a given word +/// +/// If a translated sentence does not contain any alphanumeric character (therefore, it is made basically of the EOS +/// token), this method ignores it and returns an empty ByteRange vector of words. +/// +/// Examples: +/// Suppose that you have the following source target (A): marian is a good translation service and the translate +/// service gives you the following sentence (B): +/// service gives you the following sentence (B): +/// +/// ma(0.15) ri(0.15) an(0.2) es(0.3) un(0.1) bu(0.3) en(0.2) ser(0.1) vi(0.2) cio(0.4) de(0.1) tra(0.4) du(0.2) +/// cción(0.1) +/// +/// The numbers that the words follow represent the logProb of each BPE token. +/// +/// Then, the result would be something like: +/// a vector where each position corresponds to the SubwordRange of the following words: marian +/// es un buen servicio de traducción. Hence, its length is 7. The value of the first element would be [0,3) + +/// @param [in] logProbs: the log probabilities of byte pair encodings (BPE) that comes from the tracebackWordScores +/// method (which belongs to hypothesis.h in Marian) +/// @param [in] target: AnnotatedText target value +/// @param [in] sentenceIdx: the id of a candidate sentence +std::vector mapWords(const std::vector &logProbs, const AnnotatedText &target, + const size_t sentenceIdx); + +/// Given a vector of subwordRanges, it maps the elements to be real words rather than sublevel tokens. The words are +/// represented through ByteRanges. + +/// @param [in] wordIndices: A vector where each element correspond to the index of a real word and its values are +/// represented by the SubwordRanges (which are aliases of ByteRanges) which represents sublevel token positions +/// @param [in] target: AnnotatedText target value +/// @param [in] sentenceIdx: the id of a candidate sentence +std::vector subwordToWords(const std::vector &wordIndices, const AnnotatedText &target, + const size_t sentenceIdx); + +} // namespace marian::bergamot diff --git a/src/translator/response.h b/src/translator/response.h index 2355f5225..b77fbb633 100644 --- a/src/translator/response.h +++ b/src/translator/response.h @@ -26,14 +26,6 @@ struct Point { /// Alignment is a sparse matrix, where Points represent entries with values. typedef std::vector Alignment; -/// -loglikelhoods of the sequence components as proxy to quality. -struct Quality { - /// Certainty/uncertainty score for sequence. - float sequence; - /// Certainty/uncertainty for each word in the sequence. - std::vector word; -}; - /// Response holds AnnotatedText(s) of source-text and translated text, /// alignment information between source and target sub-words and sentences. /// @@ -41,6 +33,19 @@ struct Quality { /// sentences boundaries, which are required to interpret Quality and /// Alignment (s) at the moment. struct Response { + /// SentenceQualityScore contains the quality data of a given translated sentence. + /// It includes the confidence (proxied by log probabilities) of each decoded word + /// (higher logprobs imply better-translated words), the ByteRanges of each term, + /// and logprobs of the whole sentence, represented as the mean word scores. + struct SentenceQualityScore { + /// Quality score of each translated word + std::vector wordScores; + /// Each word position in the translated text + std::vector wordByteRanges; + /// Whole sentence quality score (it is composed by the mean of its words) + float sentenceScore = 0.0; + }; + /// Convenience function to obtain number of units translated. Same as /// `.source.numSentences()` and `.target.numSentences().` The processing of a /// text of into sentences are handled internally, and this information can be @@ -54,11 +59,11 @@ struct Response { /// translated text and annotations of (sub-)words and sentences. AnnotatedText target; - /// -logprob of each word and negative log likelihood of sequence (sentence) + /// logprob of each word and the total sequence (sentence) /// normalized by length, for each sentence processed by the translator. /// Indices correspond to ranges accessible through respective Annotation on /// source or target. - std::vector qualityScores; + std::vector qualityScores; /// Alignments between source and target. Each Alignment is a /// sparse matrix representation with indices corresponding diff --git a/src/translator/response_builder.cpp b/src/translator/response_builder.cpp index 2944de53a..d51fbbf57 100644 --- a/src/translator/response_builder.cpp +++ b/src/translator/response_builder.cpp @@ -6,21 +6,7 @@ namespace marian { namespace bergamot { void ResponseBuilder::buildQualityScores(Histories &histories, Response &response) { - std::vector qualityScores; - for (auto &history : histories) { - // TODO(jerin): Change hardcode of nBest = 1 - NBestList onebest = history->nBest(1); - - Result result = onebest[0]; // Expecting only one result; - Words words = std::get<0>(result); - auto hyp = std::get<1>(result); - // Quality scores: Sequence level is obtained as normalized path scores. - // Word level using hypothesis traceback. These are most-likely - // logprobs. - auto normalizedPathScore = std::get<2>(result); - auto wordQualities = hyp->tracebackWordScores(); - response.qualityScores.push_back(Quality{normalizedPathScore, wordQualities}); - } + qualityEstimator_.computeQualityScores(histories, response); } void ResponseBuilder::buildAlignments(Histories &histories, Response &response) { diff --git a/src/translator/response_builder.h b/src/translator/response_builder.h index bee189516..614c7c282 100644 --- a/src/translator/response_builder.h +++ b/src/translator/response_builder.h @@ -1,7 +1,10 @@ #ifndef SRC_BERGAMOT_RESPONSE_BUILDER_H_ #define SRC_BERGAMOT_RESPONSE_BUILDER_H_ +#include + #include "data/types.h" +#include "quality_estimator.h" #include "response.h" #include "response_options.h" #include "vocabs.h" @@ -24,12 +27,15 @@ class ResponseBuilder { /// or not in the response and any additional configurable parameters. /// @param [in] vocabs: marian vocab object (used in decoding) /// @param [in] callback: callback with operates on the constructed Response. + /// @param [in] qualityEstimator: the QualityEstimator model that can be used + /// to provide translation quality probability. ResponseBuilder(ResponseOptions responseOptions, AnnotatedText &&source, Vocabs &vocabs, - std::function callback) + std::function callback, const QualityEstimator &qualityEstimator) : responseOptions_(responseOptions), source_(std::move(source)), vocabs_(vocabs), - callback_(std::move(callback)) {} + callback_(std::move(callback)), + qualityEstimator_(qualityEstimator) {} /// Constructs and sets the promise of a Response object from obtained /// histories after translating. @@ -86,6 +92,8 @@ class ResponseBuilder { std::function callback_; // To be set when callback triggered and // after Response constructed. AnnotatedText source_; + + const QualityEstimator &qualityEstimator_; }; } // namespace bergamot } // namespace marian diff --git a/src/translator/response_options.h b/src/translator/response_options.h index b74f5782a..92737a414 100644 --- a/src/translator/response_options.h +++ b/src/translator/response_options.h @@ -13,16 +13,6 @@ enum ConcatStrategy { SPACE }; -enum QualityScoreType { - /// Provide a free quality-score that comes with the machine-translation model - /// itself. - FREE, - - /// An expensive quality-score that runs additional computations to determine - /// quality of an output. - EXPENSIVE -}; - /// ResponseOptions dictate how to construct a Response for an input string of /// text to be translated. struct ResponseOptions { @@ -40,7 +30,6 @@ struct ResponseOptions { /// matrix). float alignmentThreshold{0.2f}; - QualityScoreType qualityScoreType{QualityScoreType::FREE}; ConcatStrategy concatStrategy{ConcatStrategy::FAITHFUL}; }; diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 26901debc..f5996aa45 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -4,6 +4,7 @@ #include #include "batch.h" +#include "byte_array_util.h" #include "definitions.h" namespace marian { @@ -17,7 +18,8 @@ Service::Service(Ptr options, MemoryBundle memoryBundle) batcher_(options), numWorkers_(std::max(1, options->get("cpu-threads"))), modelMemory_(std::move(memoryBundle.model)), - shortlistMemory_(std::move(memoryBundle.shortlist)) + shortlistMemory_(std::move(memoryBundle.shortlist)), + qualityEstimator_(createQualityEstimator(getQualityEstimatorModel(memoryBundle, options))) #ifdef WASM_COMPATIBLE_SOURCE , blocking_translator_(DeviceId(0, DeviceType::cpu), vocabs_, options_, &modelMemory_, &shortlistMemory_) @@ -71,7 +73,7 @@ void Service::queueRequest(std::string &&input, std::function text_processor_.process(std::move(input), source, segments); - ResponseBuilder responseBuilder(responseOptions, std::move(source), vocabs_, std::move(callback)); + ResponseBuilder responseBuilder(responseOptions, std::move(source), vocabs_, std::move(callback), *qualityEstimator_); Ptr request = New(requestId_++, std::move(segments), std::move(responseBuilder)); batcher_.addWholeRequest(request); diff --git a/src/translator/service.h b/src/translator/service.h index 0a3658048..3a3d616fc 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -3,6 +3,7 @@ #include "batch_translator.h" #include "data/types.h" +#include "quality_estimator.h" #include "response.h" #include "response_builder.h" #include "text_processor.h" @@ -46,7 +47,7 @@ class Service { /// the given bytearray memories. /// @param options Marian options object /// @param memoryBundle holds all byte-array memories. Can be a set/subset of - /// model, shortlist, vocabs and ssplitPrefixFile bytes. Optional. + /// model, shortlist, vocabs and ssplitPrefixFile or QualityEstimation bytes. Optional. explicit Service(Ptr options, MemoryBundle memoryBundle = {}); /// Construct Service from a string configuration. If memoryBundle is empty, Service is @@ -54,7 +55,7 @@ class Service { /// the given bytearray memories. /// @param [in] config string parsable as YAML expected to adhere with marian config /// @param [in] memoryBundle holds all byte-array memories. Can be a set/subset of - /// model, shortlist, vocabs and ssplitPrefixFile bytes. Optional. + /// model, shortlist, vocabs and ssplitPrefixFile or qualityEstimation bytes. Optional. explicit Service(const std::string &config, MemoryBundle memoryBundle = {}) : Service(parseOptions(config, /*validate=*/false), std::move(memoryBundle)) {} @@ -116,6 +117,8 @@ class Service { /// Shortlist memory passed as bytes. AlignedMemory shortlistMemory_; // ORDER DEPENDENCY (translators_) + std::shared_ptr qualityEstimator_; + /// Stores requestId of active request. Used to establish /// ordering among requests and logging/book-keeping. From cf541c68f9b43bce8c68e2292007a1573cfaa38e Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Tue, 21 Sep 2021 18:10:40 +0100 Subject: [PATCH 290/442] Multiple TranslationModels Implementation (#210) For outbound translation, we require having multiple models in the inventory at the same time and abstracting the "how-to-translate" using a model out. Reorganization: TranslationModel + Service. The new entity which contains everything required to translate in one direction is `TranslationModel`. The how-to-translate blocking single-threaded mode of operation or async multi-threaded mode of operation is decoupled as `BlockingService` and `AsyncService`. There is a new regression-test using multiple models in conjunction added, also serving as a demonstration for using multiple models in Outbound Translation. WASM: WebAssembly due to the inability to use threads uses `BlockingService. Bindings are provided with a new API to work with a Service, and multiple TranslationModels which the client (JS extension) can inventory and maintain. Ownership of a given `TranslationModel` is shared while translations using the model are active in the internal mechanism. Config-Parsing: So far bergamot-translator has been hijacking marian's config-parsing mechanisms. However, in order to support multiple models, it has become impractical to continue this approach and a new config-parsing that is bergamot specific is provisioned for command-line applications constituting tests. The original marian config-parsing tooling is only associated with a subset of `TranslationModel` now. The new config-parsing for the library manages workers and other common options (tentatively). There is a known issue of: Inefficient placing of workspaces, leading to more memory usage than what's necessary. This is to be fixed trickling down from marian-dev in a later pull request. This PR also brings in BRT changes which fix speed-tests that were broken and also fixes some QE outputs which were different due to not using shortlist. --- app/bergamot.cpp | 26 ++- app/cli.h | 46 ++-- bergamot-translator-tests | 2 +- src/tests/apps.cpp | 68 ++++-- src/tests/apps.h | 14 +- src/tests/cli.cpp | 54 +++-- src/translator/CMakeLists.txt | 7 +- src/translator/aggregate_batching_pool.cpp | 34 +++ src/translator/aggregate_batching_pool.h | 68 ++++++ src/translator/batch_translator.cpp | 128 ----------- src/translator/batch_translator.h | 57 ----- .../{batcher.cpp => batching_pool.cpp} | 15 +- src/translator/{batcher.h => batching_pool.h} | 23 +- src/translator/definitions.h | 3 + src/translator/parser.cpp | 170 +++++++++++++++ src/translator/parser.h | 120 +++++------ src/translator/request.h | 6 +- src/translator/response_builder.h | 2 +- src/translator/service.cpp | 99 ++++----- src/translator/service.h | 198 ++++++++---------- src/translator/text_processor.cpp | 8 +- src/translator/text_processor.h | 6 +- src/translator/threadsafe_batcher.cpp | 38 ---- src/translator/threadsafe_batcher.h | 57 ----- src/translator/threadsafe_batching_pool.cpp | 49 +++++ src/translator/threadsafe_batching_pool.h | 71 +++++++ src/translator/translation_model.cpp | 173 +++++++++++++++ src/translator/translation_model.h | 122 +++++++++++ wasm/bindings/service_bindings.cpp | 45 ++-- 29 files changed, 1068 insertions(+), 641 deletions(-) create mode 100644 src/translator/aggregate_batching_pool.cpp create mode 100644 src/translator/aggregate_batching_pool.h delete mode 100644 src/translator/batch_translator.cpp delete mode 100644 src/translator/batch_translator.h rename src/translator/{batcher.cpp => batching_pool.cpp} (83%) rename src/translator/{batcher.h => batching_pool.h} (63%) create mode 100644 src/translator/parser.cpp delete mode 100644 src/translator/threadsafe_batcher.cpp delete mode 100644 src/translator/threadsafe_batcher.h create mode 100644 src/translator/threadsafe_batching_pool.cpp create mode 100644 src/translator/threadsafe_batching_pool.h create mode 100644 src/translator/translation_model.cpp create mode 100644 src/translator/translation_model.h diff --git a/app/bergamot.cpp b/app/bergamot.cpp index 19dea1fcf..bffbbb112 100644 --- a/app/bergamot.cpp +++ b/app/bergamot.cpp @@ -1,18 +1,22 @@ #include "cli.h" int main(int argc, char *argv[]) { - auto cp = marian::bergamot::createConfigParser(); - auto options = cp.parseOptions(argc, argv, true); - const std::string mode = options->get("bergamot-mode"); + marian::bergamot::ConfigParser configParser; + configParser.parseArgs(argc, argv); + auto &config = configParser.getConfig(); using namespace marian::bergamot; - if (mode == "wasm") { - app::wasm(options); - } else if (mode == "native") { - app::native(options); - } else if (mode == "decoder") { - app::decoder(options); - } else { - ABORT("Unknown --mode {}. Use one of: {wasm,native,decoder}", mode); + switch (config.opMode) { + case OpMode::APP_WASM: + app::wasm(config); + break; + case OpMode::APP_NATIVE: + app::native(config); + break; + case OpMode::APP_DECODER: + app::decoder(config); + break; + default: + break; } return 0; } diff --git a/app/cli.h b/app/cli.h index 4afe8b9aa..9cb12dd28 100644 --- a/app/cli.h +++ b/app/cli.h @@ -34,34 +34,40 @@ namespace app { /// * Output: written to stdout as translations for the sentences supplied in corresponding lines /// /// @param [options]: Options to translate passed down to marian through Options. -void wasm(Ptr options) { +void wasm(const CLIConfig &config) { // Here, we take the command-line interface which is uniform across all apps. This is parsed into Ptr by // marian. However, mozilla does not allow a Ptr constructor and demands an std::string constructor since // std::string isn't marian internal unlike Ptr. Since this std::string path needs to be tested for mozilla // and since this class/CLI is intended at testing mozilla's path, we go from: // - // cmdline -> Ptr -> std::string -> Service(std::string) + // cmdline -> Ptr -> std::string -> TranslationModel(std::string) // // Overkill, yes. - std::string config = options->asYamlString(); - Service model(config); + const std::string &modelConfigPath = config.modelConfigPaths.front(); + + Ptr options = parseOptionsFromFilePath(modelConfigPath); + MemoryBundle memoryBundle = getMemoryBundleFromConfig(options); + + BlockingService::Config serviceConfig; + BlockingService service(serviceConfig); + + std::shared_ptr translationModel = + std::make_shared(options->asYamlString(), std::move(memoryBundle)); ResponseOptions responseOptions; std::vector texts; -#ifdef WASM_COMPATIBLE_SOURCE // Hide the translateMultiple operation for (std::string line; std::getline(std::cin, line);) { texts.emplace_back(line); } - auto results = model.translateMultiple(std::move(texts), responseOptions); + auto results = service.translateMultiple(translationModel, std::move(texts), responseOptions); for (auto &result : results) { std::cout << result.getTranslatedText() << std::endl; } -#endif } /// Application used to benchmark with marian-decoder from time-to-time. The implementation in this repository follows a @@ -82,9 +88,13 @@ void wasm(Ptr options) { /// * Output: to stdout, translations of the sentences supplied via stdin in corresponding lines /// /// @param [in] options: constructed from command-line supplied arguments -void decoder(Ptr options) { +void decoder(const CLIConfig &config) { marian::timer::Timer decoderTimer; - Service service(options); + AsyncService::Config asyncConfig{config.numWorkers}; + AsyncService service(asyncConfig); + auto options = parseOptionsFromFilePath(config.modelConfigPaths.front()); + MemoryBundle memoryBundle; + Ptr translationModel = service.createCompatibleModel(options, std::move(memoryBundle)); // Read a large input text blob from stdin std::ostringstream std_input; std_input << std::cin.rdbuf(); @@ -95,14 +105,15 @@ void decoder(Ptr options) { std::future responseFuture = responsePromise.get_future(); auto callback = [&responsePromise](Response &&response) { responsePromise.set_value(std::move(response)); }; - service.translate(std::move(input), std::move(callback)); + service.translate(translationModel, std::move(input), std::move(callback)); responseFuture.wait(); const Response &response = responseFuture.get(); for (size_t sentenceIdx = 0; sentenceIdx < response.size(); sentenceIdx++) { std::cout << response.target.sentence(sentenceIdx) << "\n"; } - LOG(info, "Total time: {:.5f}s wall", decoderTimer.elapsed()); + + std::cerr << "Total time: " << std::setprecision(5) << decoderTimer.elapsed() << "s wall" << std::endl; } /// Command line interface to the test the features being developed as part of bergamot C++ library on native platform. @@ -114,16 +125,19 @@ void decoder(Ptr options) { /// * Output: to stdout, translation of the source text faithful to source structure. /// /// @param [in] options: options to build translator -void native(Ptr options) { +void native(const CLIConfig &config) { + AsyncService::Config asyncConfig{config.numWorkers}; + AsyncService service(asyncConfig); + + auto options = parseOptionsFromFilePath(config.modelConfigPaths.front()); // Prepare memories for bytearrays (including model, shortlist and vocabs) MemoryBundle memoryBundle; - - if (options->get("bytearray")) { + if (config.byteArray) { // Load legit values into bytearrays. memoryBundle = getMemoryBundleFromConfig(options); } - Service service(options, std::move(memoryBundle)); + Ptr translationModel = service.createCompatibleModel(options, std::move(memoryBundle)); // Read a large input text blob from stdin std::ostringstream std_input; @@ -137,7 +151,7 @@ void native(Ptr options) { std::future responseFuture = responsePromise.get_future(); auto callback = [&responsePromise](Response &&response) { responsePromise.set_value(std::move(response)); }; - service.translate(std::move(input), std::move(callback), responseOptions); + service.translate(translationModel, std::move(input), std::move(callback), responseOptions); responseFuture.wait(); Response response = responseFuture.get(); diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 53c6e42a9..9dc3c5e9a 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 53c6e42a97e512698711068d0be3c208359b1801 +Subproject commit 9dc3c5e9a1027c1d6b4a467a27bdff16d0d6a006 diff --git a/src/tests/apps.cpp b/src/tests/apps.cpp index 991d3c3fd..63febfaf0 100644 --- a/src/tests/apps.cpp +++ b/src/tests/apps.cpp @@ -2,30 +2,25 @@ namespace marian { namespace bergamot { -namespace testapp { - -// Utility function, common for all testapps. -Response translateFromStdin(Ptr options, ResponseOptions responseOptions) { - // Prepare memories for bytearrays (including model, shortlist and vocabs) - MemoryBundle memoryBundle; - if (options->get("bytearray")) { - // Load legit values into bytearrays. - memoryBundle = getMemoryBundleFromConfig(options); - } - - Service service(options, std::move(memoryBundle)); +namespace { +std::string readFromStdin() { // Read a large input text blob from stdin std::ostringstream inputStream; inputStream << std::cin.rdbuf(); std::string input = inputStream.str(); + return input; +} +// Utility function, common for all testapps. +Response translateForResponse(AsyncService &service, Ptr model, std::string &&source, + ResponseOptions responseOptions) { std::promise responsePromise; std::future responseFuture = responsePromise.get_future(); auto callback = [&responsePromise](Response &&response) { responsePromise.set_value(std::move(response)); }; - service.translate(std::move(input), callback, responseOptions); + service.translate(model, std::move(source), callback, responseOptions); responseFuture.wait(); @@ -33,10 +28,15 @@ Response translateFromStdin(Ptr options, ResponseOptions responseOption return response; } -void annotatedTextWords(Ptr options, bool source) { +} // namespace + +namespace testapp { + +void annotatedTextWords(AsyncService &service, Ptr model, bool sourceSide) { ResponseOptions responseOptions; - Response response = translateFromStdin(options, responseOptions); - AnnotatedText &annotatedText = source ? response.source : response.target; + std::string source = readFromStdin(); + Response response = translateForResponse(service, model, std::move(source), responseOptions); + AnnotatedText &annotatedText = sourceSide ? response.source : response.target; for (size_t s = 0; s < annotatedText.numSentences(); s++) { for (size_t w = 0; w < annotatedText.numWords(s); w++) { std::cout << (w == 0 ? "" : "\t"); @@ -46,19 +46,39 @@ void annotatedTextWords(Ptr options, bool source) { } } -void annotatedTextSentences(Ptr options, bool source) { +void annotatedTextSentences(AsyncService &service, Ptr model, bool sourceSide) { ResponseOptions responseOptions; - Response response = translateFromStdin(options, responseOptions); - AnnotatedText &annotatedText = source ? response.source : response.target; + std::string source = readFromStdin(); + Response response = translateForResponse(service, model, std::move(source), responseOptions); + AnnotatedText &annotatedText = sourceSide ? response.source : response.target; for (size_t s = 0; s < annotatedText.numSentences(); s++) { std::cout << annotatedText.sentence(s) << "\n"; } } -void qualityEstimatorWords(const Ptr &options) { +void forwardAndBackward(AsyncService &service, std::vector> &models) { + ABORT_IF(models.size() != 2, "Forward and backward test needs two models."); + ResponseOptions responseOptions; + std::string source = readFromStdin(); + Response forwardResponse = translateForResponse(service, models.front(), std::move(source), responseOptions); + + // Make a copy of target + std::string target = forwardResponse.target.text; + Response backwardResponse = translateForResponse(service, models.back(), std::move(target), responseOptions); + + // Print both onto the command-line + std::cout << forwardResponse.source.text; + std::cout << "----------------\n"; + std::cout << forwardResponse.target.text; + std::cout << "----------------\n"; + std::cout << backwardResponse.target.text; +} + +void qualityEstimatorWords(AsyncService &service, Ptr model) { ResponseOptions responseOptions; responseOptions.qualityScores = true; - const Response response = translateFromStdin(options, responseOptions); + std::string source = readFromStdin(); + const Response response = translateForResponse(service, model, std::move(source), responseOptions); for (const auto &sentenceQualityEstimate : response.qualityScores) { std::cout << "[SentenceBegin]\n"; @@ -71,10 +91,12 @@ void qualityEstimatorWords(const Ptr &options) { } } -void qualityEstimatorScores(const Ptr &options) { +void qualityEstimatorScores(AsyncService &service, Ptr model) { ResponseOptions responseOptions; responseOptions.qualityScores = true; - const Response response = translateFromStdin(options, responseOptions); + + std::string source = readFromStdin(); + const Response response = translateForResponse(service, model, std::move(source), responseOptions); for (const auto &sentenceQualityEstimate : response.qualityScores) { std::cout << std::fixed << std::setprecision(3) << sentenceQualityEstimate.sentenceScore << "\n"; diff --git a/src/tests/apps.h b/src/tests/apps.h index deb6a12dc..dee77a9be 100644 --- a/src/tests/apps.h +++ b/src/tests/apps.h @@ -21,23 +21,21 @@ namespace bergamot { namespace testapp { -// Utility function, common for all testapps. Reads content from stdin, builds a Service based on options and constructs -// a response containing translation data according responseOptions. -Response translateFromStdin(Ptr options, ResponseOptions responseOptions); - // Reads from stdin and translates. Prints the tokens separated by space for each sentence. Prints words from source // side text annotation if source=true, target annotation otherwise. -void annotatedTextWords(Ptr options, bool source = true); +void annotatedTextWords(AsyncService &service, Ptr model, bool source = true); // Reads from stdin and translates the read content. Prints the sentences in source or target in constructed response // in each line, depending on source = true or false respectively. -void annotatedTextSentences(Ptr options, bool source = true); +void annotatedTextSentences(AsyncService &service, Ptr model, bool source = true); + +void forwardAndBackward(AsyncService &service, std::vector> &models); // Reads from stdin and translates the read content. Prints the quality words for each sentence. -void qualityEstimatorWords(const Ptr& options); +void qualityEstimatorWords(AsyncService &service, Ptr model); // Reads from stdin and translates the read content. Prints the quality scores for each sentence. -void qualityEstimatorScores(const Ptr& options); +void qualityEstimatorScores(AsyncService &service, Ptr model); } // namespace testapp } // namespace bergamot diff --git a/src/tests/cli.cpp b/src/tests/cli.cpp index 0e9469ab0..90c386c84 100644 --- a/src/tests/cli.cpp +++ b/src/tests/cli.cpp @@ -1,23 +1,45 @@ - #include "apps.h" int main(int argc, char *argv[]) { - auto cp = marian::bergamot::createConfigParser(); - auto options = cp.parseOptions(argc, argv, true); - const std::string mode = options->get("bergamot-mode"); using namespace marian::bergamot; - if (mode == "test-response-source-sentences") { - testapp::annotatedTextSentences(options, /*source=*/true); - } else if (mode == "test-response-target-sentences") { - testapp::annotatedTextSentences(options, /*source=*/false); - } else if (mode == "test-response-source-words") { - testapp::annotatedTextWords(options, /*source=*/true); - } else if (mode == std::string("test-quality-estimator-words")) { - testapp::qualityEstimatorWords(options); - } else if (mode == std::string("test-quality-estimator-scores")) { - testapp::qualityEstimatorScores(options); - } else { - ABORT("Unknown --mode {}. Please run a valid test", mode); + marian::bergamot::ConfigParser configParser; + configParser.parseArgs(argc, argv); + auto &config = configParser.getConfig(); + AsyncService::Config serviceConfig{config.numWorkers}; + AsyncService service(serviceConfig); + std::vector> models; + + for (auto &modelConfigPath : config.modelConfigPaths) { + TranslationModel::Config modelConfig = parseOptionsFromFilePath(modelConfigPath); + std::shared_ptr model = service.createCompatibleModel(modelConfig); + models.push_back(model); + } + + switch (config.opMode) { + case OpMode::TEST_SOURCE_SENTENCES: + testapp::annotatedTextSentences(service, models.front(), /*source=*/true); + break; + case OpMode::TEST_TARGET_SENTENCES: + testapp::annotatedTextSentences(service, models.front(), /*source=*/false); + break; + case OpMode::TEST_SOURCE_WORDS: + testapp::annotatedTextWords(service, models.front(), /*source=*/true); + break; + case OpMode::TEST_TARGET_WORDS: + testapp::annotatedTextWords(service, models.front(), /*source=*/false); + break; + case OpMode::TEST_FORWARD_BACKWARD_FOR_OUTBOUND: + testapp::forwardAndBackward(service, models); + break; + case OpMode::TEST_QUALITY_ESTIMATOR_WORDS: + testapp::qualityEstimatorWords(service, models.front()); + break; + case OpMode::TEST_QUALITY_ESTIMATOR_SCORES: + testapp::qualityEstimatorScores(service, models.front()); + break; + default: + ABORT("Incompatible op-mode. Choose one of the test modes."); + break; } return 0; } diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index c0ee6be7a..ab1448800 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -5,15 +5,16 @@ configure_file(${CMAKE_CURRENT_SOURCE_DIR}/project_version.h.in add_library(bergamot-translator STATIC byte_array_util.cpp text_processor.cpp - batch_translator.cpp + translation_model.cpp request.cpp - batcher.cpp + batching_pool.cpp + aggregate_batching_pool.cpp response_builder.cpp quality_estimator.cpp batch.cpp annotation.cpp service.cpp - threadsafe_batcher.cpp + parser.cpp ) if (USE_WASM_COMPATIBLE_SOURCE) # Using wasm compatible sources should include this compile definition; diff --git a/src/translator/aggregate_batching_pool.cpp b/src/translator/aggregate_batching_pool.cpp new file mode 100644 index 000000000..38c55f1c4 --- /dev/null +++ b/src/translator/aggregate_batching_pool.cpp @@ -0,0 +1,34 @@ + +#include "aggregate_batching_pool.h" + +namespace marian { +namespace bergamot { + +AggregateBatchingPool::AggregateBatchingPool() { + // TODO(@jerinphilip): Set aggregate limits +} + +size_t AggregateBatchingPool::enqueueRequest(Ptr model, Ptr request) { + model->enqueueRequest(request); + aggregateQueue_.insert(model); + return request->numSegments(); +} + +size_t AggregateBatchingPool::generateBatch(Ptr& model, Batch& batch) { + while (!aggregateQueue_.empty()) { + auto candidateItr = aggregateQueue_.begin(); + Ptr candidate = *candidateItr; + size_t numSentences = candidate->generateBatch(batch); + if (numSentences > 0) { + model = candidate; + return numSentences; + } else { + // Try the next model's batching pool. + aggregateQueue_.erase(candidateItr); + } + } + return /*numSentences=*/0; +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/aggregate_batching_pool.h b/src/translator/aggregate_batching_pool.h new file mode 100644 index 000000000..5b5d4b17a --- /dev/null +++ b/src/translator/aggregate_batching_pool.h @@ -0,0 +1,68 @@ +#ifndef SRC_BERGAMOT_AGGREGATE_BATCHING_POOL_H_ +#define SRC_BERGAMOT_AGGREGATE_BATCHING_POOL_H_ + +#include +#include + +#include "data/types.h" +#include "translation_model.h" + +namespace marian { +namespace bergamot { + +/// Hashes a pointer to an object using the address the pointer points to. If two pointers point to the same address, +/// they hash to the same value. Useful to put widely shared_ptrs of entities (eg: TranslationModel, Vocab, Shortlist) +/// etc into containers which require the members to be hashable (std::unordered_set, std::unordered_map). +template +struct HashPtr { + size_t operator()(const std::shared_ptr& t) const { + size_t address = reinterpret_cast(t.get()); + return std::hash()(address); + } +}; + +/// Aggregates request queueing and generation of batches from multiple TranslationModels (BatchingPools within, +/// specifically), thereby acting as an intermediary to enable multiple translation model capability in BlockingService +/// and AsyncService. +/// +/// A simple queue containing shared owning references to TranslationModels are held here from which batches are +/// generated on demand. Since a queue is involved, the ordering is first-come first serve on requests except there are +/// leaks effectively doing priority inversion if an earlier request with the same TranslationModel is pending +/// to be consumed for translation. +// +/// Actual storage for the request and batch generation are within the respective TranslationModels, which owns its own +/// BatchingPool. +/// +/// Matches API provided by BatchingPool except arguments additionally parameterized by TranslationModel. +/// +/// Note: This class is not thread-safe. You may use this class wrapped with ThreadsafeBatchingPool for a thread-safe +/// equivalent of this class, if needed. +class AggregateBatchingPool { + public: + /// Create an AggregateBatchingPool with (tentatively) global (across all BatchingPools) limits + /// imposed here. + AggregateBatchingPool(); + + /// Enqueue an existing request onto model, also keep account of that this model and request are now pending. + /// + /// @param [in] model: Model to use in translation. A shared ownership to this model is accepted by this object to + /// keep the model alive until translation is complete. + /// @param [in] request: A request to be enqueued to model. + /// @returns number of sentences added for translation. + size_t enqueueRequest(Ptr model, Ptr request); + + /// Generate a batch from pending requests, obtained from available TranslationModels. + /// + /// @param [out] model: TranslationModel + /// @param [out] batch: Batch to write onto, which is consumed at translation elsewhere. + /// @returns Number of sentences in the generated batch. + size_t generateBatch(Ptr& model, Batch& batch); + + private: + std::unordered_set, HashPtr> aggregateQueue_; +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_AGGREGATE_BATCHING_POOL_H_ diff --git a/src/translator/batch_translator.cpp b/src/translator/batch_translator.cpp deleted file mode 100644 index 889ff0073..000000000 --- a/src/translator/batch_translator.cpp +++ /dev/null @@ -1,128 +0,0 @@ -#include "batch_translator.h" - -#include "batch.h" -#include "byte_array_util.h" -#include "common/logging.h" -#include "data/corpus.h" -#include "data/text_input.h" -#include "translator/beam_search.h" - -namespace marian { -namespace bergamot { - -BatchTranslator::BatchTranslator(DeviceId const device, Vocabs &vocabs, Ptr options, - const AlignedMemory *modelMemory, const AlignedMemory *shortlistMemory) - : device_(device), - options_(options), - vocabs_(vocabs), - modelMemory_(modelMemory), - shortlistMemory_(shortlistMemory) {} - -void BatchTranslator::initialize() { - // Initializes the graph. - if (options_->hasAndNotEmpty("shortlist")) { - int srcIdx = 0, trgIdx = 1; - bool shared_vcb = - vocabs_.sources().front() == - vocabs_.target(); // vocabs_->sources().front() is invoked as we currently only support one source vocab - if (shortlistMemory_->size() > 0 && shortlistMemory_->begin() != nullptr) { - slgen_ = New(shortlistMemory_->begin(), shortlistMemory_->size(), - vocabs_.sources().front(), vocabs_.target(), srcIdx, trgIdx, - shared_vcb, options_->get("check-bytearray")); - } else { - // Changed to BinaryShortlistGenerator to enable loading binary shortlist file - // This class also supports text shortlist file - slgen_ = New(options_, vocabs_.sources().front(), vocabs_.target(), srcIdx, - trgIdx, shared_vcb); - } - } - - graph_ = New(true); // set the graph to be inference only - auto prec = options_->get>("precision", {"float32"}); - graph_->setDefaultElementType(typeFromString(prec[0])); - graph_->setDevice(device_); - graph_->getBackend()->configureDevice(options_); - graph_->reserveWorkspaceMB(options_->get("workspace")); - if (modelMemory_->size() > 0 && - modelMemory_->begin() != - nullptr) { // If we have provided a byte array that contains the model memory, we can initialise the model - // from there, as opposed to from reading in the config file - ABORT_IF((uintptr_t)modelMemory_->begin() % 256 != 0, - "The provided memory is not aligned to 256 bytes and will crash when vector instructions are used on it."); - if (options_->get("check-bytearray")) { - ABORT_IF(!validateBinaryModel(*modelMemory_, modelMemory_->size()), - "The binary file is invalid. Incomplete or corrupted download?"); - } - const std::vector container = { - modelMemory_->begin()}; // Marian supports multiple models initialised in this manner hence std::vector. - // However we will only ever use 1 during decoding. - scorers_ = createScorers(options_, container); - } else { - scorers_ = createScorers(options_); - } - for (auto scorer : scorers_) { - scorer->init(graph_); - if (slgen_) { - scorer->setShortlistGenerator(slgen_); - } - } - graph_->forward(); -} - -void BatchTranslator::translate(Batch &batch) { - std::vector batchVector; - - auto &sentences = batch.sentences(); - size_t batchSequenceNumber{0}; - for (auto &sentence : sentences) { - data::SentenceTuple sentence_tuple(batchSequenceNumber); - Segment segment = sentence.getUnderlyingSegment(); - sentence_tuple.push_back(segment); - batchVector.push_back(sentence_tuple); - - ++batchSequenceNumber; - } - - size_t batchSize = batchVector.size(); - std::vector sentenceIds; - std::vector maxDims; - for (auto &ex : batchVector) { - if (maxDims.size() < ex.size()) maxDims.resize(ex.size(), 0); - for (size_t i = 0; i < ex.size(); ++i) { - if (ex[i].size() > (size_t)maxDims[i]) maxDims[i] = (int)ex[i].size(); - } - sentenceIds.push_back(ex.getId()); - } - - typedef marian::data::SubBatch SubBatch; - typedef marian::data::CorpusBatch CorpusBatch; - - std::vector> subBatches; - for (size_t j = 0; j < maxDims.size(); ++j) { - subBatches.emplace_back(New(batchSize, maxDims[j], vocabs_.sources().at(j))); - } - - std::vector words(maxDims.size(), 0); - for (size_t i = 0; i < batchSize; ++i) { - for (size_t j = 0; j < maxDims.size(); ++j) { - for (size_t k = 0; k < batchVector[i][j].size(); ++k) { - subBatches[j]->data()[k * batchSize + i] = batchVector[i][j][k]; - subBatches[j]->mask()[k * batchSize + i] = 1.f; - words[j]++; - } - } - } - - for (size_t j = 0; j < maxDims.size(); ++j) subBatches[j]->setWords(words[j]); - - auto corpus_batch = Ptr(new CorpusBatch(subBatches)); - corpus_batch->setSentenceIds(sentenceIds); - - auto search = New(options_, scorers_, vocabs_.target()); - - auto histories = std::move(search->search(graph_, corpus_batch)); - batch.completeBatch(histories); -} - -} // namespace bergamot -} // namespace marian diff --git a/src/translator/batch_translator.h b/src/translator/batch_translator.h deleted file mode 100644 index 6a7fa9842..000000000 --- a/src/translator/batch_translator.h +++ /dev/null @@ -1,57 +0,0 @@ -#ifndef SRC_BERGAMOT_BATCH_TRANSLATOR_H_ -#define SRC_BERGAMOT_BATCH_TRANSLATOR_H_ - -#include -#include - -#include "batch.h" -#include "common/utils.h" -#include "data/shortlist.h" -#include "definitions.h" -#include "request.h" -#include "translator/history.h" -#include "translator/scorers.h" -#include "vocabs.h" - -namespace marian { -namespace bergamot { - -class BatchTranslator { - // Launches minimal marian-translation (only CPU at the moment) in individual - // threads. Constructor launches each worker thread running mainloop(). - // mainloop runs until until it receives poison from the PCQueue. Threads are - // shut down in Service which calls join() on the threads. - - public: - /** - * Initialise the marian translator. - * @param device DeviceId that performs translation. Could be CPU or GPU - * @param vocabs Vector that contains ptrs to two vocabs - * @param options Marian options object - * @param modelMemory byte array (aligned to 256!!!) that contains the bytes of a model.bin. Provide a nullptr if not - * used. - * @param shortlistMemory byte array of shortlist (aligned to 64) - */ - explicit BatchTranslator(DeviceId const device, Vocabs& vocabs, Ptr options, - const AlignedMemory* modelMemory, const AlignedMemory* shortlistMemory); - - // convenience function for logging. TODO(jerin) - std::string _identifier() { return "worker" + std::to_string(device_.no); } - void translate(Batch& batch); - void initialize(); - - private: - Ptr options_; - DeviceId device_; - const Vocabs& vocabs_; - Ptr graph_; - std::vector> scorers_; - Ptr slgen_; - const AlignedMemory* modelMemory_{nullptr}; - const AlignedMemory* shortlistMemory_{nullptr}; -}; - -} // namespace bergamot -} // namespace marian - -#endif // SRC_BERGAMOT_BATCH_TRANSLATOR_H_ diff --git a/src/translator/batcher.cpp b/src/translator/batching_pool.cpp similarity index 83% rename from src/translator/batcher.cpp rename to src/translator/batching_pool.cpp index 0a14459f1..83b5e00ab 100644 --- a/src/translator/batcher.cpp +++ b/src/translator/batching_pool.cpp @@ -1,4 +1,4 @@ -#include "batcher.h" +#include "batching_pool.h" #include @@ -8,7 +8,7 @@ namespace marian { namespace bergamot { -Batcher::Batcher(Ptr options) { +BatchingPool::BatchingPool(Ptr options) { miniBatchWords = options->get("mini-batch-words"); bucket_.resize(options->get("max-length-break") + 1); ABORT_IF(bucket_.size() - 1 > miniBatchWords, @@ -16,7 +16,7 @@ Batcher::Batcher(Ptr options) { "longer than what can fit in a batch."); } -bool Batcher::cleaveBatch(Batch &batch) { +size_t BatchingPool::generateBatch(Batch &batch) { // For now simply iterates on buckets and converts batches greedily. This // has to be enhanced with optimizing over priority. The baseline // implementation should at least be as fast as marian's maxi-batch with full @@ -35,22 +35,23 @@ bool Batcher::cleaveBatch(Batch &batch) { } else { // Check if elements exist assert(batch.size() > 0); - return true; + return batch.size(); } } } - bool isValidBatch = batch.size() > 0; - return isValidBatch; + return batch.size(); } -void Batcher::addWholeRequest(Ptr request) { +size_t BatchingPool::enqueueRequest(Ptr request) { for (size_t i = 0; i < request->numSegments(); i++) { RequestSentence sentence(i, request); size_t bucket_id = sentence.numTokens(); assert(bucket_id < bucket_.size()); bucket_[bucket_id].insert(sentence); } + + return request->numSegments(); } } // namespace bergamot diff --git a/src/translator/batcher.h b/src/translator/batching_pool.h similarity index 63% rename from src/translator/batcher.h rename to src/translator/batching_pool.h index 277bfc934..68b2cf0d0 100644 --- a/src/translator/batcher.h +++ b/src/translator/batching_pool.h @@ -1,5 +1,5 @@ -#ifndef SRC_BERGAMOT_BATCHER_H_ -#define SRC_BERGAMOT_BATCHER_H_ +#ifndef SRC_BERGAMOT_BATCHING_POOL_H_ +#define SRC_BERGAMOT_BATCHING_POOL_H_ #include #include @@ -12,24 +12,21 @@ namespace marian { namespace bergamot { -class Batcher { + +class BatchingPool { public: - explicit Batcher(Ptr options); + explicit BatchingPool(Ptr options); // RequestSentence incorporates (tentative) notions of priority with each // sentence. This method inserts the sentence into the internal data-structure // which maintains priority among sentences from multiple concurrent requests. - void addWholeRequest(Ptr request); - - // indicate no more sentences will be added. Does nothing here, for parity to threadsafe version. - void shutdown() {} - - bool operator>>(Batch &batch) { return cleaveBatch(batch); } + size_t enqueueRequest(Ptr request); - private: // Loads sentences with sentences compiled from (tentatively) multiple // requests optimizing for both padding and priority. - bool cleaveBatch(Batch &batch); + size_t generateBatch(Batch &batch); + + private: size_t miniBatchWords; std::vector> bucket_; size_t batchNumber_{0}; @@ -38,4 +35,4 @@ class Batcher { } // namespace bergamot } // namespace marian -#endif // SRC_BERGAMOT_BATCHER_H_ +#endif // SRC_BERGAMOT_BATCHING_POOL_H_ diff --git a/src/translator/definitions.h b/src/translator/definitions.h index a0f544ded..66ebb03b4 100644 --- a/src/translator/definitions.h +++ b/src/translator/definitions.h @@ -41,6 +41,9 @@ struct ByteRange { const size_t size() const { return end - begin; } }; +class Response; +using CallbackType = std::function; + } // namespace bergamot } // namespace marian diff --git a/src/translator/parser.cpp b/src/translator/parser.cpp new file mode 100644 index 000000000..d927409b5 --- /dev/null +++ b/src/translator/parser.cpp @@ -0,0 +1,170 @@ +#include "parser.h" + +#include + +#include "common/build_info.h" +#include "common/config.h" +#include "common/regex.h" +#include "common/version.h" + +namespace marian { +namespace bergamot { + +std::istringstream &operator>>(std::istringstream &in, OpMode &mode) { + std::string modeString; + in >> modeString; + std::unordered_map table = { + {"wasm", OpMode::APP_WASM}, + {"native", OpMode::APP_NATIVE}, + {"decoder", OpMode::APP_DECODER}, + {"test-response-source-sentences", OpMode::TEST_SOURCE_SENTENCES}, + {"test-response-target-sentences", OpMode::TEST_TARGET_SENTENCES}, + {"test-response-source-words", OpMode::TEST_SOURCE_WORDS}, + {"test-response-target-words", OpMode::TEST_TARGET_WORDS}, + {"test-quality-estimator-words", OpMode::TEST_QUALITY_ESTIMATOR_WORDS}, + {"test-quality-estimator-scores", OpMode::TEST_QUALITY_ESTIMATOR_SCORES}, + {"test-forward-backward", OpMode::TEST_FORWARD_BACKWARD_FOR_OUTBOUND}, + }; + + auto query = table.find(modeString); + if (query != table.end()) { + mode = query->second; + } else { + ABORT("Unknown mode {}", modeString); + } + + return in; +} + +ConfigParser::ConfigParser() : app_{"Bergamot Options"} { + addSpecialOptions(app_); + addOptionsBoundToConfig(app_, config_); +}; + +void ConfigParser::parseArgs(int argc, char *argv[]) { + try { + app_.parse(argc, argv); + handleSpecialOptions(); + } catch (const CLI::ParseError &e) { + exit(app_.exit(e)); + } +} + +void ConfigParser::addSpecialOptions(CLI::App &app) { + app.add_flag("--build-info", build_info_, "Print build-info and exit"); + app.add_flag("--version", version_, "Print version-info and exit"); +} + +void ConfigParser::handleSpecialOptions() { + if (build_info_) { +#ifndef _MSC_VER // cmake build options are not available on MSVC based build. + std::cerr << cmakeBuildOptionsAdvanced() << std::endl; + exit(0); +#else // _MSC_VER + ABORT("build-info is not available on MSVC based build."); +#endif // _MSC_VER + } + + if (version_) { + std::cerr << buildVersion() << std::endl; + exit(0); + } +} + +void ConfigParser::addOptionsBoundToConfig(CLI::App &app, CLIConfig &config) { + app.add_option("--model-config-paths", config.modelConfigPaths, + "Configuration files list, can be used for pivoting multiple models or multiple model workflows"); + + app.add_flag("--bytearray", config.byteArray, + "Flag holds whether to construct service from bytearrays, only for testing purpose"); + + app.add_flag("--check-bytearray", config.validateByteArray, + "Flag holds whether to check the content of the bytearrays (true by default)"); + + app.add_option("--cpu-threads", config.numWorkers, "Number of worker threads to use for translation"); + + app_.add_option("--bergamot-mode", config.opMode, "Operating mode for bergamot: [wasm, native, decoder]"); +} + +std::shared_ptr parseOptionsFromFilePath(const std::string &configPath, bool validate /*= true*/) { + // Read entire string and redirect to parseOptionsFromString + std::ifstream readStream(configPath); + std::stringstream buffer; + buffer << readStream.rdbuf(); + return parseOptionsFromString(buffer.str(), validate, /*pathsInSameDirAs=*/configPath); +}; + +std::shared_ptr parseOptionsFromString(const std::string &configAsString, bool validate /*= true*/, + std::string pathsInSameDirAs /*=""*/) { + marian::Options options; + + marian::ConfigParser configParser(cli::mode::translation); + + // These are additional options we use to hijack for our own marian-replacement layer (for batching, + // multi-request-compile etc) and hence goes into Ptr. + configParser.addOption("--max-length-break", "Bergamot Options", + "Maximum input tokens to be processed in a single sentence.", 128); + + // The following is a complete hijack of an existing option, so no need to add explicitly. + // configParser.addOption("--mini-batch-words", "Bergamot Options", + // "Maximum input tokens to be processed in a single sentence.", 1024); + + configParser.addOption("--ssplit-prefix-file", "Bergamot Options", + "File with nonbreaking prefixes for sentence splitting."); + + configParser.addOption("--ssplit-mode", "Bergamot Options", "[paragraph, sentence, wrapped_text]", + "paragraph"); + + configParser.addOption("--quality", "Bergamot Options", "File considering Quality Estimation model"); + + // Parse configs onto defaultConfig. The preliminary merge sets the YAML internal representation with legal values. + const YAML::Node &defaultConfig = configParser.getConfig(); + options.merge(defaultConfig); + options.parse(configAsString); + + // This is in a marian `.cpp` as of now, and requires explicit copy-here. + // https://github.com/marian-nmt/marian-dev/blob/9fa166be885b025711f27b35453e0f2c00c9933e/src/common/config_parser.cpp#L28 + + // clang-format off + const std::set PATHS = { + "model", + "models", + "train-sets", + "vocabs", + "embedding-vectors", + "valid-sets", + "valid-script-path", + "valid-script-args", + "valid-log", + "valid-translation-output", + "input", // except: 'stdin', handled in makeAbsolutePaths and interpolateEnvVars + "output", // except: 'stdout', handled in makeAbsolutePaths and interpolateEnvVars + "pretrained-model", + "data-weighting", + "log", + "sqlite", // except: 'temporary', handled in the processPaths function + "shortlist", // except: only the first element in the sequence is a path, handled in the + // processPaths function + "ssplit-prefix-file", // added for bergamot + "quality", // added for bergamot + }; + // clang-format on + + if (!pathsInSameDirAs.empty()) { + YAML::Node configYAML = options.cloneToYamlNode(); + marian::cli::makeAbsolutePaths(configYAML, pathsInSameDirAs, PATHS); + options.merge(configYAML, /*overwrite=*/true); + } + + // Perform validation on parsed options only when requested + if (validate) { + YAML::Node configYAML = options.cloneToYamlNode(); + marian::ConfigValidator validator(configYAML); + validator.validateOptions(marian::cli::mode::translation); + } + + return std::make_shared(options); +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/parser.h b/src/translator/parser.h index 54aaaf86a..c9fffcebf 100644 --- a/src/translator/parser.h +++ b/src/translator/parser.h @@ -1,6 +1,10 @@ #ifndef SRC_BERGAMOT_PARSER_H #define SRC_BERGAMOT_PARSER_H +#include +#include + +#include "3rd_party/marian-dev/src/3rd_party/CLI/CLI.hpp" #include "3rd_party/yaml-cpp/yaml.h" #include "common/config_parser.h" #include "common/config_validator.h" @@ -10,65 +14,63 @@ namespace marian { namespace bergamot { -inline marian::ConfigParser createConfigParser() { - marian::ConfigParser cp(marian::cli::mode::translation); - cp.addOption("--ssplit-prefix-file", "Bergamot Options", - "File with nonbreaking prefixes for sentence splitting."); - - cp.addOption("--ssplit-mode", "Server Options", "[paragraph, sentence, wrapped_text]", "paragraph"); - - cp.addOption("--max-length-break", "Bergamot Options", - "Maximum input tokens to be processed in a single sentence.", 128); - - cp.addOption("--bytearray", "Bergamot Options", - "Flag holds whether to construct service from bytearrays, only for testing purpose", false); - - cp.addOption("--check-bytearray", "Bergamot Options", - "Flag holds whether to check the content of the bytearrays (true by default)", true); - - cp.addOption("--bergamot-mode", "Bergamot Options", - "Operating mode for bergamot: [wasm, native, decoder]", "native"); - - cp.addOption("--quality", "Bergamot Options", "File considering Quality Estimation model"); - - return cp; -} - -inline std::shared_ptr parseOptions(const std::string &config, bool validate = true) { - marian::Options options; - - // @TODO(jerinphilip) There's something off here, @XapaJIaMnu suggests - // that should not be using the defaultConfig. This function only has access - // to std::string config and needs to be able to construct Options from the - // same. - - // Absent the following code-segment, there is a parsing exception thrown on - // rebuilding YAML. - // - // Error: Unhandled exception of type 'N4YAML11InvalidNodeE': invalid node; - // this may result from using a map iterator as a sequence iterator, or - // vice-versa - // - // Error: Aborted from void unhandledException() in - // 3rd_party/marian-dev/src/common/logging.cpp:113 - - marian::ConfigParser configParser = createConfigParser(); - const YAML::Node &defaultConfig = configParser.getConfig(); - - options.merge(defaultConfig); - - // Parse configs onto defaultConfig. - options.parse(config); - YAML::Node configCopy = options.cloneToYamlNode(); - - if (validate) { - // Perform validation on parsed options only when requested - marian::ConfigValidator validator(configCopy); - validator.validateOptions(marian::cli::mode::translation); - } - - return std::make_shared(options); -} +enum OpMode { + APP_WASM, + APP_NATIVE, + APP_DECODER, + TEST_SOURCE_SENTENCES, + TEST_TARGET_SENTENCES, + TEST_SOURCE_WORDS, + TEST_TARGET_WORDS, + TEST_QUALITY_ESTIMATOR_WORDS, + TEST_QUALITY_ESTIMATOR_SCORES, + TEST_FORWARD_BACKWARD_FOR_OUTBOUND, +}; + +/// Overload for CL11, convert a read from a stringstream into opmode. +std::istringstream &operator>>(std::istringstream &in, OpMode &mode); + +struct CLIConfig { + using ModelConfigPaths = std::vector; + ModelConfigPaths modelConfigPaths; + bool byteArray; + bool validateByteArray; + size_t numWorkers; + OpMode opMode; +}; + +/// ConfigParser for bergamot. Internally stores config options with CLIConfig. CLI11 parsing binds the parsing code to +/// write to the members of the CLIConfig instance owned by this class. Usage: +/// +/// ```cpp +/// ConfigParser configParser; +/// configParser.parseArgs(argc, argv); +/// auto &config = configParser.getConfig(); +/// ``` +class ConfigParser { + public: + ConfigParser(); + void parseArgs(int argc, char *argv[]); + const CLIConfig &getConfig() { return config_; } + + private: + // Special Options: build-info and version. These are not taken down further, the respective logic executed and + // program exits after. + void addSpecialOptions(CLI::App &app); + void handleSpecialOptions(); + + void addOptionsBoundToConfig(CLI::App &app, CLIConfig &config); + + CLIConfig config_; + CLI::App app_; + + bool build_info_{false}; + bool version_{false}; +}; + +std::shared_ptr parseOptionsFromString(const std::string &config, bool validate = true, + std::string pathsInSameDirAs = ""); +std::shared_ptr parseOptionsFromFilePath(const std::string &config, bool validate = true); } // namespace bergamot } // namespace marian diff --git a/src/translator/request.h b/src/translator/request.h index a2ea1af86..d2645f6d8 100644 --- a/src/translator/request.h +++ b/src/translator/request.h @@ -19,7 +19,7 @@ namespace bergamot { /// A Request is an internal representation used to represent a request after /// processed by TextProcessor into sentences constituted by marian::Words. /// -/// The batching mechanism (Batcher) draws from multiple Requests and compiles +/// The batching mechanism (BatchingPool) draws from multiple Requests and compiles /// sentences into a batch. When a batch completes translation (at /// BatchTranslator, intended in a different thread), backward propogation /// happens through: @@ -60,7 +60,7 @@ class Request { Segment getSegment(size_t index) const; /// For notions of priority among requests, used to enable std::set in - /// Batcher. + /// BatchingPool. bool operator<(const Request &request) const; /// Processes a history obtained after translating in a heterogenous batch @@ -90,7 +90,7 @@ class Request { /// A RequestSentence provides a view to a sentence within a Request. Existence /// of this class allows the sentences and associated information to be kept -/// within Request, while batching mechanism (Batcher) compiles Batch from +/// within Request, while batching mechanism (BatchingPool) compiles Batch from /// RequestSentence-s coming from different Requests. class RequestSentence { public: diff --git a/src/translator/response_builder.h b/src/translator/response_builder.h index 614c7c282..36bae1e9e 100644 --- a/src/translator/response_builder.h +++ b/src/translator/response_builder.h @@ -29,7 +29,7 @@ class ResponseBuilder { /// @param [in] callback: callback with operates on the constructed Response. /// @param [in] qualityEstimator: the QualityEstimator model that can be used /// to provide translation quality probability. - ResponseBuilder(ResponseOptions responseOptions, AnnotatedText &&source, Vocabs &vocabs, + ResponseBuilder(ResponseOptions responseOptions, AnnotatedText &&source, const Vocabs &vocabs, std::function callback, const QualityEstimator &qualityEstimator) : responseOptions_(responseOptions), source_(std::move(source)), diff --git a/src/translator/service.cpp b/src/translator/service.cpp index f5996aa45..9de69ba8a 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -10,88 +10,59 @@ namespace marian { namespace bergamot { -Service::Service(Ptr options, MemoryBundle memoryBundle) - : requestId_(0), - options_(options), - vocabs_(options, std::move(memoryBundle.vocabs)), - text_processor_(options, vocabs_, std::move(memoryBundle.ssplitPrefixFile)), - batcher_(options), - numWorkers_(std::max(1, options->get("cpu-threads"))), - modelMemory_(std::move(memoryBundle.model)), - shortlistMemory_(std::move(memoryBundle.shortlist)), - qualityEstimator_(createQualityEstimator(getQualityEstimatorModel(memoryBundle, options))) -#ifdef WASM_COMPATIBLE_SOURCE - , - blocking_translator_(DeviceId(0, DeviceType::cpu), vocabs_, options_, &modelMemory_, &shortlistMemory_) -#endif -{ -#ifdef WASM_COMPATIBLE_SOURCE - blocking_translator_.initialize(); -#else - workers_.reserve(numWorkers_); - for (size_t cpuId = 0; cpuId < numWorkers_; cpuId++) { - workers_.emplace_back([cpuId, this] { - marian::DeviceId deviceId(cpuId, DeviceType::cpu); - BatchTranslator translator(deviceId, vocabs_, options_, &modelMemory_, &shortlistMemory_); - translator.initialize(); - Batch batch; - // Run thread mainloop - while (batcher_ >> batch) { - translator.translate(batch); - } - }); - } -#endif -} +BlockingService::BlockingService(const BlockingService::Config &config) : requestId_(0), batchingPool_() {} -#ifdef WASM_COMPATIBLE_SOURCE -std::vector Service::translateMultiple(std::vector &&inputs, ResponseOptions responseOptions) { - // We queue the individual Requests so they get compiled at batches to be - // efficiently translated. +std::vector BlockingService::translateMultiple(std::shared_ptr translationModel, + std::vector &&sources, + const ResponseOptions &responseOptions) { std::vector responses; - responses.resize(inputs.size()); + responses.resize(sources.size()); - for (size_t i = 0; i < inputs.size(); i++) { + for (size_t i = 0; i < sources.size(); i++) { auto callback = [i, &responses](Response &&response) { responses[i] = std::move(response); }; // - queueRequest(std::move(inputs[i]), std::move(callback), responseOptions); + Ptr request = + translationModel->makeRequest(requestId_++, std::move(sources[i]), callback, responseOptions); + batchingPool_.enqueueRequest(translationModel, request); } Batch batch; - // There's no need to do shutdown here because it's single threaded. - while (batcher_ >> batch) { - blocking_translator_.translate(batch); + Ptr model{nullptr}; + while (batchingPool_.generateBatch(model, batch)) { + model->translateBatch(/*deviceId=*/0, batch); } return responses; } -#endif - -void Service::queueRequest(std::string &&input, std::function &&callback, - ResponseOptions responseOptions) { - Segments segments; - AnnotatedText source; - - text_processor_.process(std::move(input), source, segments); - - ResponseBuilder responseBuilder(responseOptions, std::move(source), vocabs_, std::move(callback), *qualityEstimator_); - Ptr request = New(requestId_++, std::move(segments), std::move(responseBuilder)); - - batcher_.addWholeRequest(request); -} -void Service::translate(std::string &&input, std::function &&callback, - ResponseOptions responseOptions) { - queueRequest(std::move(input), std::move(callback), responseOptions); +AsyncService::AsyncService(const AsyncService::Config &config) : requestId_(0), config_(config), safeBatchingPool_() { + ABORT_IF(config_.numWorkers == 0, "Number of workers should be at least 1 in a threaded workflow"); + workers_.reserve(config_.numWorkers); + for (size_t cpuId = 0; cpuId < config_.numWorkers; cpuId++) { + workers_.emplace_back([cpuId, this] { + // Consumer thread main-loop. Note that this is an infinite-loop unless the monitor is explicitly told to + // shutdown, which happens in the destructor for this class. + Batch batch; + Ptr translationModel{nullptr}; + while (safeBatchingPool_.generateBatch(translationModel, batch)) { + translationModel->translateBatch(cpuId, batch); + } + }); + } } -Service::~Service() { - batcher_.shutdown(); -#ifndef WASM_COMPATIBLE_SOURCE +AsyncService::~AsyncService() { + safeBatchingPool_.shutdown(); for (std::thread &worker : workers_) { assert(worker.joinable()); worker.join(); } -#endif +} + +void AsyncService::translate(std::shared_ptr translationModel, std::string &&source, + CallbackType callback, const ResponseOptions &responseOptions) { + // Producer thread, a call to this function adds new work items. If batches are available, notifies workers waiting. + Ptr request = translationModel->makeRequest(requestId_++, std::move(source), callback, responseOptions); + safeBatchingPool_.enqueueRequest(translationModel, request); } } // namespace bergamot diff --git a/src/translator/service.h b/src/translator/service.h index 3a3d616fc..d37f5c262 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -1,146 +1,116 @@ #ifndef SRC_BERGAMOT_SERVICE_H_ #define SRC_BERGAMOT_SERVICE_H_ -#include "batch_translator.h" +#include +#include +#include + #include "data/types.h" #include "quality_estimator.h" #include "response.h" #include "response_builder.h" #include "text_processor.h" -#include "threadsafe_batcher.h" +#include "threadsafe_batching_pool.h" +#include "translation_model.h" #include "translator/parser.h" #include "vocabs.h" -#ifndef WASM_COMPATIBLE_SOURCE -#include -#endif - -#include -#include - namespace marian { namespace bergamot { -/// This is intended to be similar to the ones provided for training or -/// decoding in ML pipelines with the following additional capabilities: -/// -/// 1. Provision of a request -> response based translation flow unlike the -/// usual a line based translation or decoding provided in most ML frameworks. -/// 2. Internal handling of normalization etc which changes source text to -/// provide to client translation meta-information like alignments consistent -/// with the unnormalized input text. -/// 3. The API splits each text entry into sentences internally, which are then -/// translated independent of each other. The translated sentences are then -/// joined back together and returned in Response. -/// -/// Service exposes methods to instantiate from a string configuration (which -/// can cover most translators) and to translate an incoming blob of text. -/// -/// Optionally Service can be initialized by also passing bytearray memories -/// for purposes of efficiency (which defaults to empty and then reads from -/// file supplied through config). +class BlockingService; +class AsyncService; + +/// See AsyncService. /// -class Service { +/// BlockingService is a not-threaded counterpart of AsyncService which can operate only in a blocking workflow (queue a +/// bunch of texts and optional args to translate, wait till the translation finishes). +class BlockingService { public: - /// Construct Service from Marian options. If memoryBundle is empty, Service is - /// initialized from file-based loading. Otherwise, Service is initialized from - /// the given bytearray memories. - /// @param options Marian options object - /// @param memoryBundle holds all byte-array memories. Can be a set/subset of - /// model, shortlist, vocabs and ssplitPrefixFile or QualityEstimation bytes. Optional. - explicit Service(Ptr options, MemoryBundle memoryBundle = {}); - - /// Construct Service from a string configuration. If memoryBundle is empty, Service is - /// initialized from file-based loading. Otherwise, Service is initialized from - /// the given bytearray memories. - /// @param [in] config string parsable as YAML expected to adhere with marian config - /// @param [in] memoryBundle holds all byte-array memories. Can be a set/subset of - /// model, shortlist, vocabs and ssplitPrefixFile or qualityEstimation bytes. Optional. - explicit Service(const std::string &config, MemoryBundle memoryBundle = {}) - : Service(parseOptions(config, /*validate=*/false), std::move(memoryBundle)) {} - - /// Explicit destructor to clean up after any threads initialized in - /// asynchronous operation mode. - ~Service(); - - /// Translate an input, providing Options to construct Response. This is - /// useful when one has to set/unset alignments or quality in the Response to - /// save compute spent in constructing these objects. - /// - /// @param [in] source: rvalue reference of the string to be translated - /// @param [in] callback: A callback function provided by the client which - /// accepts an rvalue of a Response. Called on successful construction of a - /// Response following completion of translation of source by worker threads. - /// @param [in] responseOptions: Options indicating whether or not to include - /// some member in the Response, also specify any additional configurable - /// parameters. - void translate(std::string &&source, std::function &&callback, - ResponseOptions options = ResponseOptions()); - -#ifdef WASM_COMPATIBLE_SOURCE - /// Translate multiple text-blobs in a single *blocking* API call, providing - /// ResponseOptions which applies across all text-blobs dictating how to - /// construct Response. ResponseOptions can be used to enable/disable - /// additional information like quality-scores, alignments etc. - /// - /// All texts are combined to efficiently construct batches together providing - /// speedups compared to calling translate() indepdently on individual - /// text-blob. Note that there will be minor differences in output when - /// text-blobs are individually translated due to approximations but similar - /// quality nonetheless. If you have async/multithread capabilities, it is - /// recommended to work with callbacks and translate() API. - /// - /// @param [in] source: rvalue reference of the string to be translated - /// @param [in] responseOptions: ResponseOptions indicating whether or not - /// to include some member in the Response, also specify any additional - /// configurable parameters. - std::vector translateMultiple(std::vector &&source, ResponseOptions responseOptions); -#endif + struct Config {}; + /// Construct a BlockingService with configuration loaded from an Options object. Does not require any keys, values to + /// be set. + BlockingService(const BlockingService::Config &config); + + /// Translate multiple text-blobs in a single *blocking* API call, providing ResponseOptions which applies across all + /// text-blobs dictating how to construct Response. ResponseOptions can be used to enable/disable additional + /// information like quality-scores, alignments etc. - /// Returns if model is alignment capable or not. - bool isAlignmentSupported() const { return options_->hasAndNotEmpty("alignment"); } + /// If you have async/multithread capabilities, it is recommended to work with AsyncService instead of this class. + /// Note that due to batching differences and consequent floating-point rounding differences, this is not guaranteed + /// to have the same output as AsyncService. + + /// @param [in] translationModel: TranslationModel to use for the request. + /// @param [in] source: rvalue reference of the string to be translated + /// @param [in] responseOptions: ResponseOptions indicating whether or not to include some member in the Response, + /// also specify any additional configurable parameters. + std::vector translateMultiple(std::shared_ptr translationModel, + std::vector &&source, const ResponseOptions &responseOptions); private: - /// Queue an input for translation. - void queueRequest(std::string &&input, std::function &&callback, ResponseOptions responseOptions); + /// Numbering requests processed through this instance. Used to keep account of arrival times of the request. This + /// allows for using this quantity in priority based ordering. + size_t requestId_; - /// Translates through direct interaction between batcher_ and translators_ + /// An aggregate batching pool associated with an async translating instance, which maintains an aggregate queue of + /// requests compiled from batching-pools of multiple translation models. Not thread-safe. + AggregateBatchingPool batchingPool_; - /// Number of workers to launch. - size_t numWorkers_; + Config config_; +}; - /// Options object holding the options Service was instantiated with. - Ptr options_; +/// Effectively a threadpool, providing an API to take a translation request of a source-text, paramaterized by +/// TranslationModel to be used for translation. Configurability on optional items for the Response corresponding to a +/// request is provisioned through ResponseOptions. +class AsyncService { + public: + struct Config { + size_t numWorkers; + }; + /// Construct an AsyncService with configuration loaded from Options. Expects positive integer value for + /// `cpu-threads`. Additionally requires options which configure AggregateBatchingPool. + AsyncService(const AsyncService::Config &config); + + /// Create a TranslationModel compatible with this instance of Service. Internally assigns how many replicas of + /// backend needed based on worker threads set. See TranslationModel for documentation on other params. + template + Ptr createCompatibleModel(const ConfigType &config, MemoryBundle &&memory = MemoryBundle{}) { + // @TODO: Remove this remove this dependency/coupling. + return New(config, std::move(memory), /*replicas=*/config_.numWorkers); + } + + /// With the supplied TranslationModel, translate an input. A Response is constructed with optional items set/unset + /// indicated via ResponseOptions. Upon completion translation of the input, the client supplied callback is triggered + /// with the constructed Response. Concurrent-calls to this function are safe. + /// + /// @param [in] translationModel: TranslationModel to use for the request. + /// @param [in] source: rvalue reference of the string to be translated. This is available as-is to the client later + /// in the Response corresponding to this call along with the translated-text and meta-data. + /// @param [in] callback: A callback function provided by the client which accepts an rvalue of a Response. + /// @param [in] responseOptions: Options indicating whether or not to include some member in the Response, also + /// specify any additional configurable parameters. + void translate(std::shared_ptr translationModel, std::string &&source, CallbackType callback, + const ResponseOptions &options = ResponseOptions()); + + /// Thread joins and proper shutdown are required to be handled explicitly. + ~AsyncService(); - /// Model memory to load model passed as bytes. - AlignedMemory modelMemory_; // ORDER DEPENDENCY (translators_) - /// Shortlist memory passed as bytes. - AlignedMemory shortlistMemory_; // ORDER DEPENDENCY (translators_) + private: + AsyncService::Config config_; - std::shared_ptr qualityEstimator_; + std::vector workers_; /// Stores requestId of active request. Used to establish /// ordering among requests and logging/book-keeping. + /// Numbering requests processed through this instance. Used to keep account of arrival times of the request. This + /// allows for using this quantity in priority based ordering. size_t requestId_; - /// Store vocabs representing source and target. - Vocabs vocabs_; // ORDER DEPENDENCY (text_processor_) - - /// TextProcesser takes a blob of text and converts into format consumable by - /// the batch-translator and annotates sentences and words. - TextProcessor text_processor_; // ORDER DEPENDENCY (vocabs_) - - /// Batcher handles generation of batches from a request, subject to - /// packing-efficiency and priority optimization heuristics. - ThreadsafeBatcher batcher_; - - // The following constructs are available providing full capabilities on a non - // WASM platform, where one does not have to hide threads. -#ifdef WASM_COMPATIBLE_SOURCE - BatchTranslator blocking_translator_; // ORDER DEPENDENCY (modelMemory_, shortlistMemory_) -#else - std::vector workers_; -#endif // WASM_COMPATIBLE_SOURCE + + /// An aggregate batching pool associated with an async translating instance, which maintains an aggregate queue of + /// requests compiled from batching-pools of multiple translation models. The batching pool is wrapped around one + /// object for thread-safety. + ThreadsafeBatchingPool safeBatchingPool_; }; } // namespace bergamot diff --git a/src/translator/text_processor.cpp b/src/translator/text_processor.cpp index 249ce8cda..b747f79a5 100644 --- a/src/translator/text_processor.cpp +++ b/src/translator/text_processor.cpp @@ -52,7 +52,7 @@ ug::ssplit::SentenceSplitter loadSplitter(const AlignedMemory &memory) { } // namespace -Segment TextProcessor::tokenize(const string_view &segment, std::vector &wordRanges) { +Segment TextProcessor::tokenize(const string_view &segment, std::vector &wordRanges) const { // vocabs_->sources().front() is invoked as we currently only support one source vocab return vocabs_.sources().front()->encodeWithByteRanges(segment, wordRanges, /*addEOS=*/false, /*inference=*/true); } @@ -81,10 +81,10 @@ TextProcessor::TextProcessor(Ptr options, const Vocabs &vocabs, const A void TextProcessor::parseCommonOptions(Ptr options) { maxLengthBreak_ = options->get("max-length-break"); - ssplitMode_ = string2splitmode(options->get("ssplit-mode", "paragraph")); + ssplitMode_ = string2splitmode(options->get("ssplit-mode")); } -void TextProcessor::process(std::string &&input, AnnotatedText &source, Segments &segments) { +void TextProcessor::process(std::string &&input, AnnotatedText &source, Segments &segments) const { source = std::move(AnnotatedText(std::move(input))); std::string_view input_converted(source.text.data(), source.text.size()); auto sentenceStream = ug::ssplit::SentenceStream(input_converted, ssplit_, ssplitMode_); @@ -108,7 +108,7 @@ void TextProcessor::process(std::string &&input, AnnotatedText &source, Segments } void TextProcessor::wrap(Segment &segment, std::vector &wordRanges, Segments &segments, - AnnotatedText &source) { + AnnotatedText &source) const { // There's an EOS token added to the words, manually. SentencePiece/marian-vocab is set to not append EOS. Marian // requires EOS to be at the end as a marker to start translating. So while we're supplied maxLengthBreak_ from // outside, we need to ensure there's space for EOS in each wrapped segment. diff --git a/src/translator/text_processor.h b/src/translator/text_processor.h index 1dc5a4fa7..a6c918c0e 100644 --- a/src/translator/text_processor.h +++ b/src/translator/text_processor.h @@ -47,17 +47,17 @@ class TextProcessor { /// @param [out] segments: marian::Word equivalents of the sentences processed and stored in AnnotatedText for /// consumption of marian translation pipeline. - void process(std::string &&blob, AnnotatedText &source, Segments &segments); + void process(std::string &&blob, AnnotatedText &source, Segments &segments) const; private: void parseCommonOptions(Ptr options); /// Tokenizes an input string, returns Words corresponding. Loads the /// corresponding byte-ranges into tokenRanges. - Segment tokenize(const string_view &input, std::vector &tokenRanges); + Segment tokenize(const string_view &input, std::vector &tokenRanges) const; /// Wrap into sentences of at most maxLengthBreak_ tokens and add to source. - void wrap(Segment &sentence, std::vector &tokenRanges, Segments &segments, AnnotatedText &source); + void wrap(Segment &sentence, std::vector &tokenRanges, Segments &segments, AnnotatedText &source) const; const Vocabs &vocabs_; ///< Vocabularies used to tokenize a sentence size_t maxLengthBreak_; ///< Parameter used to wrap sentences to a maximum number of tokens diff --git a/src/translator/threadsafe_batcher.cpp b/src/translator/threadsafe_batcher.cpp deleted file mode 100644 index 38b6681a9..000000000 --- a/src/translator/threadsafe_batcher.cpp +++ /dev/null @@ -1,38 +0,0 @@ -#ifndef WASM_COMPATIBLE_SOURCE -#include "threadsafe_batcher.h" - -#include - -namespace marian { -namespace bergamot { - -ThreadsafeBatcher::ThreadsafeBatcher(Ptr options) : backend_(options), enqueued_(0), shutdown_(false) {} - -ThreadsafeBatcher::~ThreadsafeBatcher() { shutdown(); } - -void ThreadsafeBatcher::addWholeRequest(Ptr request) { - std::unique_lock lock(mutex_); - assert(!shutdown_); - backend_.addWholeRequest(request); - enqueued_ += request->numSegments(); - work_.notify_all(); -} - -void ThreadsafeBatcher::shutdown() { - std::unique_lock lock(mutex_); - shutdown_ = true; - work_.notify_all(); -} - -bool ThreadsafeBatcher::operator>>(Batch &batch) { - std::unique_lock lock(mutex_); - work_.wait(lock, [this]() { return enqueued_ || shutdown_; }); - bool ret = backend_ >> batch; - assert(ret || shutdown_); - enqueued_ -= batch.size(); - return ret; -} - -} // namespace bergamot -} // namespace marian -#endif // WASM_COMPATIBLE_SOURCE diff --git a/src/translator/threadsafe_batcher.h b/src/translator/threadsafe_batcher.h deleted file mode 100644 index d0ab7b1cc..000000000 --- a/src/translator/threadsafe_batcher.h +++ /dev/null @@ -1,57 +0,0 @@ -/* Thread-safe wrapper around batcher. */ -#ifndef SRC_BERGAMOT_THREADSAFE_BATCHER_H_ -#define SRC_BERGAMOT_THREADSAFE_BATCHER_H_ - -#include "batcher.h" -#include "common/options.h" -#include "definitions.h" - -#ifndef WASM_COMPATIBLE_SOURCE -#include -#include -#endif - -namespace marian { -namespace bergamot { - -#ifdef WASM_COMPATIBLE_SOURCE -// No threads, no locks. -typedef Batcher ThreadsafeBatcher; -#else - -class ThreadsafeBatcher { - public: - explicit ThreadsafeBatcher(Ptr options); - - ~ThreadsafeBatcher(); - - // Add sentences to be translated by calling these (see Batcher). When - // done, call shutdown. - void addWholeRequest(Ptr request); - void shutdown(); - - // Get a batch out of the batcher. Return false to shutdown worker. - bool operator>>(Batch &batch); - - private: - Batcher backend_; - - // Number of sentences in backend_; - size_t enqueued_; - - // Are we shutting down? - bool shutdown_; - - // Lock on this object. - std::mutex mutex_; - - // Signaled when there are sentences to translate. - std::condition_variable work_; -}; - -#endif - -} // namespace bergamot -} // namespace marian - -#endif // SRC_BERGAMOT_THREADSAFE_BATCHER_H_ diff --git a/src/translator/threadsafe_batching_pool.cpp b/src/translator/threadsafe_batching_pool.cpp new file mode 100644 index 000000000..0c0d8d85a --- /dev/null +++ b/src/translator/threadsafe_batching_pool.cpp @@ -0,0 +1,49 @@ + +#ifndef SRC_BERGAMOT_THREADSAFE_BATCHING_POOL_IMPL +#error "This is an impl file and must not be included directly!" +#endif + +#include + +namespace marian { +namespace bergamot { + +template +template +ThreadsafeBatchingPool::ThreadsafeBatchingPool(Args &&... args) + : backend_(std::forward(args)...), enqueued_(0), shutdown_(false) {} + +template +ThreadsafeBatchingPool::~ThreadsafeBatchingPool() { + shutdown(); +} + +template +template +void ThreadsafeBatchingPool::enqueueRequest(Args &&... args) { + std::unique_lock lock(mutex_); + assert(!shutdown_); + enqueued_ += backend_.enqueueRequest(std::forward(args)...); + work_.notify_all(); +} + +template +void ThreadsafeBatchingPool::shutdown() { + std::unique_lock lock(mutex_); + shutdown_ = true; + work_.notify_all(); +} + +template +template +size_t ThreadsafeBatchingPool::generateBatch(Args &&... args) { + std::unique_lock lock(mutex_); + work_.wait(lock, [this]() { return enqueued_ || shutdown_; }); + size_t sentencesInBatch = backend_.generateBatch(std::forward(args)...); + assert(sentencesInBatch > 0 || shutdown_); + enqueued_ -= sentencesInBatch; + return sentencesInBatch; +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/threadsafe_batching_pool.h b/src/translator/threadsafe_batching_pool.h new file mode 100644 index 000000000..96896eab3 --- /dev/null +++ b/src/translator/threadsafe_batching_pool.h @@ -0,0 +1,71 @@ +/* Thread-safe wrapper around BatchingPool or AggregateBatchingPool, made generic with templates. */ +#ifndef SRC_BERGAMOT_THREADSAFE_BATCHING_POOL_H_ +#define SRC_BERGAMOT_THREADSAFE_BATCHING_POOL_H_ + +#include +#include + +#include "aggregate_batching_pool.h" +#include "batching_pool.h" +#include "common/options.h" +#include "definitions.h" +#include "translation_model.h" + +namespace marian { +namespace bergamot { + +/// The following mechanism operates in a multithreaded async-workflow guarding access to the pushes to the structure +/// keeping sentences bucketed by length and sorted by priority. +/// +/// This is a wrap of a producer-consumer queue implemented as a monitor, where there is a mutex guarding the +/// underlying data structure (BatchingPoolType) and (worker/consumer) threads waiting on a condition variable and the +/// queuing thread producing and notifying waiting threads (consumers) through the same condition variable. +/// +/// Originally written by for a single model (where items are produce: Request, consume: Batch), converted to +/// also work for multiple models where items are produce: (TranslationModel, Request), consume: (TranlsationModel, +/// Batch). This is accomplished by template parameter packs. +/// +/// Requires BatchingPoolType to implement the following: +/// +/// * produce: `size_t enqueueRequest(...)` (returns number elements produced) +/// * consume: `size_t generateBatch(...)` (returns number of elements available to be consumed) + +template +class ThreadsafeBatchingPool { + public: + template + ThreadsafeBatchingPool(Args &&... args); + ~ThreadsafeBatchingPool(); + + template + void enqueueRequest(Args &&... args); + + template + size_t generateBatch(Args &&... args); + + void shutdown(); + + private: + BatchingPoolType backend_; + + // Number of sentences in backend_; + size_t enqueued_; + + // Are we shutting down? + bool shutdown_; + + // Lock on this object. + std::mutex mutex_; + + // Signaled when there are sentences to translate. + std::condition_variable work_; +}; + +} // namespace bergamot +} // namespace marian + +#define SRC_BERGAMOT_THREADSAFE_BATCHING_POOL_IMPL +#include "threadsafe_batching_pool.cpp" +#undef SRC_BERGAMOT_THREADSAFE_BATCHING_POOL_IMPL + +#endif // SRC_BERGAMOT_THREADSAFE_BATCHING_POOL_H_ diff --git a/src/translator/translation_model.cpp b/src/translator/translation_model.cpp new file mode 100644 index 000000000..5a2739542 --- /dev/null +++ b/src/translator/translation_model.cpp @@ -0,0 +1,173 @@ +#include "translation_model.h" + +#include "batch.h" +#include "byte_array_util.h" +#include "common/logging.h" +#include "data/corpus.h" +#include "data/text_input.h" +#include "parser.h" +#include "translator/beam_search.h" + +namespace marian { +namespace bergamot { + +TranslationModel::TranslationModel(const Config &options, MemoryBundle &&memory /*=MemoryBundle{}*/, + size_t replicas /*=1*/) + : options_(options), + memory_(std::move(memory)), + vocabs_(options, std::move(memory_.vocabs)), + textProcessor_(options, vocabs_, std::move(memory_.ssplitPrefixFile)), + batchingPool_(options), + qualityEstimator_(createQualityEstimator(getQualityEstimatorModel(memory, options))) { + ABORT_IF(replicas == 0, "At least one replica needs to be created."); + backend_.resize(replicas); + + if (options_->hasAndNotEmpty("shortlist")) { + int srcIdx = 0, trgIdx = 1; + bool shared_vcb = + vocabs_.sources().front() == + vocabs_.target(); // vocabs_->sources().front() is invoked as we currently only support one source vocab + if (memory_.shortlist.size() > 0 && memory_.shortlist.begin() != nullptr) { + bool check = options_->get("check-bytearray", false); + shortlistGenerator_ = New(memory_.shortlist.begin(), memory_.shortlist.size(), + vocabs_.sources().front(), vocabs_.target(), srcIdx, + trgIdx, shared_vcb, check); + } else { + // Changed to BinaryShortlistGenerator to enable loading binary shortlist file + // This class also supports text shortlist file + shortlistGenerator_ = New(options_, vocabs_.sources().front(), vocabs_.target(), + srcIdx, trgIdx, shared_vcb); + } + } + + for (size_t idx = 0; idx < replicas; idx++) { + loadBackend(idx); + } +} + +void TranslationModel::loadBackend(size_t idx) { + auto &graph = backend_[idx].graph; + auto &scorerEnsemble = backend_[idx].scorerEnsemble; + + marian::DeviceId device_(idx, DeviceType::cpu); + graph = New(/*inference=*/true); // set the graph to be inference only + auto prec = options_->get>("precision", {"float32"}); + graph->setDefaultElementType(typeFromString(prec[0])); + graph->setDevice(device_); + graph->getBackend()->configureDevice(options_); + graph->reserveWorkspaceMB(options_->get("workspace")); + + // Marian Model: Load from memoryBundle or shortList + if (memory_.model.size() > 0 && + memory_.model.begin() != + nullptr) { // If we have provided a byte array that contains the model memory, we can initialise the + // model from there, as opposed to from reading in the config file + ABORT_IF((uintptr_t)memory_.model.begin() % 256 != 0, + "The provided memory is not aligned to 256 bytes and will crash when vector instructions are used on it."); + if (options_->get("check-bytearray", false)) { + ABORT_IF(!validateBinaryModel(memory_.model, memory_.model.size()), + "The binary file is invalid. Incomplete or corrupted download?"); + } + const std::vector container = { + memory_.model.begin()}; // Marian supports multiple models initialised in this manner hence std::vector. + // However we will only ever use 1 during decoding. + scorerEnsemble = createScorers(options_, container); + } else { + scorerEnsemble = createScorers(options_); + } + for (auto scorer : scorerEnsemble) { + scorer->init(graph); + if (shortlistGenerator_) { + scorer->setShortlistGenerator(shortlistGenerator_); + } + } + graph->forward(); +} + +// Make request process is shared between Async and Blocking workflow of translating. +Ptr TranslationModel::makeRequest(size_t requestId, std::string &&source, CallbackType callback, + const ResponseOptions &responseOptions) { + Segments segments; + AnnotatedText annotatedSource; + + textProcessor_.process(std::move(source), annotatedSource, segments); + ResponseBuilder responseBuilder(responseOptions, std::move(annotatedSource), vocabs_, callback, *qualityEstimator_); + + Ptr request = New(requestId, std::move(segments), std::move(responseBuilder)); + return request; +} + +Ptr TranslationModel::convertToMarianBatch(Batch &batch) { + std::vector batchVector; + auto &sentences = batch.sentences(); + + size_t batchSequenceNumber{0}; + for (auto &sentence : sentences) { + data::SentenceTuple sentence_tuple(batchSequenceNumber); + Segment segment = sentence.getUnderlyingSegment(); + sentence_tuple.push_back(segment); + batchVector.push_back(sentence_tuple); + + ++batchSequenceNumber; + } + + // Usually one would expect inputs to be [B x T], where B = batch-size and T = max seq-len among the sentences in the + // batch. However, marian's library supports multi-source and ensembling through different source-vocabulary but same + // target vocabulary. This means the inputs are 3 dimensional when converted into marian's library formatted batches. + // + // Consequently B x T projects to N x B x T, where N = ensemble size. This adaptation does not fully force the idea of + // N = 1 (the code remains general, but N iterates only from 0-1 in the nested loop). + + size_t batchSize = batchVector.size(); + + std::vector sentenceIds; + std::vector maxDims; + + for (auto &example : batchVector) { + if (maxDims.size() < example.size()) { + maxDims.resize(example.size(), 0); + } + for (size_t i = 0; i < example.size(); ++i) { + if (example[i].size() > static_cast(maxDims[i])) { + maxDims[i] = static_cast(example[i].size()); + } + } + sentenceIds.push_back(example.getId()); + } + + using SubBatch = marian::data::SubBatch; + std::vector> subBatches; + for (size_t j = 0; j < maxDims.size(); ++j) { + subBatches.emplace_back(New(batchSize, maxDims[j], vocabs_.sources().at(j))); + } + + std::vector words(maxDims.size(), 0); + for (size_t i = 0; i < batchSize; ++i) { + for (size_t j = 0; j < maxDims.size(); ++j) { + for (size_t k = 0; k < batchVector[i][j].size(); ++k) { + subBatches[j]->data()[k * batchSize + i] = batchVector[i][j][k]; + subBatches[j]->mask()[k * batchSize + i] = 1.f; + words[j]++; + } + } + } + + for (size_t j = 0; j < maxDims.size(); ++j) { + subBatches[j]->setWords(words[j]); + } + + using CorpusBatch = marian::data::CorpusBatch; + Ptr corpusBatch = New(subBatches); + corpusBatch->setSentenceIds(sentenceIds); + return corpusBatch; +} + +void TranslationModel::translateBatch(size_t deviceId, Batch &batch) { + auto &backend = backend_[deviceId]; + BeamSearch search(options_, backend.scorerEnsemble, vocabs_.target()); + Histories histories = search.search(backend.graph, convertToMarianBatch(batch)); + batch.completeBatch(histories); +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/translation_model.h b/src/translator/translation_model.h new file mode 100644 index 000000000..599e6c707 --- /dev/null +++ b/src/translator/translation_model.h @@ -0,0 +1,122 @@ +#ifndef SRC_BERGAMOT_TRANSLATION_MODEL_H_ +#define SRC_BERGAMOT_TRANSLATION_MODEL_H_ + +#include +#include + +#include "batch.h" +#include "batching_pool.h" +#include "common/utils.h" +#include "data/shortlist.h" +#include "definitions.h" +#include "parser.h" +#include "request.h" +#include "text_processor.h" +#include "translator/history.h" +#include "translator/scorers.h" +#include "vocabs.h" + +namespace marian { +namespace bergamot { + +/// A TranslationModel is associated with the translation of a single language direction. Holds the graph and other +/// structures required to run the forward pass of the neural network, along with preprocessing logic (TextProcessor) +/// and a BatchingPool to create batches that are to be used in conjuction with an instance. +/// +/// Thread-safety is not handled here, but the methods are available at granularity enough to be used in threaded async +/// workflow for translation. + +class TranslationModel { + public: + using Config = Ptr; + using ShortlistGenerator = Ptr; + + /// Equivalent to options based constructor, where `options` is parsed from string configuration. Configuration can be + /// JSON or YAML. Keys expected correspond to those of `marian-decoder`, available at + /// https://marian-nmt.github.io/docs/cmd/marian-decoder/ + /// + /// Note that `replicas` is not stable. This is a temporary workaround while a more daunting task of separating + /// workspace from TranslationModel and binding it to threads is to be undertaken separately. Until the separation is + /// achieved, both TranslationModel and Service will need to be aware of workers. This is expected to be resolved + /// eventually, with only Service having the knowledge of how many workers are active. + /// + /// WebAssembly uses only single-thread, and we can hardcode replicas = 1 and use it anywhere and (client) needn't be + /// aware of this ugliness at the moment, thus providing a stable API solely for WebAssembly single-threaded modus + /// operandi. + /// + /// TODO(@jerinphilip): Clean this up. + TranslationModel(const std::string& config, MemoryBundle&& memory, size_t replicas = 1) + : TranslationModel(parseOptionsFromString(config, /*validate=*/false), std::move(memory), replicas){}; + + /// Construct TranslationModel from marian-options. If memory is empty, TranslationModel is initialized from + /// paths available in the options object, backed by filesystem. Otherwise, TranslationModel is initialized from the + /// given MemoryBundle composed of AlignedMemory holding equivalent parameters. + /// + /// @param [in] options: Marian options object. + /// @param [in] memory: MemoryBundle object holding memory buffers containing parameters to build MarianBackend, + /// ShortlistGenerator, Vocabs and SentenceSplitter. + TranslationModel(const Config& options, MemoryBundle&& memory = MemoryBundle{}, size_t replicas = 1); + + /// Make a Request to be translated by this TranslationModel instance. + /// @param [in] requestId: Unique identifier associated with this request, available from Service. + /// @param [in] source: Source text to be translated. Ownership is accepted and eventually returned to the client in + /// Response corresponding to the Request created here. + /// @param [in] callback: Callback (from client) to be issued upon completion of translation of all sentences in the + /// created Request. + /// @param [in] responseOptions: Configuration used to prepare the Response corresponding to the created request. + // @returns Request created from the query parameters wrapped within a shared-pointer. + Ptr makeRequest(size_t requestId, std::string&& source, CallbackType callback, + const ResponseOptions& responseOptions); + + /// Relays a request to the batching-pool specific to this translation model. + /// @param [in] request: Request constructed through makeRequest + void enqueueRequest(Ptr request) { batchingPool_.enqueueRequest(request); }; + + /// Generates a batch from the batching-pool for this translation model, compiling from several active requests. Note + /// that it is possible that calls to this method can give empty-batches. + /// + /// @param [out] batch: Batch to write a generated batch on to. + /// @returns number of sentences that constitute the Batch. + size_t generateBatch(Batch& batch) { return batchingPool_.generateBatch(batch); } + + /// Translate a batch generated with generateBatch + /// + /// @param [in] deviceId: There are replicas of backend created for use in each worker thread. deviceId indicates + /// which replica to use. + /// @param [in] batch: A batch generated from generateBatch from the same TranslationModel instance. + void translateBatch(size_t deviceId, Batch& batch); + + private: + Config options_; + MemoryBundle memory_; + Vocabs vocabs_; + TextProcessor textProcessor_; + + /// Maintains sentences from multiple requests bucketed by length and sorted by priority in each bucket. + BatchingPool batchingPool_; + + /// A package of marian-entities which form a backend to translate. + struct MarianBackend { + using Graph = Ptr; + using ScorerEnsemble = std::vector>; + + Graph graph; + ScorerEnsemble scorerEnsemble; + }; + + // ShortlistGenerator is purely const, we don't need one per thread. + ShortlistGenerator shortlistGenerator_; + + /// Hold replicas of the backend (graph, scorers, shortlist) for use in each thread. + /// Controlled and consistent external access via graph(id), scorerEnsemble(id), + std::vector backend_; + std::shared_ptr qualityEstimator_; + + void loadBackend(size_t idx); + Ptr convertToMarianBatch(Batch& batch); +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_TRANSLATION_MODEL_H_ diff --git a/wasm/bindings/service_bindings.cpp b/wasm/bindings/service_bindings.cpp index 416a318ad..d05cf57cf 100644 --- a/wasm/bindings/service_bindings.cpp +++ b/wasm/bindings/service_bindings.cpp @@ -8,8 +8,10 @@ using namespace emscripten; -typedef marian::bergamot::Service Service; -typedef marian::bergamot::AlignedMemory AlignedMemory; +using BlockingService = marian::bergamot::BlockingService; +using TranslationModel = marian::bergamot::TranslationModel; +using AlignedMemory = marian::bergamot::AlignedMemory; +using MemoryBundle = marian::bergamot::MemoryBundle; val getByteArrayView(AlignedMemory& alignedMemory) { return val(typed_memory_view(alignedMemory.size(), alignedMemory.as())); @@ -42,9 +44,9 @@ std::vector> prepareVocabsSmartMemories(std::vect return vocabsSmartMemories; } -marian::bergamot::MemoryBundle prepareMemoryBundle(AlignedMemory* modelMemory, AlignedMemory* shortlistMemory, - std::vector uniqueVocabsMemories) { - marian::bergamot::MemoryBundle memoryBundle; +MemoryBundle prepareMemoryBundle(AlignedMemory* modelMemory, AlignedMemory* shortlistMemory, + std::vector uniqueVocabsMemories) { + MemoryBundle memoryBundle; memoryBundle.model = std::move(*modelMemory); memoryBundle.shortlist = std::move(*shortlistMemory); memoryBundle.vocabs = std::move(prepareVocabsSmartMemories(uniqueVocabsMemories)); @@ -52,18 +54,31 @@ marian::bergamot::MemoryBundle prepareMemoryBundle(AlignedMemory* modelMemory, A return memoryBundle; } -Service* ServiceFactory(const std::string& config, AlignedMemory* modelMemory, AlignedMemory* shortlistMemory, - std::vector uniqueVocabsMemories) { - return new Service(config, std::move(prepareMemoryBundle(modelMemory, shortlistMemory, uniqueVocabsMemories))); +// This allows only shared_ptrs to be operational in JavaScript, according to emscripten. +// https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#smart-pointers +std::shared_ptr TranslationModelFactory(const std::string& config, AlignedMemory* model, + AlignedMemory* shortlist, + std::vector vocabs) { + MemoryBundle memoryBundle = prepareMemoryBundle(model, shortlist, vocabs); + return std::make_shared(config, std::move(memoryBundle)); } -EMSCRIPTEN_BINDINGS(translation_service) { - class_("Service") - .constructor(&ServiceFactory, allow_raw_pointers()) - .function("translate", &Service::translateMultiple) - .function("isAlignmentSupported", &Service::isAlignmentSupported); - // ^ We redirect Service::translateMultiple to WASMBound::translate instead. Sane API is - // translate. If and when async comes, we can be done with this inconsistency. +EMSCRIPTEN_BINDINGS(translation_model) { + class_("TranslationModel") + .smart_ptr_constructor("TranslationModel", &TranslationModelFactory, allow_raw_pointers()); +} + +EMSCRIPTEN_BINDINGS(blocking_service_config) { + value_object("BlockingServiceConfig"); + // .field("name", &BlockingService::Config::name") + // The above is a future hook. Note that more will come - for cache, for workspace-size or graph details limits on + // aggregate-batching etc. +} + +EMSCRIPTEN_BINDINGS(blocking_service) { + class_("BlockingService") + .constructor() + .function("translate", &BlockingService::translateMultiple); register_vector("VectorString"); } From c7b626dfd0217471db5f034b03fe68e6ac933d0f Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Tue, 28 Sep 2021 15:53:02 +0530 Subject: [PATCH 291/442] Adapted wasm test page for new Service interface (#224) - The new interface now supports running multiple TranslationModels --- wasm/test_page/bergamot.js | 12 +++- wasm/test_page/worker.js | 120 ++++++++++++++++++++++--------------- 2 files changed, 81 insertions(+), 51 deletions(-) diff --git a/wasm/test_page/bergamot.js b/wasm/test_page/bergamot.js index e586b213c..848fba177 100644 --- a/wasm/test_page/bergamot.js +++ b/wasm/test_page/bergamot.js @@ -32,14 +32,20 @@ document.querySelector("#load").addEventListener("click", async() => { const translateCall = () => { const text = document.querySelector('#from').value; const paragraphs = text.split("\n"); - - worker.postMessage(["translate", paragraphs]); + document.querySelector("#load").disabled = true; + const lang = document.querySelector('input[name="modellang"]:checked').value; + const from = lang.substring(0, 2); + const to = lang.substring(2, 4); + worker.postMessage(["translate", from, to, paragraphs]); + document.querySelector("#load").disabled = false; } worker.onmessage = function(e) { console.debug(`Message received from worker`); if (e.data[0] === 'translated_result') { - document.querySelector('#to').value = e.data[1].join("\n"); + if (e.data[1]) { + document.querySelector('#to').value = e.data[1].join("\n"); + } log(e.data[2]); } if ((e.data[0] === 'module_loaded') || (e.data[0] === 'model_loaded')) { diff --git a/wasm/test_page/worker.js b/wasm/test_page/worker.js index 329081011..8b53a271a 100644 --- a/wasm/test_page/worker.js +++ b/wasm/test_page/worker.js @@ -1,4 +1,6 @@ var translationService, responseOptions, input = undefined; +// A map of language-pair to TranslationModel object +var translationModels = new Map(); const BERGAMOT_TRANSLATOR_MODULE = "bergamot-translator-worker.js"; const encoder = new TextEncoder(); // string to utf-8 converter @@ -33,23 +35,35 @@ onmessage = async function(e) { } else if (command === 'load_model') { let start = Date.now(); - await constructTranslationService(e.data[1], e.data[2]); - result = `translation model '${e.data[1]}${e.data[2]}' successfully loaded; took ${(Date.now() - start) / 1000} secs`; + try { + await constructTranslationService(); + await constructTranslationModel(e.data[1], e.data[2]); + result = `translation model '${e.data[1]}${e.data[2]}' successfully loaded; took ${(Date.now() - start) / 1000} secs`; + } catch (error) { + result = `translation model '${e.data[1]}${e.data[2]}' loading failed: '${error.message}'`; + } log(result); log('Posting message back to main script'); postMessage(['model_loaded', result]); } else if (command === 'translate') { - const inputParagraphs = e.data[1]; + const from = e.data[1]; + const to = e.data[2]; + const inputParagraphs = e.data[3]; let inputWordCount = 0; inputParagraphs.forEach(sentence => { inputWordCount += sentence.trim().split(" ").filter(word => word.trim() !== "").length; }) let start = Date.now(); - const translatedParagraphs = translate(e.data[1]); - const secs = (Date.now() - start) / 1000; - result = `Translation of (${inputWordCount}) words took ${secs} secs (${Math.round(inputWordCount / secs)} words per second)`; + var translatedParagraphs; + try { + translatedParagraphs = translate(from, to, inputParagraphs); + const secs = (Date.now() - start) / 1000; + result = `Translation '${from}${to}' Successful. Speed: ${Math.round(inputWordCount / secs)} Words per second (${inputWordCount} words in ${secs} secs)`; + } catch (error) { + result = `Error: ${error.message}`; + } log(result); log('Posting message back to main script'); postMessage(['translated_result', translatedParagraphs, result]); @@ -77,8 +91,24 @@ const prepareAlignedMemoryFromBuffer = async (buffer, alignmentSize) => { return alignedMemory; } -const constructTranslationService = async (from, to) => { +// Instantiate the Translation Service +const constructTranslationService = async () => { + if (!translationService) { + var translationServiceConfig = {}; + log(`Creating Translation Service with config: ${translationServiceConfig}`); + translationService = new Module.BlockingService(translationServiceConfig); + log(`Translation Service created successfully`); + } +} + +const constructTranslationModel = async (from, to) => { const languagePair = `${from}${to}`; + if (translationModels.has(languagePair)) { + var oldModel = translationModels.get(languagePair); + // Destruct the old TranslationModel explicitly and Remove its entry from the map + oldModel.delete(); + translationModels.delete(languagePair); + } // Vocab files are re-used in both translation directions const vocabLanguagePair = from === "en" ? `${to}${from}` : languagePair; @@ -133,50 +163,44 @@ gemm-precision: int8shift log(`modelFile: ${modelFile}\nshortlistFile: ${shortlistFile}\nNo. of unique vocabs: ${uniqueVocabFiles.size}`); uniqueVocabFiles.forEach(item => log(`unique vocabFile: ${item}`)); - try { - // Download the files as buffers from the given urls - let start = Date.now(); - const downloadedBuffers = await Promise.all([downloadAsArrayBuffer(modelFile), downloadAsArrayBuffer(shortlistFile)]); - const modelBuffer = downloadedBuffers[0]; - const shortListBuffer = downloadedBuffers[1]; + // Download the files as buffers from the given urls + let start = Date.now(); + const downloadedBuffers = await Promise.all([downloadAsArrayBuffer(modelFile), downloadAsArrayBuffer(shortlistFile)]); + const modelBuffer = downloadedBuffers[0]; + const shortListBuffer = downloadedBuffers[1]; - const downloadedVocabBuffers = []; - for (let item of uniqueVocabFiles.values()) { - downloadedVocabBuffers.push(await downloadAsArrayBuffer(item)); - } - log(`All files for ${languagePair} language pair took ${(Date.now() - start) / 1000} secs to download`); - - // Construct AlignedMemory objects with downloaded buffers - let constructedAlignedMemories = await Promise.all([prepareAlignedMemoryFromBuffer(modelBuffer, 256), - prepareAlignedMemoryFromBuffer(shortListBuffer, 64)]); - let alignedModelMemory = constructedAlignedMemories[0]; - let alignedShortlistMemory = constructedAlignedMemories[1]; - let alignedVocabsMemoryList = new Module.AlignedMemoryList; - for(let item of downloadedVocabBuffers) { - let alignedMemory = await prepareAlignedMemoryFromBuffer(item, 64); - alignedVocabsMemoryList.push_back(alignedMemory); - } - log(`Aligned vocab memories: ${alignedVocabsMemoryList.get(0).size()}`); - log(`Aligned model memory: ${alignedModelMemory.size()}`); - log(`Aligned shortlist memory: ${alignedShortlistMemory.size()}`); - - // Instantiate the Translation Service - if (translationService) { - translationService.delete(); - translationService = undefined; - } + const downloadedVocabBuffers = []; + for (let item of uniqueVocabFiles.values()) { + downloadedVocabBuffers.push(await downloadAsArrayBuffer(item)); + } + log(`All files for ${languagePair} language pair took ${(Date.now() - start) / 1000} secs to download`); - log(`Creating Translation Service with config: ${modelConfig}`); - translationService = new Module.Service(modelConfig, alignedModelMemory, alignedShortlistMemory, alignedVocabsMemoryList); - if (typeof translationService === 'undefined') { - throw Error(`Translation Service construction failed`); - } - } catch (error) { - log(error); + // Construct AlignedMemory objects with downloaded buffers + let constructedAlignedMemories = await Promise.all([prepareAlignedMemoryFromBuffer(modelBuffer, 256), + prepareAlignedMemoryFromBuffer(shortListBuffer, 64)]); + let alignedModelMemory = constructedAlignedMemories[0]; + let alignedShortlistMemory = constructedAlignedMemories[1]; + let alignedVocabsMemoryList = new Module.AlignedMemoryList; + for(let item of downloadedVocabBuffers) { + let alignedMemory = await prepareAlignedMemoryFromBuffer(item, 64); + alignedVocabsMemoryList.push_back(alignedMemory); + } + log(`Aligned vocab memories: ${alignedVocabsMemoryList.get(0).size()}`); + log(`Aligned model memory: ${alignedModelMemory.size()}`); + log(`Aligned shortlist memory: ${alignedShortlistMemory.size()}`); + + log(`Creating Translation Model with config: ${modelConfig}`); + var translationModel = new Module.TranslationModel(modelConfig, alignedModelMemory, alignedShortlistMemory, alignedVocabsMemoryList); + translationModels.set(languagePair, translationModel); +} + +const translate = (from, to, paragraphs) => { + const languagePair = `${from}${to}`; + if (!translationModels.has(languagePair)) { + throw Error(`Please load translation model '${languagePair}' before translating`); } - } + translationModel = translationModels.get(languagePair); -const translate = (paragraphs) => { // Instantiate the arguments of translate() API i.e. ResponseOptions and input (vector) var responseOptions = new Module.ResponseOptions(); let input = new Module.VectorString; @@ -193,7 +217,7 @@ const translate = (paragraphs) => { log(`Input size: ${input.size()}`); // Translate the input, which is a vector; the result is a vector - let result = translationService.translate(input, responseOptions); + let result = translationService.translate(translationModel, input, responseOptions); const translatedParagraphs = []; const translatedSentencesOfParagraphs = []; From a0cb1e4b3d2e06027a7f979b0c66f1336a6688e9 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Tue, 19 Oct 2021 14:40:54 +0200 Subject: [PATCH 292/442] Wasm test page UI for translating b/w non-English language pairs (#231) * Updated Wasm test page UI for translating b/w non-English language pairs * Both "from" and "to" language dropdowns now allow non-English languages --- wasm/README.md | 106 +----- wasm/test_page/bergamot-httpserver.js | 61 ++- wasm/test_page/bergamot.html | 66 ---- wasm/test_page/bergamot.js | 54 --- wasm/test_page/css/index.css | 99 +++++ wasm/test_page/helper.js | 40 -- wasm/test_page/index.html | 33 ++ wasm/test_page/js/index.js | 101 +++++ wasm/test_page/js/modelRegistry.js | 328 ++++++++++++++++ wasm/test_page/js/worker.js | 298 +++++++++++++++ wasm/test_page/package-lock.json | 515 +++++++++++++++++++++++++- wasm/test_page/start_server.sh | 6 +- wasm/test_page/worker.js | 267 ------------- 13 files changed, 1452 insertions(+), 522 deletions(-) delete mode 100644 wasm/test_page/bergamot.html delete mode 100644 wasm/test_page/bergamot.js create mode 100644 wasm/test_page/css/index.css delete mode 100644 wasm/test_page/helper.js create mode 100644 wasm/test_page/index.html create mode 100644 wasm/test_page/js/index.js create mode 100644 wasm/test_page/js/modelRegistry.js create mode 100644 wasm/test_page/js/worker.js delete mode 100644 wasm/test_page/worker.js diff --git a/wasm/README.md b/wasm/README.md index 728b0a364..a0b3d7820 100644 --- a/wasm/README.md +++ b/wasm/README.md @@ -1,95 +1,25 @@ # Using Bergamot Translator in JavaScript -Instructions in this document assume current-directory to be -[wasm](https://github.com/browsermt/bergamot-translator/tree/main/wasm) within -bergamot-translator source. - -The example file `bergamot.html` in the folder `test_page` demonstrates how to -use the bergamot translator in JavaScript via a ` - - - diff --git a/wasm/test_page/bergamot.js b/wasm/test_page/bergamot.js deleted file mode 100644 index 848fba177..000000000 --- a/wasm/test_page/bergamot.js +++ /dev/null @@ -1,54 +0,0 @@ -var worker; - -if (window.Worker) { - var worker = new Worker('worker.js'); - worker.postMessage(["load_module"]); -} - -const log = (message) => { - document.querySelector("#log").value += message + "\n"; -} - -document.querySelector("#translate").addEventListener("click", () => { - translateCall(); -}); - -document.querySelector("#from").addEventListener('keyup', function(event) { - if (event.keyCode === 13) { - translateCall(); - } -}); - -document.querySelector("#load").addEventListener("click", async() => { - document.querySelector("#load").disabled = true; - const lang = document.querySelector('input[name="modellang"]:checked').value; - const from = lang.substring(0, 2); - const to = lang.substring(2, 4); - let start = Date.now(); - worker.postMessage(["load_model", from, to]); - document.querySelector("#load").disabled = false; -}); - -const translateCall = () => { - const text = document.querySelector('#from').value; - const paragraphs = text.split("\n"); - document.querySelector("#load").disabled = true; - const lang = document.querySelector('input[name="modellang"]:checked').value; - const from = lang.substring(0, 2); - const to = lang.substring(2, 4); - worker.postMessage(["translate", from, to, paragraphs]); - document.querySelector("#load").disabled = false; -} - -worker.onmessage = function(e) { - console.debug(`Message received from worker`); - if (e.data[0] === 'translated_result') { - if (e.data[1]) { - document.querySelector('#to').value = e.data[1].join("\n"); - } - log(e.data[2]); - } - if ((e.data[0] === 'module_loaded') || (e.data[0] === 'model_loaded')) { - log(e.data[1]); - } -} \ No newline at end of file diff --git a/wasm/test_page/css/index.css b/wasm/test_page/css/index.css new file mode 100644 index 000000000..bbc5bf147 --- /dev/null +++ b/wasm/test_page/css/index.css @@ -0,0 +1,99 @@ +* { + box-sizing: border-box; +} + +html, +body { + height: 100%; + margin: 0; + font-size: 18px; + font-family: Optima, Helvetica, Arial; +} + +body { + padding: 1rem; +} + +.app { + padding: 1rem; + display: grid; + grid: "from swap to" 1fr "status status status" auto / 1fr auto 1fr; + grid-gap: 1rem; + overflow: hidden; + min-height: 400px; + max-width: 1024px; + margin: 1em auto; +} + +@media screen and (max-width: 640px) { + .app { + grid: "from from" auto "status swap" auto "to to" auto / 1fr; + } +} + +.panel { + display: grid; + grid-template-rows: auto 1fr; + grid-gap: 1rem; +} + +label { + padding: 0 0.5em; + display: flex; + align-items: center; +} + +.lang-select { + padding: 0.25rem 0.5rem; + margin-left: 1rem; + background: #f4f4f4; + font-size: 0.9rem; + border: 1px solid #ccc; + border-radius: 0.25rem; + cursor: pointer; +} + +.panel--from { + grid-area: from; +} + +.panel--to { + grid-area: to; +} + +.swap { + align-self: center; + grid-area: swap; + font-size: 1.1rem; +} + +#status { + grid-area: status; + text-align: center; + align-self: center; +} + +textarea { + padding: 1rem; + font-family: sans-serif; + font-size: 1rem; + resize: none; + border-radius: 2px; + border: 1px solid #ccc; +} + +button { + cursor: pointer; + border: 1px solid #88c; + border-radius: 4px; + background: #eef; + padding: 0; + padding: 0.25rem 0.5rem; +} +button:hover { + background: #cce; +} + +#output { + background-color: #f4f4f4; +} diff --git a/wasm/test_page/helper.js b/wasm/test_page/helper.js deleted file mode 100644 index bff116ced..000000000 --- a/wasm/test_page/helper.js +++ /dev/null @@ -1,40 +0,0 @@ -/* - * @author - Based of a file from Gist here: https://gist.github.com/1757658 - * - * @modified - Mike Newell - it was on Gist so I figure I can use it - * - * @Description - Added support for a few more mime types including the new - * .ogv, .webm, and .mp4 file types for HTML5 video. - * - */ - -/* -* @modified - Andre Natal - removed unused types for the purpose of this use -case -*/ - -Helper = { - - types: { - "wasm" : "application/wasm" - , "js" : "application/javascript" - , "html" : "text/html" - , "htm" : "text/html" - , "ico" : "image/vnd.microsoft.icon", - }, - - getMime: function(u) { - - var ext = this.getExt(u.pathname).replace('.', ''); - - return this.types[ext.toLowerCase()] || 'application/octet-stream'; - - }, - - getExt: function(path) { - var i = path.lastIndexOf('.'); - - return (i < 0) ? '' : path.substr(i); - } - -}; diff --git a/wasm/test_page/index.html b/wasm/test_page/index.html new file mode 100644 index 000000000..86eae4637 --- /dev/null +++ b/wasm/test_page/index.html @@ -0,0 +1,33 @@ + + + + Mozilla Translations + + + + + +
+
+ + +
+ +
+ + +
+ +
+ + + diff --git a/wasm/test_page/js/index.js b/wasm/test_page/js/index.js new file mode 100644 index 000000000..6b580415f --- /dev/null +++ b/wasm/test_page/js/index.js @@ -0,0 +1,101 @@ +let worker; +let modelRegistry; + +const $ = selector => document.querySelector(selector); +const status = message => ($("#status").innerText = message); + +const langFrom = $("#lang-from"); +const langTo = $("#lang-to"); + +const langs = [ + ["en", "English"], + ["it", "Italian"], + ["pt", "Portuguese"], + ["ru", "Russian"], + ["cs", "Czech"], + ["de", "German"], + ["es", "Spanish"], + ["et", "Estonian"], +]; + +if (window.Worker) { + worker = new Worker("js/worker.js"); + worker.postMessage(["import"]); +} + +document.querySelector("#input").addEventListener("keyup", function (event) { + translateCall(); +}); + +const translateCall = () => { + const text = document.querySelector("#input").value + " "; + if (!text.trim().length) return; + const paragraphs = text.split("\n"); + $("#output").setAttribute("disabled", true); + const lngFrom = langFrom.value; + const lngTo = langTo.value; + worker.postMessage(["translate", lngFrom, lngTo, paragraphs]); +}; + +worker.onmessage = function (e) { + if (e.data[0] === "translate_reply" && e.data[1]) { + document.querySelector("#output").value = e.data[1].join("\n\n"); + $("#output").removeAttribute("disabled"); + } else if (e.data[0] === "load_model_reply" && e.data[1]) { + status(e.data[1]); + translateCall(); + } else if (e.data[0] === "import_reply" && e.data[1]) { + modelRegistry = e.data[1]; + init(); + } +}; + +langs.forEach(([code, name]) => { + langFrom.innerHTML += ``; + langTo.innerHTML += ``; +}); + +const loadModel = () => { + const lngFrom = langFrom.value; + const lngTo = langTo.value; + if (lngFrom !== lngTo) { + status(`Installing model...`); + console.log(`Loading model '${lngFrom}${lngTo}'`); + worker.postMessage(["load_model", lngFrom, lngTo]); + } else { + const input = document.querySelector("#input").value; + document.querySelector("#output").value = input; + } +}; + +langFrom.addEventListener("change", e => { + loadModel(); +}); + +langTo.addEventListener("change", e => { + loadModel(); +}); + +$(".swap").addEventListener("click", e => { + [langFrom.value, langTo.value] = [langTo.value, langFrom.value]; + $("#input").value = $("#output").value; + loadModel(); +}); + +function init() { + // try to guess input language from user agent + let myLang = navigator.language; + if (myLang) { + myLang = myLang.split("-")[0]; + let langIndex = langs.findIndex(([code]) => code === myLang); + if (langIndex > -1) { + console.log("guessing input language is", myLang); + langFrom.value = myLang; + } + } + + // find first output lang that *isn't* input language + langTo.value = langs.find(([code]) => code !== langFrom.value)[0]; + // load this model + loadModel(); +} diff --git a/wasm/test_page/js/modelRegistry.js b/wasm/test_page/js/modelRegistry.js new file mode 100644 index 000000000..c8d6eda5e --- /dev/null +++ b/wasm/test_page/js/modelRegistry.js @@ -0,0 +1,328 @@ + +//const rootURL = "https://storage.googleapis.com/bergamot-models-sandbox/0.2.10"; +const rootURL = "../models"; + +const modelRegistry = { + enit: { + vocab: { + name: "vocab.enit.spm", + size: 814128, + estimatedCompressedSize: 405338, + expectedSha256Hash: + "de8cbeb79e0139304bfa47e8559f2447016bf9906225a97d3df1baed4de8f3a3", + }, + lex: { + name: "lex.50.50.enit.s2t.bin", + size: 4489920, + estimatedCompressedSize: 2409986, + expectedSha256Hash: + "bb1fad3b3f6a13ebce1698cf7f39ca736c4dea4525f3dab5e1a78436f07445e6", + }, + model: { + name: "model.enit.intgemm.alphas.bin", + size: 17140836, + estimatedCompressedSize: 13283223, + expectedSha256Hash: + "a5ce3723f62ead92a0e0373b6df0ad8e3e6d22963adb1333984206e33b8b6c61", + }, + }, + enpt: { + vocab: { + name: "vocab.enpt.spm", + size: 812781, + estimatedCompressedSize: 406524, + expectedSha256Hash: + "633a3d782c79f7d5e4b94ab96848f47c2fdf8ba82dd99efd1742b8a696bbd0cc", + }, + lex: { + name: "lex.50.50.enpt.s2t.bin", + size: 4472528, + estimatedCompressedSize: 2411984, + expectedSha256Hash: + "1e96599123d275afa37353dfe84677a4070f013494fbdc9c52a28445cc9bc38d", + }, + model: { + name: "model.enpt.intgemm.alphas.bin", + size: 17140836, + estimatedCompressedSize: 13429592, + expectedSha256Hash: + "d968735704c75e33c2e183b9241f14c0b2a560d01d88a2728e5c0119a4d7fb22", + }, + }, + enru: { + vocab: { + name: "vocab.enru.spm", + size: 937157, + estimatedCompressedSize: 435776, + expectedSha256Hash: + "feca2d44f01b946c85faba3b15b5eb53344bec84cd14a1a4d4a82ddd774c5edd", + }, + lex: { + name: "lex.50.50.enru.s2t.bin", + size: 3049096, + estimatedCompressedSize: 1579779, + expectedSha256Hash: + "7bd3e2c0a72286fe1f3da65c56c49a7cd77efa5f1d1a444e2a9e769480b96ff3", + }, + model: { + name: "model.enru.intgemm.alphas.bin", + size: 17140836, + estimatedCompressedSize: 12853987, + expectedSha256Hash: + "4a45186a93b8a2dd9301c66a3b3dad580b1bcfa74aadda583ca383f9fe0dea93", + }, + }, + iten: { + vocab: { + name: "vocab.iten.spm", + size: 814151, + estimatedCompressedSize: 405416, + expectedSha256Hash: + "22d5ce6973be5360a921103acbe984a9bfca952a1f6c55c9cb5ef7de4fd58266", + }, + lex: { + name: "lex.50.50.iten.s2t.bin", + size: 5238420, + estimatedCompressedSize: 2860178, + expectedSha256Hash: + "357d362373022b029ee9965975a133e6f36fdb0fed749202ff578365cf0111f8", + }, + model: { + name: "model.iten.intgemm.alphas.bin", + size: 17140836, + estimatedCompressedSize: 13423308, + expectedSha256Hash: + "1fae546faeb9046f80b1b7e940b37b660974ce72902778181d6cd1c30b717f35", + }, + }, + pten: { + vocab: { + name: "vocab.pten.spm", + size: 812889, + estimatedCompressedSize: 406730, + expectedSha256Hash: + "8389979e3c965688b07aeb712a7e44406e5dcdb2b84087229d26fcc71448c4ed", + }, + lex: { + name: "lex.50.50.pten.s2t.bin", + size: 5001420, + estimatedCompressedSize: 2733800, + expectedSha256Hash: + "212ed0ae44a6f920cd6d17ca02f0a523ba6c4b0ef5078ae310c20bc4c51484c5", + }, + model: { + name: "model.pten.intgemm.alphas.bin", + size: 17140836, + estimatedCompressedSize: 13584764, + expectedSha256Hash: + "6c3b7af01772022a19712410c63342ba581468c2f1aac34d7488409c4043e697", + }, + }, + ruen: { + vocab: { + name: "vocab.ruen.spm", + size: 936576, + estimatedCompressedSize: 435801, + expectedSha256Hash: + "aaf9a325c0a988c507d0312cb6ba1a02bac7a370bcd879aedee626a40bfbda78", + }, + lex: { + name: "lex.50.50.ruen.s2t.bin", + size: 5090836, + estimatedCompressedSize: 2684919, + expectedSha256Hash: + "e6667e22f5f86be4872e3768b7184727f5dd8c9f2ccfb0639baabcb1176f5d11", + }, + model: { + name: "model.ruen.intgemm.alphas.bin", + size: 17140836, + estimatedCompressedSize: 13108893, + expectedSha256Hash: + "3b6a0305e3d232fadd54f5a765365b7b96ad6d8f2e818cba594b02fbd8fadb3d", + }, + }, + csen: { + vocab: { + name: "vocab.csen.spm", + size: 769763, + estimatedCompressedSize: 366392, + expectedSha256Hash: + "f71cc5d045e479607078e079884f44032f5a0b82547fb96eefa29cd1eb47c6f3", + }, + lex: { + name: "lex.50.50.csen.s2t.bin", + size: 4535788, + estimatedCompressedSize: 2418488, + expectedSha256Hash: + "8228a3c3f7887759a62b7d7c674a7bef9b70161913f9b0939ab58f71186835c2", + }, + model: { + name: "model.csen.intgemm.alphas.bin", + size: 17140756, + estimatedCompressedSize: 13045032, + expectedSha256Hash: + "5b16661e2864dc50b2f4091a16bdd4ec8d8283e04271e602159ba348df5d6e2d", + }, + }, + deen: { + vocab: { + name: "vocab.deen.spm", + size: 784269, + estimatedCompressedSize: 410738, + expectedSha256Hash: + "417668f2ed297970febafb5b079a9d5ebc4ed0b3550ac8386d67a90473a09bd7", + }, + lex: { + name: "lex.50.50.deen.s2t.bin", + size: 5047568, + estimatedCompressedSize: 2657472, + expectedSha256Hash: + "2f7c0f7bbce97ae5b52454074a892ba7b7610fb98e3c5d341e4ca79f0850c4de", + }, + model: { + name: "model.deen.intgemm.alphas.bin", + size: 17140837, + estimatedCompressedSize: 13091214, + expectedSha256Hash: + "dda44d87ab0d8ad3b3871122fd3ee385f37878183a8b4ec139cd909531ec5009", + }, + }, + encs: { + vocab: { + name: "vocab.csen.spm", + size: 769763, + estimatedCompressedSize: 366392, + expectedSha256Hash: + "f71cc5d045e479607078e079884f44032f5a0b82547fb96eefa29cd1eb47c6f3", + }, + lex: { + name: "lex.50.50.encs.s2t.bin", + size: 3556124, + estimatedCompressedSize: 1913246, + expectedSha256Hash: + "e19c77231bf977988e31ff8db15fe79966b5170564bd3e10613f239e7f461d97", + }, + model: { + name: "model.encs.intgemm.alphas.bin", + size: 17140756, + estimatedCompressedSize: 12630325, + expectedSha256Hash: + "9a2fe0588bd972accfc801e2f31c945de0557804a91666ae5ab43b94fb74ac4b", + }, + }, + ende: { + vocab: { + name: "vocab.deen.spm", + size: 797501, + estimatedCompressedSize: 412505, + expectedSha256Hash: + "bc8f8229933d8294c727f3eab12f6f064e7082b929f2d29494c8a1e619ba174c", + }, + lex: { + name: "lex.50.50.ende.s2t.bin", + size: 3062492, + estimatedCompressedSize: 1575385, + expectedSha256Hash: + "764797d075f0642c0b079cce6547348d65fe4e92ac69fa6a8605cd8b53dacb3f", + }, + model: { + name: "model.ende.intgemm.alphas.bin", + size: 17140498, + estimatedCompressedSize: 13207068, + expectedSha256Hash: + "f0946515c6645304f0706fa66a051c3b7b7c507f12d0c850f276c18165a10c14", + }, + }, + enes: { + vocab: { + name: "vocab.esen.spm", + size: 825463, + estimatedCompressedSize: 414566, + expectedSha256Hash: + "909b1eea1face0d7f90a474fe29a8c0fef8d104b6e41e65616f864c964ba8845", + }, + lex: { + name: "lex.50.50.enes.s2t.bin", + size: 3347104, + estimatedCompressedSize: 1720700, + expectedSha256Hash: + "3a113d713dec3cf1d12bba5b138ae616e28bba4bbc7fe7fd39ba145e26b86d7f", + }, + model: { + name: "model.enes.intgemm.alphas.bin", + size: 17140755, + estimatedCompressedSize: 12602853, + expectedSha256Hash: + "fa7460037a3163e03fe1d23602f964bff2331da6ee813637e092ddf37156ef53", + }, + }, + enet: { + vocab: { + name: "vocab.eten.spm", + size: 828426, + estimatedCompressedSize: 416995, + expectedSha256Hash: + "e3b66bc141f6123cd40746e2fb9b8ee4f89cbf324ab27d6bbf3782e52f15fa2d", + }, + lex: { + name: "lex.50.50.enet.s2t.bin", + size: 2700780, + estimatedCompressedSize: 1336443, + expectedSha256Hash: + "3d1b40ff43ebef82cf98d416a88a1ea19eb325a85785eef102f59878a63a829d", + }, + model: { + name: "model.enet.intgemm.alphas.bin", + size: 17140754, + estimatedCompressedSize: 12543318, + expectedSha256Hash: + "a28874a8b702a519a14dc71bcee726a5cb4b539eeaada2d06492f751469a1fd6", + }, + }, + esen: { + vocab: { + name: "vocab.esen.spm", + size: 825463, + estimatedCompressedSize: 414566, + expectedSha256Hash: + "909b1eea1face0d7f90a474fe29a8c0fef8d104b6e41e65616f864c964ba8845", + }, + lex: { + name: "lex.50.50.esen.s2t.bin", + size: 3860888, + estimatedCompressedSize: 1978538, + expectedSha256Hash: + "f11a2c23ef85ab1fee1c412b908d69bc20d66fd59faa8f7da5a5f0347eddf969", + }, + model: { + name: "model.esen.intgemm.alphas.bin", + size: 17140755, + estimatedCompressedSize: 13215960, + expectedSha256Hash: + "4b6b7f451094aaa447d012658af158ffc708fc8842dde2f871a58404f5457fe0", + }, + }, + eten: { + vocab: { + name: "vocab.eten.spm", + size: 828426, + estimatedCompressedSize: 416995, + expectedSha256Hash: + "e3b66bc141f6123cd40746e2fb9b8ee4f89cbf324ab27d6bbf3782e52f15fa2d", + }, + lex: { + name: "lex.50.50.eten.s2t.bin", + size: 3974944, + estimatedCompressedSize: 1920655, + expectedSha256Hash: + "6992bedc590e60e610a28129c80746fe5f33144a4520e2c5508d87db14ca54f8", + }, + model: { + name: "model.eten.intgemm.alphas.bin", + size: 17140754, + estimatedCompressedSize: 12222624, + expectedSha256Hash: + "aac98a2371e216ee2d4843cbe896c617f6687501e17225ac83482eba52fd0028", + }, + }, +}; \ No newline at end of file diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js new file mode 100644 index 000000000..1cf3a1461 --- /dev/null +++ b/wasm/test_page/js/worker.js @@ -0,0 +1,298 @@ +// All variables specific to translation service +var translationService, responseOptions, input = undefined; +// A map of language-pair to TranslationModel object +var languagePairToTranslationModels = new Map(); + +const BERGAMOT_TRANSLATOR_MODULE = "bergamot-translator-worker.js"; +const MODEL_REGISTRY = "modelRegistry.js"; + +const encoder = new TextEncoder(); // string to utf-8 converter +const decoder = new TextDecoder(); // utf-8 to string converter + +const start = Date.now(); +let moduleLoadStart; +var Module = { + preRun: [function() { + log(`Time until Module.preRun: ${(Date.now() - start) / 1000} secs`); + moduleLoadStart = Date.now(); + }], + onRuntimeInitialized: function() { + log(`Wasm Runtime initialized Successfully (preRun -> onRuntimeInitialized) in ${(Date.now() - moduleLoadStart) / 1000} secs`); + importScripts(MODEL_REGISTRY); + postMessage([`import_reply`, modelRegistry]); + } +}; + +const log = (message) => { + console.debug(message); +} + +onmessage = async function(e) { + const command = e.data[0]; + log(`Message '${command}' received from main script`); + let result = ""; + if (command === 'import') { + importScripts(BERGAMOT_TRANSLATOR_MODULE); + } else if (command === 'load_model') { + let start = Date.now(); + let from = e.data[1]; + let to = e.data[2]; + try { + await constructTranslationService(); + await constructTranslationModel(from, to); + log(`Model '${from}${to}' successfully constructed. Time taken: ${(Date.now() - start) / 1000} secs`); + result = "Model successfully loaded"; + } catch (error) { + log(`Model '${from}${to}' construction failed: '${error.message}'`); + result = "Model loading failed"; + } + log(`'${command}' command done, Posting message back to main script`); + postMessage([`${command}_reply`, result]); + } else if (command === 'translate') { + const from = e.data[1]; + const to = e.data[2]; + const inputParagraphs = e.data[3]; + let inputWordCount = 0; + inputParagraphs.forEach(sentence => { + inputWordCount += sentence.trim().split(" ").filter(word => word.trim() !== "").length; + }) + let start = Date.now(); + try { + result = translate(from, to, inputParagraphs); + const secs = (Date.now() - start) / 1000; + log(`Translation '${from}${to}' Successful. Speed: ${Math.round(inputWordCount / secs)} WPS (${inputWordCount} words in ${secs} secs)`); + } catch (error) { + log(`Error: ${error.message}`); + } + log(`'${command}' command done, Posting message back to main script`); + postMessage([`${command}_reply`, result]); + } +} + +// Instantiates the Translation Service +const constructTranslationService = async () => { + if (!translationService) { + var translationServiceConfig = {}; + log(`Creating Translation Service with config: ${translationServiceConfig}`); + translationService = new Module.BlockingService(translationServiceConfig); + log(`Translation Service created successfully`); + } +} + +// Constructs a translation model object for the source and target language pair +const constructTranslationModel = async (from, to) => { + // Delete all previously constructed translation models and clear the map + languagePairToTranslationModels.forEach((value, key) => { + log(`Destructing model '${key}'`); + value.delete(); + }); + languagePairToTranslationModels.clear(); + + // If none of the languages is English then construct multiple models with + // English as a pivot language. + if (from !== 'en' && to !== 'en') { + log(`Constructing model '${from}${to}' via pivoting: '${from}en' and 'en${to}'`); + await Promise.all([_constructTranslationModelInvolvingEnglish(from, 'en'), + _constructTranslationModelInvolvingEnglish('en', to)]); + } + else { + log(`Constructing model '${from}${to}'`); + await _constructTranslationModelInvolvingEnglish(from, to); + } +} + +// Translates text from source language to target language. +const translate = (from, to, paragraphs) => { + // If none of the languages is English then perform translation with + // English as a pivot language. + if (from !== 'en' && to !== 'en') { + log(`Translating '${from}${to}' via pivoting: '${from}en' -> 'en${to}'`); + let translatedParagraphsInEnglish = _translateInvolvingEnglish(from, 'en', paragraphs); + return _translateInvolvingEnglish('en', to, translatedParagraphsInEnglish); + } + else { + log(`Translating '${from}${to}'`); + return _translateInvolvingEnglish(from, to, paragraphs); + } +} + +// Downloads file from a url and returns the array buffer +const _downloadAsArrayBuffer = async(url) => { + const response = await fetch(url); + if (!response.ok) { + throw Error(`Downloading ${url} failed: HTTP ${response.status} - ${response.statusText}`); + } + return response.arrayBuffer(); +} + +// Constructs and initializes the AlignedMemory from the array buffer and alignment size +const _prepareAlignedMemoryFromBuffer = async (buffer, alignmentSize) => { + var byteArray = new Int8Array(buffer); + log(`Constructing Aligned memory. Size: ${byteArray.byteLength} bytes, Alignment: ${alignmentSize}`); + var alignedMemory = new Module.AlignedMemory(byteArray.byteLength, alignmentSize); + log(`Aligned memory construction done`); + const alignedByteArrayView = alignedMemory.getByteArrayView(); + alignedByteArrayView.set(byteArray); + log(`Aligned memory initialized`); + return alignedMemory; +} + +const _constructTranslationModelInvolvingEnglish = async (from, to) => { + const languagePair = `${from}${to}`; + + /*Set the Model Configuration as YAML formatted string. + For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ + Vocab files are re-used in both translation directions + const vocabLanguagePair = from === "en" ? `${to}${from}` : languagePair; + const modelConfig = `models: + - /${languagePair}/model.${languagePair}.intgemm.alphas.bin + vocabs: + - /${languagePair}/vocab.${vocabLanguagePair}.spm + - /${languagePair}/vocab.${vocabLanguagePair}.spm + beam-size: 1 + normalize: 1.0 + word-penalty: 0 + max-length-break: 128 + mini-batch-words: 1024 + workspace: 128 + max-length-factor: 2.0 + skip-cost: true + cpu-threads: 0 + quiet: true + quiet-translation: true + shortlist: + - /${languagePair}/lex.${languagePair}.s2t + - 50 + - 50 + `; + */ + + // TODO: gemm-precision: int8shiftAlphaAll (for the models that support this) + // DONOT CHANGE THE SPACES BETWEEN EACH ENTRY OF CONFIG + const modelConfig = `beam-size: 1 +normalize: 1.0 +word-penalty: 0 +max-length-break: 128 +mini-batch-words: 1024 +workspace: 128 +max-length-factor: 2.0 +skip-cost: true +cpu-threads: 0 +quiet: true +quiet-translation: true +gemm-precision: int8shiftAll +`; + + const modelFile = `${rootURL}/${languagePair}/${modelRegistry[languagePair]["model"].name}`; + const shortlistFile = `${rootURL}/${languagePair}/${modelRegistry[languagePair]["lex"].name}`; + const vocabFiles = [`${rootURL}/${languagePair}/${modelRegistry[languagePair]["vocab"].name}`, + `${rootURL}/${languagePair}/${modelRegistry[languagePair]["vocab"].name}`]; + + const uniqueVocabFiles = new Set(vocabFiles); + log(`modelFile: ${modelFile}\nshortlistFile: ${shortlistFile}\nNo. of unique vocabs: ${uniqueVocabFiles.size}`); + uniqueVocabFiles.forEach(item => log(`unique vocabFile: ${item}`)); + + // Download the files as buffers from the given urls + let start = Date.now(); + const downloadedBuffers = await Promise.all([_downloadAsArrayBuffer(modelFile), _downloadAsArrayBuffer(shortlistFile)]); + const modelBuffer = downloadedBuffers[0]; + const shortListBuffer = downloadedBuffers[1]; + + const downloadedVocabBuffers = []; + for (let item of uniqueVocabFiles.values()) { + downloadedVocabBuffers.push(await _downloadAsArrayBuffer(item)); + } + log(`Total Download time for all files of '${languagePair}': ${(Date.now() - start) / 1000} secs`); + + // Construct AlignedMemory objects with downloaded buffers + let constructedAlignedMemories = await Promise.all([_prepareAlignedMemoryFromBuffer(modelBuffer, 256), + _prepareAlignedMemoryFromBuffer(shortListBuffer, 64)]); + let alignedModelMemory = constructedAlignedMemories[0]; + let alignedShortlistMemory = constructedAlignedMemories[1]; + let alignedVocabsMemoryList = new Module.AlignedMemoryList; + for(let item of downloadedVocabBuffers) { + let alignedMemory = await _prepareAlignedMemoryFromBuffer(item, 64); + alignedVocabsMemoryList.push_back(alignedMemory); + } + for (let vocabs=0; vocabs < alignedVocabsMemoryList.size(); vocabs++) { + log(`Aligned vocab memory${vocabs+1} size: ${alignedVocabsMemoryList.get(vocabs).size()}`); + } + log(`Aligned model memory size: ${alignedModelMemory.size()}`); + log(`Aligned shortlist memory size: ${alignedShortlistMemory.size()}`); + + log(`Translation Model config: ${modelConfig}`); + var translationModel = new Module.TranslationModel(modelConfig, alignedModelMemory, alignedShortlistMemory, alignedVocabsMemoryList); + languagePairToTranslationModels.set(languagePair, translationModel); +} + +const _translateInvolvingEnglish = (from, to, paragraphs) => { + const languagePair = `${from}${to}`; + if (!languagePairToTranslationModels.has(languagePair)) { + throw Error(`Please load translation model '${languagePair}' before translating`); + } + translationModel = languagePairToTranslationModels.get(languagePair); + + // Instantiate the arguments of translate() API i.e. ResponseOptions and input (vector) + var responseOptions = new Module.ResponseOptions(); + let input = new Module.VectorString; + + // Initialize the input + paragraphs.forEach(paragraph => { + // prevent empty paragraph - it breaks the translation + if (paragraph.trim() === "") { + return; + } + input.push_back(paragraph.trim()) + }) + + // Access input (just for debugging) + log(`Input size: ${input.size()}`); + + // Translate the input, which is a vector; the result is a vector + let result = translationService.translate(translationModel, input, responseOptions); + + const translatedParagraphs = []; + const translatedSentencesOfParagraphs = []; + const sourceSentencesOfParagraphs = []; + for (let i = 0; i < result.size(); i++) { + translatedParagraphs.push(result.get(i).getTranslatedText()); + translatedSentencesOfParagraphs.push(_getAllTranslatedSentencesOfParagraph(result.get(i))); + sourceSentencesOfParagraphs.push(_getAllSourceSentencesOfParagraph(result.get(i))); + } + + responseOptions.delete(); + input.delete(); + return translatedParagraphs; +} + +// Extracts all the translated sentences from the Response and returns them. +const _getAllTranslatedSentencesOfParagraph = (response) => { + const sentences = []; + const text = response.getTranslatedText(); + for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { + const utf8SentenceByteRange = response.getTranslatedSentence(sentenceIndex); + sentences.push(_getSentenceFromByteRange(text, utf8SentenceByteRange)); + } + return sentences; +} + +// Extracts all the source sentences from the Response and returns them. +const _getAllSourceSentencesOfParagraph = (response) => { + const sentences = []; + const text = response.getOriginalText(); + for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { + const utf8SentenceByteRange = response.getSourceSentence(sentenceIndex); + sentences.push(_getSentenceFromByteRange(text, utf8SentenceByteRange)); + } + return sentences; +} + +/* + * Returns a substring of text (a string). The substring is represented by + * byteRange (begin and end endices) within the utf-8 encoded version of the text. + */ +const _getSentenceFromByteRange = (text, byteRange) => { + const utf8BytesView = encoder.encode(text); + const utf8SentenceBytes = utf8BytesView.subarray(byteRange.begin, byteRange.end); + return decoder.decode(utf8SentenceBytes); +} diff --git a/wasm/test_page/package-lock.json b/wasm/test_page/package-lock.json index ae4cb9dd6..065c92de8 100644 --- a/wasm/test_page/package-lock.json +++ b/wasm/test_page/package-lock.json @@ -1,6 +1,519 @@ { + "name": "test_page", + "lockfileVersion": 2, "requires": true, - "lockfileVersion": 1, + "packages": { + "": { + "dependencies": { + "cors": "^2.8.5", + "express": "^4.17.1", + "nocache": "^2.1.0" + } + }, + "node_modules/accepts": { + "version": "1.3.7", + "resolved": "https://registry.npmjs.org/accepts/-/accepts-1.3.7.tgz", + "integrity": "sha512-Il80Qs2WjYlJIBNzNkK6KYqlVMTbZLXgHx2oT0pU/fjRHyEp+PEfEPY0R3WCwAGVOtauxh1hOxNgIf5bv7dQpA==", + "dependencies": { + "mime-types": "~2.1.24", + "negotiator": "0.6.2" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/array-flatten": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/array-flatten/-/array-flatten-1.1.1.tgz", + "integrity": "sha1-ml9pkFGx5wczKPKgCJaLZOopVdI=" + }, + "node_modules/body-parser": { + "version": "1.19.0", + "resolved": "https://registry.npmjs.org/body-parser/-/body-parser-1.19.0.tgz", + "integrity": "sha512-dhEPs72UPbDnAQJ9ZKMNTP6ptJaionhP5cBb541nXPlW60Jepo9RV/a4fX4XWW9CuFNK22krhrj1+rgzifNCsw==", + "dependencies": { + "bytes": "3.1.0", + "content-type": "~1.0.4", + "debug": "2.6.9", + "depd": "~1.1.2", + "http-errors": "1.7.2", + "iconv-lite": "0.4.24", + "on-finished": "~2.3.0", + "qs": "6.7.0", + "raw-body": "2.4.0", + "type-is": "~1.6.17" + }, + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/bytes": { + "version": "3.1.0", + "resolved": "https://registry.npmjs.org/bytes/-/bytes-3.1.0.tgz", + "integrity": "sha512-zauLjrfCG+xvoyaqLoV8bLVXXNGC4JqlxFCutSDWA6fJrTo2ZuvLYTqZ7aHBLZSMOopbzwv8f+wZcVzfVTI2Dg==", + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/content-disposition": { + "version": "0.5.3", + "resolved": "https://registry.npmjs.org/content-disposition/-/content-disposition-0.5.3.tgz", + "integrity": "sha512-ExO0774ikEObIAEV9kDo50o+79VCUdEB6n6lzKgGwupcVeRlhrj3qGAfwq8G6uBJjkqLrhT0qEYFcWng8z1z0g==", + "dependencies": { + "safe-buffer": "5.1.2" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/content-type": { + "version": "1.0.4", + "resolved": "https://registry.npmjs.org/content-type/-/content-type-1.0.4.tgz", + "integrity": "sha512-hIP3EEPs8tB9AT1L+NUqtwOAps4mk2Zob89MWXMHjHWg9milF/j4osnnQLXBCBFBk/tvIG/tUc9mOUJiPBhPXA==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/cookie": { + "version": "0.4.0", + "resolved": "https://registry.npmjs.org/cookie/-/cookie-0.4.0.tgz", + "integrity": "sha512-+Hp8fLp57wnUSt0tY0tHEXh4voZRDnoIrZPqlo3DPiI4y9lwg/jqx+1Om94/W6ZaPDOUbnjOt/99w66zk+l1Xg==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/cookie-signature": { + "version": "1.0.6", + "resolved": "https://registry.npmjs.org/cookie-signature/-/cookie-signature-1.0.6.tgz", + "integrity": "sha1-4wOogrNCzD7oylE6eZmXNNqzriw=" + }, + "node_modules/cors": { + "version": "2.8.5", + "resolved": "https://registry.npmjs.org/cors/-/cors-2.8.5.tgz", + "integrity": "sha512-KIHbLJqu73RGr/hnbrO9uBeixNGuvSQjul/jdFvS/KFSIH1hWVd1ng7zOHx+YrEfInLG7q4n6GHQ9cDtxv/P6g==", + "dependencies": { + "object-assign": "^4", + "vary": "^1" + }, + "engines": { + "node": ">= 0.10" + } + }, + "node_modules/debug": { + "version": "2.6.9", + "resolved": "https://registry.npmjs.org/debug/-/debug-2.6.9.tgz", + "integrity": "sha512-bC7ElrdJaJnPbAP+1EotYvqZsb3ecl5wi6Bfi6BJTUcNowp6cvspg0jXznRTKDjm/E7AdgFBVeAPVMNcKGsHMA==", + "dependencies": { + "ms": "2.0.0" + } + }, + "node_modules/depd": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/depd/-/depd-1.1.2.tgz", + "integrity": "sha1-m81S4UwJd2PnSbJ0xDRu0uVgtak=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/destroy": { + "version": "1.0.4", + "resolved": "https://registry.npmjs.org/destroy/-/destroy-1.0.4.tgz", + "integrity": "sha1-l4hXRCxEdJ5CBmE+N5RiBYJqvYA=" + }, + "node_modules/ee-first": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/ee-first/-/ee-first-1.1.1.tgz", + "integrity": "sha1-WQxhFWsK4vTwJVcyoViyZrxWsh0=" + }, + "node_modules/encodeurl": { + "version": "1.0.2", + "resolved": "https://registry.npmjs.org/encodeurl/-/encodeurl-1.0.2.tgz", + "integrity": "sha1-rT/0yG7C0CkyL1oCw6mmBslbP1k=", + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/escape-html": { + "version": "1.0.3", + "resolved": "https://registry.npmjs.org/escape-html/-/escape-html-1.0.3.tgz", + "integrity": "sha1-Aljq5NPQwJdN4cFpGI7wBR0dGYg=" + }, + "node_modules/etag": { + "version": "1.8.1", + "resolved": "https://registry.npmjs.org/etag/-/etag-1.8.1.tgz", + "integrity": "sha1-Qa4u62XvpiJorr/qg6x9eSmbCIc=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/express": { + "version": "4.17.1", + "resolved": "https://registry.npmjs.org/express/-/express-4.17.1.tgz", + "integrity": "sha512-mHJ9O79RqluphRrcw2X/GTh3k9tVv8YcoyY4Kkh4WDMUYKRZUq0h1o0w2rrrxBqM7VoeUVqgb27xlEMXTnYt4g==", + "dependencies": { + "accepts": "~1.3.7", + "array-flatten": "1.1.1", + "body-parser": "1.19.0", + "content-disposition": "0.5.3", + "content-type": "~1.0.4", + "cookie": "0.4.0", + "cookie-signature": "1.0.6", + "debug": "2.6.9", + "depd": "~1.1.2", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "etag": "~1.8.1", + "finalhandler": "~1.1.2", + "fresh": "0.5.2", + "merge-descriptors": "1.0.1", + "methods": "~1.1.2", + "on-finished": "~2.3.0", + "parseurl": "~1.3.3", + "path-to-regexp": "0.1.7", + "proxy-addr": "~2.0.5", + "qs": "6.7.0", + "range-parser": "~1.2.1", + "safe-buffer": "5.1.2", + "send": "0.17.1", + "serve-static": "1.14.1", + "setprototypeof": "1.1.1", + "statuses": "~1.5.0", + "type-is": "~1.6.18", + "utils-merge": "1.0.1", + "vary": "~1.1.2" + }, + "engines": { + "node": ">= 0.10.0" + } + }, + "node_modules/finalhandler": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/finalhandler/-/finalhandler-1.1.2.tgz", + "integrity": "sha512-aAWcW57uxVNrQZqFXjITpW3sIUQmHGG3qSb9mUah9MgMC4NeWhNOlNjXEYq3HjRAvL6arUviZGGJsBg6z0zsWA==", + "dependencies": { + "debug": "2.6.9", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "on-finished": "~2.3.0", + "parseurl": "~1.3.3", + "statuses": "~1.5.0", + "unpipe": "~1.0.0" + }, + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/forwarded": { + "version": "0.1.2", + "resolved": "https://registry.npmjs.org/forwarded/-/forwarded-0.1.2.tgz", + "integrity": "sha1-mMI9qxF1ZXuMBXPozszZGw/xjIQ=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/fresh": { + "version": "0.5.2", + "resolved": "https://registry.npmjs.org/fresh/-/fresh-0.5.2.tgz", + "integrity": "sha1-PYyt2Q2XZWn6g1qx+OSyOhBWBac=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/http-errors": { + "version": "1.7.2", + "resolved": "https://registry.npmjs.org/http-errors/-/http-errors-1.7.2.tgz", + "integrity": "sha512-uUQBt3H/cSIVfch6i1EuPNy/YsRSOUBXTVfZ+yR7Zjez3qjBz6i9+i4zjNaoqcoFVI4lQJ5plg63TvGfRSDCRg==", + "dependencies": { + "depd": "~1.1.2", + "inherits": "2.0.3", + "setprototypeof": "1.1.1", + "statuses": ">= 1.5.0 < 2", + "toidentifier": "1.0.0" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/iconv-lite": { + "version": "0.4.24", + "resolved": "https://registry.npmjs.org/iconv-lite/-/iconv-lite-0.4.24.tgz", + "integrity": "sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA==", + "dependencies": { + "safer-buffer": ">= 2.1.2 < 3" + }, + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/inherits": { + "version": "2.0.3", + "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.3.tgz", + "integrity": "sha1-Yzwsg+PaQqUC9SRmAiSA9CCCYd4=" + }, + "node_modules/ipaddr.js": { + "version": "1.9.1", + "resolved": "https://registry.npmjs.org/ipaddr.js/-/ipaddr.js-1.9.1.tgz", + "integrity": "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g==", + "engines": { + "node": ">= 0.10" + } + }, + "node_modules/media-typer": { + "version": "0.3.0", + "resolved": "https://registry.npmjs.org/media-typer/-/media-typer-0.3.0.tgz", + "integrity": "sha1-hxDXrwqmJvj/+hzgAWhUUmMlV0g=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/merge-descriptors": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-1.0.1.tgz", + "integrity": "sha1-sAqqVW3YtEVoFQ7J0blT8/kMu2E=" + }, + "node_modules/methods": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/methods/-/methods-1.1.2.tgz", + "integrity": "sha1-VSmk1nZUE07cxSZmVoNbD4Ua/O4=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/mime": { + "version": "1.6.0", + "resolved": "https://registry.npmjs.org/mime/-/mime-1.6.0.tgz", + "integrity": "sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==", + "bin": { + "mime": "cli.js" + }, + "engines": { + "node": ">=4" + } + }, + "node_modules/mime-db": { + "version": "1.45.0", + "resolved": "https://registry.npmjs.org/mime-db/-/mime-db-1.45.0.tgz", + "integrity": "sha512-CkqLUxUk15hofLoLyljJSrukZi8mAtgd+yE5uO4tqRZsdsAJKv0O+rFMhVDRJgozy+yG6md5KwuXhD4ocIoP+w==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/mime-types": { + "version": "2.1.28", + "resolved": "https://registry.npmjs.org/mime-types/-/mime-types-2.1.28.tgz", + "integrity": "sha512-0TO2yJ5YHYr7M2zzT7gDU1tbwHxEUWBCLt0lscSNpcdAfFyJOVEpRYNS7EXVcTLNj/25QO8gulHC5JtTzSE2UQ==", + "dependencies": { + "mime-db": "1.45.0" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/ms": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/ms/-/ms-2.0.0.tgz", + "integrity": "sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g=" + }, + "node_modules/negotiator": { + "version": "0.6.2", + "resolved": "https://registry.npmjs.org/negotiator/-/negotiator-0.6.2.tgz", + "integrity": "sha512-hZXc7K2e+PgeI1eDBe/10Ard4ekbfrrqG8Ep+8Jmf4JID2bNg7NvCPOZN+kfF574pFQI7mum2AUqDidoKqcTOw==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/nocache": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/nocache/-/nocache-2.1.0.tgz", + "integrity": "sha512-0L9FvHG3nfnnmaEQPjT9xhfN4ISk0A8/2j4M37Np4mcDesJjHgEUfgPhdCyZuFI954tjokaIj/A3NdpFNdEh4Q==", + "engines": { + "node": ">=4.0.0" + } + }, + "node_modules/object-assign": { + "version": "4.1.1", + "resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz", + "integrity": "sha1-IQmtx5ZYh8/AXLvUQsrIv7s2CGM=", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/on-finished": { + "version": "2.3.0", + "resolved": "https://registry.npmjs.org/on-finished/-/on-finished-2.3.0.tgz", + "integrity": "sha1-IPEzZIGwg811M3mSoWlxqi2QaUc=", + "dependencies": { + "ee-first": "1.1.1" + }, + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/parseurl": { + "version": "1.3.3", + "resolved": "https://registry.npmjs.org/parseurl/-/parseurl-1.3.3.tgz", + "integrity": "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ==", + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/path-to-regexp": { + "version": "0.1.7", + "resolved": "https://registry.npmjs.org/path-to-regexp/-/path-to-regexp-0.1.7.tgz", + "integrity": "sha1-32BBeABfUi8V60SQ5yR6G/qmf4w=" + }, + "node_modules/proxy-addr": { + "version": "2.0.6", + "resolved": "https://registry.npmjs.org/proxy-addr/-/proxy-addr-2.0.6.tgz", + "integrity": "sha512-dh/frvCBVmSsDYzw6n926jv974gddhkFPfiN8hPOi30Wax25QZyZEGveluCgliBnqmuM+UJmBErbAUFIoDbjOw==", + "dependencies": { + "forwarded": "~0.1.2", + "ipaddr.js": "1.9.1" + }, + "engines": { + "node": ">= 0.10" + } + }, + "node_modules/qs": { + "version": "6.7.0", + "resolved": "https://registry.npmjs.org/qs/-/qs-6.7.0.tgz", + "integrity": "sha512-VCdBRNFTX1fyE7Nb6FYoURo/SPe62QCaAyzJvUjwRaIsc+NePBEniHlvxFmmX56+HZphIGtV0XeCirBtpDrTyQ==", + "engines": { + "node": ">=0.6" + } + }, + "node_modules/range-parser": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/range-parser/-/range-parser-1.2.1.tgz", + "integrity": "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg==", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/raw-body": { + "version": "2.4.0", + "resolved": "https://registry.npmjs.org/raw-body/-/raw-body-2.4.0.tgz", + "integrity": "sha512-4Oz8DUIwdvoa5qMJelxipzi/iJIi40O5cGV1wNYp5hvZP8ZN0T+jiNkL0QepXs+EsQ9XJ8ipEDoiH70ySUJP3Q==", + "dependencies": { + "bytes": "3.1.0", + "http-errors": "1.7.2", + "iconv-lite": "0.4.24", + "unpipe": "1.0.0" + }, + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/safe-buffer": { + "version": "5.1.2", + "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz", + "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==" + }, + "node_modules/safer-buffer": { + "version": "2.1.2", + "resolved": "https://registry.npmjs.org/safer-buffer/-/safer-buffer-2.1.2.tgz", + "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg==" + }, + "node_modules/send": { + "version": "0.17.1", + "resolved": "https://registry.npmjs.org/send/-/send-0.17.1.tgz", + "integrity": "sha512-BsVKsiGcQMFwT8UxypobUKyv7irCNRHk1T0G680vk88yf6LBByGcZJOTJCrTP2xVN6yI+XjPJcNuE3V4fT9sAg==", + "dependencies": { + "debug": "2.6.9", + "depd": "~1.1.2", + "destroy": "~1.0.4", + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "etag": "~1.8.1", + "fresh": "0.5.2", + "http-errors": "~1.7.2", + "mime": "1.6.0", + "ms": "2.1.1", + "on-finished": "~2.3.0", + "range-parser": "~1.2.1", + "statuses": "~1.5.0" + }, + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/send/node_modules/ms": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.1.tgz", + "integrity": "sha512-tgp+dl5cGk28utYktBsrFqA7HKgrhgPsg6Z/EfhWI4gl1Hwq8B/GmY/0oXZ6nF8hDVesS/FpnYaD/kOWhYQvyg==" + }, + "node_modules/serve-static": { + "version": "1.14.1", + "resolved": "https://registry.npmjs.org/serve-static/-/serve-static-1.14.1.tgz", + "integrity": "sha512-JMrvUwE54emCYWlTI+hGrGv5I8dEwmco/00EvkzIIsR7MqrHonbD9pO2MOfFnpFntl7ecpZs+3mW+XbQZu9QCg==", + "dependencies": { + "encodeurl": "~1.0.2", + "escape-html": "~1.0.3", + "parseurl": "~1.3.3", + "send": "0.17.1" + }, + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/setprototypeof": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/setprototypeof/-/setprototypeof-1.1.1.tgz", + "integrity": "sha512-JvdAWfbXeIGaZ9cILp38HntZSFSo3mWg6xGcJJsd+d4aRMOqauag1C63dJfDw7OaMYwEbHMOxEZ1lqVRYP2OAw==" + }, + "node_modules/statuses": { + "version": "1.5.0", + "resolved": "https://registry.npmjs.org/statuses/-/statuses-1.5.0.tgz", + "integrity": "sha1-Fhx9rBd2Wf2YEfQ3cfqZOBR4Yow=", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/toidentifier": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/toidentifier/-/toidentifier-1.0.0.tgz", + "integrity": "sha512-yaOH/Pk/VEhBWWTlhI+qXxDFXlejDGcQipMlyxda9nthulaxLZUNcUqFxokp0vcYnvteJln5FNQDRrxj3YcbVw==", + "engines": { + "node": ">=0.6" + } + }, + "node_modules/type-is": { + "version": "1.6.18", + "resolved": "https://registry.npmjs.org/type-is/-/type-is-1.6.18.tgz", + "integrity": "sha512-TkRKr9sUTxEH8MdfuCSP7VizJyzRNMjj2J2do2Jr3Kym598JVdEksuzPQCnlFPW4ky9Q+iA+ma9BGm06XQBy8g==", + "dependencies": { + "media-typer": "0.3.0", + "mime-types": "~2.1.24" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/unpipe": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/unpipe/-/unpipe-1.0.0.tgz", + "integrity": "sha1-sr9O6FFKrmFltIF4KdIbLvSZBOw=", + "engines": { + "node": ">= 0.8" + } + }, + "node_modules/utils-merge": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/utils-merge/-/utils-merge-1.0.1.tgz", + "integrity": "sha1-n5VxD1CiZ5R7LMwSR0HBAoQn5xM=", + "engines": { + "node": ">= 0.4.0" + } + }, + "node_modules/vary": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/vary/-/vary-1.1.2.tgz", + "integrity": "sha1-IpnwLG3tMNSllhsLn3RSShj2NPw=", + "engines": { + "node": ">= 0.8" + } + } + }, "dependencies": { "accepts": { "version": "1.3.7", diff --git a/wasm/test_page/start_server.sh b/wasm/test_page/start_server.sh index 911364665..8cb90071c 100644 --- a/wasm/test_page/start_server.sh +++ b/wasm/test_page/start_server.sh @@ -19,13 +19,13 @@ if [ ! -e "$1" ]; then exit fi -WASM_ARTIFACTS="$1/bergamot-translator-worker.*" +WASM_ARTIFACTS="$1/bergamot-translator-worker.js $1/bergamot-translator-worker.wasm" for i in $WASM_ARTIFACTS; do [ -f "$i" ] || breaks - cp $i . + cp $i js/. echo "Copied \"$i\"" done npm install echo "Start httpserver" -node bergamot-httpserver.js \ No newline at end of file +node bergamot-httpserver.js 80 1 0 \ No newline at end of file diff --git a/wasm/test_page/worker.js b/wasm/test_page/worker.js deleted file mode 100644 index 8b53a271a..000000000 --- a/wasm/test_page/worker.js +++ /dev/null @@ -1,267 +0,0 @@ -var translationService, responseOptions, input = undefined; -// A map of language-pair to TranslationModel object -var translationModels = new Map(); -const BERGAMOT_TRANSLATOR_MODULE = "bergamot-translator-worker.js"; - -const encoder = new TextEncoder(); // string to utf-8 converter -const decoder = new TextDecoder(); // utf-8 to string converter - -const start = Date.now(); -let moduleLoadStart; -var Module = { - preRun: [function() { - log(`Time until Module.preRun: ${(Date.now() - start) / 1000} secs`); - moduleLoadStart = Date.now(); - }], - onRuntimeInitialized: function() { - log(`Wasm Runtime initialized (preRun -> onRuntimeInitialized) in ${(Date.now() - moduleLoadStart) / 1000} secs`); - } -}; - -const log = (message) => { - console.debug(message); -} - -onmessage = async function(e) { - let command = e.data[0]; - log(`Message '${command}' received from main script`); - let result = ""; - if (command === 'load_module') { - importScripts(BERGAMOT_TRANSLATOR_MODULE); - result = `Translator wasm module successfully loaded`; - log(result); - log('Posting message back to main script'); - postMessage(['module_loaded', result]); - } - else if (command === 'load_model') { - let start = Date.now(); - try { - await constructTranslationService(); - await constructTranslationModel(e.data[1], e.data[2]); - result = `translation model '${e.data[1]}${e.data[2]}' successfully loaded; took ${(Date.now() - start) / 1000} secs`; - } catch (error) { - result = `translation model '${e.data[1]}${e.data[2]}' loading failed: '${error.message}'`; - } - log(result); - log('Posting message back to main script'); - postMessage(['model_loaded', result]); - } - else if (command === 'translate') { - const from = e.data[1]; - const to = e.data[2]; - const inputParagraphs = e.data[3]; - let inputWordCount = 0; - inputParagraphs.forEach(sentence => { - inputWordCount += sentence.trim().split(" ").filter(word => word.trim() !== "").length; - }) - - let start = Date.now(); - var translatedParagraphs; - try { - translatedParagraphs = translate(from, to, inputParagraphs); - const secs = (Date.now() - start) / 1000; - result = `Translation '${from}${to}' Successful. Speed: ${Math.round(inputWordCount / secs)} Words per second (${inputWordCount} words in ${secs} secs)`; - } catch (error) { - result = `Error: ${error.message}`; - } - log(result); - log('Posting message back to main script'); - postMessage(['translated_result', translatedParagraphs, result]); - } -} - -// This function downloads file from a url and returns the array buffer -const downloadAsArrayBuffer = async(url) => { - const response = await fetch(url); - if (!response.ok) { - throw Error(`Downloading ${url} failed: HTTP ${response.status} - ${response.statusText}`); - } - return response.arrayBuffer(); -} - -// This function constructs and initializes the AlignedMemory from the array buffer and alignment size -const prepareAlignedMemoryFromBuffer = async (buffer, alignmentSize) => { - var byteArray = new Int8Array(buffer); - log(`Constructing Aligned memory with size: ${byteArray.byteLength} bytes with alignment: ${alignmentSize}`); - var alignedMemory = new Module.AlignedMemory(byteArray.byteLength, alignmentSize); - log(`Aligned memory construction done`); - const alignedByteArrayView = alignedMemory.getByteArrayView(); - alignedByteArrayView.set(byteArray); - log(`Aligned memory initialized`); - return alignedMemory; -} - -// Instantiate the Translation Service -const constructTranslationService = async () => { - if (!translationService) { - var translationServiceConfig = {}; - log(`Creating Translation Service with config: ${translationServiceConfig}`); - translationService = new Module.BlockingService(translationServiceConfig); - log(`Translation Service created successfully`); - } -} - -const constructTranslationModel = async (from, to) => { - const languagePair = `${from}${to}`; - if (translationModels.has(languagePair)) { - var oldModel = translationModels.get(languagePair); - // Destruct the old TranslationModel explicitly and Remove its entry from the map - oldModel.delete(); - translationModels.delete(languagePair); - } - - // Vocab files are re-used in both translation directions - const vocabLanguagePair = from === "en" ? `${to}${from}` : languagePair; - - // Set the Model Configuration as YAML formatted string. - // For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ - /*const modelConfig = `models: - - /${languagePair}/model.${languagePair}.intgemm.alphas.bin - vocabs: - - /${languagePair}/vocab.${vocabLanguagePair}.spm - - /${languagePair}/vocab.${vocabLanguagePair}.spm - beam-size: 1 - normalize: 1.0 - word-penalty: 0 - max-length-break: 128 - mini-batch-words: 1024 - workspace: 128 - max-length-factor: 2.0 - skip-cost: true - cpu-threads: 0 - quiet: true - quiet-translation: true - shortlist: - - /${languagePair}/lex.${languagePair}.s2t - - 50 - - 50 - `; - */ - - // TODO: gemm-precision: int8shiftAlphaAll (for the models that support this) - // DONOT CHANGE THE SPACES BETWEEN EACH ENTRY OF CONFIG - const modelConfig = `beam-size: 1 -normalize: 1.0 -word-penalty: 0 -max-length-break: 128 -mini-batch-words: 1024 -workspace: 128 -max-length-factor: 2.0 -skip-cost: true -cpu-threads: 0 -quiet: true -quiet-translation: true -gemm-precision: int8shift -`; - - const modelFile = `models/${languagePair}/model.${languagePair}.intgemm.alphas.bin`; - const shortlistFile = `models/${languagePair}/lex.50.50.${languagePair}.s2t.bin`; - const vocabFiles = [`models/${languagePair}/vocab.${vocabLanguagePair}.spm`, - `models/${languagePair}/vocab.${vocabLanguagePair}.spm`]; - - const uniqueVocabFiles = new Set(vocabFiles); - log(`modelFile: ${modelFile}\nshortlistFile: ${shortlistFile}\nNo. of unique vocabs: ${uniqueVocabFiles.size}`); - uniqueVocabFiles.forEach(item => log(`unique vocabFile: ${item}`)); - - // Download the files as buffers from the given urls - let start = Date.now(); - const downloadedBuffers = await Promise.all([downloadAsArrayBuffer(modelFile), downloadAsArrayBuffer(shortlistFile)]); - const modelBuffer = downloadedBuffers[0]; - const shortListBuffer = downloadedBuffers[1]; - - const downloadedVocabBuffers = []; - for (let item of uniqueVocabFiles.values()) { - downloadedVocabBuffers.push(await downloadAsArrayBuffer(item)); - } - log(`All files for ${languagePair} language pair took ${(Date.now() - start) / 1000} secs to download`); - - // Construct AlignedMemory objects with downloaded buffers - let constructedAlignedMemories = await Promise.all([prepareAlignedMemoryFromBuffer(modelBuffer, 256), - prepareAlignedMemoryFromBuffer(shortListBuffer, 64)]); - let alignedModelMemory = constructedAlignedMemories[0]; - let alignedShortlistMemory = constructedAlignedMemories[1]; - let alignedVocabsMemoryList = new Module.AlignedMemoryList; - for(let item of downloadedVocabBuffers) { - let alignedMemory = await prepareAlignedMemoryFromBuffer(item, 64); - alignedVocabsMemoryList.push_back(alignedMemory); - } - log(`Aligned vocab memories: ${alignedVocabsMemoryList.get(0).size()}`); - log(`Aligned model memory: ${alignedModelMemory.size()}`); - log(`Aligned shortlist memory: ${alignedShortlistMemory.size()}`); - - log(`Creating Translation Model with config: ${modelConfig}`); - var translationModel = new Module.TranslationModel(modelConfig, alignedModelMemory, alignedShortlistMemory, alignedVocabsMemoryList); - translationModels.set(languagePair, translationModel); -} - -const translate = (from, to, paragraphs) => { - const languagePair = `${from}${to}`; - if (!translationModels.has(languagePair)) { - throw Error(`Please load translation model '${languagePair}' before translating`); - } - translationModel = translationModels.get(languagePair); - - // Instantiate the arguments of translate() API i.e. ResponseOptions and input (vector) - var responseOptions = new Module.ResponseOptions(); - let input = new Module.VectorString; - - // Initialize the input - paragraphs.forEach(paragraph => { - // prevent empty paragraph - it breaks the translation - if (paragraph.trim() === "") { - return; - } - input.push_back(paragraph.trim()) - }) - // Access input (just for debugging) - log(`Input size: ${input.size()}`); - - // Translate the input, which is a vector; the result is a vector - let result = translationService.translate(translationModel, input, responseOptions); - - const translatedParagraphs = []; - const translatedSentencesOfParagraphs = []; - const sourceSentencesOfParagraphs = []; - for (let i = 0; i < result.size(); i++) { - translatedParagraphs.push(result.get(i).getTranslatedText()); - translatedSentencesOfParagraphs.push(getAllTranslatedSentencesOfParagraph(result.get(i))); - sourceSentencesOfParagraphs.push(getAllSourceSentencesOfParagraph(result.get(i))); - } - log({ translatedParagraphs }); - log({ translatedSentencesOfParagraphs }); - log({ sourceSentencesOfParagraphs }); - - responseOptions.delete(); - input.delete(); - return translatedParagraphs; -} - -// This function extracts all the translated sentences from the Response and returns them. -const getAllTranslatedSentencesOfParagraph = (response) => { - const sentences = []; - const text = response.getTranslatedText(); - for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { - const utf8SentenceByteRange = response.getTranslatedSentence(sentenceIndex); - sentences.push(_getSentenceFromByteRange(text, utf8SentenceByteRange)); - } - return sentences; -} - -// This function extracts all the source sentences from the Response and returns them. -const getAllSourceSentencesOfParagraph = (response) => { - const sentences = []; - const text = response.getOriginalText(); - for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { - const utf8SentenceByteRange = response.getSourceSentence(sentenceIndex); - sentences.push(_getSentenceFromByteRange(text, utf8SentenceByteRange)); - } - return sentences; -} - -// This function returns a substring of text (a string). The substring is represented by -// byteRange (begin and end endices) within the utf-8 encoded version of the text. -const _getSentenceFromByteRange = (text, byteRange) => { - const utf8BytesView = encoder.encode(text); - const utf8SentenceBytes = utf8BytesView.subarray(byteRange.begin, byteRange.end); - return decoder.decode(utf8SentenceBytes); -} From c5167b3d8cda016f305d192f438a54e841cbc46c Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Wed, 27 Oct 2021 11:54:39 +0200 Subject: [PATCH 293/442] Import matrix-multiply from a separate wasm module (#232) * Updated marian-dev submodule * Import wasm gemm from a separate wasm module - The fallback implementation of gemm is currently being imported dynamically for wasm target * Updated CI scripts and README to import GEMM from a separate wasm module * Setting model config to int8shiftAlphaAll in wasm test page --- .github/workflows/wasm-custom_marian-mac.yml | 4 ++ .../workflows/wasm-custom_marian-ubuntu.yml | 4 ++ 3rd_party/marian-dev | 2 +- README.md | 10 +++++ build-wasm.sh | 3 ++ wasm/CMakeLists.txt | 6 +++ wasm/patch-artifacts-import-gemm-module.sh | 44 +++++++++++++++++++ wasm/test_page/js/worker.js | 2 +- 8 files changed, 73 insertions(+), 2 deletions(-) create mode 100644 wasm/patch-artifacts-import-gemm-module.sh diff --git a/.github/workflows/wasm-custom_marian-mac.yml b/.github/workflows/wasm-custom_marian-mac.yml index 746fb9cdd..636323581 100644 --- a/.github/workflows/wasm-custom_marian-mac.yml +++ b/.github/workflows/wasm-custom_marian-mac.yml @@ -39,6 +39,10 @@ jobs: working-directory: build-wasm run: bash ../wasm/patch-artifacts-enable-wormhole.sh + - name: Import GEMM library from a separate wasm module + working-directory: build-wasm + run: bash ../wasm/patch-artifacts-import-gemm-module.sh + - name: Check artifacts working-directory: build-wasm run: | diff --git a/.github/workflows/wasm-custom_marian-ubuntu.yml b/.github/workflows/wasm-custom_marian-ubuntu.yml index dcea92850..b644d9763 100644 --- a/.github/workflows/wasm-custom_marian-ubuntu.yml +++ b/.github/workflows/wasm-custom_marian-ubuntu.yml @@ -39,6 +39,10 @@ jobs: working-directory: build-wasm run: bash ../wasm/patch-artifacts-enable-wormhole.sh + - name: Import GEMM library from a separate wasm module + working-directory: build-wasm + run: bash ../wasm/patch-artifacts-import-gemm-module.sh + - name: Check artifacts working-directory: build-wasm run: | diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 62bac858b..a1a82ff64 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 62bac858bfd37060beb707d12eb9711649ea4cf6 +Subproject commit a1a82ff64910dc066d64d631cd7a8212df9f88cd diff --git a/README.md b/README.md index 3ba11f026..156d12875 100644 --- a/README.md +++ b/README.md @@ -46,6 +46,11 @@ To build a version that translates with higher speeds on Firefox Nightly browser bash ../wasm/patch-artifacts-enable-wormhole.sh ``` + 3. Patch generated artifacts to import GEMM library from a separate wasm module + ```bash + bash ../wasm/patch-artifacts-import-gemm-module.sh + ``` + To build a version that runs on all browsers (including Firefox Nightly) but translates slowly, follow these instructions: 1. Create a folder where you want to build all the artifacts (`build-wasm` in this case) and compile @@ -56,6 +61,11 @@ To build a version that runs on all browsers (including Firefox Nightly) but tra emmake make -j2 ``` + 2. Patch generated artifacts to import GEMM library from a separate wasm module + ```bash + bash ../wasm/patch-artifacts-import-gemm-module.sh + ``` + #### Recompiling As long as you don't update any submodule, just follow [Compile](#Compile) steps.\ If you update a submodule, execute following command in repository root folder before executing diff --git a/build-wasm.sh b/build-wasm.sh index 7da2685cf..adc6556c3 100755 --- a/build-wasm.sh +++ b/build-wasm.sh @@ -78,5 +78,8 @@ if [ "$WORMHOLE" = true ]; then bash ../wasm/patch-artifacts-enable-wormhole.sh fi +# 3. Import GEMM library from a separate wasm module +bash ../wasm/patch-artifacts-import-gemm-module.sh + # The artifacts (.js and .wasm files) will be available in the build directory exit 0 diff --git a/wasm/CMakeLists.txt b/wasm/CMakeLists.txt index 1580defa1..92c9e1698 100644 --- a/wasm/CMakeLists.txt +++ b/wasm/CMakeLists.txt @@ -26,6 +26,12 @@ set(LINKER_FLAGS "${LINKER_FLAGS} -s ENVIRONMENT=web,worker") # Append version information in the Javascript artifact set(LINKER_FLAGS "${LINKER_FLAGS} --extern-pre-js ${CMAKE_CURRENT_BINARY_DIR}/project_version.js") +# Allow importing undefined symbols dynamically +set(LINKER_FLAGS "${LINKER_FLAGS} -s ERROR_ON_UNDEFINED_SYMBOLS=0 -s DECLARE_ASM_MODULE_EXPORTS=0") + +# Export all the functions of fallback implementation of GEMM for wasm target +set(LINKER_FLAGS "${LINKER_FLAGS} -s EXPORTED_FUNCTIONS=[_int8PrepareAFallback,_int8PrepareBFallback,_int8PrepareBFromTransposedFallback,_int8PrepareBFromQuantizedTransposedFallback,_int8PrepareBiasFallback,_int8MultiplyAndAddBiasFallback,_int8SelectColumnsOfBFallback]") + set_target_properties(bergamot-translator-worker PROPERTIES SUFFIX ".js" LINK_FLAGS ${LINKER_FLAGS} diff --git a/wasm/patch-artifacts-import-gemm-module.sh b/wasm/patch-artifacts-import-gemm-module.sh new file mode 100644 index 000000000..2f2e29afd --- /dev/null +++ b/wasm/patch-artifacts-import-gemm-module.sh @@ -0,0 +1,44 @@ +#!/bin/bash +usage="Patch wasm artifacts to import fallback implementation of gemm for wasm. + +Usage: $(basename "$0") [WASM_ARTIFACTS_FOLDER] + + where: + WASM_ARTIFACTS_FOLDER Folder containing wasm artifacts + (An optional argument, if unspecified the default is: current folder)" + +if [ "$#" -gt 1 ]; then + echo "Illegal number of parameters passed" + echo "$usage" + exit +fi + +# Parse wasm artifacts folder if provided via script argument or set it to default +WASM_ARTIFACTS_FOLDER=$PWD +if [ "$#" -eq 1 ]; then + if [ ! -e "$1" ]; then + echo "Error: Folder \""$1"\" doesn't exist" + exit + fi + WASM_ARTIFACTS_FOLDER="$1" +fi + +WASM_ARTIFACTS_JAVASCRIPT_FILE="bergamot-translator-worker.js" +WASM_ARTIFACTS="$WASM_ARTIFACTS_FOLDER/${WASM_ARTIFACTS_JAVASCRIPT_FILE}" +if [ ! -e "$WASM_ARTIFACTS" ]; then + echo "Error: Artifact \"$WASM_ARTIFACTS\" doesn't exist" + exit +fi + +echo "Polyfill the fallback integer (8-bit) gemm implementation from the main module" +sed -i.bak 's/"env"[[:space:]]*:[[:space:]]*asmLibraryArg,/"env": asmLibraryArg,\ + "wasm_gemm":{\ + "int8_prepare_a": (...a) => Module["asm"].int8PrepareAFallback(...a),\ + "int8_prepare_b": (...a) => Module["asm"].int8PrepareBFallback(...a),\ + "int8_prepare_b_from_transposed": (...a) => Module["asm"].int8PrepareBFromTransposedFallback(...a),\ + "int8_prepare_b_from_quantized_transposed": (...a) => Module["asm"].int8PrepareBFromQuantizedTransposedFallback(...a),\ + "int8_prepare_bias": (...a) => Module["asm"].int8PrepareBiasFallback(...a),\ + "int8_multiply_and_add_bias": (...a) => Module["asm"].int8MultiplyAndAddBiasFallback(...a),\ + "int8_select_columns_of_b": (...a) => Module["asm"].int8SelectColumnsOfBFallback(...a),\ + },/g' ${WASM_ARTIFACTS_JAVASCRIPT_FILE} +echo "SUCCESS" \ No newline at end of file diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js index 1cf3a1461..189658903 100644 --- a/wasm/test_page/js/worker.js +++ b/wasm/test_page/js/worker.js @@ -180,7 +180,7 @@ skip-cost: true cpu-threads: 0 quiet: true quiet-translation: true -gemm-precision: int8shiftAll +gemm-precision: int8shiftAlphaAll `; const modelFile = `${rootURL}/${languagePair}/${modelRegistry[languagePair]["model"].name}`; From d0d08c0f54b12868717c510d4118c52d4687bfa0 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Wed, 27 Oct 2021 19:26:55 +0200 Subject: [PATCH 294/442] JS bindings for Quality Estimation (#239) * Quality Score bindings complete * Updated wasm test page to test the bindings - Word and sentence scores can be seen in browser console --- wasm/bindings/response_bindings.cpp | 19 ++- wasm/bindings/response_options_bindings.cpp | 9 +- wasm/test_page/js/worker.js | 159 ++++++++++++++------ 3 files changed, 137 insertions(+), 50 deletions(-) diff --git a/wasm/bindings/response_bindings.cpp b/wasm/bindings/response_bindings.cpp index ca688249c..11bc4cabb 100644 --- a/wasm/bindings/response_bindings.cpp +++ b/wasm/bindings/response_bindings.cpp @@ -9,25 +9,36 @@ #include "response.h" -typedef marian::bergamot::Response Response; +using Response = marian::bergamot::Response; +using SentenceQualityScore = marian::bergamot::Response::SentenceQualityScore; +using ByteRange = marian::bergamot::ByteRange; using namespace emscripten; // Binding code EMSCRIPTEN_BINDINGS(byte_range) { - value_object("ByteRange") - .field("begin", &marian::bergamot::ByteRange::begin) - .field("end", &marian::bergamot::ByteRange::end); + value_object("ByteRange").field("begin", &ByteRange::begin).field("end", &ByteRange::end); } +std::vector getQualityScores(const Response& response) { return response.qualityScores; } + EMSCRIPTEN_BINDINGS(response) { class_("Response") .constructor<>() .function("size", &Response::size) + .function("getQualityScores", &getQualityScores) .function("getOriginalText", &Response::getOriginalText) .function("getTranslatedText", &Response::getTranslatedText) .function("getSourceSentence", &Response::getSourceSentenceAsByteRange) .function("getTranslatedSentence", &Response::getTargetSentenceAsByteRange); + value_object("SentenceQualityScore") + .field("wordScores", &SentenceQualityScore::wordScores) + .field("wordByteRanges", &SentenceQualityScore::wordByteRanges) + .field("sentenceScore", &SentenceQualityScore::sentenceScore); + register_vector("VectorResponse"); + register_vector("VectorSentenceQualityScore"); + register_vector("VectorFloat"); + register_vector("VectorByteRange"); } diff --git a/wasm/bindings/response_options_bindings.cpp b/wasm/bindings/response_options_bindings.cpp index e2bf8e1f5..4addbcbfc 100644 --- a/wasm/bindings/response_options_bindings.cpp +++ b/wasm/bindings/response_options_bindings.cpp @@ -7,9 +7,14 @@ #include "response_options.h" -typedef marian::bergamot::ResponseOptions ResponseOptions; +using ResponseOptions = marian::bergamot::ResponseOptions; using namespace emscripten; // Binding code -EMSCRIPTEN_BINDINGS(response_options) { class_("ResponseOptions").constructor<>(); } +EMSCRIPTEN_BINDINGS(response_options) { + value_object("ResponseOptions") + .field("qualityScores", &ResponseOptions::qualityScores) + .field("alignment", &ResponseOptions::alignment) + .field("alignmentThreshold", &ResponseOptions::alignmentThreshold); +} diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js index 189658903..f6dc83623 100644 --- a/wasm/test_page/js/worker.js +++ b/wasm/test_page/js/worker.js @@ -1,5 +1,5 @@ // All variables specific to translation service -var translationService, responseOptions, input = undefined; +var translationService = undefined; // A map of language-pair to TranslationModel object var languagePairToTranslationModels = new Map(); @@ -51,14 +51,14 @@ onmessage = async function(e) { } else if (command === 'translate') { const from = e.data[1]; const to = e.data[2]; - const inputParagraphs = e.data[3]; + const input = e.data[3]; let inputWordCount = 0; - inputParagraphs.forEach(sentence => { + input.forEach(sentence => { inputWordCount += sentence.trim().split(" ").filter(word => word.trim() !== "").length; }) let start = Date.now(); try { - result = translate(from, to, inputParagraphs); + result = translate(from, to, input); const secs = (Date.now() - start) / 1000; log(`Translation '${from}${to}' Successful. Speed: ${Math.round(inputWordCount / secs)} WPS (${inputWordCount} words in ${secs} secs)`); } catch (error) { @@ -102,17 +102,17 @@ const constructTranslationModel = async (from, to) => { } // Translates text from source language to target language. -const translate = (from, to, paragraphs) => { +const translate = (from, to, input) => { // If none of the languages is English then perform translation with // English as a pivot language. if (from !== 'en' && to !== 'en') { log(`Translating '${from}${to}' via pivoting: '${from}en' -> 'en${to}'`); - let translatedParagraphsInEnglish = _translateInvolvingEnglish(from, 'en', paragraphs); - return _translateInvolvingEnglish('en', to, translatedParagraphsInEnglish); + const translatedTextInEnglish = _translateInvolvingEnglish(from, 'en', input); + return _translateInvolvingEnglish('en', to, translatedTextInEnglish); } else { log(`Translating '${from}${to}'`); - return _translateInvolvingEnglish(from, to, paragraphs); + return _translateInvolvingEnglish(from, to, input); } } @@ -225,64 +225,135 @@ gemm-precision: int8shiftAlphaAll languagePairToTranslationModels.set(languagePair, translationModel); } -const _translateInvolvingEnglish = (from, to, paragraphs) => { +const _translateInvolvingEnglish = (from, to, input) => { const languagePair = `${from}${to}`; if (!languagePairToTranslationModels.has(languagePair)) { throw Error(`Please load translation model '${languagePair}' before translating`); } translationModel = languagePairToTranslationModels.get(languagePair); - // Instantiate the arguments of translate() API i.e. ResponseOptions and input (vector) - var responseOptions = new Module.ResponseOptions(); - let input = new Module.VectorString; + // Prepare the arguments of translate() API i.e. ResponseOptions and vectorSourceText (i.e. a vector) + const responseOptions = _prepareResponseOptions(); + let vectorSourceText = _prepareSourceText(input); - // Initialize the input - paragraphs.forEach(paragraph => { - // prevent empty paragraph - it breaks the translation - if (paragraph.trim() === "") { - return; - } - input.push_back(paragraph.trim()) - }) + // Call translate() API; result is vector where every item of vector corresponds + // to an item of vectorSourceText in the same order + const vectorResponse = translationService.translate(translationModel, vectorSourceText, responseOptions); + + // Parse all relevant information from vectorResponse + const listTranslatedText = _parseTranslatedText(vectorResponse); + const listTranslatedTextSentences = _parseTranslatedTextSentences(vectorResponse); + const listSourceTextSentences = _parseSourceTextSentences(vectorResponse); + const listTranslatedTextSentenceQualityScores = _parseTranslatedTextSentenceQualityScores(vectorResponse); - // Access input (just for debugging) - log(`Input size: ${input.size()}`); + log(`Translated text: ${listTranslatedText}`); + log(`Translated sentences: ${JSON.stringify(listTranslatedTextSentences)}`); + log(`Source sentences: ${JSON.stringify(listSourceTextSentences)}`); + log(`Translated sentence quality scores: ${JSON.stringify(listTranslatedTextSentenceQualityScores)}`); - // Translate the input, which is a vector; the result is a vector - let result = translationService.translate(translationModel, input, responseOptions); + // Delete prepared SourceText to avoid memory leak + vectorSourceText.delete(); + + return listTranslatedText; +} - const translatedParagraphs = []; - const translatedSentencesOfParagraphs = []; - const sourceSentencesOfParagraphs = []; - for (let i = 0; i < result.size(); i++) { - translatedParagraphs.push(result.get(i).getTranslatedText()); - translatedSentencesOfParagraphs.push(_getAllTranslatedSentencesOfParagraph(result.get(i))); - sourceSentencesOfParagraphs.push(_getAllSourceSentencesOfParagraph(result.get(i))); +const _parseTranslatedText = (vectorResponse) => { + const result = []; + for (let i = 0; i < vectorResponse.size(); i++) { + const response = vectorResponse.get(i); + result.push(response.getTranslatedText()); } + return result; +} + +const _parseTranslatedTextSentences = (vectorResponse) => { + const result = []; + for (let i = 0; i < vectorResponse.size(); i++) { + const response = vectorResponse.get(i); + result.push(_getTranslatedSentences(response)); + } + return result; +} + +const _parseSourceTextSentences = (vectorResponse) => { + const result = []; + for (let i = 0; i < vectorResponse.size(); i++) { + const response = vectorResponse.get(i); + result.push(_getSourceSentences(response)); + } + return result; +} - responseOptions.delete(); - input.delete(); - return translatedParagraphs; +const _parseTranslatedTextSentenceQualityScores = (vectorResponse) => { + const result = []; + for (let i = 0; i < vectorResponse.size(); i++) { + const response = vectorResponse.get(i); + const translatedText = response.getTranslatedText(); + const vectorSentenceQualityScore = response.getQualityScores(); + log(`No. of sentences: "${vectorSentenceQualityScore.size()}"`); + const sentenceQualityScores = []; + for (let sentenceIndex=0; sentenceIndex < vectorSentenceQualityScore.size(); sentenceIndex++) { + const sentenceQualityScoreObject = vectorSentenceQualityScore.get(sentenceIndex); + const wordByteRangeList = []; + const wordList = []; + const wordScoreList = []; + const vectorWordScore = sentenceQualityScoreObject.wordScores; + const vectorWordByteRange = sentenceQualityScoreObject.wordByteRanges; + + for (let wordIndex = 0; wordIndex < vectorWordScore.size(); wordIndex++) { + const wordScore = vectorWordScore.get(wordIndex); + const wordByteRange = vectorWordByteRange.get(wordIndex); + wordScoreList.push(wordScore); + wordByteRangeList.push(wordByteRange); + const word = _getSubString(translatedText, wordByteRange); + wordList.push(word); + } + + const sentenceQualityScore = { + wordByteRanges: wordByteRangeList, + words: wordList, + wordScores: wordScoreList, + sentenceScore: sentenceQualityScoreObject.sentenceScore + }; + sentenceQualityScores.push(sentenceQualityScore); + } + result.push(sentenceQualityScores); + } + return result; +} + +const _prepareResponseOptions = () => { + return {qualityScores: true, alignment: false, alignmentThreshold: 0.2}; +} + +const _prepareSourceText = (input) => { + let vectorSourceText = new Module.VectorString; + input.forEach(paragraph => { + // prevent empty paragraph - it breaks the translation + if (paragraph.trim() === "") { + return; + } + vectorSourceText.push_back(paragraph.trim()) + }) + return vectorSourceText; } -// Extracts all the translated sentences from the Response and returns them. -const _getAllTranslatedSentencesOfParagraph = (response) => { +const _getTranslatedSentences = (response) => { const sentences = []; const text = response.getTranslatedText(); for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { const utf8SentenceByteRange = response.getTranslatedSentence(sentenceIndex); - sentences.push(_getSentenceFromByteRange(text, utf8SentenceByteRange)); + sentences.push(_getSubString(text, utf8SentenceByteRange)); } return sentences; } -// Extracts all the source sentences from the Response and returns them. -const _getAllSourceSentencesOfParagraph = (response) => { +const _getSourceSentences = (response) => { const sentences = []; const text = response.getOriginalText(); for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { const utf8SentenceByteRange = response.getSourceSentence(sentenceIndex); - sentences.push(_getSentenceFromByteRange(text, utf8SentenceByteRange)); + sentences.push(_getSubString(text, utf8SentenceByteRange)); } return sentences; } @@ -291,8 +362,8 @@ const _getAllSourceSentencesOfParagraph = (response) => { * Returns a substring of text (a string). The substring is represented by * byteRange (begin and end endices) within the utf-8 encoded version of the text. */ -const _getSentenceFromByteRange = (text, byteRange) => { - const utf8BytesView = encoder.encode(text); - const utf8SentenceBytes = utf8BytesView.subarray(byteRange.begin, byteRange.end); - return decoder.decode(utf8SentenceBytes); +const _getSubString = (text, utf8ByteRange) => { + const textUtf8ByteView = encoder.encode(text); + const substringUtf8ByteView = textUtf8ByteView.subarray(utf8ByteRange.begin, utf8ByteRange.end); + return decoder.decode(substringUtf8ByteView); } From 2b98c67996eb2df7f3233c293eeb640e3b0b2fa3 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 27 Oct 2021 20:37:05 +0100 Subject: [PATCH 295/442] Cache for translations (#227) Sets a cache to operate for each sentence that a TranslationModel process caching the corresponding marian::History for a {TranslationModel::Id, marian::Words} key. Cache is thus shared across multiple TranslationModels bound to the lifetime of a Service. Cache gracefully downgrades in the case of WebAssembly. --- bergamot-translator-tests | 2 +- src/tests/apps.cpp | 33 ++++++++++ src/tests/apps.h | 2 + src/tests/cli.cpp | 10 ++- src/tests/units/CMakeLists.txt | 1 + src/tests/units/cache_tests.cpp | 56 ++++++++++++++++ src/translator/aggregate_batching_pool.cpp | 4 +- src/translator/batching_pool.cpp | 15 +++-- src/translator/cache.h | 75 ++++++++++++++++++++++ src/translator/parser.cpp | 6 ++ src/translator/parser.h | 6 ++ src/translator/request.cpp | 56 ++++++++++++++-- src/translator/request.h | 18 +++++- src/translator/service.cpp | 13 ++-- src/translator/service.h | 23 ++++++- src/translator/translation_model.cpp | 11 +++- src/translator/translation_model.h | 11 +++- 17 files changed, 314 insertions(+), 28 deletions(-) create mode 100644 src/tests/units/cache_tests.cpp create mode 100644 src/translator/cache.h diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 9dc3c5e9a..6bd396922 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 9dc3c5e9a1027c1d6b4a467a27bdff16d0d6a006 +Subproject commit 6bd396922b2159b62c55530cb3ee6a40323d4171 diff --git a/src/tests/apps.cpp b/src/tests/apps.cpp index 63febfaf0..20c6d2acb 100644 --- a/src/tests/apps.cpp +++ b/src/tests/apps.cpp @@ -108,6 +108,39 @@ void qualityEstimatorScores(AsyncService &service, Ptr model) } } +void translationCache(AsyncService &service, Ptr model) { + ResponseOptions responseOptions; + + // Read a large input text blob from stdin + const std::string source = readFromStdin(); + + // Round 1 + std::string buffer = source; + Response firstResponse = translateForResponse(service, model, std::move(buffer), responseOptions); + + auto statsFirstRun = service.cacheStats(); + LOG(info, "Cache Hits/Misses = {}/{}", statsFirstRun.hits, statsFirstRun.misses); + ABORT_IF(statsFirstRun.hits != 0, "Expecting no cache hits, but hits found."); + + // Round 2; There should be cache hits + buffer = source; + Response secondResponse = translateForResponse(service, model, std::move(buffer), responseOptions); + + auto statsSecondRun = service.cacheStats(); + LOG(info, "Cache Hits/Misses = {}/{}", statsSecondRun.hits, statsSecondRun.misses); + ABORT_IF(statsSecondRun.hits <= 0, "At least one hit expected, none found."); + if (statsSecondRun.hits != statsFirstRun.misses) { + std::cerr << "Mismatch in expected hits (Hits, Misses = " << statsSecondRun.hits << ", " << statsSecondRun.misses + << "). This can happen due to random eviction." << std::endl; + } + + ABORT_IF(firstResponse.target.text != secondResponse.target.text, + "Recompiled string provided different output when operated with cache. On the same hardware while using " + "same path, this is expected to be same."); + + std::cout << firstResponse.target.text; +} + } // namespace testapp } // namespace bergamot } // namespace marian diff --git a/src/tests/apps.h b/src/tests/apps.h index dee77a9be..9e45a1caa 100644 --- a/src/tests/apps.h +++ b/src/tests/apps.h @@ -37,6 +37,8 @@ void qualityEstimatorWords(AsyncService &service, Ptr model); // Reads from stdin and translates the read content. Prints the quality scores for each sentence. void qualityEstimatorScores(AsyncService &service, Ptr model); +// Tests if cache is active and functional +void translationCache(AsyncService &service, Ptr model); } // namespace testapp } // namespace bergamot } // namespace marian diff --git a/src/tests/cli.cpp b/src/tests/cli.cpp index 90c386c84..ba4d73218 100644 --- a/src/tests/cli.cpp +++ b/src/tests/cli.cpp @@ -5,7 +5,11 @@ int main(int argc, char *argv[]) { marian::bergamot::ConfigParser configParser; configParser.parseArgs(argc, argv); auto &config = configParser.getConfig(); - AsyncService::Config serviceConfig{config.numWorkers}; + AsyncService::Config serviceConfig; + serviceConfig.numWorkers = config.numWorkers; + serviceConfig.cacheEnabled = config.cacheEnabled; + serviceConfig.cacheMutexBuckets = config.cacheMutexBuckets; + serviceConfig.cacheSize = config.cacheSize; AsyncService service(serviceConfig); std::vector> models; @@ -37,6 +41,10 @@ int main(int argc, char *argv[]) { case OpMode::TEST_QUALITY_ESTIMATOR_SCORES: testapp::qualityEstimatorScores(service, models.front()); break; + case OpMode::TEST_TRANSLATION_CACHE: + testapp::translationCache(service, models.front()); + break; + default: ABORT("Incompatible op-mode. Choose one of the test modes."); break; diff --git a/src/tests/units/CMakeLists.txt b/src/tests/units/CMakeLists.txt index 4794badcd..2570e05e7 100644 --- a/src/tests/units/CMakeLists.txt +++ b/src/tests/units/CMakeLists.txt @@ -1,6 +1,7 @@ # Unit tests set(UNIT_TESTS annotation_tests + cache_tests quality_estimator_tests) foreach(test ${UNIT_TESTS}) diff --git a/src/tests/units/cache_tests.cpp b/src/tests/units/cache_tests.cpp new file mode 100644 index 000000000..f2f1b19ed --- /dev/null +++ b/src/tests/units/cache_tests.cpp @@ -0,0 +1,56 @@ + +#include +#include + +#include "catch.hpp" +#include "translator/cache.h" +#include "translator/history.h" + +using namespace marian::bergamot; + +TEST_CASE("Test Cache in a threaded setting") { + size_t numThreads = 100; + size_t numIters = 10000; + using Key = int; + using Value = int; + using TestCache = AtomicCache; + + TestCache cache(/*size=*/300, /*mutexBuckets=*/16); + + auto op = [numIters, &cache]() { + std::mt19937_64 randomGenerator; + randomGenerator.seed(42); // reproducible outputs + Value randMax = 2000; + + for (size_t i = 0; i < numIters; i++) { + Key query = randomGenerator() % randMax; + std::pair result = cache.find(query); + if (result.first) { + REQUIRE(result.second == query); + } + + Value value = query; + cache.store(/*key=*/query, std::move(value)); + } + }; + + std::vector workers; + for (size_t t = 0; t < numThreads; t++) { + workers.emplace_back(op); + } + + for (size_t t = 0; t < numThreads; t++) { + workers[t].join(); + } + + TestCache::Stats stats = cache.stats(); + float hitRate = static_cast(stats.hits) / static_cast(stats.hits + stats.misses); + + // This is non-deterministic due to threads. + std::cout << "Hit-Rate:" << hitRate << "\n"; + std::cout << "(Hits, Misses) = " << stats.hits << " " << stats.misses << "\n"; + + // Can we create a specialization of the actual cache-type we want? Does it compile, at least? + // We already have Ptr, it's easier to move Ptr to cache. + TranslationCache translationCache(/*size=*/300, /*mutexBuckets=*/16); +} diff --git a/src/translator/aggregate_batching_pool.cpp b/src/translator/aggregate_batching_pool.cpp index 38c55f1c4..60f5fcd2e 100644 --- a/src/translator/aggregate_batching_pool.cpp +++ b/src/translator/aggregate_batching_pool.cpp @@ -9,9 +9,9 @@ AggregateBatchingPool::AggregateBatchingPool() { } size_t AggregateBatchingPool::enqueueRequest(Ptr model, Ptr request) { - model->enqueueRequest(request); + size_t sentencesEnqueued = model->enqueueRequest(request); aggregateQueue_.insert(model); - return request->numSegments(); + return sentencesEnqueued; } size_t AggregateBatchingPool::generateBatch(Ptr& model, Batch& batch) { diff --git a/src/translator/batching_pool.cpp b/src/translator/batching_pool.cpp index 83b5e00ab..1033e80cc 100644 --- a/src/translator/batching_pool.cpp +++ b/src/translator/batching_pool.cpp @@ -44,14 +44,19 @@ size_t BatchingPool::generateBatch(Batch &batch) { } size_t BatchingPool::enqueueRequest(Ptr request) { + size_t toBeFreshlyTranslated = 0; for (size_t i = 0; i < request->numSegments(); i++) { - RequestSentence sentence(i, request); - size_t bucket_id = sentence.numTokens(); - assert(bucket_id < bucket_.size()); - bucket_[bucket_id].insert(sentence); + if (!request->cacheHitPrefilled(i)) { + RequestSentence sentence(i, request); + size_t bucket_id = sentence.numTokens(); + assert(bucket_id < bucket_.size()); + bucket_[bucket_id].insert(sentence); + + toBeFreshlyTranslated += 1; + } } - return request->numSegments(); + return toBeFreshlyTranslated; } } // namespace bergamot diff --git a/src/translator/cache.h b/src/translator/cache.h new file mode 100644 index 000000000..ba68e4e93 --- /dev/null +++ b/src/translator/cache.h @@ -0,0 +1,75 @@ +#pragma once +#include +#include +#include + +#include "definitions.h" +#include "translator/history.h" + +namespace marian::bergamot { + +template , class Equals = std::equal_to> +class AtomicCache { + public: + struct Stats { + size_t hits{0}; + size_t misses{0}; + }; + + explicit AtomicCache(size_t size, size_t buckets) : records_(size), mutexBuckets_(buckets) {} + + std::pair find(const Key &key) const { + Value value; + bool found = atomicLoad(key, value); + return std::make_pair(found, value); + } + + void store(const Key &key, Value value) { atomicStore(key, value); } + + const Stats stats() const { return stats_; } + + private: + using Record = std::pair; + + bool atomicLoad(const Key &key, Value &value) const { + // No probing, direct map onto records_ + size_t index = hash_(key) % records_.size(); + size_t mutexId = index % mutexBuckets_.size(); + + std::lock_guard lock(mutexBuckets_[mutexId]); + const Record &candidate = records_[index]; + if (equals_(key, candidate.first)) { + value = candidate.second; + stats_.hits += 1; + return true; + } else { + stats_.misses += 1; + } + + return false; + } + + void atomicStore(const Key &key, Value value) { + // No probing, direct map onto records_ + size_t index = hash_(key) % records_.size(); + size_t mutexId = index % mutexBuckets_.size(); + + std::lock_guard lock(mutexBuckets_[mutexId]); + Record &candidate = records_[index]; + + candidate.first = key; + candidate.second = value; + } + + std::vector records_; + + mutable std::vector mutexBuckets_; + mutable Stats stats_; + + Hash hash_; + Equals equals_; +}; + +typedef AtomicCache> TranslationCache; + +} // namespace marian::bergamot diff --git a/src/translator/parser.cpp b/src/translator/parser.cpp index d927409b5..2295fd6c9 100644 --- a/src/translator/parser.cpp +++ b/src/translator/parser.cpp @@ -24,6 +24,7 @@ std::istringstream &operator>>(std::istringstream &in, OpMode &mode) { {"test-quality-estimator-words", OpMode::TEST_QUALITY_ESTIMATOR_WORDS}, {"test-quality-estimator-scores", OpMode::TEST_QUALITY_ESTIMATOR_SCORES}, {"test-forward-backward", OpMode::TEST_FORWARD_BACKWARD_FOR_OUTBOUND}, + {"test-translation-cache", OpMode::TEST_TRANSLATION_CACHE}, }; auto query = table.find(modeString); @@ -84,6 +85,11 @@ void ConfigParser::addOptionsBoundToConfig(CLI::App &app, CLIConfig &config) { app.add_option("--cpu-threads", config.numWorkers, "Number of worker threads to use for translation"); app_.add_option("--bergamot-mode", config.opMode, "Operating mode for bergamot: [wasm, native, decoder]"); + + app_.add_option("--cache-translations", config.cacheEnabled, "Whether to cache translations or not."); + app_.add_option("--cache-size", config.cacheSize, "Number of entries to store in cache."); + app_.add_option("--cache-mutex-buckets", config.cacheMutexBuckets, + "Number of mutex buckets to control locking granularity"); } std::shared_ptr parseOptionsFromFilePath(const std::string &configPath, bool validate /*= true*/) { diff --git a/src/translator/parser.h b/src/translator/parser.h index c9fffcebf..80006f3b0 100644 --- a/src/translator/parser.h +++ b/src/translator/parser.h @@ -25,6 +25,7 @@ enum OpMode { TEST_QUALITY_ESTIMATOR_WORDS, TEST_QUALITY_ESTIMATOR_SCORES, TEST_FORWARD_BACKWARD_FOR_OUTBOUND, + TEST_TRANSLATION_CACHE, }; /// Overload for CL11, convert a read from a stringstream into opmode. @@ -37,6 +38,11 @@ struct CLIConfig { bool validateByteArray; size_t numWorkers; OpMode opMode; + + // Cache parameters + bool cacheEnabled{false}; + size_t cacheSize{20}; + size_t cacheMutexBuckets{4}; }; /// ConfigParser for bergamot. Internally stores config options with CLIConfig. CLI11 parsing binds the parsing code to diff --git a/src/translator/request.cpp b/src/translator/request.cpp index 9bdae9f74..feba62a4a 100644 --- a/src/translator/request.cpp +++ b/src/translator/request.cpp @@ -3,28 +3,63 @@ #include #include "annotation.h" +#include "cache.h" #include "common/logging.h" #include "definitions.h" #include "response.h" +#include "translation_model.h" namespace marian { namespace bergamot { +size_t hashForCache(const TranslationModel &model, const marian::Words &words) { + size_t seed = model.modelId(); + for (auto &word : words) { + size_t hashWord = static_cast(word.toWordIndex()); + util::hash_combine(seed, hashWord); + } + return seed; +} + // ----------------------------------------------------------------- -Request::Request(size_t Id, Segments &&segments, ResponseBuilder &&responseBuilder) +Request::Request(size_t Id, const TranslationModel &model, Segments &&segments, ResponseBuilder &&responseBuilder, + TranslationCache *cache) : Id_(Id), + model_(model), segments_(std::move(segments)), - responseBuilder_(std::move(responseBuilder)) - -{ + responseBuilder_(std::move(responseBuilder)), + cache_(cache) { counter_ = segments_.size(); histories_.resize(segments_.size(), nullptr); - // If there are no segments_, we are never able to trigger the responseBuilder - // calls from a different thread. However, in this case we want an empty valid - // response. + // 1. If there are no segments_, we are never able to trigger the responseBuilder calls from a different thread. This + // happens when the use provides empty input, or the sentence and subword preprocessing deems no translatable units + // present. However, in this case we want an empty valid response. There's no need to do any additional processing + // here. if (segments_.size() == 0) { responseBuilder_(std::move(histories_)); + } else { + counter_ = segments_.size(); + histories_.resize(segments_.size()); + + if (cache_ != nullptr) { + // Iterate through segments, see if any can be prefilled from cache. If prefilled, mark the particular segments as + // complete (non-empty ProcessedRequestSentence). Also update accounting used elsewhere (counter_) to reflect one + // less segment to translate. + for (size_t idx = 0; idx < segments_.size(); idx++) { + size_t key = hashForCache(model_, getSegment(idx)); + auto [found, history] = cache_->find(key); + if (found) { + histories_[idx] = history; + --counter_; + } + } + // 2. Also, if cache somehow manages to decrease all counter prefilling histories, then we'd have to trigger + // ResponseBuilder as well. No segments go into batching and therefore no processHistory triggers. + if (counter_.load() == 0) { + responseBuilder_(std::move(histories_)); + } + } } } @@ -37,7 +72,14 @@ Segment Request::getSegment(size_t index) const { return segments_[index]; } void Request::processHistory(size_t index, Ptr history) { // Concurrently called by multiple workers as a history from translation is // ready. The container storing histories is set with the value obtained. + + // Fill in placeholder from History obtained by freshly translating. Since this was a cache-miss to have got through, + // update cache if available to store the result. histories_[index] = history; + if (cache_ != nullptr) { + size_t key = hashForCache(model_, getSegment(index)); + cache_->store(key, histories_[index]); + } // In case this is last request in, completeRequest is called, which sets the // value of the promise. diff --git a/src/translator/request.h b/src/translator/request.h index d2645f6d8..8415e3233 100644 --- a/src/translator/request.h +++ b/src/translator/request.h @@ -6,6 +6,7 @@ #include #include "annotation.h" +#include "cache.h" #include "common/logging.h" #include "data/types.h" #include "definitions.h" @@ -16,6 +17,8 @@ namespace marian { namespace bergamot { +class TranslationModel; + /// A Request is an internal representation used to represent a request after /// processed by TextProcessor into sentences constituted by marian::Words. /// @@ -42,11 +45,16 @@ class Request { /// /// /// @param [in] Id: Identifier assigned to Request by Service. + /// @param [in] model: TranslationModel for identifying a unique translation unit key (model, words in a sentence) for + /// cache. /// @param [in] segments: Each segment is a unit to be translated. /// @param [in] responseBuilder: Callback function (of ResponseBuilder type) /// to be triggered upon the completion of translation of all units in a /// Request. - Request(size_t Id, Segments &&segments, ResponseBuilder &&responseBuilder); + /// @param [in] cache: Cache supplied externally to attempt to fetch translations or store them after completion for + /// reuse later. + Request(size_t Id, const TranslationModel &model, Segments &&segments, ResponseBuilder &&responseBuilder, + TranslationCache *cache); /// Obtain the count of tokens in the segment correponding to index. Used to /// insert sentence from multiple requests into the corresponding size bucket. @@ -67,9 +75,14 @@ class Request { /// compiled from requests. void processHistory(size_t index, Ptr history); + bool cacheHitPrefilled(size_t index) const { return histories_[index] != nullptr; } + private: size_t Id_; + /// TranslationModel associated with this request + const TranslationModel &model_; + /// Multiple translation-workers can concurrently access the same Request. The /// following atomic atomically operates on the variable holding sentences /// remaining to be translated. @@ -86,6 +99,9 @@ class Request { /// Constructing Response requires the vocabs_ used to generate Request. /// std::vector> *vocabs_; ResponseBuilder responseBuilder_; + + /// Cache used to hold unit translations. If nullptr, means no-caching. + TranslationCache *cache_; }; /// A RequestSentence provides a view to a sentence within a Request. Existence diff --git a/src/translator/service.cpp b/src/translator/service.cpp index 9de69ba8a..ca92721da 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -10,7 +10,8 @@ namespace marian { namespace bergamot { -BlockingService::BlockingService(const BlockingService::Config &config) : requestId_(0), batchingPool_() {} +BlockingService::BlockingService(const BlockingService::Config &config) + : config_(config), requestId_(0), batchingPool_(), cache_(config.cacheSize, /*mutexBuckets=*/1) {} std::vector BlockingService::translateMultiple(std::shared_ptr translationModel, std::vector &&sources, @@ -20,8 +21,9 @@ std::vector BlockingService::translateMultiple(std::shared_ptr request = - translationModel->makeRequest(requestId_++, std::move(sources[i]), callback, responseOptions); + translationModel->makeRequest(requestId_++, std::move(sources[i]), callback, responseOptions, cache); batchingPool_.enqueueRequest(translationModel, request); } @@ -34,7 +36,8 @@ std::vector BlockingService::translateMultiple(std::shared_ptr translationModel, std::string &&source, CallbackType callback, const ResponseOptions &responseOptions) { // Producer thread, a call to this function adds new work items. If batches are available, notifies workers waiting. - Ptr request = translationModel->makeRequest(requestId_++, std::move(source), callback, responseOptions); + TranslationCache *cache = config_.cacheEnabled ? &cache_ : nullptr; + Ptr request = + translationModel->makeRequest(requestId_++, std::move(source), callback, responseOptions, cache); safeBatchingPool_.enqueueRequest(translationModel, request); } diff --git a/src/translator/service.h b/src/translator/service.h index d37f5c262..fae9dbffc 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -5,6 +5,7 @@ #include #include +#include "cache.h" #include "data/types.h" #include "quality_estimator.h" #include "response.h" @@ -27,7 +28,11 @@ class AsyncService; /// bunch of texts and optional args to translate, wait till the translation finishes). class BlockingService { public: - struct Config {}; + struct Config { + bool cacheEnabled{false}; ///< Whether to enable cache or not. + size_t cacheSize{2000}; ///< Size in History items to be stored in the cache. Loosely corresponds to sentences to + /// cache in the real world. + }; /// Construct a BlockingService with configuration loaded from an Options object. Does not require any keys, values to /// be set. BlockingService(const BlockingService::Config &config); @@ -47,6 +52,8 @@ class BlockingService { std::vector translateMultiple(std::shared_ptr translationModel, std::vector &&source, const ResponseOptions &responseOptions); + TranslationCache::Stats cacheStats() { return cache_.stats(); } + private: /// Numbering requests processed through this instance. Used to keep account of arrival times of the request. This /// allows for using this quantity in priority based ordering. @@ -57,6 +64,8 @@ class BlockingService { AggregateBatchingPool batchingPool_; Config config_; + + TranslationCache cache_; }; /// Effectively a threadpool, providing an API to take a translation request of a source-text, paramaterized by @@ -65,7 +74,13 @@ class BlockingService { class AsyncService { public: struct Config { - size_t numWorkers; + size_t numWorkers; ///< How many worker translation threads to spawn. + bool cacheEnabled{false}; ///< Whether to enable cache or not. + size_t cacheSize{2000}; ///< Size in History items to be stored in the cache. Loosely corresponds to sentences to + /// cache in the real world. + size_t cacheMutexBuckets; ///< Controls the granularity of locking to reduce contention by bucketing mutexes + ///< guarding cache entry read write. Optimal at min(core, numWorkers) assuming a + ///< reasonably large cache-size. }; /// Construct an AsyncService with configuration loaded from Options. Expects positive integer value for /// `cpu-threads`. Additionally requires options which configure AggregateBatchingPool. @@ -95,6 +110,8 @@ class AsyncService { /// Thread joins and proper shutdown are required to be handled explicitly. ~AsyncService(); + TranslationCache::Stats cacheStats() { return cache_.stats(); } + private: AsyncService::Config config_; @@ -111,6 +128,8 @@ class AsyncService { /// requests compiled from batching-pools of multiple translation models. The batching pool is wrapped around one /// object for thread-safety. ThreadsafeBatchingPool safeBatchingPool_; + + TranslationCache cache_; }; } // namespace bergamot diff --git a/src/translator/translation_model.cpp b/src/translator/translation_model.cpp index 5a2739542..5cf2b85f4 100644 --- a/src/translator/translation_model.cpp +++ b/src/translator/translation_model.cpp @@ -2,6 +2,7 @@ #include "batch.h" #include "byte_array_util.h" +#include "cache.h" #include "common/logging.h" #include "data/corpus.h" #include "data/text_input.h" @@ -11,9 +12,12 @@ namespace marian { namespace bergamot { +std::atomic TranslationModel::modelCounter_ = 0; + TranslationModel::TranslationModel(const Config &options, MemoryBundle &&memory /*=MemoryBundle{}*/, size_t replicas /*=1*/) - : options_(options), + : modelId_(modelCounter_++), + options_(options), memory_(std::move(memory)), vocabs_(options, std::move(memory_.vocabs)), textProcessor_(options, vocabs_, std::move(memory_.ssplitPrefixFile)), @@ -86,14 +90,15 @@ void TranslationModel::loadBackend(size_t idx) { // Make request process is shared between Async and Blocking workflow of translating. Ptr TranslationModel::makeRequest(size_t requestId, std::string &&source, CallbackType callback, - const ResponseOptions &responseOptions) { + const ResponseOptions &responseOptions, TranslationCache *cache) { Segments segments; AnnotatedText annotatedSource; textProcessor_.process(std::move(source), annotatedSource, segments); ResponseBuilder responseBuilder(responseOptions, std::move(annotatedSource), vocabs_, callback, *qualityEstimator_); - Ptr request = New(requestId, std::move(segments), std::move(responseBuilder)); + Ptr request = + New(requestId, /*model=*/*this, std::move(segments), std::move(responseBuilder), cache); return request; } diff --git a/src/translator/translation_model.h b/src/translator/translation_model.h index 599e6c707..6d2169494 100644 --- a/src/translator/translation_model.h +++ b/src/translator/translation_model.h @@ -6,6 +6,7 @@ #include "batch.h" #include "batching_pool.h" +#include "cache.h" #include "common/utils.h" #include "data/shortlist.h" #include "definitions.h" @@ -66,11 +67,11 @@ class TranslationModel { /// @param [in] responseOptions: Configuration used to prepare the Response corresponding to the created request. // @returns Request created from the query parameters wrapped within a shared-pointer. Ptr makeRequest(size_t requestId, std::string&& source, CallbackType callback, - const ResponseOptions& responseOptions); + const ResponseOptions& responseOptions, TranslationCache* cache); /// Relays a request to the batching-pool specific to this translation model. /// @param [in] request: Request constructed through makeRequest - void enqueueRequest(Ptr request) { batchingPool_.enqueueRequest(request); }; + size_t enqueueRequest(Ptr request) { return batchingPool_.enqueueRequest(request); }; /// Generates a batch from the batching-pool for this translation model, compiling from several active requests. Note /// that it is possible that calls to this method can give empty-batches. @@ -86,7 +87,11 @@ class TranslationModel { /// @param [in] batch: A batch generated from generateBatch from the same TranslationModel instance. void translateBatch(size_t deviceId, Batch& batch); + /// Returns a unique-identifier for the model. + size_t modelId() const { return modelId_; } + private: + size_t modelId_; Config options_; MemoryBundle memory_; Vocabs vocabs_; @@ -114,6 +119,8 @@ class TranslationModel { void loadBackend(size_t idx); Ptr convertToMarianBatch(Batch& batch); + + static std::atomic modelCounter_; }; } // namespace bergamot From 45412ce7de0ba000bae96ce376421f4ef3250c85 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 28 Oct 2021 09:30:02 +0100 Subject: [PATCH 296/442] Set PR to any branch to trigger workflows (#230) --- .github/workflows/coding-styles.yml | 2 +- .github/workflows/doc.yml | 2 +- .github/workflows/wasm-custom_marian-mac.yml | 2 +- .github/workflows/wasm-custom_marian-ubuntu.yml | 2 +- .github/workflows/windows.yml | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/.github/workflows/coding-styles.yml b/.github/workflows/coding-styles.yml index 330790e88..0bff2ec79 100644 --- a/.github/workflows/coding-styles.yml +++ b/.github/workflows/coding-styles.yml @@ -6,7 +6,7 @@ on: push: branches: [ main, ci-sandbox ] pull_request: - branches: [ main, ci-sandbox ] + branches: [ '**' ] jobs: clang-format: diff --git a/.github/workflows/doc.yml b/.github/workflows/doc.yml index 706465e39..3874822b8 100644 --- a/.github/workflows/doc.yml +++ b/.github/workflows/doc.yml @@ -5,7 +5,7 @@ on: branches: [ main, ci-sandbox ] tags: ['v[0-9]+.[0-9]+.[0-9]+'] pull_request: - branches: [ main ] + branches: [ '**' ] jobs: api-documentation: diff --git a/.github/workflows/wasm-custom_marian-mac.yml b/.github/workflows/wasm-custom_marian-mac.yml index 636323581..a27f6b8de 100644 --- a/.github/workflows/wasm-custom_marian-mac.yml +++ b/.github/workflows/wasm-custom_marian-mac.yml @@ -4,7 +4,7 @@ on: push: branches: [ main, ci-sandbox ] pull_request: - branches: [ main, ci-sandbox ] + branches: [ '**' ] jobs: build-wasm: diff --git a/.github/workflows/wasm-custom_marian-ubuntu.yml b/.github/workflows/wasm-custom_marian-ubuntu.yml index b644d9763..80d083fb8 100644 --- a/.github/workflows/wasm-custom_marian-ubuntu.yml +++ b/.github/workflows/wasm-custom_marian-ubuntu.yml @@ -4,7 +4,7 @@ on: push: branches: [ main, ci-sandbox ] pull_request: - branches: [ main, ci-sandbox ] + branches: [ '**' ] jobs: build-wasm: diff --git a/.github/workflows/windows.yml b/.github/workflows/windows.yml index 7d1aca9d5..0933835de 100644 --- a/.github/workflows/windows.yml +++ b/.github/workflows/windows.yml @@ -4,7 +4,7 @@ on: push: branches: [ main, ci-sandbox ] pull_request: - branches: [ main, ci-sandbox ] + branches: [ '**' ] env: MKL_URL: "https://romang.blob.core.windows.net/mariandev/ci/mkl-2020.1-windows-static.zip" From 47e57c95a6eb4e8c3d1f6ef9cd0cbdccd04b84a6 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 29 Oct 2021 13:40:28 +0100 Subject: [PATCH 297/442] [ssplit-cpp] Enable position independent library when compiled from sources (#240) --- 3rd_party/ssplit-cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/ssplit-cpp b/3rd_party/ssplit-cpp index f0fe09765..72dbd9346 160000 --- a/3rd_party/ssplit-cpp +++ b/3rd_party/ssplit-cpp @@ -1 +1 @@ -Subproject commit f0fe09765ce22c6db79b15123c6599b2b419d240 +Subproject commit 72dbd9346b9f0eede4444922c4e3fcfdc0d16abb From 9b443997e2c36d34679975a3ebddb374c9740b68 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 31 Oct 2021 12:33:42 +0000 Subject: [PATCH 298/442] EXCLUDE_FROM_ALL for marian and ssplit-cpp 3rd-party libraries (#243) --- .github/workflows/windows.yml | 3 +-- 3rd_party/CMakeLists.txt | 4 ++-- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/.github/workflows/windows.yml b/.github/workflows/windows.yml index 0933835de..74d2439f5 100644 --- a/.github/workflows/windows.yml +++ b/.github/workflows/windows.yml @@ -62,6 +62,5 @@ jobs: - name: Print versions working-directory: build run: | - .\app\service-cli.exe --version - dir *.exe + .\app\bergamot.exe --version shell: cmd diff --git a/3rd_party/CMakeLists.txt b/3rd_party/CMakeLists.txt index 70e50d663..b84a37b80 100644 --- a/3rd_party/CMakeLists.txt +++ b/3rd_party/CMakeLists.txt @@ -1,13 +1,13 @@ # marian-dev is tested elsewhere in both paths, turning off here. set(COMPILE_TESTS OFF) -add_subdirectory(marian-dev) +add_subdirectory(marian-dev EXCLUDE_FROM_ALL) if(COMPILE_WASM) # This is a bad way of adding compilation flags. Will be improved soon. add_compile_options(${WASM_COMPILE_FLAGS}) endif(COMPILE_WASM) -add_subdirectory(ssplit-cpp) +add_subdirectory(ssplit-cpp EXCLUDE_FROM_ALL) # Add include directories for 3rd party targets to be able to use it anywhere in the # project without explicitly specifying their include directories. Once they From c5bc3f5191c7d733f9d836a3bf007d58e4b71d96 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Mon, 1 Nov 2021 13:06:23 +0100 Subject: [PATCH 299/442] Update config "skip-cost" to enable log probabilities for QE scores (#247) - Updated wasm test page --- wasm/test_page/js/worker.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js index f6dc83623..912cc87d9 100644 --- a/wasm/test_page/js/worker.js +++ b/wasm/test_page/js/worker.js @@ -176,7 +176,7 @@ max-length-break: 128 mini-batch-words: 1024 workspace: 128 max-length-factor: 2.0 -skip-cost: true +skip-cost: false cpu-threads: 0 quiet: true quiet-translation: true From 806169c822c7d240d88ba17dc2c236e560fdc4dc Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Mon, 1 Nov 2021 16:31:01 +0000 Subject: [PATCH 300/442] Recover logging (#226) --- 3rd_party/marian-dev | 2 +- bergamot-translator-tests | 2 +- src/translator/logging.h | 35 +++++++++++++++++++++++++++++++++++ src/translator/service.h | 5 +++++ 4 files changed, 42 insertions(+), 2 deletions(-) create mode 100644 src/translator/logging.h diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index a1a82ff64..87643a4e3 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit a1a82ff64910dc066d64d631cd7a8212df9f88cd +Subproject commit 87643a4e3b121c74d3b0a4f048e9f6836ad11078 diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 6bd396922..9dc3c5e9a 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 6bd396922b2159b62c55530cb3ee6a40323d4171 +Subproject commit 9dc3c5e9a1027c1d6b4a467a27bdff16d0d6a006 diff --git a/src/translator/logging.h b/src/translator/logging.h new file mode 100644 index 000000000..bd5b17a45 --- /dev/null +++ b/src/translator/logging.h @@ -0,0 +1,35 @@ +#include "3rd_party/marian-dev/src/3rd_party/spdlog/spdlog.h" +#include "common/logging.h" + +namespace marian { +namespace bergamot { + +// RAII Wrap around logging, to clean up after the object on stack. +class Logger { + public: + Logger() : marianLoggers_(createLoggers()) { + // We are manually creating loggers, because this is usually created in marian as a side-effect of + // config-parsing. + } + + ~Logger() { + // We need to manually destroy the loggers, as marian doesn't do + // that but will complain when a new marian::Config tries to + // initialise loggers with the same name. + for (auto &logger : marianLoggers_) { + if (logger) { + spdlog::drop(logger->name()); + } + } + } + + // Explicit destructor above is an indicator we should not allow this class to copy-construct. + Logger &operator=(const Logger &) = delete; + Logger(const Logger &) = delete; + + private: + using MarianLogger = std::shared_ptr; + std::vector marianLoggers_; +}; +} // namespace bergamot +} // namespace marian diff --git a/src/translator/service.h b/src/translator/service.h index fae9dbffc..d58a759da 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -7,6 +7,7 @@ #include "cache.h" #include "data/types.h" +#include "logging.h" #include "quality_estimator.h" #include "response.h" #include "response_builder.h" @@ -65,6 +66,8 @@ class BlockingService { Config config_; + // Logger which shuts down cleanly with service. + Logger logger_; TranslationCache cache_; }; @@ -129,6 +132,8 @@ class AsyncService { /// object for thread-safety. ThreadsafeBatchingPool safeBatchingPool_; + // Logger which shuts down cleanly with service. + Logger logger_; TranslationCache cache_; }; From 0bb8095bca166d765ab837d2a155e54048994006 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Mon, 1 Nov 2021 19:21:28 +0000 Subject: [PATCH 301/442] Deprecate hardAlignment in favour of softAlignment (#250) --- bergamot-translator-tests | 2 +- src/translator/response.h | 20 ++++---------------- src/translator/response_builder.cpp | 9 +-------- src/translator/response_options.h | 6 ------ wasm/bindings/response_options_bindings.cpp | 3 +-- wasm/test_page/js/worker.js | 2 +- 6 files changed, 8 insertions(+), 34 deletions(-) diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 9dc3c5e9a..9344b9835 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 9dc3c5e9a1027c1d6b4a467a27bdff16d0d6a006 +Subproject commit 9344b9835797f7c19ee49d30bff134b74a1a336e diff --git a/src/translator/response.h b/src/translator/response.h index b77fbb633..49ac80392 100644 --- a/src/translator/response.h +++ b/src/translator/response.h @@ -14,18 +14,6 @@ namespace marian { namespace bergamot { -/// Alignment is stored as a sparse matrix, this pretty much aligns with marian -/// internals but is brought here to maintain translator -/// agnosticism/independence. -struct Point { - size_t src; ///< Index pointing to source ByteRange - size_t tgt; ///< Index pointing to target ByteRange - float prob; ///< Score between [0, 1] on indicating degree of alignment. -}; - -/// Alignment is a sparse matrix, where Points represent entries with values. -typedef std::vector Alignment; - /// Response holds AnnotatedText(s) of source-text and translated text, /// alignment information between source and target sub-words and sentences. /// @@ -65,10 +53,10 @@ struct Response { /// source or target. std::vector qualityScores; - /// Alignments between source and target. Each Alignment is a - /// sparse matrix representation with indices corresponding - /// to (sub-)words accessible through Annotation. - std::vector alignments; + /// Alignments between source and target. This is a collection of dense matrices providing + /// P[t][s] = p(source-token s | target token t) + /// with an alignment matrix for each sentence. + std::vector>> alignments; /// Returns the source sentence (in terms of byte range) corresponding to sentenceIdx. /// diff --git a/src/translator/response_builder.cpp b/src/translator/response_builder.cpp index d51fbbf57..f1bb773e0 100644 --- a/src/translator/response_builder.cpp +++ b/src/translator/response_builder.cpp @@ -22,14 +22,7 @@ void ResponseBuilder::buildAlignments(Histories &histories, Response &response) // mean WASM bindings for a structure deep within marian source. auto hyp = std::get<1>(result); auto softAlignment = hyp->tracebackAlignment(); - auto threshold = responseOptions_.alignmentThreshold; - auto hardAlignment = data::ConvertSoftAlignToHardAlign(softAlignment, threshold); - Alignment unified_alignment; - for (auto &p : hardAlignment) { - unified_alignment.emplace_back(Point{p.srcPos, p.tgtPos, p.prob}); - } - - response.alignments.push_back(std::move(unified_alignment)); + response.alignments.push_back(std::move(softAlignment)); } } diff --git a/src/translator/response_options.h b/src/translator/response_options.h index 92737a414..43b1c433b 100644 --- a/src/translator/response_options.h +++ b/src/translator/response_options.h @@ -24,12 +24,6 @@ struct ResponseOptions { /// `alignment=true`. bool sentenceMappings{false}; - /// Threshold between `[0.0f, 1.0f]` to filter alignments into a sparse - /// matrix. Higher value implies stronger filtering leading to provision of - /// higher-confidence matches. `1.0f` gives argmax (not the full-dense - /// matrix). - float alignmentThreshold{0.2f}; - ConcatStrategy concatStrategy{ConcatStrategy::FAITHFUL}; }; diff --git a/wasm/bindings/response_options_bindings.cpp b/wasm/bindings/response_options_bindings.cpp index 4addbcbfc..deafe1e0a 100644 --- a/wasm/bindings/response_options_bindings.cpp +++ b/wasm/bindings/response_options_bindings.cpp @@ -15,6 +15,5 @@ using namespace emscripten; EMSCRIPTEN_BINDINGS(response_options) { value_object("ResponseOptions") .field("qualityScores", &ResponseOptions::qualityScores) - .field("alignment", &ResponseOptions::alignment) - .field("alignmentThreshold", &ResponseOptions::alignmentThreshold); + .field("alignment", &ResponseOptions::alignment); } diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js index 912cc87d9..7fbaea8d2 100644 --- a/wasm/test_page/js/worker.js +++ b/wasm/test_page/js/worker.js @@ -323,7 +323,7 @@ const _parseTranslatedTextSentenceQualityScores = (vectorResponse) => { } const _prepareResponseOptions = () => { - return {qualityScores: true, alignment: false, alignmentThreshold: 0.2}; + return {qualityScores: true, alignment: false}; } const _prepareSourceText = (input) => { From 7693a1d0076929a57ba11a809932548234c82595 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Wed, 3 Nov 2021 13:54:48 +0100 Subject: [PATCH 302/442] Updated marian submodule (#256) --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 87643a4e3..200e81c0c 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 87643a4e3b121c74d3b0a4f048e9f6836ad11078 +Subproject commit 200e81c0cc88259c540b96afc6e0867cb05570b0 From fa4efb483ba4f5f4e3ac98bc8c3f14b2e87541f1 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 5 Nov 2021 16:46:03 +0000 Subject: [PATCH 303/442] Update ssplit cpp, pcre2 source compile to fix broken builds (#258) * Update ssplit cpp, pcre2 source compile to fix tests * Syncing with browsermt/ssplit-cpp * Removing accidental binary inclusion * Removing brt accidental update by git add -u * Fix windows workflow, vcpkg is broken use our cmake route * [ssplit-cpp] Try searching different library names for Windows --- .github/workflows/windows.yml | 3 ++- 3rd_party/ssplit-cpp | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/.github/workflows/windows.yml b/.github/workflows/windows.yml index 74d2439f5..434c947cb 100644 --- a/.github/workflows/windows.yml +++ b/.github/workflows/windows.yml @@ -39,7 +39,7 @@ jobs: - name: Prepare vcpkg uses: lukka/run-vcpkg@v7.4 with: - vcpkgArguments: protobuf pcre2 + vcpkgArguments: protobuf vcpkgGitCommitId: 8dddc6c899ce6fdbeab38b525a31e7f23cb2d5bb vcpkgDirectory: ${{ github.workspace }}/vcpkg/ vcpkgTriplet: x64-windows-static @@ -51,6 +51,7 @@ jobs: buildDirectory: ${{ github.workspace }}/build cmakeAppendedArgs: '-G Ninja -DCMAKE_BUILD_TYPE="Release" + -DSSPLIT_USE_INTERNAL_PCRE2="ON" -DUSE_WASM_COMPATIBLE_SOURCE="OFF" -DUSE_STATIC_LIBS="TRUE"' cmakeListsOrSettingsJson: CMakeListsTxtAdvanced diff --git a/3rd_party/ssplit-cpp b/3rd_party/ssplit-cpp index 72dbd9346..36beacd1e 160000 --- a/3rd_party/ssplit-cpp +++ b/3rd_party/ssplit-cpp @@ -1 +1 @@ -Subproject commit 72dbd9346b9f0eede4444922c4e3fcfdc0d16abb +Subproject commit 36beacd1ee4d9d591346d8e0f7f7700c7a91eb9f From 5a693b7eecda96100a9f9397a16d04737fe6d7f7 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Fri, 5 Nov 2021 20:48:28 +0000 Subject: [PATCH 304/442] Fixes windows workflow for PCRE2 (#260) --- .github/workflows/windows.yml | 3 +- 3rd_party/ssplit-cpp | 2 +- .../ports/pcre2/pcre2-10.35_fix-uwp.patch | 10 +++ vcpkg-override/ports/pcre2/portfile.cmake | 72 +++++++++++++++++++ vcpkg-override/ports/pcre2/vcpkg.json | 6 ++ 5 files changed, 90 insertions(+), 3 deletions(-) create mode 100644 vcpkg-override/ports/pcre2/pcre2-10.35_fix-uwp.patch create mode 100644 vcpkg-override/ports/pcre2/portfile.cmake create mode 100644 vcpkg-override/ports/pcre2/vcpkg.json diff --git a/.github/workflows/windows.yml b/.github/workflows/windows.yml index 434c947cb..66ac2c413 100644 --- a/.github/workflows/windows.yml +++ b/.github/workflows/windows.yml @@ -39,7 +39,7 @@ jobs: - name: Prepare vcpkg uses: lukka/run-vcpkg@v7.4 with: - vcpkgArguments: protobuf + vcpkgArguments: protobuf pcre2 --overlay-ports="${{ github.workspace }}\vcpkg-override\ports\pcre2" vcpkgGitCommitId: 8dddc6c899ce6fdbeab38b525a31e7f23cb2d5bb vcpkgDirectory: ${{ github.workspace }}/vcpkg/ vcpkgTriplet: x64-windows-static @@ -51,7 +51,6 @@ jobs: buildDirectory: ${{ github.workspace }}/build cmakeAppendedArgs: '-G Ninja -DCMAKE_BUILD_TYPE="Release" - -DSSPLIT_USE_INTERNAL_PCRE2="ON" -DUSE_WASM_COMPATIBLE_SOURCE="OFF" -DUSE_STATIC_LIBS="TRUE"' cmakeListsOrSettingsJson: CMakeListsTxtAdvanced diff --git a/3rd_party/ssplit-cpp b/3rd_party/ssplit-cpp index 36beacd1e..a08d6bce2 160000 --- a/3rd_party/ssplit-cpp +++ b/3rd_party/ssplit-cpp @@ -1 +1 @@ -Subproject commit 36beacd1ee4d9d591346d8e0f7f7700c7a91eb9f +Subproject commit a08d6bce20619a8475736832d5418458c14db9d4 diff --git a/vcpkg-override/ports/pcre2/pcre2-10.35_fix-uwp.patch b/vcpkg-override/ports/pcre2/pcre2-10.35_fix-uwp.patch new file mode 100644 index 000000000..476dde0f6 --- /dev/null +++ b/vcpkg-override/ports/pcre2/pcre2-10.35_fix-uwp.patch @@ -0,0 +1,10 @@ +--- a/CMakeLists.txt 2020-05-09 16:43:10.000000000 +0200 ++++ b/CMakeLists.txt 2020-06-03 20:57:17.026182500 +0200 +@@ -619,6 +619,7 @@ + + IF(MSVC) + ADD_DEFINITIONS(-D_CRT_SECURE_NO_DEPRECATE -D_CRT_SECURE_NO_WARNINGS) ++ add_compile_options(/wd4146) + ENDIF(MSVC) + + SET(CMAKE_INCLUDE_CURRENT_DIR 1) diff --git a/vcpkg-override/ports/pcre2/portfile.cmake b/vcpkg-override/ports/pcre2/portfile.cmake new file mode 100644 index 000000000..641af1cd1 --- /dev/null +++ b/vcpkg-override/ports/pcre2/portfile.cmake @@ -0,0 +1,72 @@ +set(PCRE2_VERSION 10.37) +set(EXPECTED_SHA f91760a8e0747f52211612fb0e134d685e224d16bd884eb574718d077a586b1fd7b6435d4e3b75c879b12e02b252467ecc28cdc4bc2903c783dacab089f99c99) +set(PATCHES + pcre2-10.35_fix-uwp.patch +) + +vcpkg_download_distfile(ARCHIVE + URLS "https://sourceforge.net/projects/pcre/files/pcre2/${PCRE2_VERSION}/pcre2-${PCRE2_VERSION}.zip" + FILENAME "pcre2-${PCRE2_VERSION}.zip" + SHA512 ${EXPECTED_SHA} + SILENT_EXIT +) + +if (EXISTS "${ARCHIVE}") + vcpkg_extract_source_archive_ex( + OUT_SOURCE_PATH SOURCE_PATH + ARCHIVE ${ARCHIVE} + PATCHES ${PATCHES} + ) +else() + vcpkg_from_sourceforge( + OUT_SOURCE_PATH SOURCE_PATH + REPO pcre/pcre2 + REF ${PCRE2_VERSION} + FILENAME "pcre2-${PCRE2_VERSION}.zip" + SHA512 ${EXPECTED_SHA} + PATCHES ${PATCHES} + ) +endif() + +if(VCPKG_CMAKE_SYSTEM_NAME STREQUAL "Emscripten" OR VCPKG_CMAKE_SYSTEM_NAME STREQUAL "iOS") + set(JIT OFF) +else() + set(JIT ON) +endif() + +vcpkg_configure_cmake( + SOURCE_PATH ${SOURCE_PATH} + PREFER_NINJA + OPTIONS + -DPCRE2_BUILD_PCRE2_8=ON + -DPCRE2_BUILD_PCRE2_16=ON + -DPCRE2_BUILD_PCRE2_32=ON + -DPCRE2_SUPPORT_JIT=${JIT} + -DPCRE2_SUPPORT_UNICODE=ON + -DPCRE2_BUILD_TESTS=OFF + -DPCRE2_BUILD_PCRE2GREP=OFF) + +vcpkg_install_cmake() + +file(READ ${CURRENT_PACKAGES_DIR}/include/pcre2.h PCRE2_H) +if(VCPKG_LIBRARY_LINKAGE STREQUAL "static") + string(REPLACE "defined(PCRE2_STATIC)" "1" PCRE2_H "${PCRE2_H}") +else() + string(REPLACE "defined(PCRE2_STATIC)" "0" PCRE2_H "${PCRE2_H}") +endif() +file(WRITE ${CURRENT_PACKAGES_DIR}/include/pcre2.h "${PCRE2_H}") + +vcpkg_fixup_pkgconfig() + +vcpkg_copy_pdbs() + +file(REMOVE_RECURSE ${CURRENT_PACKAGES_DIR}/man) +file(REMOVE_RECURSE ${CURRENT_PACKAGES_DIR}/share/doc) +file(REMOVE_RECURSE ${CURRENT_PACKAGES_DIR}/debug/include) +file(REMOVE_RECURSE ${CURRENT_PACKAGES_DIR}/debug/man) +file(REMOVE_RECURSE ${CURRENT_PACKAGES_DIR}/debug/share) +if(VCPKG_LIBRARY_LINKAGE STREQUAL "static") + file(REMOVE_RECURSE "${CURRENT_PACKAGES_DIR}/bin" "${CURRENT_PACKAGES_DIR}/debug/bin") +endif() + +file(INSTALL ${SOURCE_PATH}/COPYING DESTINATION ${CURRENT_PACKAGES_DIR}/share/${PORT} RENAME copyright) diff --git a/vcpkg-override/ports/pcre2/vcpkg.json b/vcpkg-override/ports/pcre2/vcpkg.json new file mode 100644 index 000000000..80d87e8fe --- /dev/null +++ b/vcpkg-override/ports/pcre2/vcpkg.json @@ -0,0 +1,6 @@ +{ + "name": "pcre2", + "version-string": "10.37", + "description": "PCRE2 is a re-working of the original Perl Compatible Regular Expressions library", + "homepage": "https://pcre.org/" +} From d6a14b1d6ff65ddd52780dee2798c5477cec2a62 Mon Sep 17 00:00:00 2001 From: Andre Natal Date: Mon, 15 Nov 2021 00:14:21 -0800 Subject: [PATCH 305/442] Fix badge to point to this repo instead mozilla's (#261) --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 156d12875..11f144cce 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Bergamot Translator -[![CircleCI badge](https://img.shields.io/circleci/project/github/mozilla/bergamot-translator/main.svg?label=CircleCI)](https://circleci.com/gh/mozilla/bergamot-translator/) +[![CircleCI badge](https://img.shields.io/circleci/project/github/browsermt/bergamot-translator/main.svg?label=CircleCI)](https://circleci.com/gh/browsermt/bergamot-translator/) Bergamot translator provides a unified API for ([Marian NMT](https://marian-nmt.github.io/) framework based) neural machine translation functionality in accordance with the [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. From f9e55b3cd845478f8cc84b795f0a1e5720991100 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Mon, 15 Nov 2021 22:30:52 +0100 Subject: [PATCH 306/442] Make script run from any directory (#262) * Make script run from any directory --- wasm/test_page/start_server.sh | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/wasm/test_page/start_server.sh b/wasm/test_page/start_server.sh index 8cb90071c..59d455d14 100644 --- a/wasm/test_page/start_server.sh +++ b/wasm/test_page/start_server.sh @@ -1,11 +1,13 @@ #!/bin/bash -usage="Copy wasm artifacts from build directory and start httpserver +usage="Copy wasm artifacts from the given folder and start httpserver -Usage: $(basename "$0") [WASM_ARTIFACTS_FOLDER] +Usage: $(basename "$0") [ARTIFACTS_SOURCE_FOLDER] where: - WASM_ARTIFACTS_FOLDER Folder containing pre-built wasm artifacts" + ARTIFACTS_SOURCE_FOLDER Directory containing pre-built wasm artifacts" + +SCRIPT_ABSOLUTE_PATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )" if [ "$#" -ne 1 ]; then echo "Illegal number of parameters passed" @@ -13,19 +15,25 @@ if [ "$#" -ne 1 ]; then exit fi -# Check if WASM_ARTIFACTS_FOLDER is valid or not +# Check if ARTIFACTS_SOURCE_FOLDER is valid or not if [ ! -e "$1" ]; then echo "Error: Folder \""$1"\" doesn't exist" exit fi -WASM_ARTIFACTS="$1/bergamot-translator-worker.js $1/bergamot-translator-worker.wasm" -for i in $WASM_ARTIFACTS; do +# Prepare a list all wasm artifacts to be copied and copy them to the destination folder +ARTIFACTS_BASE_NAME="bergamot-translator-worker" +ARTIFACTS="$1/$ARTIFACTS_BASE_NAME.js $1/$ARTIFACTS_BASE_NAME.wasm" +ARTIFACTS_DESTINATION_FOLDER=$SCRIPT_ABSOLUTE_PATH/js + +for i in $ARTIFACTS; do [ -f "$i" ] || breaks - cp $i js/. - echo "Copied \"$i\"" + cp $i $ARTIFACTS_DESTINATION_FOLDER + echo "Copied \"$i\" to \"$ARTIFACTS_DESTINATION_FOLDER\"" done -npm install -echo "Start httpserver" -node bergamot-httpserver.js 80 1 0 \ No newline at end of file +# Start http server +(cd $SCRIPT_ABSOLUTE_PATH; +npm install; +echo "Start httpserver"; +node bergamot-httpserver.js 80 1 0) From 2b1b0531ff359c00685f8ef750f83edbcb7bd578 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Wed, 17 Nov 2021 09:18:55 +0100 Subject: [PATCH 307/442] Import optimized gemm implementation (when available) for wasm target (#265) * Enable importing optimized gemm module for wasm - Updated emscripten generated JS code to -- import and use the optimized gemm module when available, otherwise use fallback gemm implementation * Added logging for gemm implementation being used for wasm target --- wasm/import-gemm-module.js | 25 ++++++++++++++++ wasm/patch-artifacts-import-gemm-module.sh | 33 +++++++++------------- 2 files changed, 38 insertions(+), 20 deletions(-) create mode 100644 wasm/import-gemm-module.js diff --git a/wasm/import-gemm-module.js b/wasm/import-gemm-module.js new file mode 100644 index 000000000..e23a69d7f --- /dev/null +++ b/wasm/import-gemm-module.js @@ -0,0 +1,25 @@ + +/* Use an optimized gemm implementation if available, otherwise use the fallback + * implementation. + */ +function createWasmGemm() { + const OPTIMIZED_GEMM = "mozIntGemm"; + const FALLBACK_GEMM = "asm"; + + if (WebAssembly[OPTIMIZED_GEMM]) { + console.log(`Using optimized gemm (${OPTIMIZED_GEMM}) implementation`); + return new WebAssembly.Instance(WebAssembly[OPTIMIZED_GEMM](), {"": {memory: wasmMemory}}).exports; + } + else { + console.log(`Using fallback gemm implementation`); + return { + "int8_prepare_a": (...a) => Module[FALLBACK_GEMM]["int8PrepareAFallback"](...a), + "int8_prepare_b": (...a) => Module[FALLBACK_GEMM]["int8PrepareBFallback"](...a), + "int8_prepare_b_from_transposed": (...a) => Module[FALLBACK_GEMM]["int8PrepareBFromTransposedFallback"](...a), + "int8_prepare_b_from_quantized_transposed": (...a) => Module[FALLBACK_GEMM]["int8PrepareBFromQuantizedTransposedFallback"](...a), + "int8_prepare_bias": (...a) => Module[FALLBACK_GEMM]["int8PrepareBiasFallback"](...a), + "int8_multiply_and_add_bias": (...a) => Module[FALLBACK_GEMM]["int8MultiplyAndAddBiasFallback"](...a), + "int8_select_columns_of_b": (...a) => Module[FALLBACK_GEMM]["int8SelectColumnsOfBFallback"](...a) + } + } +} diff --git a/wasm/patch-artifacts-import-gemm-module.sh b/wasm/patch-artifacts-import-gemm-module.sh index 2f2e29afd..d9fa648fe 100644 --- a/wasm/patch-artifacts-import-gemm-module.sh +++ b/wasm/patch-artifacts-import-gemm-module.sh @@ -1,10 +1,10 @@ #!/bin/bash -usage="Patch wasm artifacts to import fallback implementation of gemm for wasm. +usage="Patch wasm artifacts to import gemm implementation for wasm. -Usage: $(basename "$0") [WASM_ARTIFACTS_FOLDER] +Usage: $(basename "$0") [ARTIFACTS_FOLDER] where: - WASM_ARTIFACTS_FOLDER Folder containing wasm artifacts + ARTIFACTS_FOLDER Folder containing wasm artifacts (An optional argument, if unspecified the default is: current folder)" if [ "$#" -gt 1 ]; then @@ -14,31 +14,24 @@ if [ "$#" -gt 1 ]; then fi # Parse wasm artifacts folder if provided via script argument or set it to default -WASM_ARTIFACTS_FOLDER=$PWD +ARTIFACTS_FOLDER=$PWD if [ "$#" -eq 1 ]; then if [ ! -e "$1" ]; then echo "Error: Folder \""$1"\" doesn't exist" exit fi - WASM_ARTIFACTS_FOLDER="$1" + ARTIFACTS_FOLDER="$1" fi -WASM_ARTIFACTS_JAVASCRIPT_FILE="bergamot-translator-worker.js" -WASM_ARTIFACTS="$WASM_ARTIFACTS_FOLDER/${WASM_ARTIFACTS_JAVASCRIPT_FILE}" -if [ ! -e "$WASM_ARTIFACTS" ]; then - echo "Error: Artifact \"$WASM_ARTIFACTS\" doesn't exist" +ARTIFACT="$ARTIFACTS_FOLDER/bergamot-translator-worker.js" +if [ ! -e "$ARTIFACT" ]; then + echo "Error: Artifact \"$ARTIFACT\" doesn't exist" exit fi -echo "Polyfill the fallback integer (8-bit) gemm implementation from the main module" +echo "Importing integer (8-bit) gemm implementation" +SCRIPT_ABSOLUTE_PATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )" sed -i.bak 's/"env"[[:space:]]*:[[:space:]]*asmLibraryArg,/"env": asmLibraryArg,\ - "wasm_gemm":{\ - "int8_prepare_a": (...a) => Module["asm"].int8PrepareAFallback(...a),\ - "int8_prepare_b": (...a) => Module["asm"].int8PrepareBFallback(...a),\ - "int8_prepare_b_from_transposed": (...a) => Module["asm"].int8PrepareBFromTransposedFallback(...a),\ - "int8_prepare_b_from_quantized_transposed": (...a) => Module["asm"].int8PrepareBFromQuantizedTransposedFallback(...a),\ - "int8_prepare_bias": (...a) => Module["asm"].int8PrepareBiasFallback(...a),\ - "int8_multiply_and_add_bias": (...a) => Module["asm"].int8MultiplyAndAddBiasFallback(...a),\ - "int8_select_columns_of_b": (...a) => Module["asm"].int8SelectColumnsOfBFallback(...a),\ - },/g' ${WASM_ARTIFACTS_JAVASCRIPT_FILE} -echo "SUCCESS" \ No newline at end of file + "wasm_gemm": createWasmGemm(),/g' ${ARTIFACT} +cat $SCRIPT_ABSOLUTE_PATH/import-gemm-module.js >> ${ARTIFACT} +echo "SUCCESS" From 40366162d82e8ddfbb9023da039670ad3d616ecb Mon Sep 17 00:00:00 2001 From: Kenneth Heafield Date: Thu, 25 Nov 2021 13:57:50 +0000 Subject: [PATCH 308/442] HTML input (#253) Co-authored-by: Jelmer van der Linde Co-authored-by: Abhishek Aggarwal --- src/tests/units/CMakeLists.txt | 3 +- src/tests/units/html_tests.cpp | 519 +++++++++++++++++++ src/tests/units/html_tests.h | 9 + src/translator/CMakeLists.txt | 2 + src/translator/definitions.h | 1 + src/translator/html.cpp | 538 ++++++++++++++++++++ src/translator/html.h | 50 ++ src/translator/response_builder.h | 9 +- src/translator/response_options.h | 2 + src/translator/translation_model.cpp | 5 +- src/translator/xh_scanner.cpp | 454 +++++++++++++++++ src/translator/xh_scanner.h | 130 +++++ wasm/bindings/response_options_bindings.cpp | 3 +- wasm/test_page/js/worker.js | 2 +- 14 files changed, 1721 insertions(+), 6 deletions(-) create mode 100644 src/tests/units/html_tests.cpp create mode 100644 src/tests/units/html_tests.h create mode 100644 src/translator/html.cpp create mode 100644 src/translator/html.h create mode 100644 src/translator/xh_scanner.cpp create mode 100644 src/translator/xh_scanner.h diff --git a/src/tests/units/CMakeLists.txt b/src/tests/units/CMakeLists.txt index 2570e05e7..8c29ab397 100644 --- a/src/tests/units/CMakeLists.txt +++ b/src/tests/units/CMakeLists.txt @@ -2,7 +2,8 @@ set(UNIT_TESTS annotation_tests cache_tests - quality_estimator_tests) + quality_estimator_tests + html_tests) foreach(test ${UNIT_TESTS}) add_executable("run_${test}" run_tests.cpp "${test}.cpp") diff --git a/src/tests/units/html_tests.cpp b/src/tests/units/html_tests.cpp new file mode 100644 index 000000000..258847970 --- /dev/null +++ b/src/tests/units/html_tests.cpp @@ -0,0 +1,519 @@ +#include "html_tests.h" + +#include + +#include "catch.hpp" +#include "data/types.h" // for marian::string_view +#include "translator/html.h" +#include "translator/response.h" + +using namespace marian::bergamot; +using marian::string_view; + +std::ostream &operator<<(std::ostream &out, std::pair const &b) { + return out << '(' << b.first << ',' << b.second << ')'; +} + +std::ostream &operator<<(std::ostream &out, ByteRange const &b) { return out << '{' << b.begin << ',' << b.end << '}'; } + +std::vector AsByteRanges(AnnotatedText const &annotation) { + std::vector words; + words.emplace_back(annotation.annotation.gap(0)); + for (size_t sentenceIdx = 0; sentenceIdx < annotation.numSentences(); ++sentenceIdx) { + for (size_t wordIdx = 0; wordIdx < annotation.numWords(sentenceIdx); ++wordIdx) + words.emplace_back(annotation.wordAsByteRange(sentenceIdx, wordIdx)); + words.emplace_back(annotation.annotation.gap(sentenceIdx + 1)); + } + return words; +} + +std::vector AsTokens(AnnotatedText const &annotation) { + std::vector words; + words.emplace_back(annotation.gap(0)); + for (size_t sentenceIdx = 0; sentenceIdx < annotation.numSentences(); ++sentenceIdx) { + for (size_t wordIdx = 0; wordIdx < annotation.numWords(sentenceIdx); ++wordIdx) + words.emplace_back(annotation.word(sentenceIdx, wordIdx)); + words.emplace_back(annotation.gap(sentenceIdx + 1)); + } + return words; +} + +void RecordSentenceFromByteRange(AnnotatedText &text, std::vector const &ranges) { + assert(ranges.size() > 0); + + std::vector tokens; + tokens.reserve(ranges.size()); + + for (auto &&range : ranges) tokens.emplace_back(text.text.data() + range.begin, range.size()); + + text.recordExistingSentence(tokens.begin(), tokens.end(), text.text.data() + ranges[0].begin); +} + +TEST_CASE("Ignore HTML if process_markup is false") { + std::string html_code("

This text & has HTML in it

"); + + std::string input(html_code); + HTML html(std::move(input), false); + CHECK(input == html_code); + + Response response; + response.source.text = html_code; + response.target.text = html_code; + html.Restore(response); + + // Assert that Restore() does not mess with my HTML code + CHECK(response.source.text == html_code); +} + +TEST_CASE("Test reconstruction") { + std::string input("

Hello world how are you?

\n"); + + std::string text(input); + HTML html(std::move(text), true); // TODO: move, but really a reference? + CHECK(text == "Hello world how are you?\n"); + + AnnotatedText source(std::move(text)); + std::vector tokens{ + string_view(source.text.data() + 0, 4), // Hell + string_view(source.text.data() + 4, 1), // o + string_view(source.text.data() + 5, 6), // _world + string_view(source.text.data() + 11, 4), // _how + string_view(source.text.data() + 15, 4), // _are + string_view(source.text.data() + 19, 4), // _you + string_view(source.text.data() + 23, 1), // ? + string_view(source.text.data() + 24, 0), // "\n" (but 0 length?) + }; + + source.recordExistingSentence(tokens.begin(), tokens.end(), source.text.data()); + + Response response; + response.source = source; + + html.Restore(response); + // CHECK(response.source.text == input); // fails because has been moved to the front of the token + CHECK(response.source.text == "

Hello world how are you?

\n"); + + std::vector restored_tokens{ + ByteRange{0, 0 + 0}, // (start of sentence) + ByteRange{0, 0 + 21}, //

Hell + ByteRange{21, 21 + 1}, // o + ByteRange{22, 22 + 9}, // _world + ByteRange{31, 31 + 8}, // _how + ByteRange{39, 39 + 7}, // _are + ByteRange{46, 46 + 4}, // _you + ByteRange{50, 50 + 5}, // ? + ByteRange{55, 55 + 0}, // "" + ByteRange{55, 55 + 5}, //

\n + }; + CHECK(response.source.text.size() == restored_tokens.back().end); + CHECK(AsByteRanges(response.source) == restored_tokens); + + // Same test as above, but easier to read. Will use this further down. + std::vector restored_tokens_str{"", + "

Hell", // Should really be "

Hell" + "o", + " world", + " how", + " are", + " you", + "?", + "", // end of sentence + "

\n"}; + + CHECK(AsTokens(response.source) == restored_tokens_str); +} + +TEST_CASE("Test reconstruction of multiple sentences") { + std::string input("

This is a sentence. And so is this.

\n"); + + HTML html(std::move(input), true); + CHECK(input == "This is a sentence. And so is this.\n"); + + Response response; + response.source = AnnotatedText(std::move(input)); + + RecordSentenceFromByteRange(response.source, { + ByteRange{0, 4}, // 0.0 "This" + ByteRange{4, 7}, // 0.1 " is" + ByteRange{7, 9}, // 0.2 " a" + ByteRange{9, 18}, // 0.3 " sentence" + ByteRange{18, 19}, // 0.4 "." + }); + + RecordSentenceFromByteRange(response.source, { + ByteRange{20, 23}, // 1.0 "And" + ByteRange{23, 26}, // 1.1 " so" + ByteRange{26, 29}, // 1.2 " is" + ByteRange{29, 34}, // 1.3 " this" + ByteRange{34, 35}, // 1.4 "." + }); + + std::vector tokens{"", "This", " is", " a", " sentence", ".", " ", + "And", " so", " is", " this", ".", "\n"}; + + CHECK(AsTokens(response.source) == tokens); + + html.Restore(response); + + std::vector html_tokens{ + "", "

This", " is", " a", " sentence", ".", " ", "And", " so", " is", " this", ".", + "

\n", //

got moved into post-sentence gap + }; + + CHECK(AsTokens(response.source) == html_tokens); +} + +TEST_CASE("Test case html entities") { + // These are all entities I would expect in innerHTML, since all other entities + // can be encoded as UTF-8 so there's no need to encode them through &...; when + // innerHTML encodes the DOM as HTML. + std::string input("

This is a sentence <with> named & entities

\n"); + HTML html(std::move(input), true); + CHECK(input == "This is a sentence named & entities\n"); + + Response response; + response.source = AnnotatedText(std::move(input)); + + RecordSentenceFromByteRange(response.source, { + ByteRange{0, 4}, // 0.0 "This" + ByteRange{4, 7}, // 0.1 " is" + ByteRange{7, 9}, // 0.2 " a" + ByteRange{9, 18}, // 0.3 " sentence" + ByteRange{18, 20}, // 0.4 " <" + ByteRange{20, 24}, // 0.5 "with" + ByteRange{24, 25}, // 0.6 ">" + ByteRange{25, 31}, // 0.7 " named" + ByteRange{31, 33}, // 0.8 " &" + ByteRange{33, 42}, // 0.9 " entities" + ByteRange{42, 42} // 0.10 "" + }); + + html.Restore(response); + + std::vector html_tokens{"", "

This", + " is", " a", + " sentence", + " <", // Oh trouble! The < is completely 'consumed' + "with", ">", + " named", " &", + " entities", "", + "

\n"}; + + CHECK(AsTokens(response.source) == html_tokens); +} + +TEST_CASE("Test self-closing tags should be treated as spaces") { + std::string input("

Space
please?

\n"); + + HTML html(std::move(input), true); + CHECK(input == "Space please?\n"); +} + +TEST_CASE("Test reconstruction of target sentence") { + std::string input("

hello world

\n"); + HTML html(std::move(input), true); + CHECK(input == "hello world\n"); + + AnnotatedText source("hello world\n"); + RecordSentenceFromByteRange(source, { + ByteRange{0, 4}, // 0.0 "hell" + ByteRange{4, 5}, // 0.1 "o" + ByteRange{5, 11}, // 0.2 " world" + ByteRange{11, 11} // 0.3 "" + }); + + AnnotatedText target("hallo Welt\n"); + RecordSentenceFromByteRange(target, { + ByteRange{0, 4}, // 0.0 "hall" + ByteRange{4, 5}, // 0.1 "o" + ByteRange{5, 10}, // 0.2 " Welt" + ByteRange{10, 10} // 0.3 "" + }); + + Response response; + response.source = source; + response.target = target; + + html.Restore(response); + + std::vector html_tokens_source{"", "

hell", "o", " world", "", "

\n"}; + + std::vector html_tokens_target{"", "

hall", "o", " Welt", "", "

\n"}; + + CHECK(AsTokens(response.source) == html_tokens_source); + CHECK(AsTokens(response.target) == html_tokens_target); +} + +TEST_CASE("Test reconstruction of target sentence with entities") { + std::string input("

hello world & friends!

\n"); + HTML html(std::move(input), true); + CHECK(input == "hello world & friends!\n"); + + AnnotatedText source("hello world & friends!\n"); + RecordSentenceFromByteRange(source, { + ByteRange{0, 4}, // 0.0 "hell" + ByteRange{4, 5}, // 0.1 "o" + ByteRange{5, 11}, // 0.2 " world" + ByteRange{11, 13}, // 0.3 " &" + ByteRange{13, 21}, // 0.4 " friends" + ByteRange{21, 22}, // 0.5 "!" + ByteRange{22, 22} // 0.6 "" + }); + + AnnotatedText target("hallo Welt & Freunde!\n"); + RecordSentenceFromByteRange(target, { + ByteRange{0, 4}, // 0.0 "hall" + ByteRange{4, 5}, // 0.1 "o" + ByteRange{5, 10}, // 0.2 " Welt" + ByteRange{10, 12}, // 0.3 " &" + ByteRange{12, 20}, // 0.4 " Freunde" + ByteRange{20, 21}, // 0.5 "!" + ByteRange{21, 21} // 0.6 "" + }); + + Response response; + response.source = source; + response.target = target; + + html.Restore(response); + + std::vector html_tokens_source{"", "

hell", "o", " world", " &", + " friends", "!", "", "

\n"}; + + std::vector html_tokens_target{"", "

hall", "o", " Welt", " &", + + " Freunde", "!", "", "

\n"}; + + CHECK(AsTokens(response.source) == html_tokens_source); + CHECK(AsTokens(response.target) == html_tokens_target); +} + +TEST_CASE("Test reconstruction of target with multiple sentences") { + std::string input( + "

hello world! How does this deal with multiple sentences? Will it work?

\n"); + HTML html(std::move(input), true); + + AnnotatedText source("hello world! How does this deal with multiple sentences? Will it work?\n"); + CHECK(source.text == input); + + RecordSentenceFromByteRange(source, { + ByteRange{0, 4}, // 0.0 "hell" + ByteRange{4, 5}, // 0.1 "o" + ByteRange{5, 11}, // 0.2 " world" + ByteRange{11, 12}, // 0.3 "!" + ByteRange{12, 12} // 0.4 "" + }); + RecordSentenceFromByteRange(source, { + ByteRange{13, 16}, // 1.0 "How" + ByteRange{16, 21}, // 1.1 " does" + ByteRange{21, 26}, // 1.2 " this" + ByteRange{26, 32}, // 1.3 " deal" + ByteRange{32, 37}, // 1.4 " with" + ByteRange{37, 46}, // 1.5 " multiple" + ByteRange{46, 55}, // 1.6 " sentence" + ByteRange{55, 56}, // 1.7 "s" + ByteRange{56, 57}, // 1.8 "?" + ByteRange{57, 57} // 1.9 "" + }); + RecordSentenceFromByteRange(source, { + ByteRange{58, 62}, // 2.0 "Will" + ByteRange{62, 65}, // 2.1 " it" + ByteRange{65, 70}, // 2.2 " work" + ByteRange{70, 71}, // 2.3 "?" + ByteRange{71, 71} // 2.4 "" + }); + + AnnotatedText target("hallo Welt! Wie geht das mit mehreren Sätzen um? Wird es funktionieren?\n"); + RecordSentenceFromByteRange(target, { + ByteRange{0, 4}, // 0.0 "hall" + ByteRange{4, 5}, // 0.1 "o" + ByteRange{5, 10}, // 0.2 " Welt" + ByteRange{10, 11}, // 0.3 "!" + ByteRange{11, 11}, // 0.4 "" + }); + RecordSentenceFromByteRange(target, { + ByteRange{12, 15}, // 1.0 "Wie" + ByteRange{15, 20}, // 1.1 " geht" + ByteRange{20, 24}, // 1.2 " das" + ByteRange{24, 28}, // 1.3 " mit" + ByteRange{28, 37}, // 1.4 " mehreren" + ByteRange{37, 44}, // 1.5 " Sätze" + ByteRange{44, 45}, // 1.6 "n" + ByteRange{45, 48}, // 1.7 " um" + ByteRange{48, 49}, // 1.8 "?" + ByteRange{49, 49}, // 1.9 "" + }); + RecordSentenceFromByteRange(target, { + ByteRange{50, 54}, // 2.0 "Wird" + ByteRange{54, 57}, // 2.1 " es" + ByteRange{57, 71}, // 2.2 " funktionieren" + ByteRange{71, 72}, // 2.3 "?" + ByteRange{72, 72}, // 2.4 "" + }); + + std::vector text_tokens_source{ + "", "hall", "o", " Welt", "!", "", " ", "Wie", " geht", " das", " mit", " mehreren", + " Sätze", "n", " um", "?", "", " ", "Wird", " es", " funktionieren", "?", "", "\n"}; + + CHECK(AsTokens(target) == text_tokens_source); + + Response response; + response.source = source; + response.target = target; + html.Restore(response); + + std::vector html_tokens_source{"", + "

hell", + "o", + " world", + "!", + "", + " ", + "How", + " does", + " this", + " deal", // note how both spaces moved to __deal + " with", + " multiple", + " sentence", + "s", + "?", + "", + " ", + "Will", + " it", + " work", + "?", + "", + "

\n"}; + CHECK(AsTokens(response.source) == html_tokens_source); +} + +TEST_CASE("Test self-closing tag (HTML5)") { + std::string input("

hello world and other creatures

\n"); + HTML html(std::move(input), true); + CHECK(input == "hello world and other creatures\n"); // Note double space between "hello" and "world" +} + +TEST_CASE("Test empty tag", "[!mayfail]") { + std::string input( + "

hello world

\n"); + HTML html(std::move(input), true); + CHECK(input == "hello world\n"); + + Response response; + + std::string sentence_str("hello world"); + std::vector sentence{ + string_view(sentence_str.data() + 0, 4), // 0.0 hell + string_view(sentence_str.data() + 4, 1), // 0.1 o + string_view(sentence_str.data() + 5, 6), // 0.2 _world + string_view(sentence_str.data() + 11, 0), // 0.3 "" + }; + response.source.appendSentence("", sentence.begin(), sentence.end()); + response.source.appendEndingWhitespace("\n"); + + html.Restore(response); + CHECK(response.source.text == + "

hello world

\n"); +} + +TEST_CASE("End-to-end translation") { + std::string input("

I like to drive this car.

\n"); + HTML html(std::move(input), true); + CHECK(input == "I like to drive this car.\n"); + + Response response; + + // clang-format off + response.alignments = std::vector>>{{ + {0.982376, 0.00742467, 0.00682965, 0.00121767, 0.000848056,6.51436e-05,7.53791e-06,0.00123162}, + {0.165639, 0.368694, 0.230394, 0.222476, 0.00349563, 0.00105052, 0.000603092,0.00764845}, + {0.00493271,0.0805876, 0.0139988, 0.89116, 0.000928116,0.00200724, 0.000512013,0.00587302}, + {0.0194648, 0.411029, 0.087059, 0.0477847, 0.26596, 0.111161, 0.000392092,0.0571499}, + {0.00879706,0.492504, 0.0448291, 0.007779, 0.423114, 0.0125523, 0.00119587, 0.00922804}, + {0.00181909,0.00603626, 0.0335758, 0.037193, 0.747266, 0.102497, 0.0585782, 0.0130341}, + {4.1348e-06,0.000156165,2.16369e-05,0.00275059, 0.00183456, 0.992357, 0.0023765, 0.000499018}, + {0.00149043,0.000719392,0.0168534, 0.00430164, 0.00200343, 0.0106381, 0.948566, 0.0154279}, + {0.0903136, 0.0550843, 0.0699474, 0.0792285, 0.223006, 0.207565, 0.129241, 0.145614}, + }}; + // clang-format on + + { + std::string sentence_str("I like to drive this car."); + std::vector sentence{ + string_view(sentence_str.data() + 0, 1), // 0.0 "I" + string_view(sentence_str.data() + 1, 5), // 0.1 " like" + string_view(sentence_str.data() + 6, 3), // 0.2 " to" + string_view(sentence_str.data() + 9, 6), // 0.3 " drive" + string_view(sentence_str.data() + 15, 5), // 0.4 " this" + string_view(sentence_str.data() + 20, 4), // 0.5 " car" + string_view(sentence_str.data() + 24, 1), // 0.6 "." + string_view(sentence_str.data() + 25, 0), // 0.7 "" + }; + response.source.appendSentence("", sentence.begin(), sentence.end()); + response.source.appendEndingWhitespace("\n"); + } + + { + std::string sentence_str("Ich fahre gerne dieses Auto."); + std::vector sentence{ + string_view(sentence_str.data() + 0, 3), // 0.0 "Ich" + string_view(sentence_str.data() + 3, 1), // 0.1 " " + string_view(sentence_str.data() + 4, 4), // 0.2 "fahr" + string_view(sentence_str.data() + 8, 1), // 0.3 "e" + string_view(sentence_str.data() + 9, 6), // 0.4 " gerne" + string_view(sentence_str.data() + 15, 7), // 0.5 " dieses" + string_view(sentence_str.data() + 22, 5), // 0.6 " Auto" + string_view(sentence_str.data() + 27, 1), // 0.7 "." + string_view(sentence_str.data() + 28, 0), // 0.8 "" + }; + response.target.appendSentence("", sentence.begin(), sentence.end()); + response.target.appendEndingWhitespace("\n"); + } + + html.Restore(response); + + { + AnnotatedText source; + std::string sentence_str("

I like to drive this car."); + std::vector sentence{ + string_view(sentence_str.data() + 0, 4), // 0.0 "

I" + string_view(sentence_str.data() + 4, 8), // 0.1 " like" + string_view(sentence_str.data() + 12, 7), // 0.2 " to" + string_view(sentence_str.data() + 19, 9), // 0.3 " drive" + string_view(sentence_str.data() + 28, 9), // 0.4 " this" + string_view(sentence_str.data() + 37, 4), // 0.5 " car" + string_view(sentence_str.data() + 41, 1), // 0.6 "." + string_view(sentence_str.data() + 42, 0), // 0.7 "" + }; + source.appendSentence("", sentence.begin(), sentence.end()); + source.appendEndingWhitespace("

\n"); + + CHECK(AsTokens(response.source) == AsTokens(source)); + } + + { + AnnotatedText target; + std::string sentence_str("

Ich fahre gerne dieses Auto."); + std::vector sentence{ + string_view(sentence_str.data() + 0, 6), // 0.0 "

Ich" + string_view(sentence_str.data() + 6, 4), // 0.1 " " + string_view(sentence_str.data() + 10, 4), // 0.2 "fahr" + string_view(sentence_str.data() + 14, 1), // 0.3 "e" + string_view(sentence_str.data() + 15, 13), // 0.4 " gerne" + string_view(sentence_str.data() + 28, 11), // 0.5 " dieses" + string_view(sentence_str.data() + 39, 5), // 0.6 " Auto" + string_view(sentence_str.data() + 44, 1), // 0.7 "." + string_view(sentence_str.data() + 45, 0), // 0.8 "" + }; + target.appendSentence("", sentence.begin(), sentence.end()); + target.appendEndingWhitespace("

\n"); + + CHECK(AsTokens(response.target) == AsTokens(target)); + } +} + +// TEST_CASE("") \ No newline at end of file diff --git a/src/tests/units/html_tests.h b/src/tests/units/html_tests.h new file mode 100644 index 000000000..0407b65b2 --- /dev/null +++ b/src/tests/units/html_tests.h @@ -0,0 +1,9 @@ +#pragma once +#include + +#include "translator/definitions.h" + +std::ostream &operator<<(std::ostream &out, marian::bergamot::ByteRange const &b); + +std::ostream &operator<<(std::ostream &out, + std::pair const &b); diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index ab1448800..6779b0fa4 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -15,6 +15,8 @@ add_library(bergamot-translator STATIC annotation.cpp service.cpp parser.cpp + html.cpp + xh_scanner.cpp ) if (USE_WASM_COMPATIBLE_SOURCE) # Using wasm compatible sources should include this compile definition; diff --git a/src/translator/definitions.h b/src/translator/definitions.h index 66ebb03b4..2ac6bf0ef 100644 --- a/src/translator/definitions.h +++ b/src/translator/definitions.h @@ -39,6 +39,7 @@ struct ByteRange { size_t begin; size_t end; const size_t size() const { return end - begin; } + bool operator==(ByteRange other) const { return begin == other.begin && end == other.end; } }; class Response; diff --git a/src/translator/html.cpp b/src/translator/html.cpp new file mode 100644 index 000000000..0614c37e5 --- /dev/null +++ b/src/translator/html.cpp @@ -0,0 +1,538 @@ +#include "html.h" + +#include "response.h" +#include "xh_scanner.h" + +namespace { +using marian::string_view; +using marian::bergamot::AnnotatedText; +using marian::bergamot::ByteRange; +using marian::bergamot::HTML; +using marian::bergamot::Response; + +void EncodeEntities(string_view const &input, std::string &output) { + output.clear(); + output.reserve(input.size()); + + for (auto it = input.begin(); it != input.end(); ++it) { + switch (*it) { + case '&': + output.append("&"); + break; + case '<': + output.append("<"); + break; + case '>': + output.append(">"); + break; + // case ???: + // output.append(" "); + // break; + // case '"': + // output.append("""); + // break; + // case '\'': + // output.append("'"); + // break; + default: + output.push_back(*it); + break; + } + } +} + +size_t CountPrefixWhitespaces(string_view const &input) { + size_t size = 0; + while (size < input.size() && input[size] == ' ') ++size; + return size; +} + +std::ostream &operator<<(std::ostream &out, HTML::Tag const *tag) { + if (tag == nullptr) return out << "[nullptr]"; + out << '<' << tag->name << tag->attributes; + if (tag->empty) out << '/'; + return out << '>'; +} + +std::ostream &operator<<(std::ostream &out, HTML::Taint const &tags) { + for (auto it = tags.begin(); it != tags.end(); ++it) { + if (it != tags.begin()) out << ' '; + out << *it; + } + return out; +} + +// Very simple replacement for std::format introduced in C++20 +std::string format(std::string const &format_str) { return format_str; } + +template +std::string format(std::string const &format_str, Arg arg) { + std::ostringstream os; + auto index = format_str.find("{}"); + assert(index != std::string::npos); + os << format_str.substr(0, index) << arg << format_str.substr(index + 2); + return os.str(); +} + +template +std::string format(std::string const &format_str, Arg arg, Args... args) { + std::ostringstream os; + auto index = format_str.find("{}"); + assert(index != std::string::npos); + os << format_str.substr(0, index) << arg << format(format_str.substr(index + 2), std::forward(args)...); + return os.str(); +} + +bool IsBlockElement(std::string const &name) { + // List of elements that we expect might occur inside words, and that should + // not introduce spacings around them. Not strictly inline elements, nor flow + // elements. See also https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories + static std::unordered_set inline_ish_elements{ + "abbr", "a", "b", "em", "i", "kbd", "mark", "math", "output", "q", "ruby", + "small", "span", "strong", "sub", "sup", "time", "u", "var", "wbr", "ins", "del"}; + + return inline_ish_elements.find(name) == inline_ish_elements.end(); +} + +bool IsEmtpyElement(std::string const &name) { + // List of elements for which we do not expect a closing tag, or self-closing + // elements in XHTML. See also https://developer.mozilla.org/en-US/docs/Glossary/Empty_element + static std::unordered_set empty_elements{"area", "base", "br", "col", "embed", "hr", "img", + "input", "link", "meta", "param", "source", "track", "wbr"}; + + return empty_elements.find(name) != empty_elements.end(); +} + +void DiffTags(HTML::Taint const &prev, HTML::Taint const &curr, HTML::Taint &opening, HTML::Taint &closing) { + opening.clear(); + closing.clear(); + + size_t i = 0; + + // Find first difference + for (; i < prev.size(); ++i) + if (i >= curr.size() || prev[i] != curr[i]) break; + + std::copy_if(prev.begin() + i, prev.end(), std::back_inserter(closing), [&](HTML::Tag *tag) { return !tag->empty; }); + + opening.insert(opening.end(), curr.begin() + i, curr.end()); +} + +bool Intersects(ByteRange const &range, HTML::Span const &span) { + return range.begin <= span.end && range.end >= span.begin; +}; + +void FilterEmpty(HTML::Taint &stack) { + auto src = stack.begin(); + auto dst = stack.begin(); + + for (auto src = stack.begin(); src != stack.end(); ++src) + if (!(*src)->empty) *(dst++) = *src; + + stack.resize(dst - stack.begin()); +} + +template +AnnotatedText Apply(AnnotatedText const &in, Fun fun) { + AnnotatedText out; + + for (size_t sentenceIdx = 0; sentenceIdx < in.numSentences(); ++sentenceIdx) { + std::string sentence; + std::vector tokens; + + std::string prefix = fun(in.annotation.gap(sentenceIdx), in.gap(sentenceIdx), false); + + for (size_t wordIdx = 0; wordIdx < in.numWords(sentenceIdx); ++wordIdx) { + std::string token = fun(in.wordAsByteRange(sentenceIdx, wordIdx), in.word(sentenceIdx, wordIdx), false); + tokens.push_back(ByteRange{sentence.size(), sentence.size() + token.size()}); + sentence += token; + } + + // Convert our ByteRanges to string_views since that's what appendSentence + // expects + // TODO: extend AnnotatedText::appendSentence to accept str + ByteRanges + // directly + std::vector token_views(tokens.size()); + std::transform(tokens.begin(), tokens.end(), token_views.begin(), + [&](ByteRange const &range) { return string_view(sentence.data() + range.begin, range.size()); }); + + out.appendSentence(prefix, token_views.begin(), token_views.end()); + } + + out.appendEndingWhitespace(fun(in.annotation.gap(in.numSentences()), in.gap(in.numSentences()), true)); + + return out; +} + +bool IsContinuation(string_view str) { return !str.empty() && str.compare(0, 1, " ", 1) != 0; } + +void HardAlignments(Response const &response, std::vector> &alignments) { + // For each sentence... + for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { + alignments.emplace_back(); + assert(response.alignments[sentenceIdx].size() == response.target.numWords(sentenceIdx)); + + // Hard-align: find for each target token the most prevalent source token + for (size_t t = 0; t < response.alignments[sentenceIdx].size(); ++t) { + size_t s_max = 0; + for (size_t s = 1; s < response.alignments[sentenceIdx][t].size(); ++s) { + if (response.alignments[sentenceIdx][t][s] > response.alignments[sentenceIdx][t][s_max]) { + s_max = s; + } + } + + alignments.back().push_back(s_max); + } + + // Next, we try to smooth out these selected alignments with a few heuristics + for (size_t t = 0; t < response.target.numWords(sentenceIdx); ++t) { + // If this token is a continuation of a previous token, pick the tags from the most + // prevalent token for the whole word. + if (t > 0 && IsContinuation(response.target.word(sentenceIdx, t))) { + // Note: only looking at the previous token since that will already + // have this treatment applied to it. + size_t s_curr = alignments.back()[t]; + size_t s_prev = alignments.back()[t - 1]; + float score_curr = response.alignments[sentenceIdx][t][s_curr]; + float score_prev = response.alignments[sentenceIdx][t - 1][s_prev]; + + size_t s_max = score_curr > score_prev ? s_curr : s_prev; + + // Apply this to all previous tokens in the word + for (size_t i = t; i >= 0; --i) { + alignments.back()[i] = s_max; + + // Stop if this was the beginning of the word + if (!IsContinuation(response.target.word(sentenceIdx, i))) break; + } + } + } + } +} + +void InterpolateAlignments(Response const &response, std::vector> &alignments) { + for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { + alignments.emplace_back(); + double ratio = (double)response.source.numWords(sentenceIdx) / response.target.numWords(sentenceIdx); + + for (size_t wordIdx = 0; wordIdx < response.target.numWords(sentenceIdx); ++wordIdx) { + size_t source_token_idx = static_cast(ratio * wordIdx); + assert(source_token_idx < response.source.numWords(sentenceIdx)); + alignments.back().push_back(source_token_idx); + } + } +} + +void CopyTaint(Response const &response, std::vector> const &alignments, + std::vector const &token_tags, std::vector &token_tags_target) { + size_t token_offset = 0; + + // Fill token_tags_target based on the alignments we just made up. + // NOTE: this should match the exact order of Apply() + for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { + token_tags_target.push_back(token_tags[token_offset]); // token_tag for sentence ending gap + for (size_t t = 0; t < response.target.numWords(sentenceIdx); ++t) { + size_t s = alignments[sentenceIdx][t]; + assert(s < response.source.numWords(sentenceIdx)); + token_tags_target.push_back(token_tags[token_offset + 1 + s]); // +1 for prefix gap + } + + token_offset += response.source.numWords(sentenceIdx) + 1; // +1 for prefix gap + } + + assert(token_offset < token_tags.size()); + token_tags_target.push_back(token_tags[token_offset]); // token_tag for ending whitespace +} + +AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &token_tags, + std::vector::const_iterator span_it, + std::vector::const_iterator span_end) { + auto prev_it = span_it; // safe because first span is always empty span, and + // and the while-loop below will do the rest + + // workspace variables for lambda + std::string html; + HTML::Taint opening, closing; + + return Apply(in, [&](ByteRange range, string_view token, bool last) { + // Do encoding of any entities that popped up in the translation + // (Also effectively clears html from previous call) + EncodeEntities(token, html); + + size_t offset = 0; // Size added by prepending HTML + size_t whitespace_size = CountPrefixWhitespaces(token); + + // Potential issue: spans and tokens can intersect, e.g. + // + // text

h e ll o

+ // spans |1| |2| |3333| (so only 2 is tainted with

, others only

) + // tokens |111111111111111|2| + // + // Now 1 covers span 1 to 3, so what taint should it get? Just

, or

? + + // Seek to the last span that overlaps with this token + while (true) { + DiffTags(prev_it->tags, span_it->tags, opening, closing); + prev_it = span_it; + + for (auto cit = closing.crbegin(); cit != closing.crend(); ++cit) { + std::string close_tag = format("", (*cit)->name); + html.insert(offset, close_tag); + offset += close_tag.size(); + } + + for (HTML::Tag const *tag : opening) { + std::string open_tag = format("<{}{}>", tag->name, tag->attributes); + html.insert(offset + whitespace_size, open_tag); + offset += open_tag.size(); + } + + if (span_it + 1 != span_end && ((span_it + 1)->begin < range.end || last)) { + span_it++; + continue; + } + + break; + } + + // TODO: This is just the taint of the last span, not the ones in between + // I don't know if that is okay for transferring taints. We'll need to test. + token_tags.push_back(prev_it->tags); + + return html; + }); +} + +AnnotatedText RestoreTarget(AnnotatedText const &in, std::vector const &token_tags_target) { + auto token_prev_it = token_tags_target.begin(); + auto token_tags_it = token_tags_target.begin() + 1; + + // workspace for lambda + std::string html; + HTML::Taint opening, closing; + + AnnotatedText out = Apply(in, [&](ByteRange range, string_view token, bool last) { + // Do encoding of any entities that popped up in the translation + // (Also effectively clears html from previous call) + EncodeEntities(token, html); + + size_t offset = 0; // Size added by prepending HTML + size_t whitespace_size = CountPrefixWhitespaces(token); + + assert(token_tags_it != token_tags_target.end()); + DiffTags(*token_prev_it, *token_tags_it, opening, closing); + + for (auto cit = closing.crbegin(); cit != closing.crend(); ++cit) { + std::string close_tag = format("", (*cit)->name); + html.insert(offset, close_tag); + offset += close_tag.size(); + } + + for (HTML::Tag const *tag : opening) { + std::string open_tag = format("<{}{}>", tag->name, tag->attributes); + html.insert(offset + whitespace_size, open_tag); + offset += open_tag.size(); + } + + // If this is the last token of the response, close all open tags. + if (last) { + for (auto cit = token_tags_it->crbegin(); cit != token_tags_it->crend(); ++cit) { + html += format("", (*cit)->name); + } + } + + ++token_prev_it; + ++token_tags_it; + + return html; + }); + + // Assert that we did in fact use all our taints + assert(token_tags_it == token_tags_target.end()); + + return out; +} + +std::ostream &DebugPrintMapping(std::ostream &out, Response const &response, + std::vector> const &alignments, + std::vector const &token_tags_target) { + auto taints = token_tags_target.begin(); + for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { + out << "Mapped sentence prefix with tags: "; + for (auto &&taint : *(++taints)) out << '/' << taint->name; + out << '\n'; + + for (size_t wordIdx = 0; wordIdx < response.target.numWords(sentenceIdx); ++wordIdx) { + assert(sentenceIdx < alignments.size()); + assert(wordIdx < alignments[sentenceIdx].size()); + + out << "Mapped "; + out << std::setw(10) << std::setfill(' ') << response.target.word(sentenceIdx, wordIdx); + out << " to "; + out << std::setw(10) << std::setfill(' ') << response.source.word(sentenceIdx, alignments[sentenceIdx][wordIdx]); + out << " with tags: "; + for (auto &&taint : *(++taints)) out << '/' << taint->name; + out << '\n'; + } + } + + out << "Mapped end-of-input with tags: "; + for (auto &&taint : *(++taints)) out << '/' << taint->name; + out << '\n'; + + assert(++taints == token_tags_target.end()); + return out; +} + +std::ostream &DebugPrintAlignmentScores(std::ostream &out, Response const &response) { + out << "std::vector>> alignments{\n"; + for (size_t sentenceIdx = 0; sentenceIdx < response.source.numSentences(); ++sentenceIdx) { + out << " {\n"; + for (size_t t = 0; t < response.alignments[sentenceIdx].size(); ++t) { + out << " {"; + for (size_t s = 0; s < response.alignments[sentenceIdx][t].size(); ++s) { + out << std::fixed << std::setw(8) << std::setprecision(8) << std::setfill(' ') + << response.alignments[sentenceIdx][t][s]; + out << ", "; + } + out << "},\n"; + } + out << " },\n"; + } + return out << "};\n"; +} + +size_t DebugCountTokens(AnnotatedText const &text) { + size_t tokens = 1; // for the ending gap + for (size_t sentenceIdx = 0; sentenceIdx < text.numSentences(); ++sentenceIdx) { + tokens += 1 + text.numWords(sentenceIdx); // pre-sentence prefix/gap + each word + } + return tokens; +} + +} // namespace + +namespace marian { +namespace bergamot { + +HTML::HTML(std::string &&source, bool process_markup) { + if (!process_markup) return; + std::string original = std::move(source); + markup::instream in(original.data(), original.data() + original.size()); + markup::scanner scanner(in); + source.clear(); // source is moved out of, so should be clear anyway + + Taint stack; + spans_.push_back(Span{0, 0, {}}); + + bool stop = false; + while (!stop) { + switch (scanner.get_token()) { + case markup::scanner::TT_ERROR: + throw BadHTML("HTML parse error"); + + case markup::scanner::TT_EOF: + stop = true; + break; + + case markup::scanner::TT_TEXT: { + auto begin = source.size(); + source.append(scanner.get_value()); + spans_.push_back(Span{begin, source.size(), stack}); + FilterEmpty(stack); + } break; + + case markup::scanner::TT_TAG_START: + // If it makes sense to treat this element as a break in a word (e.g. + //
, ,

  • ) make sure it does so in this text as well. + // TODO: Strong assumption here that the language uses spaces to + // separate words + if (IsBlockElement(scanner.get_tag_name()) && !source.empty() && source.back() != ' ') source.push_back(' '); + + pool_.emplace_back(new Tag{ + scanner.get_tag_name(), std::string(), + IsEmtpyElement(scanner.get_tag_name()) // TODO: detect empty elements by doing a second pass and detecting + // non-closed elements? + }); + + stack.push_back(pool_.back().get()); + break; + + case markup::scanner::TT_TAG_END: + // Note: self-closing tags emit TT_TAG_END immediately after TT_TAG_START + // but since we're parsing HTML5, a sole will never emit a TT_TAG_END + if (stack.empty()) + throw BadHTML(format("Encountered more closing tags ({}) than opening tags", scanner.get_tag_name())); + + // TODO: what to do with "" case, where tag is immediately closed + // so it never makes it into the taint of any of the spans? Add it as + // an empty tag to the previous/following? + if (stack.back()->name != scanner.get_tag_name()) + throw BadHTML(format("Encountered unexpected closing tag , stack is {}", scanner.get_tag_name(), stack)); + stack.pop_back(); + break; + + case markup::scanner::TT_ATTR: + // TODO could be more efficient if format() accepted a destination, i.e. format_to? + stack.back()->attributes += format(" {}=\"{}\"", scanner.get_attr_name(), scanner.get_value()); + break; + + default: + break; + } + } + + if (!stack.empty()) throw BadHTML(format("Not all tags were closed: {}", stack)); + + // Add a trailing span (that's empty) to signify all closed tags. + spans_.emplace_back(Span{source.size() + 1, source.size() + 1, stack}); +} + +void HTML::Restore(Response &response) { + if (spans_.empty()) return; + + // Reconstruction of HTML tags: + // 1. Map each token to a Span + // 2. Apply the taint of that span to the token + // 3. Reconstruct the source HTML with these tainted tokens + // 4. Transfer the taint from the source tokens to the target tokens using alignment information + // 5. Reconstruct the target HTML with these tainted tokens + + std::vector token_tags; // List of HTML tags active per token in source + // Calculating these is a side-effect of restoring + // the HTML in response.source. + + AnnotatedText source = RestoreSource(response.source, token_tags, spans_.cbegin(), spans_.cend()); + assert(token_tags.size() == DebugCountTokens(response.source)); + + // Find for every token in target the token in source that best matches. + std::vector> alignments; + + // If we do have alignment information from the model, we use that to taint + // tokens with the tags from their source token counterpart. If there is no + // alignment information available, we just interpolate based on sentence + // length (badly). + if (!response.alignments.empty()) { + // DebugPrintAlignmentScores(std::cerr, response); + HardAlignments(response, alignments); + } else { + InterpolateAlignments(response, alignments); + } + + std::vector token_tags_target; + token_tags_target.emplace_back(); // add empty one to the beginning for easy + // life later on (we start iterating at 1, + // and can then do i - 1 for empty. + CopyTaint(response, alignments, token_tags, token_tags_target); + assert(token_tags_target.size() == DebugCountTokens(response.target) + 1); + + // DebugPrintMapping(std::cerr, response, alignments, token_tags_target); + + AnnotatedText target = RestoreTarget(response.target, token_tags_target); + + response.source = source; + response.target = target; +} + +} // namespace bergamot +} // namespace marian diff --git a/src/translator/html.h b/src/translator/html.h new file mode 100644 index 000000000..ba4691541 --- /dev/null +++ b/src/translator/html.h @@ -0,0 +1,50 @@ +#ifndef SRC_BERGAMOT_HTML_H_ +#define SRC_BERGAMOT_HTML_H_ + +#include +#include + +#include "definitions.h" + +namespace marian { +namespace bergamot { + +struct Response; + +class BadHTML : public std::runtime_error { + public: + explicit BadHTML(std::string const &what) : std::runtime_error(what) {} +}; + +class HTML { + public: + struct Tag { + std::string name; + std::string attributes; + bool empty; + }; + + typedef std::vector Taint; + + struct Span { + size_t begin; + size_t end; + Taint tags; // Note: free pointer! Lifetime of tags is managed by pool_ + inline size_t size() const { return end - begin; } + }; + + explicit HTML(std::string &&source, bool process_markup); + void Restore(Response &response); + + private: + // List of text spans, and which tags are applied to them + std::vector spans_; + + // a pool of tags that we free when HTML goes out of scope + std::vector> pool_; +}; + +} // namespace bergamot +} // namespace marian + +#endif // SRC_BERGAMOT_HTML_H_ diff --git a/src/translator/response_builder.h b/src/translator/response_builder.h index 36bae1e9e..b9d163a2e 100644 --- a/src/translator/response_builder.h +++ b/src/translator/response_builder.h @@ -4,6 +4,7 @@ #include #include "data/types.h" +#include "html.h" #include "quality_estimator.h" #include "response.h" #include "response_options.h" @@ -30,12 +31,13 @@ class ResponseBuilder { /// @param [in] qualityEstimator: the QualityEstimator model that can be used /// to provide translation quality probability. ResponseBuilder(ResponseOptions responseOptions, AnnotatedText &&source, const Vocabs &vocabs, - std::function callback, const QualityEstimator &qualityEstimator) + std::function callback, const QualityEstimator &qualityEstimator, HTML &&html) : responseOptions_(responseOptions), source_(std::move(source)), vocabs_(vocabs), callback_(std::move(callback)), - qualityEstimator_(qualityEstimator) {} + qualityEstimator_(qualityEstimator), + html_(std::move(html)) {} /// Constructs and sets the promise of a Response object from obtained /// histories after translating. @@ -62,6 +64,7 @@ class ResponseBuilder { if (responseOptions_.alignment) { buildAlignments(histories, response); } + html_.Restore(response); callback_(std::move(response)); } @@ -94,6 +97,8 @@ class ResponseBuilder { AnnotatedText source_; const QualityEstimator &qualityEstimator_; + + HTML html_; }; } // namespace bergamot } // namespace marian diff --git a/src/translator/response_options.h b/src/translator/response_options.h index 43b1c433b..b5867d00d 100644 --- a/src/translator/response_options.h +++ b/src/translator/response_options.h @@ -19,6 +19,8 @@ struct ResponseOptions { bool qualityScores{false}; ///< Include quality-scores or not. bool alignment{false}; ///< Include alignments or not. + bool HTML{false}; /// Remove HTML tags from text and (TODO) insert in output. + /// Whether to include sentenceMappings or not. Alignments require /// sentenceMappings and are available irrespective of this option if /// `alignment=true`. diff --git a/src/translator/translation_model.cpp b/src/translator/translation_model.cpp index 5cf2b85f4..9d2eb0cdb 100644 --- a/src/translator/translation_model.cpp +++ b/src/translator/translation_model.cpp @@ -6,6 +6,7 @@ #include "common/logging.h" #include "data/corpus.h" #include "data/text_input.h" +#include "html.h" #include "parser.h" #include "translator/beam_search.h" @@ -94,8 +95,10 @@ Ptr TranslationModel::makeRequest(size_t requestId, std::string &&sourc Segments segments; AnnotatedText annotatedSource; + HTML html(std::move(source), responseOptions.HTML); textProcessor_.process(std::move(source), annotatedSource, segments); - ResponseBuilder responseBuilder(responseOptions, std::move(annotatedSource), vocabs_, callback, *qualityEstimator_); + ResponseBuilder responseBuilder(responseOptions, std::move(annotatedSource), vocabs_, callback, *qualityEstimator_, + std::move(html)); Ptr request = New(requestId, /*model=*/*this, std::move(segments), std::move(responseBuilder), cache); diff --git a/src/translator/xh_scanner.cpp b/src/translator/xh_scanner.cpp new file mode 100644 index 000000000..78ae13526 --- /dev/null +++ b/src/translator/xh_scanner.cpp @@ -0,0 +1,454 @@ +// https://www.codeproject.com/Articles/14076/Fast-and-Compact-HTML-XML-Scanner-Tokenizer +// BSD license + +#include "xh_scanner.h" + +#include +#include + +namespace markup { + +// case sensitive string equality test +// s_lowcase shall be lowercase string +inline bool equal(const char *s, const char *s1, size_t length) { return strncmp(s, s1, length) == 0; } + +const char *scanner::get_value() { + value[value_length] = 0; + return value; +} + +const char *scanner::get_attr_name() { + attr_name[attr_name_length] = 0; + return attr_name; +} + +const char *scanner::get_tag_name() { + tag_name[tag_name_length] = 0; + return tag_name; +} + +scanner::token_type scanner::scan_body() { + text_begin = input.p; + if (input_char) { + --text_begin; + } + text_end = text_begin; + value_length = 0; + char c = get_char(); + + if (c == 0) + return TT_EOF; + else if (c == '<') + return scan_tag(); + else if (c == '&') + return scan_entity(); + + while (true) { + append_value(c); + ++text_end; + + c = get_char(); + + if (c == 0) { + push_back(c); + break; + } + if (c == '<') { + push_back(c); + break; + } + if (c == '&') { + push_back(c); + break; + } + } + return TT_TEXT; +} + +scanner::token_type scanner::scan_head() { + char c = skip_whitespace(); + + if (c == '>') { + if (equal(tag_name, "script", 6)) { + // script is special because we want to parse the attributes, + // but not the content + c_scan = &scanner::scan_special; + return scan_special(); + } else if (equal(tag_name, "style", 5)) { + // same with style + c_scan = &scanner::scan_special; + return scan_special(); + } + c_scan = &scanner::scan_body; + return scan_body(); + } + if (c == '/') { + char t = get_char(); + if (t == '>') { + // self closing tag + c_scan = &scanner::scan_body; + return TT_TAG_END; + } else { + push_back(t); + return TT_ERROR; + } // erroneous situtation - standalone '/' + } + + attr_name_length = 0; + value_length = 0; + + // attribute name... + while (c != '=') { + if (c == 0) return TT_EOF; + if (c == '>') { + push_back(c); + return TT_ATTR; + } // attribute without value (HTML style) + if (is_whitespace(c)) { + c = skip_whitespace(); + if (c != '=') { + push_back(c); + return TT_ATTR; + } // attribute without value (HTML style) + else + break; + } + if (c == '<') return TT_ERROR; + append_attr_name(c); + c = get_char(); + } + + c = skip_whitespace(); + // attribute value... + + if (c == '\"') { + c = get_char(); + while (c) { + if (c == '\"') return TT_ATTR; + // if (c == '&') c = scan_entity(); + append_value(c); + c = get_char(); + } + } else if (c == '\'') // allowed in html + { + c = get_char(); + while (c) { + if (c == '\'') return TT_ATTR; + // if (c == '&') c = scan_entity(); + append_value(c); + c = get_char(); + } + } else // scan token, allowed in html: e.g. align=center + { + c = get_char(); + do { + if (is_whitespace(c)) return TT_ATTR; + /* these two removed in favour of better html support: + if( c == '/' || c == '>' ) { push_back(c); return TT_ATTR; } + if( c == '&' ) c = scan_entity();*/ + if (c == '>') { + push_back(c); + return TT_ATTR; + } + append_value(c); + c = get_char(); + } while (c); + } + + return TT_ERROR; +} + +// caller already consumed '<' +// scan header start or tag tail +scanner::token_type scanner::scan_tag() { + tag_name_length = 0; + + char c = get_char(); + + bool is_tail = c == '/'; + if (is_tail) c = get_char(); + + while (c) { + if (is_whitespace(c)) { + c = skip_whitespace(); + break; + } + if (c == '/' || c == '>') break; + append_tag_name(c); + + switch (tag_name_length) { + case 3: + if (equal(tag_name, "!--", 3)) { + c_scan = &scanner::scan_comment; + return TT_COMMENT_START; + } + break; + case 8: + if (equal(tag_name, "![CDATA[", 8)) { + c_scan = &scanner::scan_cdata; + return TT_CDATA_START; + } + break; + case 7: + if (equal(tag_name, "!ENTITY", 8)) { + c_scan = &scanner::scan_entity_decl; + return TT_ENTITY_START; + } + break; + } + + c = get_char(); + } + + if (c == 0) return TT_ERROR; + + if (is_tail) { + if (c == '>') return TT_TAG_END; + return TT_ERROR; + } else + push_back(c); + + c_scan = &scanner::scan_head; + return TT_TAG_START; +} + +scanner::token_type scanner::scan_entity() { + // note that when scan_entity() is called, & is already consumed. + + char buffer[8]; + unsigned int buflen = 0; + buffer[buflen++] = '&'; // (just makes resolve_entity and append_value(buffer) easier) + + bool has_end = false; + + while (true) { + char c = get_char(); + buffer[buflen++] = c; + + // Found end of entity + if (c == ';') break; + + // Too long to be entity + if (buflen == sizeof(buffer)) break; + + // Not a character we'd expect in an entity (esp '&' or '<') + if (!isalpha(c)) break; + } + + // Keep the text_end that scanner::scan_body uses similarly up-to-date. Since + // scan_entity() is only called from scan_body we assume text_begin is already + // set correctly by it. + text_end += buflen; + + // If we found the end of the entity, and we can identify it, then + // resolve_entity() will emit the char it encoded. + if (buffer[buflen - 1] == ';' && resolve_entity(buffer, buflen)) { + return TT_TEXT; + } + + // Otherwise, we just emit whatever we read as text, except for the last + // character that caused us to break. That may be another &, or a <, which we + // would want to scan properly. + for (unsigned int i = 0; i < buflen - 1; ++i) append_value(buffer[i]); + push_back(buffer[buflen - 1]); + --text_end; // because push_back() + return TT_TEXT; +} + +bool scanner::resolve_entity(char *buffer, unsigned int len) { + switch (len) { + case 4: + if (equal(buffer, "<", 4)) { + append_value('<'); + return true; + } + if (equal(buffer, ">", 4)) { + append_value('>'); + return true; + } + break; + + case 5: + if (equal(buffer, "&", 5)) { + append_value('&'); + return true; + } + break; + + case 6: + if (equal(buffer, """, 6)) { + append_value('"'); + return true; + } + if (equal(buffer, "'", 6)) { + append_value('\''); + return true; + } + if (equal(buffer, " ", 6)) { + append_value(' '); // TODO: handle non-breaking spaces better than just converting them to spaces + return true; + } + break; + } + return false; +} + +// skip whitespaces. +// returns first non-whitespace char +char scanner::skip_whitespace() { + while (char c = get_char()) { + if (!is_whitespace(c)) return c; + } + return 0; +} + +void scanner::push_back(char c) { input_char = c; } + +char scanner::get_char() { + if (input_char) { + char t(input_char); + input_char = 0; + return t; + } + return input.get_char(); +} + +bool scanner::is_whitespace(char c) { + return c <= ' ' && (c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f'); +} + +void scanner::append_value(char c) { + if (value_length < (MAX_TOKEN_SIZE - 1)) value[value_length++] = c; +} + +void scanner::append_attr_name(char c) { + if (attr_name_length < (MAX_NAME_SIZE - 1)) attr_name[attr_name_length++] = char(c); +} + +void scanner::append_tag_name(char c) { + if (tag_name_length < (MAX_NAME_SIZE - 1)) + tag_name[tag_name_length++] = + std::tolower(static_cast(c)); // cast because std::tolower has undefined behaviour otherwise +} + +scanner::token_type scanner::scan_comment() { + if (got_tail) { + c_scan = &scanner::scan_body; + got_tail = false; + return TT_COMMENT_END; + } + for (value_length = 0; value_length < (MAX_TOKEN_SIZE - 1); ++value_length) { + char c = get_char(); + if (c == 0) return TT_EOF; + value[value_length] = c; + + if (value_length >= 2 && value[value_length] == '>' && value[value_length - 1] == '-' && + value[value_length - 2] == '-') { + got_tail = true; + value_length -= 2; + break; + } + } + return TT_DATA; +} + +scanner::token_type scanner::scan_special() { + if (got_tail) { + c_scan = &scanner::scan_body; + got_tail = false; + return TT_TAG_END; + } + for (value_length = 0; value_length < (MAX_TOKEN_SIZE - 1); ++value_length) { + char c = get_char(); + if (c == 0) return TT_EOF; + + // in case MAX_TOKEN_SIZE limit breaks up the end tag + if (c == '<' && value_length + tag_name_length + 3 >= MAX_TOKEN_SIZE) { + push_back(c); + break; + } + + value[value_length] = c; + + if (c == '>' && value_length >= tag_name_length + 2) { + unsigned int i = tag_name_length - 1; + do { + if (value[value_length + i - tag_name_length] != tag_name[i]) break; + --i; + } while (i > 0); + if (i > 0) continue; + if (value[value_length - tag_name_length - 1] != '/') continue; + if (value[value_length - tag_name_length - 2] != '<') continue; + + got_tail = true; + value_length = value_length - tag_name_length - 2; + break; + } + } + return TT_DATA; +} + +scanner::token_type scanner::scan_cdata() { + if (got_tail) { + c_scan = &scanner::scan_body; + got_tail = false; + return TT_CDATA_END; + } + for (value_length = 0; value_length < (MAX_TOKEN_SIZE - 1); ++value_length) { + char c = get_char(); + if (c == 0) return TT_EOF; + value[value_length] = c; + + if (value_length >= 2 && value[value_length] == '>' && value[value_length - 1] == ']' && + value[value_length - 2] == ']') { + got_tail = true; + value_length -= 2; + break; + } + } + return TT_DATA; +} + +scanner::token_type scanner::scan_pi() { + if (got_tail) { + c_scan = &scanner::scan_body; + got_tail = false; + return TT_PI_END; + } + for (value_length = 0; value_length < (MAX_TOKEN_SIZE - 1); ++value_length) { + char c = get_char(); + if (c == 0) return TT_EOF; + value[value_length] = c; + + if (value_length >= 1 && value[value_length] == '>' && value[value_length - 1] == '?') { + got_tail = true; + value_length -= 1; + break; + } + } + return TT_DATA; +} + +scanner::token_type scanner::scan_entity_decl() { + if (got_tail) { + c_scan = &scanner::scan_body; + got_tail = false; + return TT_ENTITY_END; + } + char t; + unsigned int tc = 0; + for (value_length = 0; value_length < (MAX_TOKEN_SIZE - 1); ++value_length) { + t = get_char(); + if (t == 0) return TT_EOF; + value[value_length] = t; + if (t == '\"') + tc++; + else if (t == '>' && (tc & 1u) == 0) { + got_tail = true; + break; + } + } + return TT_DATA; +} + +} // namespace markup diff --git a/src/translator/xh_scanner.h b/src/translator/xh_scanner.h new file mode 100644 index 000000000..0b2dd2be2 --- /dev/null +++ b/src/translator/xh_scanner.h @@ -0,0 +1,130 @@ +// https://www.codeproject.com/Articles/14076/Fast-and-Compact-HTML-XML-Scanner-Tokenizer +// BSD license +//| +//| simple and fast XML/HTML scanner/tokenizer +//| +//| (C) Andrew Fedoniouk @ terrainformatica.com +//| +#include + +namespace markup { +struct instream { + const char *p; + const char *end; + explicit instream(const char *src) : p(src), end(src + strlen(src)) {} + instream(const char *begin, const char *end) : p(begin), end(end) {} + char get_char() { return p < end ? *p++ : 0; } +}; + +class scanner { + public: + enum token_type { + TT_ERROR = -1, + TT_EOF = 0, + + TT_TAG_START, // + // ^-- happens here + // + // ^-- or here + TT_ATTR, // + // ^-- happens here + TT_TEXT, + + TT_DATA, // content of followings: + // (also content of TT_TAG_START and TT_TAG_END, if the tag is 'script' or 'style') + + TT_COMMENT_START, + TT_COMMENT_END, // after "" + TT_CDATA_START, + TT_CDATA_END, // after "" + TT_PI_START, + TT_PI_END, // after "" + TT_ENTITY_START, + TT_ENTITY_END, // after "" + + }; + + enum $ { MAX_TOKEN_SIZE = 1024, MAX_NAME_SIZE = 128 }; + + public: + explicit scanner(instream &is) + : value_length(0), tag_name_length(0), attr_name_length(0), input(is), input_char(0), got_tail(false) { + c_scan = &scanner::scan_body; + } + + // get next token + token_type get_token() { return (this->*c_scan)(); } + + // get text span backed by original input. + const char *get_text_begin() { return text_begin; } + const char *get_text_end() { return text_end; } + + // get value of TT_TEXT, TT_ATTR and TT_DATA + const char *get_value(); + + // get attribute name + const char *get_attr_name(); + + // get tag name (always lowercase) + const char *get_tag_name(); + + private: /* methods */ + typedef token_type (scanner::*scan)(); + + scan c_scan; // current 'reader' + + // content 'readers' + token_type scan_body(); + + token_type scan_head(); + + token_type scan_comment(); + + token_type scan_cdata(); + + token_type scan_special(); + + token_type scan_pi(); + + token_type scan_tag(); + + token_type scan_entity(); + + token_type scan_entity_decl(); + + char skip_whitespace(); + + void push_back(char c); + + char get_char(); + + bool resolve_entity(char *buffer, unsigned int len); + + static bool is_whitespace(char c); + + void append_value(char c); + + void append_attr_name(char c); + + void append_tag_name(char c); + + private: /* data */ + char value[MAX_TOKEN_SIZE]{}; + unsigned int value_length; + + char tag_name[MAX_NAME_SIZE]{}; + unsigned int tag_name_length; + + char attr_name[MAX_NAME_SIZE]{}; + unsigned int attr_name_length; + + instream &input; + char input_char; + + bool got_tail; // aux flag used in scan_comment, etc. + + const char *text_begin, *text_end; +}; +} // namespace markup diff --git a/wasm/bindings/response_options_bindings.cpp b/wasm/bindings/response_options_bindings.cpp index deafe1e0a..c58d24c64 100644 --- a/wasm/bindings/response_options_bindings.cpp +++ b/wasm/bindings/response_options_bindings.cpp @@ -15,5 +15,6 @@ using namespace emscripten; EMSCRIPTEN_BINDINGS(response_options) { value_object("ResponseOptions") .field("qualityScores", &ResponseOptions::qualityScores) - .field("alignment", &ResponseOptions::alignment); + .field("alignment", &ResponseOptions::alignment) + .field("html", &ResponseOptions::HTML); } diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js index 7fbaea8d2..f252a9b3c 100644 --- a/wasm/test_page/js/worker.js +++ b/wasm/test_page/js/worker.js @@ -323,7 +323,7 @@ const _parseTranslatedTextSentenceQualityScores = (vectorResponse) => { } const _prepareResponseOptions = () => { - return {qualityScores: true, alignment: false}; + return {qualityScores: true, alignment: false, html: true}; } const _prepareSourceText = (input) => { From eea5554b91dceea0e51a628adc844d7fc1e7ae85 Mon Sep 17 00:00:00 2001 From: Jelmer Date: Mon, 29 Nov 2021 08:41:24 +0000 Subject: [PATCH 309/442] HTML handling improvements (#266) * Fix out-of-bounds error when determining alignment for whole word If token at offset 0 was a continuation (which it always is, since the first word of a sentence does not start with a space) it would jump to (unsigned) -1 which is probably out of bounds. * Don't segfault if alignment info is not available When alignment info is requested, but model is missing `alignment: soft` you'd get empty alignment info for every target token. * Partial fix for handling empty elements This fixes a parse error when dealing with something like `

    ...

    ` or `...
    ` where there is no text after the last empty element. This also prevents losing empty elements in the source side of the translation. Empty elements are not yet transferred correctly to the target side. * Fix formatting --- src/tests/units/html_tests.cpp | 18 ++++++++++ src/translator/html.cpp | 64 ++++++++++++++++++++++++---------- 2 files changed, 64 insertions(+), 18 deletions(-) diff --git a/src/tests/units/html_tests.cpp b/src/tests/units/html_tests.cpp index 258847970..59244a1b5 100644 --- a/src/tests/units/html_tests.cpp +++ b/src/tests/units/html_tests.cpp @@ -395,6 +395,24 @@ TEST_CASE("Test self-closing tag (HTML5)") { CHECK(input == "hello world and other creatures\n"); // Note double space between "hello" and "world" } +TEST_CASE("Test empty self-closing tag at end of input") { + std::string input("hello
    "); + HTML html(std::move(input), true); + CHECK(input == "hello "); +} + +TEST_CASE("Test empty tag pair at end of input") { + std::string input("hello "); + HTML html(std::move(input), true); + CHECK(input == "hello "); +} + +TEST_CASE("Test empty self-closing pair at end of input in parent") { + std::string input("

    hello

    "); + HTML html(std::move(input), true); + CHECK(input == "hello "); +} + TEST_CASE("Test empty tag", "[!mayfail]") { std::string input( "

    hello empty_elements{"area", "base", "br", "col", "embed", "hr", "img", @@ -132,6 +132,10 @@ void FilterEmpty(HTML::Taint &stack) { stack.resize(dst - stack.begin()); } +bool ContainsTag(HTML::Taint const &stack, HTML::Tag const *tag) { + return std::find(stack.rbegin(), stack.rend(), tag) != stack.rend(); +} + template AnnotatedText Apply(AnnotatedText const &in, Fun fun) { AnnotatedText out; @@ -166,6 +170,10 @@ AnnotatedText Apply(AnnotatedText const &in, Fun fun) { bool IsContinuation(string_view str) { return !str.empty() && str.compare(0, 1, " ", 1) != 0; } +bool HasAlignments(Response const &response) { + return !response.alignments.empty() && !response.alignments[0][0].empty(); +} + void HardAlignments(Response const &response, std::vector> &alignments) { // For each sentence... for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { @@ -199,11 +207,11 @@ void HardAlignments(Response const &response, std::vector> & size_t s_max = score_curr > score_prev ? s_curr : s_prev; // Apply this to all previous tokens in the word - for (size_t i = t; i >= 0; --i) { + for (size_t i = t;; --i) { alignments.back()[i] = s_max; - // Stop if this was the beginning of the word - if (!IsContinuation(response.target.word(sentenceIdx, i))) break; + // Stop if this was the first token or the beginning of the word + if (i == 0 || !IsContinuation(response.target.word(sentenceIdx, i))) break; } } } @@ -262,6 +270,14 @@ AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &t size_t offset = 0; // Size added by prepending HTML size_t whitespace_size = CountPrefixWhitespaces(token); + // Close tags we want to show up left (before) the token, but open tags + // ideally come directly after any prefix whitespace. However, some tokens + // match multiple spans. If a previous span has added an open tag, after any + // whitespace, and the next span closes said tag again, we need to close + // it after the whitespace. So after the first open tag, any closing tag + // should also align right, after whitespace, not before. Hence this bool. + bool close_left = true; + // Potential issue: spans and tokens can intersect, e.g. // // text

    h e ll o

    @@ -277,7 +293,7 @@ AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &t for (auto cit = closing.crbegin(); cit != closing.crend(); ++cit) { std::string close_tag = format("", (*cit)->name); - html.insert(offset, close_tag); + html.insert(offset + (close_left ? 0 : whitespace_size), close_tag); offset += close_tag.size(); } @@ -285,6 +301,7 @@ AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &t std::string open_tag = format("<{}{}>", tag->name, tag->attributes); html.insert(offset + whitespace_size, open_tag); offset += open_tag.size(); + close_left = false; } if (span_it + 1 != span_end && ((span_it + 1)->begin < range.end || last)) { @@ -295,8 +312,9 @@ AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &t break; } - // TODO: This is just the taint of the last span, not the ones in between - // I don't know if that is okay for transferring taints. We'll need to test. + // TODO: This is just the taint of the last span, not the ones in between. + // This makes us lose empty tags, and maybe some markup as well, in the + // response target HTML restoration. token_tags.push_back(prev_it->tags); return html; @@ -422,6 +440,7 @@ HTML::HTML(std::string &&source, bool process_markup) { markup::scanner scanner(in); source.clear(); // source is moved out of, so should be clear anyway + Tag *tag; Taint stack; spans_.push_back(Span{0, 0, {}}); @@ -449,13 +468,19 @@ HTML::HTML(std::string &&source, bool process_markup) { // separate words if (IsBlockElement(scanner.get_tag_name()) && !source.empty() && source.back() != ' ') source.push_back(' '); - pool_.emplace_back(new Tag{ - scanner.get_tag_name(), std::string(), - IsEmtpyElement(scanner.get_tag_name()) // TODO: detect empty elements by doing a second pass and detecting - // non-closed elements? - }); + tag = new Tag{scanner.get_tag_name(), std::string(), IsEmptyElement(scanner.get_tag_name())}; + pool_.emplace_back(tag); // pool_ takes ownership of our tag stack.push_back(pool_.back().get()); + + // Empty elements (e.g. ) are not applicable to a span of text + // so instead we "apply" them to an empty span in between, and then + // immediately remove them again from the stack. + if (tag->empty) { + spans_.push_back(Span{source.size(), source.size(), stack}); + stack.pop_back(); + } + break; case markup::scanner::TT_TAG_END: @@ -464,17 +489,20 @@ HTML::HTML(std::string &&source, bool process_markup) { if (stack.empty()) throw BadHTML(format("Encountered more closing tags ({}) than opening tags", scanner.get_tag_name())); - // TODO: what to do with "" case, where tag is immediately closed - // so it never makes it into the taint of any of the spans? Add it as - // an empty tag to the previous/following? if (stack.back()->name != scanner.get_tag_name()) throw BadHTML(format("Encountered unexpected closing tag , stack is {}", scanner.get_tag_name(), stack)); + + // What to do with "" case, where tag is immediately closed + // so it never makes it into the taint of any of the spans? This adds + // an empty span so it still lives. + if (spans_.empty() || !ContainsTag(spans_.back().tags, stack.back())) + spans_.push_back(Span{source.size(), source.size(), stack}); + stack.pop_back(); break; case markup::scanner::TT_ATTR: - // TODO could be more efficient if format() accepted a destination, i.e. format_to? - stack.back()->attributes += format(" {}=\"{}\"", scanner.get_attr_name(), scanner.get_value()); + tag->attributes += format(" {}=\"{}\"", scanner.get_attr_name(), scanner.get_value()); break; default: @@ -512,7 +540,7 @@ void HTML::Restore(Response &response) { // tokens with the tags from their source token counterpart. If there is no // alignment information available, we just interpolate based on sentence // length (badly). - if (!response.alignments.empty()) { + if (HasAlignments(response)) { // DebugPrintAlignmentScores(std::cerr, response); HardAlignments(response, alignments); } else { From e8fd01e9f4c28d3acdd49485cd4b9395b87aa631 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal Date: Tue, 30 Nov 2021 14:31:01 +0100 Subject: [PATCH 310/442] Updated marian-dev submodule --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 200e81c0c..a284a05a1 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 200e81c0cc88259c540b96afc6e0867cb05570b0 +Subproject commit a284a05a12bdc6fdf72223c0120838b26d3a977c From 8e79897f30a3948621e95b657fda2bfc6f69bc76 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Wed, 1 Dec 2021 11:32:51 +0100 Subject: [PATCH 311/442] Updated configuration for html text translation to work in wasm test page (#269) * Updated translator configuration in wasm test page - Added alignment: soft * Set ResponseOptions::alignment to "true" - Had to be set for html text translation to work --- wasm/test_page/js/worker.js | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js index f252a9b3c..2711fe1c6 100644 --- a/wasm/test_page/js/worker.js +++ b/wasm/test_page/js/worker.js @@ -181,6 +181,7 @@ cpu-threads: 0 quiet: true quiet-translation: true gemm-precision: int8shiftAlphaAll +alignment: soft `; const modelFile = `${rootURL}/${languagePair}/${modelRegistry[languagePair]["model"].name}`; @@ -323,7 +324,7 @@ const _parseTranslatedTextSentenceQualityScores = (vectorResponse) => { } const _prepareResponseOptions = () => { - return {qualityScores: true, alignment: false, html: true}; + return {qualityScores: true, alignment: true, html: true}; } const _prepareSourceText = (input) => { From e75a9e1da3ecaace48b8cf41c191c1920b3cd3ac Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Tue, 14 Dec 2021 16:39:19 +0100 Subject: [PATCH 312/442] More robust logic to import wasm gemm (#276) - Import optimized gemm implementation only if all the necessary functions are provided by it, othewise use the fallback gemm --- wasm/import-gemm-module.js | 49 +++++++++++++++++++++++++++----------- 1 file changed, 35 insertions(+), 14 deletions(-) diff --git a/wasm/import-gemm-module.js b/wasm/import-gemm-module.js index e23a69d7f..369c551cc 100644 --- a/wasm/import-gemm-module.js +++ b/wasm/import-gemm-module.js @@ -3,23 +3,44 @@ * implementation. */ function createWasmGemm() { + // Name of the optimized gemm implementation. const OPTIMIZED_GEMM = "mozIntGemm"; - const FALLBACK_GEMM = "asm"; - if (WebAssembly[OPTIMIZED_GEMM]) { - console.log(`Using optimized gemm (${OPTIMIZED_GEMM}) implementation`); - return new WebAssembly.Instance(WebAssembly[OPTIMIZED_GEMM](), {"": {memory: wasmMemory}}).exports; + // A map of expected gemm function to the corresponding fallback gemm function names. + const GEMM_TO_FALLBACK_FUNCTIONS_MAP = { + "int8_prepare_a": "int8PrepareAFallback", + "int8_prepare_b": "int8PrepareBFallback", + "int8_prepare_b_from_transposed": "int8PrepareBFromTransposedFallback", + "int8_prepare_b_from_quantized_transposed": "int8PrepareBFromQuantizedTransposedFallback", + "int8_prepare_bias": "int8PrepareBiasFallback", + "int8_multiply_and_add_bias": "int8MultiplyAndAddBiasFallback", + "int8_select_columns_of_b": "int8SelectColumnsOfBFallback" + }; + + const optimizedGemmModule = WebAssembly[OPTIMIZED_GEMM]; + if (!optimizedGemmModule) { + return fallbackGemm(GEMM_TO_FALLBACK_FUNCTIONS_MAP); } - else { - console.log(`Using fallback gemm implementation`); - return { - "int8_prepare_a": (...a) => Module[FALLBACK_GEMM]["int8PrepareAFallback"](...a), - "int8_prepare_b": (...a) => Module[FALLBACK_GEMM]["int8PrepareBFallback"](...a), - "int8_prepare_b_from_transposed": (...a) => Module[FALLBACK_GEMM]["int8PrepareBFromTransposedFallback"](...a), - "int8_prepare_b_from_quantized_transposed": (...a) => Module[FALLBACK_GEMM]["int8PrepareBFromQuantizedTransposedFallback"](...a), - "int8_prepare_bias": (...a) => Module[FALLBACK_GEMM]["int8PrepareBiasFallback"](...a), - "int8_multiply_and_add_bias": (...a) => Module[FALLBACK_GEMM]["int8MultiplyAndAddBiasFallback"](...a), - "int8_select_columns_of_b": (...a) => Module[FALLBACK_GEMM]["int8SelectColumnsOfBFallback"](...a) + + const optimizedGemmModuleExports = new WebAssembly.Instance(optimizedGemmModule(), {"": {memory: wasmMemory}}).exports; + for (let key in GEMM_TO_FALLBACK_FUNCTIONS_MAP) { + if (!optimizedGemmModuleExports[key]) { + return fallbackGemm(GEMM_TO_FALLBACK_FUNCTIONS_MAP); } } + console.log(`Using optimized gemm (${OPTIMIZED_GEMM}) implementation`); + return optimizedGemmModuleExports; +} + +// Return the fallback gemm implementation. +function fallbackGemm(gemmToFallbackFunctionsMap) { + // The fallback gemm implementation + const FALLBACK_GEMM = "asm"; + + let fallbackGemmModuleExports = {}; + for (let key in gemmToFallbackFunctionsMap) { + fallbackGemmModuleExports[key] = (...a) => Module[FALLBACK_GEMM][gemmToFallbackFunctionsMap[key]](...a) + } + console.log(`Using fallback gemm implementation`); + return fallbackGemmModuleExports; } From 571d312930374d834f3dbdc4cd611f4be1fc820e Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Tue, 14 Dec 2021 16:34:30 +0000 Subject: [PATCH 313/442] Constrain mistune to fix docs CI (#278) --- doc/requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/requirements.txt b/doc/requirements.txt index 8d56e6839..28e6e70ca 100644 --- a/doc/requirements.txt +++ b/doc/requirements.txt @@ -2,5 +2,6 @@ sphinx==2.4.4 breathe==4.13.0 exhale sphinx_rtd_theme +mistune<2.0.0 recommonmark m2r From feb9c90429fe23423dc23c56e8d0ee19d85acec7 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Tue, 14 Dec 2021 21:52:00 +0100 Subject: [PATCH 314/442] Additional logs in JS translation worker (#277) - Print source text received in the response - Print no. of block elements in the input --- wasm/test_page/js/worker.js | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js index 2711fe1c6..3bc89bef5 100644 --- a/wasm/test_page/js/worker.js +++ b/wasm/test_page/js/worker.js @@ -53,11 +53,14 @@ onmessage = async function(e) { const to = e.data[2]; const input = e.data[3]; let inputWordCount = 0; + let inputBlockElements = 0; input.forEach(sentence => { inputWordCount += sentence.trim().split(" ").filter(word => word.trim() !== "").length; + inputBlockElements++; }) let start = Date.now(); try { + log(`Blocks to translate: ${inputBlockElements}`); result = translate(from, to, input); const secs = (Date.now() - start) / 1000; log(`Translation '${from}${to}' Successful. Speed: ${Math.round(inputWordCount / secs)} WPS (${inputWordCount} words in ${secs} secs)`); @@ -243,10 +246,12 @@ const _translateInvolvingEnglish = (from, to, input) => { // Parse all relevant information from vectorResponse const listTranslatedText = _parseTranslatedText(vectorResponse); + const listSourceText = _parseSourceText(vectorResponse); const listTranslatedTextSentences = _parseTranslatedTextSentences(vectorResponse); const listSourceTextSentences = _parseSourceTextSentences(vectorResponse); const listTranslatedTextSentenceQualityScores = _parseTranslatedTextSentenceQualityScores(vectorResponse); + log(`Source text: ${listSourceText}`); log(`Translated text: ${listTranslatedText}`); log(`Translated sentences: ${JSON.stringify(listTranslatedTextSentences)}`); log(`Source sentences: ${JSON.stringify(listSourceTextSentences)}`); @@ -276,6 +281,15 @@ const _parseTranslatedTextSentences = (vectorResponse) => { return result; } +const _parseSourceText = (vectorResponse) => { + const result = []; + for (let i = 0; i < vectorResponse.size(); i++) { + const response = vectorResponse.get(i); + result.push(response.getOriginalText()); + } + return result; +} + const _parseSourceTextSentences = (vectorResponse) => { const result = []; for (let i = 0; i < vectorResponse.size(); i++) { From 8563f0856f6dada0d6ce4037e727033dda97cfd1 Mon Sep 17 00:00:00 2001 From: Nikolay Bogoychev Date: Tue, 14 Dec 2021 23:53:53 +0000 Subject: [PATCH 315/442] Proper arch setting on win32 (#275) * Proper arch detection on win32 * Whoops --- 3rd_party/marian-dev | 2 +- CMakeLists.txt | 36 ++++++++++++++++++++++++++++++------ 2 files changed, 31 insertions(+), 7 deletions(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index a284a05a1..08b154463 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit a284a05a12bdc6fdf72223c0120838b26d3a977c +Subproject commit 08b1544636fe13eaf1fbacb17c6fb050abfb8d42 diff --git a/CMakeLists.txt b/CMakeLists.txt index a9586d8e5..006e9521d 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -20,10 +20,39 @@ if(NOT CMAKE_BUILD_TYPE) message(WARNING "CMAKE_BUILD_TYPE not set; setting to Release") set(CMAKE_BUILD_TYPE "Release") endif() + +if(NOT COMPILE_WASM) + # Setting BUILD_ARCH to native invokes CPU intrinsic detection logic below. + # Prevent invoking that logic for WASM builds. + set(BUILD_ARCH native CACHE STRING "Compile for this CPU architecture.") + + # Unfortunately MSVC supports a limited subset of BUILD_ARCH flags. Instead try to guess + # what architecture we can compile to reading BUILD_ARCH and mapping it to MSVC values + # references: https://clang.llvm.org/docs/UsersManual.html https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/i386-and-x86-64-Options.html + # https://docs.microsoft.com/en-us/cpp/build/reference/arch-x86?redirectedfrom=MSDN&amp;view=vs-2019&view=msvc-170 https://devblogs.microsoft.com/oldnewthing/20201026-00/?p=104397 + # This is by no means an exhaustive list but should match the most common flags Linux programmers expect to parse to MSVC + if(MSVC) + if(BUILD_ARCH STREQUAL "native") # avx2 is good default for native. Very few desktop systems support avx512 + set(MSVC_BUILD_ARCH "/arch:AVX2") + elseif(BUILD_ARCH STREQUAL "skylake-avx512" OR BUILD_ARCH STREQUAL "cannonlake" OR BUILD_ARCH STREQUAL "x86-64-v4" OR BUILD_ARCH STREQUAL "tigerlake" OR BUILD_ARCH STREQUAL "cooperlake" OR BUILD_ARCH STREQUAL "cascadelake") + set(MSVC_BUILD_ARCH "/arch:AVX512") + elseif(BUILD_ARCH STREQUAL "core-avx2" OR BUILD_ARCH STREQUAL "haswell" OR BUILD_ARCH STREQUAL "x86-64-v3" OR BUILD_ARCH STREQUAL "broadwell" OR BUILD_ARCH STREQUAL "skylake") + set(MSVC_BUILD_ARCH "/arch:AVX2") + elseif(BUILD_ARCH STREQUAL "sandybridge" OR BUILD_ARCH STREQUAL "corei7-avx" OR BUILD_ARCH STREQUAL "core-avx-i" OR BUILD_ARCH STREQUAL "ivybridge") + set(MSVC_BUILD_ARCH "/arch:AVX") + elseif(BUILD_ARCH STREQUAL "nehalem" OR BUILD_ARCH STREQUAL "westmere" OR BUILD_ARCH STREQUAL "x86-64-v2" OR BUILD_ARCH STREQUAL "corei7" OR BUILD_ARCH STREQUAL "core2") + set(MSVC_BUILD_ARCH "/arch:SSE2") # This is MSVC default. We won't go down to SSE because we don't support that hardware at all with intgemm. Marian recommends to only go down to SSE4.1 at most + else() + message(WARNING "Unknown BUILD_ARCH ${BUILD_ARCH} provided. Default to SSE2 for Windows build") + set(MSVC_BUILD_ARCH "/arch:SSE2") + endif() + endif(MSVC) +endif() + #MSVC can't seem to pick up correct flags otherwise: if(MSVC) add_definitions(-DUSE_SSE2=1) # Supposed to fix something in the sse_mathfun.h but not sure it does - set(INTRINSICS "/arch:AVX2") # ARCH we're targetting on win32. @TODO variable + set(INTRINSICS ${MSVC_BUILD_ARCH}) # ARCH we're targetting on win32. @TODO variable set(CMAKE_CXX_FLAGS "/EHsc /DWIN32 /D_WINDOWS /DUNICODE /D_UNICODE /D_CRT_NONSTDC_NO_WARNINGS /D_CRT_SECURE_NO_WARNINGS /bigobj") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS} /MT /O2 ${INTRINSICS} /Zi /MP /GL /DNDEBUG") @@ -80,11 +109,6 @@ include(GetVersionFromFile) message(STATUS "Project name: ${PROJECT_NAME}") message(STATUS "Project version: ${PROJECT_VERSION_STRING_FULL}") -if(NOT COMPILE_WASM) - # Set BUILD_ARCH to native only while compiling for non wasm platform - set(BUILD_ARCH native CACHE STRING "Compile for this CPU architecture.") -endif() - if(COMPILE_WASM) set(WORMHOLE ON CACHE BOOL "Use WASM wormhole in intgemm https://bugzilla.mozilla.org/show_bug.cgi?id=1672160") list(APPEND WASM_COMPILE_FLAGS -O3 -g2 -fPIC -mssse3 -msimd128) From 420f12b3ff9eb854a8e09be975d4deddf60712c2 Mon Sep 17 00:00:00 2001 From: Jelmer Date: Wed, 15 Dec 2021 23:01:49 +0100 Subject: [PATCH 316/442] Remove value length limit from HTML parser & interpolated alignments (#274) * Remove InterpolateAlignment And some code improvements * Replace the fixed value buffer with a std::string backing * Fix tests that had no alignment info These depended on the linear interpolation that I removed * Remove arbitrary limits on tag and attribute names This might also fix a bug caused by the eager lower casing of tag names, which could break and token_type scan_special(); - token_type scan_pi(); - + // Consumes token_type scan_tag(); - token_type scan_entity(); - - token_type scan_entity_decl(); + // Consumes '&' etc, emits parent_token_type + token_type scan_entity(token_type parent_token_type); - char skip_whitespace(); + size_t skip_whitespace(); - void push_back(char c); - - char get_char(); - - bool resolve_entity(char *buffer, unsigned int len); + bool resolve_entity(string_ref const &buffer, string_ref &decoded) const; static bool is_whitespace(char c); - void append_value(char c); - - void append_attr_name(char c); - - void append_tag_name(char c); - private: /* data */ - char value[MAX_TOKEN_SIZE]{}; - unsigned int value_length; - - char tag_name[MAX_NAME_SIZE]{}; - unsigned int tag_name_length; - - char attr_name[MAX_NAME_SIZE]{}; - unsigned int attr_name_length; - - instream &input; - char input_char; + string_ref value_; + string_ref tag_name_; + string_ref attr_name_; - bool got_tail; // aux flag used in scan_comment, etc. + instream &input_; - const char *text_begin, *text_end; + bool got_tail; // aux flag used in scan_comment }; } // namespace markup From 8884b390554b816a3273859eb9ed58d8e50b5fbc Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Fri, 17 Dec 2021 17:39:43 +0100 Subject: [PATCH 317/442] Disabled importing optimized gemm module (#282) - Until the optimized gemm module stops requiring Shared Array Buffer, we can't really use it in Firefox --- wasm/import-gemm-module.js | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/wasm/import-gemm-module.js b/wasm/import-gemm-module.js index 369c551cc..8d20c58a7 100644 --- a/wasm/import-gemm-module.js +++ b/wasm/import-gemm-module.js @@ -3,9 +3,6 @@ * implementation. */ function createWasmGemm() { - // Name of the optimized gemm implementation. - const OPTIMIZED_GEMM = "mozIntGemm"; - // A map of expected gemm function to the corresponding fallback gemm function names. const GEMM_TO_FALLBACK_FUNCTIONS_MAP = { "int8_prepare_a": "int8PrepareAFallback", @@ -17,19 +14,28 @@ function createWasmGemm() { "int8_select_columns_of_b": "int8SelectColumnsOfBFallback" }; - const optimizedGemmModule = WebAssembly[OPTIMIZED_GEMM]; - if (!optimizedGemmModule) { - return fallbackGemm(GEMM_TO_FALLBACK_FUNCTIONS_MAP); - } + // ToDo: Activate the if code and remove else code once optimized gemm can work without shared array buffer. + if (0) { + // Name of the optimized gemm implementation. + const OPTIMIZED_GEMM = "mozIntGemm"; - const optimizedGemmModuleExports = new WebAssembly.Instance(optimizedGemmModule(), {"": {memory: wasmMemory}}).exports; - for (let key in GEMM_TO_FALLBACK_FUNCTIONS_MAP) { - if (!optimizedGemmModuleExports[key]) { + const optimizedGemmModule = WebAssembly[OPTIMIZED_GEMM]; + if (!optimizedGemmModule) { return fallbackGemm(GEMM_TO_FALLBACK_FUNCTIONS_MAP); } + + const optimizedGemmModuleExports = new WebAssembly.Instance(optimizedGemmModule(), {"": {memory: wasmMemory}}).exports; + for (let key in GEMM_TO_FALLBACK_FUNCTIONS_MAP) { + if (!optimizedGemmModuleExports[key]) { + return fallbackGemm(GEMM_TO_FALLBACK_FUNCTIONS_MAP); + } + } + console.log(`Using optimized gemm (${OPTIMIZED_GEMM}) implementation`); + return optimizedGemmModuleExports; + } + else { + return fallbackGemm(GEMM_TO_FALLBACK_FUNCTIONS_MAP); } - console.log(`Using optimized gemm (${OPTIMIZED_GEMM}) implementation`); - return optimizedGemmModuleExports; } // Return the fallback gemm implementation. From 793d132b7c8dfa2c41ca5745c09ba402e1749eb8 Mon Sep 17 00:00:00 2001 From: Andre Natal Date: Fri, 17 Dec 2021 15:05:11 -0800 Subject: [PATCH 318/442] Adding circle ci job to push the wasm artifacts to github releases (#280) * Adding circle ci job to push the wasm artifacts to github releases. * Updated config.yml --- .circleci/config.yml | 66 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 61 insertions(+), 5 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 9b14ed154..d9ff7933d 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -11,8 +11,9 @@ jobs: - checkout - run: - name: Build WASM - command: bash build-wasm.sh WORMHOLE + name: Build WASM WORMHOLE + command: | + bash build-wasm.sh WORMHOLE - run: name: Check artifacts @@ -22,11 +23,21 @@ jobs: if ls bergamot*.wasm &>/dev/null && ls bergamot*.js &>/dev/null then echo "Artifacts Successfully Generated" + mkdir ../artifacts + cp bergamot-translator-worker.wasm ../artifacts/bergamot-translator-worker-with-wormhole.wasm + cp bergamot-translator-worker.js ../artifacts/bergamot-translator-worker-with-wormhole.js + shasum -a 256 ../artifacts/* > ../artifacts/SHA256-1 + cp ../BERGAMOT_VERSION ../artifacts/ else echo "Failure: Artifacts Not Present" exit 1 fi + - persist_to_workspace: + root: . + paths: + - artifacts/* + - store_artifacts: path: "build-wasm" destination: "wasm-wormhole" @@ -43,7 +54,8 @@ jobs: - run: name: Build WASM - command: bash build-wasm.sh + command: | + bash build-wasm.sh - run: name: Check artifacts @@ -53,17 +65,61 @@ jobs: if ls bergamot*.wasm &>/dev/null && ls bergamot*.js &>/dev/null then echo "Artifacts Successfully Generated" + mkdir ../artifacts + cp bergamot-translator-worker.wasm ../artifacts/bergamot-translator-worker-without-wormhole.wasm + cp bergamot-translator-worker.js ../artifacts/bergamot-translator-worker-without-wormhole.js + shasum -a 256 ../artifacts/* > ../artifacts/SHA256-2 else echo "Failure: Artifacts Not Present" exit 1 fi + - persist_to_workspace: + root: . + paths: + - artifacts/* - store_artifacts: path: "build-wasm" destination: "wasm-without-wormhole" + publish_to_github: + docker: + - image: cibuilds/github:0.10 + steps: + - attach_workspace: + # Must be absolute path or relative path from working_directory + at: ./ + - run: + name: "Publish Release on GitHub" + command: | + export COMMIT=$(echo $CIRCLE_SHA1 | cut -c -7) + export VERSION=$(cat ./artifacts/BERGAMOT_VERSION | cut -c 2-) + VERSION=$VERSION+$COMMIT + ls -lsa ./artifacts/ > ./artifacts/FILESIZES + cat ./artifacts/SHA256-1 ./artifacts/SHA256-2 > ./artifacts/SHA256 + rm ./artifacts/SHA256-1 + rm ./artifacts/SHA256-2 + rm ./artifacts/BERGAMOT_VERSION + ghr -t ${GHTOKEN} -u ${CIRCLE_PROJECT_USERNAME} -r ${CIRCLE_PROJECT_REPONAME} -c ${CIRCLE_SHA1} -delete ${VERSION} ./artifacts/ workflows: build: jobs: - - build-with-wormhole - - build-without-wormhole \ No newline at end of file + - build-with-wormhole: + filters: + tags: + only: /^v.*/ + - build-without-wormhole: + filters: + tags: + only: /^v.*/ + - publish_to_github: + filters: + tags: + only: /^v.*/ + branches: + ignore: /.*/ + requires: + - build-without-wormhole + - build-with-wormhole + + From 1a27a8e0a75ab4f0154cc61ab4ae9ccc1cf9842e Mon Sep 17 00:00:00 2001 From: Jelmer Date: Mon, 20 Dec 2021 16:24:30 +0100 Subject: [PATCH 319/442] Increase HTML test coverage (#279) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Fix bug in HasAlignments check When fixing it to allow empty sentences, it no longer caught misconfigured models. I've added a test that triggers this scenario, and a fix in HasAlignments for it. * Add more unit tests for xh_scanner Trying to increase that code coverage to 100% * Add test for whitespaces around attributes * Make accessing value(), attr_name() and tag_name() at the wrong time safer * Fix bug in "); + markup::Scanner scanner(in); + + CHECK(scanner.next() == markup::Scanner::TT_TAG_START); + CHECK(scanner.tag() == "style"); + CHECK(scanner.next() == markup::Scanner::TT_DATA); + CHECK(scanner.value() == "body { background: url(test.png); }"); + CHECK(scanner.next() == markup::Scanner::TT_TAG_END); + CHECK(scanner.next() == markup::Scanner::TT_EOF); +} + +TEST_CASE("scan processing instruction") { + // Based on https://searchfox.org/mozilla-central/source/dom/base/nsContentUtils.cpp#8961 + // element.outerHTML can produce processing instructions in the html. These + // should be treated similar to . + markup::instream in(""); + markup::Scanner scanner(in); + + CHECK(scanner.next() == markup::Scanner::TT_PROCESSING_INSTRUCTION_START); + CHECK(scanner.next() == markup::Scanner::TT_DATA); + CHECK(scanner.value() == "xml version=\"1.0\""); + CHECK(scanner.next() == markup::Scanner::TT_PROCESSING_INSTRUCTION_END); + CHECK(scanner.next() == markup::Scanner::TT_EOF); } \ No newline at end of file diff --git a/src/translator/html.cpp b/src/translator/html.cpp index efe7969f6..f531b44fe 100644 --- a/src/translator/html.cpp +++ b/src/translator/html.cpp @@ -10,7 +10,7 @@ using marian::bergamot::ByteRange; using marian::bergamot::HTML; using marian::bergamot::Response; -void EncodeEntities(string_view const &input, std::string &output) { +void encodeEntities(string_view const &input, std::string &output) { output.clear(); output.reserve(input.size()); @@ -41,7 +41,7 @@ void EncodeEntities(string_view const &input, std::string &output) { } } -size_t CountPrefixWhitespaces(string_view const &input) { +size_t countPrefixWhitespaces(string_view const &input) { size_t size = 0; while (size < input.size() && input[size] == ' ') ++size; return size; @@ -63,47 +63,50 @@ std::ostream &operator<<(std::ostream &out, HTML::Taint const &tags) { } // Very simple replacement for std::format introduced in C++20 -std::string format(std::string const &format_str) { return format_str; } +std::string format(std::string const &formatTemplate) { return formatTemplate; } template -std::string format(std::string const &format_str, Arg arg) { +std::string format(std::string const &formatTemplate, Arg arg) { std::ostringstream os; - auto index = format_str.find("{}"); + auto index = formatTemplate.find("{}"); assert(index != std::string::npos); - os << format_str.substr(0, index) << arg << format_str.substr(index + 2); + os << formatTemplate.substr(0, index) << arg << formatTemplate.substr(index + 2); return os.str(); } template -std::string format(std::string const &format_str, Arg arg, Args... args) { +std::string format(std::string const &formatTemplate, Arg arg, Args... args) { std::ostringstream os; - auto index = format_str.find("{}"); + auto index = formatTemplate.find("{}"); assert(index != std::string::npos); - os << format_str.substr(0, index) << arg << format(format_str.substr(index + 2), std::forward(args)...); + os << formatTemplate.substr(0, index) << arg << format(formatTemplate.substr(index + 2), std::forward(args)...); return os.str(); } -bool IsBlockElement(std::string_view const &name) { +bool isBlockElement(std::string_view const &name) { // List of elements that we expect might occur inside words, and that should // not introduce spacings around them. Not strictly inline elements, nor flow // elements. See also https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories - static std::unordered_set inline_ish_elements{ + static std::unordered_set inlineishElements{ "abbr", "a", "b", "em", "i", "kbd", "mark", "math", "output", "q", "ruby", "small", "span", "strong", "sub", "sup", "time", "u", "var", "wbr", "ins", "del"}; - return inline_ish_elements.find(std::string(name)) == inline_ish_elements.end(); + return inlineishElements.find(std::string(name)) == inlineishElements.end(); } -bool IsEmptyElement(std::string_view const &name) { +bool isVoidTag(std::string_view const &name) { // List of elements for which we do not expect a closing tag, or self-closing // elements in XHTML. See also https://developer.mozilla.org/en-US/docs/Glossary/Empty_element - static std::unordered_set empty_elements{"area", "base", "br", "col", "embed", "hr", "img", - "input", "link", "meta", "param", "source", "track", "wbr"}; + // More relevant source of this list: + // https://searchfox.org/mozilla-central/rev/7d17fd1fe9f0005a2fb19e5d53da4741b06a98ba/dom/base/FragmentOrElement.cpp#1791 + static std::unordered_set voidElements{"area", "base", "basefont", "bgsound", "br", "col", + "embed", "frame", "hr", "img", "input", "keygen", + "link", "meta", "param", "source", "track", "wbr"}; - return empty_elements.find(std::string(name)) != empty_elements.end(); + return voidElements.find(std::string(name)) != voidElements.end(); } -void DiffTags(HTML::Taint const &prev, HTML::Taint const &curr, HTML::Taint &opening, HTML::Taint &closing) { +void diffTags(HTML::Taint const &prev, HTML::Taint const &curr, HTML::Taint &opening, HTML::Taint &closing) { opening.clear(); closing.clear(); @@ -118,11 +121,11 @@ void DiffTags(HTML::Taint const &prev, HTML::Taint const &curr, HTML::Taint &ope opening.insert(opening.end(), curr.begin() + i, curr.end()); } -bool Intersects(ByteRange const &range, HTML::Span const &span) { +bool intersects(ByteRange const &range, HTML::Span const &span) { return range.begin <= span.end && range.end >= span.begin; }; -void FilterEmpty(HTML::Taint &stack) { +void filterEmpty(HTML::Taint &stack) { auto src = stack.begin(); auto dst = stack.begin(); @@ -132,12 +135,12 @@ void FilterEmpty(HTML::Taint &stack) { stack.resize(dst - stack.begin()); } -bool ContainsTag(HTML::Taint const &stack, HTML::Tag const *tag) { +bool containsTag(HTML::Taint const &stack, HTML::Tag const *tag) { return std::find(stack.rbegin(), stack.rend(), tag) != stack.rend(); } template -AnnotatedText Apply(AnnotatedText const &in, Fun fun) { +AnnotatedText apply(AnnotatedText const &in, Fun fun) { AnnotatedText out; for (size_t sentenceIdx = 0; sentenceIdx < in.numSentences(); ++sentenceIdx) { @@ -168,20 +171,27 @@ AnnotatedText Apply(AnnotatedText const &in, Fun fun) { return out; } -bool IsContinuation(string_view str) { return !str.empty() && str.compare(0, 1, " ", 1) != 0; } +bool isContinuation(string_view str) { return !str.empty() && str.compare(0, 1, " ", 1) != 0; } -bool HasAlignments(Response const &response) { +bool hasAlignments(Response const &response) { // Test for each sentence individually as a sentence may be empty (or there) // might be no sentences, so just testing for alignments.empty() would not be // sufficient. - for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) + for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { + // If response.alignments is just empty, this might catch it. if (response.alignments.size() <= sentenceIdx || response.alignments[sentenceIdx].size() != response.target.numWords(sentenceIdx)) return false; + + // If response.alignments is "empty" because the model did not provide alignments, + // it still has entries for each target word. But all these entries are empty. + for (size_t wordIdx = 0; wordIdx < response.target.numWords(sentenceIdx); ++wordIdx) + if (response.alignments[sentenceIdx][wordIdx].size() != response.source.numWords(sentenceIdx)) return false; + } return true; } -void HardAlignments(Response const &response, std::vector> &alignments) { +void hardAlignments(Response const &response, std::vector> &alignments) { // For each sentence... for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { alignments.emplace_back(); @@ -204,24 +214,24 @@ void HardAlignments(Response const &response, std::vector> & for (size_t t = 1; t + 1 < response.target.numWords(sentenceIdx); ++t) { // If this token is a continuation of a previous token, pick the tags from the most // prevalent token for the whole word. - if (IsContinuation(response.target.word(sentenceIdx, t))) { + if (isContinuation(response.target.word(sentenceIdx, t))) { // Note: only looking at the previous token since that will already // have this treatment applied to it. - size_t s_curr = alignments.back()[t]; - size_t s_prev = alignments.back()[t - 1]; - float score_curr = response.alignments[sentenceIdx][t][s_curr]; - float score_prev = response.alignments[sentenceIdx][t - 1][s_prev]; + size_t currSentenceIdx = alignments.back()[t]; + size_t prevSentenceIdx = alignments.back()[t - 1]; + float currScore = response.alignments[sentenceIdx][t][currSentenceIdx]; + float prevScore = response.alignments[sentenceIdx][t - 1][prevSentenceIdx]; - if (score_curr > score_prev) { + if (currScore > prevScore) { // Apply this to all previous tokens in the word for (size_t i = t;; --i) { - alignments.back()[i] = s_curr; + alignments.back()[i] = currSentenceIdx; // Stop if this was the first token or the beginning of the word - if (i == 0 || !IsContinuation(response.target.word(sentenceIdx, i))) break; + if (i == 0 || !isContinuation(response.target.word(sentenceIdx, i))) break; } } else { - alignments.back()[t] = s_prev; + alignments.back()[t] = prevSentenceIdx; } } } @@ -231,28 +241,28 @@ void HardAlignments(Response const &response, std::vector> & } } -void CopyTaint(Response const &response, std::vector> const &alignments, - std::vector const &token_tags, std::vector &token_tags_target) { - size_t token_offset = 0; +void copyTaint(Response const &response, std::vector> const &alignments, + std::vector const &sourceTokenTags, std::vector &targetTokenTags) { + size_t offset = 0; - // Fill token_tags_target based on the alignments we just made up. + // Fill targetTokenTags based on the alignments we just made up. // NOTE: this should match the exact order of Apply() for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { - token_tags_target.push_back(token_tags[token_offset]); // token_tag for sentence ending gap + targetTokenTags.push_back(sourceTokenTags[offset]); // token_tag for sentence ending gap for (size_t t = 0; t < response.target.numWords(sentenceIdx); ++t) { size_t s = alignments[sentenceIdx][t]; assert(s < response.source.numWords(sentenceIdx)); - token_tags_target.push_back(token_tags[token_offset + 1 + s]); // +1 for prefix gap + targetTokenTags.push_back(sourceTokenTags[offset + 1 + s]); // +1 for prefix gap } - token_offset += response.source.numWords(sentenceIdx) + 1; // +1 for prefix gap + offset += response.source.numWords(sentenceIdx) + 1; // +1 for prefix gap } - assert(token_offset < token_tags.size()); - token_tags_target.push_back(token_tags[token_offset]); // token_tag for ending whitespace + assert(offset < sourceTokenTags.size()); + targetTokenTags.push_back(sourceTokenTags[offset]); // token_tag for ending whitespace } -AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &token_tags, +AnnotatedText restoreSource(AnnotatedText const &in, std::vector &token_tags, std::vector::const_iterator span_it, std::vector::const_iterator span_end) { auto prev_it = span_it; // safe because first span is always empty span, and @@ -262,13 +272,13 @@ AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &t std::string html; HTML::Taint opening, closing; - return Apply(in, [&](ByteRange range, string_view token, bool last) { + return apply(in, [&](ByteRange range, string_view token, bool last) { // Do encoding of any entities that popped up in the translation // (Also effectively clears html from previous call) - EncodeEntities(token, html); + encodeEntities(token, html); size_t offset = 0; // Size added by prepending HTML - size_t whitespace_size = CountPrefixWhitespaces(token); + size_t whitespace_size = countPrefixWhitespaces(token); // Close tags we want to show up left (before) the token, but open tags // ideally come directly after any prefix whitespace. However, some tokens @@ -288,7 +298,7 @@ AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &t // Seek to the last span that overlaps with this token while (true) { - DiffTags(prev_it->tags, span_it->tags, opening, closing); + diffTags(prev_it->tags, span_it->tags, opening, closing); prev_it = span_it; for (auto cit = closing.crbegin(); cit != closing.crend(); ++cit) { @@ -321,7 +331,7 @@ AnnotatedText RestoreSource(AnnotatedText const &in, std::vector &t }); } -AnnotatedText RestoreTarget(AnnotatedText const &in, std::vector const &token_tags_target) { +AnnotatedText restoreTarget(AnnotatedText const &in, std::vector const &token_tags_target) { auto token_prev_it = token_tags_target.begin(); auto token_tags_it = token_tags_target.begin() + 1; @@ -329,16 +339,16 @@ AnnotatedText RestoreTarget(AnnotatedText const &in, std::vector co std::string html; HTML::Taint opening, closing; - AnnotatedText out = Apply(in, [&](ByteRange range, string_view token, bool last) { + AnnotatedText out = apply(in, [&](ByteRange range, string_view token, bool last) { // Do encoding of any entities that popped up in the translation // (Also effectively clears html from previous call) - EncodeEntities(token, html); + encodeEntities(token, html); size_t offset = 0; // Size added by prepending HTML - size_t whitespace_size = CountPrefixWhitespaces(token); + size_t whitespace_size = countPrefixWhitespaces(token); assert(token_tags_it != token_tags_target.end()); - DiffTags(*token_prev_it, *token_tags_it, opening, closing); + diffTags(*token_prev_it, *token_tags_it, opening, closing); for (auto cit = closing.crbegin(); cit != closing.crend(); ++cit) { std::string close_tag = format("", (*cit)->name); @@ -371,7 +381,7 @@ AnnotatedText RestoreTarget(AnnotatedText const &in, std::vector co return out; } -std::ostream &DebugPrintMapping(std::ostream &out, Response const &response, +std::ostream &debugPrintMapping(std::ostream &out, Response const &response, std::vector> const &alignments, std::vector const &token_tags_target) { auto taints = token_tags_target.begin(); @@ -402,7 +412,7 @@ std::ostream &DebugPrintMapping(std::ostream &out, Response const &response, return out; } -std::ostream &DebugPrintAlignmentScores(std::ostream &out, Response const &response) { +std::ostream &debugPrintAlignmentScores(std::ostream &out, Response const &response) { out << "std::vector>> alignments{\n"; for (size_t sentenceIdx = 0; sentenceIdx < response.source.numSentences(); ++sentenceIdx) { out << " {\n"; @@ -420,7 +430,7 @@ std::ostream &DebugPrintAlignmentScores(std::ostream &out, Response const &respo return out << "};\n"; } -size_t DebugCountTokens(AnnotatedText const &text) { +size_t debugCountTokens(AnnotatedText const &text) { size_t tokens = 1; // for the ending gap for (size_t sentenceIdx = 0; sentenceIdx < text.numSentences(); ++sentenceIdx) { tokens += 1 + text.numWords(sentenceIdx); // pre-sentence prefix/gap + each word @@ -430,14 +440,13 @@ size_t DebugCountTokens(AnnotatedText const &text) { } // namespace -namespace marian { -namespace bergamot { +namespace marian::bergamot { HTML::HTML(std::string &&source, bool process_markup) { if (!process_markup) return; std::string original = std::move(source); markup::instream in(original.data(), original.data() + original.size()); - markup::scanner scanner(in); + markup::Scanner scanner(in); source.clear(); // source is moved out of, so should be clear anyway Tag *tag; @@ -446,30 +455,30 @@ HTML::HTML(std::string &&source, bool process_markup) { bool stop = false; while (!stop) { - switch (scanner.next_token()) { - case markup::scanner::TT_ERROR: + switch (scanner.next()) { + case markup::Scanner::TT_ERROR: throw BadHTML("HTML parse error"); - case markup::scanner::TT_EOF: + case markup::Scanner::TT_EOF: stop = true; break; - case markup::scanner::TT_TEXT: { + case markup::Scanner::TT_TEXT: { auto begin = source.size(); source.append(scanner.value()); spans_.push_back(Span{begin, source.size(), stack}); - FilterEmpty(stack); + filterEmpty(stack); } break; - case markup::scanner::TT_TAG_START: + case markup::Scanner::TT_TAG_START: // If it makes sense to treat this element as a break in a word (e.g. //
    , ,
  • ) make sure it does so in this text as well. // TODO: Strong assumption here that the language uses spaces to // separate words - if (IsBlockElement(scanner.tag_name()) && !source.empty() && source.back() != ' ') source.push_back(' '); + if (isBlockElement(scanner.tag()) && !source.empty() && source.back() != ' ') source.push_back(' '); // pool_ takes ownership of our tag, makes sure it's freed when necessary - pool_.emplace_back(new Tag{std::string(scanner.tag_name()), std::string(), IsEmptyElement(scanner.tag_name())}); + pool_.emplace_back(new Tag{std::string(scanner.tag()), std::string(), isVoidTag(scanner.tag())}); // Tag *tag is used by attribute parsing tag = pool_.back().get(); @@ -485,27 +494,26 @@ HTML::HTML(std::string &&source, bool process_markup) { } break; - case markup::scanner::TT_TAG_END: + case markup::Scanner::TT_TAG_END: // Note: self-closing tags emit TT_TAG_END immediately after TT_TAG_START // but since we're parsing HTML5, a sole will never emit a TT_TAG_END - if (stack.empty()) - throw BadHTML(format("Encountered more closing tags ({}) than opening tags", scanner.tag_name())); + if (stack.empty()) throw BadHTML(format("Encountered more closing tags ({}) than opening tags", scanner.tag())); - if (stack.back()->name != scanner.tag_name()) - throw BadHTML(format("Encountered unexpected closing tag , stack is {}", scanner.tag_name(), stack)); + if (stack.back()->name != scanner.tag()) + throw BadHTML(format("Encountered unexpected closing tag , stack is {}", scanner.tag(), stack)); // What to do with "" case, where tag is immediately closed // so it never makes it into the taint of any of the spans? This adds // an empty span so it still gets recorded in spans_. - if (spans_.empty() || !ContainsTag(spans_.back().tags, stack.back())) + if (spans_.empty() || !containsTag(spans_.back().tags, stack.back())) spans_.push_back(Span{source.size(), source.size(), stack}); stack.pop_back(); break; - case markup::scanner::TT_ATTR: + case markup::Scanner::TT_ATTRIBUTE: assert(tag != nullptr); - tag->attributes += format(" {}=\"{}\"", scanner.attr_name(), scanner.value()); + tag->attributes += format(" {}=\"{}\"", scanner.attribute(), scanner.value()); break; default: @@ -519,14 +527,14 @@ HTML::HTML(std::string &&source, bool process_markup) { spans_.emplace_back(Span{source.size() + 1, source.size() + 1, stack}); } -void HTML::Restore(Response &response) { +void HTML::restore(Response &response) { // No-op if process_markup was false (and thus spans_ is empty) // TODO: replace this with optional at a higher level if (spans_.empty()) return; // We need alignment info to transfer the HTML tags from the input to the // translation. If those are not available, no HTML in translations for you. - ABORT_UNLESS(HasAlignments(response), + ABORT_UNLESS(hasAlignments(response), "Response object does not contain alignments. TranslationModel or ResponseOptions is misconfigured?"); // Reconstruction of HTML tags: @@ -540,27 +548,26 @@ void HTML::Restore(Response &response) { // Calculating these is a side-effect of restoring // the HTML in response.source. - AnnotatedText source = RestoreSource(response.source, token_tags, spans_.cbegin(), spans_.cend()); - assert(token_tags.size() == DebugCountTokens(response.source)); + AnnotatedText source = restoreSource(response.source, token_tags, spans_.cbegin(), spans_.cend()); + assert(token_tags.size() == debugCountTokens(response.source)); // Find for every token in target the token in source that best matches. std::vector> alignments; - HardAlignments(response, alignments); + hardAlignments(response, alignments); std::vector token_tags_target; token_tags_target.emplace_back(); // add empty one to the beginning for easy // life later on (we start iterating at 1, // and can then do i - 1 for empty. - CopyTaint(response, alignments, token_tags, token_tags_target); - assert(token_tags_target.size() == DebugCountTokens(response.target) + 1); + copyTaint(response, alignments, token_tags, token_tags_target); + assert(token_tags_target.size() == debugCountTokens(response.target) + 1); // DebugPrintMapping(std::cerr, response, alignments, token_tags_target); - AnnotatedText target = RestoreTarget(response.target, token_tags_target); + AnnotatedText target = restoreTarget(response.target, token_tags_target); response.source = source; response.target = target; } -} // namespace bergamot -} // namespace marian +} // namespace marian::bergamot diff --git a/src/translator/html.h b/src/translator/html.h index ba4691541..5ddb3d006 100644 --- a/src/translator/html.h +++ b/src/translator/html.h @@ -34,7 +34,7 @@ class HTML { }; explicit HTML(std::string &&source, bool process_markup); - void Restore(Response &response); + void restore(Response &response); private: // List of text spans, and which tags are applied to them diff --git a/src/translator/response_builder.h b/src/translator/response_builder.h index b9d163a2e..baa648850 100644 --- a/src/translator/response_builder.h +++ b/src/translator/response_builder.h @@ -64,7 +64,7 @@ class ResponseBuilder { if (responseOptions_.alignment) { buildAlignments(histories, response); } - html_.Restore(response); + html_.restore(response); callback_(std::move(response)); } diff --git a/src/translator/xh_scanner.cpp b/src/translator/xh_scanner.cpp index bb72f8020..85eb7e972 100644 --- a/src/translator/xh_scanner.cpp +++ b/src/translator/xh_scanner.cpp @@ -9,14 +9,14 @@ namespace { -// Simple replacement for str.ends_with(compile-time C string) +// Simple replacement for string_view.ends_with(compile-time C string) template -inline bool ends_with(markup::string_ref &str, const Char_t (&suffix)[Len]) { +inline bool endsWith(markup::string_ref &str, const Char_t (&suffix)[Len]) { size_t offset = str.size - (Len - 1); return offset <= str.size && std::memcmp(str.data + offset, suffix, Len - 1) == 0; } -inline bool equals_case_insensitive(const char *lhs, const char *rhs, size_t len) { +inline bool equalsCaseInsensitive(const char *lhs, const char *rhs, size_t len) { for (size_t i = 0; i < len; ++i) { // cast to unsigned char otherwise std::tolower has undefined behaviour if (std::tolower(static_cast(lhs[i])) != std::tolower(static_cast(rhs[i]))) @@ -28,8 +28,8 @@ inline bool equals_case_insensitive(const char *lhs, const char *rhs, size_t len // Alias for the above, but with compile-time known C string template -inline bool equals_case_insensitive(markup::string_ref &lhs, const char (&rhs)[Len]) { - return lhs.size == Len - 1 && equals_case_insensitive(lhs.data, rhs, Len); +inline bool equalsCaseInsensitive(markup::string_ref &lhs, const char (&rhs)[Len]) { + return lhs.size == Len - 1 && equalsCaseInsensitive(lhs.data, rhs, Len - 1); } template @@ -43,22 +43,22 @@ namespace markup { // case sensitive string equality test // s_lowcase shall be lowercase string -std::string_view scanner::value() const { return std::string_view(value_.data, value_.size); } +std::string_view Scanner::value() const { return std::string_view(value_.data, value_.size); } -std::string_view scanner::attr_name() const { return std::string_view(attr_name_.data, attr_name_.size); } +std::string_view Scanner::attribute() const { return std::string_view(attributeName_.data, attributeName_.size); } -std::string_view scanner::tag_name() const { return std::string_view(tag_name_.data, tag_name_.size); } +std::string_view Scanner::tag() const { return std::string_view(tagName_.data, tagName_.size); } -scanner::token_type scanner::scan_body() { +Scanner::TokenType Scanner::scanBody() { value_ = string_ref{input_.pos(), 0}; switch (input_.peek()) { case '\0': return TT_EOF; case '<': - return scan_tag(); + return scanTag(); case '&': - return scan_entity(TT_TEXT); + return scanEntity(TT_TEXT); } while (true) { @@ -79,50 +79,50 @@ scanner::token_type scanner::scan_body() { // ... // |------------| // Followed by: -// - scan_special if + // ^-- or here + // + // ^-- or here + // comes after TT_COMMENT_START, TT_PI_START, or TT_TAG_START + // if the tag was - token_type scan_special(); + TokenType scanSpecial(); // Consumes - token_type scan_tag(); + TokenType scanTag(); // Consumes '&' etc, emits parent_token_type - token_type scan_entity(token_type parent_token_type); + TokenType scanEntity(TokenType parentTokenType); - size_t skip_whitespace(); + size_t skipWhitespace(); - bool resolve_entity(string_ref const &buffer, string_ref &decoded) const; + bool resolveEntity(string_ref const &buffer, string_ref &decoded) const; - static bool is_whitespace(char c); + static bool isWhitespace(char c); private: /* data */ string_ref value_; - string_ref tag_name_; - string_ref attr_name_; + string_ref tagName_; + string_ref attributeName_; + + ScanPtr scanFun_; // current 'reader' instream &input_; - bool got_tail; // aux flag used in scan_comment + bool gotTail_; // aux flag used in scanComment, scanSpecial, scanProcessingInstruction }; } // namespace markup From bcbbfe129525ed2be8dbf00d2da9d412667f1d8d Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Tue, 21 Dec 2021 09:22:37 +0000 Subject: [PATCH 320/442] Better command-line with isolation for both Services and co-located defaults and parsing (#252) * CLI Rework * Consolidate common tests, template specialize CLI * Remove remnant cache stuff * [BRT]: Run BRT with new cli * Formalizing bridge * Removing stuff from parsing and moving to TestSuite * Template includes, everything consolidating at tests * Inlining readFromStdin * Removing unnecessary headers * Checking in template implementation which was missing * Sane defaults, some catches at BRT * BRT: Install fixes * Updating marian-dev to point to main * Removing the enum indirection, using strings at one place, directly * Fix typo; * [BRT] test blocking service via native * Conservative defaults for workers and cache-mutex buckets in AsyncService * Create proper barriers for cmdline app * Build failure fixes * Moving common, common-impl to a familiar structure * Binary reorganization: async, blocking, wasm - async tests AsyncService - blocking tests BlockingService - wasm arranges tests for things that are Mozilla requirements. eg: - bytearray - multiple sentences in same translate request workflow. * [brt] updates to adapt to cli rework * [brt] updates to adapt to cli rework, all working * Empty commit, sync brt online and run GitHub CI * Switch for parser to have multiple mode or not * [brt]: Fix for --bergamot-mode being removed from CLI app * [brt]: Fix for --bergamot-mode being removed from CLI app * [brt]: Removing remnant faithful translation test from blocking/ --- app/bergamot.cpp | 52 +++-- app/cli.h | 174 ---------------- bergamot-translator-tests | 2 +- src/tests/CMakeLists.txt | 20 +- src/tests/apps.cpp | 146 ------------- src/tests/apps.h | 46 ----- src/tests/async.cpp | 27 +++ src/tests/blocking.cpp | 25 +++ src/tests/cli.cpp | 53 ----- src/tests/common-impl.cpp | 192 ++++++++++++++++++ src/tests/common.h | 88 ++++++++ ...ntgemm_resolve.cpp => intgemm-resolve.cpp} | 0 src/tests/wasm.cpp | 53 +++++ src/translator/parser.cpp | 84 -------- src/translator/parser.h | 96 +++++---- src/translator/service.h | 22 +- src/translator/utils.h | 15 ++ 17 files changed, 522 insertions(+), 573 deletions(-) delete mode 100644 app/cli.h delete mode 100644 src/tests/apps.cpp delete mode 100644 src/tests/apps.h create mode 100644 src/tests/async.cpp create mode 100644 src/tests/blocking.cpp delete mode 100644 src/tests/cli.cpp create mode 100644 src/tests/common-impl.cpp create mode 100644 src/tests/common.h rename src/tests/{intgemm_resolve.cpp => intgemm-resolve.cpp} (100%) create mode 100644 src/tests/wasm.cpp create mode 100644 src/translator/utils.h diff --git a/app/bergamot.cpp b/app/bergamot.cpp index bffbbb112..5629f9110 100644 --- a/app/bergamot.cpp +++ b/app/bergamot.cpp @@ -1,22 +1,42 @@ -#include "cli.h" +#include "translator/byte_array_util.h" +#include "translator/parser.h" +#include "translator/response.h" +#include "translator/response_options.h" +#include "translator/service.h" +#include "translator/utils.h" int main(int argc, char *argv[]) { - marian::bergamot::ConfigParser configParser; + using namespace marian::bergamot; + ConfigParser configParser("Bergamot CLI", /*multiOpMode=*/false); configParser.parseArgs(argc, argv); auto &config = configParser.getConfig(); - using namespace marian::bergamot; - switch (config.opMode) { - case OpMode::APP_WASM: - app::wasm(config); - break; - case OpMode::APP_NATIVE: - app::native(config); - break; - case OpMode::APP_DECODER: - app::decoder(config); - break; - default: - break; - } + + AsyncService service(config.serviceConfig); + + // Construct a model. + auto options = parseOptionsFromFilePath(config.modelConfigPaths.front()); + + MemoryBundle memoryBundle; + std::shared_ptr model = service.createCompatibleModel(options, std::move(memoryBundle)); + + ResponseOptions responseOptions; + std::string input = readFromStdin(); + + // Create a barrier using future/promise. + std::promise promise; + std::future future = promise.get_future(); + auto callback = [&promise](Response &&response) { + // Fulfill promise. + promise.set_value(std::move(response)); + }; + + service.translate(model, std::move(input), callback, responseOptions); + + // Wait until promise sets the response. + Response response = future.get(); + + // Print (only) translated text. + std::cout << response.target.text; + return 0; } diff --git a/app/cli.h b/app/cli.h deleted file mode 100644 index 08f203466..000000000 --- a/app/cli.h +++ /dev/null @@ -1,174 +0,0 @@ -#ifndef BERGAMOT_APP_CLI_H -#define BERGAMOT_APP_CLI_H -#include -#include -#include -#include -#include - -#include "common/definitions.h" -#include "common/timer.h" -#include "common/utils.h" -#include "marian.h" -#include "translator/byte_array_util.h" -#include "translator/parser.h" -#include "translator/response.h" -#include "translator/response_options.h" -#include "translator/service.h" - -namespace marian { -namespace bergamot { - -// marian::bergamot:: makes life easier, won't need to prefix it everywhere and these classes plenty use constructs. - -namespace app { - -/// Previously bergamot-translator-app. Provides a command-line app on native which executes the code-path used by Web -/// Assembly. Expected to be maintained consistent with how the browser (Mozilla through WebAssembly) dictates its API -/// and tests be intact. Also used in [bergamot-evaluation](https://github.com/mozilla/bergamot-evaluation). -/// -/// Usage example: -/// [brt/tests/basic/test_bergamot_translator_app_intgemm_8bit.cpu-threads.0.sh](https://github.com/browsermt/bergamot-translator-tests/blob/main/tests/basic/test_bergamot_translator_app_intgemm_8bit.cpu-threads.0.sh) -/// -/// * Input : read from stdin as sentences as lines of text. -/// * Output: written to stdout as translations for the sentences supplied in corresponding lines -/// -/// @param [options]: Options to translate passed down to marian through Options. -void wasm(const CLIConfig &config) { - // Here, we take the command-line interface which is uniform across all apps. This is parsed into Ptr by - // marian. However, mozilla does not allow a Ptr constructor and demands an std::string constructor since - // std::string isn't marian internal unlike Ptr. Since this std::string path needs to be tested for mozilla - // and since this class/CLI is intended at testing mozilla's path, we go from: - // - // cmdline -> Ptr -> std::string -> TranslationModel(std::string) - // - // Overkill, yes. - - const std::string &modelConfigPath = config.modelConfigPaths.front(); - - Ptr options = parseOptionsFromFilePath(modelConfigPath); - MemoryBundle memoryBundle = getMemoryBundleFromConfig(options); - - BlockingService::Config serviceConfig; - BlockingService service(serviceConfig); - - std::shared_ptr translationModel = - std::make_shared(options->asYamlString(), std::move(memoryBundle)); - - ResponseOptions responseOptions; - if (config.html) { - responseOptions.HTML = true; - responseOptions.alignment = true; // Necessary for HTML - } - std::vector texts; - - // Hide the translateMultiple operation - for (std::string line; std::getline(std::cin, line);) { - texts.emplace_back(line); - } - - auto results = service.translateMultiple(translationModel, std::move(texts), responseOptions); - - for (auto &result : results) { - std::cout << result.getTranslatedText() << std::endl; - } -} - -/// Application used to benchmark with marian-decoder from time-to-time. The implementation in this repository follows a -/// different route than marian-decoder and routinely needs to be checked that the speeds while operating similar to -/// marian-decoder are not affected during the course of development. -/// -/// Example usage: -/// [brt/speed-tests/test_wngt20_perf.sh](https://github.com/browsermt/bergamot-translator-tests/blob/main/speed-tests/test_wngt20_perf.sh). -/// -/// Expected to be compatible with Translator[1] and marian-decoder[2]. -/// -/// - [1] -/// [marian-dev/../src/translator/translator.h](https://github.com/marian-nmt/marian-dev/blob/master/src/translator/translator.h) -/// - [2] -/// [marian-dev/../src/command/marian_decoder.cpp](https://github.com/marian-nmt/marian/blob/master/src/command/marian_decoder.cpp) -/// -/// * Input: stdin, lines containing sentences, same as marian-decoder. -/// * Output: to stdout, translations of the sentences supplied via stdin in corresponding lines -/// -/// @param [in] options: constructed from command-line supplied arguments -void decoder(const CLIConfig &config) { - marian::timer::Timer decoderTimer; - AsyncService::Config asyncConfig{config.numWorkers}; - AsyncService service(asyncConfig); - auto options = parseOptionsFromFilePath(config.modelConfigPaths.front()); - MemoryBundle memoryBundle; - Ptr translationModel = service.createCompatibleModel(options, std::move(memoryBundle)); - // Read a large input text blob from stdin - std::ostringstream std_input; - std_input << std::cin.rdbuf(); - std::string input = std_input.str(); - - // Wait on future until Response is complete - std::promise responsePromise; - std::future responseFuture = responsePromise.get_future(); - auto callback = [&responsePromise](Response &&response) { responsePromise.set_value(std::move(response)); }; - - service.translate(translationModel, std::move(input), std::move(callback)); - responseFuture.wait(); - const Response &response = responseFuture.get(); - - for (size_t sentenceIdx = 0; sentenceIdx < response.size(); sentenceIdx++) { - std::cout << response.target.sentence(sentenceIdx) << "\n"; - } - - std::cerr << "Total time: " << std::setprecision(5) << decoderTimer.elapsed() << "s wall" << std::endl; -} - -/// Command line interface to the test the features being developed as part of bergamot C++ library on native platform. -/// -/// Usage example: -/// [brt/tests/basic/test_service-cli_intgemm_8bit.cpu-threads.4.sh](https://github.com/browsermt/bergamot-translator-tests/blob/main/tests/basic/test_service-cli_intgemm_8bit.cpu-threads.4.sh) -/// -/// * Input: reads from stdin, blob of text, read as a whole ; sentence-splitting etc handled internally. -/// * Output: to stdout, translation of the source text faithful to source structure. -/// -/// @param [in] options: options to build translator -void native(const CLIConfig &config) { - AsyncService::Config asyncConfig{config.numWorkers}; - AsyncService service(asyncConfig); - - auto options = parseOptionsFromFilePath(config.modelConfigPaths.front()); - // Prepare memories for bytearrays (including model, shortlist and vocabs) - MemoryBundle memoryBundle; - if (config.byteArray) { - // Load legit values into bytearrays. - memoryBundle = getMemoryBundleFromConfig(options); - } - - Ptr translationModel = service.createCompatibleModel(options, std::move(memoryBundle)); - - // Read a large input text blob from stdin - std::ostringstream std_input; - std_input << std::cin.rdbuf(); - std::string input = std_input.str(); - - ResponseOptions responseOptions; - if (config.html) { - responseOptions.HTML = true; - responseOptions.alignment = true; // Necessary for HTML - } - - // Wait on future until Response is complete - std::promise responsePromise; - std::future responseFuture = responsePromise.get_future(); - auto callback = [&responsePromise](Response &&response) { responsePromise.set_value(std::move(response)); }; - - service.translate(translationModel, std::move(input), std::move(callback), responseOptions); - responseFuture.wait(); - Response response = responseFuture.get(); - - std::cout << response.target.text; -} - -} // namespace app - -} // namespace bergamot -} // namespace marian - -#endif // BERGAMOT_APP_CLI_H diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 9344b9835..5524e37a0 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 9344b9835797f7c19ee49d30bff134b74a1a336e +Subproject commit 5524e37a01920dc5149dcc87b047615c6a70aa53 diff --git a/src/tests/CMakeLists.txt b/src/tests/CMakeLists.txt index 483bd075f..86fe00236 100644 --- a/src/tests/CMakeLists.txt +++ b/src/tests/CMakeLists.txt @@ -13,20 +13,12 @@ endif (COMPILE_UNIT_TESTS) if(NOT MSVC) # Testing apps - set(APP_TESTS) - add_executable("bergamot-test" "cli.cpp" "apps.cpp") - - if(CUDA_FOUND) - target_link_libraries("bergamot-test" bergamot-translator) - else(CUDA_FOUND) - target_link_libraries("bergamot-test" bergamot-translator) - endif(CUDA_FOUND) - - set_target_properties("bergamot-test" PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}") + set(TEST_BINARIES async blocking intgemm-resolve wasm) + foreach(binary ${TEST_BINARIES}) + add_executable("${binary}" "${binary}.cpp") + target_link_libraries("${binary}" bergamot-translator) + set_target_properties("${binary}" PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/tests/") + endforeach(binary) - # Adding an intgemm_resolve cmdline - add_executable(intgemm-resolve intgemm_resolve.cpp) - target_link_libraries(intgemm-resolve PRIVATE bergamot-translator) - set_target_properties(intgemm-resolve PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}") endif(NOT MSVC) diff --git a/src/tests/apps.cpp b/src/tests/apps.cpp deleted file mode 100644 index 20c6d2acb..000000000 --- a/src/tests/apps.cpp +++ /dev/null @@ -1,146 +0,0 @@ -#include "apps.h" - -namespace marian { -namespace bergamot { - -namespace { - -std::string readFromStdin() { - // Read a large input text blob from stdin - std::ostringstream inputStream; - inputStream << std::cin.rdbuf(); - std::string input = inputStream.str(); - return input; -} - -// Utility function, common for all testapps. -Response translateForResponse(AsyncService &service, Ptr model, std::string &&source, - ResponseOptions responseOptions) { - std::promise responsePromise; - std::future responseFuture = responsePromise.get_future(); - - auto callback = [&responsePromise](Response &&response) { responsePromise.set_value(std::move(response)); }; - service.translate(model, std::move(source), callback, responseOptions); - - responseFuture.wait(); - - Response response = responseFuture.get(); - return response; -} - -} // namespace - -namespace testapp { - -void annotatedTextWords(AsyncService &service, Ptr model, bool sourceSide) { - ResponseOptions responseOptions; - std::string source = readFromStdin(); - Response response = translateForResponse(service, model, std::move(source), responseOptions); - AnnotatedText &annotatedText = sourceSide ? response.source : response.target; - for (size_t s = 0; s < annotatedText.numSentences(); s++) { - for (size_t w = 0; w < annotatedText.numWords(s); w++) { - std::cout << (w == 0 ? "" : "\t"); - std::cout << annotatedText.word(s, w); - } - std::cout << "\n"; - } -} - -void annotatedTextSentences(AsyncService &service, Ptr model, bool sourceSide) { - ResponseOptions responseOptions; - std::string source = readFromStdin(); - Response response = translateForResponse(service, model, std::move(source), responseOptions); - AnnotatedText &annotatedText = sourceSide ? response.source : response.target; - for (size_t s = 0; s < annotatedText.numSentences(); s++) { - std::cout << annotatedText.sentence(s) << "\n"; - } -} - -void forwardAndBackward(AsyncService &service, std::vector> &models) { - ABORT_IF(models.size() != 2, "Forward and backward test needs two models."); - ResponseOptions responseOptions; - std::string source = readFromStdin(); - Response forwardResponse = translateForResponse(service, models.front(), std::move(source), responseOptions); - - // Make a copy of target - std::string target = forwardResponse.target.text; - Response backwardResponse = translateForResponse(service, models.back(), std::move(target), responseOptions); - - // Print both onto the command-line - std::cout << forwardResponse.source.text; - std::cout << "----------------\n"; - std::cout << forwardResponse.target.text; - std::cout << "----------------\n"; - std::cout << backwardResponse.target.text; -} - -void qualityEstimatorWords(AsyncService &service, Ptr model) { - ResponseOptions responseOptions; - responseOptions.qualityScores = true; - std::string source = readFromStdin(); - const Response response = translateForResponse(service, model, std::move(source), responseOptions); - - for (const auto &sentenceQualityEstimate : response.qualityScores) { - std::cout << "[SentenceBegin]\n"; - - for (const auto &wordByteRange : sentenceQualityEstimate.wordByteRanges) { - const string_view word(response.target.text.data() + wordByteRange.begin, wordByteRange.size()); - std::cout << word << "\n"; - } - std::cout << "[SentenceEnd]\n\n"; - } -} - -void qualityEstimatorScores(AsyncService &service, Ptr model) { - ResponseOptions responseOptions; - responseOptions.qualityScores = true; - - std::string source = readFromStdin(); - const Response response = translateForResponse(service, model, std::move(source), responseOptions); - - for (const auto &sentenceQualityEstimate : response.qualityScores) { - std::cout << std::fixed << std::setprecision(3) << sentenceQualityEstimate.sentenceScore << "\n"; - - for (const float &wordScore : sentenceQualityEstimate.wordScores) { - std::cout << std::fixed << std::setprecision(3) << wordScore << "\n"; - } - std::cout << "\n"; - } -} - -void translationCache(AsyncService &service, Ptr model) { - ResponseOptions responseOptions; - - // Read a large input text blob from stdin - const std::string source = readFromStdin(); - - // Round 1 - std::string buffer = source; - Response firstResponse = translateForResponse(service, model, std::move(buffer), responseOptions); - - auto statsFirstRun = service.cacheStats(); - LOG(info, "Cache Hits/Misses = {}/{}", statsFirstRun.hits, statsFirstRun.misses); - ABORT_IF(statsFirstRun.hits != 0, "Expecting no cache hits, but hits found."); - - // Round 2; There should be cache hits - buffer = source; - Response secondResponse = translateForResponse(service, model, std::move(buffer), responseOptions); - - auto statsSecondRun = service.cacheStats(); - LOG(info, "Cache Hits/Misses = {}/{}", statsSecondRun.hits, statsSecondRun.misses); - ABORT_IF(statsSecondRun.hits <= 0, "At least one hit expected, none found."); - if (statsSecondRun.hits != statsFirstRun.misses) { - std::cerr << "Mismatch in expected hits (Hits, Misses = " << statsSecondRun.hits << ", " << statsSecondRun.misses - << "). This can happen due to random eviction." << std::endl; - } - - ABORT_IF(firstResponse.target.text != secondResponse.target.text, - "Recompiled string provided different output when operated with cache. On the same hardware while using " - "same path, this is expected to be same."); - - std::cout << firstResponse.target.text; -} - -} // namespace testapp -} // namespace bergamot -} // namespace marian diff --git a/src/tests/apps.h b/src/tests/apps.h deleted file mode 100644 index 9e45a1caa..000000000 --- a/src/tests/apps.h +++ /dev/null @@ -1,46 +0,0 @@ -#ifndef BERGAMOT_SRC_TESTS_APPS_H -#define BERGAMOT_SRC_TESTS_APPS_H -#include -#include -#include -#include -#include - -#include "common/definitions.h" -#include "common/timer.h" -#include "common/utils.h" -#include "marian.h" -#include "translator/byte_array_util.h" -#include "translator/parser.h" -#include "translator/response.h" -#include "translator/response_options.h" -#include "translator/service.h" - -namespace marian { -namespace bergamot { - -namespace testapp { - -// Reads from stdin and translates. Prints the tokens separated by space for each sentence. Prints words from source -// side text annotation if source=true, target annotation otherwise. -void annotatedTextWords(AsyncService &service, Ptr model, bool source = true); - -// Reads from stdin and translates the read content. Prints the sentences in source or target in constructed response -// in each line, depending on source = true or false respectively. -void annotatedTextSentences(AsyncService &service, Ptr model, bool source = true); - -void forwardAndBackward(AsyncService &service, std::vector> &models); - -// Reads from stdin and translates the read content. Prints the quality words for each sentence. -void qualityEstimatorWords(AsyncService &service, Ptr model); - -// Reads from stdin and translates the read content. Prints the quality scores for each sentence. -void qualityEstimatorScores(AsyncService &service, Ptr model); - -// Tests if cache is active and functional -void translationCache(AsyncService &service, Ptr model); -} // namespace testapp -} // namespace bergamot -} // namespace marian - -#endif // BERGAMOT_SRC_TESTS_APPS_H diff --git a/src/tests/async.cpp b/src/tests/async.cpp new file mode 100644 index 000000000..25ba334ae --- /dev/null +++ b/src/tests/async.cpp @@ -0,0 +1,27 @@ +#include "common.h" +#include "translator/parser.h" +#include "translator/service.h" +#include "translator/translation_model.h" + +using namespace marian::bergamot; + +int main(int argc, char *argv[]) { + ConfigParser configParser("AsyncService test-suite", /*multiOpMode=*/true); + configParser.parseArgs(argc, argv); + auto &config = configParser.getConfig(); + + AsyncService service(config.serviceConfig); + + std::vector> models; + + for (auto &modelConfigPath : config.modelConfigPaths) { + TranslationModel::Config modelConfig = parseOptionsFromFilePath(modelConfigPath); + std::shared_ptr model = service.createCompatibleModel(modelConfig); + models.push_back(model); + } + + TestSuite testSuite(service); + testSuite.run(config.opMode, models); + + return 0; +} diff --git a/src/tests/blocking.cpp b/src/tests/blocking.cpp new file mode 100644 index 000000000..3bbb45634 --- /dev/null +++ b/src/tests/blocking.cpp @@ -0,0 +1,25 @@ +#include "common.h" +using namespace marian::bergamot; + +int main(int argc, char *argv[]) { + ConfigParser configParser("BlockingService test-suite", /*multiOpMode=*/true); + configParser.parseArgs(argc, argv); + + auto &config = configParser.getConfig(); + BlockingService service(config.serviceConfig); + + TestSuite testSuite(service); + std::vector> models; + + for (auto &modelConfigPath : config.modelConfigPaths) { + TranslationModel::Config modelConfig = parseOptionsFromFilePath(modelConfigPath); + std::shared_ptr model = std::make_shared(modelConfig); + models.push_back(model); + } + + /// WASM is one special case where WASM path is being checked, involving translateMultiple and a multi-line feed. + /// Hence we do not bind it at a single input-blob single Response constraint imposed by the TestSuite. + testSuite.run(config.opMode, models); + + return 0; +} diff --git a/src/tests/cli.cpp b/src/tests/cli.cpp deleted file mode 100644 index ba4d73218..000000000 --- a/src/tests/cli.cpp +++ /dev/null @@ -1,53 +0,0 @@ -#include "apps.h" - -int main(int argc, char *argv[]) { - using namespace marian::bergamot; - marian::bergamot::ConfigParser configParser; - configParser.parseArgs(argc, argv); - auto &config = configParser.getConfig(); - AsyncService::Config serviceConfig; - serviceConfig.numWorkers = config.numWorkers; - serviceConfig.cacheEnabled = config.cacheEnabled; - serviceConfig.cacheMutexBuckets = config.cacheMutexBuckets; - serviceConfig.cacheSize = config.cacheSize; - AsyncService service(serviceConfig); - std::vector> models; - - for (auto &modelConfigPath : config.modelConfigPaths) { - TranslationModel::Config modelConfig = parseOptionsFromFilePath(modelConfigPath); - std::shared_ptr model = service.createCompatibleModel(modelConfig); - models.push_back(model); - } - - switch (config.opMode) { - case OpMode::TEST_SOURCE_SENTENCES: - testapp::annotatedTextSentences(service, models.front(), /*source=*/true); - break; - case OpMode::TEST_TARGET_SENTENCES: - testapp::annotatedTextSentences(service, models.front(), /*source=*/false); - break; - case OpMode::TEST_SOURCE_WORDS: - testapp::annotatedTextWords(service, models.front(), /*source=*/true); - break; - case OpMode::TEST_TARGET_WORDS: - testapp::annotatedTextWords(service, models.front(), /*source=*/false); - break; - case OpMode::TEST_FORWARD_BACKWARD_FOR_OUTBOUND: - testapp::forwardAndBackward(service, models); - break; - case OpMode::TEST_QUALITY_ESTIMATOR_WORDS: - testapp::qualityEstimatorWords(service, models.front()); - break; - case OpMode::TEST_QUALITY_ESTIMATOR_SCORES: - testapp::qualityEstimatorScores(service, models.front()); - break; - case OpMode::TEST_TRANSLATION_CACHE: - testapp::translationCache(service, models.front()); - break; - - default: - ABORT("Incompatible op-mode. Choose one of the test modes."); - break; - } - return 0; -} diff --git a/src/tests/common-impl.cpp b/src/tests/common-impl.cpp new file mode 100644 index 000000000..49ebfc53c --- /dev/null +++ b/src/tests/common-impl.cpp @@ -0,0 +1,192 @@ + +#ifndef BERGAMOT_TESTS_COMMON_IMPL +#error "This is an impl file and must not be included directly!" +#endif + +Response Bridge::translate(BlockingService &service, std::shared_ptr &model, + std::string &&source, const ResponseOptions &responseOptions) { + // project source to a vector of std::string, send in, unpack the first element from + // vector, return. + std::vector sources = {source}; + return service.translateMultiple(model, std::move(sources), responseOptions).front(); +} + +Response Bridge::translate(AsyncService &service, std::shared_ptr &model, + std::string &&source, const ResponseOptions &responseOptions) { + // downgrade to blocking via promise, future, wait and return response; + std::promise responsePromise; + std::future responseFuture = responsePromise.get_future(); + + auto callback = [&responsePromise](Response &&response) { responsePromise.set_value(std::move(response)); }; + service.translate(model, std::move(source), callback, responseOptions); + + responseFuture.wait(); + + Response response = responseFuture.get(); + return response; +} + +template +TestSuite::TestSuite(Service &service) : service_{service} {} + +template +void TestSuite::TestSuite::run(const std::string &opModeAsString, std::vector> &models) { + if (opModeAsString == "decoder") { + benchmarkDecoder(models.front()); + } else if (opModeAsString == "test-response-source-sentences") { + annotatedTextSentences(models.front(), /*source=*/true); + } else if (opModeAsString == "test-response-target-sentences") { + annotatedTextSentences(models.front(), /*source=*/false); + } else if (opModeAsString == "test-response-source-words") { + annotatedTextWords(models.front(), /*source=*/true); + } else if (opModeAsString == "test-response-target-words") { + annotatedTextWords(models.front(), /*source=*/false); + } else if (opModeAsString == "test-forward-backward") { + forwardAndBackward(models); + } else if (opModeAsString == "test-quality-estimator-words") { + qualityEstimatorWords(models.front()); + } else if (opModeAsString == "test-quality-estimator-scores") { + qualityEstimatorScores(models.front()); + } else if (opModeAsString == "test-translation-cache") { + translationCache(models.front()); + } else { + std::cerr << "Incompatible test mode. Choose from the one of the valid test-modes"; + std::abort(); + } +} + +template +void TestSuite::benchmarkDecoder(Ptr &model) { + marian::timer::Timer decoderTimer; + std::string source = readFromStdin(); + + ResponseOptions responseOptions; + Response response = bridge_.translate(service_, model, std::move(source), responseOptions); + + for (size_t sentenceIdx = 0; sentenceIdx < response.size(); sentenceIdx++) { + std::cout << response.target.sentence(sentenceIdx) << "\n"; + } + + std::cerr << "Total time: " << std::setprecision(5) << decoderTimer.elapsed() << "s wall" << std::endl; +} + +// Reads from stdin and translates. Prints the tokens separated by space for each sentence. Prints words from source +// side text annotation if source=true, target annotation otherwise. +template +void TestSuite::annotatedTextWords(Ptr model, bool sourceSide /*=true*/) { + ResponseOptions responseOptions; + std::string source = readFromStdin(); + Response response = bridge_.translate(service_, model, std::move(source), responseOptions); + AnnotatedText &annotatedText = sourceSide ? response.source : response.target; + for (size_t s = 0; s < annotatedText.numSentences(); s++) { + for (size_t w = 0; w < annotatedText.numWords(s); w++) { + std::cout << (w == 0 ? "" : "\t"); + std::cout << annotatedText.word(s, w); + } + std::cout << "\n"; + } +} + +// Reads from stdin and translates the read content. Prints the sentences in source or target in constructed response +// in each line, depending on source = true or false respectively. +template +void TestSuite::annotatedTextSentences(Ptr model, bool sourceSide /*=true*/) { + ResponseOptions responseOptions; + std::string source = readFromStdin(); + Response response = bridge_.translate(service_, model, std::move(source), responseOptions); + AnnotatedText &annotatedText = sourceSide ? response.source : response.target; + for (size_t s = 0; s < annotatedText.numSentences(); s++) { + std::cout << annotatedText.sentence(s) << "\n"; + } +} + +template +void TestSuite::forwardAndBackward(std::vector> &models) { + ABORT_IF(models.size() != 2, "Forward and backward test needs two models."); + ResponseOptions responseOptions; + std::string source = readFromStdin(); + Response forwardResponse = bridge_.translate(service_, models.front(), std::move(source), responseOptions); + + // Make a copy of target + std::string target = forwardResponse.target.text; + Response backwardResponse = bridge_.translate(service_, models.back(), std::move(target), responseOptions); + + // Print both onto the command-line + std::cout << forwardResponse.source.text; + std::cout << "----------------\n"; + std::cout << forwardResponse.target.text; + std::cout << "----------------\n"; + std::cout << backwardResponse.target.text; +} + +// Reads from stdin and translates the read content. Prints the quality words for each sentence. +template +void TestSuite::qualityEstimatorWords(Ptr model) { + ResponseOptions responseOptions; + responseOptions.qualityScores = true; + std::string source = readFromStdin(); + const Response response = bridge_.translate(service_, model, std::move(source), responseOptions); + + for (const auto &sentenceQualityEstimate : response.qualityScores) { + std::cout << "[SentenceBegin]\n"; + + for (const auto &wordByteRange : sentenceQualityEstimate.wordByteRanges) { + const string_view word(response.target.text.data() + wordByteRange.begin, wordByteRange.size()); + std::cout << word << "\n"; + } + std::cout << "[SentenceEnd]\n\n"; + } +} + +// Reads from stdin and translates the read content. Prints the quality scores for each sentence. +template +void TestSuite::qualityEstimatorScores(Ptr model) { + ResponseOptions responseOptions; + responseOptions.qualityScores = true; + + std::string source = readFromStdin(); + const Response response = bridge_.translate(service_, model, std::move(source), responseOptions); + + for (const auto &sentenceQualityEstimate : response.qualityScores) { + std::cout << std::fixed << std::setprecision(3) << sentenceQualityEstimate.sentenceScore << "\n"; + + for (const float &wordScore : sentenceQualityEstimate.wordScores) { + std::cout << std::fixed << std::setprecision(3) << wordScore << "\n"; + } + std::cout << "\n"; + } +} + +template +void TestSuite::translationCache(Ptr model) { + ResponseOptions responseOptions; + + // Read a large input text blob from stdin + const std::string source = readFromStdin(); + + // Round 1 + std::string buffer = source; + Response firstResponse = bridge_.translate(service_, model, std::move(buffer), responseOptions); + + auto statsFirstRun = service_.cacheStats(); + LOG(info, "Cache Hits/Misses = {}/{}", statsFirstRun.hits, statsFirstRun.misses); + ABORT_IF(statsFirstRun.hits != 0, "Expecting no cache hits, but hits found."); + + // Round 2; There should be cache hits + buffer = source; + Response secondResponse = bridge_.translate(service_, model, std::move(buffer), responseOptions); + + auto statsSecondRun = service_.cacheStats(); + LOG(info, "Cache Hits/Misses = {}/{}", statsSecondRun.hits, statsSecondRun.misses); + ABORT_IF(statsSecondRun.hits <= 0, "At least one hit expected, none found."); + if (statsSecondRun.hits != statsFirstRun.misses) { + std::cerr << "Mismatch in expected hits (Hits, Misses = " << statsSecondRun.hits << ", " << statsSecondRun.misses + << "). This can happen due to random eviction." << std::endl; + } + + ABORT_IF(firstResponse.target.text != secondResponse.target.text, + "Recompiled string provided different output when operated with cache. On the same hardware while using " + "same path, this is expected to be same."); + + std::cout << firstResponse.target.text; +} diff --git a/src/tests/common.h b/src/tests/common.h new file mode 100644 index 000000000..dff47e483 --- /dev/null +++ b/src/tests/common.h @@ -0,0 +1,88 @@ +#pragma once +#include +#include +#include +#include +#include +#include + +#include "common/definitions.h" +#include "common/timer.h" +#include "common/utils.h" +#include "marian.h" +#include "translator/byte_array_util.h" +#include "translator/parser.h" +#include "translator/response.h" +#include "translator/response_options.h" +#include "translator/service.h" +#include "translator/utils.h" + +namespace marian::bergamot { + +/// Due to the stubborn-ness of the extension and native to not agree on API (e.g, translateMultiple vs translate), +/// different underlying cache we have the following "bridge" at test-applications - taking into account the fact that +/// the most commonly used primitives across both Services is a single text blob in and corresponding Response out, in a +/// blocking fashion. +/// +/// The following contraption constrains a single sentence to single Response parameterized by Service, in a test-suite +/// below. This allows sharing of code for test-suite between WebAssembly's workflows and Native's workflows. +/// +/// The intention here is to use templating to achieve the same thing an ifdef would have at compile-time. Also mandates +/// after bridge layer, both WebAssembly and Native paths compile correctly (this does not guarantee outputs are the +/// same through both code-paths, or that both are tested at runtime - only that both compile and work with a bridge). +/// +/// For any complex workflows involving non-blocking concurrent translation, it is required to write something not +/// constrained by the following. + +template +struct Bridge : public std::false_type {}; + +template <> +struct Bridge : public std::true_type { + Response translate(BlockingService &service, std::shared_ptr &model, std::string &&source, + const ResponseOptions &responseOptions); +}; + +template <> +struct Bridge : public std::true_type { + Response translate(AsyncService &service, std::shared_ptr &model, std::string &&source, + const ResponseOptions &responseOptions); +}; + +template +class TestSuite { + private: + Bridge bridge_; + Service &service_; + + public: + TestSuite(Service &service); + void run(const std::string &opModeAsString, std::vector> &models); + + private: + void benchmarkDecoder(Ptr &model); + + // Reads from stdin and translates. Prints the tokens separated by space for each sentence. Prints words from source + // side text annotation if source=true, target annotation otherwise. + void annotatedTextWords(Ptr model, bool sourceSide = true); + + // Reads from stdin and translates the read content. Prints the sentences in source or target in constructed response + // in each line, depending on source = true or false respectively. + void annotatedTextSentences(Ptr model, bool sourceSide = true); + + void forwardAndBackward(std::vector> &models); + + // Reads from stdin and translates the read content. Prints the quality words for each sentence. + void qualityEstimatorWords(Ptr model); + + // Reads from stdin and translates the read content. Prints the quality scores for each sentence. + void qualityEstimatorScores(Ptr model); + + void translationCache(Ptr model); +}; + +#define BERGAMOT_TESTS_COMMON_IMPL +#include "common-impl.cpp" +#undef BERGAMOT_TESTS_COMMON_IMPL + +} // namespace marian::bergamot diff --git a/src/tests/intgemm_resolve.cpp b/src/tests/intgemm-resolve.cpp similarity index 100% rename from src/tests/intgemm_resolve.cpp rename to src/tests/intgemm-resolve.cpp diff --git a/src/tests/wasm.cpp b/src/tests/wasm.cpp new file mode 100644 index 000000000..9a29a20e1 --- /dev/null +++ b/src/tests/wasm.cpp @@ -0,0 +1,53 @@ +#include "common.h" +using namespace marian::bergamot; + +void wasm(BlockingService &service, std::shared_ptr &model) { + ResponseOptions responseOptions; + std::vector texts; + + // WASM always requires HTML and alignment. + // TODO(jerinphilip): Fix this, bring in actual tests. + // responseOptions.HTML = true; + // responseOptions.alignment = true; // Necessary for HTML + + // Hide the translateMultiple operation + for (std::string line; std::getline(std::cin, line);) { + texts.emplace_back(line); + } + + auto results = service.translateMultiple(model, std::move(texts), responseOptions); + + for (auto &result : results) { + std::cout << result.getTranslatedText() << std::endl; + } +} + +int main(int argc, char *argv[]) { + ConfigParser configParser("WebAssembly test-suite", /*multiOpMode=*/true); + configParser.parseArgs(argc, argv); + + auto &config = configParser.getConfig(); + BlockingService service(config.serviceConfig); + + TestSuite testSuite(service); + std::vector> models; + + for (auto &modelConfigPath : config.modelConfigPaths) { + TranslationModel::Config modelConfig = parseOptionsFromFilePath(modelConfigPath); + // Anything WASM is expected to use the byte-array-loads. So we hard-code grabbing MemoryBundle from FS and use the + // MemoryBundle capable constructor. + MemoryBundle memoryBundle = getMemoryBundleFromConfig(modelConfig); + std::shared_ptr model = std::make_shared(modelConfig, std::move(memoryBundle)); + models.push_back(model); + } + + /// WASM is one special case where WASM path is being checked, involving translateMultiple and a multi-line feed. + /// Hence we do not bind it at a single input-blob single Response constraint imposed by the TestSuite. + if (config.opMode == "wasm") { + wasm(service, models.front()); + } else { + testSuite.run(config.opMode, models); + } + + return 0; +} diff --git a/src/translator/parser.cpp b/src/translator/parser.cpp index e875d97e0..2636b7472 100644 --- a/src/translator/parser.cpp +++ b/src/translator/parser.cpp @@ -10,90 +10,6 @@ namespace marian { namespace bergamot { -std::istringstream &operator>>(std::istringstream &in, OpMode &mode) { - std::string modeString; - in >> modeString; - std::unordered_map table = { - {"wasm", OpMode::APP_WASM}, - {"native", OpMode::APP_NATIVE}, - {"decoder", OpMode::APP_DECODER}, - {"test-response-source-sentences", OpMode::TEST_SOURCE_SENTENCES}, - {"test-response-target-sentences", OpMode::TEST_TARGET_SENTENCES}, - {"test-response-source-words", OpMode::TEST_SOURCE_WORDS}, - {"test-response-target-words", OpMode::TEST_TARGET_WORDS}, - {"test-quality-estimator-words", OpMode::TEST_QUALITY_ESTIMATOR_WORDS}, - {"test-quality-estimator-scores", OpMode::TEST_QUALITY_ESTIMATOR_SCORES}, - {"test-forward-backward", OpMode::TEST_FORWARD_BACKWARD_FOR_OUTBOUND}, - {"test-translation-cache", OpMode::TEST_TRANSLATION_CACHE}, - }; - - auto query = table.find(modeString); - if (query != table.end()) { - mode = query->second; - } else { - ABORT("Unknown mode {}", modeString); - } - - return in; -} - -ConfigParser::ConfigParser() : app_{"Bergamot Options"} { - addSpecialOptions(app_); - addOptionsBoundToConfig(app_, config_); -}; - -void ConfigParser::parseArgs(int argc, char *argv[]) { - try { - app_.parse(argc, argv); - handleSpecialOptions(); - } catch (const CLI::ParseError &e) { - exit(app_.exit(e)); - } -} - -void ConfigParser::addSpecialOptions(CLI::App &app) { - app.add_flag("--build-info", build_info_, "Print build-info and exit"); - app.add_flag("--version", version_, "Print version-info and exit"); -} - -void ConfigParser::handleSpecialOptions() { - if (build_info_) { -#ifndef _MSC_VER // cmake build options are not available on MSVC based build. - std::cerr << cmakeBuildOptionsAdvanced() << std::endl; - exit(0); -#else // _MSC_VER - ABORT("build-info is not available on MSVC based build."); -#endif // _MSC_VER - } - - if (version_) { - std::cerr << buildVersion() << std::endl; - exit(0); - } -} - -void ConfigParser::addOptionsBoundToConfig(CLI::App &app, CLIConfig &config) { - app.add_option("--model-config-paths", config.modelConfigPaths, - "Configuration files list, can be used for pivoting multiple models or multiple model workflows"); - - app.add_flag("--bytearray", config.byteArray, - "Flag holds whether to construct service from bytearrays, only for testing purpose"); - - app.add_flag("--check-bytearray", config.validateByteArray, - "Flag holds whether to check the content of the bytearrays (true by default)"); - - app.add_option("--cpu-threads", config.numWorkers, "Number of worker threads to use for translation"); - - app_.add_option("--bergamot-mode", config.opMode, "Operating mode for bergamot: [wasm, native, decoder]"); - - app_.add_option("--cache-translations", config.cacheEnabled, "Whether to cache translations or not."); - app_.add_option("--cache-size", config.cacheSize, "Number of entries to store in cache."); - app_.add_option("--cache-mutex-buckets", config.cacheMutexBuckets, - "Number of mutex buckets to control locking granularity"); - - app_.add_flag("--html", config.html, "Whether input and output should be HTML"); -} - std::shared_ptr parseOptionsFromFilePath(const std::string &configPath, bool validate /*= true*/) { // Read entire string and redirect to parseOptionsFromString std::ifstream readStream(configPath); diff --git a/src/translator/parser.h b/src/translator/parser.h index 1aff5dba7..793582dd0 100644 --- a/src/translator/parser.h +++ b/src/translator/parser.h @@ -6,6 +6,7 @@ #include "3rd_party/marian-dev/src/3rd_party/CLI/CLI.hpp" #include "3rd_party/yaml-cpp/yaml.h" +#include "common/build_info.h" #include "common/config_parser.h" #include "common/config_validator.h" #include "common/options.h" @@ -14,36 +15,34 @@ namespace marian { namespace bergamot { -enum OpMode { - APP_WASM, - APP_NATIVE, - APP_DECODER, - TEST_SOURCE_SENTENCES, - TEST_TARGET_SENTENCES, - TEST_SOURCE_WORDS, - TEST_TARGET_WORDS, - TEST_QUALITY_ESTIMATOR_WORDS, - TEST_QUALITY_ESTIMATOR_SCORES, - TEST_FORWARD_BACKWARD_FOR_OUTBOUND, - TEST_TRANSLATION_CACHE, -}; - -/// Overload for CL11, convert a read from a stringstream into opmode. -std::istringstream &operator>>(std::istringstream &in, OpMode &mode); - +template struct CLIConfig { + using ServiceConfig = typename Service::Config; using ModelConfigPaths = std::vector; + + std::string opMode; + + // For marian-models we retain the old marian-yml configs to a large extent. These are supplied as file-paths to the + // CLI. For multiple model workflows, we allow more than one model config to be supplied. How to process the models + // provided is decided by the application. ModelConfigPaths modelConfigPaths; - bool byteArray; - bool validateByteArray; - bool html; - size_t numWorkers; - OpMode opMode; - - // Cache parameters - bool cacheEnabled{false}; - size_t cacheSize{20}; - size_t cacheMutexBuckets{4}; + + ServiceConfig serviceConfig; + + /// All config in bergamot has the following templated addOptions(...) method hierarchically placing parse actions on + /// "option-groups" in nested structs. This allows to keep additional documentation and information on defaults + /// alongside. Since this is templated with App, we don't add a CLI11 dependency in any configs, thus CLI11 not coming + /// into the picture until the parser is instantiated. + template + static void addOptions(App &app, CLIConfig &config, bool multiOpMode = false) { + if (multiOpMode) { + app.add_option("--bergamot-mode", config.opMode, ""); + } + app.add_option("--model-config-paths", config.modelConfigPaths, + "Configuration files list, can be used for pivoting multiple models or multiple model workflows"); + + ServiceConfig::addOptions(app, config.serviceConfig); + }; }; /// ConfigParser for bergamot. Internally stores config options with CLIConfig. CLI11 parsing binds the parsing code to @@ -54,21 +53,48 @@ struct CLIConfig { /// configParser.parseArgs(argc, argv); /// auto &config = configParser.getConfig(); /// ``` +template class ConfigParser { public: - ConfigParser(); - void parseArgs(int argc, char *argv[]); - const CLIConfig &getConfig() { return config_; } + ConfigParser(const std::string &appName, bool multiOpMode = false) : app_{appName} { + addSpecialOptions(app_); + CLIConfig::addOptions(app_, config_, multiOpMode); + }; + void parseArgs(int argc, char *argv[]) { + try { + app_.parse(argc, argv); + handleSpecialOptions(); + } catch (const CLI::ParseError &e) { + exit(app_.exit(e)); + } + }; + const CLIConfig &getConfig() { return config_; } private: // Special Options: build-info and version. These are not taken down further, the respective logic executed and // program exits after. - void addSpecialOptions(CLI::App &app); - void handleSpecialOptions(); + void addSpecialOptions(CLI::App &app) { + app.add_flag("--build-info", build_info_, "Print build-info and exit"); + app.add_flag("--version", version_, "Print version-info and exit"); + }; + + void handleSpecialOptions() { + if (build_info_) { +#ifndef _MSC_VER // cmake build options are not available on MSVC based build. + std::cerr << cmakeBuildOptionsAdvanced() << std::endl; + exit(0); +#else // _MSC_VER + ABORT("build-info is not available on MSVC based build."); +#endif // _MSC_VER + } - void addOptionsBoundToConfig(CLI::App &app, CLIConfig &config); + if (version_) { + std::cerr << buildVersion() << std::endl; + exit(0); + } + } - CLIConfig config_; + CLIConfig config_; CLI::App app_; bool build_info_{false}; @@ -79,7 +105,7 @@ std::shared_ptr parseOptionsFromString(const std::string &confi std::string pathsInSameDirAs = ""); std::shared_ptr parseOptionsFromFilePath(const std::string &config, bool validate = true); -} // namespace bergamot +} // namespace bergamot } // namespace marian #endif // SRC_BERGAMOT_PARSER_H diff --git a/src/translator/service.h b/src/translator/service.h index d58a759da..383ab9885 100644 --- a/src/translator/service.h +++ b/src/translator/service.h @@ -33,6 +33,12 @@ class BlockingService { bool cacheEnabled{false}; ///< Whether to enable cache or not. size_t cacheSize{2000}; ///< Size in History items to be stored in the cache. Loosely corresponds to sentences to /// cache in the real world. + template + static void addOptions(App &app, Config &config) { + // Options will come here. + app.add_option("--cache-translations", config.cacheEnabled, "Whether to cache translations or not."); + app.add_option("--cache-size", config.cacheSize, "Number of entries to store in cache."); + } }; /// Construct a BlockingService with configuration loaded from an Options object. Does not require any keys, values to /// be set. @@ -77,13 +83,21 @@ class BlockingService { class AsyncService { public: struct Config { - size_t numWorkers; ///< How many worker translation threads to spawn. + size_t numWorkers{1}; ///< How many worker translation threads to spawn. bool cacheEnabled{false}; ///< Whether to enable cache or not. size_t cacheSize{2000}; ///< Size in History items to be stored in the cache. Loosely corresponds to sentences to /// cache in the real world. - size_t cacheMutexBuckets; ///< Controls the granularity of locking to reduce contention by bucketing mutexes - ///< guarding cache entry read write. Optimal at min(core, numWorkers) assuming a - ///< reasonably large cache-size. + size_t cacheMutexBuckets{1}; ///< Controls the granularity of locking to reduce contention by bucketing mutexes + ///< guarding cache entry read write. Optimal at min(core, numWorkers) assuming a + ///< reasonably large cache-size. + template + static void addOptions(App &app, Config &config) { + app.add_option("--cpu-threads", config.numWorkers, "Workers to form translation backend"); + app.add_option("--cache-translations", config.cacheEnabled, "Whether to cache translations or not."); + app.add_option("--cache-size", config.cacheSize, "Number of entries to store in cache."); + app.add_option("--cache-mutex-buckets", config.cacheMutexBuckets, + "Number of mutex buckets to control locking granularity"); + } }; /// Construct an AsyncService with configuration loaded from Options. Expects positive integer value for /// `cpu-threads`. Additionally requires options which configure AggregateBatchingPool. diff --git a/src/translator/utils.h b/src/translator/utils.h new file mode 100644 index 000000000..a35cebcbd --- /dev/null +++ b/src/translator/utils.h @@ -0,0 +1,15 @@ +#pragma once + +#include + +namespace marian::bergamot { + +inline std::string readFromStdin() { + // Read a large input text blob from stdin + std::ostringstream inputStream; + inputStream << std::cin.rdbuf(); + std::string input = inputStream.str(); + return input; +} + +} // namespace marian::bergamot From f55377b6876e04a9c858c84dcdfa4a0faa361e3c Mon Sep 17 00:00:00 2001 From: Jelmer Date: Tue, 21 Dec 2021 14:44:04 +0100 Subject: [PATCH 321/442] HTML transfer empty elements (#283) * Fix test case This should now be implemented * Remove FilterEmpty This path wasn't used anymore anyway, empty tags just got their own spans, and never reached the stack. * Insert skipped empty source spans into target HTML Also refactor variable names to better match their contents and be more consistent with each other. This implementation passes all test cases, finally! * Fix remaining style changes * Move HTML formatting to its own section That code had become exact copies in three different places --- src/tests/units/html_tests.cpp | 4 +- src/translator/html.cpp | 263 ++++++++++++++++++--------------- 2 files changed, 145 insertions(+), 122 deletions(-) diff --git a/src/tests/units/html_tests.cpp b/src/tests/units/html_tests.cpp index 9fd2acfc6..e3d79379f 100644 --- a/src/tests/units/html_tests.cpp +++ b/src/tests/units/html_tests.cpp @@ -162,7 +162,7 @@ TEST_CASE("Do not abort if the input is just empty element") { Response response; html.restore(response); CHECK(response.source.text == "

    "); - CHECK(response.target.text == ""); // Should be

    but hey not there yet. + CHECK(response.target.text == "

    "); } TEST_CASE("Test case html entities") { @@ -388,7 +388,7 @@ TEST_CASE("Test empty self-closing pair at end of input in parent") { CHECK(input == "hello "); } -TEST_CASE("Test empty tag", "[!mayfail]") { +TEST_CASE("Test empty tag") { std::string test_str( "

    hello world

    \n"); diff --git a/src/translator/html.cpp b/src/translator/html.cpp index f531b44fe..4424241c2 100644 --- a/src/translator/html.cpp +++ b/src/translator/html.cpp @@ -12,7 +12,7 @@ using marian::bergamot::Response; void encodeEntities(string_view const &input, std::string &output) { output.clear(); - output.reserve(input.size()); + output.reserve(input.size()); // assumes there are no entities in most cases for (auto it = input.begin(); it != input.end(); ++it) { switch (*it) { @@ -83,6 +83,21 @@ std::string format(std::string const &formatTemplate, Arg arg, Args... args) { return os.str(); } +// Syntactic sugar around rbegin() and rend() that allows me to write +// `for (auto &&item : reversed(container))` instead of the needlessly verbose +// `for (auto it = container.rbegin(); it != container.rend(); ++it)` +template +class reversed { + public: + typedef typename T::const_reverse_iterator iterator; + explicit reversed(T const &container) : container_(container){}; + iterator begin() const { return container_.rbegin(); } + iterator end() const { return container_.rend(); } + + private: + T const &container_; +}; + bool isBlockElement(std::string_view const &name) { // List of elements that we expect might occur inside words, and that should // not introduce spacings around them. Not strictly inline elements, nor flow @@ -125,16 +140,6 @@ bool intersects(ByteRange const &range, HTML::Span const &span) { return range.begin <= span.end && range.end >= span.begin; }; -void filterEmpty(HTML::Taint &stack) { - auto src = stack.begin(); - auto dst = stack.begin(); - - for (auto src = stack.begin(); src != stack.end(); ++src) - if (!(*src)->empty) *(dst++) = *src; - - stack.resize(dst - stack.begin()); -} - bool containsTag(HTML::Taint const &stack, HTML::Tag const *tag) { return std::find(stack.rbegin(), stack.rend(), tag) != stack.rend(); } @@ -159,11 +164,11 @@ AnnotatedText apply(AnnotatedText const &in, Fun fun) { // expects // TODO: extend AnnotatedText::appendSentence to accept str + ByteRanges // directly - std::vector token_views(tokens.size()); - std::transform(tokens.begin(), tokens.end(), token_views.begin(), + std::vector views(tokens.size()); + std::transform(tokens.begin(), tokens.end(), views.begin(), [&](ByteRange const &range) { return string_view(sentence.data() + range.begin, range.size()); }); - out.appendSentence(prefix, token_views.begin(), token_views.end()); + out.appendSentence(prefix, views.begin(), views.end()); } out.appendEndingWhitespace(fun(in.annotation.gap(in.numSentences()), in.gap(in.numSentences()), true)); @@ -200,14 +205,14 @@ void hardAlignments(Response const &response, std::vector> & // Note: only search from 0 to N-1 because token N is end-of-sentence token // that can only align with the end-of-sentence token of the target for (size_t t = 0; t + 1 < response.target.numWords(sentenceIdx); ++t) { - size_t s_max = 0; + size_t maxS = 0; for (size_t s = 1; s + 1 < response.source.numWords(sentenceIdx); ++s) { - if (response.alignments[sentenceIdx][t][s] > response.alignments[sentenceIdx][t][s_max]) { - s_max = s; + if (response.alignments[sentenceIdx][t][s] > response.alignments[sentenceIdx][t][maxS]) { + maxS = s; } } - alignments.back().push_back(s_max); + alignments.back().push_back(maxS); } // Next, we try to smooth out these selected alignments with a few heuristics @@ -241,52 +246,84 @@ void hardAlignments(Response const &response, std::vector> & } } +// Internal type used to point to a position in HTML::spans_. +typedef std::vector::const_iterator SpanIterator; + void copyTaint(Response const &response, std::vector> const &alignments, - std::vector const &sourceTokenTags, std::vector &targetTokenTags) { + std::vector const &sourceTokenSpans, std::vector &targetTokenSpans) { size_t offset = 0; - // Fill targetTokenTags based on the alignments we just made up. + // Fill targetTokenSpans based on the alignments we just made up. // NOTE: this should match the exact order of Apply() for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { - targetTokenTags.push_back(sourceTokenTags[offset]); // token_tag for sentence ending gap + targetTokenSpans.push_back(sourceTokenSpans[offset]); // token_tag for sentence ending gap for (size_t t = 0; t < response.target.numWords(sentenceIdx); ++t) { size_t s = alignments[sentenceIdx][t]; assert(s < response.source.numWords(sentenceIdx)); - targetTokenTags.push_back(sourceTokenTags[offset + 1 + s]); // +1 for prefix gap + targetTokenSpans.push_back(sourceTokenSpans[offset + 1 + s]); // +1 for prefix gap } offset += response.source.numWords(sentenceIdx) + 1; // +1 for prefix gap } - assert(offset < sourceTokenTags.size()); - targetTokenTags.push_back(sourceTokenTags[offset]); // token_tag for ending whitespace + assert(offset < sourceTokenSpans.size()); + targetTokenSpans.push_back(sourceTokenSpans[offset]); // token_tag for ending whitespace } -AnnotatedText restoreSource(AnnotatedText const &in, std::vector &token_tags, - std::vector::const_iterator span_it, - std::vector::const_iterator span_end) { - auto prev_it = span_it; // safe because first span is always empty span, and - // and the while-loop below will do the rest +// Little helper class to append HTML to a token +class TokenFormatter { + public: + TokenFormatter(string_view token) + : html_(), offset_(0), whitespaceSize_(countPrefixWhitespaces(token)), closeLeft_(true) { + // Do encoding of any entities that popped up in the translation + encodeEntities(token, html_); + } + + std::string &&html() { return std::move(html_); } - // workspace variables for lambda - std::string html; - HTML::Taint opening, closing; + // Append the markup necessary for moving from `prev` set of tags to `curr`. + void append(HTML::Taint const &prev, HTML::Taint const &curr) { + HTML::Taint opening, closing; - return apply(in, [&](ByteRange range, string_view token, bool last) { - // Do encoding of any entities that popped up in the translation - // (Also effectively clears html from previous call) - encodeEntities(token, html); + diffTags(prev, curr, opening, closing); - size_t offset = 0; // Size added by prepending HTML - size_t whitespace_size = countPrefixWhitespaces(token); + for (HTML::Tag const *tag : reversed(closing)) { + std::string closeTag = format("", tag->name); + html_.insert(offset_ + (closeLeft_ ? 0 : whitespaceSize_), closeTag); + offset_ += closeTag.size(); + } - // Close tags we want to show up left (before) the token, but open tags - // ideally come directly after any prefix whitespace. However, some tokens - // match multiple spans. If a previous span has added an open tag, after any - // whitespace, and the next span closes said tag again, we need to close - // it after the whitespace. So after the first open tag, any closing tag - // should also align right, after whitespace, not before. Hence this bool. - bool close_left = true; + for (HTML::Tag const *tag : opening) { + std::string openTag = format("<{}{}>", tag->name, tag->attributes); + html_.insert(offset_ + whitespaceSize_, openTag); + offset_ += openTag.size(); + closeLeft_ = false; + } + } + + private: + std::string html_; // Output html + size_t offset_; // Size added by prepending HTML + size_t whitespaceSize_; // number of prefix whitespace characters + + // Close tags we want to show up left (before) the token, but open tags + // ideally come directly after any prefix whitespace. However, some tokens + // match multiple spans. If a previous span has added an open tag, after any + // whitespace, and the next span closes said tag again, we need to close + // it after the whitespace. So after the first open tag, any closing tag + // should also align right, after whitespace, not before. Hence this bool. + bool closeLeft_; +}; + +AnnotatedText restoreSource(AnnotatedText const &in, std::vector const &sourceSpans, + std::vector &sourceTokenSpans) { + auto spanIt = sourceSpans.begin(); + auto prevIt = sourceSpans.begin(); // safe because first span is always empty span, and + // and the while-loop below will do the rest + assert(prevIt == sourceSpans.end() || prevIt->tags.empty()); + + return apply(in, [&](ByteRange range, string_view token, bool last) { + TokenFormatter formatter(token); // Potential issue: spans and tokens can intersect, e.g. // @@ -295,27 +332,16 @@ AnnotatedText restoreSource(AnnotatedText const &in, std::vector &t // tokens |111111111111111|2| // // Now 1 covers span 1 to 3, so what taint should it get? Just

    , or

    ? + // Note: only relevant if isBlockElement is used. If we just insert spaces + // around all elements, every segment of `hello` will be a token. // Seek to the last span that overlaps with this token while (true) { - diffTags(prev_it->tags, span_it->tags, opening, closing); - prev_it = span_it; - - for (auto cit = closing.crbegin(); cit != closing.crend(); ++cit) { - std::string close_tag = format("", (*cit)->name); - html.insert(offset + (close_left ? 0 : whitespace_size), close_tag); - offset += close_tag.size(); - } + formatter.append(prevIt->tags, spanIt->tags); + prevIt = spanIt; - for (HTML::Tag const *tag : opening) { - std::string open_tag = format("<{}{}>", tag->name, tag->attributes); - html.insert(offset + whitespace_size, open_tag); - offset += open_tag.size(); - close_left = false; - } - - if (span_it + 1 != span_end && ((span_it + 1)->begin < range.end || last)) { - span_it++; + if (spanIt + 1 != sourceSpans.end() && ((spanIt + 1)->begin < range.end || last)) { + spanIt++; continue; } @@ -323,71 +349,69 @@ AnnotatedText restoreSource(AnnotatedText const &in, std::vector &t } // TODO: This is just the taint of the last span, not the ones in between. - // This makes us lose empty tags, and maybe some markup as well, in the - // response target HTML restoration. - token_tags.push_back(prev_it->tags); + // This makes us lose some markup of parts of tokens as described above. + sourceTokenSpans.push_back(prevIt); - return html; + return std::move(formatter.html()); }); } -AnnotatedText restoreTarget(AnnotatedText const &in, std::vector const &token_tags_target) { - auto token_prev_it = token_tags_target.begin(); - auto token_tags_it = token_tags_target.begin() + 1; - - // workspace for lambda - std::string html; - HTML::Taint opening, closing; +AnnotatedText restoreTarget(AnnotatedText const &in, std::vector const &sourceSpans, + std::vector const &targetTokenSpans) { + auto prevSpan = sourceSpans.begin(); + auto targetSpanIt = targetTokenSpans.begin(); AnnotatedText out = apply(in, [&](ByteRange range, string_view token, bool last) { - // Do encoding of any entities that popped up in the translation - // (Also effectively clears html from previous call) - encodeEntities(token, html); + TokenFormatter formatter(token); - size_t offset = 0; // Size added by prepending HTML - size_t whitespace_size = countPrefixWhitespaces(token); + // First we scan through spans_ to catch up to the span assigned to this + // token. We're only interested in empty spans (empty and void elements) + for (auto span_it = prevSpan + 1; span_it < *targetSpanIt; span_it++) { + // We're only interested in empty spans between the spans in targetSpanIt + if (span_it->size() != 0) continue; - assert(token_tags_it != token_tags_target.end()); - diffTags(*token_prev_it, *token_tags_it, opening, closing); + formatter.append(prevSpan->tags, span_it->tags); - for (auto cit = closing.crbegin(); cit != closing.crend(); ++cit) { - std::string close_tag = format("", (*cit)->name); - html.insert(offset, close_tag); - offset += close_tag.size(); + // Note: here, not in 3rd part of for-statement because we don't want to + // set prevSpan if the continue clause at the beginning of this for-loop + // was hit. + prevSpan = span_it; } - for (HTML::Tag const *tag : opening) { - std::string open_tag = format("<{}{}>", tag->name, tag->attributes); - html.insert(offset + whitespace_size, open_tag); - offset += open_tag.size(); - } + // Now do the same thing but for our target set of tags. Note that we cannot + // combine this in the for-loop above (i.e. `span_it <= *targetSpanIt`) + // because there is no guarantee that the order in `targetTokenSpans` is + // the same as that of `spans`. + formatter.append(prevSpan->tags, (*targetSpanIt)->tags); // If this is the last token of the response, close all open tags. if (last) { - for (auto cit = token_tags_it->crbegin(); cit != token_tags_it->crend(); ++cit) { - html += format("", (*cit)->name); - } + // Note: this assert is true due to our current implementation of + // HardAlignments() that always matches the last token of the input with + // the last token of the output. But lets assume someone someday changes + // HardAlignments(), and then this for-loop will be necessary. + // assert((*targetSpanIt)->tags.empty()); + formatter.append((*targetSpanIt)->tags, HTML::Taint()); } - ++token_prev_it; - ++token_tags_it; + prevSpan = *targetSpanIt++; - return html; + return std::move(formatter.html()); }); // Assert that we did in fact use all our taints - assert(token_tags_it == token_tags_target.end()); + assert(targetSpanIt == targetTokenSpans.end()); return out; } std::ostream &debugPrintMapping(std::ostream &out, Response const &response, std::vector> const &alignments, - std::vector const &token_tags_target) { - auto taints = token_tags_target.begin(); + std::vector const &targetTokenSpans) { + auto spans = targetTokenSpans.begin(); for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { out << "Mapped sentence prefix with tags: "; - for (auto &&taint : *(++taints)) out << '/' << taint->name; + for (auto &&taint : (*++spans)->tags) out << '/' << taint->name; out << '\n'; for (size_t wordIdx = 0; wordIdx < response.target.numWords(sentenceIdx); ++wordIdx) { @@ -399,16 +423,16 @@ std::ostream &debugPrintMapping(std::ostream &out, Response const &response, out << " to "; out << std::setw(10) << std::setfill(' ') << response.source.word(sentenceIdx, alignments[sentenceIdx][wordIdx]); out << " with tags: "; - for (auto &&taint : *(++taints)) out << '/' << taint->name; + for (auto &&taint : (*++spans)->tags) out << '/' << taint->name; out << '\n'; } } out << "Mapped end-of-input with tags: "; - for (auto &&taint : *(++taints)) out << '/' << taint->name; + for (auto &&taint : (*++spans)->tags) out << '/' << taint->name; out << '\n'; - assert(++taints == token_tags_target.end()); + assert(++spans == targetTokenSpans.end()); return out; } @@ -467,7 +491,6 @@ HTML::HTML(std::string &&source, bool process_markup) { auto begin = source.size(); source.append(scanner.value()); spans_.push_back(Span{begin, source.size(), stack}); - filterEmpty(stack); } break; case markup::Scanner::TT_TAG_START: @@ -539,32 +562,32 @@ void HTML::restore(Response &response) { // Reconstruction of HTML tags: // 1. Map each token to a Span - // 2. Apply the taint of that span to the token - // 3. Reconstruct the source HTML with these tainted tokens - // 4. Transfer the taint from the source tokens to the target tokens using alignment information + // 2. Reconstruct the source HTML with these tainted tokens + // 3. Transfer the spans from the source tokens to the target tokens using alignment information + // 4. For spans that represent empty elements (e.g. ) figure out their position // 5. Reconstruct the target HTML with these tainted tokens - std::vector token_tags; // List of HTML tags active per token in source - // Calculating these is a side-effect of restoring - // the HTML in response.source. + // sourceTokenSpans is a vector with a pointer to a span for each token. We + // use iterators here to point to these positions so we can easily compare if + // one span comes before or after another, information we'll need when we need + // to figure out whether we've skipped spans (of emtpy elements) when + // reconstructing HTML in response.target. + std::vector sourceTokenSpans; - AnnotatedText source = restoreSource(response.source, token_tags, spans_.cbegin(), spans_.cend()); - assert(token_tags.size() == debugCountTokens(response.source)); + // RestoreSource re-inserts HTML into the source text, but also identifies + // which span each source token fits into best. + AnnotatedText source = restoreSource(response.source, spans_, sourceTokenSpans); + assert(sourceTokenSpans.size() == debugCountTokens(response.source)); // Find for every token in target the token in source that best matches. std::vector> alignments; hardAlignments(response, alignments); - std::vector token_tags_target; - token_tags_target.emplace_back(); // add empty one to the beginning for easy - // life later on (we start iterating at 1, - // and can then do i - 1 for empty. - copyTaint(response, alignments, token_tags, token_tags_target); - assert(token_tags_target.size() == debugCountTokens(response.target) + 1); - - // DebugPrintMapping(std::cerr, response, alignments, token_tags_target); + std::vector targetTokenSpans; + copyTaint(response, alignments, sourceTokenSpans, targetTokenSpans); + assert(targetTokenSpans.size() == debugCountTokens(response.target)); - AnnotatedText target = restoreTarget(response.target, token_tags_target); + AnnotatedText target = restoreTarget(response.target, spans_, targetTokenSpans); response.source = source; response.target = target; From 9e1c1e8dbf4817f411f718c75ce42493d92c43b6 Mon Sep 17 00:00:00 2001 From: Abhishek Aggarwal <66322306+abhi-agg@users.noreply.github.com> Date: Tue, 21 Dec 2021 23:58:13 +0100 Subject: [PATCH 322/442] CI: Circle CI config script update (#287) - Robust artifact presence check - Variable name refactoring - Storing only those artifacts that are required - Remove commit sha from the names of the Github Releases - Use BERGAMOT_VERSION file contents for Git Tag names --- .circleci/config.yml | 34 +++++++++++++++------------------- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index d9ff7933d..fbea34e18 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -19,13 +19,12 @@ jobs: name: Check artifacts working_directory: build-wasm command: | - ls -all bergamot* - if ls bergamot*.wasm &>/dev/null && ls bergamot*.js &>/dev/null - then + ARTIFACT_BASE="bergamot-translator-worker" + if [[ -f "$ARTIFACT_BASE.js" && -f "$ARTIFACT_BASE.wasm" ]]; then echo "Artifacts Successfully Generated" mkdir ../artifacts - cp bergamot-translator-worker.wasm ../artifacts/bergamot-translator-worker-with-wormhole.wasm - cp bergamot-translator-worker.js ../artifacts/bergamot-translator-worker-with-wormhole.js + cp $ARTIFACT_BASE.wasm ../artifacts/$ARTIFACT_BASE-with-wormhole.wasm + cp $ARTIFACT_BASE.js ../artifacts/$ARTIFACT_BASE-with-wormhole.js shasum -a 256 ../artifacts/* > ../artifacts/SHA256-1 cp ../BERGAMOT_VERSION ../artifacts/ else @@ -39,7 +38,7 @@ jobs: - artifacts/* - store_artifacts: - path: "build-wasm" + path: "artifacts" destination: "wasm-wormhole" build-without-wormhole: @@ -61,26 +60,27 @@ jobs: name: Check artifacts working_directory: build-wasm command: | - ls -all bergamot* - if ls bergamot*.wasm &>/dev/null && ls bergamot*.js &>/dev/null - then + ARTIFACT_BASE="bergamot-translator-worker" + if [[ -f "$ARTIFACT_BASE.js" && -f "$ARTIFACT_BASE.wasm" ]]; then echo "Artifacts Successfully Generated" mkdir ../artifacts - cp bergamot-translator-worker.wasm ../artifacts/bergamot-translator-worker-without-wormhole.wasm - cp bergamot-translator-worker.js ../artifacts/bergamot-translator-worker-without-wormhole.js + cp $ARTIFACT_BASE.wasm ../artifacts/$ARTIFACT_BASE-without-wormhole.wasm + cp $ARTIFACT_BASE.js ../artifacts/$ARTIFACT_BASE-without-wormhole.js shasum -a 256 ../artifacts/* > ../artifacts/SHA256-2 else echo "Failure: Artifacts Not Present" exit 1 fi + - persist_to_workspace: root: . paths: - artifacts/* - store_artifacts: - path: "build-wasm" + path: "artifacts" destination: "wasm-without-wormhole" + publish_to_github: docker: - image: cibuilds/github:0.10 @@ -91,15 +91,11 @@ jobs: - run: name: "Publish Release on GitHub" command: | - export COMMIT=$(echo $CIRCLE_SHA1 | cut -c -7) - export VERSION=$(cat ./artifacts/BERGAMOT_VERSION | cut -c 2-) - VERSION=$VERSION+$COMMIT + export TAG_VERSION=$(cat ./artifacts/BERGAMOT_VERSION) ls -lsa ./artifacts/ > ./artifacts/FILESIZES cat ./artifacts/SHA256-1 ./artifacts/SHA256-2 > ./artifacts/SHA256 - rm ./artifacts/SHA256-1 - rm ./artifacts/SHA256-2 - rm ./artifacts/BERGAMOT_VERSION - ghr -t ${GHTOKEN} -u ${CIRCLE_PROJECT_USERNAME} -r ${CIRCLE_PROJECT_REPONAME} -c ${CIRCLE_SHA1} -delete ${VERSION} ./artifacts/ + rm ./artifacts/SHA256-1 ./artifacts/SHA256-2 ./artifacts/BERGAMOT_VERSION + ghr -t ${GHTOKEN} -u ${CIRCLE_PROJECT_USERNAME} -r ${CIRCLE_PROJECT_REPONAME} -c ${CIRCLE_SHA1} -delete ${TAG_VERSION} ./artifacts/ workflows: build: From 6e6042c98f2194cd10514844a9a71e218d8e7830 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Wed, 29 Dec 2021 11:02:56 +0000 Subject: [PATCH 323/442] GitHub CI: Update YAML to run all tests on marian-full (#292) Previously there were #native tags and #wasm tags separating the two. There is now a clear separation between async, blocking and wasm. --- .github/workflows/native.yml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/.github/workflows/native.yml b/.github/workflows/native.yml index e27cbe1b0..4a8dbffdc 100644 --- a/.github/workflows/native.yml +++ b/.github/workflows/native.yml @@ -25,19 +25,19 @@ jobs: os: ubuntu-18.04 identifier: ubuntu_1804_full cmake: -DCOMPILE_TESTS=on - brt_tags: "'#native'" + brt_args: "" unittests: 'true' - name: Ubuntu 18.04 minimal os: ubuntu-18.04 identifier: ubuntu_1804_minimal cmake: -DCOMPILE_TESTS=on -DUSE_WASM_COMPATIBLE_SOURCE=on - brt_tags: "'#wasm'" + brt_args: "'#wasm'" unittests: 'false' - name: Ubuntu 20.04 full os: ubuntu-20.04 identifier: ubuntu_2004_full cmake: -DCOMPILE_TESTS=on - brt_tags: "'#native'" + brt_tags: "" unittests: 'true' - name: Ubuntu 20.04 minimal os: ubuntu-20.04 @@ -140,7 +140,7 @@ jobs: os: macos-10.15 identifier: mac_1015_full cmake: -DCOMPILE_TESTS=on -DUSE_APPLE_ACCELERATE=off -DUSE_FBGEMM=off -DUSE_STATIC_LIBS=off - brt_tags: "'#native'" + brt_tags: "" unittests: 'true' - name: MacOS 10.15 minimal os: macos-10.15 From 8eb238ed5ec30c0ea03ba9507df23ab73fe2ef04 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 30 Dec 2021 14:29:12 +0000 Subject: [PATCH 324/442] HTML basic integration tests (#291) --- bergamot-translator-tests | 2 +- src/tests/common-impl.cpp | 13 +++++++++++++ src/tests/common.h | 2 ++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 5524e37a0..59720cb67 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 5524e37a01920dc5149dcc87b047615c6a70aa53 +Subproject commit 59720cb67458c4682cde7e999a4b18d6934ab988 diff --git a/src/tests/common-impl.cpp b/src/tests/common-impl.cpp index 49ebfc53c..9fc44c9ad 100644 --- a/src/tests/common-impl.cpp +++ b/src/tests/common-impl.cpp @@ -49,6 +49,8 @@ void TestSuite::TestSuite::run(const std::string &opModeAsString, std:: qualityEstimatorScores(models.front()); } else if (opModeAsString == "test-translation-cache") { translationCache(models.front()); + } else if (opModeAsString == "test-html-translation") { + htmlTranslation(models.front()); } else { std::cerr << "Incompatible test mode. Choose from the one of the valid test-modes"; std::abort(); @@ -138,6 +140,17 @@ void TestSuite::qualityEstimatorWords(Ptr model) { } } +template +void TestSuite::htmlTranslation(Ptr model) { + ResponseOptions responseOptions; + responseOptions.HTML = true; + responseOptions.alignment = true; + std::string source = readFromStdin(); + const Response response = bridge_.translate(service_, model, std::move(source), responseOptions); + + std::cout << response.target.text; +} + // Reads from stdin and translates the read content. Prints the quality scores for each sentence. template void TestSuite::qualityEstimatorScores(Ptr model) { diff --git a/src/tests/common.h b/src/tests/common.h index dff47e483..1e454858c 100644 --- a/src/tests/common.h +++ b/src/tests/common.h @@ -79,6 +79,8 @@ class TestSuite { void qualityEstimatorScores(Ptr model); void translationCache(Ptr model); + + void htmlTranslation(Ptr model); }; #define BERGAMOT_TESTS_COMMON_IMPL From d209e4fc49290989dfb2443163b12e150ad1b97a Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Thu, 30 Dec 2021 16:12:30 +0000 Subject: [PATCH 325/442] Fix typo in BRT args on CI runs (#294) --- .github/workflows/native.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/native.yml b/.github/workflows/native.yml index 4a8dbffdc..9ff351c0a 100644 --- a/.github/workflows/native.yml +++ b/.github/workflows/native.yml @@ -25,13 +25,13 @@ jobs: os: ubuntu-18.04 identifier: ubuntu_1804_full cmake: -DCOMPILE_TESTS=on - brt_args: "" + brt_tags: "" unittests: 'true' - name: Ubuntu 18.04 minimal os: ubuntu-18.04 identifier: ubuntu_1804_minimal cmake: -DCOMPILE_TESTS=on -DUSE_WASM_COMPATIBLE_SOURCE=on - brt_args: "'#wasm'" + brt_tags: "'#wasm'" unittests: 'false' - name: Ubuntu 20.04 full os: ubuntu-20.04 From ddccc77570f64206d1df38cd19957022b4f26b3c Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 2 Jan 2022 00:17:12 +0000 Subject: [PATCH 326/442] Turn logging off by default, allow turning on via config/cmdline (#295) * Turn logging off by default, allow turning on via config/cmdline * No need to store config in member variable if things are decided at construction time --- src/translator/logging.h | 38 +++++++++++++++++++++++++++++++++++++- src/translator/service.cpp | 12 ++++++++++-- src/translator/service.h | 6 ++++++ 3 files changed, 53 insertions(+), 3 deletions(-) diff --git a/src/translator/logging.h b/src/translator/logging.h index bd5b17a45..2256d7889 100644 --- a/src/translator/logging.h +++ b/src/translator/logging.h @@ -7,9 +7,45 @@ namespace bergamot { // RAII Wrap around logging, to clean up after the object on stack. class Logger { public: - Logger() : marianLoggers_(createLoggers()) { + struct Config { + std::string level{"off"}; + template + static void addOptions(App &app, Config &config) { + app.add_option("--log-level", config.level, + "Set verbosity level of logging: trace, debug, info, warn, err(or), critical, off"); + } + }; + + Logger(const Config &config) : marianLoggers_(createLoggers()) { // We are manually creating loggers, because this is usually created in marian as a side-effect of // config-parsing. + for (auto &logger : marianLoggers_) { + setLoggingLevel(*logger, config.level); + } + } + + // Taken from + // https://github.com/marian-nmt/marian-dev/blob/c84599d08ad69059279abd5a7417a8053db8b631/src/common/logging.cpp#L45 + static bool setLoggingLevel(spdlog::logger &logger, std::string const level) { + if (level == "trace") + logger.set_level(spdlog::level::trace); + else if (level == "debug") + logger.set_level(spdlog::level::debug); + else if (level == "info") + logger.set_level(spdlog::level::info); + else if (level == "warn") + logger.set_level(spdlog::level::warn); + else if (level == "err" || level == "error") + logger.set_level(spdlog::level::err); + else if (level == "critical") + logger.set_level(spdlog::level::critical); + else if (level == "off") + logger.set_level(spdlog::level::off); + else { + logger.warn("Unknown log level '{}' for logger '{}'", level.c_str(), logger.name().c_str()); + return false; + } + return true; } ~Logger() { diff --git a/src/translator/service.cpp b/src/translator/service.cpp index ca92721da..8acbc97de 100644 --- a/src/translator/service.cpp +++ b/src/translator/service.cpp @@ -11,7 +11,11 @@ namespace marian { namespace bergamot { BlockingService::BlockingService(const BlockingService::Config &config) - : config_(config), requestId_(0), batchingPool_(), cache_(config.cacheSize, /*mutexBuckets=*/1) {} + : config_(config), + requestId_(0), + batchingPool_(), + cache_(config.cacheSize, /*mutexBuckets=*/1), + logger_(config.logger) {} std::vector BlockingService::translateMultiple(std::shared_ptr translationModel, std::vector &&sources, @@ -37,7 +41,11 @@ std::vector BlockingService::translateMultiple(std::shared_ptr static void addOptions(App &app, Config &config) { // Options will come here. app.add_option("--cache-translations", config.cacheEnabled, "Whether to cache translations or not."); app.add_option("--cache-size", config.cacheSize, "Number of entries to store in cache."); + Logger::Config::addOptions(app, config.logger); } }; /// Construct a BlockingService with configuration loaded from an Options object. Does not require any keys, values to @@ -90,6 +93,8 @@ class AsyncService { size_t cacheMutexBuckets{1}; ///< Controls the granularity of locking to reduce contention by bucketing mutexes ///< guarding cache entry read write. Optimal at min(core, numWorkers) assuming a ///< reasonably large cache-size. + Logger::Config logger; // Configurations for logging + template static void addOptions(App &app, Config &config) { app.add_option("--cpu-threads", config.numWorkers, "Workers to form translation backend"); @@ -97,6 +102,7 @@ class AsyncService { app.add_option("--cache-size", config.cacheSize, "Number of entries to store in cache."); app.add_option("--cache-mutex-buckets", config.cacheMutexBuckets, "Number of mutex buckets to control locking granularity"); + Logger::Config::addOptions(app, config.logger); } }; /// Construct an AsyncService with configuration loaded from Options. Expects positive integer value for From 3883dd19713b0f6f30eb4c3cfcdb8e488eab3a76 Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Sun, 2 Jan 2022 12:33:30 +0000 Subject: [PATCH 327/442] cache: threadsafety-fixes; optional stats collection (#245) * Make stats hits misses atomic to guard when mutex has multiple buckets * Use compile time switch for cache-stats-collection bound to COMPILE_TESTS cmake variable * -DENABLE_CACHE_STATS on if COMPILE_TESTS otherwise optional * Make stats() call without enabling build fatal abort --- CMakeLists.txt | 2 ++ bergamot-translator-tests | 2 +- src/translator/CMakeLists.txt | 4 ++++ src/translator/cache.h | 24 ++++++++++++++++++++---- 4 files changed, 27 insertions(+), 5 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 006e9521d..f121ca0fb 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -74,6 +74,8 @@ cmake_dependent_option(USE_WASM_COMPATIBLE_SOURCE "Use wasm compatible sources" # WASM disables a million libraries, which also includes the unit test-library. cmake_dependent_option(COMPILE_UNIT_TESTS "Compile unit tests" OFF "USE_WASM_COMPATIBLE_SOURCE" ON) option(COMPILE_TESTS "Compile bergamot-tests" OFF) +cmake_dependent_option(ENABLE_CACHE_STATS "Enable stats on cache" ON "COMPILE_TESTS" OFF) + # Set 3rd party submodule specific cmake options for this project SET(COMPILE_CUDA OFF CACHE BOOL "Compile GPU version") diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 59720cb67..332e976df 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 59720cb67458c4682cde7e999a4b18d6934ab988 +Subproject commit 332e976df4583793a09b6483b80b972621fcfadb diff --git a/src/translator/CMakeLists.txt b/src/translator/CMakeLists.txt index 6779b0fa4..dbead6173 100644 --- a/src/translator/CMakeLists.txt +++ b/src/translator/CMakeLists.txt @@ -32,6 +32,10 @@ if(COMPILE_WASM) target_compile_options(bergamot-translator PRIVATE ${WASM_COMPILE_FLAGS}) endif(COMPILE_WASM) +if(ENABLE_CACHE_STATS) + target_compile_definitions(bergamot-translator PUBLIC ENABLE_CACHE_STATS) +endif(ENABLE_CACHE_STATS) + target_link_libraries(bergamot-translator marian ssplit) target_include_directories(bergamot-translator diff --git a/src/translator/cache.h b/src/translator/cache.h index ba68e4e93..ceeca5d32 100644 --- a/src/translator/cache.h +++ b/src/translator/cache.h @@ -1,4 +1,5 @@ #pragma once +#include #include #include #include @@ -26,7 +27,14 @@ class AtomicCache { void store(const Key &key, Value value) { atomicStore(key, value); } - const Stats stats() const { return stats_; } + const Stats stats() const { +#ifdef ENABLE_CACHE_STATS + return Stats{hits_.load(), misses_.load()}; +#else + ABORT("Cache statistics requested without enabling in builds. Please use -DENABLE_CACHE_STATS with cmake."); + return Stats{0, 0}; +#endif + } private: using Record = std::pair; @@ -40,10 +48,14 @@ class AtomicCache { const Record &candidate = records_[index]; if (equals_(key, candidate.first)) { value = candidate.second; - stats_.hits += 1; +#ifdef ENABLE_CACHE_STATS + ++hits_; +#endif return true; } else { - stats_.misses += 1; +#ifdef ENABLE_CACHE_STATS + ++misses_; +#endif } return false; @@ -64,7 +76,11 @@ class AtomicCache { std::vector records_; mutable std::vector mutexBuckets_; - mutable Stats stats_; + +#ifdef ENABLE_CACHE_STATS + mutable std::atomic hits_{0}; + mutable std::atomic misses_{0}; +#endif Hash hash_; Equals equals_; From 81c21928d5c360e47b998a6d24abe055bab9165b Mon Sep 17 00:00:00 2001 From: Jerin Philip Date: Mon, 3 Jan 2022 12:27:41 +0000 Subject: [PATCH 328/442] Have alignments placed if HTML is on (#296) --- src/translator/response_builder.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/translator/response_builder.h b/src/translator/response_builder.h index baa648850..345951a0e 100644 --- a/src/translator/response_builder.h +++ b/src/translator/response_builder.h @@ -61,7 +61,7 @@ class ResponseBuilder { buildQualityScores(histories, response); } - if (responseOptions_.alignment) { + if (responseOptions_.alignment || responseOptions_.HTML) { buildAlignments(histories, response); } html_.restore(response); From dae02a3c8d2f95139d3c9623f8644b2b4776dab9 Mon Sep 17 00:00:00 2001 From: Jelmer Date: Wed, 5 Jan 2022 14:33:51 +0100 Subject: [PATCH 329/442] HTML transfer script/style/etc elements (#285) --- src/tests/units/html_tests.cpp | 47 +++++++++++++ src/translator/html.cpp | 120 +++++++++++++++++---------------- src/translator/html.h | 12 +++- 3 files changed, 120 insertions(+), 59 deletions(-) diff --git a/src/tests/units/html_tests.cpp b/src/tests/units/html_tests.cpp index e3d79379f..48af7066c 100644 --- a/src/tests/units/html_tests.cpp +++ b/src/tests/units/html_tests.cpp @@ -419,6 +419,53 @@ TEST_CASE("Test empty tag") { CHECK(response.target.text == test_str); } +TEST_CASE("Test world"); + + std::string input(test_str); + HTML html(std::move(input), true); + CHECK(input == "hello world"); + + Response response; + std::string sentence_str("hello world"); + std::vector sentence{ + string_view(sentence_str.data() + 0, 4), // 0.0 hell + string_view(sentence_str.data() + 4, 1), // 0.1 o + string_view(sentence_str.data() + 5, 6), // 0.2 _world + string_view(sentence_str.data() + 11, 0), // 0.3 "" + }; + response.source.appendSentence("", sentence.begin(), sentence.end()); + response.target.appendSentence("", sentence.begin(), sentence.end()); + response.alignments = {identity_matrix(4)}; + + html.restore(response); + CHECK(response.source.text == test_str); + CHECK(response.target.text == test_str); +} + +TEST_CASE("Test comment") { + std::string test_str("foo bar"); + + std::string input(test_str); + HTML html(std::move(input), true); + CHECK(input == "foo bar"); + + Response response; + std::string sentence_str("foo bar"); + std::vector sentence{ + string_view(sentence_str.data() + 0, 3), // foo + string_view(sentence_str.data() + 3, 4), // _bar + string_view(sentence_str.data() + 7, 0), // "" + }; + response.source.appendSentence("", sentence.begin(), sentence.end()); + response.target.appendSentence("", sentence.begin(), sentence.end()); + response.alignments = {identity_matrix(3)}; + + html.restore(response); + CHECK(response.source.text == test_str); + CHECK(response.target.text == test_str); +} + TEST_CASE("End-to-end translation") { std::string input("

    I like to drive this car.

    \n"); HTML html(std::move(input), true); diff --git a/src/translator/html.cpp b/src/translator/html.cpp index 4424241c2..13ab422ac 100644 --- a/src/translator/html.cpp +++ b/src/translator/html.cpp @@ -47,11 +47,20 @@ size_t countPrefixWhitespaces(string_view const &input) { return size; } +// Formatters used for exception messages combined with format() std::ostream &operator<<(std::ostream &out, HTML::Tag const *tag) { if (tag == nullptr) return out << "[nullptr]"; - out << '<' << tag->name << tag->attributes; - if (tag->empty) out << '/'; - return out << '>'; + switch (tag->type) { + case HTML::Tag::ELEMENT: + return out << '<' << tag->name << tag->attributes << '>'; + case HTML::Tag::VOID_ELEMENT: + return out << '<' << tag->name << tag->attributes << "/>"; + case HTML::Tag::COMMENT: + return out << ""; + case HTML::Tag::PROCESSING_INSTRUCTION: + return out << "data << "?>"; + } + return out << "[Unknown tag type]"; } std::ostream &operator<<(std::ostream &out, HTML::Taint const &tags) { @@ -131,7 +140,9 @@ void diffTags(HTML::Taint const &prev, HTML::Taint const &curr, HTML::Taint &ope for (; i < prev.size(); ++i) if (i >= curr.size() || prev[i] != curr[i]) break; - std::copy_if(prev.begin() + i, prev.end(), std::back_inserter(closing), [&](HTML::Tag *tag) { return !tag->empty; }); + // Only nodes of type ELEMENT can have children and thus would need a closing tag. + std::copy_if(prev.begin() + i, prev.end(), std::back_inserter(closing), + [&](HTML::Tag *tag) { return tag->type == HTML::Tag::ELEMENT; }); opening.insert(opening.end(), curr.begin() + i, curr.end()); } @@ -273,7 +284,7 @@ void copyTaint(Response const &response, std::vector> const // Little helper class to append HTML to a token class TokenFormatter { public: - TokenFormatter(string_view token) + explicit TokenFormatter(string_view token) : html_(), offset_(0), whitespaceSize_(countPrefixWhitespaces(token)), closeLeft_(true) { // Do encoding of any entities that popped up in the translation encodeEntities(token, html_); @@ -288,13 +299,26 @@ class TokenFormatter { diffTags(prev, curr, opening, closing); for (HTML::Tag const *tag : reversed(closing)) { + assert(tag->type == HTML::Tag::ELEMENT); std::string closeTag = format("", tag->name); html_.insert(offset_ + (closeLeft_ ? 0 : whitespaceSize_), closeTag); offset_ += closeTag.size(); } for (HTML::Tag const *tag : opening) { - std::string openTag = format("<{}{}>", tag->name, tag->attributes); + std::string openTag; + switch (tag->type) { + case HTML::Tag::ELEMENT: + case HTML::Tag::VOID_ELEMENT: + openTag = format("<{}{}>{}", tag->name, tag->attributes, tag->data); + break; + case HTML::Tag::COMMENT: + openTag = format("", tag->data); + break; + case HTML::Tag::PROCESSING_INSTRUCTION: + openTag = format("", tag->data); + break; + } html_.insert(offset_ + whitespaceSize_, openTag); offset_ += openTag.size(); closeLeft_ = false; @@ -405,55 +429,6 @@ AnnotatedText restoreTarget(AnnotatedText const &in, std::vector con return out; } -std::ostream &debugPrintMapping(std::ostream &out, Response const &response, - std::vector> const &alignments, - std::vector const &targetTokenSpans) { - auto spans = targetTokenSpans.begin(); - for (size_t sentenceIdx = 0; sentenceIdx < response.target.numSentences(); ++sentenceIdx) { - out << "Mapped sentence prefix with tags: "; - for (auto &&taint : (*++spans)->tags) out << '/' << taint->name; - out << '\n'; - - for (size_t wordIdx = 0; wordIdx < response.target.numWords(sentenceIdx); ++wordIdx) { - assert(sentenceIdx < alignments.size()); - assert(wordIdx < alignments[sentenceIdx].size()); - - out << "Mapped "; - out << std::setw(10) << std::setfill(' ') << response.target.word(sentenceIdx, wordIdx); - out << " to "; - out << std::setw(10) << std::setfill(' ') << response.source.word(sentenceIdx, alignments[sentenceIdx][wordIdx]); - out << " with tags: "; - for (auto &&taint : (*++spans)->tags) out << '/' << taint->name; - out << '\n'; - } - } - - out << "Mapped end-of-input with tags: "; - for (auto &&taint : (*++spans)->tags) out << '/' << taint->name; - out << '\n'; - - assert(++spans == targetTokenSpans.end()); - return out; -} - -std::ostream &debugPrintAlignmentScores(std::ostream &out, Response const &response) { - out << "std::vector>> alignments{\n"; - for (size_t sentenceIdx = 0; sentenceIdx < response.source.numSentences(); ++sentenceIdx) { - out << " {\n"; - for (size_t t = 0; t < response.alignments[sentenceIdx].size(); ++t) { - out << " {"; - for (size_t s = 0; s < response.alignments[sentenceIdx][t].size(); ++s) { - out << std::fixed << std::setw(8) << std::setprecision(8) << std::setfill(' ') - << response.alignments[sentenceIdx][t][s]; - out << ", "; - } - out << "},\n"; - } - out << " },\n"; - } - return out << "};\n"; -} - size_t debugCountTokens(AnnotatedText const &text) { size_t tokens = 1; // for the ending gap for (size_t sentenceIdx = 0; sentenceIdx < text.numSentences(); ++sentenceIdx) { @@ -501,7 +476,8 @@ HTML::HTML(std::string &&source, bool process_markup) { if (isBlockElement(scanner.tag()) && !source.empty() && source.back() != ' ') source.push_back(' '); // pool_ takes ownership of our tag, makes sure it's freed when necessary - pool_.emplace_back(new Tag{std::string(scanner.tag()), std::string(), isVoidTag(scanner.tag())}); + pool_.emplace_back(new Tag{isVoidTag(scanner.tag()) ? Tag::VOID_ELEMENT : Tag::ELEMENT, + std::string(scanner.tag()), std::string()}); // Tag *tag is used by attribute parsing tag = pool_.back().get(); @@ -511,7 +487,7 @@ HTML::HTML(std::string &&source, bool process_markup) { // Empty elements (e.g. ) are not applicable to a span of text // so instead we "apply" them to an empty span in between, and then // immediately remove them again from the stack. - if (tag->empty) { + if (tag->type == Tag::VOID_ELEMENT) { spans_.push_back(Span{source.size(), source.size(), stack}); stack.pop_back(); } @@ -539,8 +515,36 @@ HTML::HTML(std::string &&source, bool process_markup) { tag->attributes += format(" {}=\"{}\"", scanner.attribute(), scanner.value()); break; - default: + case markup::Scanner::TT_COMMENT_START: + // pool_ takes ownership of our tag, makes sure it's freed when necessary + pool_.emplace_back(new Tag{Tag::COMMENT}); + tag = pool_.back().get(); + stack.push_back(tag); + spans_.push_back(Span{source.size(), source.size(), stack}); + stack.pop_back(); + break; + + case markup::Scanner::TT_PROCESSING_INSTRUCTION_START: + // pool_ takes ownership of our tag, makes sure it's freed when necessary + pool_.emplace_back(new Tag{Tag::PROCESSING_INSTRUCTION}); + tag = pool_.back().get(); + stack.push_back(tag); + spans_.push_back(Span{source.size(), source.size(), stack}); + stack.pop_back(); + break; + + case markup::Scanner::TT_COMMENT_END: + case markup::Scanner::TT_PROCESSING_INSTRUCTION_END: + tag = nullptr; + break; + + case markup::Scanner::TT_DATA: + assert(tag != nullptr); + tag->data = scanner.value(); break; + + default: + throw BadHTML("Unsupported scanner token type"); } } diff --git a/src/translator/html.h b/src/translator/html.h index 5ddb3d006..b233fd225 100644 --- a/src/translator/html.h +++ b/src/translator/html.h @@ -19,9 +19,19 @@ class BadHTML : public std::runtime_error { class HTML { public: struct Tag { + enum NodeType { + ELEMENT, + VOID_ELEMENT, + COMMENT, + PROCESSING_INSTRUCTION, + }; + + NodeType type; // Type of the node std::string name; std::string attributes; - bool empty; + std::string data; // Raw data of an element that just needs to be + // copied as is, e.g. because + /// the script tag may not be nested, but that is not the case for these + /// elements per se. Some tags, like + diff --git a/wasm/test_page/js/index.js b/wasm/test_page/js/index.js index b1c308e8b..56cbfdc72 100644 --- a/wasm/test_page/js/index.js +++ b/wasm/test_page/js/index.js @@ -1,156 +1,215 @@ -let worker; -let modelRegistry; +import {LatencyOptimisedTranslator, TranslatorBacking, CancelledError, SupersededError} from '../node_modules/@browsermt/bergamot-translator/translator.js'; -const $ = selector => document.querySelector(selector); -const status = message => ($("#status").innerText = message); - -const langFrom = $("#lang-from"); -const langTo = $("#lang-to"); - -if (window.Worker) { - worker = new Worker("js/worker.js"); - worker.postMessage(["import"]); +function $(selector) { + return document.querySelector(selector); } -document.querySelector("#input").addEventListener("keyup", function (event) { - translateCall(); -}); - -const _prepareTranslateOptions = (paragraphs) => { - const translateOptions = []; - paragraphs.forEach(paragraph => { - // Each option object can be different for each entry. But to keep the test page simple, - // we just keep all the options same (specifically avoiding parsing the input to determine - // html/non-html text) - translateOptions.push({"isQualityScores": true, "isHtml": true}); - }); - return translateOptions; -}; +function $$(selector) { + return document.querySelectorAll(selector); +} -const textToHTML = (text) => { +function encodeHTML(text) { const div = document.createElement('div'); div.appendChild(document.createTextNode(text)); return div.innerHTML; -}; - -const translateCall = () => { - const text = document.querySelector("#input").value; - if (!text.trim().length) return; - - const paragraphs = text.split(/\n+/).map(textToHTML); // escape HTML - const translateOptions = _prepareTranslateOptions(paragraphs); - const lngFrom = langFrom.value; - const lngTo = langTo.value; - worker.postMessage(["translate", lngFrom, lngTo, paragraphs, translateOptions]); -}; - -const addQualityClasses = (root) => { - // You can do this wit CSS variables, calc() and min/max, but JS is just easier +} - root.querySelectorAll('[x-bergamot-sentence-score]').forEach(el => { +function addQualityIndicators() { + $$('#output [x-bergamot-sentence-score]').forEach(el => { // The threshold is ln(0.5) (https://github.com/browsermt/bergamot-translator/pull/370#issuecomment-1058123399) - el.classList.toggle('bad', parseFloat(el.getAttribute('x-bergamot-sentence-score')) < -0.6931); + el.classList.toggle('bad', parseFloat(el.getAttribute('x-bergamot-sentence-score')) < Math.log(0.5)); }); - root.querySelectorAll('[x-bergamot-word-score]').forEach(el => { + $$('#output [x-bergamot-word-score]').forEach(el => { // The threshold is ln(0.5) (https://github.com/browsermt/bergamot-translator/pull/370#issuecomment-1058123399) - el.classList.toggle('bad', parseFloat(el.getAttribute('x-bergamot-word-score')) < -0.6931); + el.classList.toggle('bad', parseFloat(el.getAttribute('x-bergamot-word-score')) < Math.log(0.5)); }); // Add tooltips to each (sub)word with sentence and word score. - root.querySelectorAll('[x-bergamot-sentence-score] > [x-bergamot-word-score]').forEach(el => { + $$('#output [x-bergamot-sentence-score] > [x-bergamot-word-score]').forEach(el => { const sentenceScore = parseFloat(el.parentNode.getAttribute('x-bergamot-sentence-score')); const wordScore = parseFloat(el.getAttribute('x-bergamot-word-score')); - el.title = `Sentence: ${sentenceScore} Word: ${wordScore}`; + el.title = `Sentence: ${Math.exp(sentenceScore).toFixed(2)} Word: ${Math.exp(wordScore).toFixed(2)}`; }); } -worker.onmessage = function (e) { - if (e.data[0] === "translate_reply" && e.data[1]) { - // Clear output of previous translation - document.querySelector("#output").innerHTML = ''; - - // Add each translation in its own div to have a known root in which the - // sentence ids are unique. Used for highlighting sentences. - e.data[1].forEach(translatedHTML => { - const translation = document.createElement('div'); - translation.classList.add('translation'); - translation.innerHTML = translatedHTML; - addQualityClasses(translation); - document.querySelector("#output").appendChild(translation); - }); - } else if (e.data[0] === "load_model_reply" && e.data[1]) { - status(e.data[1]); - translateCall(); - } else if (e.data[0] === "import_reply" && e.data[1]) { - modelRegistry = e.data[1]; - init(); +function highlightSentence(element) { + const sentence = element.parentNode.hasAttribute('x-bergamot-sentence-index') + ? element.parentNode.getAttribute('x-bergamot-sentence-index') + : null; + $$('#output font[x-bergamot-sentence-index]').forEach(el => { + el.classList.toggle('highlight-sentence', el.getAttribute('x-bergamot-sentence-index') === sentence); + }) +} + +/** + * Very minimal WISYWIG editor. Just keyboard shortcuts for the IYKYK crowd. + */ +class Editor { + constructor(root) { + this.isApple = window.navigator.platform.startsWith('Mac'); + + this.root = root; + this.root.addEventListener('keydown', this.onkeydown.bind(this)); + + this.mapping = { + "b": "bold", + "i": "italic", + "u": "underline", + }; } -}; - -const loadModel = () => { - const lngFrom = langFrom.value; - const lngTo = langTo.value; - if (lngFrom !== lngTo) { - status(`Installing model...`); - console.log(`Loading model '${lngFrom}${lngTo}'`); - worker.postMessage(["load_model", lngFrom, lngTo]); - } else { - const input = textToHTML(document.querySelector("#input").value); - document.querySelector("#output").innerHTML = input; + + onkeydown(event) { + if (!(this.isApple ? event.metaKey : event.ctrlKey)) + return; + + if (!(event.key in this.mapping)) + return; + + document.execCommand(this.mapping[event.key], false, null); + + event.preventDefault(); } -}; - -langFrom.addEventListener("change", e => { - loadModel(); -}); - -langTo.addEventListener("change", e => { - loadModel(); -}); - -$(".swap").addEventListener("click", e => { - [langFrom.value, langTo.value] = [langTo.value, langFrom.value]; - $("#input").value = $("#output").innerText; - loadModel(); -}); - -$('#output').addEventListener('mouseover', e => { - const root = e.target.closest('.translation'); - const sentence = e.target.parentNode.hasAttribute('x-bergamot-sentence-index') ? e.target.parentNode.getAttribute('x-bergamot-sentence-index') : null; - document.querySelectorAll('#output font[x-bergamot-sentence-index]').forEach(el => { - el.classList.toggle('highlight-sentence', el.getAttribute('x-bergamot-sentence-index') === sentence && el.closest('.translation') === root); - }) -}) +} + +async function main() { + const options = { + cacheSize: 2^13, + downloadTimeout: null // Disable timeout + }; + + const backing = new TranslatorBacking(options); + + let pending = 0; // Number of pending requests + + // Patch the fetch() function to track number of pending requests + backing.fetch = async function(...args) { + try { + $('.app').classList.toggle('loading', ++pending > 0); + return await TranslatorBacking.prototype.fetch.call(backing, ...args); + } finally { + $('.app').classList.toggle('loading', --pending > 0); + } + }; -function init() { - // Populate langs - const langs = Array.from(new Set(Object.keys(modelRegistry).reduce((acc, key) => acc.concat([key.substr(0, 2), key.substr(2, 2)]), []))); - const langNames = new Intl.DisplayNames(undefined, {type: "language"}); + // Wait for the language model registry to load. Once it is loaded, use + // it to fill the "from" and "to" language selection dropdowns. + await backing.registry.then(models => { + const names = new Intl.DisplayNames(['en'], {type: 'language'}); - // Sort languages by display name - langs.sort((a, b) => langNames.of(a).localeCompare(langNames.of(b))); + ['from', 'to'].forEach(field => { + const languages = new Set(models.map(model => model[field])); + const select = $(`#lang-${field}`); - // Populate the dropdowns - langs.forEach(code => { - const name = langNames.of(code); - langFrom.innerHTML += ``; - langTo.innerHTML += ``; + const pairs = Array.from(languages, code => ({code, name: names.of(code)})); + + pairs.sort(({name: a}, {name: b}) => a.localeCompare(b)); + + pairs.forEach(({name, code}) => { + select.add(new Option(name, code)); + }) + }); + + $('#lang-from').value = 'en'; + $('#lang-to').value = 'es'; }); - // try to guess input language from user agent - let myLang = navigator.language; - if (myLang) { - myLang = myLang.split("-")[0]; - let langIndex = langs.indexOf(myLang); - if (langIndex > -1) { - console.log("guessing input language is", myLang); - langFrom.value = myLang; + // Intentionally do this after querying backing.registry to make sure that + // that request is fired off first. Now we can start thinking about loading + // the WASM binary etc. + const translator = new LatencyOptimisedTranslator(options, backing); + + let abortController = new AbortController(); + + const translate = async () => { + try { + const from = $('#lang-from').value; + const to = $('#lang-to').value; + + // Querying models to see whether quality estimation is supported by all + // of them. + const models = await backing.getModels({from, to}); + const qualityScores = models.every(model => 'qualityModel' in model.files); + + $('.app').classList.add('translating'); + + const response = await translator.translate({ + from, + to, + text: $('#input').innerHTML, + html: true, + qualityScores + }, {signal: abortController.signal}); + + $('#output').innerHTML = response.target.text; + $('#output').classList.toggle('has-quality-scores', qualityScores); + + if (qualityScores) + addQualityIndicators(); + + } catch (error) { + // Ignore errors caused by changing the language pair (which triggers abort()) + if (error.constructor === CancelledError) { + return; + } + + // Ignore 'errors' caused by typing too fast or by changing the language + // pair while a translation was still in progress (or being loaded) + if (error.constructor === SupersededError || error.constructor === CancelledError) + return; + + // Ignore errors caused by selecting a bad pair (e.g. en -> en) + if (error.message.startsWith('No model available to translate from')) + return; + + alert(`Error during translation: ${error}\n\n${error.stack}`); + } finally { + const worker = await Promise.race([translator.worker, Promise.resolve(null)]); + $('.app').classList.toggle('translating', worker === null || !worker.idle); } } - // find first output lang that *isn't* input language - langTo.value = langs.find(code => code !== langFrom.value); - // load this model - loadModel(); + const reset = async () => { + // Cancel any pending loading/translation + abortController.abort(); + + // Reset abort controller to a fresh un-aborted one + abortController = new AbortController(); + + // Clear output to make it more clear something is happening + $('#output').innerHTML = ''; + + // Immediately start loading the new selection + translate(); + } + + $('button.swap').addEventListener('click', () => { + const tmp = $('#lang-from').value; + $('#lang-from').value = $('#lang-to').value; + $('#lang-to').value = tmp; + translate(); + }) + + // Simple WYSIWYG controls + const editor = new Editor($('#input')); + + // Translate on any change + $('#input').addEventListener('input', translate); + $('#lang-from').addEventListener('input', reset); + $('#lang-to').addEventListener('input', reset); + + // Hook up sentence boundary highlighting if that information is available. + $('#output').addEventListener('mouseover', (e) => highlightSentence(e.target)) + + // Wait for bergamot-translator to load. This could throw a CompileError + // which we want to catch so we can show "oh noes browser not supported!" + translator.worker.catch(error => { + // Catch CompileErrors because for those we know what to do. + if (error.name === 'CompileError') + $('#unsupported-browser').hidden = false; + else + throw error; + }); } + +main(); diff --git a/wasm/test_page/js/worker.js b/wasm/test_page/js/worker.js deleted file mode 100644 index 3327d8a3a..000000000 --- a/wasm/test_page/js/worker.js +++ /dev/null @@ -1,352 +0,0 @@ -// All variables specific to translation service -var translationService = undefined; - -// Model registry -let modelRegistry = undefined; - -// A map of language-pair to TranslationModel object -var languagePairToTranslationModels = new Map(); - -const BERGAMOT_TRANSLATOR_MODULE = "bergamot-translator-worker.js"; -const MODEL_REGISTRY = "../models/registry.json"; -const MODEL_ROOT_URL = "../models/"; -const PIVOT_LANGUAGE = 'en'; - -// Information corresponding to each file type -const fileInfo = [ - {"type": "model", "alignment": 256}, - {"type": "lex", "alignment": 64}, - {"type": "vocab", "alignment": 64}, - {"type": "qualityModel", "alignment": 64} -]; - -const encoder = new TextEncoder(); // string to utf-8 converter -const decoder = new TextDecoder(); // utf-8 to string converter - -const start = Date.now(); -let moduleLoadStart; -var Module = { - preRun: [function() { - log(`Time until Module.preRun: ${(Date.now() - start) / 1000} secs`); - moduleLoadStart = Date.now(); - }], - onRuntimeInitialized: async function() { - log(`Wasm Runtime initialized Successfully (preRun -> onRuntimeInitialized) in ${(Date.now() - moduleLoadStart) / 1000} secs`); - const response = await fetch(MODEL_REGISTRY); - modelRegistry = await response.json(); - postMessage([`import_reply`, modelRegistry]); - } -}; - -const log = (message) => { - console.debug(message); -} - -onmessage = async function(e) { - const command = e.data[0]; - log(`Message '${command}' received from main script`); - let result = ""; - if (command === 'import') { - importScripts(BERGAMOT_TRANSLATOR_MODULE); - } else if (command === 'load_model') { - let start = Date.now(); - let from = e.data[1]; - let to = e.data[2]; - try { - await constructTranslationService(); - await constructTranslationModel(from, to); - log(`Model '${from}${to}' successfully constructed. Time taken: ${(Date.now() - start) / 1000} secs`); - result = "Model successfully loaded"; - } catch (error) { - log(`Model '${from}${to}' construction failed: '${error.message}'`); - result = "Model loading failed"; - } - log(`'${command}' command done, Posting message back to main script`); - postMessage([`${command}_reply`, result]); - } else if (command === 'translate') { - const from = e.data[1]; - const to = e.data[2]; - const input = e.data[3]; - const translateOptions = e.data[4]; - let inputWordCount = 0; - let inputBlockElements = 0; - input.forEach(sentence => { - inputWordCount += sentence.trim().split(" ").filter(word => word.trim() !== "").length; - inputBlockElements++; - }) - let start = Date.now(); - try { - log(`Blocks to translate: ${inputBlockElements}`); - result = translate(from, to, input, translateOptions); - const secs = (Date.now() - start) / 1000; - log(`Translation '${from}${to}' Successful. Speed: ${Math.round(inputWordCount / secs)} WPS (${inputWordCount} words in ${secs} secs)`); - } catch (error) { - log(`Error: ${error.message}`); - } - log(`'${command}' command done, Posting message back to main script`); - postMessage([`${command}_reply`, result]); - } -} - -// Instantiates the Translation Service -const constructTranslationService = async () => { - if (!translationService) { - var translationServiceConfig = {cacheSize: 20000}; - log(`Creating Translation Service with config: ${translationServiceConfig}`); - translationService = new Module.BlockingService(translationServiceConfig); - log(`Translation Service created successfully`); - } -} - -// Constructs translation model(s) for the source and target language pair (using -// pivoting if required). -const constructTranslationModel = async (from, to) => { - // Delete all previously constructed translation models and clear the map - languagePairToTranslationModels.forEach((value, key) => { - log(`Destructing model '${key}'`); - value.delete(); - }); - languagePairToTranslationModels.clear(); - - if (_isPivotingRequired(from, to)) { - // Pivoting requires 2 translation models - const languagePairSrcToPivot = _getLanguagePair(from, PIVOT_LANGUAGE); - const languagePairPivotToTarget = _getLanguagePair(PIVOT_LANGUAGE, to); - await Promise.all([_constructTranslationModelHelper(languagePairSrcToPivot), - _constructTranslationModelHelper(languagePairPivotToTarget)]); - } - else { - // Non-pivoting case requires only 1 translation model - await _constructTranslationModelHelper(_getLanguagePair(from, to)); - } -} - -// Translates text from source language to target language (via pivoting if necessary). -const translate = (from, to, input, translateOptions) => { - let vectorResponseOptions, vectorSourceText, vectorResponse; - try { - // Prepare the arguments (vectorResponseOptions and vectorSourceText (vector)) of Translation API and call it. - // Result is a vector where each of its item corresponds to one item of vectorSourceText in the same order. - vectorResponseOptions = _prepareResponseOptions(translateOptions); - vectorSourceText = _prepareSourceText(input); - - if (_isPivotingRequired(from, to)) { - // Translate via pivoting - const translationModelSrcToPivot = _getLoadedTranslationModel(from, PIVOT_LANGUAGE); - const translationModelPivotToTarget = _getLoadedTranslationModel(PIVOT_LANGUAGE, to); - vectorResponse = translationService.translateViaPivoting(translationModelSrcToPivot, - translationModelPivotToTarget, - vectorSourceText, - vectorResponseOptions); - } - else { - // Translate without pivoting - const translationModel = _getLoadedTranslationModel(from, to); - vectorResponse = translationService.translate(translationModel, vectorSourceText, vectorResponseOptions); - } - - // Parse all relevant information from vectorResponse - const listTranslatedText = _parseTranslatedText(vectorResponse); - const listSourceText = _parseSourceText(vectorResponse); - const listTranslatedTextSentences = _parseTranslatedTextSentences(vectorResponse); - const listSourceTextSentences = _parseSourceTextSentences(vectorResponse); - - log(`Source text: ${listSourceText}`); - log(`Translated text: ${listTranslatedText}`); - log(`Translated sentences: ${JSON.stringify(listTranslatedTextSentences)}`); - log(`Source sentences: ${JSON.stringify(listSourceTextSentences)}`); - - return listTranslatedText; - } finally { - // Necessary clean up - if (vectorSourceText != null) vectorSourceText.delete(); - if (vectorResponseOptions != null) vectorResponseOptions.delete(); - if (vectorResponse != null) vectorResponse.delete(); - } -} - -// Downloads file from a url and returns the array buffer -const _downloadAsArrayBuffer = async(url) => { - const response = await fetch(url); - if (!response.ok) { - throw Error(`Downloading ${url} failed: HTTP ${response.status} - ${response.statusText}`); - } - return response.arrayBuffer(); -} - -// Constructs and initializes the AlignedMemory from the array buffer and alignment size -const _prepareAlignedMemoryFromBuffer = async (buffer, alignmentSize) => { - var byteArray = new Int8Array(buffer); - var alignedMemory = new Module.AlignedMemory(byteArray.byteLength, alignmentSize); - const alignedByteArrayView = alignedMemory.getByteArrayView(); - alignedByteArrayView.set(byteArray); - return alignedMemory; -} - -async function prepareAlignedMemory(file, languagePair) { - const fileName = `${MODEL_ROOT_URL}/${languagePair}/${modelRegistry[languagePair][file.type].name}`; - const buffer = await _downloadAsArrayBuffer(fileName); - const alignedMemory = await _prepareAlignedMemoryFromBuffer(buffer, file.alignment); - log(`"${file.type}" aligned memory prepared. Size:${alignedMemory.size()} bytes, alignment:${file.alignment}`); - return alignedMemory; -} - -const _constructTranslationModelHelper = async (languagePair) => { - log(`Constructing translation model ${languagePair}`); - - /*Set the Model Configuration as YAML formatted string. - For available configuration options, please check: https://marian-nmt.github.io/docs/cmd/marian-decoder/ - Vocab files are re-used in both translation directions. - DO NOT CHANGE THE SPACES BETWEEN EACH ENTRY OF CONFIG - */ - const modelConfig = `beam-size: 1 -normalize: 1.0 -word-penalty: 0 -max-length-break: 128 -mini-batch-words: 1024 -workspace: 128 -max-length-factor: 2.0 -skip-cost: false -cpu-threads: 0 -quiet: true -quiet-translation: true -gemm-precision: int8shiftAlphaAll -alignment: soft -`; - - const promises = []; - fileInfo.filter(file => modelRegistry[languagePair].hasOwnProperty(file.type)) - .map((file) => { - promises.push(prepareAlignedMemory(file, languagePair)); - }); - - const alignedMemories = await Promise.all(promises); - - log(`Translation Model config: ${modelConfig}`); - log(`Aligned memory sizes: Model:${alignedMemories[0].size()} Shortlist:${alignedMemories[1].size()} Vocab:${alignedMemories[2].size()}`); - const alignedVocabMemoryList = new Module.AlignedMemoryList(); - alignedVocabMemoryList.push_back(alignedMemories[2]); - let translationModel; - if (alignedMemories.length === fileInfo.length) { - log(`QE:${alignedMemories[3].size()}`); - translationModel = new Module.TranslationModel(modelConfig, alignedMemories[0], alignedMemories[1], alignedVocabMemoryList, alignedMemories[3]); - } - else { - translationModel = new Module.TranslationModel(modelConfig, alignedMemories[0], alignedMemories[1], alignedVocabMemoryList, null); - } - languagePairToTranslationModels.set(languagePair, translationModel); -} - -const _isPivotingRequired = (from, to) => { - return (from !== PIVOT_LANGUAGE) && (to !== PIVOT_LANGUAGE); -} - -const _getLanguagePair = (srcLang, tgtLang) => { - return `${srcLang}${tgtLang}`; -} - -const _getLoadedTranslationModel = (srcLang, tgtLang) => { - const languagePair = _getLanguagePair(srcLang, tgtLang); - if (!languagePairToTranslationModels.has(languagePair)) { - throw Error(`Translation model '${languagePair}' not loaded`); - } - return languagePairToTranslationModels.get(languagePair); -} - -const _parseTranslatedText = (vectorResponse) => { - const result = []; - for (let i = 0; i < vectorResponse.size(); i++) { - const response = vectorResponse.get(i); - result.push(response.getTranslatedText()); - } - return result; -} - -const _parseTranslatedTextSentences = (vectorResponse) => { - const result = []; - for (let i = 0; i < vectorResponse.size(); i++) { - const response = vectorResponse.get(i); - result.push(_getTranslatedSentences(response)); - } - return result; -} - -const _parseSourceText = (vectorResponse) => { - const result = []; - for (let i = 0; i < vectorResponse.size(); i++) { - const response = vectorResponse.get(i); - result.push(response.getOriginalText()); - } - return result; -} - -const _parseSourceTextSentences = (vectorResponse) => { - const result = []; - for (let i = 0; i < vectorResponse.size(); i++) { - const response = vectorResponse.get(i); - result.push(_getSourceSentences(response)); - } - return result; -} - -const _prepareResponseOptions = (translateOptions) => { - let vectorResponseOptions = new Module.VectorResponseOptions; - translateOptions.forEach(translateOption => { - vectorResponseOptions.push_back({ - qualityScores: translateOption["isQualityScores"], - alignment: true, - html: translateOption["isHtml"] - }); - }); - if (vectorResponseOptions.size() == 0) { - vectorResponseOptions.delete(); - throw Error(`No Translation Options provided`); - } - return vectorResponseOptions; -} - -const _prepareSourceText = (input) => { - let vectorSourceText = new Module.VectorString; - input.forEach(paragraph => { - // prevent empty paragraph - it breaks the translation - if (paragraph.trim() === "") { - return; - } - vectorSourceText.push_back(paragraph.trim()) - }) - if (vectorSourceText.size() == 0) { - vectorSourceText.delete(); - throw Error(`No text provided to translate`); - } - return vectorSourceText; -} - -const _getTranslatedSentences = (response) => { - const sentences = []; - const text = response.getTranslatedText(); - for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { - const utf8SentenceByteRange = response.getTranslatedSentence(sentenceIndex); - sentences.push(_getSubString(text, utf8SentenceByteRange)); - } - return sentences; -} - -const _getSourceSentences = (response) => { - const sentences = []; - const text = response.getOriginalText(); - for (let sentenceIndex = 0; sentenceIndex < response.size(); sentenceIndex++) { - const utf8SentenceByteRange = response.getSourceSentence(sentenceIndex); - sentences.push(_getSubString(text, utf8SentenceByteRange)); - } - return sentences; -} - -/* - * Returns a substring of text (a string). The substring is represented by - * byteRange (begin and end endices) within the utf-8 encoded version of the text. - */ -const _getSubString = (text, utf8ByteRange) => { - const textUtf8ByteView = encoder.encode(text); - const substringUtf8ByteView = textUtf8ByteView.subarray(utf8ByteRange.begin, utf8ByteRange.end); - return decoder.decode(substringUtf8ByteView); -} diff --git a/wasm/test_page/logos.png b/wasm/test_page/logos.png new file mode 100644 index 0000000000000000000000000000000000000000..7646f3ca2623fd25629cc37b939cbc7331141caa GIT binary patch literal 15207 zcmZ|01yo#3(my=71qm7?xVt+81Pd-haCaZvT|$5W!Cit3?(P;KxVuA;;O_D#&+fbX zeEaR*bNY7O>iSi6Raf;neP?c%l7bW}G66CG06>-gD6aC_ue~-VMEKV`b8b8EwSzSm zl@|p7s-jUI4dGs&M@==P&E(|)^sh7`01*lv0P{*g0ia)RnE#}qUTFfrAL&;LUV{Vy8(mHrlL34nd=|6uc8+n?*>Yy0#3SFdF0Zfav|=nQrw z2*`A+;#m&u)*^Pr4>}bxy#>dCU!phFV&d&5I!Q|v&=WOWCWamWj4<-Mn zN8HrO*wND7*%E9=`bV#!5!l69h@AY7p?_Wfn5U_`<-aZ2IsMbER|i@CjIgjVv$Fg* z*sHVun&1Bi`DSH=3o4*2KV$5P{V`^(^=j`-~V`KT3 z|F7-|{+T(y7}y5v2(mXcHU+W?{^iWyS^tj2KcxRkBMouOSB*}uQ3JBGbNmzZpCkVj zsrhdtC)eMBe;fE4@HKk;@?aB7GY@e?XH(#R$yh;_|B(3)w3wsmpPW|(gKdESSpxsS z{$1j4;6IV(S21-0+qnEmk56`%&OmlSmjA~7Pn41c*ctqq%Z`@D(ss_Kj`ohPdHqMp zzZCp$iT~6k2DW{LJN*@F)_K*ZLpUUYGN)>HeMfpYj_2b`i+)fByZQ z`JVvI{|5Lw^KSqXV}2(WBP&y5=f6t#UlsLN4TDS_SpLQqWcmL~3$pwv%D)xqKX!n> zxUX9P5czcl|8F}1BFnOz&H?~j)zadks_sxn83_J3GgmJs!zD{7jPK*gWnyR}aGsC? z;aIKYL{NplKIbGu=pY)gjnr&+R913w&-b9Xc|uCbTx_`Iy6@x~l_1q`?Ugnj;G3&E zzxIn&MzEp!gvfLKd~V$ECWx z7^$dq0XI8K=9D^tfrNV3jqW(?IfwKWnHqt+XPQy-LIhZ5F1ItUH71R^Gd3aHh7lW%{=6HqVb^Lf3n_aV}_IY~Blh zSrl!tn5c=sWx8+O^t$s=T8zGUw&Ob9DqT0PI3&RHUYQE;ksaz$S62(}(AQJ8*Oa#G zx@z}7X-YU?@vz)(mW`?`wfyZ3nEgt)n1zRwFBZRhaS8=+XA8BF(0O!J)k#0 z2S_2IXeTDzv2LMjWC5yp7$VOf@~k@*XGndL18!^kGK3zu@VwPJ(RjFehp&%p!yu4) z6Juk>gf9h?5)$9X-~yj}i2>$@fvdbV{7NA8r28vB68HUz4)$~6&LjspwIJWghc1}a z&E5!<%7U@TUZWcJl;K@j8JXoU6SsX~mwi>dygc@Qof%*pOjeiA$vcIdW%x~1=lPr8r896O zzF_h+?!=+D&*^zF(W1fuB9?L{7?w8A4p5Mh6>4oCQrL|9DBb1`OG`?IQkN3z&~4ZJ zuOGSt;ZV@ji8G~(fp>W^0?k>qGCqk16d)wW)kUSL6<5p>A7BY*hAWQdgGRtWt?>Fq zo#dv}83X+1f%E^NggB%%b48>w>&R~C?@k@(3f##psGU{8H57p}FU;IOC_ zhz?c`C>Q#PFG)sjs?bIIkrCaaK*lF)1!fGG4T9Ln7+w1S%5Dyc5K$~EdO`At4{kVV5P7@=!5g>;<}&6Nl&Ivamvu#O zyEcc5I0hDH6;Cm}()M|s*;uCSr%#_WsTkBWAkB-lwY9v_!q4vTl_r#jKMt^29djt- z2CTBi+%N(7pX*y(Hh+YvGip_?(~EFq)s{56Ee9iEJyUjO`aR#3yFO-L7Xrh@eBApub z?nuLnZZ;rBfQ@&=MPDKeFH(Ili6wqLXb zRM8yzeot4-2W|Z60H6(f+Jf;kLsnOQrjHp{p{ywv44)Z^jNN1QcFjJuM zWz#x+5xSl?7ADOSu>WC0{$Boh{RlolFDgv}1#k5Ou;+JE5?u>Px9-?5+ZRE*IPRI? z%`&dAo2n_hpy;U1W!3=8sZ!S(v*WVJAf_cAf}mfd>;&Ca!hu!js1-S;)x!0L)NMlx zT4wn;&Oc;D8BV)@{N8A|&?@I|{E&!uodxnV_&sSAY^vGvyYBF)Mv`i)XbVKm#h&n0 zz0A*erSRM*LwEtoD}2zYeJpz%ob>mV{qA3H4sO`BYT`(UcAc8!_MG!eaJ_fA*UymLPs5f zi)mQCq`IX1)PV+=a~CpTRx8TQqg0`Lo=|RK_QK3Oj5|w`QR#ArXC9#=S)Mh~Mhn zOneuzdYvpQET^hh2SXFp@Zb+SHSy#4R|Y7h458HKeS0v!5nsT`i_X(*unz>tjwB6A16o%`O`n4f;PHmF2TiHMES85PaPj7N~KB61j z8FJ$&vmt z(+&{B8#)uAITQ@W?-M^a{&uxJcy8oREg~_U`R*wmUYr|$P-qtomPyvBEG2E=W6(}; z1af#}Rtud!L1g`7ezxDh-B!OgT?I_Sf_aLa_)G zHog*BP(GN)UN3sb3ZSv1St3!W7j9@v&e#LMW5Y^{s~0%)!tB8G$Bp?sm{K-<=aF9P z-`j2CTX8_|EP9qVp)h_UAH zQw$Ar;>*?U++ox*UQZO!ByqG5LW@_Lrec>_=XiWc(>F7He_*9?!K7jT{wD9vPZGZ) z_q69r9Awlsc;HPw4l}8Ky-E?)mV)4!5TKxSG3z&1uS(jm&|-) zUEd}LD=T}Fm4)*e-O+Wr-p-E#xQdENmC{xhs7et&8Y~#wa~HlOMX1rTe6XVeP#~BI(L1xFn&5 zm;qkN$&XXsJG^rH)Gd$oN=hoB?=ruBK5KcO@#CYdOn!K{kFh-E(@tyt42>K?%^_U{ z(WVauY7U0{X9xIjHaBvU)gK-(L5tl0DUICTaqnuILxWC*wr4xXUqKapC33WJBA<$x zF%Z;<+q&C}tCb4(v!8sZ4L}#FeakI^`?a=Lq4JgsY>Hs{u5tOG88FQ%S&GhFOM5PF1;EQ@G3h#Qva!{M0Vu;9ytxj1RxHqMkUHScQ)w>|xnZ@QD?g2)a; zcj&RKCFjZV50t!FbDeDfBIy z6*Czg?z|h@`SyGQ)@OBAC3|-VWlK#m*-MS#FuY)TszQkGEpOz6uo!_&WU)rz0oog> zXhE{UnUw^q-i0O|7Ic=3{RSt!W*QfY51}XQAbRwaE%MnOU63sk@c^64Kxk5Axugz> z9J}epundJu(*U6{Yo(W~j!Ml8 zrm~PYe>QF+T~#eRen2gi9xCDsFN0z0U6cs!m^Vq$T=N||Tmnb%T`ozbbUoE;OR0@r z>AK9>U&IY)kEDPyZ=N>Ft9DK*u+#M5@UaXjXPy~7r1~`GAu7ln>&-^CTtD!bHS9@} zz}2@xM5Qko0d1l8(FU;^Xq|(6)Y3|TA(h$t83cnV-^KXrJELGe3vm*@Q0X;!j}!TQ zY;QHYj3jBzxuitghg46`pfeLjoM zuM6d+wEcp76&$>8=2`bgOH_px?bpP!+oO33xF1~!CNpwf00Et3NZ3fpG7L(`a;rii z2ReZXILsj?0}N{|qhm`bLt_Osg01dHQlC?q3uQ-i`6z#JYTqC{CZ^NVc?LQdF4%mt ztRxKUFrbe>XI+sAtu0ttzJ(2oNWoT5ZeNMwfhn$JUTU{gvq?6cwQIuw5X*Tcn%UuL zJij^Uv;Y3y!dm_%QVz{IAgRV+iDN|IJKN7ZP>AvpJ8ROnQ>~FUw_7Sj% z!cSn^&cCt{9J2A80CS@%rhS(X5fH9X7~QI9-yM|jkmwEW>+9?IZu}zCrci(?oF6wE zRQp_Oh@-_a=z)o0S2I7fK0^h0TZxC*sE0S{Fo5+YG32CC*Oi~nVGPy<8E7hWbG^44 zA&cA(BWPwP)k(oAFl8?X=Ux%Q6vAuWgkO+8Wm`Wfm+CCK*VJ317)GPjKlu&nYoeK= z^n~lIPeq$P$r0H}> zY>++Q+I|hJ$={|2VH5R#iBD0V^Ia(|$})hzc?5HKp29)&8;nOTN~jxeY)aBlMC{>D z*^vhE`M?W%LuY|{4qMe$FL!4BKqFr&C|wwQ@d@WL=j_(c`}%mF5UWO>05GBHS_#f} zStj%xIo0#~bzwAM`iNh_13M4}9BD_p#l>y;_^ZkisJ4-1K|0=vOcC>HRfjwu`?Dm4Xx^3;L zb~cezduBa_oA#S8<5Ad0=9keZyhyjbN8fd`Y8_+^n%?-nXAVi*0G^&0&ME(>!tuq$ zy`3hW(OjQO8y5}`iiMnZBCsx}_;L0N_aJngNHO*eS5=BE1dMxH#5GSpn&yU*F5^K9 zuHaRszy%^|x1HTxv-N*_8S+lFw+85%Iw$0agh9%frwwRH)$ zoV!6DASlCcA6`m=|-gMbxJQ#`A+soO&D^BNj1C{9}0>WnGmEtxaqkwczW!{ z)hza^Q5)MmTi@H=tzppzDtV6M2ARkF6{TX!L>I>>x1L-jW z*Er>lIdkqc@-gr4sn8JW;3D2o{!!z;W}KqQi(GZbwsd}GvWp;$-7-N3F^5stR)3q} zdC_C-(lZVC>g(JtUX%sS{OSPzB|`z3g$r}$FRr84-?dY(DQpD2+&_F~z-Z!auJ8U1 zjb&?FB=kEye2uO(CudjsfTyfI6r@-}+XyqkESG0uSj=_zsY96cc6?(?QNQ(Zg2{%( zd8OpNqkUb{;EhHqgu{r-QXvH0Uom*DlN>47|961m?x5_i_b$gS{1of;x@L(|{44?a zzY#nFPMnAa+Y}DMW?>Fpc&|%QWCx$rWaE|@=~&xc5^X()3-i9{ZO2%qAA0$3RgFDa(`gwnd-+2!}~{2Cex9`dnqR!86U3+ zA`c1MSyx|z$SkTJ>u-YM(AI$3FD`C!Ab|=kH+pAu9hDBnfIk2kE>32tTF(HJocRlxx)`XKtUZlTGhzG;dR9T#j5hc;_)4u6CfbS7uP z-uqi}cde+7fwSmS84c<^W}ni!OyUQv$JMus!nZ2g@#MA9fT~(<+W~WjNgr5+8gCapb8rCZ9C?!`ZZ6vot=}=KVngo%BXjRgX08iqtKa@k?DWOK8?FZ{`mI0vXb4_{^6p zm6l%jZ9u_^;IO?w%{nZwSdT7uQPctLu~=+JjU#EJ?HhH)B{ci z=r4wA)YHs`M3U3?vufsk#Id&jO#h;Cf{TAD2+ zcK6U3$S z2nfQ?0YOtTi~4&6L)Xf%;gqB%Euy(yu0lUDDP(*nHbzSgzV(5TD{IJxj5S z)8lOOYPEiTyj`#m(8)46wCufR=R>MSxWxrF{to3ytx$XcLTpnh3Eks|mp;GIKHZ6qX zAxHNH!NQXtaa+%Jollf@;dn=t+}h^edKP2Df3~?8wpQ~iDFd&OJ#pJ;_&AnZWe#{S z%|T}dt96&N!^RW1Lo^d?$Nj`~24_6B(#UPEc|GAMsWkF>uA`0t8|!FNe<1*iWeV^`?v0iU8ZpfIr{p@voHEz zXy~AQ^=sKu@n1Taz zg!RoP)%C%8+S(R_#go*K*yijM^GuVObYbcq5T6@{wJl>zG7)mf{lq+py)HB&N%krk zHA)DV!%BEtQNMld+#!rhtNN3w`Pfm^~Y!5GNmi-5!sZx{{!yIPT$cY=dv2!t~9LI4i zcA^!kco|dlQWT?7VVcO!2^UF>G$ssLFbXAK5}FJtDFKfId5jDX?&s22T19B*xRDr& z`r#3SdX(pr6eFj;LT(Np@sc9l_uSiZ(ajAUcL6qVzoSX;*--SE*ceiMAo}n80_az9 z&Q`x8VmzTaok<-Ec7!2SCEq{4m2Z1Lwr#Sn6~@{T6mGq>EHx2A^l6G$8x1K9Rd^D@ zD$QZVnwDsn>2;x4u%3(miC2pA;}Hcjhw8`aM40BKl`&Z=%qPc$xNovcHb1KZxoF8< z(&k1=R=ypA&iIc&s;a8Qk&--f!>w$*E}L>*BalyaDKUCthk-|gW!A@cj)CaEXnnkN z9X$$<&$)}4p-*?$84P38GxuxyPzRt5b_E?9To2YQw}sz6JXHGAXqOWKdHS365ISJj z5QG6IAEKk3mU2&5%e%t33P!mVpvOsZz8{;dsLfi6|76Yrde)_B&!yT?J#Y*#R2-U& z`jZ+DF7KDfOh=b2O4LFv(C_Er&%+T;RIg9L3w$6|3Y25ylZLOXmnYOTy98JPt6VHG zdF9~a^GE80A4`h*mpta?qBwBcrb>15xvQ+JE`O=Ugf$aYW>op5-K|BGbJdM`56Csy z#3P|?l11N4cr&Qk49}9Fn`&DAc5bWbXHdo}>Q~oZ21ip7p=(ya=WIWVXj4mS^HS6|~clD_1L zCCXtB?I1_m>ab5=!!cTyc@r=3R!T#YBR_6096oc-7G(=xS;`~tiH}c~o6_c@;jTY* zhgKJH9(pgcfdUBXRqF}=Vv9!WJWL>%gOCUrcCr@RSq&X!t0*wnCig zJWuK{&*qE0YsG7;<8vn{Tpa7{uG`HKi@=vS8U>cZ5!0Q7@3Px8omW%63csLE9J^0~ zg^M*?K=sG@2sxTMZEqAqxq8ro@}Y!cXVshZ1Jt`Z{N2bHM|FIyqEZX`sWs7BW+lv5 z8f9@w-Wgn|YP!C|;p#wX+UyM|XD5Ylrp?+y!7oM^UjmIv1_$o~mx9A0DWU)fW8ST$ zKZ{Fu1BU5U+~4OvZr2I}rY^J_*mO#4I8N&EMxcV!)9|@yHk7JIUJmE1o~s*wJRX;# zn%&bm4TJ5TWACymP`PpXDMwDv4?S6Ugk7{&Qk&;i_LWwg!tBdekuJ4o-RV8~TrM{Z zGy|H7s6=5yWdSIol55g*5hzMq>>~337ZXovChnCK8x5ip-E+5?C$;_gB(C>QS?}4n z^lIqz$TEfGq=!yTJZ2w%O%D|a>^9S&L(#dCp)(@+_bNC#9u>m$;nekU2p_8H;m6ad zemx#%Tp9mD7i+QR|4O(3WVZ{GdBvmi-(`zTMbVasHz ze*g1PSPxY%Ay1ep_EZf|b+-1gPJSG^XtnXmb`u@)6?q5lTq}n!AR~>%^s`F#{gCsH zy=ZD0#twIFI``a?j^VJ05r09SttwOBQ*tdSx;#w;GV~HRw+Z|WI^2C;M&td!fp$0< zl-ZWyQ|u03q9)!w-8KURXurAUgl6j&6aCqMC!{3P*{|b2w+5^o z)U?9F3SdSicpC{%?zS@37c&K}L~1}$DX12Zc3#~=|C?YGto*!yC_r#f0;$CZ#B3nR z`p0>3=sB`edJXDi0#cCDC{j!?Xj+Em;}?BCqsxt@dY=9?TY>2|K1>ftQ-Vab!!^xz zw2Vx+nEdZ)s`wH`0*gxNo$$QRVt}kSDE{nOdA7gSW8Ata!jeVkRNl9G?rC)@t%%E1 zG!Eus<9ZmEItr@zc`WJ`OIyYlq%#_&@S|)<8E&&+vI;28OsBu5j!D_UOg_C_w6hho zm6dJ}@ro~Us56b%T-$P)@P0v_Kv@#r$xw?kOLI}RzcXI)Jwa->S_@BKlb^t&VeEVI zt{F)Im}sxbtM*EW6Bl{;2gxt?oj-lH#1~bno`t+&^p|6yPmhniLm!Vf;DJ1bKv3b7o506JNte zak@=5BQSbHvllo{2G8Ipb+$Nb7%-BEXh>1SY+Jp0{t)O=fIFoNety!{b$O63b_^aMV&RoHhhV zGO0yuuYX9Iy5uzlX24 z5og&QCL;S9hlxk*#)^==lFmCDcSD!b8^~2Vaay;a-shgWIGvgvnj+&BybT!ij;6G` z6Z3~Ea!?YRjtvK+e#v()l1bYlyBilNvo;(tq3`oVh_E}**@#&j)3{$)-o3%K*R^}_ z#u2B79DCw@x^7Enpbu_9eQGvvqmf8z)VHE2+DZCr9ge1{rVr%c!|y$ha+tqr6R;GH z?UFo3Yab|PTV0fRzG<}F>iYzp6OAsE#U)CT%%LC2d53yt<_7bC& z;8?_4lTll^<0Q-ih1Pz}2m}7i^Fm=20wI`M_RPeecv;#xX;Vsr zE38IoKIx9!6OLnwm@%{dbps*2)yG~Rg)!*6^%$#Q+MbH=?-P7>F9&+uy|pamPtCoA z4l*`%-XW$cyktvitqV01r;{W`iGEoM$p)a7vtifE22Jy{jM~>$`5^k6dB$SxL!3` zN*|!;1=K2q>Lr5eGkv+K)!?N7^fLmZErlpHE0z5utI)cw82tP54D~hd?$*b}zV2>= zs*W(#H#-APCl(&z5Ba5h=*yLS=VtsF&EKqh-t25?N9})v68XU7q&T;~?p>G7{oD8T z#J)eAwo$V|ui6ltr4S z#EzfMd&!%Sf89kX?)(ff)*b>!8@B0$)$16b9lrR}G3RG$*5`Lu<*QNRziib4u{YYGPiQ1+8tus1C}(Q z`I*7795k2VI-bXZ$CBkG_3JKBX@acET@{jur-&m*6N{wZNy2+Y(qNb7dcLq?X2|^# z)9F==TNig2M*qT5fw;~ZOdyOeK$2WHw*DwVU`~`2 zn_*O<@?-2l%ESphX>@!AZ@W^of0&@?J^$R*w*dUmPdjbrt?aa)(xK%wuc*m z-qy#_k<$FrJ-gw!1yL7%#j@t3xFKS9h-04*Zx6ONL!7@1Ay~;b*sZ+D zI=?vf+oRZujNSis6gnz~1jxE~set*Nt6DxZJJQs4cUV0ctDUM%E~sX5_BFsJ`q1A# zmKhP}+~3YxOz>^WWr&kkiwFU?`|wY2EF^Dt>~7cr+o!yDrS<93x(u#Y z>TQ?u^F9f|+E|vCY=oRzXk@n#KGO?gt)kln!!=476rn>V-KQaF&Q~gZBXI!5; zIjK@p=UYOzegUuDV7HBOIRmLwLeF+a60uafACQAAd;Bmq-J2<#rnvWEHMiXo_8M^I^X+6w-`5CYu zVrr+XZk2R@#fV&?+&1eZch15jCV^bM&=dP+vym&wQOnCuLOTEvT805y1hW(=DKaHU zq6F&nl8{2F$=i(4*1G%PAMUFrySdUK64yHIs*p=xmhY?fU#K!!+ZcQe(Fok3#uRZ= z@B4U$2O}F&A@vBl>F`e!!xKGVv6ckVQp+ggiiE(>DR66H< zuQ_lJgsCS4-1~W37Zj}+CFgq?Bc<{{V(;JKD=TC?XLidS{(G)Ncnw!RSzX^}hDgw} zRG%Qeip_Z3{RZFB08UsH4l=>Wlb&kX9xG+b<Ni8GA$?G5uU zc*yL-AnnaiPU-}iK#jqz?=^ckC4SP^*V}lJoVr%E%gF_-iqj$zzy7ZMl2=#GRBE-;zcuDLy7s);mV$vQyOK zYkCd0)k=lt*zvI7Sl`PRzccPsy*4lF$k9yO7&>M;?v)ldp6%^zv+Z!4o$MfVZtJ;H zHYs9{yi$U)vRhy4X;n3~R7_3-4^4bUCF$kr;K%WD9m3X5zc}NhvYd(91;?FL{HDU{ z(!Kr9eHMh981YgnX=vTq&+BMYOYw|GcdLlk6#g41)2(=0l>r0n>9on25O!=v#Oh4r zL|G_^yS9X;vZ%3BUgCo=MTxf8744$=)=r{!LHZ}pDD~@ffz;iQpb(mUzA15fkIUUV z*xun39(cRv-`+@(u-bsPLWo}L(|1M4!LU1N>rMuWg>VGa*m&RW6hb6|GoWO#W;-Wc zp=&ua2DMP5LwI`j8y?#){6E@zHFlr(xPb8n2NDUv$0##(`T5TdBk}@(LdM$iDQ8b8)3oOb^x9Nl-tJI&$7!XT|m77?GmX&(KE3sXcoQ|Cr(mXgy@ckO4<;Stcq z&?WgJhg4sQ?nO*AbqN}QpfEVytXy)dEa-6^-K*RI?}PA+nRyH3ckxx+OUrF4UxQS$ zTM2D*<9D{4wWgh|Z)&s!P;a@Uz%b^lsBmh6QeBl}*H>H9uWwhLZM485q{!Z+^x7B| zc6&eQQ6JbV`lhbmrtwIWe1OspTu~6D8^imm?Zu9|=i!tRwbMsv4gq9U5AbZ8IUEHj zQs+xyLYJ+<0we1=<7+cHGxi=%D+~jyp#`CRiHKpN10`Y&R!H7}0~JF*Wzx>2A4(j9 z%?;m^|I#l^+ryRzr36A^w^B0_wfr5i36#(|FEl8}?3zdxt3BlREMn=QE=Wh*>~Fan zo>C6fYKh3~$LGd$(RWtw;$Vb9YGHlWA4k)4o&p8vbR9wjM7;|=cjoSbMzg%nr9q$J z3E52$T{cj~IaAvG3qH$CA^EdQxU4Ks+YB-ct}vdZra>Uo?I5h_v8*!?j3G~EZDBjy z!-=GBJ>uzU*D^u-iE>PhTa z=un2o36BpSCQ}w)oE=GrSyC&==2IyeN7}2rx$ynN`iut*^3QS*J{T^|O@C`9UiSgf z31#o@R@01WI}~GYFyeDr#d{!QSbf8m^w9yST8jun=(Y}^Wjezv@Au{^vKA_RgMJHV zcPzA8#vLRttY({Mv!$_0m-K&inA*Ev3!!Y)jrFGi(0Aw6gjd)`O7%gvZr_VAkK#fp z-bX|qoWc+s|5!r%dM)_d__+ljigKpPhqzsoFd6QM8oEHVm$8ytQs%-h2Me9jNsX~z z_(uAB)!|pzR14`eU

    h`X^%W)hybP1^=}@Qj=L7DE%A~RoBCmM8i57;AxEc6 z^&4Wrd6z<3yOLQioCLF>L?(Cfh!Ofpq_3V?Xeaeu4n3{s?MBUWi`5Mrt7Vqr03`{* zGLDncor+VXH8?20%H~S{-Q2gKl&+Ne9L9<=)d7MgVZIk^J#beL4lmDCxUIzwub3sC z3;A(@X#QR+|;CAUSTgifuIuY@CC1*G3NE zZ$l`sgRCvnl-<&E5%lg5L$9TJ5N=(*aC0R+Z=|2Ou<^NEU!GiT^Q4_Q&3R8pk_cB( zgO_GIOEQs02&SXu$8(JQ3(>zXln>C^78qjj_Eu>O&xEr}iG63?g_`o8jr%>woj^$Q zwP_JJDVHNCl$-T7?JM8?Nz=@?wJ!d#mkn&)u?`A;e%t(L^946bp&)yKNAk!|!wl4& zUm5fwH^JD>Hj?ITVex_qt|}okM1oFc|bT1c|e8kQS0zxoo(QSs$9t zu^!O6rv@hJ?r>OTJ$zJ9Rsc>r;F-mpOOQa#cW@2IBg!j7hlO@k35WBb?rqnmdlxLD zKUS4cKdbuLdSyw5DHAQ)iGMN|;ytFC#zmU?k@mq8rG_tX&>H*=0T_iI*Kx0YfTj+A zE$Q*A$YI9O05fjPd+{~yQaChoLQW)@Fvnxe2s?uRwIS-=T8Q` z7uV~_GAmC)XvdOI^cf;Fum8&VXuuQ#*Y1aZN;|aj`I}Ko82XX|;l}gmylhfTU-mFj`6Pp_beH- z<0I3y1j6a(1xrwablEakdc^nV#i{6dM7QV1$DaFFjZp$0w^aKYX4)-`cA2eM{Q931 ddlz01NGd|SUT_b=e?AbEmQWBc|6maK{{Xxf;(`DG literal 0 HcmV?d00001 diff --git a/wasm/test_page/package-lock.json b/wasm/test_page/package-lock.json index 5ead514d8..22d229647 100644 --- a/wasm/test_page/package-lock.json +++ b/wasm/test_page/package-lock.json @@ -5,11 +5,21 @@ "packages": { "": { "dependencies": { + "@browsermt/bergamot-translator": "file:../module", "cors": "^2.8.5", "express": "^4.18.2", "nocache": "^2.1.0" } }, + "../module": { + "name": "@browsermt/bergamot-translator", + "version": "0.4.8", + "license": "MPL-2.0" + }, + "node_modules/@browsermt/bergamot-translator": { + "resolved": "../module", + "link": true + }, "node_modules/accepts": { "version": "1.3.8", "resolved": "https://registry.npmjs.org/accepts/-/accepts-1.3.8.tgz", @@ -616,6 +626,9 @@ } }, "dependencies": { + "@browsermt/bergamot-translator": { + "version": "file:../module" + }, "accepts": { "version": "1.3.8", "resolved": "https://registry.npmjs.org/accepts/-/accepts-1.3.8.tgz", diff --git a/wasm/test_page/package.json b/wasm/test_page/package.json index 79447e3bf..622b48c1a 100644 --- a/wasm/test_page/package.json +++ b/wasm/test_page/package.json @@ -1,7 +1,14 @@ { "dependencies": { + "@browsermt/bergamot-translator": "file:../module", "cors": "^2.8.5", "express": "^4.18.2", "nocache": "^2.1.0" + }, + "config": { + "port": 80 + }, + "scripts": { + "start": "node ./bergamot-httpserver.js $npm_package_config_port 1 0" } } diff --git a/wasm/test_page/start_server.sh b/wasm/test_page/start_server.sh index 59d455d14..5b6eeb0a3 100644 --- a/wasm/test_page/start_server.sh +++ b/wasm/test_page/start_server.sh @@ -24,7 +24,7 @@ fi # Prepare a list all wasm artifacts to be copied and copy them to the destination folder ARTIFACTS_BASE_NAME="bergamot-translator-worker" ARTIFACTS="$1/$ARTIFACTS_BASE_NAME.js $1/$ARTIFACTS_BASE_NAME.wasm" -ARTIFACTS_DESTINATION_FOLDER=$SCRIPT_ABSOLUTE_PATH/js +ARTIFACTS_DESTINATION_FOLDER=$SCRIPT_ABSOLUTE_PATH/../module/worker for i in $ARTIFACTS; do [ -f "$i" ] || breaks From 1ba7461a36ed94423896d47f8fd8397e7265eb3e Mon Sep 17 00:00:00 2001 From: Nikolay Bogoychev Date: Thu, 19 Jan 2023 10:06:57 +0000 Subject: [PATCH 395/442] Fix compilation on x86 --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 4b30c267c..69e27d298 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 4b30c267c701198cef4cddcd646cca17ccbb16f5 +Subproject commit 69e27d298419a2ff0e24ea7c43cad997fa8230c0 From 82c276a15c23a40bc7e21e8a1e0a289a6ce57017 Mon Sep 17 00:00:00 2001 From: Kenneth Heafield Date: Wed, 1 Mar 2023 18:30:38 +0000 Subject: [PATCH 396/442] Fix path to example program --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b70c818ec..eae9ef319 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,7 @@ git submodule update --init --recursive ### Using Native version The builds generate library that can be integrated to any project. All the public header files are specified in `src` folder.\ -A short example of how to use the APIs is provided in `app/main.cpp` file. +A short example of how to use the APIs is provided in `app/bergamot.cpp` file. ### Using WASM version From eb0fe1b583d3c66a59bbbe1ce830f76a6d037496 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu, 4 May 2023 10:55:15 +0100 Subject: [PATCH 397/442] Bump 3rd_party/marian-dev from `69e27d2` to `8ceb051` (#446) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `69e27d2` to `8ceb051`. - [Release notes](https://github.com/browsermt/marian-dev/releases) - [Commits](https://github.com/browsermt/marian-dev/compare/69e27d298419a2ff0e24ea7c43cad997fa8230c0...8ceb051b7f6388ed5edf7e1e2d0dde0c3cd7d737) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 69e27d298..8ceb051b7 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 69e27d298419a2ff0e24ea7c43cad997fa8230c0 +Subproject commit 8ceb051b7f6388ed5edf7e1e2d0dde0c3cd7d737 From fceb713b2749724bff1eaa8cafdd694b740f3304 Mon Sep 17 00:00:00 2001 From: Nikolay Bogoychev Date: Thu, 4 May 2023 11:16:07 +0100 Subject: [PATCH 398/442] Update workflows --- .github/workflows/native.yml | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/.github/workflows/native.yml b/.github/workflows/native.yml index 8ee8c5c5f..41a91af1c 100644 --- a/.github/workflows/native.yml +++ b/.github/workflows/native.yml @@ -27,9 +27,9 @@ jobs: cmake: -DCOMPILE_TESTS=on brt_tags: "" unittests: 'true' - - name: Ubuntu 18.04 minimal - os: ubuntu-18.04 - identifier: ubuntu_1804_minimal + - name: Ubuntu 22.04 minimal + os: ubuntu-22.04 + identifier: ubuntu_2204_minimal cmake: -DCOMPILE_TESTS=on -DUSE_WASM_COMPATIBLE_SOURCE=on brt_tags: "'#wasm'" unittests: 'false' @@ -140,15 +140,15 @@ jobs: fail-fast: false matrix: include: - - name: MacOS 10.15 full - os: macos-10.15 - identifier: mac_1015_full + - name: MacOS 12 full + os: macos-12 + identifier: mac_12_full cmake: -DCOMPILE_TESTS=on -DUSE_APPLE_ACCELERATE=off -DUSE_FBGEMM=off -DUSE_STATIC_LIBS=off brt_tags: "" unittests: 'true' - - name: MacOS 10.15 minimal - os: macos-10.15 - identifier: mac_1015_minimal + - name: MacOS 12 minimal + os: macos-12 + identifier: mac_12_minimal cmake: -DCOMPILE_TESTS=on -DUSE_APPLE_ACCELERATE=off -DUSE_FBGEMM=off -DUSE_STATIC_LIBS=on -DUSE_WASM_COMPATIBLE_SOURCE=on brt_tags: "'#wasm'" unittests: 'false' From 3c2a667f9b5b748a3808a78b373098791ed636de Mon Sep 17 00:00:00 2001 From: Nikolay Bogoychev Date: Thu, 4 May 2023 12:06:20 +0100 Subject: [PATCH 399/442] Try harder to install gperftools --- .github/workflows/native.yml | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/.github/workflows/native.yml b/.github/workflows/native.yml index 41a91af1c..6c5f56913 100644 --- a/.github/workflows/native.yml +++ b/.github/workflows/native.yml @@ -21,9 +21,9 @@ jobs: fail-fast: false matrix: include: - - name: Ubuntu 18.04 full - os: ubuntu-18.04 - identifier: ubuntu_1804_full + - name: Ubuntu 22.04 full + os: ubuntu-22.04 + identifier: ubuntu_2204_full cmake: -DCOMPILE_TESTS=on brt_tags: "" unittests: 'true' @@ -55,9 +55,7 @@ jobs: - name: Install Dependencies run: |- sudo apt-get update - sudo apt-get install -y \ - libgoogle-perftools-dev libprotobuf-dev protobuf-compiler \ - libboost-all-dev ccache + sudo apt-get install -y libprotobuf-dev protobuf-compiler libboost-all-dev ccache libunwind-dev libgoogle-perftools-dev - name: Install MKL run: |- wget -qO- "https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB" | sudo apt-key add - From b3d36bca905a201f1239f74e5b0049db66065bed Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 10 May 2023 16:07:24 +0100 Subject: [PATCH 400/442] Bump 3rd_party/marian-dev from `8ceb051` to `bb65f47` (#447) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `8ceb051` to `bb65f47`. - [Commits](https://github.com/browsermt/marian-dev/compare/8ceb051b7f6388ed5edf7e1e2d0dde0c3cd7d737...bb65f473d535e6bcbc1a97beff5824397c0cd9cb) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 8ceb051b7..bb65f473d 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 8ceb051b7f6388ed5edf7e1e2d0dde0c3cd7d737 +Subproject commit bb65f473d535e6bcbc1a97beff5824397c0cd9cb From ada8c3922490cc6a507bcf81fa4882b435595323 Mon Sep 17 00:00:00 2001 From: XapaJIaMnu Date: Tue, 6 Jun 2023 17:04:49 +0100 Subject: [PATCH 401/442] Fix compilation on newer gcc --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index bb65f473d..b20981969 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit bb65f473d535e6bcbc1a97beff5824397c0cd9cb +Subproject commit b209819699e0725fa2dde4ebc98b7d91ded0c243 From eaa2562fe0b3b2bd9ac3424962ada33b7c3be2f1 Mon Sep 17 00:00:00 2001 From: XapaJIaMnu Date: Thu, 13 Jul 2023 00:14:13 +0100 Subject: [PATCH 402/442] Sentencepiece windows compilation --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index b20981969..6a6bbb627 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit b209819699e0725fa2dde4ebc98b7d91ded0c243 +Subproject commit 6a6bbb627877d40840b8b852eea80ddff22adceb From e333208cb93b01e0ec93402c4448cc7b18daeda9 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 31 Jul 2023 15:26:44 +0100 Subject: [PATCH 403/442] Bump 3rd_party/marian-dev from `6a6bbb6` to `aa0221e` (#452) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `6a6bbb6` to `aa0221e`. - [Commits](https://github.com/browsermt/marian-dev/compare/6a6bbb627877d40840b8b852eea80ddff22adceb...aa0221e687fe8b3b69b5bb64279d4349663ad410) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 6a6bbb627..aa0221e68 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 6a6bbb627877d40840b8b852eea80ddff22adceb +Subproject commit aa0221e687fe8b3b69b5bb64279d4349663ad410 From becb6e2cda6b76ac66fe5396de04ded3e20c3503 Mon Sep 17 00:00:00 2001 From: Graeme Nail Date: Mon, 31 Jul 2023 15:27:24 +0100 Subject: [PATCH 404/442] Fix Python formatting (Black) (#453) --- bindings/python/repository.py | 2 -- setup.py | 2 +- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/bindings/python/repository.py b/bindings/python/repository.py index 323b4482b..9667c7242 100644 --- a/bindings/python/repository.py +++ b/bindings/python/repository.py @@ -139,7 +139,6 @@ def download(self, model_identifier: str): with tarfile.open(save_location) as model_archive: def is_within_directory(directory, target): - abs_directory = os.path.abspath(directory) abs_target = os.path.abspath(target) @@ -148,7 +147,6 @@ def is_within_directory(directory, target): return prefix == abs_directory def safe_extract(tar, path=".", members=None, *, numeric_owner=False): - for member in tar.getmembers(): member_path = os.path.join(path, member.name) if not is_within_directory(path, member_path): diff --git a/setup.py b/setup.py index 51161a3c0..ed4c6dc81 100644 --- a/setup.py +++ b/setup.py @@ -16,6 +16,7 @@ "win-arm64": "ARM64", } + # A CMakeExtension needs a sourcedir instead of a file list. # The name must be the _single_ output extension from the CMake build. # If you need multiple extensions, see scikit-build. @@ -84,7 +85,6 @@ def build_extension(self, ext): pass else: - # Single config generators are handled "normally" single_config = any(x in cmake_generator for x in {"NMake", "Ninja"}) From cbfa839eef6715da0f356d47cfac55fe22700ae9 Mon Sep 17 00:00:00 2001 From: Graeme Nail Date: Mon, 31 Jul 2023 15:54:42 +0100 Subject: [PATCH 405/442] Fix CI (#454) * Use ubuntu-latest, macos-latest in GitHub Actions for cibuildwheel * Update deprecated ubuntu-18.04 to ubuntu-latest for docs in GH actions --- .github/workflows/build.yml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index d0afe1649..f06b26357 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -24,7 +24,7 @@ jobs: build-wheels: strategy: matrix: - os: [ubuntu-20.04, macos-10.15] + os: [ubuntu-latest, macos-latest] fail-fast: false name: "cibuildwheel / ${{ matrix.os }}" @@ -281,7 +281,7 @@ jobs: ${{github.workspace}}/build-wasm/bergamot-translator-worker.wasm ${{github.workspace}}/build-wasm/bergamot-translator-worker.js.bak - + upload-wasm: name: "Upload node package to NPM" runs-on: ubuntu-latest @@ -383,7 +383,7 @@ jobs: python3 -m pytype bindings/python docs: - runs-on: ubuntu-18.04 + runs-on: ubuntu-latest needs: [build-wheels] steps: - name: Checkout From 8011f9c849ca7351f886c55f9780d8583fb4c8f5 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 31 Jul 2023 15:54:53 +0100 Subject: [PATCH 406/442] Bump bergamot-translator-tests from `7984d14` to `a04432d` (#455) Bumps [bergamot-translator-tests](https://github.com/browsermt/bergamot-translator-tests) from `7984d14` to `a04432d`. - [Commits](https://github.com/browsermt/bergamot-translator-tests/compare/7984d140aef00489699d0b7711fa942816224294...a04432d7921bfa1dd62bc2e5cdca46b226f256de) --- updated-dependencies: - dependency-name: bergamot-translator-tests dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- bergamot-translator-tests | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bergamot-translator-tests b/bergamot-translator-tests index 7984d140a..a04432d79 160000 --- a/bergamot-translator-tests +++ b/bergamot-translator-tests @@ -1 +1 @@ -Subproject commit 7984d140aef00489699d0b7711fa942816224294 +Subproject commit a04432d7921bfa1dd62bc2e5cdca46b226f256de From 4b0da8d434e5a688139255873afd177f647ef777 Mon Sep 17 00:00:00 2001 From: Graeme Nail Date: Tue, 1 Aug 2023 19:35:11 +0100 Subject: [PATCH 407/442] Enables model ensembles (#450) * Enables model ensembles Adds the ability to use ensembles of models. This supports ensembles of binary- or npz-format models, as well as mixtures of both. When all models in the ensembles are of binary format, the load from memory path is used. Otherwise, they are loaded via the file system. Enable log-level debug for output related to this. * Fix formatting * Fix WASM bindings for MemoryBundle For now, this does not support ensembles. * Remove shared_ptr wrapping the AlignedMemory of models. * Fix formatting --- src/translator/byte_array_util.cpp | 31 ++++++++++++---------- src/translator/byte_array_util.h | 2 +- src/translator/definitions.h | 4 +-- src/translator/translation_model.cpp | 39 ++++++++++++++++++---------- wasm/bindings/service_bindings.cpp | 2 +- 5 files changed, 46 insertions(+), 32 deletions(-) diff --git a/src/translator/byte_array_util.cpp b/src/translator/byte_array_util.cpp index 183dea3c0..c7515e797 100644 --- a/src/translator/byte_array_util.cpp +++ b/src/translator/byte_array_util.cpp @@ -91,21 +91,24 @@ AlignedMemory loadFileToMemory(const std::string& path, size_t alignment) { return alignedMemory; } -AlignedMemory getModelMemoryFromConfig(marian::Ptr options) { +std::vector getModelMemoryFromConfig(marian::Ptr options) { auto models = options->get>("models"); - ABORT_IF(models.size() != 1, "Loading multiple binary models is not supported for now as it is not necessary."); - - // If binary model we load into aligned memory. If .npz we leave it be to - // return empty aligned memory, thus allowing traditional file system loads. - if (marian::io::isBin(models[0])) { - AlignedMemory alignedMemory = loadFileToMemory(models[0], 256); - return alignedMemory; - } else if (marian::io::isNpz(models[0])) { - return AlignedMemory(); - } else { - ABORT("Unknown extension for model: {}, should be one of `.bin` or `.npz`", models[0]); + + std::vector modelMemories(models.size()); + for (size_t i = 0; i < models.size(); ++i) { + const auto model = models[i]; + if (marian::io::isBin(model)) { + modelMemories[i] = loadFileToMemory(model, 256); + } else if (marian::io::isNpz(model)) { + // if any of the models are npz format, we revert to loading from file for all models. + LOG(debug, "Encountered an npz file {}; will use file loading for {} models", model, models.size()); + return {}; + } else { + ABORT("Unknown extension for model: {}, should be one of `.bin` or `.npz`", model); + } } - return AlignedMemory(); + + return modelMemories; } AlignedMemory getShortlistMemoryFromConfig(marian::Ptr options) { @@ -153,7 +156,7 @@ AlignedMemory getQualityEstimatorModel(MemoryBundle& memoryBundle, const marian: MemoryBundle getMemoryBundleFromConfig(marian::Ptr options) { MemoryBundle memoryBundle; - memoryBundle.model = getModelMemoryFromConfig(options); + memoryBundle.models = getModelMemoryFromConfig(options); memoryBundle.shortlist = getShortlistMemoryFromConfig(options); getVocabsMemoryFromConfig(options, memoryBundle.vocabs); memoryBundle.ssplitPrefixFile = getSsplitPrefixFileMemoryFromConfig(options); diff --git a/src/translator/byte_array_util.h b/src/translator/byte_array_util.h index b445b3dec..851a175fd 100644 --- a/src/translator/byte_array_util.h +++ b/src/translator/byte_array_util.h @@ -5,7 +5,7 @@ namespace marian { namespace bergamot { AlignedMemory loadFileToMemory(const std::string& path, size_t alignment); -AlignedMemory getModelMemoryFromConfig(marian::Ptr options); +std::vector getModelMemoryFromConfig(marian::Ptr options); AlignedMemory getQualityEstimatorModel(const marian::Ptr& options); AlignedMemory getQualityEstimatorModel(MemoryBundle& memoryBundle, const marian::Ptr& options); AlignedMemory getShortlistMemoryFromConfig(marian::Ptr options); diff --git a/src/translator/definitions.h b/src/translator/definitions.h index b3bc1019b..efba3f9f6 100644 --- a/src/translator/definitions.h +++ b/src/translator/definitions.h @@ -19,8 +19,8 @@ typedef AlignedVector AlignedMemory; /// Memory bundle for all byte-arrays. /// Can be a set/subset of model, shortlist, vocabs and ssplitPrefixFile bytes. struct MemoryBundle { - AlignedMemory model{}; ///< Byte-array of model (aligned to 256) - AlignedMemory shortlist{}; ///< Byte-array of shortlist (aligned to 64) + std::vector models{}; ///< Byte-array of model (each element is aligned to 256) + AlignedMemory shortlist{}; ///< Byte-array of shortlist (aligned to 64) /// Vector of vocabulary memories (aligned to 64). /// If two vocabularies are the same (based on the filenames), two entries (shared diff --git a/src/translator/translation_model.cpp b/src/translator/translation_model.cpp index 3f91ebb47..6f8dd4dc8 100644 --- a/src/translator/translation_model.cpp +++ b/src/translator/translation_model.cpp @@ -61,24 +61,35 @@ void TranslationModel::loadBackend(size_t idx) { graph->getBackend()->configureDevice(options_); graph->reserveWorkspaceMB(options_->get("workspace")); - // Marian Model: Load from memoryBundle or shortList - if (memory_.model.size() > 0 && - memory_.model.begin() != - nullptr) { // If we have provided a byte array that contains the model memory, we can initialise the - // model from there, as opposed to from reading in the config file - ABORT_IF((uintptr_t)memory_.model.begin() % 256 != 0, - "The provided memory is not aligned to 256 bytes and will crash when vector instructions are used on it."); - if (options_->get("check-bytearray", false)) { - ABORT_IF(!validateBinaryModel(memory_.model, memory_.model.size()), - "The binary file is invalid. Incomplete or corrupted download?"); - } - const std::vector container = { - memory_.model.begin()}; // Marian supports multiple models initialised in this manner hence std::vector. - // However we will only ever use 1 during decoding. + // if memory_.models is populated, then all models were of binary format + if (memory_.models.size() >= 1) { + const std::vector container = std::invoke([&]() { + std::vector model_ptrs(memory_.models.size()); + for (size_t i = 0; i < memory_.models.size(); ++i) { + const AlignedMemory &model = memory_.models[i]; + + ABORT_IF(model.size() == 0 || model.begin() == nullptr, "The provided memory is empty. Cannot load the model."); + ABORT_IF( + (uintptr_t)model.begin() % 256 != 0, + "The provided memory is not aligned to 256 bytes and will crash when vector instructions are used on it."); + if (options_->get("check-bytearray", false)) { + ABORT_IF(!validateBinaryModel(model, model.size()), + "The binary file is invalid. Incomplete or corrupted download?"); + } + + model_ptrs[i] = model.begin(); + LOG(debug, "Loaded model {} of {} from memory", (i + 1), model_ptrs.size()); + } + return model_ptrs; + }); + scorerEnsemble = createScorers(options_, container); } else { + // load npz format models, or a mixture of binary/npz formats scorerEnsemble = createScorers(options_); + LOG(debug, "Loaded {} model(s) from file", scorerEnsemble.size()); } + for (auto scorer : scorerEnsemble) { scorer->init(graph); if (shortlistGenerator_) { diff --git a/wasm/bindings/service_bindings.cpp b/wasm/bindings/service_bindings.cpp index d56615dc6..54675a498 100644 --- a/wasm/bindings/service_bindings.cpp +++ b/wasm/bindings/service_bindings.cpp @@ -48,7 +48,7 @@ MemoryBundle prepareMemoryBundle(AlignedMemory* modelMemory, AlignedMemory* shor std::vector uniqueVocabsMemories, AlignedMemory* qualityEstimatorMemory) { MemoryBundle memoryBundle; - memoryBundle.model = std::move(*modelMemory); + memoryBundle.models.emplace_back(std::move(*modelMemory)); memoryBundle.shortlist = std::move(*shortlistMemory); memoryBundle.vocabs = std::move(prepareVocabsSmartMemories(uniqueVocabsMemories)); if (qualityEstimatorMemory != nullptr) { From 2bdc493df3fa5109b2cd434a7a9634eb021b514b Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 8 Aug 2023 10:37:24 +0300 Subject: [PATCH 408/442] Bump 3rd_party/ssplit-cpp from `ad2c5a5` to `a311f98` (#456) Bumps [3rd_party/ssplit-cpp](https://github.com/browsermt/ssplit-cpp) from `ad2c5a5` to `a311f98`. - [Commits](https://github.com/browsermt/ssplit-cpp/compare/ad2c5a52a507ec5a1f58c6403fc674e76e92e185...a311f9865ade34db1e8e080e6cc146f55dafb067) --- updated-dependencies: - dependency-name: 3rd_party/ssplit-cpp dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- 3rd_party/ssplit-cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/ssplit-cpp b/3rd_party/ssplit-cpp index ad2c5a52a..a311f9865 160000 --- a/3rd_party/ssplit-cpp +++ b/3rd_party/ssplit-cpp @@ -1 +1 @@ -Subproject commit ad2c5a52a507ec5a1f58c6403fc674e76e92e185 +Subproject commit a311f9865ade34db1e8e080e6cc146f55dafb067 From ca954670aa4327630a3aee427668728b12b02df7 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 11 Aug 2023 15:04:27 +0100 Subject: [PATCH 409/442] Bump 3rd_party/marian-dev from `aa0221e` to `8dbde0f` (#458) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `aa0221e` to `8dbde0f`. - [Commits](https://github.com/browsermt/marian-dev/compare/aa0221e687fe8b3b69b5bb64279d4349663ad410...8dbde0fd8e690ad8791fb7fc94dba7674ee7c77e) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index aa0221e68..8dbde0fd8 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit aa0221e687fe8b3b69b5bb64279d4349663ad410 +Subproject commit 8dbde0fd8e690ad8791fb7fc94dba7674ee7c77e From 534ed37a3d609f867a65c250328c5745b306a3c5 Mon Sep 17 00:00:00 2001 From: Nikolay Bogoychev Date: Mon, 14 Aug 2023 17:22:54 +0300 Subject: [PATCH 410/442] Remove wormhole references (#459) * Remove warmhole references * Remove more references to the WORMHOLE * Update marian to wormhole removed marian * Whoops --------- Co-authored-by: Jelmer van der Linde --- .circleci/config.yml | 69 +++---------------------- .github/workflows/build.yml | 3 +- 3rd_party/marian-dev | 2 +- CMakeLists.txt | 1 - README.md | 9 +--- build-wasm.sh | 41 +-------------- wasm/README.md | 14 +---- wasm/patch-artifacts-enable-wormhole.sh | 36 ------------- 8 files changed, 16 insertions(+), 159 deletions(-) delete mode 100644 wasm/patch-artifacts-enable-wormhole.sh diff --git a/.circleci/config.yml b/.circleci/config.yml index 140e3116d..52d58fc09 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -1,52 +1,6 @@ version: 2.1 jobs: - build-with-wormhole: - docker: - - image: 'emscripten/emsdk:3.1.8' - resource_class: medium - - working_directory: ~/checkout - - steps: - - checkout - - - run: - name: Build WASM WORMHOLE - command: | - bash build-wasm.sh WORMHOLE - - - run: - name: Check artifacts - working_directory: build-wasm - command: | - ARTIFACT_BASE="bergamot-translator-worker" - ARTIFACT_SUFFIX="with-wormhole" - ARTIFACT_FINAL=$ARTIFACT_BASE-$ARTIFACT_SUFFIX - - if [[ -f "$ARTIFACT_BASE.js" && -f "$ARTIFACT_BASE.wasm" ]]; then - echo "Artifacts Successfully Generated" - mkdir ../artifacts - cp $ARTIFACT_BASE.wasm ../artifacts/$ARTIFACT_FINAL.wasm - cp $ARTIFACT_BASE.js ../artifacts/$ARTIFACT_FINAL.js - cd ../artifacts - shasum -a 256 $ARTIFACT_FINAL.wasm $ARTIFACT_FINAL.js >> sha256-filesize-$ARTIFACT_SUFFIX - ls -lsa $ARTIFACT_FINAL.wasm $ARTIFACT_FINAL.js >> sha256-filesize-$ARTIFACT_SUFFIX - cp ../BERGAMOT_VERSION . - else - echo "Failure: Artifacts Not Present" - exit 1 - fi - - - persist_to_workspace: - root: . - paths: - - artifacts/* - - - store_artifacts: - path: "artifacts" - destination: "wasm-wormhole" - - build-without-wormhole: + build: docker: - image: 'emscripten/emsdk:3.1.8' resource_class: medium @@ -66,8 +20,7 @@ jobs: working_directory: build-wasm command: | ARTIFACT_BASE="bergamot-translator-worker" - ARTIFACT_SUFFIX="without-wormhole" - ARTIFACT_FINAL=$ARTIFACT_BASE-$ARTIFACT_SUFFIX + ARTIFACT_FINAL=$ARTIFACT_BASE if [[ -f "$ARTIFACT_BASE.js" && -f "$ARTIFACT_BASE.wasm" ]]; then echo "Artifacts Successfully Generated" @@ -75,8 +28,8 @@ jobs: cp $ARTIFACT_BASE.wasm ../artifacts/$ARTIFACT_FINAL.wasm cp $ARTIFACT_BASE.js ../artifacts/$ARTIFACT_FINAL.js cd ../artifacts - shasum -a 256 $ARTIFACT_FINAL.wasm $ARTIFACT_FINAL.js >> sha256-filesize-$ARTIFACT_SUFFIX - ls -lsa $ARTIFACT_FINAL.wasm $ARTIFACT_FINAL.js >> sha256-filesize-$ARTIFACT_SUFFIX + shasum -a 256 $ARTIFACT_FINAL.wasm $ARTIFACT_FINAL.js >> sha256-filesize + ls -lsa $ARTIFACT_FINAL.wasm $ARTIFACT_FINAL.js >> sha256-filesize else echo "Failure: Artifacts Not Present" exit 1 @@ -89,7 +42,7 @@ jobs: - store_artifacts: path: "artifacts" - destination: "wasm-without-wormhole" + destination: "wasm" publish_to_github: docker: @@ -106,18 +59,13 @@ jobs: name: "Publish Release on GitHub" command: | export TAG_VERSION=$(cat ./artifacts/BERGAMOT_VERSION) - cat ./artifacts/sha256-filesize-without-wormhole ./artifacts/sha256-filesize-with-wormhole >> ./artifacts/sha256-filesize - rm ./artifacts/sha256-filesize-without-wormhole ./artifacts/sha256-filesize-with-wormhole ./artifacts/BERGAMOT_VERSION + rm ./artifacts/BERGAMOT_VERSION ghr -t ${GHTOKEN} -u ${CIRCLE_PROJECT_USERNAME} -r ${CIRCLE_PROJECT_REPONAME} -c ${CIRCLE_SHA1} -delete ${TAG_VERSION} ./artifacts/ workflows: build: jobs: - - build-with-wormhole: - filters: - tags: - only: /^v.*/ - - build-without-wormhole: + - build: filters: tags: only: /^v.*/ @@ -128,7 +76,6 @@ workflows: branches: ignore: /.*/ requires: - - build-without-wormhole - - build-with-wormhole + - build diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index f06b26357..830924c2c 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -236,7 +236,7 @@ jobs: run: | mkdir -p build-wasm cd build-wasm - emcmake cmake -DCOMPILE_WASM=on -DWORMHOLE=off .. + emcmake cmake -DCOMPILE_WASM=on .. - name: "Compile" @@ -276,7 +276,6 @@ jobs: name: wasm-artefacts if-no-files-found: error path: | - # Without wormhole ${{github.workspace}}/build-wasm/bergamot-translator-worker.js ${{github.workspace}}/build-wasm/bergamot-translator-worker.wasm ${{github.workspace}}/build-wasm/bergamot-translator-worker.js.bak diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 8dbde0fd8..300a50f42 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 8dbde0fd8e690ad8791fb7fc94dba7674ee7c77e +Subproject commit 300a50f4251d978dc197d15bb7b296597b1eb221 diff --git a/CMakeLists.txt b/CMakeLists.txt index dc51acf80..82940de82 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -113,7 +113,6 @@ message(STATUS "Project version: ${PROJECT_VERSION_STRING_FULL}") if(COMPILE_WASM) # See https://github.com/emscripten-core/emscripten/blob/main/src/settings.js - set(WORMHOLE ON CACHE BOOL "Use WASM wormhole in intgemm https://bugzilla.mozilla.org/show_bug.cgi?id=1672160") list(APPEND WASM_COMPILE_FLAGS -O3 # Preserve whitespaces in JS even for release builds; this doesn't increase wasm binary size diff --git a/README.md b/README.md index eae9ef319..05c3c3d25 100644 --- a/README.md +++ b/README.md @@ -41,12 +41,7 @@ To build a version that translates with higher speeds on Firefox Nightly browser The wasm artifacts (.js and .wasm files) will be available in the build directory ("build-wasm" in this case). - 2. Enable SIMD Wormhole via Wasm instantiation API in generated artifacts - ```bash - bash ../wasm/patch-artifacts-enable-wormhole.sh - ``` - - 3. Patch generated artifacts to import GEMM library from a separate wasm module + 2. Patch generated artifacts to import GEMM library from a separate wasm module ```bash bash ../wasm/patch-artifacts-import-gemm-module.sh ``` @@ -57,7 +52,7 @@ To build a version that runs on all browsers (including Firefox Nightly) but tra ```bash mkdir build-wasm cd build-wasm - emcmake cmake -DCOMPILE_WASM=on -DWORMHOLE=off ../ + emcmake cmake -DCOMPILE_WASM=on ../ emmake make -j2 ``` diff --git a/build-wasm.sh b/build-wasm.sh index ff12013d1..b6d70efb6 100755 --- a/build-wasm.sh +++ b/build-wasm.sh @@ -2,34 +2,6 @@ set -e set -x -# Usage -Usage="Build translator to wasm (with/without wormhole). - -Usage: $(basename "$0") [WORMHOLE] - - where: - WORMHOLE An optional string argument - - when specified on command line, builds wasm artifacts with wormhole - - when not specified (the default behaviour), builds wasm artifacts without wormhole." - -if [ "$#" -gt 1 ]; then - echo "Illegal number of parameters passed" - echo "$Usage" - exit -fi - -WORMHOLE=false - -if [ "$#" -eq 1 ]; then - if [ "$1" = "WORMHOLE" ]; then - WORMHOLE=true - else - echo "Illegal parameter passed" - echo "$Usage" - exit - fi -fi - # Run script from the context of the script-containing directory cd "$(dirname $0)" @@ -66,19 +38,10 @@ if [ ! -d ${BUILD_DIRECTORY} ]; then fi cd ${BUILD_DIRECTORY} -if [ "$WORMHOLE" = true ]; then - emcmake cmake -DCOMPILE_WASM=on ../ -else - emcmake cmake -DCOMPILE_WASM=on -DWORMHOLE=off ../ -fi +emcmake cmake -DCOMPILE_WASM=on ../ emmake make -j2 -# 2. Enable SIMD Wormhole via Wasm instantiation API in generated artifacts -if [ "$WORMHOLE" = true ]; then - bash ../wasm/patch-artifacts-enable-wormhole.sh -fi - -# 3. Import GEMM library from a separate wasm module +# 2. Import GEMM library from a separate wasm module bash ../wasm/patch-artifacts-import-gemm-module.sh # The artifacts (.js and .wasm files) will be available in the build directory diff --git a/wasm/README.md b/wasm/README.md index e2d9a447c..0f3f77426 100644 --- a/wasm/README.md +++ b/wasm/README.md @@ -32,18 +32,8 @@ Alternatively refer to the file `test_page/js/worker.js` that demonstrates how t Provide the folder containing the wasm artifacts as the first argument of `start_server.sh` script (`../../build-wasm` in this case). -* Open any of the browsers below - * Firefox Nightly +87: make sure the following prefs are on (about:config) - ``` - dom.postMessage.sharedArrayBuffer.bypassCOOP_COEP.insecure.enabled = true - javascript.options.wasm_simd = true - javascript.options.wasm_simd_wormhole = true - ``` - - * Chrome Canary +90: start with the following argument - ``` - --js-flags="--experimental-wasm-simd" - ``` +* Open any browser (tested with latest Chrome/Firefox/Safari) + * Browse to the following page: ``` diff --git a/wasm/patch-artifacts-enable-wormhole.sh b/wasm/patch-artifacts-enable-wormhole.sh deleted file mode 100644 index e39988b4e..000000000 --- a/wasm/patch-artifacts-enable-wormhole.sh +++ /dev/null @@ -1,36 +0,0 @@ -#!/bin/bash -usage="Patch wasm artifacts to enable wormhole via APIs that compile and instantiate wasm module. - -Usage: $(basename "$0") [WASM_ARTIFACTS_FOLDER] - - where: - WASM_ARTIFACTS_FOLDER Folder containing wasm artifacts - (An optional argument, if unspecified the default is: current folder)" - -if [ "$#" -gt 1 ]; then - echo "Illegal number of parameters passed" - echo "$usage" - exit -fi - -# Parse wasm artifacts folder if provided via script argument or set it to default -WASM_ARTIFACTS_FOLDER=$PWD -if [ "$#" -eq 1 ]; then - if [ ! -e "$1" ]; then - echo "Error: Folder \""$1"\" doesn't exist" - exit - fi - WASM_ARTIFACTS_FOLDER="$1" -fi - -WASM_ARTIFACTS="$WASM_ARTIFACTS_FOLDER/bergamot-translator-worker.js" -if [ ! -e "$WASM_ARTIFACTS" ]; then - echo "Error: Artifact \"$WASM_ARTIFACTS\" doesn't exist" - exit -fi - -echo "Patching \"$WASM_ARTIFACTS\" to enable wormhole via APIs that compile and instantiate wasm module" -sed -i.bak 's/WebAssembly.instantiateStreaming[[:space:]]*([[:space:]]*response[[:space:]]*,[[:space:]]*info[[:space:]]*)/WebAssembly.instantiateStreaming(response, info, {simdWormhole:true})/g' $WASM_ARTIFACTS -sed -i.bak 's/WebAssembly.instantiate[[:space:]]*([[:space:]]*binary[[:space:]]*,[[:space:]]*info[[:space:]]*)/WebAssembly.instantiate(binary, info, {simdWormhole:true})/g' $WASM_ARTIFACTS -sed -i.bak 's/WebAssembly.Module[[:space:]]*([[:space:]]*bytes[[:space:]]*)/WebAssembly.Module(bytes, {simdWormhole:true})/g' $WASM_ARTIFACTS -echo "Done" From 47024ec7a3ed2fe7909c01758ed9ca51625d8703 Mon Sep 17 00:00:00 2001 From: Greg Tatum Date: Wed, 16 Aug 2023 09:35:26 -0500 Subject: [PATCH 411/442] Add more things to the gitignore that are not being ignored (#462) --- .gitignore | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index c796e0656..94b32949c 100644 --- a/.gitignore +++ b/.gitignore @@ -17,7 +17,10 @@ _deps wasm/test_page/node_modules -build-wasm +/build +/build-native +/build-wasm +/emsdk models wasm/module/worker/bergamot-translator-worker.* wasm/module/browsermt-bergamot-translator-*.tgz From 62770bb067d2c79bc83c82f2e45063ee73754c39 Mon Sep 17 00:00:00 2001 From: Greg Tatum Date: Wed, 16 Aug 2023 10:14:56 -0500 Subject: [PATCH 412/442] Generate a compile_commands.json by default with cmake (#461) --- CMakeLists.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/CMakeLists.txt b/CMakeLists.txt index 82940de82..d8a2d00cb 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -14,6 +14,10 @@ project(bergamot_translator CXX C) set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD_REQUIRED ON) +# Generate a compile_commands.json in the build directory. The compile commands allow +# code editors to understand the build process and provide static analysis of the code. +set(CMAKE_EXPORT_COMPILE_COMMANDS ON) + # Note that with CMake MSVC build, the option CMAKE_BUILD_TYPE is automatically derived from the key # 'configurationType' in CMakeSettings.json configurations if(NOT CMAKE_BUILD_TYPE) From db3826266d11e611f9a96ab36a2deb84c4938697 Mon Sep 17 00:00:00 2001 From: Greg Tatum Date: Thu, 17 Aug 2023 01:55:49 -0500 Subject: [PATCH 413/442] Report the wasm size on builds (#460) --- build-wasm.sh | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/build-wasm.sh b/build-wasm.sh index b6d70efb6..443907232 100755 --- a/build-wasm.sh +++ b/build-wasm.sh @@ -44,5 +44,22 @@ emmake make -j2 # 2. Import GEMM library from a separate wasm module bash ../wasm/patch-artifacts-import-gemm-module.sh +set +x +echo "" +echo "Build complete" +echo "" +echo " ./build-wasm/bergamot-translator-worker.js" +echo " ./build-wasm/bergamot-translator-worker.wasm" + +WASM_SIZE=$(wc -c bergamot-translator-worker.wasm | awk '{print $1}') +GZIP_SIZE=$(gzip -c bergamot-translator-worker.wasm | wc -c | xargs) # xargs trims the whitespace + +# Convert it to human readable. +WASM_SIZE="$(awk 'BEGIN {printf "%.2f",'$WASM_SIZE'/1048576}')M ($WASM_SIZE bytes)" +GZIP_SIZE="$(awk 'BEGIN {printf "%.2f",'$GZIP_SIZE'/1048576}')M ($GZIP_SIZE bytes)" + +echo " Uncompressed wasm size: $WASM_SIZE" +echo " Compressed wasm size: $GZIP_SIZE" + # The artifacts (.js and .wasm files) will be available in the build directory exit 0 From 0b069acce6076bf6d01d6fab132a332ca26ef076 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 11 Sep 2023 08:20:47 +0100 Subject: [PATCH 414/442] Bump 3rd_party/marian-dev from `300a50f` to `780df27` (#464) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `300a50f` to `780df27`. - [Commits](https://github.com/browsermt/marian-dev/compare/300a50f4251d978dc197d15bb7b296597b1eb221...780df2708e023ce47c0e1e89f2f4a7f3beab5271) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 300a50f42..780df2708 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 300a50f4251d978dc197d15bb7b296597b1eb221 +Subproject commit 780df2708e023ce47c0e1e89f2f4a7f3beab5271 From 321be8ae0486de3af67307c4cb2e005994593597 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 20 Sep 2023 08:10:18 +0100 Subject: [PATCH 415/442] Bump 3rd_party/marian-dev from `780df27` to `11c6ae7` (#466) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `780df27` to `11c6ae7`. - [Commits](https://github.com/browsermt/marian-dev/compare/780df2708e023ce47c0e1e89f2f4a7f3beab5271...11c6ae7c46be21ef96ed10c60f28022fa968939f) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 780df2708..11c6ae7c4 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 780df2708e023ce47c0e1e89f2f4a7f3beab5271 +Subproject commit 11c6ae7c46be21ef96ed10c60f28022fa968939f From 73182d4c58000f74a5bf2e2529f2d2344a584625 Mon Sep 17 00:00:00 2001 From: Kenneth Heafield Date: Thu, 7 Dec 2023 10:21:45 -0500 Subject: [PATCH 416/442] Pull in marian-dev with fixed CI and clang --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 11c6ae7c4..831a7362e 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 11c6ae7c46be21ef96ed10c60f28022fa968939f +Subproject commit 831a7362e26a5d43602658d31a2b52571dd16761 From 7774029d0dc239817f009a1dea84e2a195797052 Mon Sep 17 00:00:00 2001 From: Kenneth Heafield Date: Thu, 7 Dec 2023 11:03:33 -0500 Subject: [PATCH 417/442] clang: marian-dev with newer fbgemm --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 831a7362e..ecda59e61 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 831a7362e26a5d43602658d31a2b52571dd16761 +Subproject commit ecda59e6105fb1d7935892c3bacfbc9562b235f1 From 0367ae07a79d2769b861a13e07fc205969c75ce2 Mon Sep 17 00:00:00 2001 From: Kenneth Heafield Date: Thu, 7 Dec 2023 12:10:50 -0500 Subject: [PATCH 418/442] Fix MKL key URL --- .github/workflows/native.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/native.yml b/.github/workflows/native.yml index 6c5f56913..505381cbc 100644 --- a/.github/workflows/native.yml +++ b/.github/workflows/native.yml @@ -58,7 +58,7 @@ jobs: sudo apt-get install -y libprotobuf-dev protobuf-compiler libboost-all-dev ccache libunwind-dev libgoogle-perftools-dev - name: Install MKL run: |- - wget -qO- "https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB" | sudo apt-key add - + wget -qO- "https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB" | sudo apt-key add - sudo sh -c "echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list" sudo apt-get update -o Dir::Etc::sourcelist="/etc/apt/sources.list.d/intel-mkl.list" sudo apt-get install -y --no-install-recommends intel-mkl-64bit-2020.0-088 From 983331bbc98e5b76b11ee265f4ecb22d69ad035f Mon Sep 17 00:00:00 2001 From: XapaJIaMnu Date: Tue, 19 Dec 2023 18:41:18 +0000 Subject: [PATCH 419/442] More pendantic spm --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index ecda59e61..2be8344fc 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit ecda59e6105fb1d7935892c3bacfbc9562b235f1 +Subproject commit 2be8344fcf2776fb43a7376284067164674cbfaf From 5261614dfd2f4098c32911f4aa7e7759afd13abb Mon Sep 17 00:00:00 2001 From: Kirandevraj Date: Sun, 24 Mar 2024 01:51:46 +0530 Subject: [PATCH 420/442] model url update in example script (#470) --- examples/run-native.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/run-native.sh b/examples/run-native.sh index b02968a23..84e1302f0 100644 --- a/examples/run-native.sh +++ b/examples/run-native.sh @@ -3,8 +3,8 @@ # Obtain an example model from the web. mkdir -p models wget --quiet --continue --directory models/ \ - http://data.statmt.org/bergamot/models/deen/ende.student.tiny11.tar.gz -(cd models && tar -xzf ende.student.tiny11.tar.gz) + https://data.statmt.org/bergamot/models/deen/ende.student.tiny11.v2.93821e13b3c511b5.tar.gz +(cd models && tar -xzf ende.student.tiny11.v2.93821e13b3c511b5.tar.gz) # Patch the config-files generated from marian for use in bergamot. python3 bergamot-translator-tests/tools/patch-marian-for-bergamot.py \ From 34acd8d982d33fd38378093108b1a48ffa542c3c Mon Sep 17 00:00:00 2001 From: Yo'av Moshe Date: Sat, 20 Apr 2024 00:17:45 +0200 Subject: [PATCH 421/442] fix downloading of models in the python binding (#472) models come in files named like `csen.student.base.v1.cd5418ba6a412fc7.tar.gz`, but the directory they create when extracted are named like `csen.student.base`. we therefore need to remove not just the extension but everything following and including the 3rd period --- bindings/python/repository.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bindings/python/repository.py b/bindings/python/repository.py index 9667c7242..9ea3ac023 100644 --- a/bindings/python/repository.py +++ b/bindings/python/repository.py @@ -180,7 +180,7 @@ def safe_extract(tar, path=".", members=None, *, numeric_owner=False): def _archive_name_without_extension(self, url: URL): o = urlparse(url) fname = os.path.basename(o.path) # something tar.gz. - fname_without_extension = fname.replace(".tar.gz", "") + fname_without_extension = ".".join(fname.split(".")[:3]) return fname_without_extension From 9271618ebbdc5d21ac4dc4df9e72beb7ce644774 Mon Sep 17 00:00:00 2001 From: XapaJIaMnu Date: Sun, 12 May 2024 09:51:02 +0100 Subject: [PATCH 422/442] Update submodule --- 3rd_party/marian-dev | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3rd_party/marian-dev b/3rd_party/marian-dev index 2be8344fc..2781d735d 160000 --- a/3rd_party/marian-dev +++ b/3rd_party/marian-dev @@ -1 +1 @@ -Subproject commit 2be8344fcf2776fb43a7376284067164674cbfaf +Subproject commit 2781d735d4a10dca876d61be587afdab2726293c From bbb844243c028bf88f8fc4def142d4389dda2354 Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 19 Sep 2024 16:46:44 -0500 Subject: [PATCH 423/442] Move inference-engine git submodules to the repository root The fork of Bergamot, now locaed in the `inference-engine` directory, had its own set of defined submodules. These need to be moved to the repository root in order to function correctly within a mono repo. --- .gitmodules | 21 +++++++++++++++++++++ inference-engine/.gitmodules | 12 ------------ inference-engine/CMakeLists.txt | 5 ++++- 3 files changed, 25 insertions(+), 13 deletions(-) delete mode 100644 inference-engine/.gitmodules diff --git a/.gitmodules b/.gitmodules index f1813a444..e6abab367 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,18 +1,39 @@ [submodule "fast_align"] path = 3rd_party/fast_align url = https://github.com/clab/fast_align + [submodule "extract-lex"] path = 3rd_party/extract-lex url = https://github.com/marian-nmt/extract-lex + +[submodule "inference-engine/bergamot-translator-tests"] + path = inference-engine/bergamot-translator-tests + url = https://github.com/browsermt/bergamot-translator-tests + +[submodule "inference-engine/3rd_party/pybind11"] + path = inference-engine/3rd_party/pybind11 + url = https://github.com/pybind/pybind11.git + +[submodule "inference-engine/3rd_party/marian-dev"] + path = inference-engine/3rd_party/marian-dev + url = https://github.com/browsermt/marian-dev + +[submodule "inference-engine/3rd_party/ssplit-cpp"] + path = inference-engine/3rd_party/ssplit-cpp + url = https://github.com/browsermt/ssplit-cpp + [submodule "3rd_party/kenlm"] path = 3rd_party/kenlm url = https://github.com/kpu/kenlm + [submodule "3rd_party/browsermt-marian-dev"] path = 3rd_party/browsermt-marian-dev url = https://github.com/browsermt/marian-dev + [submodule "3rd_party/marian-dev"] path = 3rd_party/marian-dev url = https://github.com/marian-nmt/marian-dev + [submodule "3rd_party/preprocess"] path = 3rd_party/preprocess url = https://github.com/kpu/preprocess.git diff --git a/inference-engine/.gitmodules b/inference-engine/.gitmodules deleted file mode 100644 index cfedde289..000000000 --- a/inference-engine/.gitmodules +++ /dev/null @@ -1,12 +0,0 @@ -[submodule "3rd_party/marian-dev"] - path = 3rd_party/marian-dev - url = https://github.com/browsermt/marian-dev -[submodule "3rd_party/ssplit-cpp"] - path = 3rd_party/ssplit-cpp - url = https://github.com/browsermt/ssplit-cpp -[submodule "bergamot-translator-tests"] - path = bergamot-translator-tests - url = https://github.com/browsermt/bergamot-translator-tests -[submodule "3rd_party/pybind11"] - path = 3rd_party/pybind11 - url = https://github.com/pybind/pybind11.git diff --git a/inference-engine/CMakeLists.txt b/inference-engine/CMakeLists.txt index d8a2d00cb..da01c6048 100644 --- a/inference-engine/CMakeLists.txt +++ b/inference-engine/CMakeLists.txt @@ -11,6 +11,9 @@ endif() project(bergamot_translator CXX C) +# Retrieve the parent-directory path of PROJECT_SOURCE_DIR and assign that to REPOSITORY_ROOT_DIR. +cmake_path(GET PROJECT_SOURCE_DIR PARENT_PATH REPOSITORY_ROOT_DIR) + set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD_REQUIRED ON) @@ -96,7 +99,7 @@ endif() # Documentation: https://cliutils.gitlab.io/modern-cmake/chapters/projects/submodule.html # Ensures the submodules are set correctly during a build. find_package(Git QUIET) -if(GIT_FOUND AND EXISTS "${PROJECT_SOURCE_DIR}/.git") +if(GIT_FOUND AND EXISTS "${REPOSITORY_ROOT_DIR}/.git") # Update submodules as needed option(GIT_SUBMODULE "Check submodules during build" ON) if(GIT_SUBMODULE) From cad39633ef7f18eec8999b9bac1c80afc5089836 Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 19 Sep 2024 16:55:12 -0500 Subject: [PATCH 424/442] Rename inference-engine/3rd_party/marian-nmt --- .gitmodules | 26 ++++++++++++++++--- inference-engine/3rd_party/CMakeLists.txt | 14 +++++----- .../{marian-dev => browsermt-marian-dev} | 0 .../patches/01-marian-fstream-for-macos.patch | 6 ++--- inference-engine/src/tests/CMakeLists.txt | 2 +- inference-engine/src/translator/logging.h | 2 +- inference-engine/src/translator/parser.h | 2 +- 7 files changed, 35 insertions(+), 17 deletions(-) rename inference-engine/3rd_party/{marian-dev => browsermt-marian-dev} (100%) diff --git a/.gitmodules b/.gitmodules index e6abab367..5d1bbf716 100644 --- a/.gitmodules +++ b/.gitmodules @@ -10,14 +10,32 @@ path = inference-engine/bergamot-translator-tests url = https://github.com/browsermt/bergamot-translator-tests +# This is the same dependency and repository as `3rd_party/browsermt-marian-dev` below. +# +# When forking `inference-engine` into to this project, I made an earnest attempt to utilize the preexisting +# `3rd_party/browsermt-marian-dev` submodule within `inference-engine`. Unfortunately, I ran into several roadblocks: +# +# 1) I cannot directly add `3rd_party/browsermt-marian-dev` as a cmake subdirectory because cmake is aware that +# this path is not a subdirectory of the `inference-engine` project root. +# +# 2) Symbolic links do not appear to work for git submodule direcotires the way that they do for regular directories. +# Even if the symbolic link had linked correctly, it may have still failed due to the considerations of 1). +# +# 3) I tried using cmake to copy the files from `3rd_party/browsermt-marian-dev` into `inference-engine/3rd_party/browsermt-marian-dev` +# at build time, which would ensure that there is no duplicate reference to the URL in this file, however the upstream dependency itself +# has hard-coded cmake expectations that the `.git` directory is only one level up, which appears to work correctly for the way git submodules +# are configured, but does not work if the files are copied over to a regular directory deeper in the repository. +# +# It may be possible to remove `3rd_party/browsermt-marian-dev` to instead use `inference-engine/3rd-party/browsermt-marian-dev` everywhere +# within this repository, but I will leave that for a future commit if there is a need to do so. +[submodule "inference-engine/3rd_party/browsermt-marian-dev"] + path = inference-engine/3rd_party/browsermt-marian-dev + url = https://github.com/browsermt/marian-dev + [submodule "inference-engine/3rd_party/pybind11"] path = inference-engine/3rd_party/pybind11 url = https://github.com/pybind/pybind11.git -[submodule "inference-engine/3rd_party/marian-dev"] - path = inference-engine/3rd_party/marian-dev - url = https://github.com/browsermt/marian-dev - [submodule "inference-engine/3rd_party/ssplit-cpp"] path = inference-engine/3rd_party/ssplit-cpp url = https://github.com/browsermt/ssplit-cpp diff --git a/inference-engine/3rd_party/CMakeLists.txt b/inference-engine/3rd_party/CMakeLists.txt index eac898eb9..0185d7673 100644 --- a/inference-engine/3rd_party/CMakeLists.txt +++ b/inference-engine/3rd_party/CMakeLists.txt @@ -1,6 +1,6 @@ -# marian-dev is tested elsewhere in both paths, turning off here. +# browsermt-marian-dev is tested elsewhere in both paths, turning off here. set(COMPILE_TESTS OFF) -add_subdirectory(marian-dev EXCLUDE_FROM_ALL) +add_subdirectory(browsermt-marian-dev EXCLUDE_FROM_ALL) if(COMPILE_WASM) # This is a bad way of adding compilation flags. Will be improved soon. @@ -13,21 +13,21 @@ add_subdirectory(ssplit-cpp EXCLUDE_FROM_ALL) # Add include directories for 3rd party targets to be able to use it anywhere in the # project without explicitly specifying their include directories. Once they # fixe this problem, it can be removed. -get_property(INCDIRS DIRECTORY marian-dev/src PROPERTY INCLUDE_DIRECTORIES) +get_property(INCDIRS DIRECTORY browsermt-marian-dev/src PROPERTY INCLUDE_DIRECTORIES) target_include_directories(marian PUBLIC ${INCDIRS}) get_property(INCLUDE_DIRECTORIES DIRECTORY ssplit-cpp/src PROPERTY INCLUDE_DIRECTORIES) target_include_directories(ssplit PUBLIC ${INCLUDE_DIRECTORIES}) -get_property(COMPILE_DEFINITIONS DIRECTORY marian-dev PROPERTY COMPILE_DEFINITIONS) +get_property(COMPILE_DEFINITIONS DIRECTORY browsermt-marian-dev PROPERTY COMPILE_DEFINITIONS) target_compile_definitions(marian PUBLIC ${COMPILE_DEFINITIONS}) -get_property(COMPILE_OPTIONS DIRECTORY marian-dev PROPERTY COMPILE_OPTIONS) +get_property(COMPILE_OPTIONS DIRECTORY browsermt-marian-dev PROPERTY COMPILE_OPTIONS) target_compile_options(marian PUBLIC ${COMPILE_OPTIONS}) # Compilation flags -get_directory_property(CMAKE_C_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_C_FLAGS) -get_directory_property(CMAKE_CXX_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_CXX_FLAGS) +get_directory_property(CMAKE_C_FLAGS DIRECTORY browsermt-marian-dev DEFINITION CMAKE_C_FLAGS) +get_directory_property(CMAKE_CXX_FLAGS DIRECTORY browsermt-marian-dev DEFINITION CMAKE_CXX_FLAGS) set(CMAKE_C_FLAGS ${CMAKE_C_FLAGS} PARENT_SCOPE) set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} PARENT_SCOPE) diff --git a/inference-engine/3rd_party/marian-dev b/inference-engine/3rd_party/browsermt-marian-dev similarity index 100% rename from inference-engine/3rd_party/marian-dev rename to inference-engine/3rd_party/browsermt-marian-dev diff --git a/inference-engine/patches/01-marian-fstream-for-macos.patch b/inference-engine/patches/01-marian-fstream-for-macos.patch index 5219227d9..6b521ba7e 100644 --- a/inference-engine/patches/01-marian-fstream-for-macos.patch +++ b/inference-engine/patches/01-marian-fstream-for-macos.patch @@ -1,7 +1,7 @@ -diff --git a/3rd_party/marian-dev/src/3rd_party/zstr/strict_fstream.hpp b/3rd_party/marian-dev/src/3rd_party/zstr/strict_fstream.hpp +diff --git a/3rd_party/browsermt-marian-dev/src/3rd_party/zstr/strict_fstream.hpp b/3rd_party/browsermt-marian-dev/src/3rd_party/zstr/strict_fstream.hpp index 7b1173931df977e69021f3995fa064a492f89d38..948e91eaf99b6b29ce41cf793fba6717f3b5f5b5 100644 ---- a/3rd_party/marian-dev/src/3rd_party/zstr/strict_fstream.hpp -+++ b/3rd_party/marian-dev/src/3rd_party/zstr/strict_fstream.hpp +--- a/3rd_party/browsermt-marian-dev/src/3rd_party/zstr/strict_fstream.hpp ++++ b/3rd_party/browsermt-marian-dev/src/3rd_party/zstr/strict_fstream.hpp @@ -27,7 +27,7 @@ static std::string strerror() { buff = "Unknown error"; diff --git a/inference-engine/src/tests/CMakeLists.txt b/inference-engine/src/tests/CMakeLists.txt index 86fe00236..cd0e4c777 100644 --- a/inference-engine/src/tests/CMakeLists.txt +++ b/inference-engine/src/tests/CMakeLists.txt @@ -1,7 +1,7 @@ # Unit tests # Include Catch explicitly from marian. -set(CATCH_INCLUDE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/3rd_party/marian-dev/3rd-party) +set(CATCH_INCLUDE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/3rd_party/browsermt-marian-dev/3rd-party) add_library(Catch INTERFACE) target_include_directories(Catch INTERFACE ${CATCH_INCLUDE_DIR}) diff --git a/inference-engine/src/translator/logging.h b/inference-engine/src/translator/logging.h index 2256d7889..704492283 100644 --- a/inference-engine/src/translator/logging.h +++ b/inference-engine/src/translator/logging.h @@ -1,4 +1,4 @@ -#include "3rd_party/marian-dev/src/3rd_party/spdlog/spdlog.h" +#include "3rd_party/browsermt-marian-dev/src/3rd_party/spdlog/spdlog.h" #include "common/logging.h" namespace marian { diff --git a/inference-engine/src/translator/parser.h b/inference-engine/src/translator/parser.h index 793582dd0..8f98e2c73 100644 --- a/inference-engine/src/translator/parser.h +++ b/inference-engine/src/translator/parser.h @@ -4,7 +4,7 @@ #include #include -#include "3rd_party/marian-dev/src/3rd_party/CLI/CLI.hpp" +#include "3rd_party/browsermt-marian-dev/src/3rd_party/CLI/CLI.hpp" #include "3rd_party/yaml-cpp/yaml.h" #include "common/build_info.h" #include "common/config_parser.h" From 3da08c956a0f160870e33015d6526f0f7858c8ac Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 26 Sep 2024 13:04:29 -0500 Subject: [PATCH 425/442] Remove bergamot-translator-tests dependency --- .gitmodules | 22 ---------------------- inference-engine/bergamot-translator-tests | 1 - 2 files changed, 23 deletions(-) delete mode 160000 inference-engine/bergamot-translator-tests diff --git a/.gitmodules b/.gitmodules index 5d1bbf716..ce6f3230b 100644 --- a/.gitmodules +++ b/.gitmodules @@ -6,28 +6,6 @@ path = 3rd_party/extract-lex url = https://github.com/marian-nmt/extract-lex -[submodule "inference-engine/bergamot-translator-tests"] - path = inference-engine/bergamot-translator-tests - url = https://github.com/browsermt/bergamot-translator-tests - -# This is the same dependency and repository as `3rd_party/browsermt-marian-dev` below. -# -# When forking `inference-engine` into to this project, I made an earnest attempt to utilize the preexisting -# `3rd_party/browsermt-marian-dev` submodule within `inference-engine`. Unfortunately, I ran into several roadblocks: -# -# 1) I cannot directly add `3rd_party/browsermt-marian-dev` as a cmake subdirectory because cmake is aware that -# this path is not a subdirectory of the `inference-engine` project root. -# -# 2) Symbolic links do not appear to work for git submodule direcotires the way that they do for regular directories. -# Even if the symbolic link had linked correctly, it may have still failed due to the considerations of 1). -# -# 3) I tried using cmake to copy the files from `3rd_party/browsermt-marian-dev` into `inference-engine/3rd_party/browsermt-marian-dev` -# at build time, which would ensure that there is no duplicate reference to the URL in this file, however the upstream dependency itself -# has hard-coded cmake expectations that the `.git` directory is only one level up, which appears to work correctly for the way git submodules -# are configured, but does not work if the files are copied over to a regular directory deeper in the repository. -# -# It may be possible to remove `3rd_party/browsermt-marian-dev` to instead use `inference-engine/3rd-party/browsermt-marian-dev` everywhere -# within this repository, but I will leave that for a future commit if there is a need to do so. [submodule "inference-engine/3rd_party/browsermt-marian-dev"] path = inference-engine/3rd_party/browsermt-marian-dev url = https://github.com/browsermt/marian-dev diff --git a/inference-engine/bergamot-translator-tests b/inference-engine/bergamot-translator-tests deleted file mode 160000 index a04432d79..000000000 --- a/inference-engine/bergamot-translator-tests +++ /dev/null @@ -1 +0,0 @@ -Subproject commit a04432d7921bfa1dd62bc2e5cdca46b226f256de From 37d0113997cc09b2f0035aa4cb8258ba37dd0c2f Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Fri, 20 Sep 2024 14:38:15 -0500 Subject: [PATCH 426/442] Remove .circleci and .github files --- inference-engine/.circleci/config.yml | 81 --- inference-engine/.github/dependabot.yml | 9 - inference-engine/.github/workflows/arm.yml | 139 ------ inference-engine/.github/workflows/build.yml | 466 ------------------ .../.github/workflows/coding-styles.yml | 42 -- inference-engine/.github/workflows/native.yml | 243 --------- .../.github/workflows/windows.yml | 128 ----- 7 files changed, 1108 deletions(-) delete mode 100644 inference-engine/.circleci/config.yml delete mode 100644 inference-engine/.github/dependabot.yml delete mode 100644 inference-engine/.github/workflows/arm.yml delete mode 100644 inference-engine/.github/workflows/build.yml delete mode 100644 inference-engine/.github/workflows/coding-styles.yml delete mode 100644 inference-engine/.github/workflows/native.yml delete mode 100644 inference-engine/.github/workflows/windows.yml diff --git a/inference-engine/.circleci/config.yml b/inference-engine/.circleci/config.yml deleted file mode 100644 index 52d58fc09..000000000 --- a/inference-engine/.circleci/config.yml +++ /dev/null @@ -1,81 +0,0 @@ -version: 2.1 -jobs: - build: - docker: - - image: 'emscripten/emsdk:3.1.8' - resource_class: medium - - working_directory: ~/checkout - - steps: - - checkout - - - run: - name: Build WASM - command: | - bash build-wasm.sh - - - run: - name: Check artifacts - working_directory: build-wasm - command: | - ARTIFACT_BASE="bergamot-translator-worker" - ARTIFACT_FINAL=$ARTIFACT_BASE - - if [[ -f "$ARTIFACT_BASE.js" && -f "$ARTIFACT_BASE.wasm" ]]; then - echo "Artifacts Successfully Generated" - mkdir ../artifacts - cp $ARTIFACT_BASE.wasm ../artifacts/$ARTIFACT_FINAL.wasm - cp $ARTIFACT_BASE.js ../artifacts/$ARTIFACT_FINAL.js - cd ../artifacts - shasum -a 256 $ARTIFACT_FINAL.wasm $ARTIFACT_FINAL.js >> sha256-filesize - ls -lsa $ARTIFACT_FINAL.wasm $ARTIFACT_FINAL.js >> sha256-filesize - else - echo "Failure: Artifacts Not Present" - exit 1 - fi - - - persist_to_workspace: - root: . - paths: - - artifacts/* - - - store_artifacts: - path: "artifacts" - destination: "wasm" - - publish_to_github: - docker: - - image: cibuilds/github:0.10 - steps: - - attach_workspace: - # Must be absolute path or relative path from working_directory - at: ./ - - when: - condition: - equal: [ 'https://github.com/mozilla/bergamot-translator', << pipeline.project.git_url >> ] - steps: - - run: - name: "Publish Release on GitHub" - command: | - export TAG_VERSION=$(cat ./artifacts/BERGAMOT_VERSION) - rm ./artifacts/BERGAMOT_VERSION - ghr -t ${GHTOKEN} -u ${CIRCLE_PROJECT_USERNAME} -r ${CIRCLE_PROJECT_REPONAME} -c ${CIRCLE_SHA1} -delete ${TAG_VERSION} ./artifacts/ - -workflows: - build: - jobs: - - build: - filters: - tags: - only: /^v.*/ - - publish_to_github: - filters: - tags: - only: /^v.*/ - branches: - ignore: /.*/ - requires: - - build - - diff --git a/inference-engine/.github/dependabot.yml b/inference-engine/.github/dependabot.yml deleted file mode 100644 index bbb39076f..000000000 --- a/inference-engine/.github/dependabot.yml +++ /dev/null @@ -1,9 +0,0 @@ -version: 2 - -updates: - # Maintain dependencies for Git Submodules - - package-ecosystem: "gitsubmodule" - directory: "/" - schedule: - interval: "daily" - diff --git a/inference-engine/.github/workflows/arm.yml b/inference-engine/.github/workflows/arm.yml deleted file mode 100644 index 2ee14548d..000000000 --- a/inference-engine/.github/workflows/arm.yml +++ /dev/null @@ -1,139 +0,0 @@ -name: ARM -'on': - push: - branches: - - main - - ci-sandbox - pull_request: - branches: - - '**' -env: - ccache_basedir: ${{ github.workspace }} - ccache_dir: "${{ github.workspace }}/.ccache" - ccache_compilercheck: content - ccache_compress: 'true' - ccache_compresslevel: 9 - ccache_maxsize: 200M - ccache_cmake: -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_C_COMPILER_LAUNCHER=ccache - ndk: "${{ github.workspace }}/android-ndk-r23b" - abi: "arm64-v8a" - minsdk_version : 28 - android_platform: 28 - -jobs: - ubuntu: - name: "arm-v8a cross-compile via Android NDK" - runs-on: ubuntu-latest - - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - submodules: recursive - - - name: Install prerequisites - run: | - wget -c --quiet https://dl.google.com/android/repository/android-ndk-r23b-linux.zip - unzip -qq android-ndk-r23b-linux.zip - sudo apt-get -y install ccache cmake - - - name: Generate ccache_vars for ccache based on machine - shell: bash - id: ccache_vars - run: |- - echo "::set-output name=hash::$(echo ${{ env.ccache_compilercheck }})" - echo "::set-output name=timestamp::$(date '+%Y-%m-%dT%H.%M.%S')" - - - name: Cache-op for build-cache through ccache - uses: actions/cache@v2 - with: - path: ${{ env.ccache_dir }} - key: ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }}-${{ steps.ccache_vars.outputs.timestamp }} - restore-keys: |- - ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }} - ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }} - ccache-${{ matrix.identifier }} - - - name: ccache environment setup - run: |- - echo "CCACHE_COMPILER_CHECK=${{ env.ccache_compilercheck }}" >> $GITHUB_ENV - echo "CCACHE_BASEDIR=${{ env.ccache_basedir }}" >> $GITHUB_ENV - echo "CCACHE_COMPRESS=${{ env.ccache_compress }}" >> $GITHUB_ENV - echo "CCACHE_COMPRESSLEVEL=${{ env.ccache_compresslevel }}" >> $GITHUB_ENV - echo "CCACHE_DIR=${{ env.ccache_dir }}" >> $GITHUB_ENV - echo "CCACHE_MAXSIZE=${{ env.ccache_maxsize }}" >> $GITHUB_ENV - - - name: ccache prolog - run: |- - ccache -s # Print current cache stats - ccache -z # Zero cache entry - - - name: Generate buildfiles for bergamot-translator on android via cmake - run: |- - mkdir -p build - cd build - NDK=${{ env.ndk }} - ABI=${{ env.abi }} - MINSDK_VERSION=${{ env.minsdk_version }} - ANDROID_PLATFORM=android-${{ env.android_platform }} - OTHER_ANDROID_ARGS=( - -DANDROID_ARM_NEON=TRUE - ) - OTHER_MARIAN_ARGS=( - -DCOMPILE_CUDA=off - -DCOMPILE_CPU=on - -DCMAKE_HAVE_THREADS_LIBRARY=1 - -DCMAKE_USE_WIN32_THREADS_INIT=0 - -DCMAKE_USE_PTHREADS_INIT=1 - -DTHREADS_PREFER_PTHREAD_FLAG=ON - -DBUILD_ARCH=armv8-a - # -DCOMPILE_WITHOUT_EXCEPTIONS=on # Apparently this can reduce the binary size, let's see. - -DSSPLIT_USE_INTERNAL_PCRE2=ON - ) - # Additionally list variables finally configured. - cmake -L \ - -DCMAKE_BUILD_TYPE=Release \ - -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \ - -DANDROID_TOOLCHAIN=clang \ - -DANDROID_ABI=$ABI \ - -DANDROID_PLATFORM=$ANDROID_PLATFORM \ - -DANDROID_NATIVE_API_LEVEL=$MINSDKVERSION \ - -DANDROID_TOOLCHAIN_NAME=arm-linux-androideabi-4.8 \ - -DANDROID_STL=c++_static \ - -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_C_COMPILER_LAUNCHER=ccache \ - "${OTHER_ANDROID_ARGS[@]}" "${OTHER_MARIAN_ARGS[@]}" \ - .. - - - - name : Build bergamot-translator for android - working-directory: build - run: |- - make -j2 - - - name: ccache epilog - run: 'ccache -s # Print current cache stats' - - - uses: actions/upload-artifact@v2 - with: - path: ${{github.workspace}}/build/app/bergamot - - - # Disable release for now. - # release: - # name: Release Latest Build - # runs-on: ubuntu-latest - # needs: [ubuntu] - # if: github.ref == 'refs/heads/master' - # steps: - # - name: Download artifacts - # uses: actions/download-artifact@v2 - # - # - name: Update GitHub prerelease - # uses: marvinpinto/action-automatic-releases@latest - # with: - # repo_token: ${{ secrets.GITHUB_TOKEN }} - # automatic_release_tag: latest - # prerelease: true - # title: "Latest Build" - # files: | - # artifact/marian-decoder diff --git a/inference-engine/.github/workflows/build.yml b/inference-engine/.github/workflows/build.yml deleted file mode 100644 index 830924c2c..000000000 --- a/inference-engine/.github/workflows/build.yml +++ /dev/null @@ -1,466 +0,0 @@ -name: "Build" -'on': - push: - branches: - - main - - ci-sandbox - tags: - - "v*.*.*" - pull_request: - branches: - - '**' -env: - qt_version: "6.2.1" # only used by build-macos - emsdk_version: 3.1.8 # For use in emscripten build - ccache_basedir: ${{ github.workspace }} - ccache_dir: "${{ github.workspace }}/.ccache" - ccache_compilercheck: content - ccache_compress: 'true' - ccache_compresslevel: 9 - ccache_maxsize: 200M - ccache_cmake: -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_C_COMPILER_LAUNCHER=ccache - -jobs: - build-wheels: - strategy: - matrix: - os: [ubuntu-latest, macos-latest] - fail-fast: false - - name: "cibuildwheel / ${{ matrix.os }}" - runs-on: ${{ matrix.os }} - - steps: - - uses: actions/checkout@v2 - with: - submodules: recursive - - - name: Generate ccache_vars for ccache based on machine - shell: bash - id: ccache_vars - run: |- - echo "::set-output name=hash::$(echo ${{ env.ccache_compilercheck }})" - echo "::set-output name=timestamp::$(date '+%Y-%m-%dT%H.%M.%S')" - - - name: Cache-op for build-cache through ccache - uses: actions/cache@v2 - with: - path: ${{ env.ccache_dir }} - key: ccache-cibuildwheel-${{ matrix.os }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }}-${{ steps.ccache_vars.outputs.timestamp }} - restore-keys: |- - ccache-cibuildwheel-${{ matrix.os }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }} - ccache-cibuildwheel-${{ matrix.os }}-${{ steps.ccache_vars.outputs.hash }} - ccache-cibuildwheel-${{ matrix.os }} - - - name: ccache environment setup - run: |- - mkdir -p ${{ env.ccache_dir }} - - - name: Inject local version identifier for non tag builds - if: ${{ !startsWith(github.ref, 'refs/tags/v') }} - run: |- - echo "PYTHON_LOCAL_VERSION_IDENTIFIER=$(git rev-parse --short HEAD)" >> $GITHUB_ENV - - - name: Apply MacOS patch - if: ${{ startsWith(runner.os, 'mac') }} - run: | - patch -p1 < patches/01-marian-fstream-for-macos.patch - - - name: Build wheels - uses: pypa/cibuildwheel@v2.6.1 - # to supply options, put them in 'env', like: - env: - CIBW_ENVIRONMENT_LINUX: - BUILD_ARCH=core-avx-i - USE_CCACHE=1 - CCACHE_COMPILER_CHECK=${{ env.ccache_compilercheck }} - CCACHE_COMPRESS=${{ env.ccache_compress }} - CCACHE_COMPRESSLEVEL=${{ env.ccache_compresslevel }} - CCACHE_MAXSIZE=${{ env.ccache_maxsize }} - PYTHON_LOCAL_VERSION_IDENTIFIER=${{ env.PYTHON_LOCAL_VERSION_IDENTIFIER }} - CCACHE_DIR=/host/${{ env.ccache_dir }} - CCACHE_BASEDIR=/host/${{ env.ccache_basedir }} - - CIBW_ENVIRONMENT_MACOS: - BUILD_ARCH=core-avx-i - USE_CCACHE=1 - CCACHE_COMPILER_CHECK=${{ env.ccache_compilercheck }} - CCACHE_COMPRESS=${{ env.ccache_compress }} - CCACHE_COMPRESSLEVEL=${{ env.ccache_compresslevel }} - CCACHE_MAXSIZE=${{ env.ccache_maxsize }} - PYTHON_LOCAL_VERSION_IDENTIFIER=${{ env.PYTHON_LOCAL_VERSION_IDENTIFIER }} - CCACHE_DIR=${{ env.ccache_dir }} - CCACHE_BASEDIR=${{ env.ccache_basedir }} - MACOSX_DEPLOYMENT_TARGET=10.9 - - CIBW_BEFORE_BUILD_LINUX: | - yum install -y ccache - - # Install Intel MKL. - yum-config-manager -y --add-repo https://yum.repos.intel.com/mkl/setup/intel-mkl.repo - yum install -y intel-mkl - - chmod -R a+rwx /host/${{ env.ccache_dir }} - - ccache -s # Print current cache stats - ccache -z # Zero cache entry - - CIBW_BEFORE_BUILD_MACOS: | - brew install openblas protobuf ccache boost pybind11 - chmod -R a+rwx ${{ env.ccache_dir }} - ccache -s # Print current cache stats - ccache -z # Zero cache entry - - CIBW_BUILD: "cp{36,37,38,39,310}-*manylinux_x86_64 cp{36,37,38,39,310}-macosx_x86_64" - - CIBW_BEFORE_TEST: | - ccache -s # Print current ccache stats - - CIBW_TEST_COMMAND: | - # The wheels are installed automatically and available. - - # Fetch models from translateLocally repository. - python3 -m bergamot download -m en-de-tiny - python3 -m bergamot download -m de-en-tiny - python3 -m bergamot ls - - # Fetch models from opus repository. - python3 -m bergamot download -m eng-fin-tiny -r opus - python3 -m bergamot ls -r opus - - # Run the sample python script shipped with module - python3 -m bergamot translate --model en-de-tiny <<< "Hello World" - python3 -m bergamot translate --model en-de-tiny de-en-tiny <<< "Hello World" - python3 -m bergamot translate --model eng-fin-tiny --repository opus <<< "Hello World" - - - - uses: actions/upload-artifact@v2 - with: - name: wheels - path: ./wheelhouse/*.whl - - upload-wheels: - name: "Upload wheels to PyPI" - runs-on: ubuntu-latest - if: ${{ startsWith(github.ref, 'refs/tags/v') }} - needs: [build-wheels] - steps: - - name: Download artifacts - uses: actions/download-artifact@v2 - with: - name: wheels - - - name: Publish wheels to PyPI - env: - TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }} - TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} - run: | - python3 -m pip install twine - twine upload *.whl - - - build-wasm: - name: "emscripten" - runs-on: ubuntu-latest - steps: - - - name: Checkout - uses: actions/checkout@v2 - with: - submodules: recursive - - - name: Set ccache environment for emcc - run: | - # We are hardcoding this to mtime instead of env pickup. Rest use content. - echo "CCACHE_COMPILER_CHECK=mtime" >> $GITHUB_ENV - - echo "CCACHE_BASEDIR=${{ env.ccache_basedir }}" >> $GITHUB_ENV - echo "CCACHE_COMPRESS=${{ env.ccache_compress }}" >> $GITHUB_ENV - echo "CCACHE_COMPRESSLEVEL=${{ env.ccache_compresslevel }}" >> $GITHUB_ENV - echo "CCACHE_DIR=${{ env.ccache_dir }}" >> $GITHUB_ENV - echo "CCACHE_MAXSIZE=${{ env.ccache_maxsize }}" >> $GITHUB_ENV - # https://emscripten.org/docs/compiling/Building-Projects.html#using-a-compiler-wrapper - echo "EM_COMPILER_WRAPPER=ccache" >> $GITHUB_ENV - - # This need to be run before setup, so ccache build caching doesn't complain. - - name: Obtain emsdk sources - run: | - git clone --depth 1 https://github.com/emscripten-core/emsdk.git - - - name: Cache-op for build-cache through ccache - uses: actions/cache@v2 - with: - path: | - ${{ env.ccache_dir }} - ${{ github.workspace }}/emsdk/ccache/git-emscripten_64bit/ - key: ccache-${{ github.job }}-${{ env.emsdk_version }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }}-${{ steps.ccache_vars.outputs.timestamp }} - restore-keys: |- - ccache-${{ github.job }}-${{ env.emsdk_version }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }} - ccache-${{ github.job }}-${{ env.emsdk_version }}-${{ steps.ccache_vars.outputs.hash }} - ccache-${{ github.job }}-${{ env.emsdk_version }} - - - name: Setup Emscripten toolchain - run: | - (cd emsdk && ./emsdk install ${{ env.emsdk_version }} ccache-git-emscripten-64bit) - (cd emsdk && ./emsdk activate ${{ env.emsdk_version }} ccache-git-emscripten-64bit) - # mtime of this file is checked by ccache, we set it to avoid cache misses. - touch -m -d '1 Jan 2021 12:00' emsdk/.emscripten - - # These needs to be done in the activated shell. - eval $(./emsdk/emsdk construct_env \ - | sed 's/export PATH=\(.*\);/echo \1 >> $GITHUB_PATH;/' \ - | sed 's/export \(.*\);/echo \1 >> $GITHUB_ENV;/' ); - - # This looks more permanent than version pinned, so keeping temporarily to avoid failures. - echo "${{ github.workspace }}/emsdk/ccache/git-emscripten_64bit/bin" >> $GITHUB_PATH - - - name: Generate ccache_vars for ccache based on machine - shell: bash - id: ccache_vars - run: |- - echo "::set-output name=hash::$(echo ${{ env.ccache_compilercheck }})" - echo "::set-output name=timestamp::$(date '+%Y-%m-%dT%H.%M.%S')" - - - name: Verify Emscripten setup - run: | - emcc --version - emcmake cmake --version - emmake make --version - - - name: ccache prolog - run: |- - ccache -s # Print current cache stats - ccache -z # Zero cache entry - - - name: "Configure builds" - run: | - mkdir -p build-wasm - cd build-wasm - emcmake cmake -DCOMPILE_WASM=on .. - - - - name: "Compile" - working-directory: build-wasm - run: | - emmake make -j2 - - - name: ccache epilog - run: | - ccache -s # Print current cache stats - - - name: Import GEMM library from a separate wasm module - working-directory: build-wasm - run: bash ../wasm/patch-artifacts-import-gemm-module.sh - - # Setup nodejs-18, as nodejs-14 provided by emsdk fails when running - # and newer version of node allows us to use fetch(). - - name: Setup nodejs - uses: actions/setup-node@v3 - with: - node-version: 18 - - - name: Test run - working-directory: wasm - run: | - cp ../build-wasm/bergamot-translator-worker.{js,wasm} ./ - npm install jsdom - - # --unhandled-rejections make the script exit with a non-zero code (at least on node-14). - # So leaving this here. - node --unhandled-rejections=strict node-test.js - - # Upload both together. - - name: Upload wasm artifact - uses: actions/upload-artifact@v2 - with: - name: wasm-artefacts - if-no-files-found: error - path: | - ${{github.workspace}}/build-wasm/bergamot-translator-worker.js - ${{github.workspace}}/build-wasm/bergamot-translator-worker.wasm - ${{github.workspace}}/build-wasm/bergamot-translator-worker.js.bak - - - upload-wasm: - name: "Upload node package to NPM" - runs-on: ubuntu-latest - if: ${{ startsWith(github.ref, 'refs/tags/v') }} - needs: [build-wasm] - steps: - - name: Download artifacts - uses: actions/download-artifact@v2 - with: - name: wasm-artefacts - path: wasm/module/worker - - - uses: actions/setup-node@v3 - with: - node-version: '18.x' - registry-url: 'https://registry.npmjs.org' - - run: npm ci - - run: npm publish - env: - NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }} - - - - # Try to upload a release using https://github.com/marvinpinto/actions/issues/177#issuecomment-917605585 as a model - release-latest: - name: Release Latest Build - runs-on: ubuntu-latest - needs: [build-wheels, build-wasm] - if: github.ref == 'refs/heads/main' - steps: - - name: Download artifacts - uses: actions/download-artifact@v2 - - # Leave the below be, it will be useful. - - name: List downloaded assets - run: | - find ./ - - - name: Update GitHub prerelease - uses: marvinpinto/action-automatic-releases@latest - with: - repo_token: ${{ secrets.GITHUB_TOKEN }} - automatic_release_tag: latest - prerelease: true - title: "Latest Build" - files: | - wheels/*.whl - wasm-artefacts/bergamot-translator-worker.js - wasm-artefacts/bergamot-translator-worker.wasm - - release-version: - name: Release version - runs-on: ubuntu-latest - needs: [build-wheels, build-wasm] - permissions: - contents: "write" - packages: "write" - pull-requests: "read" - if: startsWith(github.ref, 'refs/tags/v') - steps: - - name: Download artifacts - uses: actions/download-artifact@v2 - - # Leave the below be, it will be useful. - - name: List downloaded assets - run: | - find ./ - - - name: Update GitHub release - uses: marvinpinto/action-automatic-releases@latest - with: - repo_token: ${{ secrets.GITHUB_TOKEN }} - automatic_release_tag: ${{ github.ref_name }} - prerelease: false - title: "${{ github.ref_name }}" - files: | - wheels/*.whl - wasm-artefacts/bergamot-translator-worker.js - wasm-artefacts/bergamot-translator-worker.wasm - - - python-checks: - name: "formatting and typechecks" - runs-on: "ubuntu-latest" - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - submodules: recursive - - name: Install Dependencies - run: |- - python3 -m pip install black isort pytype - - name: "Formatting checks: black, isort" - run: | - python3 -m black --diff --check bindings/python/ setup.py doc/conf.py - python3 -m isort --profile black --diff --check bindings/python setup.py doc/conf.py - - name: "Static typing checks: pytype" - run: |- - python3 -m pytype bindings/python - - docs: - runs-on: ubuntu-latest - needs: [build-wheels] - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - submodules: recursive - - # Runs javascript to extract push events from both tags and branch (only main, due to workflow trigger) - # converts refs/<>/ -> - # eg: - # refs/head/main -> main - # refs/tags/v0.1.0 -> v0.1.0 - # - - name: Download artifacts - uses: actions/download-artifact@v2 - - name: Extract tag name - id: tag - uses: actions/github-script@0.2.0 - if: ${{ github.event_name == 'push' }} - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - const args = context.payload.ref.split("/"); - [refs, category, ...rest] = args; - return rest.join("/"); - - # Patches the BERGAMOT_VERSION file used by sphinx-docs at run time to - # obtain names like 'main' or 'ci-sandbox' to not confuse with version - # based documentation built separately. - - name: Deploy-time patch version - run: | - echo ${{steps.tag.outputs.result }} > BERGAMOT_VERSION - - - name: Set up Doxygen - run: sudo apt-get install -y doxygen - - - name: Set up Python - uses: actions/setup-python@v2 - with: - python-version: 3.7 - - - name: Set up dependency cache - uses: actions/cache@v2 - with: - path: ~/.cache/pip - key: ${{ runner.os }}-pip-${{ hashFiles('doc/requirements.txt') }} - restore-keys: | - ${{ runner.os }}-pip- - - - name: Install dependencies - working-directory: ./doc - run: | - python3 -m pip install -r requirements.txt - python3 -m pip install --find-links=${{github.workspace}}/wheels bergamot - - - name: Build documentation - working-directory: ./doc - run: sphinx-build -b html ./ build/ - - - - name: Deploy 🚀 - uses: JamesIves/github-pages-deploy-action@4.1.3 - if: ${{ github.event_name == 'push' && github.repository == 'browsermt/bergamot-translator' }} - with: - repository-name: 'browsermt/docs' - branch: gh-pages # The branch the action should deploy to. - folder: './doc/build/' # The folder the action should deploy. - target-folder: '${{ steps.tag.outputs.result }}' - ssh-key: ${{ secrets.BERGAMOT_SSH_PRIVATE_KEY }} - - # This artifact contains the HTML output of Sphinx only. - # With index.html at the root of the produced zip file. - # For use for maintainers to download the zip and check render of - # documentation while generated at pull-request. - - name: Upload documentation - uses: actions/upload-artifact@v2 - if: ${{ github.event_name == 'pull_request'}} - with: - name: api-docs - path: ./doc/build/ - if-no-files-found: error diff --git a/inference-engine/.github/workflows/coding-styles.yml b/inference-engine/.github/workflows/coding-styles.yml deleted file mode 100644 index b13345601..000000000 --- a/inference-engine/.github/workflows/coding-styles.yml +++ /dev/null @@ -1,42 +0,0 @@ -name: "Coding Style" - -on: - push: - branches: [ main, ci-sandbox ] - pull_request: - branches: [ '**' ] - -jobs: - clang-format: - name: "clang-format" - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - submodules: recursive - - - name: Install dependencies - run: | - sudo apt-get update - sudo apt-get install -y build-essential cmake - sudo apt-get install -y clang-format clang-tidy - - - name: Run clang-format - run: - python3 run-clang-format.py --style file -r src wasm bindings/python - - - - name: Prepare build, compilation database etc. - run: | - mkdir -p build - cd build - cmake \ - -DUSE_WASM_COMPATIBLE_SOURCE=off -DCMAKE_EXPORT_COMPILE_COMMANDS=on \ - -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \ - .. - - - name: Run clang-tidy - run: | - run-clang-tidy -p build "$PWD/src/.*" - run-clang-tidy -p build "$PWD/app/.*" diff --git a/inference-engine/.github/workflows/native.yml b/inference-engine/.github/workflows/native.yml deleted file mode 100644 index 505381cbc..000000000 --- a/inference-engine/.github/workflows/native.yml +++ /dev/null @@ -1,243 +0,0 @@ -name: native -'on': - push: - branches: - - main - - ci-sandbox - pull_request: - branches: - - '**' -env: - ccache_basedir: ${{ github.workspace }} - ccache_dir: "${{ github.workspace }}/.ccache" - ccache_compilercheck: content - ccache_compress: 'true' - ccache_compresslevel: 9 - ccache_maxsize: 200M - ccache_cmake: -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_C_COMPILER_LAUNCHER=ccache -jobs: - ubuntu: - strategy: - fail-fast: false - matrix: - include: - - name: Ubuntu 22.04 full - os: ubuntu-22.04 - identifier: ubuntu_2204_full - cmake: -DCOMPILE_TESTS=on - brt_tags: "" - unittests: 'true' - - name: Ubuntu 22.04 minimal - os: ubuntu-22.04 - identifier: ubuntu_2204_minimal - cmake: -DCOMPILE_TESTS=on -DUSE_WASM_COMPATIBLE_SOURCE=on - brt_tags: "'#wasm'" - unittests: 'false' - - name: Ubuntu 20.04 full - os: ubuntu-20.04 - identifier: ubuntu_2004_full - cmake: -DCOMPILE_TESTS=on - brt_tags: "" - unittests: 'true' - - name: Ubuntu 20.04 minimal - os: ubuntu-20.04 - identifier: ubuntu_2004_minimal - cmake: -DCOMPILE_TESTS=on -DUSE_WASM_COMPATIBLE_SOURCE=on - brt_tags: "'#wasm'" - unittests: 'false' - name: ${{ matrix.name }} - runs-on: ${{ matrix.os }} - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - submodules: recursive - - name: Install Dependencies - run: |- - sudo apt-get update - sudo apt-get install -y libprotobuf-dev protobuf-compiler libboost-all-dev ccache libunwind-dev libgoogle-perftools-dev - - name: Install MKL - run: |- - wget -qO- "https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB" | sudo apt-key add - - sudo sh -c "echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list" - sudo apt-get update -o Dir::Etc::sourcelist="/etc/apt/sources.list.d/intel-mkl.list" - sudo apt-get install -y --no-install-recommends intel-mkl-64bit-2020.0-088 - - name: Generate ccache_vars for ccache based on machine - shell: bash - id: ccache_vars - run: |- - echo "::set-output name=hash::$(echo ${{ env.ccache_compilercheck }})" - echo "::set-output name=timestamp::$(date '+%Y-%m-%dT%H.%M.%S')" - - name: Cache-op for build-cache through ccache - uses: actions/cache@v2 - with: - path: ${{ env.ccache_dir }} - key: ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }}-${{ steps.ccache_vars.outputs.timestamp }} - restore-keys: |- - ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }} - ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }} - ccache-${{ matrix.identifier }} - - name: ccache environment setup - run: |- - echo "CCACHE_COMPILER_CHECK=${{ env.ccache_compilercheck }}" >> $GITHUB_ENV - echo "CCACHE_BASEDIR=${{ env.ccache_basedir }}" >> $GITHUB_ENV - echo "CCACHE_COMPRESS=${{ env.ccache_compress }}" >> $GITHUB_ENV - echo "CCACHE_COMPRESSLEVEL=${{ env.ccache_compresslevel }}" >> $GITHUB_ENV - echo "CCACHE_DIR=${{ env.ccache_dir }}" >> $GITHUB_ENV - echo "CCACHE_MAXSIZE=${{ env.ccache_maxsize }}" >> $GITHUB_ENV - - name: ccache prolog - run: |- - ccache -s # Print current cache stats - ccache -z # Zero cache entry - - name: cmake - run: |- - mkdir -p build - cd build - cmake -L .. ${{ matrix.cmake }} ${{ env.ccache_cmake }} - - name: Build from source - working-directory: build - run: make -j2 - - name: ccache epilog - run: 'ccache -s # Print current cache stats' - - name: Print Versions - working-directory: build - run: ./app/bergamot --version - - name: Run unit tests - working-directory: build - run: make test - if: ${{ matrix.unittests == 'true' }} - - name: Install regression-test framework (BRT) - working-directory: bergamot-translator-tests - run: make install - - name: Run regression-tests (BRT) - working-directory: bergamot-translator-tests - id: brt_run - run: MARIAN=../build ./run_brt.sh ${{ matrix.brt_tags }} - - name: Print logs of unsuccessful BRTs - working-directory: bergamot-translator-tests - run: |- - grep "tests.*.sh" previous.log \ - | sed 's/^\s*-\s*//' \ - | xargs -I% bash -c 'echo %; tail -n20 %.log' - if: ${{ always() && steps.brt_run.outcome == 'failure' }} - - name: Upload regression-tests artifacts - uses: actions/upload-artifact@v2 - if: ${{ always() && steps.brt_run.outcome != 'skipped' }} - with: - name: brt-${{ matrix.identifier }} - path: |- - bergamot-translator-tests/**/*.expected - bergamot-translator-tests/**/*.log - bergamot-translator-tests/**/*.out - - name: Confirm native-run example script works - run: |- - bash examples/run-native.sh - - mac: - strategy: - fail-fast: false - matrix: - include: - - name: MacOS 12 full - os: macos-12 - identifier: mac_12_full - cmake: -DCOMPILE_TESTS=on -DUSE_APPLE_ACCELERATE=off -DUSE_FBGEMM=off -DUSE_STATIC_LIBS=off - brt_tags: "" - unittests: 'true' - - name: MacOS 12 minimal - os: macos-12 - identifier: mac_12_minimal - cmake: -DCOMPILE_TESTS=on -DUSE_APPLE_ACCELERATE=off -DUSE_FBGEMM=off -DUSE_STATIC_LIBS=on -DUSE_WASM_COMPATIBLE_SOURCE=on - brt_tags: "'#wasm'" - unittests: 'false' - name: ${{ matrix.name }} - runs-on: ${{ matrix.os }} - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - submodules: recursive - - name: Install Dependencies - run: |- - brew update - brew install openblas protobuf ccache - brew install coreutils findutils - - name: Setup path with gnu - run: |- - echo "/usr/local/opt/coreutils/libexec/gnubin" >> $GITHUB_PATH - echo "/usr/local/opt/findutils/libexec/gnubin" >> $GITHUB_PATH - - name: Setup BLAS - run: |- - echo "LDFLAGS=-L/usr/local/opt/openblas/lib" >> $GITHUB_ENV - echo "CPPFLAGS=-I/usr/local/opt/openblas/include" >> $GITHUB_ENV - - name: Generate ccache_vars for ccache based on machine - shell: bash - id: ccache_vars - run: |- - echo "::set-output name=hash::$(echo ${{ env.ccache_compilercheck }})" - echo "::set-output name=timestamp::$(date '+%Y-%m-%dT%H.%M.%S')" - - name: Cache-op for build-cache through ccache - uses: actions/cache@v2 - with: - path: ${{ env.ccache_dir }} - key: ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }}-${{ steps.ccache_vars.outputs.timestamp }} - restore-keys: |- - ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }} - ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }} - ccache-${{ matrix.identifier }} - - name: ccache environment setup - run: |- - echo "CCACHE_COMPILER_CHECK=${{ env.ccache_compilercheck }}" >> $GITHUB_ENV - echo "CCACHE_BASEDIR=${{ env.ccache_basedir }}" >> $GITHUB_ENV - echo "CCACHE_COMPRESS=${{ env.ccache_compress }}" >> $GITHUB_ENV - echo "CCACHE_COMPRESSLEVEL=${{ env.ccache_compresslevel }}" >> $GITHUB_ENV - echo "CCACHE_DIR=${{ env.ccache_dir }}" >> $GITHUB_ENV - echo "CCACHE_MAXSIZE=${{ env.ccache_maxsize }}" >> $GITHUB_ENV - - name: ccache prolog - run: |- - ccache -s # Print current cache stats - ccache -z # Zero cache entry - - name: cmake - run: |- - mkdir -p build - cd build - cmake -L .. ${{ matrix.cmake }} ${{ env.ccache_cmake }} - - name: Build from source - working-directory: build - run: make -j2 - - name: ccache epilog - run: 'ccache -s # Print current cache stats' - - name: Print Versions - working-directory: build - run: ./app/bergamot --version - - name: Run unit tests - working-directory: build - run: make test - if: ${{ matrix.unittests == 'true' }} - - name: Install regression-test framework (BRT) - working-directory: bergamot-translator-tests - run: make install - - name: Run regression-tests (BRT) - working-directory: bergamot-translator-tests - id: brt_run - run: MARIAN=../build ./run_brt.sh ${{ matrix.brt_tags }} - - name: Print logs of unsuccessful BRTs - working-directory: bergamot-translator-tests - run: |- - grep "tests.*.sh" previous.log \ - | sed 's/^\s*-\s*//' \ - | xargs -I% bash -c 'echo %; tail -n20 %.log' - if: ${{ always() && steps.brt_run.outcome == 'failure' }} - - name: Upload regression-tests artifacts - uses: actions/upload-artifact@v2 - if: ${{ always() && steps.brt_run.outcome != 'skipped' }} - with: - name: brt-${{ matrix.identifier }} - path: |- - bergamot-translator-tests/**/*.expected - bergamot-translator-tests/**/*.log - bergamot-translator-tests/**/*.out - - name: Confirm native-run example script works - run: |- - bash examples/run-native.sh - diff --git a/inference-engine/.github/workflows/windows.yml b/inference-engine/.github/workflows/windows.yml deleted file mode 100644 index a0ff86b84..000000000 --- a/inference-engine/.github/workflows/windows.yml +++ /dev/null @@ -1,128 +0,0 @@ -name: Windows - -on: - push: - branches: [ main, ci-sandbox ] - pull_request: - branches: [ '**' ] - -env: - MKL_URL: "https://data.statmt.org/romang/marian-regression-tests/ci/mkl-2020.1-windows-static.zip" - CCACHE_BASEDIR: "${{ github.workspace }}" - CCACHE_DIR: "${{ github.workspace }}\\ccache" - CCACHE_COMPILERCHECK: content - CCACHE_COMPRESS: 'true' - CCACHE_COMPRESSLEVEL: 9 - CCACHE_MAXSIZE: 200M - ccache_version: '4.5' - -jobs: - build-windows: - strategy: - matrix: - include: - # Windows CPU-only build - - name: "Windows CPU-only" - identifier: "windows-x64" - - runs-on: windows-2019 - name: ${{ matrix.name }} - - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - submodules: recursive - - - - name: Download ccache - shell: cmake -P {0} - run: | - set(ccache_url "https://github.com/cristianadam/ccache/releases/download/v${{ env.ccache_version }}/${{ runner.os }}.tar.xz") - file(DOWNLOAD "${ccache_url}" ./ccache.tar.xz SHOW_PROGRESS) - execute_process(COMMAND ${CMAKE_COMMAND} -E tar xvf ./ccache.tar.xz) - if(ret AND NOT ret EQUAL 0) - message( FATAL_ERROR "Bad exit status") - endif() - - - name: Generate ccache_vars for ccache based on machine - shell: cmake -P {0} - id: ccache_vars - run: |- - string(TIMESTAMP current_date "%Y-%m-%d-%H;%M;%S" UTC) - message("::set-output name=timestamp::${current_date}") - message("::set-output name=hash::${{ env.ccache_compilercheck }}") - - - name: Cache-op for build-cache through ccache - uses: actions/cache@v2 - with: - path: ${{ env.CCACHE_DIR }} - key: ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }}-${{ steps.ccache_vars.outputs.timestamp }} - restore-keys: |- - ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }}-${{ github.ref }} - ccache-${{ matrix.identifier }}-${{ steps.ccache_vars.outputs.hash }} - ccache-${{ matrix.identifier }} - - - name: ccache prolog - run: |- - ${{github.workspace}}\ccache.exe -sv # Print current cache stats - ${{github.workspace}}\ccache.exe -z # Print current cache stats - - - name: Download MKL - run: | - # Wget retries downloading files and is faster than Invoke-WebRequest - C:\msys64\usr\bin\wget.exe -nv ${{ env.MKL_URL }} -O mkl.zip - Expand-Archive -Force mkl.zip ${{ github.workspace }}\mkl - # Set MKLROOT environment variable so that CMake can find MKL - echo "MKLROOT=${{ github.workspace }}\mkl" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append - shell: powershell - - - name: Disable debug vcpkg build - shell: powershell - working-directory: C:\vcpkg\triplets - run: | - $PSDefaultParameterValues['Out-File:Encoding'] = 'utf8' # Powershell murders me. - echo "set(VCPKG_BUILD_TYPE release)" | Tee-Object -FilePath x64-windows-static.cmake -Append - echo "set(VCPKG_BUILD_TYPE release)" | Tee-Object -FilePath x64-windows.cmake -Append - cat x64-windows-static.cmake - cat x64-windows.cmake - - - name: Install dependencies with vcpkg - working-directory: C:\vcpkg - run: | - $Env:VCPKG_BUILD_TYPE = 'release' - $Env:VCPKG_DEFAULT_TRIPLET = 'x64-windows-static' # QT6 version, linguist tools not working yet: qtbase:x64-windows-static qttools:x64-windows-static qtsvg:x64-windows-static - .\vcpkg install protobuf:x64-windows-static pcre2:x64-windows-static - .\vcpkg upgrade --no-dry-run # In case there are new builds available after cache restoration - shell: powershell - - - name: Create Build Environment - # Some projects don't allow in-source building, so create a separate build directory - # We'll use this as our working directory for all subsequent commands - run: cmake -E make_directory ${{github.workspace}}/build - - - name: Configure - working-directory: ${{github.workspace}}/build #@TODO figure out how variables are accessed from power shell, as they seem to not be read. - run: | - cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_STATIC_LIBS=ON -DVCPKG_TARGET_TRIPLET='x64-windows-static' ` - -DCMAKE_TOOLCHAIN_FILE="C:/vcpkg/scripts/buildsystems/vcpkg.cmake" ` - -DCMAKE_CXX_COMPILER_LAUNCHER=${{github.workspace}}\ccache.exe ` - -DCMAKE_C_COMPILER_LAUNCHER=${{github.workspace}}\ccache.exe - shell: powershell - - - name: Build - working-directory: ${{github.workspace}}/build - run: cmake --build . --config Release -j3 - shell: powershell - - - - name: Print versions - working-directory: ${{github.workspace}}/build - run: | - .\app\Release\bergamot.exe --version - - shell: cmd - - - name: ccache epilog - run: |- - ${{github.workspace}}\\ccache.exe -sv # Print current cache stats From 27e85d25a7f9f2c34169c279c2ea6fae1d30e7ae Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Fri, 20 Sep 2024 13:07:21 -0500 Subject: [PATCH 427/442] Remove unneeded Python code --- .gitmodules | 4 - inference-engine/3rd_party/CMakeLists.txt | 6 +- inference-engine/3rd_party/pybind11 | 1 - inference-engine/bindings/CMakeLists.txt | 1 - .../bindings/python/CMakeLists.txt | 9 - inference-engine/bindings/python/README.md | 14 - inference-engine/bindings/python/__init__.py | 18 - inference-engine/bindings/python/__main__.py | 20 - inference-engine/bindings/python/bergamot.cpp | 213 --------- inference-engine/bindings/python/cmds.py | 177 -------- .../bindings/python/repository.py | 218 ---------- .../bindings/python/typing_utils.py | 5 - inference-engine/bindings/python/utils.py | 52 --- inference-engine/run-clang-format.py | 408 ------------------ inference-engine/setup.py | 248 ----------- 15 files changed, 1 insertion(+), 1393 deletions(-) delete mode 160000 inference-engine/3rd_party/pybind11 delete mode 100644 inference-engine/bindings/CMakeLists.txt delete mode 100644 inference-engine/bindings/python/CMakeLists.txt delete mode 100644 inference-engine/bindings/python/README.md delete mode 100644 inference-engine/bindings/python/__init__.py delete mode 100644 inference-engine/bindings/python/__main__.py delete mode 100644 inference-engine/bindings/python/bergamot.cpp delete mode 100644 inference-engine/bindings/python/cmds.py delete mode 100644 inference-engine/bindings/python/repository.py delete mode 100644 inference-engine/bindings/python/typing_utils.py delete mode 100644 inference-engine/bindings/python/utils.py delete mode 100644 inference-engine/run-clang-format.py delete mode 100644 inference-engine/setup.py diff --git a/.gitmodules b/.gitmodules index ce6f3230b..a07948957 100644 --- a/.gitmodules +++ b/.gitmodules @@ -10,10 +10,6 @@ path = inference-engine/3rd_party/browsermt-marian-dev url = https://github.com/browsermt/marian-dev -[submodule "inference-engine/3rd_party/pybind11"] - path = inference-engine/3rd_party/pybind11 - url = https://github.com/pybind/pybind11.git - [submodule "inference-engine/3rd_party/ssplit-cpp"] path = inference-engine/3rd_party/ssplit-cpp url = https://github.com/browsermt/ssplit-cpp diff --git a/inference-engine/3rd_party/CMakeLists.txt b/inference-engine/3rd_party/CMakeLists.txt index 0185d7673..62ba02722 100644 --- a/inference-engine/3rd_party/CMakeLists.txt +++ b/inference-engine/3rd_party/CMakeLists.txt @@ -29,8 +29,4 @@ target_compile_options(marian PUBLIC ${COMPILE_OPTIONS}) get_directory_property(CMAKE_C_FLAGS DIRECTORY browsermt-marian-dev DEFINITION CMAKE_C_FLAGS) get_directory_property(CMAKE_CXX_FLAGS DIRECTORY browsermt-marian-dev DEFINITION CMAKE_CXX_FLAGS) set(CMAKE_C_FLAGS ${CMAKE_C_FLAGS} PARENT_SCOPE) -set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} PARENT_SCOPE) - -if(COMPILE_PYTHON) - add_subdirectory(pybind11) -endif(COMPILE_PYTHON) +set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} PARENT_SCOPE) diff --git a/inference-engine/3rd_party/pybind11 b/inference-engine/3rd_party/pybind11 deleted file mode 160000 index 9ec1128c7..000000000 --- a/inference-engine/3rd_party/pybind11 +++ /dev/null @@ -1 +0,0 @@ -Subproject commit 9ec1128c7aac3d069a4ec2bd1dfc7f57c6526d1c diff --git a/inference-engine/bindings/CMakeLists.txt b/inference-engine/bindings/CMakeLists.txt deleted file mode 100644 index 8e5f91a37..000000000 --- a/inference-engine/bindings/CMakeLists.txt +++ /dev/null @@ -1 +0,0 @@ -add_subdirectory(python) diff --git a/inference-engine/bindings/python/CMakeLists.txt b/inference-engine/bindings/python/CMakeLists.txt deleted file mode 100644 index 16e3e48d3..000000000 --- a/inference-engine/bindings/python/CMakeLists.txt +++ /dev/null @@ -1,9 +0,0 @@ -find_package(Python COMPONENTS Interpreter Development.Module REQUIRED) - -message("Using Python: " ${Python_EXECUTABLE}) - -# pybind11 method: -pybind11_add_module(_bergamot SHARED bergamot.cpp) -target_link_libraries(_bergamot PUBLIC pybind11::module pybind11::headers bergamot-translator) -target_include_directories(_bergamot PUBLIC ${PROJECT_SOURCE_DIR} ${PROJECT_SOURCE_DIR}/src - ${CMAKE_BINARY_DIR}/src) diff --git a/inference-engine/bindings/python/README.md b/inference-engine/bindings/python/README.md deleted file mode 100644 index 3797b7dea..000000000 --- a/inference-engine/bindings/python/README.md +++ /dev/null @@ -1,14 +0,0 @@ -# bergamot-translator - -The [Bergamot project](https://browser.mt/) adds and improves client-side -machine translation in a web browser. - -This package provides Python bindings to bergamot-translator developed as part -of the Bergamot Project and extras assorted in a package to enable further use -of the library developed for local-translation on the consumer machine. - -Bergamot is a consortium coordinated by the University of Edinburgh with -partners Charles University in Prague, the University of Sheffield, University -of Tartu, and Mozilla. - - diff --git a/inference-engine/bindings/python/__init__.py b/inference-engine/bindings/python/__init__.py deleted file mode 100644 index 5855a4faf..000000000 --- a/inference-engine/bindings/python/__init__.py +++ /dev/null @@ -1,18 +0,0 @@ -import typing - -from ._bergamot import * # type: ignore -from .repository import Aggregator, TranslateLocallyLike - -REPOSITORY = Aggregator( - [ - TranslateLocallyLike("browsermt", "https://translatelocally.com/models.json"), - TranslateLocallyLike( - "opus", "https://object.pouta.csc.fi/OPUS-MT-models/app/models.json" - ), - ] -) -""" -REPOSITORY is a global object that aggregates multiple model-providers to -provide a (model-provider: str, model-code: str) based query mechanism to -get models. -""" diff --git a/inference-engine/bindings/python/__main__.py b/inference-engine/bindings/python/__main__.py deleted file mode 100644 index 35014c099..000000000 --- a/inference-engine/bindings/python/__main__.py +++ /dev/null @@ -1,20 +0,0 @@ -import argparse -import sys -from argparse import ArgumentParser - -from .cmds import CMDS, make_parser - - -def main() -> None: - parser = make_parser() - args = parser.parse_args() - - if args.action in CMDS: - CMDS[args.action].execute(args) - else: - parser.print_help(sys.stderr) - sys.exit(1) - - -if __name__ == "__main__": - main() diff --git a/inference-engine/bindings/python/bergamot.cpp b/inference-engine/bindings/python/bergamot.cpp deleted file mode 100644 index 2ffb2267e..000000000 --- a/inference-engine/bindings/python/bergamot.cpp +++ /dev/null @@ -1,213 +0,0 @@ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -namespace py = pybind11; - -using marian::bergamot::AnnotatedText; -using marian::bergamot::ByteRange; -using marian::bergamot::ConcatStrategy; -using marian::bergamot::Response; -using marian::bergamot::ResponseOptions; -using Service = marian::bergamot::AsyncService; -using _Model = marian::bergamot::TranslationModel; -using Model = std::shared_ptr<_Model>; -using Alignment = std::vector>; -using Alignments = std::vector; - -PYBIND11_MAKE_OPAQUE(std::vector); -PYBIND11_MAKE_OPAQUE(std::vector); -PYBIND11_MAKE_OPAQUE(Alignments); - -class ServicePyAdapter { - public: - ServicePyAdapter(const Service::Config &config) : service_(make_service(config)) { - // Set marian to throw exceptions instead of std::abort() - marian::setThrowExceptionOnAbort(true); - } - - std::shared_ptr<_Model> modelFromConfig(const std::string &config) { - auto parsedConfig = marian::bergamot::parseOptionsFromString(config); - return service_.createCompatibleModel(parsedConfig); - } - - std::shared_ptr<_Model> modelFromConfigPath(const std::string &configPath) { - auto config = marian::bergamot::parseOptionsFromFilePath(configPath); - return service_.createCompatibleModel(config); - } - - std::vector translate(Model model, std::vector &inputs, const ResponseOptions &options) { - py::scoped_ostream_redirect outstream(std::cout, // std::ostream& - py::module_::import("sys").attr("stdout") // Python output - ); - py::scoped_ostream_redirect errstream(std::cerr, // std::ostream& - py::module_::import("sys").attr("stderr") // Python output - ); - - py::call_guard gil_guard; - - // Prepare promises, save respective futures. Have callback's in async set - // value to the promises. - std::vector> futures; - std::vector> promises; - promises.resize(inputs.size()); - - for (size_t i = 0; i < inputs.size(); i++) { - auto callback = [&promises, i](Response &&response) { promises[i].set_value(std::move(response)); }; - - service_.translate(model, std::move(inputs[i]), std::move(callback), options); - - futures.push_back(std::move(promises[i].get_future())); - } - - // Wait on all futures to be ready. - std::vector responses; - for (size_t i = 0; i < futures.size(); i++) { - futures[i].wait(); - responses.push_back(std::move(futures[i].get())); - } - - return responses; - } - - std::vector pivot(Model first, Model second, std::vector &inputs, - const ResponseOptions &options) { - py::scoped_ostream_redirect outstream(std::cout, // std::ostream& - py::module_::import("sys").attr("stdout") // Python output - ); - py::scoped_ostream_redirect errstream(std::cerr, // std::ostream& - py::module_::import("sys").attr("stderr") // Python output - ); - - py::call_guard gil_guard; - // Prepare promises, save respective futures. Have callback's in async set - // value to the promises. - std::vector> futures; - std::vector> promises; - promises.resize(inputs.size()); - - for (size_t i = 0; i < inputs.size(); i++) { - auto callback = [&promises, i](Response &&response) { promises[i].set_value(std::move(response)); }; - - service_.pivot(first, second, std::move(inputs[i]), std::move(callback), options); - - futures.push_back(std::move(promises[i].get_future())); - } - - // Wait on all futures to be ready. - std::vector responses; - for (size_t i = 0; i < futures.size(); i++) { - futures[i].wait(); - responses.push_back(std::move(futures[i].get())); - } - - return responses; - } - - private /*functions*/: - static Service make_service(const Service::Config &config) { - py::scoped_ostream_redirect outstream(std::cout, // std::ostream& - py::module_::import("sys").attr("stdout") // Python output - ); - py::scoped_ostream_redirect errstream(std::cerr, // std::ostream& - py::module_::import("sys").attr("stderr") // Python output - ); - - py::call_guard gil_guard; - - return Service(config); - } - - private /*data*/: - Service service_; -}; - -PYBIND11_MODULE(_bergamot, m) { - m.doc() = "Bergamot pybind11 bindings"; - m.attr("__version__") = marian::bergamot::bergamotBuildVersion(); - py::class_(m, "ByteRange") - .def(py::init<>()) - .def_readonly("begin", &ByteRange::begin) - .def_readonly("end", &ByteRange::end) - .def("__repr__", [](const ByteRange &range) { - return "{" + std::to_string(range.begin) + ", " + std::to_string(range.end) + "}"; - }); - - py::class_(m, "AnnotatedText") - .def(py::init<>()) - .def("numWords", &AnnotatedText::numWords) - .def("numSentences", &AnnotatedText::numSentences) - .def("word", - [](const AnnotatedText &annotatedText, size_t sentenceIdx, size_t wordIdx) -> std::string { - auto view = annotatedText.word(sentenceIdx, wordIdx); - return std::string(view.data(), view.size()); - }) - .def("sentence", - [](const AnnotatedText &annotatedText, size_t sentenceIdx) -> std::string { - auto view = annotatedText.sentence(sentenceIdx); - return std::string(view.data(), view.size()); - }) - .def("wordAsByteRange", &AnnotatedText::wordAsByteRange) - .def("sentenceAsByteRange", &AnnotatedText::sentenceAsByteRange) - .def_readonly("text", &AnnotatedText::text); - - py::class_(m, "Response") - .def(py::init<>()) - .def_readonly("source", &Response::source) - .def_readonly("target", &Response::target) - .def_readonly("alignments", &Response::alignments); - - py::bind_vector>(m, "VectorString"); - py::bind_vector>(m, "VectorResponse"); - - py::enum_(m, "ConcatStrategy") - .value("FAITHFUL", ConcatStrategy::FAITHFUL) - .value("SPACE", ConcatStrategy::SPACE) - .export_values(); - - py::class_(m, "ResponseOptions") - .def( - py::init<>([](bool qualityScores, bool alignment, bool HTML, bool sentenceMappings, ConcatStrategy strategy) { - return ResponseOptions{qualityScores, alignment, HTML, sentenceMappings, strategy}; - }), - py::arg("qualityScores") = true, py::arg("alignment") = false, py::arg("HTML") = false, - py::arg("sentenceMappings") = true, py::arg("concatStrategy") = ConcatStrategy::FAITHFUL) - .def_readwrite("qualityScores", &ResponseOptions::qualityScores) - .def_readwrite("HTML", &ResponseOptions::HTML) - .def_readwrite("alignment", &ResponseOptions::alignment) - .def_readwrite("concatStrategy", &ResponseOptions::concatStrategy) - .def_readwrite("sentenceMappings", &ResponseOptions::sentenceMappings); - - py::class_(m, "Service") - .def(py::init()) - .def("modelFromConfig", &ServicePyAdapter::modelFromConfig) - .def("modelFromConfigPath", &ServicePyAdapter::modelFromConfigPath) - .def("translate", &ServicePyAdapter::translate) - .def("pivot", &ServicePyAdapter::pivot); - - py::class_(m, "ServiceConfig") - .def(py::init<>([](size_t numWorkers, size_t cacheSize, std::string logging) { - Service::Config config; - config.numWorkers = numWorkers; - config.cacheSize = cacheSize; - config.logger.level = logging; - return config; - }), - py::arg("numWorkers") = 1, py::arg("cacheSize") = 0, py::arg("logLevel") = "off") - .def_readwrite("numWorkers", &Service::Config::numWorkers) - .def_readwrite("cacheSize", &Service::Config::cacheSize); - - py::class_<_Model, std::shared_ptr<_Model>>(m, "TranslationModel"); -} diff --git a/inference-engine/bindings/python/cmds.py b/inference-engine/bindings/python/cmds.py deleted file mode 100644 index 5949adaca..000000000 --- a/inference-engine/bindings/python/cmds.py +++ /dev/null @@ -1,177 +0,0 @@ -import argparse -import sys -from collections import Counter, defaultdict - -from . import REPOSITORY, ResponseOptions, Service, ServiceConfig, VectorString - -CMDS = {} - - -def _register_cmd(cmd: str): - """ - Convenience decorator function, which populates the dictionary above with - commands created in a declarative fashion. - """ - - def __inner(cls): - CMDS[cmd] = cls - return cls - - return __inner - - -@_register_cmd("translate") -class Translate: - @staticmethod - def embed_subparser(key: str, subparsers: argparse._SubParsersAction): - translate = subparsers.add_parser( - key, - description="translate using a given model. Multiple models mean pivoting", - ) - - translate.add_argument( - "-m", - "--model", - type=str, - nargs="+", - help="Path to model file(s) to use in forward or pivot translation", - required=True, - ) - - translate.add_argument( - "-r", - "--repository", - type=str, - help="Repository to download model from", - choices=REPOSITORY.available(), - default="browsermt", - ) - - translate.add_argument( - "--num-workers", - type=int, - help="Number of worker threads to use to translate", - default=4, - ) - - translate.add_argument( - "--log-level", - type=str, - default="off", - help="Set verbosity level of logging: trace, debug, info, warn, err(or), critical, off", - ) - - # Tweak response-options for quick HTML in out via commandline - options = translate.add_argument_group("response-options") - options.add_argument("--html", type=bool, default=False) - options.add_argument("--alignment", type=bool, default=False) - options.add_argument("--quality-scores", type=bool, default=False) - - @staticmethod - def execute(args: argparse.Namespace): - # Build service - - config = ServiceConfig(numWorkers=args.num_workers, logLevel=args.log_level) - service = Service(config) - - models = [ - service.modelFromConfigPath( - REPOSITORY.modelConfigPath(args.repository, model) - ) - for model in args.model - ] - - # Configure a few options which require how a Response is constructed - options = ResponseOptions( - alignment=args.alignment, qualityScores=args.quality_scores, HTML=args.html - ) - - source = sys.stdin.read() - responses = None - if len(models) == 1: - [model] = models - responses = service.translate(model, VectorString([source]), options) - else: - [first, second] = models - responses = service.pivot(first, second, VectorString([source]), options) - - for response in responses: - print(response.target.text, end="") - - -@_register_cmd("download") -class Download: - @staticmethod - def embed_subparser(key: str, subparsers: argparse._SubParsersAction): - download = subparsers.add_parser( - key, description="Download models from the web." - ) - - download.add_argument( - "-m", - "--model", - type=str, - required=False, - default=None, - help="Fetch model with given code. Use ls to list available models. Optional, if none supplied all models are downloaded.", - ) - - download.add_argument( - "-r", - "--repository", - type=str, - help="Repository to download model from", - choices=REPOSITORY.available(), - default="browsermt", - ) - - @staticmethod - def execute(args: argparse.Namespace): - if args.model is not None: - REPOSITORY.download(args.repository, args.model) - else: - for model in REPOSITORY.models(args.repository, filter_downloaded=False): - REPOSITORY.download(args.repository, model) - - -@_register_cmd("ls") -class List: - @staticmethod - def embed_subparser(key: str, subparsers: argparse._SubParsersAction): - ls = subparsers.add_parser(key, description="List available models.") - ls.add_argument( - "-r", - "--repository", - type=str, - help="Repository to list models from", - choices=REPOSITORY.available(), - default="browsermt", - ) - - @staticmethod - def execute(args: argparse.Namespace): - print("Available models: ") - for counter, identifier in enumerate( - REPOSITORY.models(args.repository, filter_downloaded=True), 1 - ): - model = REPOSITORY.model(args.repository, identifier) - print( - " {}.".format(str(counter).rjust(4)), - model["code"], - model["name"], - ) - print() - - -def make_parser() -> argparse.ArgumentParser: - parser = argparse.ArgumentParser("bergamot") - subparsers = parser.add_subparsers( - title="actions", - description="The following actions are available through the bergamot package", - help="To obtain help on how to run these actions supply -h.", - dest="action", - ) - - for key, cls in CMDS.items(): - cls.embed_subparser(key, subparsers) - return parser diff --git a/inference-engine/bindings/python/repository.py b/inference-engine/bindings/python/repository.py deleted file mode 100644 index 9ea3ac023..000000000 --- a/inference-engine/bindings/python/repository.py +++ /dev/null @@ -1,218 +0,0 @@ -import json -import os -import tarfile -import typing as t -from abc import ABC, abstractmethod -from functools import partial -from urllib.parse import urlparse - -import requests -from appdirs import AppDirs - -from .typing_utils import URL, PathLike -from .utils import download_resource, patch_marian_for_bergamot - -APP = "bergamot" - - -class Repository(ABC): - """ - An interface for several repositories. Intended to enable interchangable - use of translateLocally and Mozilla repositories for usage through python. - """ - - @property - @abstractmethod - def name(self): - pass - - @abstractmethod - def update(self): - """Updates the model list""" - pass - - @abstractmethod - def models(self) -> t.List[str]: - """returns identifiers for available models""" - pass - - @abstractmethod - def model(self, model_identifier: str) -> t.Any: - """returns entry for the for available models""" - pass - - @abstractmethod - def modelConfigPath(self, model_identifier: str) -> str: - """returns modelConfigPath for for a given model-identifier""" - pass - - @abstractmethod - def download(self, model_identifier: str): - pass - - -class TranslateLocallyLike(Repository): - """ - This class implements Repository to fetch models from translateLocally. - AppDirs is used to standardize directories and further specialization - happens with translateLocally identifier. - """ - - def __init__(self, name, url): - self.url = url - self._name = name - appDir = AppDirs(APP) - f = lambda *args: os.path.join(*args, self._name) - self.dirs = { - "cache": f(appDir.user_cache_dir), - "config": f(appDir.user_config_dir), - "data": f(appDir.user_data_dir), - "archive": f(appDir.user_data_dir, "archives"), - "models": f(appDir.user_data_dir, "models"), - } - - for directory in self.dirs.values(): - os.makedirs(directory, exist_ok=True) - - self.models_file_path = os.path.join(self.dirs["config"], "models.json") - self.data = self._load_data(self.models_file_path) - - # Update inverse lookup. - self.data_by_code = {} - for model in self.data["models"]: - self.data_by_code[model["code"]] = model - - @property - def name(self) -> str: - return self._name - - def _load_data(self, models_file_path): - """ - Load model data from existing file. If file does not exist, download from the web. - """ - if os.path.exists(models_file_path): - # File already exists, prefer to work with this. - # A user is expected to update manually if model's already - # downloaded and setup. - with open(models_file_path) as model_file: - return json.load(model_file) - else: - # We are running for the first time. - # Try to fetch this file from the internet. - self.update() - with open(models_file_path) as model_file: - return json.load(model_file) - - def update(self) -> None: - inventory = requests.get(self.url).text - with open(self.models_file_path, "w+") as models_file: - models_file.write(inventory) - - def models(self, filter_downloaded: bool = True) -> t.List[str]: - codes = [] - for model in self.data["models"]: - if filter_downloaded: - fprefix = self._archive_name_without_extension(model["url"]) - model_dir = os.path.join(self.dirs["models"], fprefix) - if os.path.exists(model_dir): - codes.append(model["code"]) - else: - codes.append(model["code"]) - return codes - - def modelConfigPath(self, model_identifier: str) -> str: - model = self.model(model_identifier) - fprefix = self._archive_name_without_extension(model["url"]) - model_dir = os.path.join(self.dirs["models"], fprefix) - return os.path.join(model_dir, "config.bergamot.yml") - - def model(self, model_identifier: str) -> t.Any: - return self.data_by_code[model_identifier] - - def download(self, model_identifier: str): - # Download path - model = self.model(model_identifier) - model_archive = "{}.tar.gz".format(model["shortName"]) - save_location = os.path.join(self.dirs["archive"], model_archive) - download_resource(model["url"], save_location) - - with tarfile.open(save_location) as model_archive: - - def is_within_directory(directory, target): - abs_directory = os.path.abspath(directory) - abs_target = os.path.abspath(target) - - prefix = os.path.commonprefix([abs_directory, abs_target]) - - return prefix == abs_directory - - def safe_extract(tar, path=".", members=None, *, numeric_owner=False): - for member in tar.getmembers(): - member_path = os.path.join(path, member.name) - if not is_within_directory(path, member_path): - raise Exception("Attempted Path Traversal in Tar File") - - tar.extractall(path, members, numeric_owner=numeric_owner) - - safe_extract(model_archive, self.dirs["models"]) - fprefix = self._archive_name_without_extension(model["url"]) - model_dir = os.path.join(self.dirs["models"], fprefix) - symlink = os.path.join(self.dirs["models"], model["code"]) - - print( - "Downloading and extracting {} into ... {}".format( - model["code"], model_dir - ), - end=" ", - ) - - if not os.path.exists(symlink): - os.symlink(model_dir, symlink) - - config_path = os.path.join(symlink, "config.intgemm8bitalpha.yml") - bergamot_config_path = os.path.join(symlink, "config.bergamot.yml") - - # Finally patch so we don't have to reload this again. - patch_marian_for_bergamot(config_path, bergamot_config_path) - - print("Done.") - - def _archive_name_without_extension(self, url: URL): - o = urlparse(url) - fname = os.path.basename(o.path) # something tar.gz. - fname_without_extension = ".".join(fname.split(".")[:3]) - return fname_without_extension - - -class Aggregator: - def __init__(self, repositories: t.List[Repository]): - self.repositories = {} - for repository in repositories: - if repository.name in self.repositories: - raise ValueError("Duplicate repository found.") - self.repositories[repository.name] = repository - - # Default is self.repostiory - self.default_repository = repositories[0] - - def update(self, name: str) -> None: - self.repositories.get(name, self.default_repository).update() - - def modelConfigPath(self, name: str, code: str) -> PathLike: - return self.repositories.get(name, self.default_repository).modelConfigPath( - code - ) - - def models(self, name: str, filter_downloaded: bool = True) -> t.List[str]: - return self.repositories.get(name, self.default_repository).models() - - def model(self, name: str, model_identifier: str) -> t.Any: - return self.repositories.get(name, self.default_repository).model( - model_identifier - ) - - def available(self): - return list(self.repositories.keys()) - - def download(self, name: str, model_identifier: str) -> None: - self.repositories.get(name, self.default_repository).download(model_identifier) diff --git a/inference-engine/bindings/python/typing_utils.py b/inference-engine/bindings/python/typing_utils.py deleted file mode 100644 index 3e1682cff..000000000 --- a/inference-engine/bindings/python/typing_utils.py +++ /dev/null @@ -1,5 +0,0 @@ -import pathlib -import typing as t - -PathLike = t.TypeVar("PathLike", str, pathlib.Path) -URL = str diff --git a/inference-engine/bindings/python/utils.py b/inference-engine/bindings/python/utils.py deleted file mode 100644 index 3164c171c..000000000 --- a/inference-engine/bindings/python/utils.py +++ /dev/null @@ -1,52 +0,0 @@ -import os - -import requests -import yaml - -from .typing_utils import URL, PathLike - - -def download_resource(url: URL, save_location: PathLike, force_download=False): - """ - Downloads a resource from url into save_location, overwrites only if - force_download is true. - """ - if force_download or not os.path.exists(save_location): - response = requests.get(url, stream=True) - # Throw an error for bad status codes - response.raise_for_status() - with open(save_location, "wb") as handle: - for block in response.iter_content(1024): - handle.write(block) - - -def patch_marian_for_bergamot( - marian_config_path: PathLike, bergamot_config_path: PathLike, quality: bool = False -): - """ - Accepts path to a config-file from marian-training and followign - quantization and adjusts parameters for use in bergamot. - """ - # Load marian_config_path - data = None - with open(marian_config_path) as fp: - data = yaml.load(fp, Loader=yaml.FullLoader) - - # Update a few entries. Things here are hardcode. - data.update( - { - "ssplit-prefix-file": "", - "ssplit-mode": "paragraph", - "max-length-break": 128, - "mini-batch-words": 1024, - "workspace": 128, # shipped models use big workspaces. We'd prefer to keep it low. - "alignment": "soft", - } - ) - - if quality: - data.update({"quality": quality, "skip-cost": False}) - - # Write-out. - with open(bergamot_config_path, "w") as output_file: - print(yaml.dump(data, sort_keys=False), file=output_file) diff --git a/inference-engine/run-clang-format.py b/inference-engine/run-clang-format.py deleted file mode 100644 index dcabaf1ec..000000000 --- a/inference-engine/run-clang-format.py +++ /dev/null @@ -1,408 +0,0 @@ -#!/usr/bin/env python -"""A wrapper script around clang-format, suitable for linting multiple files -and to use for continuous integration. - -This is an alternative API for the clang-format command line. -It runs over multiple files and directories in parallel. -A diff output is produced and a sensible exit code is returned. - -""" - -from __future__ import print_function, unicode_literals - -import argparse -import codecs -import difflib -import fnmatch -import io -import errno -import multiprocessing -import os -import signal -import subprocess -import sys -import traceback - -from functools import partial - -try: - from subprocess import DEVNULL # py3k -except ImportError: - DEVNULL = open(os.devnull, "wb") - - -DEFAULT_EXTENSIONS = 'c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx' -DEFAULT_CLANG_FORMAT_IGNORE = '.clang-format-ignore' - - -class ExitStatus: - SUCCESS = 0 - DIFF = 1 - TROUBLE = 2 - -def excludes_from_file(ignore_file): - excludes = [] - try: - with io.open(ignore_file, 'r', encoding='utf-8') as f: - for line in f: - if line.startswith('#'): - # ignore comments - continue - pattern = line.rstrip() - if not pattern: - # allow empty lines - continue - excludes.append(pattern) - except EnvironmentError as e: - if e.errno != errno.ENOENT: - raise - return excludes; - -def list_files(files, recursive=False, extensions=None, exclude=None): - if extensions is None: - extensions = [] - if exclude is None: - exclude = [] - - out = [] - for file in files: - if recursive and os.path.isdir(file): - for dirpath, dnames, fnames in os.walk(file): - fpaths = [os.path.join(dirpath, fname) for fname in fnames] - for pattern in exclude: - # os.walk() supports trimming down the dnames list - # by modifying it in-place, - # to avoid unnecessary directory listings. - dnames[:] = [ - x for x in dnames - if - not fnmatch.fnmatch(os.path.join(dirpath, x), pattern) - ] - fpaths = [ - x for x in fpaths if not fnmatch.fnmatch(x, pattern) - ] - for f in fpaths: - ext = os.path.splitext(f)[1][1:] - if ext in extensions: - out.append(f) - else: - out.append(file) - return out - - -def make_diff(file, original, reformatted): - return list( - difflib.unified_diff( - original, - reformatted, - fromfile='{}\t(original)'.format(file), - tofile='{}\t(reformatted)'.format(file), - n=3)) - - -class DiffError(Exception): - def __init__(self, message, errs=None): - super(DiffError, self).__init__(message) - self.errs = errs or [] - - -class UnexpectedError(Exception): - def __init__(self, message, exc=None): - super(UnexpectedError, self).__init__(message) - self.formatted_traceback = traceback.format_exc() - self.exc = exc - - -def run_clang_format_diff_wrapper(args, file): - try: - ret = run_clang_format_diff(args, file) - return ret - except DiffError: - raise - except Exception as e: - raise UnexpectedError('{}: {}: {}'.format(file, e.__class__.__name__, - e), e) - - -def run_clang_format_diff(args, file): - try: - with io.open(file, 'r', encoding='utf-8') as f: - original = f.readlines() - except IOError as exc: - raise DiffError(str(exc)) - - if args.in_place: - invocation = [args.clang_format_executable, '-i', file] - else: - invocation = [args.clang_format_executable, file] - - if args.style: - invocation.extend(['--style', args.style]) - - if args.dry_run: - print(" ".join(invocation)) - return [], [] - - # Use of utf-8 to decode the process output. - # - # Hopefully, this is the correct thing to do. - # - # It's done due to the following assumptions (which may be incorrect): - # - clang-format will returns the bytes read from the files as-is, - # without conversion, and it is already assumed that the files use utf-8. - # - if the diagnostics were internationalized, they would use utf-8: - # > Adding Translations to Clang - # > - # > Not possible yet! - # > Diagnostic strings should be written in UTF-8, - # > the client can translate to the relevant code page if needed. - # > Each translation completely replaces the format string - # > for the diagnostic. - # > -- http://clang.llvm.org/docs/InternalsManual.html#internals-diag-translation - # - # It's not pretty, due to Python 2 & 3 compatibility. - encoding_py3 = {} - if sys.version_info[0] >= 3: - encoding_py3['encoding'] = 'utf-8' - - try: - proc = subprocess.Popen( - invocation, - stdout=subprocess.PIPE, - stderr=subprocess.PIPE, - universal_newlines=True, - **encoding_py3) - except OSError as exc: - raise DiffError( - "Command '{}' failed to start: {}".format( - subprocess.list2cmdline(invocation), exc - ) - ) - proc_stdout = proc.stdout - proc_stderr = proc.stderr - if sys.version_info[0] < 3: - # make the pipes compatible with Python 3, - # reading lines should output unicode - encoding = 'utf-8' - proc_stdout = codecs.getreader(encoding)(proc_stdout) - proc_stderr = codecs.getreader(encoding)(proc_stderr) - # hopefully the stderr pipe won't get full and block the process - outs = list(proc_stdout.readlines()) - errs = list(proc_stderr.readlines()) - proc.wait() - if proc.returncode: - raise DiffError( - "Command '{}' returned non-zero exit status {}".format( - subprocess.list2cmdline(invocation), proc.returncode - ), - errs, - ) - if args.in_place: - return [], errs - return make_diff(file, original, outs), errs - - -def bold_red(s): - return '\x1b[1m\x1b[31m' + s + '\x1b[0m' - - -def colorize(diff_lines): - def bold(s): - return '\x1b[1m' + s + '\x1b[0m' - - def cyan(s): - return '\x1b[36m' + s + '\x1b[0m' - - def green(s): - return '\x1b[32m' + s + '\x1b[0m' - - def red(s): - return '\x1b[31m' + s + '\x1b[0m' - - for line in diff_lines: - if line[:4] in ['--- ', '+++ ']: - yield bold(line) - elif line.startswith('@@ '): - yield cyan(line) - elif line.startswith('+'): - yield green(line) - elif line.startswith('-'): - yield red(line) - else: - yield line - - -def print_diff(diff_lines, use_color): - if use_color: - diff_lines = colorize(diff_lines) - if sys.version_info[0] < 3: - sys.stdout.writelines((l.encode('utf-8') for l in diff_lines)) - else: - sys.stdout.writelines(diff_lines) - - -def print_trouble(prog, message, use_colors): - error_text = 'error:' - if use_colors: - error_text = bold_red(error_text) - print("{}: {} {}".format(prog, error_text, message), file=sys.stderr) - - -def main(): - parser = argparse.ArgumentParser(description=__doc__) - parser.add_argument( - '--clang-format-executable', - metavar='EXECUTABLE', - help='path to the clang-format executable', - default='clang-format') - parser.add_argument( - '--extensions', - help='comma separated list of file extensions (default: {})'.format( - DEFAULT_EXTENSIONS), - default=DEFAULT_EXTENSIONS) - parser.add_argument( - '-r', - '--recursive', - action='store_true', - help='run recursively over directories') - parser.add_argument( - '-d', - '--dry-run', - action='store_true', - help='just print the list of files') - parser.add_argument( - '-i', - '--in-place', - action='store_true', - help='format file instead of printing differences') - parser.add_argument('files', metavar='file', nargs='+') - parser.add_argument( - '-q', - '--quiet', - action='store_true', - help="disable output, useful for the exit code") - parser.add_argument( - '-j', - metavar='N', - type=int, - default=0, - help='run N clang-format jobs in parallel' - ' (default number of cpus + 1)') - parser.add_argument( - '--color', - default='auto', - choices=['auto', 'always', 'never'], - help='show colored diff (default: auto)') - parser.add_argument( - '-e', - '--exclude', - metavar='PATTERN', - action='append', - default=[], - help='exclude paths matching the given glob-like pattern(s)' - ' from recursive search') - parser.add_argument( - '--style', - help='formatting style to apply (LLVM, Google, Chromium, Mozilla, WebKit)') - - args = parser.parse_args() - - # use default signal handling, like diff return SIGINT value on ^C - # https://bugs.python.org/issue14229#msg156446 - signal.signal(signal.SIGINT, signal.SIG_DFL) - try: - signal.SIGPIPE - except AttributeError: - # compatibility, SIGPIPE does not exist on Windows - pass - else: - signal.signal(signal.SIGPIPE, signal.SIG_DFL) - - colored_stdout = False - colored_stderr = False - if args.color == 'always': - colored_stdout = True - colored_stderr = True - elif args.color == 'auto': - colored_stdout = sys.stdout.isatty() - colored_stderr = sys.stderr.isatty() - - version_invocation = [args.clang_format_executable, str("--version")] - try: - subprocess.check_call(version_invocation, stdout=DEVNULL) - except subprocess.CalledProcessError as e: - print_trouble(parser.prog, str(e), use_colors=colored_stderr) - return ExitStatus.TROUBLE - except OSError as e: - print_trouble( - parser.prog, - "Command '{}' failed to start: {}".format( - subprocess.list2cmdline(version_invocation), e - ), - use_colors=colored_stderr, - ) - return ExitStatus.TROUBLE - - retcode = ExitStatus.SUCCESS - - excludes = excludes_from_file(DEFAULT_CLANG_FORMAT_IGNORE) - excludes.extend(args.exclude) - - files = list_files( - args.files, - recursive=args.recursive, - exclude=excludes, - extensions=args.extensions.split(',')) - - if not files: - return - - njobs = args.j - if njobs == 0: - njobs = multiprocessing.cpu_count() + 1 - njobs = min(len(files), njobs) - - if njobs == 1: - # execute directly instead of in a pool, - # less overhead, simpler stacktraces - it = (run_clang_format_diff_wrapper(args, file) for file in files) - pool = None - else: - pool = multiprocessing.Pool(njobs) - it = pool.imap_unordered( - partial(run_clang_format_diff_wrapper, args), files) - pool.close() - while True: - try: - outs, errs = next(it) - except StopIteration: - break - except DiffError as e: - print_trouble(parser.prog, str(e), use_colors=colored_stderr) - retcode = ExitStatus.TROUBLE - sys.stderr.writelines(e.errs) - except UnexpectedError as e: - print_trouble(parser.prog, str(e), use_colors=colored_stderr) - sys.stderr.write(e.formatted_traceback) - retcode = ExitStatus.TROUBLE - # stop at the first unexpected error, - # something could be very wrong, - # don't process all files unnecessarily - if pool: - pool.terminate() - break - else: - sys.stderr.writelines(errs) - if outs == []: - continue - if not args.quiet: - print_diff(outs, use_color=colored_stdout) - if retcode == ExitStatus.SUCCESS: - retcode = ExitStatus.DIFF - if pool: - pool.join() - return retcode - - -if __name__ == '__main__': - sys.exit(main()) diff --git a/inference-engine/setup.py b/inference-engine/setup.py deleted file mode 100644 index ed4c6dc81..000000000 --- a/inference-engine/setup.py +++ /dev/null @@ -1,248 +0,0 @@ -import io -import os -import re -import subprocess -import sys - -from setuptools import Command, Extension, find_packages, setup -from setuptools.command.build_ext import build_ext -from setuptools.command.build_py import build_py as _build_py - -# Convert distutils Windows platform specifiers to CMake -A arguments -PLAT_TO_CMAKE = { - "win32": "Win32", - "win-amd64": "x64", - "win-arm32": "ARM", - "win-arm64": "ARM64", -} - - -# A CMakeExtension needs a sourcedir instead of a file list. -# The name must be the _single_ output extension from the CMake build. -# If you need multiple extensions, see scikit-build. -class CMakeExtension(Extension): - def __init__(self, name, sourcedir=""): - Extension.__init__(self, name, sources=[]) - self.sourcedir = os.path.abspath(sourcedir) - - -class CMakeBuild(build_ext): - def build_extension(self, ext): - extdir = os.path.abspath(os.path.dirname(self.get_ext_fullpath(ext.name))) - - # required for auto-detection & inclusion of auxiliary "native" libs - if not extdir.endswith(os.path.sep): - extdir += os.path.sep - - debug = int(os.environ.get("DEBUG", 0)) if self.debug is None else self.debug - cfg = "Debug" if debug else "Release" - - # CMake lets you override the generator - we need to check this. - # Can be set with Conda-Build, for example. - cmake_generator = os.environ.get("CMAKE_GENERATOR", "") - build_arch = os.environ.get("BUILD_ARCH", "native") - - # Set Python_EXECUTABLE instead if you use PYBIND11_FINDPYTHON - # EXAMPLE_VERSION_INFO shows you how to pass a value into the C++ code - # from Python. - cmake_args = [ - f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={extdir}", - f"-DPYTHON_EXECUTABLE={sys.executable}", - f"-DCMAKE_BUILD_TYPE={cfg}", # not used on MSVC, but no harm - f"-DCOMPILE_PYTHON=ON", - f"-DSSPLIT_USE_INTERNAL_PCRE2=ON", - f"-DBUILD_ARCH={build_arch}", - ] - - build_args = ["-t", "_bergamot"] - # Adding CMake arguments set as environment variable - # (needed e.g. to build for ARM OSx on conda-forge) - if "CMAKE_ARGS" in os.environ: - cmake_args += [item for item in os.environ["CMAKE_ARGS"].split(" ") if item] - - # In this example, we pass in the version to C++. You might not need to. - cmake_args += [f"-DEXAMPLE_VERSION_INFO={self.distribution.get_version()}"] - - use_ccache = os.environ.get("USE_CCACHE", "0") == "1" - if use_ccache: - cmake_args += [ - f"-DCMAKE_CXX_COMPILER_LAUNCHER=ccache", - f"-DCMAKE_C_COMPILER_LAUNCHER=ccache", - ] - - if self.compiler.compiler_type != "msvc": - # Using Ninja-build since it a) is available as a wheel and b) - # multithreads automatically. MSVC would require all variables be - # exported for Ninja to pick it up, which is a little tricky to do. - # Users can override the generator with CMAKE_GENERATOR in CMake - # 3.15+. - if not cmake_generator: - try: - import ninja # noqa: F401 - - cmake_args += ["-GNinja"] - except ImportError: - pass - - else: - # Single config generators are handled "normally" - single_config = any(x in cmake_generator for x in {"NMake", "Ninja"}) - - # CMake allows an arch-in-generator style for backward compatibility - contains_arch = any(x in cmake_generator for x in {"ARM", "Win64"}) - - # Specify the arch if using MSVC generator, but only if it doesn't - # contain a backward-compatibility arch spec already in the - # generator name. - if not single_config and not contains_arch: - cmake_args += ["-A", PLAT_TO_CMAKE[self.plat_name]] - - # Multi-config generators have a different way to specify configs - if not single_config: - cmake_args += [ - f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_{cfg.upper()}={extdir}" - ] - build_args += ["--config", cfg] - - if sys.platform.startswith("darwin"): - # Cross-compile support for macOS - respect ARCHFLAGS if set - archs = re.findall(r"-arch (\S+)", os.environ.get("ARCHFLAGS", "")) - if archs: - cmake_args += ["-DCMAKE_OSX_ARCHITECTURES={}".format(";".join(archs))] - - # Set CMAKE_BUILD_PARALLEL_LEVEL to control the parallel build level - # across all generators. - if "CMAKE_BUILD_PARALLEL_LEVEL" not in os.environ: - # self.parallel is a Python 3 only way to set parallel jobs by hand - # using -j in the build_ext call, not supported by pip or PyPA-build. - if hasattr(self, "parallel") and self.parallel: - # CMake 3.12+ only. - build_args += [f"-j{self.parallel}"] - - if not os.path.exists(self.build_temp): - os.makedirs(self.build_temp) - - print("cmake", ext.sourcedir, " ".join(cmake_args)) - - subprocess.check_call( - ["cmake", ext.sourcedir] + cmake_args, cwd=self.build_temp - ) - subprocess.check_call( - ["cmake", "--build", "."] + build_args, cwd=self.build_temp - ) - - -here = os.path.abspath(os.path.dirname(__file__)) - -# Import the README and use it as the long-description. -# Note: this will only work if 'README.md' is present in your MANIFEST.in file! -long_description = "" -with io.open(os.path.join(here, "bindings/python/README.md"), encoding="utf-8") as f: - long_description = "\n" + f.read() - -version = None -with open(os.path.join(here, "BERGAMOT_VERSION")) as f: - version = f.read().strip() - suffix = os.environ.get("PYTHON_LOCAL_VERSION_IDENTIFIER", None) - if suffix: - version = "{}+{}".format(version, suffix) - - -class UploadCommand(Command): - """Support setup.py upload.""" - - description = "Build and publish the package." - user_options = [] - - @staticmethod - def status(s): - """Prints things in bold.""" - print("\033[1m{0}\033[0m".format(s)) - - def initialize_options(self): - pass - - def finalize_options(self): - pass - - def run(self): - try: - self.status("Removing previous builds…") - rmtree(os.path.join(here, "dist")) - except OSError: - pass - - self.status("Building Source and Wheel (universal) distribution…") - os.system("{0} setup.py sdist bdist_wheel --universal".format(sys.executable)) - - self.status("Pushing git tags…") - os.system("git push --tags") - - self.status("Uploading the package to PyPI via Twine…") - os.system("twine upload dist/*") - - sys.exit() - - -class build_py(_build_py): - def run(self): - self.run_command("build_ext") - return super().run() - - -# The information here can also be placed in setup.cfg - better separation of -# logic and declaration, and simpler if you include description/version in a file. -setup( - name="bergamot", - version=version, - author="Jerin Philip", - author_email="jerinphilip@live.in", - url="https://github.com/browsermt/bergamot-translator/", - description="Translate text-content locally in your machine across langauges.", - long_description=long_description, - long_description_content_type="text/markdown", - ext_modules=[CMakeExtension("bergamot/_bergamot")], - cmdclass={"build_py": build_py, "build_ext": CMakeBuild}, - zip_safe=False, - extras_require={"test": ["pytest>=6.0"]}, - license_files=("LICENSE",), - python_requires=">=3.6", - packages=["bergamot"], - package_dir={"bergamot": "bindings/python"}, - install_requires=["requests", "pyyaml>=5.1", "appdirs"], - entry_points={ - "console_scripts": [ - "bergamot = bergamot.__main__:main", - ], - }, - # Classifiers help users find your project by categorizing it. - # - # For a list of valid classifiers, see https://pypi.org/classifiers/ - classifiers=[ # Optional - # How mature is this project? Common values are - # 3 - Alpha - # 4 - Beta - # 5 - Production/Stable - "Development Status :: 3 - Alpha", - # Indicate who your project is intended for - "Intended Audience :: Developers", - "Topic :: Software Development :: Build Tools", - # Pick your license as you wish - "License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)", - # Specify the Python versions you support here. In particular, ensure - # that you indicate you support Python 3. These classifiers are *not* - # checked by 'pip install'. See instead 'python_requires' below. - "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.6", - "Programming Language :: Python :: 3.7", - "Programming Language :: Python :: 3.8", - "Programming Language :: Python :: 3.9", - "Programming Language :: Python :: 3.10", - "Programming Language :: Python :: 3 :: Only", - ], - project_urls={ - "Bug Reports": "https://github.com/browsermt/bergamot-translator/issues", - "Source": "https://github.com/browsermt/bergamot-translator/", - "Documentation": "https://browser.mt/docs/main/python.html", - }, -) From 1019fdd06df0ddca950feea8d1627aa8eb06bdf6 Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 26 Sep 2024 13:02:24 -0500 Subject: [PATCH 428/442] Remove unneeded CLI code --- inference-engine/CMakeLists.txt | 4 +-- inference-engine/app/CMakeLists.txt | 2 -- inference-engine/app/bergamot.cpp | 41 ----------------------------- 3 files changed, 1 insertion(+), 46 deletions(-) delete mode 100644 inference-engine/app/CMakeLists.txt delete mode 100644 inference-engine/app/bergamot.cpp diff --git a/inference-engine/CMakeLists.txt b/inference-engine/CMakeLists.txt index da01c6048..febff3e6e 100644 --- a/inference-engine/CMakeLists.txt +++ b/inference-engine/CMakeLists.txt @@ -60,7 +60,7 @@ endif() if(MSVC) add_definitions(-DUSE_SSE2=1) # Supposed to fix something in the sse_mathfun.h but not sure it does set(INTRINSICS ${MSVC_BUILD_ARCH}) # ARCH we're targetting on win32. @TODO variable - + set(CMAKE_CXX_FLAGS "/EHsc /DWIN32 /D_WINDOWS /DUNICODE /D_UNICODE /D_CRT_NONSTDC_NO_WARNINGS /D_CRT_SECURE_NO_WARNINGS /bigobj") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS} /MT /O2 ${INTRINSICS} /MP /GL /DNDEBUG") set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS} /MTd /Od /Ob0 ${INTRINSICS} /RTC1 /Zi /D_DEBUG") @@ -179,8 +179,6 @@ add_subdirectory(src) if(COMPILE_WASM) add_subdirectory(wasm) -else() - add_subdirectory(app) endif(COMPILE_WASM) option(COMPILE_PYTHON "Compile python bindings. Intended to be activated with setup.py" OFF) diff --git a/inference-engine/app/CMakeLists.txt b/inference-engine/app/CMakeLists.txt deleted file mode 100644 index b5c6a433b..000000000 --- a/inference-engine/app/CMakeLists.txt +++ /dev/null @@ -1,2 +0,0 @@ -add_executable(bergamot bergamot.cpp) -target_link_libraries(bergamot PRIVATE bergamot-translator) diff --git a/inference-engine/app/bergamot.cpp b/inference-engine/app/bergamot.cpp deleted file mode 100644 index 195e167b1..000000000 --- a/inference-engine/app/bergamot.cpp +++ /dev/null @@ -1,41 +0,0 @@ -#include "translator/byte_array_util.h" -#include "translator/parser.h" -#include "translator/response.h" -#include "translator/response_options.h" -#include "translator/service.h" -#include "translator/utils.h" - -int main(int argc, char *argv[]) { - using namespace marian::bergamot; - ConfigParser configParser("Bergamot CLI", /*multiOpMode=*/false); - configParser.parseArgs(argc, argv); - auto &config = configParser.getConfig(); - - AsyncService service(config.serviceConfig); - - // Construct a model. - auto options = parseOptionsFromFilePath(config.modelConfigPaths.front()); - - std::shared_ptr model = service.createCompatibleModel(options); - - ResponseOptions responseOptions; - std::string input = readFromStdin(); - - // Create a barrier using future/promise. - std::promise promise; - std::future future = promise.get_future(); - auto callback = [&promise](Response &&response) { - // Fulfill promise. - promise.set_value(std::move(response)); - }; - - service.translate(model, std::move(input), callback, responseOptions); - - // Wait until promise sets the response. - Response response = future.get(); - - // Print (only) translated text. - std::cout << response.target.text; - - return 0; -} From b6906d9df1b3a0d5439a85bd9fa5cacba54d3f71 Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 26 Sep 2024 13:02:41 -0500 Subject: [PATCH 429/442] Remove unneded doc code --- inference-engine/doc/.gitignore | 4 - inference-engine/doc/CI.md | 22 -- inference-engine/doc/README.md | 51 ----- inference-engine/doc/Unified_API.md | 212 -------------------- inference-engine/doc/_static/css/custom.css | 4 - inference-engine/doc/conf.py | 126 ------------ inference-engine/doc/index.rst | 40 ---- inference-engine/doc/make.bat | 35 ---- inference-engine/doc/marian-integration.rst | 97 --------- inference-engine/doc/python.rst | 87 -------- inference-engine/doc/references.bib | 0 inference-engine/doc/requirements.txt | 9 - inference-engine/doc/wasm-example.md | 1 - 13 files changed, 688 deletions(-) delete mode 100644 inference-engine/doc/.gitignore delete mode 100644 inference-engine/doc/CI.md delete mode 100644 inference-engine/doc/README.md delete mode 100644 inference-engine/doc/Unified_API.md delete mode 100644 inference-engine/doc/_static/css/custom.css delete mode 100644 inference-engine/doc/conf.py delete mode 100644 inference-engine/doc/index.rst delete mode 100644 inference-engine/doc/make.bat delete mode 100644 inference-engine/doc/marian-integration.rst delete mode 100644 inference-engine/doc/python.rst delete mode 100644 inference-engine/doc/references.bib delete mode 100644 inference-engine/doc/requirements.txt delete mode 120000 inference-engine/doc/wasm-example.md diff --git a/inference-engine/doc/.gitignore b/inference-engine/doc/.gitignore deleted file mode 100644 index 4d192b770..000000000 --- a/inference-engine/doc/.gitignore +++ /dev/null @@ -1,4 +0,0 @@ -api -build -doxygen -venv diff --git a/inference-engine/doc/CI.md b/inference-engine/doc/CI.md deleted file mode 100644 index 2f29b02c1..000000000 --- a/inference-engine/doc/CI.md +++ /dev/null @@ -1,22 +0,0 @@ -# Continuous Integration - -[Circle CI](https://circleci.com/) is used for continuous integration. Configured via `./.circleci/config.yml`. - -## Run Circle CI locally (requires Docker) - -1. [Install the CircleCI local cli](https://circleci.com/docs/2.0/local-cli/#installation) -2. Validate Circle CI configuration (useful exercise before pushing any changes to the configuration) - -```shell -circleci config validate -c .circleci/config.yml -``` - -3. To better mimic the starting point for CI, commit your changes and clone your repository into a clean directory then run CircleCI inside that directory: - -```shell -git clone . /tmp/$(basename $PWD) -cd /tmp/$(basename $PWD) -circleci build -``` - -Note: Steps related to caching and uploading/storing artifacts will report as failed locally. This is not necessarily a problem, they are designed to fail since the operations are not supported locally by the CircleCI build agent. diff --git a/inference-engine/doc/README.md b/inference-engine/doc/README.md deleted file mode 100644 index 87d86ba1c..000000000 --- a/inference-engine/doc/README.md +++ /dev/null @@ -1,51 +0,0 @@ -# Marian NMT code documentation and library API - -This directory contains code documentation and library API for developers of Marian NMT. - -The documentation is generated using -[Sphinx](https://www.sphinx-doc.org/en/master/usage/quickstart.html) + -[Breathe](https://breathe.readthedocs.io/en/latest/directives.html) + -[Doxygen](http://www.doxygen.nl/manual/docblocks.html) + -[Exhale](https://exhale.readthedocs.io/en/latest/usage.html). -The documentation source code is written in `.rst` or `.md` files with special directives that allow -to reference to C++ source code and documentation. The source documents are then build into static -HTML pages. - - -## Installation - -On Ubuntu 20.04, install the following packages: - - sudo apt-get install python3 python3-pip python3-setuptools doxygen - -Then set up a Python environment and install modules: - - pip3 install virtualenv - virtualenv venv -p python3 - source venv/bin/activate - pip install -r requirements.txt - -Documentation building should also work on Windows, but it has not been tested. - - -## Generation - -The documentation can be generated by running: - - make html - -The website will be generated into `build/html` and accessible by opening _index.html_ in your -browser. - -Directories: - -- `build` - automatically output directory for HTML documentation -- `doxygen` - automatically generated Doxygen XML files -- `api` - automatic library API generated with Exhale -- `.rst` and `.md` files in this directory and its subdirectories are documentation source files -- `_static` - custom CSS and JavaScript files - - -## Writing documentation - -To be documented... diff --git a/inference-engine/doc/Unified_API.md b/inference-engine/doc/Unified_API.md deleted file mode 100644 index e6a14301b..000000000 --- a/inference-engine/doc/Unified_API.md +++ /dev/null @@ -1,212 +0,0 @@ -# Unified (C++) API of Bergamot Translator - -/* A Translation model interface for translating a plain utf-8 encoded text (without any markups and emojis). The model supports translation from 1 source language to 1 target language. There can be different implementations of this interface. */ - -class **AbstractTranslationModel** { - - public: - - AbstractTranslationModel(); - - virtual ~AbstractTranslationModel() {}; - - /* This method performs translation on a list of (utf-8) texts and returns a list of results in the same order. Each text entry can either be a word, a phrase, a sentence or a list of sentences and should contain plain text (without any markups or emojis). Additional information related to the translated text can be requested via TranslationRequest which is applied equally to each text entry. The translated text corresponding to each text entry and the additional information (as specified in the TranslationRequest) is encapsulated and returned in TranslationResult. - The API splits each text entry into sentences internally, which are then translated independent of each other. The translated sentences are then joined together and returned in TranslationResult. - Please refer to the TranslationRequest class to find out what additional information can be requested. The alignment information can only be requested if the model supports it (check isAlignmentSupported() API). - */ - virtual std::vector> translate(std::vector texts, TranslationRequest request) = 0; - - /* Check if the model can provide alignment information b/w original and translated text. */ - virtual bool isAlignmentSupported() const = 0; -} - -/* This class specifies the additional information related to the translated text (e.g. quality of the translation etc.) that can be requested to be included in the TranslationResult. These optional requests are set/unset independent of each other i.e. setting any one of them doesn’t have the side effect of setting any of the others. */ - -class **TranslationRequest** { - - private: - - // Optional request. The granularity for which Quality scores of the translated text will be included in TranslationResult. By default (QualityScoreGranularity::NONE), scores are not included. - QualityScoreGranularity qualityScore = QualityScoreGranularity::NONE; - - // Optional request. The type of the alignment b/w original and translated text that will be included in TranslationResult. By default (AlignmentType::NONE), alignment is not included. - AlignmentType alignmentType = AlignmentType::NONE; - - // Optional request. A true/false value will include/exclude the original text in the TranslationResult. By default (false), the original text is not included. - bool includeOriginalText = false; - - // Optional request. A true/false value will include/exclude the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text in the TranslationResult. By default (false), this information is not included. - bool includeSentenceMapping = false; - - public: - - explicit TranslationRequest(); - - ~TranslationRequest(); - - /* Set the granularity for which the Quality scores of translated text should be included in the TranslationResult. By default (QualityScoreGranularity::NONE), scores are not included. */ - void setQualityScoreGranularity(QualityScoreGranularity granularity); - - /* Set the type of Alignment b/w original and translated text to be included in the TranslationResult. By default (AlignmentType::NONE), alignment is not included. */ - void setAlignmentType(AlignmentType alignmentType); - - /* Set to true/false to include/exclude the original text in the TranslationResult. By default (false), the original text is not included. */ - void includeOriginalText(bool originalText); - - /* Set to true/false to include/exclude the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text in the TranslationResult. By default (false), this information is not included. */ - void includeSentenceMapping(bool sentenceMapping); - - /* Return the granularity for which the Quality scores of the translated text will be included in TranslationResult. QualityScoreGranularity::NONE means the scores will not be included. */ - QualityScoreGranularity getQualityScoreGranularity() const; - - /* Return the type of Alignment b/w original and translated text that should be included in the TranslationResult. AlignmentType::NONE means the alignment will not be included. */ - AlignmentType getAlignmentType() const; - - /* Return whether the original text should be included in the TranslationResult. False means the original text will not be included. */ - bool includeOriginalText() const; - - /* Return whether the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text should be included in the TranslationResult. False means this information will not be included. */ - bool includeSentenceMapping() const; -} - -/* This class represents the result of translation on a TranslationRequest. */ - -class **TranslationResult** { - - private: - - // Original text (utf-8) that was supposed to be translated; An optional result (it will be an empty string if not requested in TranslationRequest). - std::string originalText; - - // Translation (in utf-8 format) of the originalText - std::string translatedText; - - // Quality score of the translated text at the granularity specified in TranslationRequest; An optional result (it will have no information if not requested in TranslationRequest) - QualityScore qualityScore; - - // Alignment information b/w original and translated text for AlignmentType specified in TranslationRequest; An optional result (it will have no information if not requested in TranslationRequest) - Alignment alignment; - - // Information regarding how individual sentences of originalText map to corresponding translated sentences - // in joined translated text (translatedText); An optional result (it will be empty if not requested in TranslationRequest); - // An example: - // originalText (contains 2 sentences) = "What is your name? My name is Abc." - // translatedText (contains 2 translated sentences) = "Was ist dein Name? Mein Name ist Abc." - // sentenceMappings = [ - // {"What is your name?", "Was ist dein Name?"}, // A pair of Sentence 1 of originalText (originalText[0]) and corresponding translated sentence in translatedText (translatedText[0]) - // {"My name is Abc", "Mein Name ist Abc."} // A pair of Sentence 2 of originalText (originalText[1]) and corresponding translated sentence in translatedText (translatedText[1]) - // ] - // - std::vector> sentenceMappings; - - public: - // ToDo: Public Methods -} - -/* This class encapsulates the configuration that is required by a translation model to perform translation. This configuration includes a path to the model file, source language vocabulary file, target language vocabulary file along with other options. */ - -class **TranslationModelConfiguration** { - - private: - - // Path to the translation model file - const std::string modelPath; - - // Path to the source vocabulary file to be used by the model - const std::string sourceLanguageVocabPath; - - // Path to the target vocabulary file to be used by the model - const std::string targetLanguageVocabPath; - - // ToDo: Add all possible user configurable options (e.g. min batch size, max batch size) that are relevant for translation - - public: - - // Provide the path to the model file along with the source and target vocabulary files - TranslationModelConfiguration(const std::string& modelFilePath, - const std::string& sourceVocabPath, - const std::string& targetVocabPath); - - // Return the path of the model file - const std::string& getModelFilePath() const; - - // Return the path of the source language vocabulary file - const std::string& getSourceVocabularyPath() const; - - // Return the path of the target language vocabulary file - const std::string& getSourceVocabularyPath() const; -} - -// All possible granularities for which Quality Scores can be returned for translated (utf-8) text - -enum class QualityScoreGranularity { - - WORD, - SENTENCE, - NONE, -} - -// All possible supported alignment types between a text and its translation - -enum class AlignmentType { - - SOFT, - NONE, -} - -// This class represents the Quality Scores for various spans of the translated text at a specific granularity - -class QualityScore { - - private: - - // Sections of a text for the Quality Scores - std::vector textViews; - - // Quality Scores corresponding to each section of the text in textViews in the same order - std::vector textScores; - - // Granularity of the text for the Quality scores above - QualityScoreGranularity textGranularity; - - public: - // ToDo: Public Methods -} - -// This class encapsulates a translated text, all the sections of the original text that align to this translated text and the corresponding alignments for each of these sections of original text. - -class Alignment { - - private: - - // A list of sections of a translated text - // An example: originalText = "What do you need" - // translatedText = "Was brauchst du" - // translatedTextViews = ["Was ", "brauchst", "du"] - std::vector translatedTextViews; - - // Each ith entry of this container corresponds to a list of all the sections of the original text that align to the ith entry of translatedTextView - // For the example above: - // translatedTextViews = ["Was ", "brauchst", "du"] - // originalTextViews = [ - // ["What"], // originalTextViews[0] = All sections of original text that align with translatedTextViews[0] i.e. "Was" - // ["you", "need"], // originalTextViews[1] = All sections of original text that align with translatedTextViews[1] i.e. "brauchst" - // ["you"] // originalTextViews[2] = All sections of original text that align with translatedTextViews[2] i.e. "du" - // ] - std::vector> originalTextViews; - - // Each ith entry of this container corresponds to the alignments of all the sections of the original text (ith entry of originalTextViews) that align to the ith entry of translatedTextViews - // For the example above: - // alignments = [ - // [0.90], // alignments[0] = Alignments of all sections of original text (i.e. originalTextViews[0]) to translatedTextViews[0] i.e. "Was" - // [0.3, 0.7], // alignments[1] = Alignments of all sections of original text (i.e. originalTextViews[1]) to translatedTextViews[1] i.e. "brauchst" - // [0.9] // alignments[2] = Alignments of all sections of original text (i.e. originalTextViews[2]) to translatedTextViews[2] i.e. "du" - // ] - std::vector> alignments; - - // Type of the alignment b/w original and translated text above - AlignmentType alignmentType; - - public: - // ToDo: Public Methods -} diff --git a/inference-engine/doc/_static/css/custom.css b/inference-engine/doc/_static/css/custom.css deleted file mode 100644 index 8352655e1..000000000 --- a/inference-engine/doc/_static/css/custom.css +++ /dev/null @@ -1,4 +0,0 @@ -.wy-body-for-nav > .wy-grid-for-nav > .wy-nav-side { - border-bottom: 5px solid #28bbee; - /*background-color: #494d55;*/ -} diff --git a/inference-engine/doc/conf.py b/inference-engine/doc/conf.py deleted file mode 100644 index a86f4cbea..000000000 --- a/inference-engine/doc/conf.py +++ /dev/null @@ -1,126 +0,0 @@ -# Configuration file for the Sphinx documentation builder. -# -# This file only contains a selection of the most common options. For a full -# list see the documentation: -# https://www.sphinx-doc.org/en/master/usage/configuration.html - -# -- Path setup -------------------------------------------------------------- - -import datetime - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -import os -import sys - -sys.path.insert(0, os.path.abspath(".")) - - -# -- Project information ----------------------------------------------------- - -project = "Bergamot Translator" -copyright = "2021-2022 Bergamot Translator Team" -author = "Bergamot Translator Team" - -# The full version, including alpha/beta/rc tags -# TODO: add GitHub commit hash to the version -version_file = os.path.join( - os.path.dirname(os.path.dirname(__file__)), "BERGAMOT_VERSION" -) -with open(os.path.abspath(version_file)) as f: - version = f.read().strip() -release = version + " " + str(datetime.date.today()) - - -# -- General configuration --------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ - "sphinx.ext.mathjax", - "sphinx.ext.todo", - "breathe", - "exhale", - "recommonmark", - "sphinx.ext.autodoc", - "sphinxarg.ext", -] - -# Add any paths that contain templates here, relative to this directory. -templates_path = ["_templates"] - -# List of patterns, relative to source directory, that match files and -# directories to ignore when looking for source files. -# This pattern also affects html_static_path and html_extra_path. -exclude_patterns = [ - "build", - "doxygen", - "venv", - "README.md", -] - - -# -- Options for HTML output ------------------------------------------------- - -# The theme to use for HTML and HTML Help pages. See the documentation for -# a list of builtin themes. -# -html_theme = "sphinx_rtd_theme" -htmlhelp_basename = "bergamot-translator" - -# Add any paths that contain custom static files (such as style sheets) here, -# relative to this directory. They are copied after the builtin static files, -# so a file named "default.css" will overwrite the builtin "default.css". -html_static_path = ["_static"] -html_css_files = ["css/custom.css"] - -# The base URL which points to the root of the HTML documentation -html_baseurl = "https://browser.mt/docs" - - -# -- Extension configuration ------------------------------------------------- - -breathe_projects = {"bergamot-translator": "./doxygen/xml"} -breathe_default_project = "bergamot-translator" - -doxygen_config = """ -INPUT = ../src ../app -EXCLUDE += ../3rd_party -EXCLUDE += ../src/tests -EXCLUDE_PATTERNS = *.md *.txt -FILE_PATTERNS += *.cu -EXTENSION_MAPPING += cu=C++ inc=C++ -ENABLE_PREPROCESSING = YES -JAVADOC_AUTOBRIEF = YES -WARN_IF_UNDOCUMENTED = NO -""" - -exhale_args = { - "containmentFolder": "./api", - "rootFileName": "library_index.rst", - "rootFileTitle": "Library API", - "doxygenStripFromPath": "..", - "createTreeView": True, - "exhaleExecutesDoxygen": True, - "exhaleDoxygenStdin": doxygen_config.strip(), -} - -primary_domain = "cpp" -highlight_language = "cpp" - -# A trick to include markdown files from outside the source directory using -# 'mdinclude'. Warning: all other markdown files not included via 'mdinclude' -# will be rendered using recommonmark as recommended by Sphinx -from m2r import MdInclude - - -def setup(app): - # from m2r to make `mdinclude` work - app.add_config_value("no_underscore_emphasis", False, "env") - app.add_config_value("m2r_parse_relative_links", False, "env") - app.add_config_value("m2r_anonymous_references", False, "env") - app.add_config_value("m2r_disable_inline_math", False, "env") - app.add_directive("mdinclude", MdInclude) diff --git a/inference-engine/doc/index.rst b/inference-engine/doc/index.rst deleted file mode 100644 index 54dc1e8dc..000000000 --- a/inference-engine/doc/index.rst +++ /dev/null @@ -1,40 +0,0 @@ -Welcome to Bergamot Translator's documentation! -=============================================== - -|buildcpu| |tests| |release| |license| - -Bergamot translator provides a unified API for (Marian NMT framework based) -neural machine translation functionality in accordance with the Bergamot -project that focuses on improving client-side machine translation in a web -browser. - -This is developer documentation. - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - marian-integration - wasm-example - api/library_index - python - - - -Indices and tables ------------------- - -* :ref:`genindex` - - -.. |buildcpu| image:: https://img.shields.io/jenkins/s/http/vali.inf.ed.ac.uk/jenkins/view/browsermt/job/bergamot-translator.svg?label=CPU%20Build - :target: http://vali.inf.ed.ac.uk/jenkins/job/bergamot-translator - :alt: CPU build status - -.. |tests| image:: https://img.shields.io/jenkins/s/http/vali.inf.ed.ac.uk/jenkins/view/marian/job/bergamot-translator-regression-tests.svg?label=Tests - :target: http://vali.inf.ed.ac.uk/jenkins/job/bergamot-translator-regression-tests/ - :alt: Tests status - -.. |license| image:: https://img.shields.io/badge/License-MPL%202.0-brightgreen.svg - :target: https://opensource.org/licenses/MPL-2.0 - :alt: License: MPL diff --git a/inference-engine/doc/make.bat b/inference-engine/doc/make.bat deleted file mode 100644 index 6247f7e23..000000000 --- a/inference-engine/doc/make.bat +++ /dev/null @@ -1,35 +0,0 @@ -@ECHO OFF - -pushd %~dp0 - -REM Command file for Sphinx documentation - -if "%SPHINXBUILD%" == "" ( - set SPHINXBUILD=sphinx-build -) -set SOURCEDIR=source -set BUILDDIR=build - -if "%1" == "" goto help - -%SPHINXBUILD% >NUL 2>NUL -if errorlevel 9009 ( - echo. - echo.The 'sphinx-build' command was not found. Make sure you have Sphinx - echo.installed, then set the SPHINXBUILD environment variable to point - echo.to the full path of the 'sphinx-build' executable. Alternatively you - echo.may add the Sphinx directory to PATH. - echo. - echo.If you don't have Sphinx installed, grab it from - echo.http://sphinx-doc.org/ - exit /b 1 -) - -%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% -goto end - -:help -%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% - -:end -popd diff --git a/inference-engine/doc/marian-integration.rst b/inference-engine/doc/marian-integration.rst deleted file mode 100644 index 756e0a810..000000000 --- a/inference-engine/doc/marian-integration.rst +++ /dev/null @@ -1,97 +0,0 @@ -Bergamot C++ Library -==================== - -This document contains instructions to develop for modifications on top -of the marian machine translation toolkit powering bergamot-translator. -The library is optimized towards fast and efficient translation of a -given input. - -Build Instructions ------------------- - -Note: You are strongly advised to refer to the continuous integration on -this repository, which builds bergamot-translator and associated -applications from scratch. Examples to run these command -line-applications are available in the -`bergamot-translator-tests `__ -repository. Builds take about 30 mins on a consumer grade machine, so -using a tool like ccache is highly recommended. - -Dependencies -~~~~~~~~~~~~ - -Marian CPU version requires Intel MKL or OpenBLAS. Both are free, but -MKL is not open-sourced. Intel MKL is strongly recommended as it is -faster. On Ubuntu 16.04 and newer it can be installed from the APT -repositories. - -.. code:: bash - - wget -qO- 'https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB' | sudo apt-key add - - sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' - sudo apt-get update - sudo apt-get install intel-mkl-64bit-2020.0-088 - -On MacOS, apple accelerate framework will be used instead of -MKL/OpenBLAS. - -Building bergamot-translator -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Web Assembly (WASM) reduces building to only using a subset of -functionalities of marian, the translation library powering -bergamot-translator. When developing bergamot-translator it is important -that the sources added be compatible with marian. Therefore, it is -required to set ``-DUSE_WASM_COMPATIBLE_SOURCE=on``. - -:: - - $ git clone https://github.com/browsermt/bergamot-translator - $ cd bergamot-translator - $ mkdir build - $ cd build - $ cmake .. -DUSE_WASM_COMPATIBLE_SOURCE=off -DCMAKE_BUILD_TYPE=Release - $ make -j2 - -The build will generate the library that can be linked to any project. -All the public header files are specified in ``src`` folder. - -Command line apps ------------------ - -bergamot-translator is intended to be used as a library. However, we -provide a command-line application which is capable of translating text -provided on standard-input. During development this application is used -to perform regression-tests. - - -Example command line run ------------------------- - -The models required to run the command-line are available at -`data.statmt.org/bergamot/models/ `__. - -The following example uses an English to German tiny11 student model, -available at: - -- `data.statmt.org/bergamot/models/deen/ende.student.tiny11.tar.gz `__ - -.. literalinclude:: ../examples/run-native.sh - :language: bash - -Coding Style ------------- - -This repository contains C++ and JS source-files, of which C++ should -adhere to the clang-format based style guidelines. You may configure -your development environment to use the ``.clang-format`` and -``.clang-format-ignore`` files provided in the root folder of this -repository with your preferred choice of editor/tooling. - -One simple and recommended method to get your code to adhere to this -style is to issue the following command in the source-root of this -repository, which is used to also check for the coding style in the CI. - -.. code:: bash - - python3 run-clang-format.py -i --style file -r src wasm diff --git a/inference-engine/doc/python.rst b/inference-engine/doc/python.rst deleted file mode 100644 index 0426f349f..000000000 --- a/inference-engine/doc/python.rst +++ /dev/null @@ -1,87 +0,0 @@ -.. Bergamot documentation master file, created by - sphinx-quickstart on Tue Jan 18 17:26:57 2022. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Python -======= - -.. toctree:: - :maxdepth: 3 - :caption: Contents: - - -This document describes python bindings from bergamot-translator and a -batteries included python package supplied for easy use. The library also -provides entry point via a command-line making it easier for the average user -to get started. - -As bergamot-translator is built on top of marian, the python API should also -work as python bindings for marian trained models, if they need to be -integrated into python code-bases. - -*Disclaimer*: The package is still in early stages and unstable. Functions and -classes might move around quite fast. Use at your own risk. - -Command Line Interface ----------------------- - -.. argparse:: - :ref: bergamot.cmds.make_parser - :prog: bergamot - - -Module Documentation --------------------- - -.. automodule:: bergamot - :members: - :undoc-members: - -bergamot-translator -+++++++++++++++++++ - -The following components are exported from C++ via python-bindings and form -library primitives that can be used to build translation workflows. - -.. autoclass:: bergamot.ServiceConfig - :members: - :undoc-members: - -.. autoclass:: bergamot.Service - :members: - :undoc-members: - - -.. autoclass:: bergamot.TranslationModel - :members: - :undoc-members: - -.. autoclass:: bergamot.ResponseOptions - :members: - :undoc-members: - -Model Inventory -+++++++++++++++ - -.. autoclass:: bergamot.repository.Repository - :members: - :undoc-members: - -.. autoclass:: bergamot.repository.TranslateLocallyLike - :members: - :undoc-members: - -Utilities -+++++++++ - -.. autofunction:: bergamot.utils.patch_marian_for_bergamot - - - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/inference-engine/doc/references.bib b/inference-engine/doc/references.bib deleted file mode 100644 index e69de29bb..000000000 diff --git a/inference-engine/doc/requirements.txt b/inference-engine/doc/requirements.txt deleted file mode 100644 index 778f08914..000000000 --- a/inference-engine/doc/requirements.txt +++ /dev/null @@ -1,9 +0,0 @@ -sphinx==2.4.4 -breathe==4.13.0 -Jinja2==3.0.3 -exhale -sphinx_rtd_theme -mistune<2.0.0 -recommonmark -m2r -sphinx-argparse diff --git a/inference-engine/doc/wasm-example.md b/inference-engine/doc/wasm-example.md deleted file mode 120000 index 9188e9356..000000000 --- a/inference-engine/doc/wasm-example.md +++ /dev/null @@ -1 +0,0 @@ -../wasm/README.md \ No newline at end of file From 00f4a30984c62fb58f06c9d1f672b91762c52a5f Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 26 Sep 2024 13:07:27 -0500 Subject: [PATCH 430/442] Add build-local script to inference-engine --- Taskfile.yml | 6 ++++ inference-engine/.gitignore | 1 + inference-engine/scripts/build-local.sh | 35 +++++++++++++++++++++++ inference-engine/scripts/detect-docker.sh | 19 ++++++++++++ 4 files changed, 61 insertions(+) create mode 100755 inference-engine/scripts/build-local.sh create mode 100755 inference-engine/scripts/detect-docker.sh diff --git a/Taskfile.yml b/Taskfile.yml index c767924e1..44da5e5c4 100644 --- a/Taskfile.yml +++ b/Taskfile.yml @@ -75,6 +75,12 @@ tasks: cmds: - poetry run opuscleaner-server serve --host=0.0.0.0 --port=8000 + inference-engine-build: + desc: Build inference engine. + cmds: + - >- + task docker-run -- ./inference-engine/scripts/build-local.sh + lint-black: desc: Checks the styling of the Python code with Black. deps: [poetry-install-black] diff --git a/inference-engine/.gitignore b/inference-engine/.gitignore index 94b32949c..78202d979 100644 --- a/inference-engine/.gitignore +++ b/inference-engine/.gitignore @@ -18,6 +18,7 @@ _deps wasm/test_page/node_modules /build +/build-local /build-native /build-wasm /emsdk diff --git a/inference-engine/scripts/build-local.sh b/inference-engine/scripts/build-local.sh new file mode 100755 index 000000000..97595276f --- /dev/null +++ b/inference-engine/scripts/build-local.sh @@ -0,0 +1,35 @@ +#!/bin/bash +set -e + +# Run script from the context of inference-engine directory +cd "$(dirname $0)/.." + +# Ensure script is running within docker +./scripts/detect-docker.sh inference-engine-build + +# Return the number of available CPUs, or default to 1 if nproc is unavailable. +detect_cpus() { + if command -v nproc >/dev/null 2>&1; then + nproc + else + echo 1 + fi +} + +if [ ! -d "build-local" ]; then + echo "Creating build-local directory..." + mkdir build-local +else + echo "build-local directory already exists. Skipping creation." +fi + +cd build-local || exit + +echo "Running cmake for build-local..." +cmake ../ + +# Run make using the detected number of CPUs +CPUS=$(detect_cpus) +echo "Running make for build-local with $CPUS CPUs..." +make -j ${CPUS} + diff --git a/inference-engine/scripts/detect-docker.sh b/inference-engine/scripts/detect-docker.sh new file mode 100755 index 000000000..c1065349a --- /dev/null +++ b/inference-engine/scripts/detect-docker.sh @@ -0,0 +1,19 @@ +#!/bin/bash + +help_task=$1 + +if [ -z "${IS_DOCKER}" ]; then + if [ "${ALLOW_RUN_ON_HOST}" != "1" ]; then + echo >&2 + echo "Error: This script needs to be run inside Docker, or you must set ALLOW_RUN_ON_HOST=1." >&2 + echo >&2 + if [ -n "${help_task}" ]; then + echo " Help: To run this script directly in docker, run: task ${help_task}" >&2 + fi + echo " Help: To enter docker, run: task docker" >&2 + exit 1 + else + echo >&2 + echo "ALLOW_RUN_ON_HOST is set to 1. Continuing..." >&2 + fi +fi From bb47eed7cfa516b90e6038f457b4b345a6f30a5f Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 26 Sep 2024 13:07:27 -0500 Subject: [PATCH 431/442] Add unit-tests script to inference-engine --- Taskfile.yml | 6 +++ inference-engine/scripts/build-local.sh | 17 ++++++++- inference-engine/scripts/unit-tests.sh | 49 +++++++++++++++++++++++++ 3 files changed, 71 insertions(+), 1 deletion(-) create mode 100755 inference-engine/scripts/unit-tests.sh diff --git a/Taskfile.yml b/Taskfile.yml index 44da5e5c4..a9f52aaae 100644 --- a/Taskfile.yml +++ b/Taskfile.yml @@ -81,6 +81,12 @@ tasks: - >- task docker-run -- ./inference-engine/scripts/build-local.sh + inference-engine-test: + desc: Run inference-engine tests. + cmds: + - >- + task docker-run -- ./inference-engine/scripts/unit-tests.sh + lint-black: desc: Checks the styling of the Python code with Black. deps: [poetry-install-black] diff --git a/inference-engine/scripts/build-local.sh b/inference-engine/scripts/build-local.sh index 97595276f..65e42e761 100755 --- a/inference-engine/scripts/build-local.sh +++ b/inference-engine/scripts/build-local.sh @@ -16,6 +16,16 @@ detect_cpus() { fi } +# Parse command-line arguments for the --test flag +COMPILE_TESTS=OFF +while [[ "$#" -gt 0 ]]; do + case $1 in + "--test") COMPILE_TESTS=ON ;; + *) echo "Unknown parameter passed: $1"; exit 1 ;; + esac + shift +done + if [ ! -d "build-local" ]; then echo "Creating build-local directory..." mkdir build-local @@ -25,8 +35,13 @@ fi cd build-local || exit +# Run cmake with optional COMPILE_TESTS flag echo "Running cmake for build-local..." -cmake ../ +if [ "$COMPILE_TESTS" = "ON" ]; then + cmake ../ -DCOMPILE_TESTS=ON +else + cmake ../ +fi # Run make using the detected number of CPUs CPUS=$(detect_cpus) diff --git a/inference-engine/scripts/unit-tests.sh b/inference-engine/scripts/unit-tests.sh new file mode 100755 index 000000000..f4f12e3e1 --- /dev/null +++ b/inference-engine/scripts/unit-tests.sh @@ -0,0 +1,49 @@ +#!/bin/bash +set -e + +# Run script from the context of inference-engine directory +cd "$(dirname $0)/.." + +# Ensure script is running within docker +./scripts/detect-docker.sh inference-engine-test + +# Check if build-local/src/tests/units directory exists +if [ ! -d "build-local/src/tests/units" ]; then + echo "Directory build-local/src/tests/units does not exist. Running build." + ./scripts/build-local.sh --test +else + echo "Directory build-local/src/tests/units already exists. Skipping build." +fi + +# Change to the unit tests directory +cd build-local/src/tests/units + +# List of test commands +tests=( + "./run_annotation_tests" + "./run_cache_tests" + "./run_html_tests" + "./run_quality_estimator_tests" + "./run_xh_scanner_tests" +) + +# Run all tests, collect failures +failures=0 + +for test in "${tests[@]}"; do + echo "Running $test..." + if ! $test; then + echo "$test failed!" + failures=$((failures + 1)) + fi +done + +# If any test failed, exit with a non-zero status +if [ $failures -gt 0 ]; then + echo "$failures test(s) failed." + exit 1 +else + echo "All tests passed successfully." + exit 0 +fi + From cf23bf758582c4c8cd564fa5c21ac0772e95aec3 Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 26 Sep 2024 13:07:27 -0500 Subject: [PATCH 432/442] Add clean script to inference-engine --- Taskfile.yml | 6 ++++++ inference-engine/scripts/clean.sh | 29 +++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+) create mode 100755 inference-engine/scripts/clean.sh diff --git a/Taskfile.yml b/Taskfile.yml index a9f52aaae..9fe48bdc8 100644 --- a/Taskfile.yml +++ b/Taskfile.yml @@ -75,6 +75,12 @@ tasks: cmds: - poetry run opuscleaner-server serve --host=0.0.0.0 --port=8000 + inference-engine-clean: + desc: Clean build artifacts from the inference-engine directory. + cmds: + - >- + task docker-run -- ./inference-engine/scripts/clean.sh + inference-engine-build: desc: Build inference engine. cmds: diff --git a/inference-engine/scripts/clean.sh b/inference-engine/scripts/clean.sh new file mode 100755 index 000000000..410291705 --- /dev/null +++ b/inference-engine/scripts/clean.sh @@ -0,0 +1,29 @@ +#!/bin/bash +set -e + +# Run script from the context of inference-engine directory +cd "$(dirname $0)/.." + +# Ensure script is running within docker +./scripts/detect-docker.sh inference-engine-clean + +# List of directories to clean +dirs=("build-local" "build-wasm" "emsdk") + +# Flag to track if any directories were cleaned +cleaned=false + +# Check and remove directories +for dir in "${dirs[@]}"; do + if [ -d "$dir" ]; then + echo "Removing $dir..." + rm -rf "$dir" + cleaned=true + fi +done + +# If no directories were cleaned, print a message +if [ "$cleaned" = false ]; then + echo "Nothing to clean" +fi + From 07e3216cd48e079058cdc6343398d48839c235d5 Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Thu, 26 Sep 2024 13:07:27 -0500 Subject: [PATCH 433/442] Move build-wasm script to inference-engine/scripts directory --- Taskfile.yml | 6 ++++++ inference-engine/{ => scripts}/build-wasm.sh | 10 +++++++--- 2 files changed, 13 insertions(+), 3 deletions(-) rename inference-engine/{ => scripts}/build-wasm.sh (90%) diff --git a/Taskfile.yml b/Taskfile.yml index 9fe48bdc8..8bf8dddf5 100644 --- a/Taskfile.yml +++ b/Taskfile.yml @@ -93,6 +93,12 @@ tasks: - >- task docker-run -- ./inference-engine/scripts/unit-tests.sh + inference-engine-build-wasm: + desc: Build inference engine WASM. + cmds: + - >- + task docker-run -- ./inference-engine/scripts/build-wasm.sh + lint-black: desc: Checks the styling of the Python code with Black. deps: [poetry-install-black] diff --git a/inference-engine/build-wasm.sh b/inference-engine/scripts/build-wasm.sh similarity index 90% rename from inference-engine/build-wasm.sh rename to inference-engine/scripts/build-wasm.sh index 443907232..4cabed2b5 100755 --- a/inference-engine/build-wasm.sh +++ b/inference-engine/scripts/build-wasm.sh @@ -1,9 +1,13 @@ #!/usr/bin/env bash set -e -set -x -# Run script from the context of the script-containing directory -cd "$(dirname $0)" +# Run script from the context of inference-engine directory +cd "$(dirname $0)/.." + +# Ensure script is running within docker +./scripts/detect-docker.sh inference-engine-build-wasm + +set -x # Prerequisite: Download and Install Emscripten using following instructions (unless the EMSDK env var is already set) if [ "$EMSDK" == "" ]; then From c62bea0890747c588d83b2cf3578a860719e3cc9 Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Wed, 25 Sep 2024 16:02:11 -0500 Subject: [PATCH 434/442] Add review groups to CODEOWNERS --- .github/CODEOWNERS | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 26f6f4418..fa4f321cf 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,5 +1,31 @@ +# Firefox Translations review group +.dockerignore @mozilla/firefox-translations +.github @mozilla/firefox-translations +.gitignore @mozilla/firefox-translations +.gitmodules @mozilla/firefox-translations +docker @mozilla/firefox-translations +docs @mozilla/firefox-translations +utils @mozilla/firefox-translations +CODE_OF_CONDUCT.md @mozilla/firefox-translations +LICENSE @mozilla/firefox-translations +poetry.lock @mozilla/firefox-translations +pyproject.toml @mozilla/firefox-translations +README.md @mozilla/firefox-translations +Taskfile.yml @mozilla/firefox-translations + +# Translations Training review group +configs @mozilla/translations-training +pipeline @mozilla/translations-training +snakemake @mozilla/translations-training +tests @mozilla/translations-training +tracking @mozilla/translations-training + +# Translations Inference review group +inference-engine @mozilla/translations-inference + # Taskcluster pipeline related files. Changes to these ought to be reviewed by # RelEng to watch for security issues and best practices. These should also # be reviewed by people familiar with the pipeline itself. -.taskcluster.yml @mozilla/releng -taskcluster @mozilla/releng +.taskcluster.yml @mozilla/releng @mozilla/translations-training +taskcluster @mozilla/releng @mozilla/translations-training + From 72b6c9def8ce7ae4ec47e31289f5f41ad295807f Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Mon, 30 Sep 2024 15:32:13 -0500 Subject: [PATCH 435/442] Rename inference-engine to inference --- .gitmodules | 14 ++++++------ Taskfile.yml | 20 +++++++++--------- {inference-engine => inference}/.clang-format | 0 .../.clang-format-ignore | 0 {inference-engine => inference}/.clang-tidy | 0 {inference-engine => inference}/.gitignore | 0 .../3rd_party/CMakeLists.txt | 0 .../3rd_party/browsermt-marian-dev | 0 .../3rd_party/ssplit-cpp | 0 .../BERGAMOT_VERSION | 0 .../CMakeLists.txt | 0 {inference-engine => inference}/Doxyfile.in | 0 {inference-engine => inference}/LICENSE | 0 {inference-engine => inference}/MANIFEST.in | 0 {inference-engine => inference}/README.md | 0 .../cmake/GetVersionFromFile.cmake | 0 .../examples/run-native.sh | 0 .../patches/01-marian-fstream-for-macos.patch | 0 .../scripts/build-local.sh | 4 ++-- .../scripts/build-wasm.sh | 4 ++-- .../scripts/clean.sh | 4 ++-- .../scripts/detect-docker.sh | 0 .../scripts/unit-tests.sh | 4 ++-- .../src/CMakeLists.txt | 0 .../src/tests/CMakeLists.txt | 0 .../src/tests/async.cpp | 0 .../src/tests/blocking.cpp | 0 .../src/tests/common-impl.cpp | 0 .../src/tests/common.h | 0 .../src/tests/intgemm-resolve.cpp | 0 .../src/tests/units/CMakeLists.txt | 0 .../src/tests/units/annotation_tests.cpp | 0 .../src/tests/units/cache_tests.cpp | 0 .../src/tests/units/html_tests.cpp | 0 .../src/tests/units/html_tests.h | 0 .../tests/units/quality_estimator_tests.cpp | 0 .../src/tests/units/quality_estimator_tests.h | 0 .../src/tests/units/run_tests.cpp | 0 .../src/tests/units/xh_scanner_tests.cpp | 0 .../src/tests/wasm.cpp | 0 .../src/translator/CMakeLists.txt | 0 .../translator/aggregate_batching_pool.cpp | 0 .../src/translator/aggregate_batching_pool.h | 0 .../src/translator/aligned.h | 0 .../src/translator/annotation.cpp | 0 .../src/translator/annotation.h | 0 .../src/translator/batch.cpp | 0 .../src/translator/batch.h | 0 .../src/translator/batching_pool.cpp | 0 .../src/translator/batching_pool.h | 0 .../src/translator/byte_array_util.cpp | 0 .../src/translator/byte_array_util.h | 0 .../src/translator/cache.h | 0 .../src/translator/definitions.h | 0 .../src/translator/html.cpp | 0 .../src/translator/html.h | 0 .../src/translator/logging.h | 0 .../src/translator/parser.cpp | 0 .../src/translator/parser.h | 0 .../src/translator/project_version.h.in | 0 .../src/translator/quality_estimator.cpp | 0 .../src/translator/quality_estimator.h | 0 .../src/translator/request.cpp | 0 .../src/translator/request.h | 0 .../src/translator/response.cpp | 0 .../src/translator/response.h | 0 .../src/translator/response_builder.cpp | 0 .../src/translator/response_builder.h | 0 .../src/translator/response_options.h | 0 .../src/translator/service.cpp | 0 .../src/translator/service.h | 0 .../src/translator/text_processor.cpp | 0 .../src/translator/text_processor.h | 0 .../translator/threadsafe_batching_pool.cpp | 0 .../src/translator/threadsafe_batching_pool.h | 0 .../src/translator/translation_model.cpp | 0 .../src/translator/translation_model.h | 0 .../src/translator/utils.h | 0 .../src/translator/vocabs.h | 0 .../src/translator/xh_scanner.cpp | 0 .../src/translator/xh_scanner.h | 0 .../wasm/CMakeLists.txt | 0 .../wasm/README.md | 0 .../wasm/bindings/response_bindings.cpp | 0 .../bindings/response_options_bindings.cpp | 0 .../wasm/bindings/service_bindings.cpp | 0 .../wasm/import-gemm-module.js | 0 .../wasm/module/README.md | 0 .../wasm/module/main.js | 0 .../wasm/module/package.json | 0 .../wasm/module/translator.js | 0 .../wasm/module/worker/package.json | 0 .../wasm/module/worker/translator-worker.js | 0 .../wasm/node-test.js | 0 .../patch-artifacts-import-gemm-module.sh | 0 .../wasm/project_version.js.in | 0 .../wasm/test_page/bergamot-httpserver.js | 0 .../wasm/test_page/css/index.css | 0 .../wasm/test_page/index.html | 0 .../wasm/test_page/js/index.js | 0 .../wasm/test_page/logos.png | Bin .../wasm/test_page/package-lock.json | 0 .../wasm/test_page/package.json | 0 .../wasm/test_page/start_server.sh | 0 104 files changed, 24 insertions(+), 26 deletions(-) rename {inference-engine => inference}/.clang-format (100%) rename {inference-engine => inference}/.clang-format-ignore (100%) rename {inference-engine => inference}/.clang-tidy (100%) rename {inference-engine => inference}/.gitignore (100%) rename {inference-engine => inference}/3rd_party/CMakeLists.txt (100%) rename {inference-engine => inference}/3rd_party/browsermt-marian-dev (100%) rename {inference-engine => inference}/3rd_party/ssplit-cpp (100%) rename {inference-engine => inference}/BERGAMOT_VERSION (100%) rename {inference-engine => inference}/CMakeLists.txt (100%) rename {inference-engine => inference}/Doxyfile.in (100%) rename {inference-engine => inference}/LICENSE (100%) rename {inference-engine => inference}/MANIFEST.in (100%) rename {inference-engine => inference}/README.md (100%) rename {inference-engine => inference}/cmake/GetVersionFromFile.cmake (100%) rename {inference-engine => inference}/examples/run-native.sh (100%) rename {inference-engine => inference}/patches/01-marian-fstream-for-macos.patch (100%) rename {inference-engine => inference}/scripts/build-local.sh (89%) rename {inference-engine => inference}/scripts/build-wasm.sh (94%) rename {inference-engine => inference}/scripts/clean.sh (82%) rename {inference-engine => inference}/scripts/detect-docker.sh (100%) rename {inference-engine => inference}/scripts/unit-tests.sh (90%) rename {inference-engine => inference}/src/CMakeLists.txt (100%) rename {inference-engine => inference}/src/tests/CMakeLists.txt (100%) rename {inference-engine => inference}/src/tests/async.cpp (100%) rename {inference-engine => inference}/src/tests/blocking.cpp (100%) rename {inference-engine => inference}/src/tests/common-impl.cpp (100%) rename {inference-engine => inference}/src/tests/common.h (100%) rename {inference-engine => inference}/src/tests/intgemm-resolve.cpp (100%) rename {inference-engine => inference}/src/tests/units/CMakeLists.txt (100%) rename {inference-engine => inference}/src/tests/units/annotation_tests.cpp (100%) rename {inference-engine => inference}/src/tests/units/cache_tests.cpp (100%) rename {inference-engine => inference}/src/tests/units/html_tests.cpp (100%) rename {inference-engine => inference}/src/tests/units/html_tests.h (100%) rename {inference-engine => inference}/src/tests/units/quality_estimator_tests.cpp (100%) rename {inference-engine => inference}/src/tests/units/quality_estimator_tests.h (100%) rename {inference-engine => inference}/src/tests/units/run_tests.cpp (100%) rename {inference-engine => inference}/src/tests/units/xh_scanner_tests.cpp (100%) rename {inference-engine => inference}/src/tests/wasm.cpp (100%) rename {inference-engine => inference}/src/translator/CMakeLists.txt (100%) rename {inference-engine => inference}/src/translator/aggregate_batching_pool.cpp (100%) rename {inference-engine => inference}/src/translator/aggregate_batching_pool.h (100%) rename {inference-engine => inference}/src/translator/aligned.h (100%) rename {inference-engine => inference}/src/translator/annotation.cpp (100%) rename {inference-engine => inference}/src/translator/annotation.h (100%) rename {inference-engine => inference}/src/translator/batch.cpp (100%) rename {inference-engine => inference}/src/translator/batch.h (100%) rename {inference-engine => inference}/src/translator/batching_pool.cpp (100%) rename {inference-engine => inference}/src/translator/batching_pool.h (100%) rename {inference-engine => inference}/src/translator/byte_array_util.cpp (100%) rename {inference-engine => inference}/src/translator/byte_array_util.h (100%) rename {inference-engine => inference}/src/translator/cache.h (100%) rename {inference-engine => inference}/src/translator/definitions.h (100%) rename {inference-engine => inference}/src/translator/html.cpp (100%) rename {inference-engine => inference}/src/translator/html.h (100%) rename {inference-engine => inference}/src/translator/logging.h (100%) rename {inference-engine => inference}/src/translator/parser.cpp (100%) rename {inference-engine => inference}/src/translator/parser.h (100%) rename {inference-engine => inference}/src/translator/project_version.h.in (100%) rename {inference-engine => inference}/src/translator/quality_estimator.cpp (100%) rename {inference-engine => inference}/src/translator/quality_estimator.h (100%) rename {inference-engine => inference}/src/translator/request.cpp (100%) rename {inference-engine => inference}/src/translator/request.h (100%) rename {inference-engine => inference}/src/translator/response.cpp (100%) rename {inference-engine => inference}/src/translator/response.h (100%) rename {inference-engine => inference}/src/translator/response_builder.cpp (100%) rename {inference-engine => inference}/src/translator/response_builder.h (100%) rename {inference-engine => inference}/src/translator/response_options.h (100%) rename {inference-engine => inference}/src/translator/service.cpp (100%) rename {inference-engine => inference}/src/translator/service.h (100%) rename {inference-engine => inference}/src/translator/text_processor.cpp (100%) rename {inference-engine => inference}/src/translator/text_processor.h (100%) rename {inference-engine => inference}/src/translator/threadsafe_batching_pool.cpp (100%) rename {inference-engine => inference}/src/translator/threadsafe_batching_pool.h (100%) rename {inference-engine => inference}/src/translator/translation_model.cpp (100%) rename {inference-engine => inference}/src/translator/translation_model.h (100%) rename {inference-engine => inference}/src/translator/utils.h (100%) rename {inference-engine => inference}/src/translator/vocabs.h (100%) rename {inference-engine => inference}/src/translator/xh_scanner.cpp (100%) rename {inference-engine => inference}/src/translator/xh_scanner.h (100%) rename {inference-engine => inference}/wasm/CMakeLists.txt (100%) rename {inference-engine => inference}/wasm/README.md (100%) rename {inference-engine => inference}/wasm/bindings/response_bindings.cpp (100%) rename {inference-engine => inference}/wasm/bindings/response_options_bindings.cpp (100%) rename {inference-engine => inference}/wasm/bindings/service_bindings.cpp (100%) rename {inference-engine => inference}/wasm/import-gemm-module.js (100%) rename {inference-engine => inference}/wasm/module/README.md (100%) rename {inference-engine => inference}/wasm/module/main.js (100%) rename {inference-engine => inference}/wasm/module/package.json (100%) rename {inference-engine => inference}/wasm/module/translator.js (100%) rename {inference-engine => inference}/wasm/module/worker/package.json (100%) rename {inference-engine => inference}/wasm/module/worker/translator-worker.js (100%) rename {inference-engine => inference}/wasm/node-test.js (100%) rename {inference-engine => inference}/wasm/patch-artifacts-import-gemm-module.sh (100%) rename {inference-engine => inference}/wasm/project_version.js.in (100%) rename {inference-engine => inference}/wasm/test_page/bergamot-httpserver.js (100%) rename {inference-engine => inference}/wasm/test_page/css/index.css (100%) rename {inference-engine => inference}/wasm/test_page/index.html (100%) rename {inference-engine => inference}/wasm/test_page/js/index.js (100%) rename {inference-engine => inference}/wasm/test_page/logos.png (100%) rename {inference-engine => inference}/wasm/test_page/package-lock.json (100%) rename {inference-engine => inference}/wasm/test_page/package.json (100%) rename {inference-engine => inference}/wasm/test_page/start_server.sh (100%) diff --git a/.gitmodules b/.gitmodules index a07948957..ebb589038 100644 --- a/.gitmodules +++ b/.gitmodules @@ -6,14 +6,6 @@ path = 3rd_party/extract-lex url = https://github.com/marian-nmt/extract-lex -[submodule "inference-engine/3rd_party/browsermt-marian-dev"] - path = inference-engine/3rd_party/browsermt-marian-dev - url = https://github.com/browsermt/marian-dev - -[submodule "inference-engine/3rd_party/ssplit-cpp"] - path = inference-engine/3rd_party/ssplit-cpp - url = https://github.com/browsermt/ssplit-cpp - [submodule "3rd_party/kenlm"] path = 3rd_party/kenlm url = https://github.com/kpu/kenlm @@ -29,3 +21,9 @@ [submodule "3rd_party/preprocess"] path = 3rd_party/preprocess url = https://github.com/kpu/preprocess.git +[submodule "inference/3rd_party/browsermt-marian-dev"] + path = inference/3rd_party/browsermt-marian-dev + url = https://github.com/browsermt/marian-dev +[submodule "inference/3rd_party/ssplit-cpp"] + path = inference/3rd_party/ssplit-cpp + url = https://github.com/browsermt/ssplit-cpp diff --git a/Taskfile.yml b/Taskfile.yml index 8bf8dddf5..745c5f05c 100644 --- a/Taskfile.yml +++ b/Taskfile.yml @@ -75,29 +75,29 @@ tasks: cmds: - poetry run opuscleaner-server serve --host=0.0.0.0 --port=8000 - inference-engine-clean: - desc: Clean build artifacts from the inference-engine directory. + inference-clean: + desc: Clean build artifacts from the inference directory. cmds: - >- - task docker-run -- ./inference-engine/scripts/clean.sh + task docker-run -- ./inference/scripts/clean.sh - inference-engine-build: + inference-build: desc: Build inference engine. cmds: - >- - task docker-run -- ./inference-engine/scripts/build-local.sh + task docker-run -- ./inference/scripts/build-local.sh - inference-engine-test: - desc: Run inference-engine tests. + inference-test: + desc: Run inference tests. cmds: - >- - task docker-run -- ./inference-engine/scripts/unit-tests.sh + task docker-run -- ./inference/scripts/unit-tests.sh - inference-engine-build-wasm: + inference-build-wasm: desc: Build inference engine WASM. cmds: - >- - task docker-run -- ./inference-engine/scripts/build-wasm.sh + task docker-run -- ./inference/scripts/build-wasm.sh lint-black: desc: Checks the styling of the Python code with Black. diff --git a/inference-engine/.clang-format b/inference/.clang-format similarity index 100% rename from inference-engine/.clang-format rename to inference/.clang-format diff --git a/inference-engine/.clang-format-ignore b/inference/.clang-format-ignore similarity index 100% rename from inference-engine/.clang-format-ignore rename to inference/.clang-format-ignore diff --git a/inference-engine/.clang-tidy b/inference/.clang-tidy similarity index 100% rename from inference-engine/.clang-tidy rename to inference/.clang-tidy diff --git a/inference-engine/.gitignore b/inference/.gitignore similarity index 100% rename from inference-engine/.gitignore rename to inference/.gitignore diff --git a/inference-engine/3rd_party/CMakeLists.txt b/inference/3rd_party/CMakeLists.txt similarity index 100% rename from inference-engine/3rd_party/CMakeLists.txt rename to inference/3rd_party/CMakeLists.txt diff --git a/inference-engine/3rd_party/browsermt-marian-dev b/inference/3rd_party/browsermt-marian-dev similarity index 100% rename from inference-engine/3rd_party/browsermt-marian-dev rename to inference/3rd_party/browsermt-marian-dev diff --git a/inference-engine/3rd_party/ssplit-cpp b/inference/3rd_party/ssplit-cpp similarity index 100% rename from inference-engine/3rd_party/ssplit-cpp rename to inference/3rd_party/ssplit-cpp diff --git a/inference-engine/BERGAMOT_VERSION b/inference/BERGAMOT_VERSION similarity index 100% rename from inference-engine/BERGAMOT_VERSION rename to inference/BERGAMOT_VERSION diff --git a/inference-engine/CMakeLists.txt b/inference/CMakeLists.txt similarity index 100% rename from inference-engine/CMakeLists.txt rename to inference/CMakeLists.txt diff --git a/inference-engine/Doxyfile.in b/inference/Doxyfile.in similarity index 100% rename from inference-engine/Doxyfile.in rename to inference/Doxyfile.in diff --git a/inference-engine/LICENSE b/inference/LICENSE similarity index 100% rename from inference-engine/LICENSE rename to inference/LICENSE diff --git a/inference-engine/MANIFEST.in b/inference/MANIFEST.in similarity index 100% rename from inference-engine/MANIFEST.in rename to inference/MANIFEST.in diff --git a/inference-engine/README.md b/inference/README.md similarity index 100% rename from inference-engine/README.md rename to inference/README.md diff --git a/inference-engine/cmake/GetVersionFromFile.cmake b/inference/cmake/GetVersionFromFile.cmake similarity index 100% rename from inference-engine/cmake/GetVersionFromFile.cmake rename to inference/cmake/GetVersionFromFile.cmake diff --git a/inference-engine/examples/run-native.sh b/inference/examples/run-native.sh similarity index 100% rename from inference-engine/examples/run-native.sh rename to inference/examples/run-native.sh diff --git a/inference-engine/patches/01-marian-fstream-for-macos.patch b/inference/patches/01-marian-fstream-for-macos.patch similarity index 100% rename from inference-engine/patches/01-marian-fstream-for-macos.patch rename to inference/patches/01-marian-fstream-for-macos.patch diff --git a/inference-engine/scripts/build-local.sh b/inference/scripts/build-local.sh similarity index 89% rename from inference-engine/scripts/build-local.sh rename to inference/scripts/build-local.sh index 65e42e761..ae64689fe 100755 --- a/inference-engine/scripts/build-local.sh +++ b/inference/scripts/build-local.sh @@ -1,11 +1,11 @@ #!/bin/bash set -e -# Run script from the context of inference-engine directory +# Run script from the context of inference directory cd "$(dirname $0)/.." # Ensure script is running within docker -./scripts/detect-docker.sh inference-engine-build +./scripts/detect-docker.sh inference-build # Return the number of available CPUs, or default to 1 if nproc is unavailable. detect_cpus() { diff --git a/inference-engine/scripts/build-wasm.sh b/inference/scripts/build-wasm.sh similarity index 94% rename from inference-engine/scripts/build-wasm.sh rename to inference/scripts/build-wasm.sh index 4cabed2b5..c21eea985 100755 --- a/inference-engine/scripts/build-wasm.sh +++ b/inference/scripts/build-wasm.sh @@ -1,11 +1,11 @@ #!/usr/bin/env bash set -e -# Run script from the context of inference-engine directory +# Run script from the context of inference directory cd "$(dirname $0)/.." # Ensure script is running within docker -./scripts/detect-docker.sh inference-engine-build-wasm +./scripts/detect-docker.sh inference-build-wasm set -x diff --git a/inference-engine/scripts/clean.sh b/inference/scripts/clean.sh similarity index 82% rename from inference-engine/scripts/clean.sh rename to inference/scripts/clean.sh index 410291705..73f5ae5eb 100755 --- a/inference-engine/scripts/clean.sh +++ b/inference/scripts/clean.sh @@ -1,11 +1,11 @@ #!/bin/bash set -e -# Run script from the context of inference-engine directory +# Run script from the context of inference directory cd "$(dirname $0)/.." # Ensure script is running within docker -./scripts/detect-docker.sh inference-engine-clean +./scripts/detect-docker.sh inference-clean # List of directories to clean dirs=("build-local" "build-wasm" "emsdk") diff --git a/inference-engine/scripts/detect-docker.sh b/inference/scripts/detect-docker.sh similarity index 100% rename from inference-engine/scripts/detect-docker.sh rename to inference/scripts/detect-docker.sh diff --git a/inference-engine/scripts/unit-tests.sh b/inference/scripts/unit-tests.sh similarity index 90% rename from inference-engine/scripts/unit-tests.sh rename to inference/scripts/unit-tests.sh index f4f12e3e1..dd8be9925 100755 --- a/inference-engine/scripts/unit-tests.sh +++ b/inference/scripts/unit-tests.sh @@ -1,11 +1,11 @@ #!/bin/bash set -e -# Run script from the context of inference-engine directory +# Run script from the context of inference directory cd "$(dirname $0)/.." # Ensure script is running within docker -./scripts/detect-docker.sh inference-engine-test +./scripts/detect-docker.sh inference-test # Check if build-local/src/tests/units directory exists if [ ! -d "build-local/src/tests/units" ]; then diff --git a/inference-engine/src/CMakeLists.txt b/inference/src/CMakeLists.txt similarity index 100% rename from inference-engine/src/CMakeLists.txt rename to inference/src/CMakeLists.txt diff --git a/inference-engine/src/tests/CMakeLists.txt b/inference/src/tests/CMakeLists.txt similarity index 100% rename from inference-engine/src/tests/CMakeLists.txt rename to inference/src/tests/CMakeLists.txt diff --git a/inference-engine/src/tests/async.cpp b/inference/src/tests/async.cpp similarity index 100% rename from inference-engine/src/tests/async.cpp rename to inference/src/tests/async.cpp diff --git a/inference-engine/src/tests/blocking.cpp b/inference/src/tests/blocking.cpp similarity index 100% rename from inference-engine/src/tests/blocking.cpp rename to inference/src/tests/blocking.cpp diff --git a/inference-engine/src/tests/common-impl.cpp b/inference/src/tests/common-impl.cpp similarity index 100% rename from inference-engine/src/tests/common-impl.cpp rename to inference/src/tests/common-impl.cpp diff --git a/inference-engine/src/tests/common.h b/inference/src/tests/common.h similarity index 100% rename from inference-engine/src/tests/common.h rename to inference/src/tests/common.h diff --git a/inference-engine/src/tests/intgemm-resolve.cpp b/inference/src/tests/intgemm-resolve.cpp similarity index 100% rename from inference-engine/src/tests/intgemm-resolve.cpp rename to inference/src/tests/intgemm-resolve.cpp diff --git a/inference-engine/src/tests/units/CMakeLists.txt b/inference/src/tests/units/CMakeLists.txt similarity index 100% rename from inference-engine/src/tests/units/CMakeLists.txt rename to inference/src/tests/units/CMakeLists.txt diff --git a/inference-engine/src/tests/units/annotation_tests.cpp b/inference/src/tests/units/annotation_tests.cpp similarity index 100% rename from inference-engine/src/tests/units/annotation_tests.cpp rename to inference/src/tests/units/annotation_tests.cpp diff --git a/inference-engine/src/tests/units/cache_tests.cpp b/inference/src/tests/units/cache_tests.cpp similarity index 100% rename from inference-engine/src/tests/units/cache_tests.cpp rename to inference/src/tests/units/cache_tests.cpp diff --git a/inference-engine/src/tests/units/html_tests.cpp b/inference/src/tests/units/html_tests.cpp similarity index 100% rename from inference-engine/src/tests/units/html_tests.cpp rename to inference/src/tests/units/html_tests.cpp diff --git a/inference-engine/src/tests/units/html_tests.h b/inference/src/tests/units/html_tests.h similarity index 100% rename from inference-engine/src/tests/units/html_tests.h rename to inference/src/tests/units/html_tests.h diff --git a/inference-engine/src/tests/units/quality_estimator_tests.cpp b/inference/src/tests/units/quality_estimator_tests.cpp similarity index 100% rename from inference-engine/src/tests/units/quality_estimator_tests.cpp rename to inference/src/tests/units/quality_estimator_tests.cpp diff --git a/inference-engine/src/tests/units/quality_estimator_tests.h b/inference/src/tests/units/quality_estimator_tests.h similarity index 100% rename from inference-engine/src/tests/units/quality_estimator_tests.h rename to inference/src/tests/units/quality_estimator_tests.h diff --git a/inference-engine/src/tests/units/run_tests.cpp b/inference/src/tests/units/run_tests.cpp similarity index 100% rename from inference-engine/src/tests/units/run_tests.cpp rename to inference/src/tests/units/run_tests.cpp diff --git a/inference-engine/src/tests/units/xh_scanner_tests.cpp b/inference/src/tests/units/xh_scanner_tests.cpp similarity index 100% rename from inference-engine/src/tests/units/xh_scanner_tests.cpp rename to inference/src/tests/units/xh_scanner_tests.cpp diff --git a/inference-engine/src/tests/wasm.cpp b/inference/src/tests/wasm.cpp similarity index 100% rename from inference-engine/src/tests/wasm.cpp rename to inference/src/tests/wasm.cpp diff --git a/inference-engine/src/translator/CMakeLists.txt b/inference/src/translator/CMakeLists.txt similarity index 100% rename from inference-engine/src/translator/CMakeLists.txt rename to inference/src/translator/CMakeLists.txt diff --git a/inference-engine/src/translator/aggregate_batching_pool.cpp b/inference/src/translator/aggregate_batching_pool.cpp similarity index 100% rename from inference-engine/src/translator/aggregate_batching_pool.cpp rename to inference/src/translator/aggregate_batching_pool.cpp diff --git a/inference-engine/src/translator/aggregate_batching_pool.h b/inference/src/translator/aggregate_batching_pool.h similarity index 100% rename from inference-engine/src/translator/aggregate_batching_pool.h rename to inference/src/translator/aggregate_batching_pool.h diff --git a/inference-engine/src/translator/aligned.h b/inference/src/translator/aligned.h similarity index 100% rename from inference-engine/src/translator/aligned.h rename to inference/src/translator/aligned.h diff --git a/inference-engine/src/translator/annotation.cpp b/inference/src/translator/annotation.cpp similarity index 100% rename from inference-engine/src/translator/annotation.cpp rename to inference/src/translator/annotation.cpp diff --git a/inference-engine/src/translator/annotation.h b/inference/src/translator/annotation.h similarity index 100% rename from inference-engine/src/translator/annotation.h rename to inference/src/translator/annotation.h diff --git a/inference-engine/src/translator/batch.cpp b/inference/src/translator/batch.cpp similarity index 100% rename from inference-engine/src/translator/batch.cpp rename to inference/src/translator/batch.cpp diff --git a/inference-engine/src/translator/batch.h b/inference/src/translator/batch.h similarity index 100% rename from inference-engine/src/translator/batch.h rename to inference/src/translator/batch.h diff --git a/inference-engine/src/translator/batching_pool.cpp b/inference/src/translator/batching_pool.cpp similarity index 100% rename from inference-engine/src/translator/batching_pool.cpp rename to inference/src/translator/batching_pool.cpp diff --git a/inference-engine/src/translator/batching_pool.h b/inference/src/translator/batching_pool.h similarity index 100% rename from inference-engine/src/translator/batching_pool.h rename to inference/src/translator/batching_pool.h diff --git a/inference-engine/src/translator/byte_array_util.cpp b/inference/src/translator/byte_array_util.cpp similarity index 100% rename from inference-engine/src/translator/byte_array_util.cpp rename to inference/src/translator/byte_array_util.cpp diff --git a/inference-engine/src/translator/byte_array_util.h b/inference/src/translator/byte_array_util.h similarity index 100% rename from inference-engine/src/translator/byte_array_util.h rename to inference/src/translator/byte_array_util.h diff --git a/inference-engine/src/translator/cache.h b/inference/src/translator/cache.h similarity index 100% rename from inference-engine/src/translator/cache.h rename to inference/src/translator/cache.h diff --git a/inference-engine/src/translator/definitions.h b/inference/src/translator/definitions.h similarity index 100% rename from inference-engine/src/translator/definitions.h rename to inference/src/translator/definitions.h diff --git a/inference-engine/src/translator/html.cpp b/inference/src/translator/html.cpp similarity index 100% rename from inference-engine/src/translator/html.cpp rename to inference/src/translator/html.cpp diff --git a/inference-engine/src/translator/html.h b/inference/src/translator/html.h similarity index 100% rename from inference-engine/src/translator/html.h rename to inference/src/translator/html.h diff --git a/inference-engine/src/translator/logging.h b/inference/src/translator/logging.h similarity index 100% rename from inference-engine/src/translator/logging.h rename to inference/src/translator/logging.h diff --git a/inference-engine/src/translator/parser.cpp b/inference/src/translator/parser.cpp similarity index 100% rename from inference-engine/src/translator/parser.cpp rename to inference/src/translator/parser.cpp diff --git a/inference-engine/src/translator/parser.h b/inference/src/translator/parser.h similarity index 100% rename from inference-engine/src/translator/parser.h rename to inference/src/translator/parser.h diff --git a/inference-engine/src/translator/project_version.h.in b/inference/src/translator/project_version.h.in similarity index 100% rename from inference-engine/src/translator/project_version.h.in rename to inference/src/translator/project_version.h.in diff --git a/inference-engine/src/translator/quality_estimator.cpp b/inference/src/translator/quality_estimator.cpp similarity index 100% rename from inference-engine/src/translator/quality_estimator.cpp rename to inference/src/translator/quality_estimator.cpp diff --git a/inference-engine/src/translator/quality_estimator.h b/inference/src/translator/quality_estimator.h similarity index 100% rename from inference-engine/src/translator/quality_estimator.h rename to inference/src/translator/quality_estimator.h diff --git a/inference-engine/src/translator/request.cpp b/inference/src/translator/request.cpp similarity index 100% rename from inference-engine/src/translator/request.cpp rename to inference/src/translator/request.cpp diff --git a/inference-engine/src/translator/request.h b/inference/src/translator/request.h similarity index 100% rename from inference-engine/src/translator/request.h rename to inference/src/translator/request.h diff --git a/inference-engine/src/translator/response.cpp b/inference/src/translator/response.cpp similarity index 100% rename from inference-engine/src/translator/response.cpp rename to inference/src/translator/response.cpp diff --git a/inference-engine/src/translator/response.h b/inference/src/translator/response.h similarity index 100% rename from inference-engine/src/translator/response.h rename to inference/src/translator/response.h diff --git a/inference-engine/src/translator/response_builder.cpp b/inference/src/translator/response_builder.cpp similarity index 100% rename from inference-engine/src/translator/response_builder.cpp rename to inference/src/translator/response_builder.cpp diff --git a/inference-engine/src/translator/response_builder.h b/inference/src/translator/response_builder.h similarity index 100% rename from inference-engine/src/translator/response_builder.h rename to inference/src/translator/response_builder.h diff --git a/inference-engine/src/translator/response_options.h b/inference/src/translator/response_options.h similarity index 100% rename from inference-engine/src/translator/response_options.h rename to inference/src/translator/response_options.h diff --git a/inference-engine/src/translator/service.cpp b/inference/src/translator/service.cpp similarity index 100% rename from inference-engine/src/translator/service.cpp rename to inference/src/translator/service.cpp diff --git a/inference-engine/src/translator/service.h b/inference/src/translator/service.h similarity index 100% rename from inference-engine/src/translator/service.h rename to inference/src/translator/service.h diff --git a/inference-engine/src/translator/text_processor.cpp b/inference/src/translator/text_processor.cpp similarity index 100% rename from inference-engine/src/translator/text_processor.cpp rename to inference/src/translator/text_processor.cpp diff --git a/inference-engine/src/translator/text_processor.h b/inference/src/translator/text_processor.h similarity index 100% rename from inference-engine/src/translator/text_processor.h rename to inference/src/translator/text_processor.h diff --git a/inference-engine/src/translator/threadsafe_batching_pool.cpp b/inference/src/translator/threadsafe_batching_pool.cpp similarity index 100% rename from inference-engine/src/translator/threadsafe_batching_pool.cpp rename to inference/src/translator/threadsafe_batching_pool.cpp diff --git a/inference-engine/src/translator/threadsafe_batching_pool.h b/inference/src/translator/threadsafe_batching_pool.h similarity index 100% rename from inference-engine/src/translator/threadsafe_batching_pool.h rename to inference/src/translator/threadsafe_batching_pool.h diff --git a/inference-engine/src/translator/translation_model.cpp b/inference/src/translator/translation_model.cpp similarity index 100% rename from inference-engine/src/translator/translation_model.cpp rename to inference/src/translator/translation_model.cpp diff --git a/inference-engine/src/translator/translation_model.h b/inference/src/translator/translation_model.h similarity index 100% rename from inference-engine/src/translator/translation_model.h rename to inference/src/translator/translation_model.h diff --git a/inference-engine/src/translator/utils.h b/inference/src/translator/utils.h similarity index 100% rename from inference-engine/src/translator/utils.h rename to inference/src/translator/utils.h diff --git a/inference-engine/src/translator/vocabs.h b/inference/src/translator/vocabs.h similarity index 100% rename from inference-engine/src/translator/vocabs.h rename to inference/src/translator/vocabs.h diff --git a/inference-engine/src/translator/xh_scanner.cpp b/inference/src/translator/xh_scanner.cpp similarity index 100% rename from inference-engine/src/translator/xh_scanner.cpp rename to inference/src/translator/xh_scanner.cpp diff --git a/inference-engine/src/translator/xh_scanner.h b/inference/src/translator/xh_scanner.h similarity index 100% rename from inference-engine/src/translator/xh_scanner.h rename to inference/src/translator/xh_scanner.h diff --git a/inference-engine/wasm/CMakeLists.txt b/inference/wasm/CMakeLists.txt similarity index 100% rename from inference-engine/wasm/CMakeLists.txt rename to inference/wasm/CMakeLists.txt diff --git a/inference-engine/wasm/README.md b/inference/wasm/README.md similarity index 100% rename from inference-engine/wasm/README.md rename to inference/wasm/README.md diff --git a/inference-engine/wasm/bindings/response_bindings.cpp b/inference/wasm/bindings/response_bindings.cpp similarity index 100% rename from inference-engine/wasm/bindings/response_bindings.cpp rename to inference/wasm/bindings/response_bindings.cpp diff --git a/inference-engine/wasm/bindings/response_options_bindings.cpp b/inference/wasm/bindings/response_options_bindings.cpp similarity index 100% rename from inference-engine/wasm/bindings/response_options_bindings.cpp rename to inference/wasm/bindings/response_options_bindings.cpp diff --git a/inference-engine/wasm/bindings/service_bindings.cpp b/inference/wasm/bindings/service_bindings.cpp similarity index 100% rename from inference-engine/wasm/bindings/service_bindings.cpp rename to inference/wasm/bindings/service_bindings.cpp diff --git a/inference-engine/wasm/import-gemm-module.js b/inference/wasm/import-gemm-module.js similarity index 100% rename from inference-engine/wasm/import-gemm-module.js rename to inference/wasm/import-gemm-module.js diff --git a/inference-engine/wasm/module/README.md b/inference/wasm/module/README.md similarity index 100% rename from inference-engine/wasm/module/README.md rename to inference/wasm/module/README.md diff --git a/inference-engine/wasm/module/main.js b/inference/wasm/module/main.js similarity index 100% rename from inference-engine/wasm/module/main.js rename to inference/wasm/module/main.js diff --git a/inference-engine/wasm/module/package.json b/inference/wasm/module/package.json similarity index 100% rename from inference-engine/wasm/module/package.json rename to inference/wasm/module/package.json diff --git a/inference-engine/wasm/module/translator.js b/inference/wasm/module/translator.js similarity index 100% rename from inference-engine/wasm/module/translator.js rename to inference/wasm/module/translator.js diff --git a/inference-engine/wasm/module/worker/package.json b/inference/wasm/module/worker/package.json similarity index 100% rename from inference-engine/wasm/module/worker/package.json rename to inference/wasm/module/worker/package.json diff --git a/inference-engine/wasm/module/worker/translator-worker.js b/inference/wasm/module/worker/translator-worker.js similarity index 100% rename from inference-engine/wasm/module/worker/translator-worker.js rename to inference/wasm/module/worker/translator-worker.js diff --git a/inference-engine/wasm/node-test.js b/inference/wasm/node-test.js similarity index 100% rename from inference-engine/wasm/node-test.js rename to inference/wasm/node-test.js diff --git a/inference-engine/wasm/patch-artifacts-import-gemm-module.sh b/inference/wasm/patch-artifacts-import-gemm-module.sh similarity index 100% rename from inference-engine/wasm/patch-artifacts-import-gemm-module.sh rename to inference/wasm/patch-artifacts-import-gemm-module.sh diff --git a/inference-engine/wasm/project_version.js.in b/inference/wasm/project_version.js.in similarity index 100% rename from inference-engine/wasm/project_version.js.in rename to inference/wasm/project_version.js.in diff --git a/inference-engine/wasm/test_page/bergamot-httpserver.js b/inference/wasm/test_page/bergamot-httpserver.js similarity index 100% rename from inference-engine/wasm/test_page/bergamot-httpserver.js rename to inference/wasm/test_page/bergamot-httpserver.js diff --git a/inference-engine/wasm/test_page/css/index.css b/inference/wasm/test_page/css/index.css similarity index 100% rename from inference-engine/wasm/test_page/css/index.css rename to inference/wasm/test_page/css/index.css diff --git a/inference-engine/wasm/test_page/index.html b/inference/wasm/test_page/index.html similarity index 100% rename from inference-engine/wasm/test_page/index.html rename to inference/wasm/test_page/index.html diff --git a/inference-engine/wasm/test_page/js/index.js b/inference/wasm/test_page/js/index.js similarity index 100% rename from inference-engine/wasm/test_page/js/index.js rename to inference/wasm/test_page/js/index.js diff --git a/inference-engine/wasm/test_page/logos.png b/inference/wasm/test_page/logos.png similarity index 100% rename from inference-engine/wasm/test_page/logos.png rename to inference/wasm/test_page/logos.png diff --git a/inference-engine/wasm/test_page/package-lock.json b/inference/wasm/test_page/package-lock.json similarity index 100% rename from inference-engine/wasm/test_page/package-lock.json rename to inference/wasm/test_page/package-lock.json diff --git a/inference-engine/wasm/test_page/package.json b/inference/wasm/test_page/package.json similarity index 100% rename from inference-engine/wasm/test_page/package.json rename to inference/wasm/test_page/package.json diff --git a/inference-engine/wasm/test_page/start_server.sh b/inference/wasm/test_page/start_server.sh similarity index 100% rename from inference-engine/wasm/test_page/start_server.sh rename to inference/wasm/test_page/start_server.sh From 8d2edd1f136605ee65cc2c517c35b7c2496529ed Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Mon, 30 Sep 2024 16:13:30 -0500 Subject: [PATCH 436/442] Reintroduce browsermt-marian-dev comment to .gitmodules file --- .gitmodules | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/.gitmodules b/.gitmodules index ebb589038..51663e9bf 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,29 +1,42 @@ [submodule "fast_align"] path = 3rd_party/fast_align url = https://github.com/clab/fast_align - [submodule "extract-lex"] path = 3rd_party/extract-lex url = https://github.com/marian-nmt/extract-lex - [submodule "3rd_party/kenlm"] path = 3rd_party/kenlm url = https://github.com/kpu/kenlm - [submodule "3rd_party/browsermt-marian-dev"] path = 3rd_party/browsermt-marian-dev url = https://github.com/browsermt/marian-dev - [submodule "3rd_party/marian-dev"] path = 3rd_party/marian-dev url = https://github.com/marian-nmt/marian-dev - [submodule "3rd_party/preprocess"] path = 3rd_party/preprocess url = https://github.com/kpu/preprocess.git -[submodule "inference/3rd_party/browsermt-marian-dev"] - path = inference/3rd_party/browsermt-marian-dev - url = https://github.com/browsermt/marian-dev [submodule "inference/3rd_party/ssplit-cpp"] path = inference/3rd_party/ssplit-cpp url = https://github.com/browsermt/ssplit-cpp +# This is the same dependency and repository as `3rd_party/browsermt-marian-dev` below. +# +# When forking `inference-engine` into to this project, I made an earnest attempt to utilize the preexisting +# `3rd_party/browsermt-marian-dev` submodule within `inference-engine`. Unfortunately, I ran into several roadblocks: +# +# 1) I cannot directly add `3rd_party/browsermt-marian-dev` as a cmake subdirectory because cmake is aware that +# this path is not a subdirectory of the `inference-engine` project root. +# +# 2) Symbolic links do not appear to work for git submodule direcotires the way that they do for regular directories. +# Even if the symbolic link had linked correctly, it may have still failed due to the considerations of 1). +# +# 3) I tried using cmake to copy the files from `3rd_party/browsermt-marian-dev` into `inference-engine/3rd_party/browsermt-marian-dev` +# at build time, which would ensure that there is no duplicate reference to the URL in this file, however the upstream dependency itself +# has hard-coded expectations that the `.git` directory is only one level up, which appears to work correctly for the way git submodules are +# configured, but does not work if the files are copied over to a regular directory deeper in the repository's directory tree. +# +# It may be possible to remove `3rd_party/browsermt-marian-dev` to instead use `inference-engine/3rd-party/browsermt-marian-dev` everywhere +# within this repository, but I will leave that for a future commit if there is a need to do so. +[submodule "inference/3rd_party/browsermt-marian-dev"] + path = inference/3rd_party/browsermt-marian-dev + url = https://github.com/browsermt/marian-dev From 01e3af527ef7b153ceb7b53fd04e220c6bbbd323 Mon Sep 17 00:00:00 2001 From: Erik Nordin Date: Mon, 30 Sep 2024 16:15:18 -0500 Subject: [PATCH 437/442] Remove sub-directory README files --- inference/README.md | 82 ----------- inference/wasm/README.md | 46 ------ inference/wasm/module/README.md | 238 -------------------------------- 3 files changed, 366 deletions(-) delete mode 100644 inference/README.md delete mode 100644 inference/wasm/README.md delete mode 100644 inference/wasm/module/README.md diff --git a/inference/README.md b/inference/README.md deleted file mode 100644 index 05c3c3d25..000000000 --- a/inference/README.md +++ /dev/null @@ -1,82 +0,0 @@ -# Bergamot Translator - -[![CircleCI badge](https://img.shields.io/circleci/project/github/browsermt/bergamot-translator/main.svg?label=CircleCI)](https://circleci.com/gh/browsermt/bergamot-translator/) - -Bergamot translator provides a unified API for ([Marian NMT](https://marian-nmt.github.io/) framework based) neural machine translation functionality in accordance with the [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. - -## Build Instructions - -### Build Natively -Create a folder where you want to build all the artifacts (`build-native` in this case) and compile - -```bash -mkdir build-native -cd build-native -cmake ../ -make -j2 -``` - -### Build WASM -#### Prerequisite - -Building on wasm requires Emscripten toolchain. It can be downloaded and installed using following instructions: - -* Get the latest sdk: `git clone https://github.com/emscripten-core/emsdk.git` -* Enter the cloned directory: `cd emsdk` -* Install the sdk: `./emsdk install 3.1.8` -* Activate the sdk: `./emsdk activate 3.1.8` -* Activate path variables: `source ./emsdk_env.sh` - -#### Compile - -To build a version that translates with higher speeds on Firefox Nightly browser, follow these instructions: - - 1. Create a folder where you want to build all the artifacts (`build-wasm` in this case) and compile - ```bash - mkdir build-wasm - cd build-wasm - emcmake cmake -DCOMPILE_WASM=on ../ - emmake make -j2 - ``` - - The wasm artifacts (.js and .wasm files) will be available in the build directory ("build-wasm" in this case). - - 2. Patch generated artifacts to import GEMM library from a separate wasm module - ```bash - bash ../wasm/patch-artifacts-import-gemm-module.sh - ``` - -To build a version that runs on all browsers (including Firefox Nightly) but translates slowly, follow these instructions: - - 1. Create a folder where you want to build all the artifacts (`build-wasm` in this case) and compile - ```bash - mkdir build-wasm - cd build-wasm - emcmake cmake -DCOMPILE_WASM=on ../ - emmake make -j2 - ``` - - 2. Patch generated artifacts to import GEMM library from a separate wasm module - ```bash - bash ../wasm/patch-artifacts-import-gemm-module.sh - ``` - -#### Recompiling -As long as you don't update any submodule, just follow [Compile](#Compile) steps.\ -If you update a submodule, execute following command in repository root folder before executing -[Compile](#Compile) steps. -```bash -git submodule update --init --recursive -``` - - -## How to use - -### Using Native version - -The builds generate library that can be integrated to any project. All the public header files are specified in `src` folder.\ -A short example of how to use the APIs is provided in `app/bergamot.cpp` file. - -### Using WASM version - -Please follow the `README` inside the `wasm` folder of this repository that demonstrates how to use the translator in JavaScript. diff --git a/inference/wasm/README.md b/inference/wasm/README.md deleted file mode 100644 index 0f3f77426..000000000 --- a/inference/wasm/README.md +++ /dev/null @@ -1,46 +0,0 @@ -# Using Bergamot Translator in JavaScript - -All the instructions below are meant to run from the current directory. - -## Using JS APIs - -See [node-test.js](./node-test.js) for an annotated example of how to use the WASM module. Most of the code from it can also be used in a browser context. - -Alternatively refer to the file `test_page/js/worker.js` that demonstrates how to use the bergamot translator in JavaScript via a `