3.1.1 (#821)

MontrealCorpusTools · Jun 24, 2024 · c33e40e · c33e40e
1 parent 945424c
commit c33e40e
Show file tree

Hide file tree

Showing 19 changed files with 207 additions and 74 deletions.
diff --git a/docs/source/changelog/changelog_3.0.rst b/docs/source/changelog/changelog_3.0.rst
@@ -5,6 +5,15 @@
 3.0 Changelog
 *************
 
+3.1.1
+-----
+
+- Fixed a bug where hidden files and folders would be parsed as corpus data
+- Fixed a bug where validation would not respect :code:`--no_final_clean`
+- Fixed a rare crash in training when a job would not have utterances assigned to it
+- Fixed a bug where MFA would mistakenly report a dictionary and acoustic model phones did not match for older versions
+
+
 3.1.0
 -----
 

diff --git a/docs/source/user_guide/concepts/hmm.md b/docs/source/user_guide/concepts/hmm.md
@@ -22,4 +22,37 @@ Still under construction, I hope to fill these sections out as I have time.
 
 ### MFA topology
 
-MFA uses a variable 5-state topology for modeling phones.  Each state has a likelihood to transition to the final state in addition to the next state.  What this is means is that each phone has a minimum duration of 10ms (corresponding to the default time step for MFCC generation), rather than 30ms for a more standard 3-state HMM.  Having a shorter minimum duration reduces alignment errors from short or dropped phones, i.e., American English flaps or schwas, or accommodate for dictionary errors (though these should still be fixed).
+MFA uses a variable 3-state topology for modeling phones.  Each state has a likelihood to transition to the final state in addition to the next state.  What this is means is that each phone has a minimum duration of 10ms (corresponding to the default time step for MFCC generation), rather than 30ms for a more standard 3-state HMM.  Having a shorter minimum duration reduces alignment errors from short or dropped phones, i.e., American English flaps or schwas, or accommodate for dictionary errors (though these should still be fixed).
+
+#### Customizing topologies
+
+Custom numbers of states can be specified via a topology configuration file. The configuration file should list per-phone minimum and maximum states, as below.
+
+```{code}yaml
+tʃ:
+  - min_states: 3
+  - max_states: 5
+ɾ:
+  - min_states: 1
+  - max_states: 1
+```
+
+In the above example, the {ipa_inline}`[tʃ]` phone will have a variable topology with a minimum 3 states before terminating, but optional 5 states to cover additional transitions for the complex articulation. Conversely, the {ipa_inline}`[ɾ]` phone is a very short articulation and so having both minimum and maximum set to 1 state ensures that additional states are not used to model the phone.
+
+```{seealso}
+* [Example configuration files](https://github.com/MontrealCorpusTools/mfa-models/tree/main/config/acoustic/topologies)
+```
+
+## Clustering phones
+
+In a monophone model, each phone is modeled the same regardless of the surrounding phonological context. Consider the {ipa_inline}`[P]` in the words {ipa_inline}`paid [P EY1 D]` and {ipa_inline}`spade [S P EY1 D]` in English. The actual pronunciation of the {ipa_inline}`[P]` in paid will be an aspirated {ipa_inline}`[pʰ]` but the pronunciation of {ipa_inline}`[P]` following {ipa_inline}`[S]` is an unaspirated {ipa_inline}`[p]`.
+
+To more accurately model these phonological variants, we use triphone models. Under the hood, each phone gets transformed into a sequence of three phones, including the phone and its preceding and following phones. So the representation for "paid" and "spade" becomes {ipa_inline}`[#/P/EY1 P/EY1/D EY1/D/#]` and {ipa_inline}`[#/S/P S/P/EY1 P/EY1/D EY1/D/#]`. At this level, we have made it so that the {ipa_inline}`[P]` phones have two different labels, so each can be modeled differently.
+
+However, representing phones this way results in a massive explosion of the number of phones, with not as many corresponding occurrences. If there is not much data for particular phones, modeling them appropriately becomes challenging. The solution to this data sparsity issue is to cluster the resulting states based on their similarity. For the triphone {ipa_inline}`[P/EY1/D]`, triphones like {ipa_inline}`[P/EY1/T]`, {ipa_inline}`[B/EY1/D]`,{ipa_inline}`[M/EY1/D]`,{ipa_inline}`[B/EY1/T]`, and {ipa_inline}`[M/EY1/T]` will all have similar acoustics, as they're {ipa_inline}`[EY1]` vowels with bilabial stops preceding and oral coronal stops following. The triphone {ipa_inline}`[P/EY1/N]` and others with following nasals will likely not be similar enough due to regressive nasalization in English.
+
+As a result of the phone clustering, the number of PDFs being modeled is reduced to a more manageable number with less data sparsity issues.
+
+```{note}
+By default Kaldi and earlier versions MFA included silence phones with the nonsilence phones, due to the idea for instance that stops have a closure state to them and so that is similar to silence. However, having silence states be clustered with nonsilence states has led to gross alignment errors with less clean data, so MFA 3.1 and later removes all instances of the silence phone being clustered with nonsilence phones. The OOV phone is still clustered with both silence and nonsilence however, and OOVs can cover multiple words.
+```
diff --git a/montreal_forced_aligner/abc.py b/montreal_forced_aligner/abc.py
@@ -699,7 +699,7 @@ def cleanup(self) -> None:
                 logger.info(f"Done! Everything took {time.time() - self.start_time:.3f} seconds")
                 if config.FINAL_CLEAN:
                     logger.debug(
-                        "Cleaning up temporary files, use the --debug flag to keep temporary files."
+                        "Cleaning up temporary files, use the --no_final_clean flag to keep temporary files."
                     )
                     if hasattr(self, "delete_database"):
                         if config.USE_POSTGRES:

diff --git a/montreal_forced_aligner/acoustic_modeling/monophone.py b/montreal_forced_aligner/acoustic_modeling/monophone.py
@@ -133,9 +133,10 @@ def _run(self):
                 writer.Close()
                 self.callback((accumulator.transition_accs, accumulator.gmm_accs))
                 train_logger.info(f"Done {num_done} utterances, errors on {num_error} utterances.")
-                train_logger.info(
-                    f"Overall avg like per frame (Gaussian only) = {tot_like/tot_t} over {tot_t} frames."
-                )
+                if tot_t:
+                    train_logger.info(
+                        f"Overall avg like per frame (Gaussian only) = {tot_like/tot_t} over {tot_t} frames."
+                    )
 
 
 class MonophoneTrainer(AcousticModelTrainingMixin):

diff --git a/montreal_forced_aligner/acoustic_modeling/triphone.py b/montreal_forced_aligner/acoustic_modeling/triphone.py
@@ -411,11 +411,22 @@ def _setup_tree(self, init_from_previous=False, initial_mix_up=True) -> None:
             silence_sets = [
                 x for x in questions if silence_phone_id in x and x != [silence_phone_id]
             ]
+            filtered = []
+            existing_sets = {tuple(x) for x in questions}
             for q_set in silence_sets:
                 train_logger.debug(", ".join([self.reversed_phone_mapping[x] for x in q_set]))
-            questions = [
-                x for x in questions if silence_phone_id not in x or x == [silence_phone_id]
-            ]
+
+            for q_set in questions:
+                if silence_phone_id not in q_set or q_set == [silence_phone_id]:
+                    filtered.append(q_set)
+                    continue
+                q_set = [x for x in q_set if x != silence_phone_id]
+                if not q_set:
+                    continue
+                if tuple(q_set) in existing_sets:
+                    continue
+                filtered.append(q_set)
+            questions = filtered
 
             extra_questions = self.worker.extra_questions_mapping
             if extra_questions:

diff --git a/montreal_forced_aligner/alignment/base.py b/montreal_forced_aligner/alignment/base.py
@@ -1449,6 +1449,7 @@ def evaluate_alignments(
                 align_phones,
                 silence_phone=self.optional_silence_phone,
                 custom_mapping=mapping,
+                debug=config.DEBUG,
             )
             unaligned_utts = []
             utterances: typing.List[Utterance] = session.query(Utterance).options(

diff --git a/montreal_forced_aligner/command_line/align.py b/montreal_forced_aligner/command_line/align.py
@@ -4,7 +4,6 @@
 from pathlib import Path
 
 import rich_click as click
-import yaml
 
 from montreal_forced_aligner import config
 from montreal_forced_aligner.alignment import PretrainedAligner
@@ -15,7 +14,7 @@
     validate_g2p_model,
 )
 from montreal_forced_aligner.data import WorkflowType
-from montreal_forced_aligner.helper import mfa_open
+from montreal_forced_aligner.helper import load_evaluation_mapping
 
 __all__ = ["align_corpus_cli"]
 
@@ -137,8 +136,7 @@ def align_corpus_cli(context, **kwargs) -> None:
             aligner.load_reference_alignments(reference_directory)
             mapping = None
             if custom_mapping_path:
-                with mfa_open(custom_mapping_path, "r") as f:
-                    mapping = yaml.load(f, Loader=yaml.Loader)
+                mapping = load_evaluation_mapping(custom_mapping_path)
                 aligner.validate_mapping(mapping)
             reference_alignments = WorkflowType.reference
         else:

diff --git a/montreal_forced_aligner/command_line/validate.py b/montreal_forced_aligner/command_line/validate.py
@@ -104,6 +104,7 @@ def validate_corpus_cli(context, **kwargs) -> None:
     """
     if kwargs.get("profile", None) is not None:
         config.profile = kwargs.pop("profile")
+    config.FINAL_CLEAN = True
     config.update_configuration(kwargs)
     kwargs["USE_THREADING"] = False
 
@@ -139,7 +140,6 @@ def validate_corpus_cli(context, **kwargs) -> None:
         validator.dirty = True
         raise
     finally:
-        config.FINAL_CLEAN = True
         validator.cleanup()
 
 

diff --git a/montreal_forced_aligner/corpus/acoustic_corpus.py b/montreal_forced_aligner/corpus/acoustic_corpus.py
@@ -234,8 +234,12 @@ def load_reference_alignments(self, reference_directory: Path) -> None:
                     max_id = p_id
             new_phones = []
             for root, _, files in os.walk(reference_directory, followlinks=True):
+                if root.startswith("."):  # Ignore hidden directories
+                    continue
                 root_speaker = os.path.basename(root)
                 for f in files:
+                    if f.startswith("."):  # Ignore hidden files
+                        continue
                     if f.endswith(".TextGrid"):
                         file_name = f.replace(".TextGrid", "")
                         file_id = session.query(File.id).filter_by(name=file_name).scalar()
@@ -988,6 +992,8 @@ def _load_corpus_from_source(self) -> None:
             for root, _, files in os.walk(self.audio_directory, followlinks=True):
                 if self.stopped.is_set():
                     return
+                if root.startswith("."):  # Ignore hidden directories
+                    continue
                 exts = find_exts(files)
                 exts.wav_files = {k: os.path.join(root, v) for k, v in exts.wav_files.items()}
                 exts.other_audio_files = {
@@ -999,12 +1005,14 @@ def _load_corpus_from_source(self) -> None:
         with self.session() as session:
             import_data = DatabaseImportData()
             for root, _, files in os.walk(self.corpus_directory, followlinks=True):
+                if self.stopped.is_set():
+                    return
+                if root.startswith("."):  # Ignore hidden directories
+                    continue
                 exts = find_exts(files)
                 relative_path = (
                     root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
                 )
-                if self.stopped.is_set():
-                    return
                 if not use_audio_directory:
                     all_sound_files = {}
                     wav_files = {k: os.path.join(root, v) for k, v in exts.wav_files.items()}

diff --git a/montreal_forced_aligner/corpus/helper.py b/montreal_forced_aligner/corpus/helper.py
@@ -62,6 +62,8 @@ def find_exts(files: typing.List[str]) -> FileExtensions:
     """
     exts = FileExtensions(set(), {}, {}, {}, {})
     for full_filename in files:
+        if full_filename.startswith("."):  # Ignore hidden files
+            continue
         try:
             filename, fext = full_filename.rsplit(".", maxsplit=1)
         except ValueError:

diff --git a/montreal_forced_aligner/corpus/multiprocessing.py b/montreal_forced_aligner/corpus/multiprocessing.py
@@ -106,6 +106,8 @@ def run(self) -> None:
         if self.audio_directory and os.path.exists(self.audio_directory):
             use_audio_directory = True
             for root, _, files in os.walk(self.audio_directory, followlinks=True):
+                if root.startswith("."):  # Ignore hidden directories
+                    continue
                 exts = find_exts(files)
                 wav_files = {k: os.path.join(root, v) for k, v in exts.wav_files.items()}
                 other_audio_files = {
@@ -114,11 +116,12 @@ def run(self) -> None:
                 all_sound_files.update(other_audio_files)
                 all_sound_files.update(wav_files)
         for root, _, files in os.walk(self.corpus_directory, followlinks=True):
-            exts = find_exts(files)
-            relative_path = root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
-
             if self.stopped.is_set():
                 break
+            if root.startswith("."):  # Ignore hidden directories
+                continue
+            exts = find_exts(files)
+            relative_path = root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
             if not use_audio_directory:
                 all_sound_files = {}
                 exts.wav_files = {k: os.path.join(root, v) for k, v in exts.wav_files.items()}

diff --git a/montreal_forced_aligner/corpus/text_corpus.py b/montreal_forced_aligner/corpus/text_corpus.py
@@ -65,13 +65,14 @@ def _load_corpus_from_source_mp(self) -> None:
             file_count = 0
             with tqdm(total=1, disable=config.QUIET) as pbar, self.session() as session:
                 for root, _, files in os.walk(self.corpus_directory, followlinks=True):
+                    if self.stopped.is_set():
+                        break
+                    if root.startswith("."):  # Ignore hidden directories
+                        continue
                     exts = find_exts(files)
                     relative_path = (
                         root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
                     )
-
-                    if self.stopped.is_set():
-                        break
                     for file_name in exts.identifiers:
                         if self.stopped.is_set():
                             break
@@ -181,12 +182,14 @@ def _load_corpus_from_source(self) -> None:
         sanitize_function = getattr(self, "sanitize_function", None)
         with self.session() as session:
             for root, _, files in os.walk(self.corpus_directory, followlinks=True):
+                if self.stopped:
+                    return
+                if root.startswith("."):  # Ignore hidden directories
+                    continue
                 exts = find_exts(files)
                 relative_path = (
                     root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
                 )
-                if self.stopped:
-                    return
                 for file_name in exts.identifiers:
                     wav_path = None
                     if file_name in exts.lab_files:

diff --git a/montreal_forced_aligner/dictionary/mixins.py b/montreal_forced_aligner/dictionary/mixins.py
@@ -260,11 +260,7 @@ def get_base_phone(self, phone: str) -> str:
     @property
     def extra_questions_mapping(self) -> Dict[str, List[str]]:
         """Mapping of extra questions for the given phone set type"""
-        mapping = {"silence_question": []}
-        for p in sorted(self.silence_phones):
-            mapping["silence_question"].append(p)
-            if self.position_dependent_phones:
-                mapping["silence_question"].extend([p + x for x in self.positions])
+        mapping = {}
         for k, v in self.phone_set_type.extra_questions.items():
             if k not in mapping:
                 mapping[k] = []
@@ -427,7 +423,8 @@ def _generate_positional_list(self, phones: Set[str]) -> List[str]:
             List of positional phones, sorted by base phone
         """
         positional_phones = []
-        phones |= {self.get_base_phone(p) for p in phones}
+        if not hasattr(self, "acoustic_model"):
+            phones |= {self.get_base_phone(p) for p in phones}
         for p in sorted(phones):
             if p not in self.non_silence_phones:
                 continue

diff --git a/montreal_forced_aligner/dictionary/multispeaker.py b/montreal_forced_aligner/dictionary/multispeaker.py
@@ -152,7 +152,7 @@ def tokenizers(self):
                         bracketed_word=self.bracketed_word,
                         cutoff_word=self.cutoff_word,
                         ignore_case=self.ignore_case,
-                        use_g2p=self.use_g2p,
+                        use_g2p=self.use_g2p or getattr(self, "g2p_model", None) is not None,
                         clitic_set=clitic_set,
                         grapheme_set=grapheme_set,
                     )