Skip to content

Commit

Permalink
3.1.1 (#821)
Browse files Browse the repository at this point in the history
  • Loading branch information
mmcauliffe authored Jun 24, 2024
1 parent 945424c commit c33e40e
Show file tree
Hide file tree
Showing 19 changed files with 207 additions and 74 deletions.
9 changes: 9 additions & 0 deletions docs/source/changelog/changelog_3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,15 @@
3.0 Changelog
*************

3.1.1
-----

- Fixed a bug where hidden files and folders would be parsed as corpus data
- Fixed a bug where validation would not respect :code:`--no_final_clean`
- Fixed a rare crash in training when a job would not have utterances assigned to it
- Fixed a bug where MFA would mistakenly report a dictionary and acoustic model phones did not match for older versions


3.1.0
-----

Expand Down
35 changes: 34 additions & 1 deletion docs/source/user_guide/concepts/hmm.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,37 @@ Still under construction, I hope to fill these sections out as I have time.

### MFA topology

MFA uses a variable 5-state topology for modeling phones. Each state has a likelihood to transition to the final state in addition to the next state. What this is means is that each phone has a minimum duration of 10ms (corresponding to the default time step for MFCC generation), rather than 30ms for a more standard 3-state HMM. Having a shorter minimum duration reduces alignment errors from short or dropped phones, i.e., American English flaps or schwas, or accommodate for dictionary errors (though these should still be fixed).
MFA uses a variable 3-state topology for modeling phones. Each state has a likelihood to transition to the final state in addition to the next state. What this is means is that each phone has a minimum duration of 10ms (corresponding to the default time step for MFCC generation), rather than 30ms for a more standard 3-state HMM. Having a shorter minimum duration reduces alignment errors from short or dropped phones, i.e., American English flaps or schwas, or accommodate for dictionary errors (though these should still be fixed).

#### Customizing topologies

Custom numbers of states can be specified via a topology configuration file. The configuration file should list per-phone minimum and maximum states, as below.

```{code}yaml
tʃ:
- min_states: 3
- max_states: 5
ɾ:
- min_states: 1
- max_states: 1
```

In the above example, the {ipa_inline}`[tʃ]` phone will have a variable topology with a minimum 3 states before terminating, but optional 5 states to cover additional transitions for the complex articulation. Conversely, the {ipa_inline}`[ɾ]` phone is a very short articulation and so having both minimum and maximum set to 1 state ensures that additional states are not used to model the phone.

```{seealso}
* [Example configuration files](https://github.com/MontrealCorpusTools/mfa-models/tree/main/config/acoustic/topologies)
```

## Clustering phones

In a monophone model, each phone is modeled the same regardless of the surrounding phonological context. Consider the {ipa_inline}`[P]` in the words {ipa_inline}`paid [P EY1 D]` and {ipa_inline}`spade [S P EY1 D]` in English. The actual pronunciation of the {ipa_inline}`[P]` in paid will be an aspirated {ipa_inline}`[pʰ]` but the pronunciation of {ipa_inline}`[P]` following {ipa_inline}`[S]` is an unaspirated {ipa_inline}`[p]`.

To more accurately model these phonological variants, we use triphone models. Under the hood, each phone gets transformed into a sequence of three phones, including the phone and its preceding and following phones. So the representation for "paid" and "spade" becomes {ipa_inline}`[#/P/EY1 P/EY1/D EY1/D/#]` and {ipa_inline}`[#/S/P S/P/EY1 P/EY1/D EY1/D/#]`. At this level, we have made it so that the {ipa_inline}`[P]` phones have two different labels, so each can be modeled differently.

However, representing phones this way results in a massive explosion of the number of phones, with not as many corresponding occurrences. If there is not much data for particular phones, modeling them appropriately becomes challenging. The solution to this data sparsity issue is to cluster the resulting states based on their similarity. For the triphone {ipa_inline}`[P/EY1/D]`, triphones like {ipa_inline}`[P/EY1/T]`, {ipa_inline}`[B/EY1/D]`,{ipa_inline}`[M/EY1/D]`,{ipa_inline}`[B/EY1/T]`, and {ipa_inline}`[M/EY1/T]` will all have similar acoustics, as they're {ipa_inline}`[EY1]` vowels with bilabial stops preceding and oral coronal stops following. The triphone {ipa_inline}`[P/EY1/N]` and others with following nasals will likely not be similar enough due to regressive nasalization in English.

As a result of the phone clustering, the number of PDFs being modeled is reduced to a more manageable number with less data sparsity issues.

```{note}
By default Kaldi and earlier versions MFA included silence phones with the nonsilence phones, due to the idea for instance that stops have a closure state to them and so that is similar to silence. However, having silence states be clustered with nonsilence states has led to gross alignment errors with less clean data, so MFA 3.1 and later removes all instances of the silence phone being clustered with nonsilence phones. The OOV phone is still clustered with both silence and nonsilence however, and OOVs can cover multiple words.
```
2 changes: 1 addition & 1 deletion montreal_forced_aligner/abc.py
Original file line number Diff line number Diff line change
Expand Up @@ -699,7 +699,7 @@ def cleanup(self) -> None:
logger.info(f"Done! Everything took {time.time() - self.start_time:.3f} seconds")
if config.FINAL_CLEAN:
logger.debug(
"Cleaning up temporary files, use the --debug flag to keep temporary files."
"Cleaning up temporary files, use the --no_final_clean flag to keep temporary files."
)
if hasattr(self, "delete_database"):
if config.USE_POSTGRES:
Expand Down
7 changes: 4 additions & 3 deletions montreal_forced_aligner/acoustic_modeling/monophone.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,9 +133,10 @@ def _run(self):
writer.Close()
self.callback((accumulator.transition_accs, accumulator.gmm_accs))
train_logger.info(f"Done {num_done} utterances, errors on {num_error} utterances.")
train_logger.info(
f"Overall avg like per frame (Gaussian only) = {tot_like/tot_t} over {tot_t} frames."
)
if tot_t:
train_logger.info(
f"Overall avg like per frame (Gaussian only) = {tot_like/tot_t} over {tot_t} frames."
)


class MonophoneTrainer(AcousticModelTrainingMixin):
Expand Down
17 changes: 14 additions & 3 deletions montreal_forced_aligner/acoustic_modeling/triphone.py
Original file line number Diff line number Diff line change
Expand Up @@ -411,11 +411,22 @@ def _setup_tree(self, init_from_previous=False, initial_mix_up=True) -> None:
silence_sets = [
x for x in questions if silence_phone_id in x and x != [silence_phone_id]
]
filtered = []
existing_sets = {tuple(x) for x in questions}
for q_set in silence_sets:
train_logger.debug(", ".join([self.reversed_phone_mapping[x] for x in q_set]))
questions = [
x for x in questions if silence_phone_id not in x or x == [silence_phone_id]
]

for q_set in questions:
if silence_phone_id not in q_set or q_set == [silence_phone_id]:
filtered.append(q_set)
continue
q_set = [x for x in q_set if x != silence_phone_id]
if not q_set:
continue
if tuple(q_set) in existing_sets:
continue
filtered.append(q_set)
questions = filtered

extra_questions = self.worker.extra_questions_mapping
if extra_questions:
Expand Down
1 change: 1 addition & 0 deletions montreal_forced_aligner/alignment/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1449,6 +1449,7 @@ def evaluate_alignments(
align_phones,
silence_phone=self.optional_silence_phone,
custom_mapping=mapping,
debug=config.DEBUG,
)
unaligned_utts = []
utterances: typing.List[Utterance] = session.query(Utterance).options(
Expand Down
6 changes: 2 additions & 4 deletions montreal_forced_aligner/command_line/align.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
from pathlib import Path

import rich_click as click
import yaml

from montreal_forced_aligner import config
from montreal_forced_aligner.alignment import PretrainedAligner
Expand All @@ -15,7 +14,7 @@
validate_g2p_model,
)
from montreal_forced_aligner.data import WorkflowType
from montreal_forced_aligner.helper import mfa_open
from montreal_forced_aligner.helper import load_evaluation_mapping

__all__ = ["align_corpus_cli"]

Expand Down Expand Up @@ -137,8 +136,7 @@ def align_corpus_cli(context, **kwargs) -> None:
aligner.load_reference_alignments(reference_directory)
mapping = None
if custom_mapping_path:
with mfa_open(custom_mapping_path, "r") as f:
mapping = yaml.load(f, Loader=yaml.Loader)
mapping = load_evaluation_mapping(custom_mapping_path)
aligner.validate_mapping(mapping)
reference_alignments = WorkflowType.reference
else:
Expand Down
2 changes: 1 addition & 1 deletion montreal_forced_aligner/command_line/validate.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ def validate_corpus_cli(context, **kwargs) -> None:
"""
if kwargs.get("profile", None) is not None:
config.profile = kwargs.pop("profile")
config.FINAL_CLEAN = True
config.update_configuration(kwargs)
kwargs["USE_THREADING"] = False

Expand Down Expand Up @@ -139,7 +140,6 @@ def validate_corpus_cli(context, **kwargs) -> None:
validator.dirty = True
raise
finally:
config.FINAL_CLEAN = True
validator.cleanup()


Expand Down
12 changes: 10 additions & 2 deletions montreal_forced_aligner/corpus/acoustic_corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,8 +234,12 @@ def load_reference_alignments(self, reference_directory: Path) -> None:
max_id = p_id
new_phones = []
for root, _, files in os.walk(reference_directory, followlinks=True):
if root.startswith("."): # Ignore hidden directories
continue
root_speaker = os.path.basename(root)
for f in files:
if f.startswith("."): # Ignore hidden files
continue
if f.endswith(".TextGrid"):
file_name = f.replace(".TextGrid", "")
file_id = session.query(File.id).filter_by(name=file_name).scalar()
Expand Down Expand Up @@ -988,6 +992,8 @@ def _load_corpus_from_source(self) -> None:
for root, _, files in os.walk(self.audio_directory, followlinks=True):
if self.stopped.is_set():
return
if root.startswith("."): # Ignore hidden directories
continue
exts = find_exts(files)
exts.wav_files = {k: os.path.join(root, v) for k, v in exts.wav_files.items()}
exts.other_audio_files = {
Expand All @@ -999,12 +1005,14 @@ def _load_corpus_from_source(self) -> None:
with self.session() as session:
import_data = DatabaseImportData()
for root, _, files in os.walk(self.corpus_directory, followlinks=True):
if self.stopped.is_set():
return
if root.startswith("."): # Ignore hidden directories
continue
exts = find_exts(files)
relative_path = (
root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
)
if self.stopped.is_set():
return
if not use_audio_directory:
all_sound_files = {}
wav_files = {k: os.path.join(root, v) for k, v in exts.wav_files.items()}
Expand Down
2 changes: 2 additions & 0 deletions montreal_forced_aligner/corpus/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ def find_exts(files: typing.List[str]) -> FileExtensions:
"""
exts = FileExtensions(set(), {}, {}, {}, {})
for full_filename in files:
if full_filename.startswith("."): # Ignore hidden files
continue
try:
filename, fext = full_filename.rsplit(".", maxsplit=1)
except ValueError:
Expand Down
9 changes: 6 additions & 3 deletions montreal_forced_aligner/corpus/multiprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ def run(self) -> None:
if self.audio_directory and os.path.exists(self.audio_directory):
use_audio_directory = True
for root, _, files in os.walk(self.audio_directory, followlinks=True):
if root.startswith("."): # Ignore hidden directories
continue
exts = find_exts(files)
wav_files = {k: os.path.join(root, v) for k, v in exts.wav_files.items()}
other_audio_files = {
Expand All @@ -114,11 +116,12 @@ def run(self) -> None:
all_sound_files.update(other_audio_files)
all_sound_files.update(wav_files)
for root, _, files in os.walk(self.corpus_directory, followlinks=True):
exts = find_exts(files)
relative_path = root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")

if self.stopped.is_set():
break
if root.startswith("."): # Ignore hidden directories
continue
exts = find_exts(files)
relative_path = root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
if not use_audio_directory:
all_sound_files = {}
exts.wav_files = {k: os.path.join(root, v) for k, v in exts.wav_files.items()}
Expand Down
13 changes: 8 additions & 5 deletions montreal_forced_aligner/corpus/text_corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,13 +65,14 @@ def _load_corpus_from_source_mp(self) -> None:
file_count = 0
with tqdm(total=1, disable=config.QUIET) as pbar, self.session() as session:
for root, _, files in os.walk(self.corpus_directory, followlinks=True):
if self.stopped.is_set():
break
if root.startswith("."): # Ignore hidden directories
continue
exts = find_exts(files)
relative_path = (
root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
)

if self.stopped.is_set():
break
for file_name in exts.identifiers:
if self.stopped.is_set():
break
Expand Down Expand Up @@ -181,12 +182,14 @@ def _load_corpus_from_source(self) -> None:
sanitize_function = getattr(self, "sanitize_function", None)
with self.session() as session:
for root, _, files in os.walk(self.corpus_directory, followlinks=True):
if self.stopped:
return
if root.startswith("."): # Ignore hidden directories
continue
exts = find_exts(files)
relative_path = (
root.replace(str(self.corpus_directory), "").lstrip("/").lstrip("\\")
)
if self.stopped:
return
for file_name in exts.identifiers:
wav_path = None
if file_name in exts.lab_files:
Expand Down
9 changes: 3 additions & 6 deletions montreal_forced_aligner/dictionary/mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,11 +260,7 @@ def get_base_phone(self, phone: str) -> str:
@property
def extra_questions_mapping(self) -> Dict[str, List[str]]:
"""Mapping of extra questions for the given phone set type"""
mapping = {"silence_question": []}
for p in sorted(self.silence_phones):
mapping["silence_question"].append(p)
if self.position_dependent_phones:
mapping["silence_question"].extend([p + x for x in self.positions])
mapping = {}
for k, v in self.phone_set_type.extra_questions.items():
if k not in mapping:
mapping[k] = []
Expand Down Expand Up @@ -427,7 +423,8 @@ def _generate_positional_list(self, phones: Set[str]) -> List[str]:
List of positional phones, sorted by base phone
"""
positional_phones = []
phones |= {self.get_base_phone(p) for p in phones}
if not hasattr(self, "acoustic_model"):
phones |= {self.get_base_phone(p) for p in phones}
for p in sorted(phones):
if p not in self.non_silence_phones:
continue
Expand Down
2 changes: 1 addition & 1 deletion montreal_forced_aligner/dictionary/multispeaker.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ def tokenizers(self):
bracketed_word=self.bracketed_word,
cutoff_word=self.cutoff_word,
ignore_case=self.ignore_case,
use_g2p=self.use_g2p,
use_g2p=self.use_g2p or getattr(self, "g2p_model", None) is not None,
clitic_set=clitic_set,
grapheme_set=grapheme_set,
)
Expand Down
Loading

0 comments on commit c33e40e

Please sign in to comment.