Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

review #1

Merged
merged 68 commits into from
Jan 16, 2024
Merged

review #1

merged 68 commits into from
Jan 16, 2024

Conversation

Lallapallooza
Copy link
Collaborator

No description provided.

Dockerfile Outdated
@@ -0,0 +1,25 @@
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

зачем эти приседания с установкой куднн руками? можно взять готовый образ с куднн, например
nvcr.io/nvidia/tensorrt:22.08-py3

можно поискать более подходящий

run_docker.sh Outdated
@@ -0,0 +1,9 @@
#!/bin/bash

app=$PWD
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$(pwd)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

нужны комменты к каждому полю, вот пример как гпт4 сгенерил, поправьте чтоб получше было, т.е. оставь группировку

DATA OPTIONS

MODEL OPTIONS

TRAIN OPTIONS

AUGMENTATION OPTIONS

ну и убери избыточность в комментах

там где будут идти строчки в стиле

# Section
param1 : VeryLongType1 = "value1" # doc1
long_param_name : Type1 = "value1" # doc2

форматнуть как

# Section
param1          : VeryLongType1 = "value1" # doc1
long_param_name : Type1         = "value1" # doc2
@dataclass
class TrainConfig:
    # The computing platform for training: 'cuda' for NVIDIA GPUs or 'cpu' for CPU-based training.
    device: str = "cuda"

    # Directory path where the MUSDB18-HQ dataset is stored or to be downloaded.
    musdb_path: str = "musdb18hq"

    # Directory path for saving training metadata, like track names and lengths.
    metadata_train_path: str = "metadata"

    # Directory path for saving testing metadata.
    metadata_test_path: str = "metadata1"

    # Length (in seconds) of each audio segment used during training. Shorter segments can help in learning finer details.
    segment: int = 5

    # Batch size for training, determining how many samples are processed together. Affects memory usage and gradient updates.
    batch_size: int = 6

    # Whether to shuffle the training dataset at the beginning of each epoch. Shuffling can help in reducing overfitting.
    shuffle_train: bool = True

    # Whether to shuffle the validation dataset. Generally not needed as validation performance is independent of data order.
    shuffle_valid: bool = False

    # Whether to drop the last incomplete batch in case the dataset size isn't divisible by the batch size.
    drop_last: bool = True

    # Number of worker processes used for loading data. Higher numbers can speed up data loading.
    num_workers: int = 2

    # Strategy for monitoring metrics to save model checkpoints: 'min' for saving when a monitored metric is minimized.
    metric_monitor_mode: str = "min"

    # Number of best-performing model weights to save based on the monitored metric.
    save_top_k_model_weights: int = 1

    # PM_Unet model configurations
    # Specific sources to target in source separation, e.g., separating drums, bass, etc.
    model_source: tuple = ("drums", "bass", "other", "vocals")

    # The depth of the U-Net architecture, affecting its capacity and complexity.
    model_depth: int = 4

    # Number of initial channels in U-Net layers, influencing the model's learning capability.
    model_channel: int = 28

    # Indicates whether the input audio should be treated as mono (True) or stereo (False).
    is_mono: bool = False

    # Whether to utilize masking within the model for source separation.
    mask_mode: bool = False

    # Mode of skip connections in U-Net ('concat' for concatenation, 'add' for summation).
    skip_mode: str = "concat"

    # Number of bins used in Fast Fourier Transform (FFT) during audio processing.
    nfft: int = 4096

    # Determines whether to use LSTM layers as bottleneck in the U-Net architecture.
    bottlneck_lstm: bool = True

    # Number of LSTM layers if bottleneck is enabled, adding to the model's complexity and ability to capture temporal dependencies.
    layers: int = 2

    # Flag to decide whether to apply Short Time Fourier Transform (STFT) on input audio.
    stft_flag: bool = True

    # Augmentation parameters
    # Maximum number of samples to shift in time during data augmentation, helping the model generalize over temporal variations.
    shift: int = 8192

    # Probability of applying pitch shift augmentation to the audio, introducing pitch variability in training data.
    pitchshift_proba: float = 0.2

    # Range of semitone shifts for pitch shift augmentation applied specifically to vocal tracks.
    vocals_min_semitones: int = -5
    vocals_max_semitones: int = 5

    # Range of semitone shifts for pitch shift augmentation applied to non-vocal tracks.
    other_min_semitones: int = -2
    other_max_semitones: int = 2

    # Flag to enable pitch shift augmentation on non-vocal sources.
    pitchshift_flag_other: bool = False

    # Probability of applying time stretching or compression to alter the speed of the audio without changing its pitch.
    time_change_proba: float = 0.2

    # Factors for time stretching/compression, defining the range and intensity of this augmentation.
    time_change_factors: tuple = (0.8, 0.85, 0.9, 0.95, 1.05, 1.1, 1.15, 1.2, 1.25, 1.3)

    # Probability of remixing audio tracks, useful for training the model to recognize and separate blended sounds.
    remix_proba: float = 1

    # Number of tracks to group together for the remix augmentation.
    remix_group_size: int = batch_size

    # Probability and range of scaling the amplitude of audio tracks, introducing dynamic range variability.
    scale_proba: float = 1
    scale_min: float = 0.25
    scale_max: float = 1.25

    # Probability of applying a fade effect to the audio, simulating natural starts and ends of sounds.
    fade_mask_proba: float = 0.1

    # Probability of doubling one channel's audio to both channels, simulating mono audio in a stereo setup.
    double_proba: float = 0.1

    # Probability of reversing a segment of the audio track, useful for capturing reversed audio patterns.
    reverse_proba: float = 0.2

    # Probability and depth of mixing tracks together to create mashups, aiding the model in learning from complex blends of sounds.
    mushap_proba: float = 0.0
    mushap_depth: int = 2

    # Loss function parameters
    # Multiplicative factors for different components of the loss function, affecting the emphasis on different aspects of the training objective.
    factor: int = 1
    c_factor: int = 1

    # Number of FFT bins for calculating loss, impacting the granularity of frequency-based loss computation.
    loss_nfft: tuple = (4096,)

    # Gamma parameter for adjusting the focus of the loss function on certain aspects of the audio spectrum.
    gamma: float = 0.3

    # Learning rate for the optimizer, a crucial parameter for training convergence.
    lr: float = 0.5 * 3e-3

    # Period of the cosine annealing schedule in learning rate adjustment, influencing the adaptation of learning rate over time.
    T_0: int = 40

    # Maximum number of training epochs, defining the total duration of the training process.
    max_epochs: int = 100

    # Precision of training computations, e.g., '16' for 16-bit floating-point precision, reducing memory usage and potentially speeding up training.
    precision: str = 16

    # Gradient clipping value to prevent exploding gradients, aiding in training stability.
    grad_clip: float = 0.5

channels=2,
ext=File.EXT,
):
"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

поправить на

"""
A dataset class for audio source separation, compatible with WAV (or MP3) files.
This class allows training with arbitrary sources, where each audio track is represented by a separate folder within the specified root directory. Each folder should contain audio files for different sources, named as `{source}.{ext}`.

Args:
    root (Path or str): The root directory of the dataset where audio tracks are stored.
    metadata (dict): Metadata information generated by the `build_metadata` function. It contains details like track names and lengths.
    sources (list[str]): A list of source names to be separated, e.g., ['drums', 'vocals'].
    segment (Optional[float]): The length of each audio segment in seconds. If `None`, the entire track is used.
    shift (Optional[float]): The stride in seconds between samples. Determines the overlap between consecutive audio segments.
    normalize (bool): If True, normalizes the input audio based on the entire track's statistics, not just individual segments.
    samplerate (int): The target sample rate. Audio files with a different sample rate will be resampled to this rate.
    channels (int): The target number of audio channels. If different, audio will be converted accordingly.
    ext (str): The file extension of the audio files, default is '.wav'.

Note:
    The `samplerate` and `channels` parameters are used to ensure consistency across the dataset. They allow on-the-fly conversion of audio properties to match the target specifications.
"""


def resolve_weigths(self):
if self.model_bottlneck_lstm:
self.weights_path = self.config.weights_dir / "weight_LSTM.pt"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

плохо хардкодить имена весов

)

if model.bottlneck_lstm:
weights_path = config.weights_dir / "weight_LSTM.pt"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

хардкод имен весов плохо, передавай параметром

def istft(self, z):
return self.model.stft.istft(z, self.length_wave)

SEGMENT_WAVE = 44100
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

хардкод плохо, передавай параметром

)

model_path = str(
args.out_dir + f"/{args.class_name}_outer_stft_{SEGMENT_WAVE / 44100:.1f}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

хардкод плохо, передавай параметром

Copy link
Collaborator

@d-a-yakovlev d-a-yakovlev Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_filename = f"{args.class_name}_outer_stft_{config.segment_duration:.1f}"
model_path = args.out_dir + '/' + model_filename

ok?


if start_converter:
subprocess.Popen(
["python3", "/app/converter.py"], executable="/bin/bash", shell=True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

хардкод плохо, передавай параметром

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

название на нормальное, tf_lite_stream.py

@maks00170 maks00170 merged commit 3bea291 into review Jan 16, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants