Implementation of Tests proposed in the "The ML Test Score" paper #2943

Peetee06 · 2023-01-16T10:57:36Z

Peetee06
Jan 16, 2023

I am currently working on my master's thesis on the topic of technical debt in ML. I am implementing the Infrastructure Tests proposed in "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction" by Breck et al. while using Ludwig. For the test "Infra 2: Model specification code is unit tested", they propose to train a model to overfit on a test dataset, to get an indicator for if the model can actually learn given data.

I was not yet able to get a model to overfit using Ludwig. What I tried by now is "disabling" early stop (setting it to the number of epochs) and just setting the number of epochs to a high number (300). The model will converge to a loss of 0.38 +- 0.04 and an accuracy of 0.87 +- 0.02 (train/validation/test).

This is the configuration I used:

input_features:
  -
    name: audio_path
    type: audio
    preprocessing:
      audio_file_length_limit_in_s: 1.0
      type: fbank
      window_length_in_s: 0.025
      window_shift_in_s: 0.01
      num_filter_bands: 80
      norm: per_file
output_features:
    -
      name: label
      type: category
      num_classes: 2
trainer:
  epochs: 300
  early_stop: 300

From what I understand, if the model was overfitting, the validation and test accuracy should significantly drop compared to the one from train.

How do I get the model to overfit so I can implement the Infra 2 test?

justinxzhao · 2023-01-20T05:16:37Z

justinxzhao
Jan 20, 2023
Maintainer

Hi @Peetee06,

Thanks for reaching out!

It's interesting that you are intentionally trying to get your model to overfit.

It also looks like you're onto the right idea with trying to get your model to overfit -- bumping up early stopping (which, by the way, you can also set to -1 to disable it entirely), and increasing the total training runway.

Some questions on my end:

What data are you using?
What are your data splits?
How long does training generally take for you?
Are you able to attain overfitting phenomenon on your data with a smaller subsample? (You can try setting trainer.sample_ratio: 0.1 to use 10% of the data)

5 replies

Peetee06 Jan 20, 2023
Author

Hi @justinxzhao,

Thanks for the tip about setting early stopping to -1.

I am using audio data from machines the company i am working at builds. I want to classify the data into 2 categories.
Currently I use the default splits of 0.7, 0.1, 0.2
Usually up to 30 minutes
I will try that on Monday!

Peetee06 Jan 24, 2023
Author

I tried sample_ratio but ran into a bug that prevented the training from starting. Reported it here: #2995

Peetee06 Jan 27, 2023
Author

With following config:

input_features:
  - name: audio_path
    type: audio
    preprocessing:
      audio_file_length_limit_in_s: 1.0
      type: stft
      window_length_in_s: 0.025
      window_shift_in_s: 0.01
      num_filter_bands: 80
      norm: per_file
output_features:
  - name: label
    type: category
    num_classes: 2
trainer:
  epochs: 300
  early_stop: -1

At the last epoch I get:

╒════════════╤════════╤════════════╕
│ label      │   loss │   accuracy │
╞════════════╪════════╪════════════╡
│ train      │ 0.0000 │     1.0000 │
├────────────┼────────┼────────────┤
│ validation │ 0.5607 │     0.9596 │
├────────────┼────────┼────────────┤
│ test       │ 0.4800 │     0.9642 │
╘════════════╧════════╧════════════╛
╒════════════╤════════╕
│ combined   │   loss │
╞════════════╪════════╡
│ train      │ 0.0000 │
├────────────┼────────┤
│ validation │ 0.5607 │
├────────────┼────────┤
│ test       │ 0.4800 │
╘════════════╧════════╛

Not sure if < 3% test accuracy drop can be counted as overfitting 😆

With the fix #3006 for issue #2995 I can use sample ratio now. As the training didn't do much after epoch 50, I'm now using following config with epochs set to 70 and sample_ratio 0.1:

input_features:
  - name: audio_path
    type: audio
    preprocessing:
      audio_file_length_limit_in_s: 1.0
      type: stft
      window_length_in_s: 0.025
      window_shift_in_s: 0.01
      num_filter_bands: 80
      norm: per_file
output_features:
  - name: label
    type: category
    num_classes: 2
preprocessing:
  sample_ratio: 0.1
trainer:
  epochs: 70
  early_stop: -1

At the last epoch I get:

╒════════════╤════════════╤════════╕
│ label      │   accuracy │   loss │
╞════════════╪════════════╪════════╡
│ train      │     1.0000 │ 0.0000 │
├────────────┼────────────┼────────┤
│ validation │     0.9420 │ 0.6130 │
├────────────┼────────────┼────────┤
│ test       │     0.9525 │ 0.5916 │
╘════════════╧════════════╧════════╛
╒════════════╤════════╕
│ combined   │   loss │
╞════════════╪════════╡
│ train      │ 0.0000 │
├────────────┼────────┤
│ validation │ 0.6130 │
├────────────┼────────┤
│ test       │ 0.5916 │
╘════════════╧════════╛

So the overfitting "improved" by 1% point.

I would expect drops of 10%+ as indication of the model actually overfitting. Is that correct? Apart from lectures I've actually never seen or tried to get a model to overfit.

Peetee06 Jan 27, 2023
Author

After evaluating the model with the evaluate python API method, I found following metrics:
class1 samples: 272
class2 samples: 35
class1 F1 score: 0.974
class2 F1 score: 0.794

Does that mean the model successfully overfit on the training data but the accuracy drop is minor because of the heavily imbalanced classes?

justinxzhao Feb 7, 2023
Maintainer

Hi @Peetee06,

In my mind, there's no hard metric gap minimum that counts as overfitting. Overfitting just means that metrics on your training data are improving/better than metrics on your validation data, which are worse/getting worse.

The paste you shared definitely suggests that there's some overfitting going on -- 100% accuracy / 0 loss on train is telling!

ShreyaR · 2023-02-07T19:44:05Z

ShreyaR
Feb 7, 2023
Collaborator

@Peetee06 thanks for sharing info. Happy to assist in getting your model to overfit.

In general, overfitting depends on two factors:

1. Does your model have sufficient capacity to train on the dataset

Training for longer and disabling early stopping are good starting points. Additionally, you'd want to make sure that your model has enough capacity (i.e. parameters) to overfit on the data. To figure out if the model if overfitting or underfitting, can you share the training curves for your model? You should be able to generate them by setting up Tensorboard. Instructions here.

Ideally, in overfitting, your training loss should go to zero while the testing and validation loss should continue to increase. If you aren't seeing that, can you try:

Using more layers and larger embedding sizes in your existing encoder. I believe the default audio encoder you're using is parallel CNN. Can you try changing your config to add the following parameters, and increasing the embedding_size, output_size and num_filters parameters?

encoder: 
    type: parallel_cnn
    representation: dense
    embedding_size: 256
    embeddings_trainable: true
    filter_size: 3
    num_filters: 256
    pool_function: max
    output_size: 256
    use_bias: true
    weights_initializer: glorot_uniform
    bias_initializer: zeros
    activation: relu
    dropout: 0.0
    reduce_output: sum

Using larger encoders. Instead of using parallel_cnn which has fewer parameters and layers, can you try using stacked_cnn or transformer? The links should contain documentation for how to update the config to use these encoders. For starters, I'd just try changing your configuration to use the default stacked_cnn or transformer encoders like so:

input_features:
  - name: audio_path
    type: audio
    preprocessing:
      audio_file_length_limit_in_s: 1.0
      type: stft
      window_length_in_s: 0.025
      window_shift_in_s: 0.01
      num_filter_bands: 80
      norm: per_file
    encoder: stacked_cnn <OR transformer>
output_features:
  - name: label
    type: category
    num_classes: 2
preprocessing:
  sample_ratio: 0.1
trainer:
  epochs: 70
  early_stop: -1

2. Does your dataset have enough signal to train?

In order to figure this out, I'd recommend trying to visualize your data if possible. However, visualizing audio data points is not as straightforward and requires visualizing t-SNE embeddings of embeddings of your data, so I would recommend starting with the suggestions listed above.

Happy to follow up when you can share your training curves or output of trying larger models.

2 replies

Peetee06 Feb 8, 2023
Author

Thank you for your answer, @ShreyaR

Sadly I'm not part of that project anymore and can't test your suggestions. I forwarded the thread to my colleague who still works on this project, though. This will help him resuming where I left off and get the model to overfit :)

justinxzhao Feb 9, 2023
Maintainer

Ah, good luck with your next adventure @Peetee06! If you or your colleague has any further questions, feel free to ping this thread, or on Slack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Tests proposed in the "The ML Test Score" paper #2943

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Implementation of Tests proposed in the "The ML Test Score" paper #2943

Peetee06 Jan 16, 2023

Replies: 2 comments · 7 replies

justinxzhao Jan 20, 2023 Maintainer

Peetee06 Jan 20, 2023 Author

Peetee06 Jan 24, 2023 Author

Peetee06 Jan 27, 2023 Author

Peetee06 Jan 27, 2023 Author

justinxzhao Feb 7, 2023 Maintainer

ShreyaR Feb 7, 2023 Collaborator

Peetee06 Feb 8, 2023 Author

justinxzhao Feb 9, 2023 Maintainer

Peetee06
Jan 16, 2023

Replies: 2 comments 7 replies

justinxzhao
Jan 20, 2023
Maintainer

Peetee06 Jan 20, 2023
Author

Peetee06 Jan 24, 2023
Author

Peetee06 Jan 27, 2023
Author

Peetee06 Jan 27, 2023
Author

justinxzhao Feb 7, 2023
Maintainer

ShreyaR
Feb 7, 2023
Collaborator

Peetee06 Feb 8, 2023
Author

justinxzhao Feb 9, 2023
Maintainer