Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG]: Compute dataset statistics on training data #606

Open
albertocarpentieri opened this issue Jul 17, 2024 · 2 comments
Open

🐛[BUG]: Compute dataset statistics on training data #606

albertocarpentieri opened this issue Jul 17, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@albertocarpentieri
Copy link

Version

0.6.0

On which installation method(s) does this occur?

Docker

Describe the issue

In examples/weather/dataset_download/start_mirror.py the global_means and global_stds files (used later for normalization) are computed on the entire dataset and not only on the training set.

Current implementation

    if cfg.compute_mean_std:
        stats_path = os.path.join(cfg.hdf5_store_path, "stats")
        print(f"Saving global mean and std at {stats_path}")
        if not os.path.exists(stats_path):
            os.makedirs(stats_path)
        era5_mean = np.array(
            era5_xarray.mean(dim=("time", "latitude", "longitude")).values
        )
        np.save(
            os.path.join(stats_path, "global_means.npy"), era5_mean.reshape(1, -1, 1, 1)
        )
        era5_std = np.array(
            era5_xarray.std(dim=("time", "latitude", "longitude")).values
        )
        np.save(
            os.path.join(stats_path, "global_stds.npy"), era5_std.reshape(1, -1, 1, 1)
        )
        print(f"Finished saving global mean and std at {stats_path}")

Proposed modification

    if cfg.compute_mean_std:
        # Compute stats only on training data
        train_era5_xarray = era5_xarray.sel(
            time=era5_xarray.time.dt.year.isin(train_years)
        )
        stats_path = os.path.join(cfg.hdf5_store_path, "stats")
        print(f"Saving global mean and std at {stats_path}")
        if not os.path.exists(stats_path):
            os.makedirs(stats_path)
        era5_mean = np.array(
            train_era5_xarray.mean(dim=("time", "latitude", "longitude")).values
        )
        np.save(
            os.path.join(stats_path, "global_means.npy"), era5_mean.reshape(1, -1, 1, 1)
        )
        era5_std = np.array(
            train_era5_xarray.std(dim=("time", "latitude", "longitude")).values
        )
        np.save(
            os.path.join(stats_path, "global_stds.npy"), era5_std.reshape(1, -1, 1, 1)
        )
        print(f"Finished saving global mean and std at {stats_path}")

Minimum reproducible example

No response

Relevant log output

No response

Environment details

Modulus Docker container version 24.04
@albertocarpentieri albertocarpentieri added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 17, 2024
@albertocarpentieri albertocarpentieri changed the title 🐛[BUG]: Compute dataset statistics on test set 🐛[BUG]: Compute dataset statistics on training data Jul 17, 2024
@mnabian
Copy link
Collaborator

mnabian commented Oct 17, 2024

@loliverhennigh is the proposed modification acceptable to you?

@mnabian mnabian removed the ? - Needs Triage Need team to review and classify label Oct 17, 2024
@loliverhennigh
Copy link
Collaborator

This seems fine to me @albertocarpentieri if you want to submit a PR. Ill mention that with the unified recipe for training global weather models we compute the statistics on the fly with a moving average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants