This folder contains a set of subdirectories, one for each model, that contains submitted model output files for that model. The structure of these directories and their contents follows the model output guidelines in our documentation. Documentation for hub submissions specifically is provided below.
All model output files should be submitted directly to a team's subdirectory within the the model-output/ folder. Data in this directory should be added to the repository through a pull request so that automatic data validation checks are run.
These instructions provide detail about the data format as well as validation that you can do prior to this pull request. In addition, we describe metadata that each model should provide in the model-metadata folder.
Table of Contents
- Model output details
- Submission file format
- Data formatting
- Model output validation
- Weekly ensemble build
- Policy on late submissions
This hub follows hubverse data standards. Submissions must include either mean outputs, or sample-based model outputs. If sample-based model outputs are submitted and means are not, modelers should assume that these samples may be used to compute a mean prediction which may be scored.
We use the term “model task” below to refer to a prediction for a specific clade, location and target date. For example, if mean model outputs are submitted, there will be one value between 0 and 1 for each model task. The submitted values for all clades must sum to 1 (within +/- 0.001) for a given location and target date. As we will describe in further detail below, the target for prediction is the proportion of circulating viral genomes for a given location and target date amongst infected individuals that are sequenced for SARS-CoV-2.
To submit probabilistic predictions, a sample format is used to encode samples from the predictive distribution for each model task. The hub requires exactly 100 samples for each model task. One key advantage to submitting sample-based output is that dependence can be encoded across horizons (corresponding to trajectories of variant prevalence over time), or even across locations (see details in Hubverse sample model-output specifications). For this hub, we require that samples be submitted in such a way as to imply that they are structured into trajectories across clades and horizons. (See following section for how variants are classified into clade categories.) In particular, a common sample ID will be used in multiple rows of the submission file with different combinations of clade and target date. This means that
- at each location and target date a common sample ID (in the
ouput_type_id
column) ensures that the clade proportions sum to 1, and - for each location and clade, common sample IDs across
target_date
values allows us to draw trajectories by clade.
This specification corresponds to a hubverse-style "compound modeling task" that includes the following fields: nowcast_date
, location
. Samples then capture dependence across the complementary set of task ids: target_date
, clade
.
We note that sample IDs present in the output_type_id
column of submissions are not necessarily inherent properties of how the samples are generated, as they can be changed post-hoc by a modeler. For example, some models may make nowcasts independently by target date but the samples could be tied together either randomly or via some other correlation structure or secondary model to assign sample IDs that are consistent across target dates. As another example, some models may make forecasts that have joint dependence structure across locations as well as target dates. Sample IDs can be shared across locations as well, but this is not required for the submission to pass validation.
To be included in the hub ensemble model, samples must be submitted and the mean forecast for the hub ensemble will be obtained as a summary of sample predictions.
Submissions must be submitted as .parquet
files and must follow a specific tabular data format. Every submission file must contain the following columns
nowcast_date
: the date of the Wednesday submission deadline, inYYYY-MM-DD
format.target_date
: the date that a specific nowcast prediction is made for, inYYYY-MM-DD
format.location
: the two-letter abbreviation for a US state, including DC and PR for Washington DC and Puerto Rico.clade
: the label for a Nextstrain clade (or "other"), as defined on a per-round basis in these files.output_type
: the type of output represented by this row, one of eithermean
orsample
.output_type_id
: eitherNA
formean
rows or, forsample
rows, an alphanumeric sample ID value that links together rows from the same predictive sample from the model.value
: the predicted proportion (between 0 and 1 inclusive) for the combination oftarget_date
,location
andclade
.
Here are a few example rows, showing mean
values for ten unique modeling tasks (a modeling task is a unique combination of location
, target_date
and clade
):
nowcast_date |
target_date |
location |
clade |
output_type |
output_type_id |
value |
---|---|---|---|---|---|---|
2024-09-25 | 2024-09-23 | MA | 24A | mean | NA | 0.1 |
2024-09-25 | 2024-09-23 | MA | 24B | mean | NA | 0.2 |
2024-09-25 | 2024-09-23 | MA | 24C | mean | NA | 0.05 |
2024-09-25 | 2024-09-23 | MA | recombinant | mean | NA | 0.6 |
2024-09-25 | 2024-09-23 | MA | other | mean | NA | 0.05 |
2024-09-25 | 2024-09-24 | MA | 24A | mean | NA | 0.12 |
2024-09-25 | 2024-09-24 | MA | 24B | mean | NA | 0.18 |
2024-09-25 | 2024-09-24 | MA | 24C | mean | NA | 0.02 |
2024-09-25 | 2024-09-24 | MA | recombinant | mean | NA | 0.6 |
2024-09-25 | 2024-09-24 | MA | other | mean | NA | 0.08 |
Here are a few example rows, showing two predictive samples for ten unique modeling tasks. The samples that share the same value in the output_type_id
column are assumed to be drawn from the same predictive sample from the model:
nowcast_date |
target_date |
location |
clade |
output_type |
output_type_id |
value |
---|---|---|---|---|---|---|
2024-09-25 | 2024-09-23 | MA | 24A | sample | MA00 | 0.1 |
2024-09-25 | 2024-09-23 | MA | 24B | sample | MA00 | 0.2 |
2024-09-25 | 2024-09-23 | MA | 24C | sample | MA00 | 0.05 |
2024-09-25 | 2024-09-23 | MA | recombinant | sample | MA00 | 0.6 |
2024-09-25 | 2024-09-23 | MA | other | sample | MA00 | 0.05 |
2024-09-25 | 2024-09-24 | MA | 24A | sample | MA00 | 0.12 |
2024-09-25 | 2024-09-24 | MA | 24B | sample | MA00 | 0.18 |
2024-09-25 | 2024-09-24 | MA | 24C | sample | MA00 | 0.02 |
2024-09-25 | 2024-09-24 | MA | recombinant | sample | MA00 | 0.6 |
2024-09-25 | 2024-09-24 | MA | other | sample | MA00 | 0.08 |
2024-09-25 | 2024-09-23 | MA | 24A | sample | MA01 | 0.1 |
2024-09-25 | 2024-09-23 | MA | 24B | sample | MA01 | 0.2 |
2024-09-25 | 2024-09-23 | MA | 24C | sample | MA01 | 0.05 |
2024-09-25 | 2024-09-23 | MA | recombinant | sample | MA01 | 0.6 |
2024-09-25 | 2024-09-23 | MA | other | sample | MA01 | 0.05 |
2024-09-25 | 2024-09-24 | MA | 24A | sample | MA01 | 0.12 |
2024-09-25 | 2024-09-24 | MA | 24B | sample | MA01 | 0.18 |
2024-09-25 | 2024-09-24 | MA | 24C | sample | MA01 | 0.02 |
2024-09-25 | 2024-09-24 | MA | recombinant | sample | MA01 | 0.6 |
2024-09-25 | 2024-09-24 | MA | other | sample | MA01 | 0.08 |
The automatic checks in place for forecast files submitted to this repository validates both the filename and file contents to ensure the file can be used in the visualization and ensemble forecasting.
Each model that submits forecasts for this project will have a unique subdirectory within the model-output/ directory in this GitHub repository where forecasts will be submitted. Each subdirectory must be named
team-model
where
team
is theteam_abbr
field from the model metadata file andmodel
is themodel_abbr
field from the model matadata file.
Both team and model should be less than 15 characters and not include hyphens or other special characters, with the exception of "_".
The combination of team
and model
should be unique from any other model in the project.
The metadata file will be saved within the model-metdata directory in the Hub's GitHub repository, and should have the following naming convention:
team-model.yml
Details on the content and formatting of metadata files are provided in the model-metadata README.
Each submission file should have the following format
YYYY-MM-DD-team-model.csv
where
YYYY
is the 4 digit year,MM
is the 2 digit month,DD
is the 2 digit day,team
is theteam_abbr
, andmodel
is themodel_abbr
.
The date YYYY-MM-DD is the nowcast_date
. This should be the Wednesday submission deadline for each round.
The team
and model
in this file must match the team
and model
in
the directory this file is in.
To ensure proper data formatting, pull requests for new data in
model-output/
will be automatically run. Optionally, you may also run these validations locally.
When a pull request is submitted, the data are validated through Github Actions which runs the tests present in the hubValidations package. The intent for these tests are to validate the requirements above. Please let us know if you are facing issues while running the tests.
Every Wednesday evening, we will generate an ensemble using valid submissions in the current week by the deadline. Some or all participant forecasts may be combined into an ensemble forecast to be published in real-time along with the participant forecasts. In addition, some or all forecasts may be displayed alongside the output of a baseline model for comparison.
In order to ensure that forecasting is done in real-time, all forecasts are required to be submitted to this repository by Wednesday at 8pm ET each week. We do not accept late forecasts.