Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to remove windows with poor data quality #1059

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jasminerienecker
Copy link
Contributor

@jasminerienecker jasminerienecker commented Jul 9, 2024

This review adds the parameters data_availability_threshold (defaults to 0.0 to maintain currently functionality) to all models that inherit the BaseWindows class. This parameters allows us to discard windows where the percentage of good quality data points is below the threshold. The quality of a data point is determined by the corresponding value in the available_mask column.

This is a functionality I currently require as my dataset has many large gaps and I don't want to be training the model using these gaps.

I have added a test to the end of base_windows notebook.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@elephaint
Copy link
Contributor

@jasminerienecker Thanks for the PR, I think I understand what you're trying to do.

Isn't it easier to remove the missing datapoints / large gaps from your dataset before training?

@jasminerienecker
Copy link
Contributor Author

jasminerienecker commented Jul 11, 2024

@jasminerienecker Thanks for the PR, I think I understand what you're trying to do.

Isn't it easier to remove the missing datapoints / large gaps from your dataset before training?

@elephaint Going through the code it seems the base_windows class assumes all the timesteps are available. For example if your data is at one minute resolution but there is a gap of 10 minutes, the windows are created as if no timesteps are missing.

This means I think you'd have to keep the rows with missing values in the dataset, but if there are longer chunks of missing data (as in most of the values in a window are NaN) this could interfere with the model training. This solution was a way of keeping the temporal information while not training the model on windows where the majority of the data is not available.

Please let me know if there's something I've missed though!

@elephaint
Copy link
Contributor

@jasminerienecker Thanks; I think this PR could be a generalization of #1036 (@jose-moralez). I have to think about the behaviour and we also would have to include the changes in the other Base classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants