-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time Series Interpolation is wrong #21351
Comments
And here is the result we should get with original frequency of 21H: df1 = pd.DataFrame(ts1)
df2 = pd.DataFrame(ts1.asfreq('15H'))
tmp = pd.concat([df1, df2]).sort_index().interpolate(method='time')
tmp = tmp[~tmp.index.duplicated(keep='first')]
pd.merge(tmp, df2, how='inner', on=0, left_index=True, right_index=True)
This is, IMHO, what should return ts1.resample('15H').interpolate(method='time') |
@oricou : Indeed, that is odd! Investigation and PR are welcome! |
2 years later still the same bug... I suggested a correction which might be not optimal but at least it gives the good result, what make it so long to correct that bug? |
BTW I don't understand the Missing-Data flag. You just have to run the 4 first cells at the top of the thread. |
@oricou you are welcome to have a look there are 3500 open issues and pandas is all volunteer the missing data tag is that this is about missing data itself |
ok, I will try I still don't understand the Missing Data tag. What is expected ? |
nothing - this is a tag that indicates this issue is about nans |
This issue describes the same issue: #21418 The interpolation is only done for values for which are in the index. This produces really unexpected and hard to interpret results. As a quick fix I would propose to add text with this information to the documentation. |
Here ist the shorted example I could come up with: import pandas as pd
index = pd.date_range(start="2020-01-01 00:00:0.0", end="2020-01-16 00:00:0.0", freq="3D")
data = (np.linspace(1, 16, len(index)) < 8).astype(float)
ds = pd.Series(index=index, data=data).rename("Original")
pd.concat([ds, ds.resample("5D").interpolate().rename("Resampled")], axis=1)
The data from the original Series is interpolated with the rows 2020-01-01 and 2020-01-16, because these are the only dates, which are in the original and resampled Series. |
Why should we do a fix to send a warning when I explain how to do it correctly (see my second message)? I know I should send a Pull request but I almost done it and I am not conviced it will do any good. |
I did not want to attack you, I was just trying to help. Sorry. This issue has been open for almost 2 years, how could I know that you are almost there delivering a pull request? Could you please link the pull request even if its only a draft (Github now allows to mark Pull Requests as drafts), so people know about your progress? In the meanwhile I did my own investigation. In my opinion this line is responsible for the issue: pandas/pandas/core/resample.py Line 861 in 94e2982
This sets all values not in the new index to nan (everything but 2020-01-01 and 2020-01-16). You can see that by calling Instead the code should combine both indices, do the interpolation and drop the old index. Is this similar to your approach? How did you approach it? Best regards |
All you need to do is to add these lines in pandas/pandas/core/resample.py line 1181
and in the header (I don't know exactly where):
to get:
It should be tested ! |
@oricou did you try running $ git diff
diff --git a/pandas/core/resample.py b/pandas/core/resample.py
index 2308f9edb4..1ca3c6ec18 100644
--- a/pandas/core/resample.py
+++ b/pandas/core/resample.py
@@ -1175,8 +1175,10 @@ class DatetimeIndexResampler(Resampler):
result = obj.reindex(
res_index, method=method, limit=limit, fill_value=fill_value
)
-
- result = self._apply_loffset(result)
+ from pandas.core.reshape.concat import concat
+ tmp = concat([obj, result]).sort_index().interpolate(method='time')
+ result = tmp[result.index]
+ result = result[~result.index.duplicated(keep='first')]
return self._wrap_result(result)
def _wrap_result(self, result): So, that might not quite be the solution |
Nope... I do it... when pip will be ok "XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future." |
Please push your changes into a Pull Request to end the my/your machine debate. Follow these links for more information: |
@MarcoGorelli I started the tests and I added the frequency to the result since my last modification broke it. Now it is better but I still have many errors. The first error is not mine I would say: tests/resample/test_base.py ligne 55 says the result should be:
which is wrong from my point of view, the correct answer is the interpolation I provide I think:
So what should I do when the tests expect a wrong answer? |
@JulianWgs Don't expect people to be as much involved as you can. If the deal is "do all the work" or "do nothing, not even a ticket" you will loose many help. I don't understand you reference to "my/your machine debate" but it seems agressive. |
Open a pull request with your changes, change the test you think is wrong, and see if reviewers agree |
Here is my diff % git diff
diff --git a/pandas/core/resample.py b/pandas/core/resample.py
index 2308f9edb4..f682ff07cd 100644
--- a/pandas/core/resample.py
+++ b/pandas/core/resample.py
@@ -86,6 +86,8 @@ from pandas.tseries.offsets import (
Tick,
)
+from pandas.core.reshape.concat import concat
+
_shared_docs_kwargs: Dict[str, str] = {}
@@ -859,6 +861,11 @@ class Resampler(BaseGroupBy, ShallowMixin):
Interpolate values according to different methods.
"""
result = self._upsample("asfreq")
+ if isinstance(result.index, DatetimeIndex):
+ obj = self._selected_obj
+ tmp = concat([obj, result]).sort_index().interpolate(method='time')
+ tmp = tmp[result.index]
+ result[...] = tmp[~tmp.index.duplicated(keep='first')]
return result.interpolate(
method=method,
axis=axis, |
@oricou your best best is to actually open a pull request |
With one test which does not pass? Ok, let's try. Thanks. |
edited to add I am currently looking at this issue and the related pull request (#39886). Just wanted to make sure I'm not overlapping with anyone, however unlikely it may be. I'll probably come back to this next weekend to work on it some more, but if anyone is interested, please go ahead (: Related Issues and PRs
To ReproduceI have been able to replicate the error described in this issue. See the following example: import pandas as pd
import numpy as np
resample_rule = "18T"
dti = pd.date_range("2023-06-10", periods=3, freq="30T")
original = pd.Series(np.arange(len(dti), dtype=float), dti)
resampled = original.resample(resample_rule).sum() # any func will do
resampled.loc[:] = np.nan # reset values
expected = (
pd.concat([original, resampled])
.sort_index()
.interpolate(method="time")
.drop(
original.index.drop(resampled.index, errors="ignore"), errors="ignore"
)
.drop_duplicates()
)
result = original.resample(resample_rule).interpolate(method="time")
print(f"> original\n{original}\n\n> expected\n{expected}\n\n> result\n{result}") Output
The interpolation will fail for Successful interpolation for
|
die 12/06/23, ad 03h44, Joshua Shew ***@***.***> dixit :
I am currently looking at this issue and the related pull request (#39886). Just wanted to make sure I'm not overlapping with anyone, however unlikely it may be.
You can work on it. You are most welcome!
Olivier.
|
…a points Fixes pandas-dev#21351 (Time Series Interpolation is wrong). Interpolation on resampled data was only working as the original data points were discarded in the resampling process if they were not aligned with the resampled index. In this implementation the missing datapoints are added to the dataset before interpolation, and discarded afterwards. Does not handle PeriodIndex.
@jreback @jbrockmendel @oricou @Jython1415 The most important aspect in the code is the comments around the examples here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/resample.py#L1064-L1080 ... The case with 400ms resampling causes only 3 of the original data points to be included when the interpolation method is called. See this plot: The blue points are the original example data (
However, objectively nobody would "expect" this result, and it is correctly labeled as "misleading" in the comment. The purpose of such interpolation is precisely to either (a) increase the time resolution of a time series using some assumed underlying function (i.e. linear, polynomial, ...) or to (b) align measurement data done at potentially arbitrary timestamps to a fixed grid. In neither case the current implementation leads to the expected results. It only works when the new desired sampling includes all of the known data points exactly. The orange points in the plot show you the interpolation result with my fixed code, achieved by first concatenating the non-covered data points with the new resampled index, then doing the interpolation, then discarding the excess data points. I'd argue this is the result most people would expect. In the edge case, the current behavior leads to completely unreasonable results, such as: index = pd.date_range('2000-01-01 00:01:00', periods=5, freq='2h')
df = pd.Series(np.arange(5.), index)
df.resample('1h').interpolate(method='linear') printing
Here, a 1-second offset makes the upsampling lead to all-NaN results. If I am given green light that it is okay to correct the interpolation behavior in version 2.x, I can fix the test cases which are based on hard-coded wrong expectations. |
with the caveat that I had time to fully digest your message, there is a version 3.0 coming up, so if it's necessary to correct any current wrong behaviour, then this may be a good chance - want to make a PR? |
@MarcoGorelli Thank you for open ears and quick feedback :) Please note a PR is already there: #56515 ... I can target 3.0 instead (and adjust the implementation accordingly). But I want to stress that the behavior in 2.x is "wrong", so it would be wise to carefully consider whether it could target 2.x to benefit users of pandas 2. What do you think? |
Nice, thanks I doubt we can get it reviewed that fast I'm afraid (also, looks like there's still a test failure?) |
So much for long messages 😄 ... Let me try shorter (others please see longer message above for proper details):
|
it's unlikely there's time for |
Okay, what branch do I need to target? (I wasn't able to figure out where the 3.0 code lives) |
just the |
@MarcoGorelli okay, but I am targeting |
I'd suggest fixing the test cases within the same PR, we can't merge unless it's green |
Problem description
The answer whould not be always 0.
Note that without
method='time'
the result is the same.It can look like ok
If I chose another frequency in the beginning, 20H instead of 21H, then it is ok except the last value:
My interpretation is that it puts NaN everywhere except on times it has data that is
2016-08-28 00:00:00
and2016-08-30 12:00:00
and after it does a linear interpolation.My example is bad because I used range(4) which is linear. If I set values to
0 9 9 3 9
then the interpolation gives the same result which is totaly wrong now.Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: fr_FR.UTF-8
pandas: 0.23.0
pytest: 3.5.0
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: