Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Lead and lag operators? #791

Closed
sglyon opened this issue Mar 27, 2015 · 13 comments
Closed

Question: Lead and lag operators? #791

sglyon opened this issue Mar 27, 2015 · 13 comments
Labels
non-breaking The proposed change is not breaking

Comments

@sglyon
Copy link

sglyon commented Mar 27, 2015

I do a lot of work with pandas and really appreciate the shift method on a DataFrame. It simply applies a lead or lag (depending on the sign of the argument) to the DataFrame, retaining the index. A simple working example:

In [345]: df = pd.DataFrame(np.arange(12).reshape(3,4))

In [346]: df
Out[346]:
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

In [348]: df.shift(1)
Out[348]:
    0   1   2   3
0 NaN NaN NaN NaN
1   0   1   2   3
2   4   5   6   7

In [349]: df.shift(-1)
Out[349]:
    0   1   2   3
0   4   5   6   7
1   8   9  10  11
2 NaN NaN NaN NaN

Notice how shift(1) moves the row originally indexed by 0 to the 1 index an fills the new index with missing data. The shift happens in the other direction with a negative argument.

I know this is achievable in part by doing things like df[1:end-1, :] and df[2:end, :] for lead and lag respectively, but the killer thing about the python routines is that it applies the shift relative to the original index. This makes things like arithmetic or building autoregressive models dead simple because the row index makes sure data is always aligned. Things like constructing the model for x_t = a + b + x_{t-1} + c y_t + d z_t + eps_t becomes very easy:

df["lx"]  = df["x"].shift(1)
mod = "x ~ lx + y + z"

and all data alignment happens automagically!

I know that the core developers of this package have chosen not to have an intrinsic bind between data and a row index (meaning the index is actually considered part of the data and this bind won't be broken unless the user explicitly indicates that it should) so this exact same functionality may not be quite as straightforward. That being said, I thought I'd ask if anyone was aware of a workflow or set of methods that would allow me to easily achieve the same functionality.

@tshort
Copy link
Contributor

tshort commented Mar 27, 2015

One way to do this is to define a new AbstractDataVector (or AbstractNullableVector when it's ready) that is a view into a column vector. The view would automatically account for the shift when indexing. It shouldn't be that hard to write. You just need a few getindex methods to make the shifting work. Then, you don't need to do anything in the DataFrames package. This new type could go into a standalone package or into the DataArrays package (or its successor).

@johnmyleswhite
Copy link
Contributor

This is a rough one since I don't see a way to do this for generic databases, since it's based on assumptions about the ordering of rows.

I suspect we ultimately need to fork DataFrames into a library that's useful to people who work with RDBMS and a second library that's useful to people who work with time series data.

@johansigfrids
Copy link
Contributor

Does not TimeSeries.jl already provide a separate tabular structure for working with time series data?

@tshort
Copy link
Contributor

tshort commented Mar 28, 2015

Yes, but it's different; a TimeSeries is a matrix representation like a zoo object in R. DataFrames are also useful for time series data, especially for irregular data or for cases where time data is mixed with other attributes.

@nalimilan
Copy link
Member

DataFrames already support sort!, which wouldn't work for RDBMSs. I don't see that as a problem: it can simply raise an error when the chosen backend does not support the operation. Same for lead/lag.

Anyway, I think people who work with time series data may well need a RDBMS too. I taught SAS with SQL to deal with stock prices data, which can easily become large enough that storing everything in memory is not a good idea. And it can be made to work, since SQL supports LEAD and LAG in queries. It just needs an ORDER BY statement to give a meaning to "next row" and "previous row". Maybe a syntax can be found in DataFramesMeta.jl, which would work for both in-memory and database-backed DataFrames?

@johnmyleswhite
Copy link
Contributor

It would be totally reasonable to make "inherently ordered" into a trait and have things fail for DataFrame-like objects (like databases) that don't implement that trait. My main worry there is that it requires a lot of discipline to not get lazy and assume ordering when it's not really necessary, since you then write code that's much less portable than possible. I think using sort! is a pretty good example of that: I don't think a data store should be assumed to exist in a permanently sorted state since so many operations don't depend upon sorting and those that do should make their sorting explicit.

I don't think standard SQL supports LEAD and LAG, although it's true that you could make it work by hand using an ORDER BY.

My broader concern is that I'd like us to make sure we think through which operations require which assumptions to make work. Pandas makes a lot of assumptions you can't make about a system like Hive: that gives it a lot of extra power, but at the expense of generality (and also applicability to the kinds of work I do). Ideally, we'd find a way to expose both sets of operations, but I can imagine that some of the behavior you'd want from Pandas would require a change in architecture to match the assumptions that Pandas makes.

@nalimilan
Copy link
Member

LEAD and LAG appear to have been introduced by ANSI SQL 99 as part of "analytic functions"/window functions. They are supported by several major implementations (among which apparently CUBRID, DB2, Oracle, PostgreSQL, Microsoft SQL Server, and Sybase SQL Anywhere -- there are convoluted workarounds for e.g. MySQL). Hive supports them since 2013: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Not supporting this would make Julia lag (pun intended) behind other programs.

The way dplyr supports this is interesting: http://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html Basically, it translates many commands that are valid R code into SQL statements. Its lead and lag functions take an order_by argument. Also, the arrange function defines an order which is automatically passed as ORDER BY in subsequent SQL statements when not overridden. This is a very clever strategy, which effectively makes the table behave as if it was physically sorted (which might indeed be the case if the data frame is stored in memory). Generally, dplyr is a great source of inspiration.

@sglyon
Copy link
Author

sglyon commented Mar 29, 2015

@johansigfrids TimeSeries.jl does do this, but I prefer working with DataFrames. Also, the behavior of TimeSeries with lead and lag is simply to drop observations completely from the object (so lag(ta, 1) with have 1 less row than the original ta) rather than mark them as missing. This puts the burden of data alignment back on the user, especially when using the data to construct statistical models in formulas. At that point I think I'd be better off using a DataFrame from the start.

@johnmyleswhite I like the goal of DataFrames being a general front-end into many data RDBMSs , but I also like the quick and powerful functionality I am used to with pandas. I realize that much of this comes as a result of pandas making some assumptions for me, but most of the time I'm ok with that. Do you think there is a way we can support both "modes" of operation -- one fully flexible making little or no assumptions for you and the other making some pandas-esque assumptions to offer more out of the box flexibility?

dplyr is great. I have been happy to watch DataFramesMeta implement many of the dplyr ideas.

@sglyon
Copy link
Author

sglyon commented Mar 29, 2015

Other comment about TimeSeries.jl -- often my "time-series" data is just data I have simulated. There are no real-world dates or times associated with any of my data so I don't want to go through the extra work to create real date/datetime indices for them.

I'm looking for a way to get classic time-series functions (like moving windows, lead/lag), without the hassle of making up dates for my data.

@garborg
Copy link
Contributor

garborg commented Mar 30, 2015

@nalimilan's suggestion of dplyr-/DataFramesMeta-like translation of Base Julia syntax to the backend language sounds right to me.

Overall, I've slowly been coming over to @johnmyleswhite's side -- I think that the core DataFrames API should align with RDBMS ops, and these special operations that we find convenient on in-memory tables should more explicitly belong to abstract subtypes of AbstractDataFrame.

@milktrader
Copy link

@spencerlyon2 once we move to v0.4, TimeSeries.jl will likely mark as missing the observations consumed by a computation such as lag or lead and use the Nullable type as a sentinel. JuliaStats/TimeSeries.jl#179

@nalimilan
Copy link
Member

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Jul 31, 2020

The recommended way to do it now is:

julia> using DataFrames, ShiftedArrays

julia> df = DataFrame(reshape(1:20, 5,4))
5×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 6     │ 11    │ 16    │
│ 2   │ 2     │ 7     │ 12    │ 17    │
│ 3   │ 3     │ 8     │ 13    │ 18    │
│ 4   │ 4     │ 9     │ 14    │ 19    │
│ 5   │ 5     │ 10    │ 15    │ 20    │

julia> mapcols(x -> lag(x, -1), df)
5×4 DataFrame
│ Row │ x1      │ x2      │ x3      │ x4      │
│     │ Int64?  │ Int64?  │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 2       │ 7       │ 12      │ 17      │
│ 2   │ 3       │ 8       │ 13      │ 18      │
│ 3   │ 4       │ 9       │ 14      │ 19      │
│ 4   │ 5       │ 10      │ 15      │ 20      │
│ 5   │ missing │ missing │ missing │ missing │

julia> mapcols(lag, df)
5×4 DataFrame
│ Row │ x1      │ x2      │ x3      │ x4      │
│     │ Int64?  │ Int64?  │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ missing │ missing │ missing │ missing │
│ 2   │ 1       │ 6       │ 11      │ 16      │
│ 3   │ 2       │ 7       │ 12      │ 17      │
│ 4   │ 3       │ 8       │ 13      │ 18      │
│ 5   │ 4       │ 9       │ 14      │ 19      │

I think it is easy enough so I am closing this issue.

If you feel otherwise please reopen.

@bkamins bkamins closed this as completed Jul 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

8 participants