Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

date_range / creating trading indices #138

Closed
maread99 opened this issue Jul 9, 2021 · 12 comments
Closed

date_range / creating trading indices #138

maread99 opened this issue Jul 9, 2021 · 12 comments

Comments

@maread99
Copy link

maread99 commented Jul 9, 2021

Hi @Stryder-Git, great to see a new take on evaluating a trading index, especially if its quicker!

I've copied your date_range function into a Jupyter Notebook and this...

import pandas_market_calendars as pmc
cal = pmc.get_calendar('XHKG')
schedule = cal.schedule("2021-07-07", "2021-07-07")
date_range(schedule, frequency="25T", close="left", force_close=True)

returned...

DatetimeIndex(['2021-07-07 02:00:00+00:00', '2021-07-07 02:25:00+00:00',
               '2021-07-07 02:50:00+00:00', '2021-07-07 03:15:00+00:00',
               '2021-07-07 03:40:00+00:00', '2021-07-07 05:20:00+00:00',
               '2021-07-07 05:45:00+00:00', '2021-07-07 06:10:00+00:00',
               '2021-07-07 06:35:00+00:00', '2021-07-07 07:00:00+00:00',
               '2021-07-07 07:25:00+00:00', '2021-07-07 07:50:00+00:00',
               '2021-07-07 08:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

...which is the same as the return from the existing function.

As you're working on this now I thought I'd mention a couple of things that occured to me when I was looking at using date_range. You may or may not want to consider them...

For reference:
schedule.iloc[0]
->

market_open    2021-07-07 02:00:00+00:00
market_close   2021-07-07 08:00:00+00:00
break_start    2021-07-07 04:00:00+00:00
break_end      2021-07-07 05:00:00+00:00
Name: 2021-07-07 00:00:00, dtype: datetime64[ns, UTC]
  • if forcing close on the market close, wouldn't you also force close on the break start? i.e. shouldn't 04:00 also be an indice in the above example?
  • indices following the break end are only accurate if the break end is 'on-frequency' based off the market open. Above this is shown by the first post-break indice being 05:20. I would have thought this first post-break indice should be 05:00 (as closed on "left"), or at a push 05:25 (on-frequency based on break end), although 05:20 seems rather meaningless in light of there having been a trading break? Personally, I'd consider this a bug in the existing function.

Maybe some food for thought if you hadn't already considered these points.

Originally posted by @maread99 in #136 (comment)

@maread99
Copy link
Author

maread99 commented Jul 9, 2021

Originally posted by @Stryder-Git in #136 (comment)

Hi @maread99 and @rsheftel, I haven't used schedules with breaks in my own research, so that is something I haven't considered. I only set out to make it faster and wanted to preserve the behaviour of the original function.

But your suggestions come at a good time because I actually just finished a newer version, which still behaves the exact same way, is slightly more efficient and is the result of finding some edge cases when doing more thorough testing.

Running these tests I found a couple of peculiarities that I would adjust if I were to change the behaviour of the function, which you may be interested in, additional to your suggestions:

  • With closed= "right" and force_close= False, if the difference between market_open and market_close is smaller than the frequency, some days disappear.
import pandas_market_calendars as mcal
nyse = mcal.get_calendar("NYSE")
sched = nyse.schedule("2020-12-22", "2020-12-26")
sched
>>
                         market_open              market_close
2020-12-22 2020-12-22 14:30:00+00:00 2020-12-22 21:00:00+00:00
2020-12-23 2020-12-23 14:30:00+00:00 2020-12-23 21:00:00+00:00
2020-12-24 2020-12-24 14:30:00+00:00 2020-12-24 18:00:00+00:00

mcal.date_range(sched, "4H", closed = "right", force_close= False)
>> DatetimeIndex(['2020-12-22 18:30:00+00:00', '2020-12-23 18:30:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

# --> Dec 24th disappears, but is a valid day in the schedule
  • If the market_close is before the market_open (e.g. 24H markets) an empty index is returned, unless force_close= True, which causes an index to be returned with nothing but closing times, regardless of the requested frequency.
cmes = mcal.get_calendar("CMES")
sched = cmes.schedule("2016-12-27", "2017-01-02")

mcal.date_range(sched, "1H")
>> DatetimeIndex(['2016-12-27 23:00:00+00:00', '2016-12-28 23:00:00+00:00',
               '2016-12-29 23:00:00+00:00', '2016-12-30 23:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

mcal.date_range(sched, "1D", force_close= False)     # <-- Daily frequency
>> DatetimeIndex([], dtype='datetime64[ns, UTC]', freq=None)

Both of these are the cause of the call to pd.date_range in the original function, and might not be what you want/expect.

So I am definitely interested in looking into how to implement this as effectively as possible. But I wanted your thoughts on it (and give others the chance to point out what they might want to be different) before working on changing the original behaviour.

I also added the test test_utils.test_new_date_range, which may be a bit overkill but it shows that as of now, the original behaviour is preserved. (It iterates over all possible settings, a list of frequencies, and all possible calendar-schedules between "2016-12-15" and "2017-01-05" and then compares it to the output of the original function.)

I am looking forward to your thoughts.

@maread99
Copy link
Author

maread99 commented Jul 9, 2021

Originally posted by @Stryder-Git in #136 (comment)

To plant some seeds for ideas, my initial thoughts on each of these situations:

  • Breaks

    • this shouldn't be too hard to implement, and my first instinct is to handle each day twice, if breaks are present. Basically performing the same timeseries calculation by first treating market_open as open and break_start as close, then break_end as the second open and market_close as the second close. Which should yield what you expect (using the break_start as a type of close and aligning a row at break_end)
  • Shorter day than frequency

    • Considering that this only seems to happen with force_close= False, which means that the user has explicitly chosen not to add the close, I am not sure if any changes should be implemented at all. But I would consider raising a warning letting the user know that a chosen frequency may lead to lost days, I wonder what your thoughts are on this?
  • market_close before market_open

    • After briefly checking the schedule for all the calendars where close_time <= open_time, I realized that it is only
      ['CMES', 'IEPA', 'us_futures'] where some weirdness is caused, for the others (see below) it is not really a problem, which makes me wonder If it weren't better to set up the schedule/MarketCalendar differently for those markets rather than trying to deal with it in the date_range function?
      This also reminds me of the fact that the schedule of XNZE on December 22nd and December 30th has odd closing times which don't seem to make sense
      --> So these four markets may need some adjusting, which I could open an Issue for, if you agree?
# Get the markets that I thought were an issue
cals = {}
for name in mcal.get_calendar_names():
    calendar = mcal.get_calendar(name)
    if calendar.close_time <= calendar.open_time:
        print(name, ":", calendar.open_time, calendar.close_time)
        cals[name] = calendar
>> 
CME_Equity : 17:00:00 16:00:00
CBOT_Equity : 17:00:00 16:00:00
CME_Agriculture : 17:01:00 17:00:00
CBOT_Agriculture : 17:01:00 17:00:00
COMEX_Agriculture : 17:01:00 17:00:00
NYMEX_Agriculture : 17:01:00 17:00:00
CME_Rate : 17:00:00 16:00:00
CBOT_Rate : 17:00:00 16:00:00
CME_InterestRate : 17:00:00 16:00:00
CBOT_InterestRate : 17:00:00 16:00:00
CME_Bond : 17:00:00 16:00:00
CBOT_Bond : 17:00:00 16:00:00
ICE : 20:01:00 18:00:00
ICEUS : 20:01:00 18:00:00
NYFE : 20:01:00 18:00:00
CMES : 17:01:00 17:00:00
IEPA : 20:00:00 18:00:00


# Get the markets that I actually mean
off = {}
for name, calendar in cals.items():
    sched = calendar.schedule("2021-06-28", "2021-07-04")
    if (sched.market_close <= sched.market_open).any():
        print(name)
        off[name] = calendar
        display(sched)
        display(mcal.date_range(sched, "6H", closed= "left", force_close= True))
        print()

>> .....

off.keys()
>> dict_keys(['CMES', 'IEPA', 'us_futures'])

@maread99
Copy link
Author

Hi @Stryder-Git, @rsheftel, I haven't contributed to pandas_market_calendars although I was intending to raise the 'breaks bug' I mentioned above and at some point offer a new version. @Stryder-Git, as you seem to be looking at doing the same I thought it opportune to let you have where I got to (trading_index.txt). Feel free to interrogate it, use it, criticise it and certainly to improve it!

It is still very much a draft, I've only tested it thoroughly with a couple of calendars and with the 24 hour calendars I've only tested it lightly - I wouldn't be surprised if you can break it quite easily. I wrote it based on exchange_calendars calendars, It might work straight off with a pandas_market_calendars calendar or you might need to change a line or two.

If you don't have / want to install pydantic then just comment out the decorator.

With regards to the specific points you raised:

Breaks

  • this shouldn't be too hard to implement, and my first instinct is to handle each day twice, if breaks are present. Basically performing the same timeseries calculation by first treating market_open as open and break_start as close, then break_end as the second open and market_close as the second close. Which should yield what you expect (using the break_start as a type of close and aligning a row at break_end)

This is exactly what I did. It obviously makes it considerably slower.

Shorter day than frequency

Considering that this only seems to happen with force_close= False, which means that the user has explicitly chosen not to add the close, I am not sure if any changes should be implemented at all. But I would consider raising a warning letting the user know that a chosen frequency may lead to lost days, I wonder what your thoughts are on this?

Personally, if a user asks for the index to be closed 'right', doesn't force_close and goes with a frequency longer than the trading day, I wouldn't consider it that odd that the day isn't represented. Same with...

mcal.date_range(sched, "1D", force_close= False)     # <-- Daily frequency
>> DatetimeIndex([], dtype='datetime64[ns, UTC]', freq=None)

Having said that, you'll see that my attempt interprets closed 'right' as meaning that the final indice for a session should be included even when that indice is later than the market close. If the user doesn't want this indice then they can force_close. If they don't want the market close either then they can close on the 'left'. I haven't accommodated closing on neither side (i.e. not including the first or last indice) although if you wanted to support that I guess you'd implement it as closed 'left' and drop the first indice from each session/subsession.

As an aside, from looking at your test I suspect you'd find the hypothesis package of interest if you're not already familiar with it - you only need to define what type of inputs a function can receive and then if the function can be broken within those constraints, it'll find a way to break it (in my experience, usually with edge cases that I would never have imagined).

@Stryder-Git
Copy link
Collaborator

Hi @maread99, @rsheftel I have briefly read through your code and you seem to have already taken it some steps further than I have.
I particularly like the idea of also offering Interval indices and making more thorough checks.
Obviously my main point of criticism would be that you are still using itertuples and a (or even two) call(s) to pd.date_range for each day in the schedule. I think it would absolutely be possible to incorporate my calculation of the indices into your code, making it much faster.

Regarding breaks, splitting the calculation will definitely hamper the performance, but it may be an interesting optimization problem to consider and, with the vectorized calculation, can still be done at a decent speed.

I guess the exact behaviour of closed and force_close will mostly be a matter of interpretation/preference and will just require clear documentation.

Implementing our ideas would cause a different behaviour of the original function and possibly some functions and/or classes to be added to the project. Before I spend a more serious amount of time, it would be great to get some other users opinions as well, in particular @rsheftel's.
I actually have some family/travel plans ahead and might not be responding very quickly the next couple of weeks, so that might also give enough time for some other users to pick this up.

But, knowing myself, I will probably end up playing around with some ideas anyway and will keep you posted if I happen to create something I like.
(The hypothesis package is also quite interesting, thanks for the suggestion!)

@rsheftel
Copy link
Owner

@maread99 and @Stryder-Git , this is all great work. I know the community and myself appreciate it.

Here are my thoughts:

  1. Adding new classes/modules/etc is perfectly fine. I would take functionality and speed over code minimalism. So I don't think you need to worry about that.

  2. As long as the existing tests pass, and you add new tests for new functionality, then I think you are safe. The only contracts the package should make are those in the tests. I would assume users that use functionality not in tests are caveat emptor.

For the examples given above, I am comfortable with the behavior you both have been describing. It could lead to some confusion, but I think the instances would be rare and the code would be consistent with less special cases so that is a plus.

Thanks both for taking the time to do this, if/when you submit the final PR I would be happy to merge it.

@maread99
Copy link
Author

@Stryder-Git, I agree with all your most recent comments and certainly look forward to seeing your revised implementation.

@rsheftel, regarding...

For the examples given above, I am comfortable with the behavior you both have been describing. It could lead to some confusion, but I think the instances would be rare and the code would be consistent with less special cases so that is a plus.

...as @Stryder-Git noted, I guess it comes down to clear documentation. On the meaning of the closed parameter, I would suggest the following interpretation covers all the bases:

  • "left" - include left indice of first interval/bar, do not include right indice of last interval/bar.
  • "right" - do not include left indice of first interval, include right indice of last interval.
  • "both" - include both left indice of first interval and right indice of last interval.
  • "neither" - do not include either left indice of first interval or right indice of last interval.

...with the last interval considered as the last interval with a left side earlier than the close (i.e. right side may align with close or may fall after the close). I believe the above, together with force_close, allows for all useful index configurations to be specified and that, moreover, anything less would be left wanting by way of not providing for at least one reasonable configuration.

@rsheftel
Copy link
Owner

I agree with and am fine with the implementation you suggested with one change. To be consistent with the pandas date_range function and documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html) the default should be None and None should have the same meaning as "both" in your list above.

I also strongly agree with @Stryder-Git comment that we need more tests. That is also the best way for people to be sure of how the implementation works.

Other than that I would say go ahead and submit the PR and I will merge it in.

@Stryder-Git
Copy link
Collaborator

Thanks for your reactions, @rsheftel and @maread99, and sorry for the late response. I am working on the enhanced version at the moment, paying attention to the points that I outlined here. Currently all tests pass, except for the tests using calendars with breaks.
The code is also available at the bottom

@rsheftel, please see the first bulletpoint.

  • Breaks
    • This is handled like @maread99 suggested. Although, @rsheftel, you said that you wanted the existing tests to pass, but there are 4 assertions in test_utils.test_date_range_w_breaks that would need to be adjusted to accomodate the new way of handling breaks. I am willing to do that since I am also planning on writing some extra tests next week anyway.

---> Do you agree that aligning the indices along break_start and break_end makes more sense than how it has been handled originally, and that I can change those tests?

  • Shorter day than frequency

    • When thinking about it, I agree with maread99, so it will be handled the same way, making the days disappear when force_close= False. To be clear, with the new calculation, this means that if the frequency is larger than the difference between market_open and break_start, or break_end and market_close, that part of the day will also disappear (but only when force_close= False). I have chosen not to add a warning for this yet, let me know if you think I should.
  • Market_close before market_open

    • Since I found that this only concerns the following three calendars ['CMES', 'IEPA', 'us_futures'], the function is just going to raise a ValueError when such a schedule is passed to it, requesting the user to adjust the schedule, rather than trying to handle this inside the function. Also, I believe that a seperate Issue/PR to fix those calendars would be more appropriate and I might do so at a later date.

Extra Points

To be as consistent as possible I have only changed two things about the possible inputs to the function.

  • You can now pass None to force_close.
  • And, in addition to None, you can also pass "both" to closed, which will be the exact same as passing None.

The new interpretation of the possible values of force_close are:

  • True: guarantee that the close is the last value of the day
  • False: guarantee that there is no value larger than the close
  • None: don't force anything

This allows for an extra configuration that that I don't remember being possible previously, while still allowing the same results with the old inputs for closed and force_close.

Since I still want to do some tests and add a way to handle some edge cases when getting overlapping indices, it isn't ready for a PR yet, but here is the code if you want to have a look. You can also see the docstring there for a permutation analysis. But I will explain the kwargs more clearly in the actual PR/ final docstring.

@rsheftel
Copy link
Owner

rsheftel commented Aug 8, 2021

---> Do you agree that aligning the indices along break_start and break_end makes more sense than how it has been handled originally, and that I can change those tests?

Yes.

I have chosen not to add a warning for this yet, let me know if you think I should.

We should definitely have a warning, otherwise I am sure we will get the question over and over in the issues or on Reddit

Since I found that this only concerns the following three calendars ['CMES', 'IEPA', 'us_futures'], the function is just going to raise a ValueError when such a schedule is passed to it, requesting the user to adjust the schedule, rather than trying to handle this inside the function.

The problem with the futures calendars is that to make them start properly on a Sunday, the open needs to be greater than the close. We will have to think about how to handle this.

@Stryder-Git
Copy link
Collaborator

When implementing the warning, I did some more digging and made changes to handle some edge cases and get a much cleaner result.

The edge cases are where the start of the trading session == the end of it. This happens in some tests and schedules, and leads to intervals representing no trading, which I don't think makes any sense.

E.g.:

cal = FakeBreakCalendar()
# when the close is the break end
schedule = cal.schedule('2016-12-30', '2016-12-30')

schedule
>>
market_open              market_close               break_start                 break_end
2016-12-30 14:30:00+00:00 2016-12-30 15:40:00+00:00 2016-12-30 15:00:00+00:00 2016-12-30 15:40:00+00:00

# -- >> As you can see, after 2016-12-30 15:00:00+00:00, there is no trading anymore for the rest of the day...

# "15min", closed = None, force_close = True
#### original expectation
expected = ['2016-12-30 14:30:00+00:00', '2016-12-30 14:45:00+00:00', '2016-12-30 15:00:00+00:00','2016-12-30 15:40:00+00:00']

# -- >> new expectation, since '2016-12-30 15:40:00+00:00', really doesn't mean anything.
expected = ['2016-12-30 14:30:00+00:00', '2016-12-30 14:45:00+00:00', '2016-12-30 15:00:00+00:00']

The same logic applies to market_open == break_start, and market_open == market_close (which I actually haven't encountered yet)
Please see the implementation in #142

@rsheftel
Copy link
Owner

Thank you everyone for your hard work. This is merged into master and updated to v3.0 in 5634385 with corresponding PyPi release

@maread99
Copy link
Author

@Stryder-Git, I've only just got round to having a look at your new date_range. Tremendous work on the vectorised implementation!! I've overhauled my attempt with a very simlar vectorised evaluation - one which I would never have got anywhere close to if you hadn't shown the way!

If you're interested, the PR's here in a bit of a devlopement queue over at exchange_calendars. Hopefully it will be incorporated prior to release 3.4. You'll see the internals work in nanos (sourced from ExchangeCalendar's nano properties). This has made it quite considerably faster than the current date_range. Using a schedule for Hong Kong and with closed as "right" I got:

  • 33min frequency over a year, 8x faster (1.8ms v 14.7ms)
  • 1ms frequency over a single day, 10x faster (1.6s v 16.6s)

Worth noting that, if you were interested in doing something similar with the nanos, you could simplify what I've done a lot by not offering the separate force_break_close option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants