Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in dev/test/monitoring split? #1

Open
MiladShahidi opened this issue Aug 2, 2023 · 0 comments
Open

Error in dev/test/monitoring split? #1

MiladShahidi opened this issue Aug 2, 2023 · 0 comments

Comments

@MiladShahidi
Copy link

Hi,

Thanks for sharing the task and your code. It's very helpful and instructive.

I've encountered an issue in your code which might indicate that the dev/test/monitoring split might be done the wrong way. I might have misunderstood the context. So, I'd appreciate it if you could correct me if I got this wrong.

The Issue

You create three subsets for development, out-of-time monitoring and post-deployment monitoring. These are subsets of the original application data frame. I'd expect these to be mutually exclusive samples. However, there seems to be considerable overlap between the three.

The original application data frame has around 70K unique records (based on unique_id) after cleaning. One would expect the number of rows in the three subsets to add up to the same number. But they don't. Here are the number of rows in each sub-sample:

Development: 32616
Out-of-time monitoring: 48537
post-deployment monitoring: 56010

which adds up to 137163, almost twice the number of records in the cleaned application data frame. This seems to suggest that the three sub-samples have considerable overlap.

(Note that each row has a unique id -the unique_id index- and corresponds to a unique loan.)

As another way to check this, I looked at the earliest origination_date in each sub-sample:

application_out_of_time['origination_date'].min() returns 2018-08-01 (while this subset is supposed to be loan applications after Aug-2019).

application_post_deployment['origination_date'].min() returns 2019-01-02 (while this subset is supposed to start in Jan-2020).

Possible Cause

Assuming I got this right so far, I think the cause of the overlap is that you used repayment dates (monthly_outcome.date) to decide which sub-sample a loan belongs to. For example, consider the post-deployment sample is defined as "all loans applications since 1st Jan 2020". Since you use monthly_outcome, any loan that is still making repayments after 01-01-202, will be assigned to the post-deployment set in your code, even if application.origination_date is before that date.

Same goes for the other two sub-samples.

I'd appreciate it if you could let me know whether I'm missing something here or my understanding is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant