Error in dev/test/monitoring split? #1

MiladShahidi · 2023-08-02T15:58:18Z

Hi,

Thanks for sharing the task and your code. It's very helpful and instructive.

I've encountered an issue in your code which might indicate that the dev/test/monitoring split might be done the wrong way. I might have misunderstood the context. So, I'd appreciate it if you could correct me if I got this wrong.

The Issue

You create three subsets for development, out-of-time monitoring and post-deployment monitoring. These are subsets of the original application data frame. I'd expect these to be mutually exclusive samples. However, there seems to be considerable overlap between the three.

The original application data frame has around 70K unique records (based on unique_id) after cleaning. One would expect the number of rows in the three subsets to add up to the same number. But they don't. Here are the number of rows in each sub-sample:

Development: 32616
Out-of-time monitoring: 48537
post-deployment monitoring: 56010

which adds up to 137163, almost twice the number of records in the cleaned application data frame. This seems to suggest that the three sub-samples have considerable overlap.

(Note that each row has a unique id -the unique_id index- and corresponds to a unique loan.)

As another way to check this, I looked at the earliest origination_date in each sub-sample:

application_out_of_time['origination_date'].min() returns 2018-08-01 (while this subset is supposed to be loan applications after Aug-2019).

application_post_deployment['origination_date'].min() returns 2019-01-02 (while this subset is supposed to start in Jan-2020).

Possible Cause

Assuming I got this right so far, I think the cause of the overlap is that you used repayment dates (monthly_outcome.date) to decide which sub-sample a loan belongs to. For example, consider the post-deployment sample is defined as "all loans applications since 1st Jan 2020". Since you use monthly_outcome, any loan that is still making repayments after 01-01-202, will be assigned to the post-deployment set in your code, even if application.origination_date is before that date.

Same goes for the other two sub-samples.

I'd appreciate it if you could let me know whether I'm missing something here or my understanding is correct.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in dev/test/monitoring split? #1

Error in dev/test/monitoring split? #1

MiladShahidi commented Aug 2, 2023

Error in dev/test/monitoring split? #1

Error in dev/test/monitoring split? #1

Comments

MiladShahidi commented Aug 2, 2023

The Issue

Possible Cause