You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for sharing the task and your code. It's very helpful and instructive.
I've encountered an issue in your code which might indicate that the dev/test/monitoring split might be done the wrong way. I might have misunderstood the context. So, I'd appreciate it if you could correct me if I got this wrong.
The Issue
You create three subsets for development, out-of-time monitoring and post-deployment monitoring. These are subsets of the original application data frame. I'd expect these to be mutually exclusive samples. However, there seems to be considerable overlap between the three.
The original application data frame has around 70K unique records (based on unique_id) after cleaning. One would expect the number of rows in the three subsets to add up to the same number. But they don't. Here are the number of rows in each sub-sample:
which adds up to 137163, almost twice the number of records in the cleaned application data frame. This seems to suggest that the three sub-samples have considerable overlap.
(Note that each row has a unique id -the unique_id index- and corresponds to a unique loan.)
As another way to check this, I looked at the earliest origination_date in each sub-sample:
application_out_of_time['origination_date'].min() returns 2018-08-01 (while this subset is supposed to be loan applications after Aug-2019).
application_post_deployment['origination_date'].min() returns 2019-01-02 (while this subset is supposed to start in Jan-2020).
Possible Cause
Assuming I got this right so far, I think the cause of the overlap is that you used repayment dates (monthly_outcome.date) to decide which sub-sample a loan belongs to. For example, consider the post-deployment sample is defined as "all loans applications since 1st Jan 2020". Since you use monthly_outcome, any loan that is still making repayments after 01-01-202, will be assigned to the post-deployment set in your code, even if application.origination_date is before that date.
Same goes for the other two sub-samples.
I'd appreciate it if you could let me know whether I'm missing something here or my understanding is correct.
The text was updated successfully, but these errors were encountered:
Hi,
Thanks for sharing the task and your code. It's very helpful and instructive.
I've encountered an issue in your code which might indicate that the dev/test/monitoring split might be done the wrong way. I might have misunderstood the context. So, I'd appreciate it if you could correct me if I got this wrong.
The Issue
You create three subsets for development, out-of-time monitoring and post-deployment monitoring. These are subsets of the original
application
data frame. I'd expect these to be mutually exclusive samples. However, there seems to be considerable overlap between the three.The original
application
data frame has around 70K unique records (based onunique_id
) after cleaning. One would expect the number of rows in the three subsets to add up to the same number. But they don't. Here are the number of rows in each sub-sample:Development: 32616
Out-of-time monitoring: 48537
post-deployment monitoring: 56010
which adds up to 137163, almost twice the number of records in the cleaned
application
data frame. This seems to suggest that the three sub-samples have considerable overlap.(Note that each row has a unique id -the
unique_id
index- and corresponds to a unique loan.)As another way to check this, I looked at the earliest
origination_date
in each sub-sample:application_out_of_time['origination_date'].min()
returns2018-08-01
(while this subset is supposed to be loan applications after Aug-2019).application_post_deployment['origination_date'].min()
returns2019-01-02
(while this subset is supposed to start in Jan-2020).Possible Cause
Assuming I got this right so far, I think the cause of the overlap is that you used repayment dates (
monthly_outcome.date
) to decide which sub-sample a loan belongs to. For example, consider the post-deployment sample is defined as "all loans applications since 1st Jan 2020". Since you usemonthly_outcome
, any loan that is still making repayments after 01-01-202, will be assigned to the post-deployment set in your code, even ifapplication.origination_date
is before that date.Same goes for the other two sub-samples.
I'd appreciate it if you could let me know whether I'm missing something here or my understanding is correct.
The text was updated successfully, but these errors were encountered: