-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Unable to replicate evaluation metrics when using ignore_masking. #506
Comments
@TassaraR, thank you for testing the library and investigating the consistency of the scores returned by the model!! I was able to replicate the same results using the code you shared with the end-to-end example data. After investigating the issue, the source of the difference in scores you are observing comes from how masking is applied in T4Rec. In fact, we are not replacing each input feature with
Hope that helps you understand how the masking is used in T4Rec. Can you also please test the shared code at your end to validate the scores are matching the ones returned by |
Thanks for your time and answer @sararb ! I'm going to test it out. Its going to take me a little while though! |
So I tested out the code but I had to change a bit to accommodate it to my own implementation: trainer.eval_dataset_or_path = 'short_eval.parquet'
train_metrics = trainer.evaluate(metric_key_prefix='eval') Returns:
As I mentioned before I changed my code from GPU to CPU so: _ = model.eval()
_ = model.cpu() I re-ran my custom code with my custom recall function: recall(topk_pred, labels) Returns:
And finally I the code that you sent me @sararb : mdl = trainer.model.wrapper_module
input_block = mdl.heads[0].body.inputs
masking_block = mdl.heads[0].body.inputs.masking
sequential_block = mdl.heads[0].body[1]
transformer_block = mdl.heads[0].body[2]
prediction_block = mdl.heads[0].prediction_task_dict['next-item']
inputs_wo_masking = input_block(batch_pred, ignore_masking=True, training=False)
masked_positions = masking_block._compute_masked_targets(batch_pred['products_padded'], training=False)
apply_masking_to_inputs = masking_block.apply_mask_to_inputs(inputs_wo_masking, masked_positions.schema)
sequential_pass = sequential_block(apply_masking_to_inputs, training=False)
transformer_pass = transformer_block(sequential_pass, training=False)
predictions = prediction_block(transformer_pass)['predictions']
_, topk_pred = torch.topk(predictions, k=10)
recall(topk_pred, labels) Which returns the exact same value:
Still after testing this I'm left with some questions as I feel I'm struggling with offline predictions. For my current use-case I cannot rely on Triton, So I'm planning on building my own API to serve the model, so I'm in need of performing offline inference. The problem I notice is that using: model(batch_pred, training=False) always masks the last existing value of each sequence. So its trying to predict the masked value instead of the new/next item. Given that, I would assume the correct way to perform inference should be: model(batch_pred, training=False, ignore_masking=True) As it should of course "ignore the mask". As I understand while performing the prediction the last value is replaced by a special character "[MASK]" instead of zero. The problem is that in a real case scenario we won't have a [MASK] token. So I was trying to evaluate the model on a "simulated real-life scenario" by manually masking that last value with a zero and performing inference over those sequence as I already know the label. In this case the model under-performed (As seen in the original question) by returning a score of I don't know if I'm missing or misunderstanding something but I've been struggling with this for a while. (Also I hope I made my point clear as english is my second language). It would be awesome if some examples could be provided. I've been checking the previous issues regarding this topic but none of them have actually helped me resolve this issue. Again, thanks a lot for your time and effort! |
Hi all, I am also a bit confused by this discrepancy. To summarize, it seems there are two ways in which one can get 'next-item' predictions given an input sequence:
It seems that |
@Conway, thanks for your questions and we apologize for the confusion. The parameter
|
@TassaraR, thank you for investigating the model’s performance in different modes. Regarding your point:
Code to add dummy interaction to manually masked sequences:
Thank you for pointing out this discrepancy! I opened a bug ticket to track the issue and we are currently working on a fix to ensure that the performance of the model is not affected during inference. |
Hi @sararb, Thanks for your answer! First of all, it seems pretty clear that I also had planned creating a dummy variable and performing masking ( Let's imagine an array of maximum size 5 that contains a session. In this session the user has a full cart. So it has 5 items total.
If we want to get recommendations we then need to create a dummy variable so that variable can be masked later. The only issue is, in a real case scenario, given this array, the only way to do that might be by shifting the array's values to the left and removing the first added item as we don't have any more space in the array to fit the dummy variable.
This way we are actually losing information as we had to remove the first item One approach to solve this and the one I have in mind is by extending my input size by 1 value. So instead of having a sequence of 5, we now work with a sequence of 6 but the last spot will always be a 0 (so each session should always be padded by at least one 0 at the end). We train our model with this extended sequence.
This value can actually be replaced by the dummy variable to mask and thus we don't have to drop any of our current existing values. I don't know though if this may create some issues at inference as the model was never trained with a session full of 6 items. (I hope it doesn't cause any problems). |
❓ Questions & Help
Details
transformers4rec==0.1.13
I've been trying out t4r for a while and I decided to try and replicate the evaluation metrics on my own by performing offline predictions.
I've been using a model architecture similar to one in the provided examples and similar features.
Model:
When I evaluate the metrics by using:
I get the following recall:
eval_/next-item/recall_at_10': 0.1004810556769371
I tried writing some code to replicate the results and validate that the evaluation metrics returned by the model were correct.
(As the model was already trained I moved everything from GPU to CPU)
The following script transforms my data from pd.DataFrame to a dict with the appropriate format for the t4r model and also extracts the labels (The last products of the sequence).
For the moment I'm not removing the last item from the sessions on my dataset as I know
model(data, training=False)
does that for me.I also created a function to evaluate the recall on my own
I performed offline predictions by using the following code:
After evaluating using my code
Which returns a pretty similar result:
BUT When I try to run the predictions "manually" masking the last item of each session and using
ignore_masking=True
I get an entirely different result:I re-run my script for label extraction and to adapt the pandas dataframe to a dict but this time I mask the last item of each session:
I did check that the masking process was performed correctly
I re-ran the inference phase with
ignore_masking=True
and got different and disappointing results
I can't figure out if I'm missing something or if I did something wrong but for the moment I can't find anything on my side.
It would be of great help if someone could check this issue out and try to replicate this experiment.
Thanks in advance to anyone willing to look into this issue.
The text was updated successfully, but these errors were encountered: