-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StringNormalizer drops strings when they only contain stop words #1031
Conversation
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
@xadupre Hi! Any updates on this problem ? As I understand the fast fix would be to process every string independently. RIght ? |
That would work. |
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
My tests show that performance on InferenceSession run is twice slower without vectorization - 400 rps instead of 800 rps on my server. |
If you are using a loop, it is not really suprising. There is no parallelization even though each row is processed independently. |
May you suggest some impovement in this case ? My guess is if I know indices of dropped strings then I can rerun inference session only on them instead loop over all input items. |
Also I want to notice that I don't user stop_words parameter. How exactly set stop_words parameter to empty or None while converting to ONNX model ? |
One issue is StringNormalizer is defined in onnx. To change its behaviour, it has to be changed in onnx and onnxruntime. It is long. onnxruntime-extensions is a project implementing custom kernel without going through the process of validating the definition by onnx mainteners. The best option here is to implement is to implement a custom kernel. I'm hesitating in adding more in onnxruntime or creating a similar projet to host custom kernel dedicated to standard machine learning. But a custom kernel would solve the issue. |
@xadupre What do you mean by custom kernel exactly ? |
A C++ implementation of an operator. Let's assume we define the operator StringNormalizer2. onnxruntime does not know it because it is new so it cannot run the inference unless we give it a piece of code compiled into an assembly. onnxruntime lets you do that. See https://onnxruntime.ai/docs/reference/operators/add-custom-op.html. |
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
This issue seems to be fixed with scikit-learn==1.5.0. Could you check on your side? |
…x#1031) * investigate Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix unit tests Signed-off-by: Xavier Dupre <xadupre@microsoft.com> --------- Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
* Extend CI to test with onnxruntime==1.18.0 (#1093) * Extend CI to test with onnxruntime==1.18.0 Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update doc Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * simplify pipelines Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * rename master into main Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * action Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * example Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * remove benchmark Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * doc Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix unittest Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> --------- Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * Increase last supported opset to 19 in readme (#1087) See: https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/__init__.py#L12 Signed-off-by: XiangRongLin <41164160+XiangRongLin@users.noreply.github.com> Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * Fix the converters for scikit-learn==1.5.0 (#1095) * Extend CI to test with onnxruntime==1.18.0 Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update doc Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * simplify pipelines Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * rename master into main Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * action Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * update CI Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * example Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * remove benchmark Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * doc Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix unittest Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix title Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix disc Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * better ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * linear Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix two unit tests Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix PLSRegression Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix version Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * ci Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * precision Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix unit test Signed-off-by: Xavier Dupre <xadupre@microsoft.com> --------- Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * StringNormalizer drops strings when they only contain stop words (#1031) * investigate Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * fix unit tests Signed-off-by: Xavier Dupre <xadupre@microsoft.com> --------- Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * Fix for gaussian process Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * add test for issue 1073 Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * Fix multi dimensional GaussianRegressor Signed-off-by: Xavier Dupre <xadupre@microsoft.com> * doc Signed-off-by: Xavier Dupre <xadupre@microsoft.com> --------- Signed-off-by: Xavier Dupre <xadupre@microsoft.com> Signed-off-by: XiangRongLin <41164160+XiangRongLin@users.noreply.github.com> Co-authored-by: XiangRongLin <41164160+XiangRongLin@users.noreply.github.com> Co-authored-by: Liberty Askew <laskew@sailgp.com> Co-authored-by: dreivmeister <felix.brandt@fh-bielefeld.de>
Issue raised in microsoft/onnxruntime#17795 and #998. After investigation, operator StringNormalizer drops strings containing only stopwords. The expected output can be obtained by running the inference for every input independantly.
A possible fix would be to update onnx specifications to not drop strings. Another fix could be to update the converter and loop over rows but that's not very efficient.