Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringNormalizer drops strings when they only contain stop words #1031

Merged
merged 6 commits into from
May 28, 2024

Conversation

xadupre
Copy link
Collaborator

@xadupre xadupre commented Oct 5, 2023

Issue raised in microsoft/onnxruntime#17795 and #998. After investigation, operator StringNormalizer drops strings containing only stopwords. The expected output can be obtained by running the inference for every input independantly.

A possible fix would be to update onnx specifications to not drop strings. Another fix could be to update the converter and loop over rows but that's not very efficient.

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
@vsbaldeev
Copy link

@xadupre Hi! Any updates on this problem ?

As I understand the fast fix would be to process every string independently. RIght ?

@xadupre
Copy link
Collaborator Author

xadupre commented Oct 23, 2023

That would work.

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
@vsbaldeev
Copy link

My tests show that performance on InferenceSession run is twice slower without vectorization - 400 rps instead of 800 rps on my server.

@xadupre
Copy link
Collaborator Author

xadupre commented Dec 13, 2023

If you are using a loop, it is not really suprising. There is no parallelization even though each row is processed independently.

@vsbaldeev
Copy link

If you are using a loop, it is not really suprising. There is no parallelization even though each row is processed independently.

May you suggest some impovement in this case ?

My guess is if I know indices of dropped strings then I can rerun inference session only on them instead loop over all input items.

@vsbaldeev
Copy link

Also I want to notice that I don't user stop_words parameter. How exactly set stop_words parameter to empty or None while converting to ONNX model ?

@xadupre
Copy link
Collaborator Author

xadupre commented Dec 14, 2023

One issue is StringNormalizer is defined in onnx. To change its behaviour, it has to be changed in onnx and onnxruntime. It is long. onnxruntime-extensions is a project implementing custom kernel without going through the process of validating the definition by onnx mainteners. The best option here is to implement is to implement a custom kernel. I'm hesitating in adding more in onnxruntime or creating a similar projet to host custom kernel dedicated to standard machine learning. But a custom kernel would solve the issue.

@vsbaldeev
Copy link

@xadupre What do you mean by custom kernel exactly ?

@xadupre
Copy link
Collaborator Author

xadupre commented Dec 15, 2023

A C++ implementation of an operator. Let's assume we define the operator StringNormalizer2. onnxruntime does not know it because it is new so it cannot run the inference unless we give it a piece of code compiled into an assembly. onnxruntime lets you do that. See https://onnxruntime.ai/docs/reference/operators/add-custom-op.html.

@xadupre
Copy link
Collaborator Author

xadupre commented May 23, 2024

This issue seems to be fixed with scikit-learn==1.5.0. Could you check on your side?

@xadupre xadupre merged commit 80a5d14 into onnx:main May 28, 2024
20 checks passed
xadupre added a commit to xadupre/sklearn-onnx that referenced this pull request May 28, 2024
…x#1031)

* investigate

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix unit tests

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

---------

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
xadupre added a commit that referenced this pull request May 28, 2024
* Extend CI to test with onnxruntime==1.18.0 (#1093)

* Extend CI to test with onnxruntime==1.18.0

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update doc

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* simplify pipelines

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* rename master into main

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* action

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* example

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* remove benchmark

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* doc

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix unittest

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

---------

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* Increase last supported opset to 19 in readme (#1087)

See: https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/__init__.py#L12

Signed-off-by: XiangRongLin <41164160+XiangRongLin@users.noreply.github.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* Fix the converters for scikit-learn==1.5.0 (#1095)

* Extend CI to test with onnxruntime==1.18.0

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update doc

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* simplify pipelines

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* rename master into main

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* action

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* update CI

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* example

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* remove benchmark

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* doc

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix unittest

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix title

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix disc

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* better ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* linear

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix two unit tests

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix PLSRegression

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix version

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* ci

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* precision

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix unit test

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

---------

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* StringNormalizer drops strings when they only contain stop words (#1031)

* investigate

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* fix unit tests

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

---------

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* Fix for gaussian process

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* add test for issue 1073

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* Fix multi dimensional GaussianRegressor

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

* doc

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>

---------

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: XiangRongLin <41164160+XiangRongLin@users.noreply.github.com>
Co-authored-by: XiangRongLin <41164160+XiangRongLin@users.noreply.github.com>
Co-authored-by: Liberty Askew <laskew@sailgp.com>
Co-authored-by: dreivmeister <felix.brandt@fh-bielefeld.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants