Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: PreprocessLogParsingStage should be folded into PreprocessNLPStage #801

Closed
2 tasks done
dagardner-nv opened this issue Mar 27, 2023 · 2 comments · Fixed by #842
Closed
2 tasks done

[FEA]: PreprocessLogParsingStage should be folded into PreprocessNLPStage #801

dagardner-nv opened this issue Mar 27, 2023 · 2 comments · Fixed by #842
Assignees
Labels
feature request New feature or request

Comments

@dagardner-nv
Copy link
Contributor

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

The pre-processing stage in examples/log_parsing is 99% similar to PreprocessNLPStage and should either be a subclass or PreprocessNLPStage should optionally be configured to add white-space around punctuation marks.

Describe your ideal solution

n/a

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@dagardner-nv dagardner-nv added the feature request New feature or request label Mar 27, 2023
@dagardner-nv dagardner-nv self-assigned this Mar 27, 2023
@github-actions github-actions bot added the Needs Triage Need team to review and classify label Mar 27, 2023
@dagardner-nv
Copy link
Contributor Author

Folding this into the PreprocessNLPStage has the advantage that it enables C++ execution for the stage, subclassing has the advantage of providing an example on how to extend one of the built-in Morpheus stages.

Since pre-processing is likely the most common custom class for users to want to implement I'm going to go with the subclass route.

@dagardner-nv
Copy link
Contributor Author

Turns out PreprocessLogParsingStage isn't needed at all. The only thing it does is provide different default constructor values, along with surrounding punctuation with white-space:

        for symbol in string.punctuation:
            text_ser = text_ser.str.replace(symbol, ' ' + symbol + ' ')

However this isn't needed as the cudf subword tokenizer removes punctuation, and the results from both stages are identical.

From:
https://github.com/rapidsai/cudf/blob/43f0c3d70cb8bf777d6856c432245faaa947cc4f/cpp/include/nvtext/subword_tokenize.hpp

 * The strings are first normalized by converting to lower-case, removing
 * punctuation, replacing a select set of multi-byte characters and
 * whitespace characters.

@dagardner-nv dagardner-nv added 3 - Ready for Review and removed Needs Triage Need team to review and classify labels Apr 4, 2023
@rapids-bot rapids-bot bot closed this as completed in #842 Apr 10, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 10, 2023
)

* Code in `PreprocessLogParsingStage` was about 99% the same as `PreprocessNLPStage`
* The only thing `PreprocessLogParsingStage` provided was different default constructor values, along with some special handling of punctuation. However the cudf subword_tokenizer removes all punctuation.

fixes #801

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant