[FEA]: PreprocessLogParsingStage should be folded into PreprocessNLPStage #801

dagardner-nv · 2023-03-27T19:03:17Z

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

The pre-processing stage in examples/log_parsing is 99% similar to PreprocessNLPStage and should either be a subclass or PreprocessNLPStage should optionally be configured to add white-space around punctuation marks.

Describe your ideal solution

n/a

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct
I have searched the open feature requests and have found no duplicates for this feature request

The text was updated successfully, but these errors were encountered:

dagardner-nv · 2023-04-03T23:14:58Z

Folding this into the PreprocessNLPStage has the advantage that it enables C++ execution for the stage, subclassing has the advantage of providing an example on how to extend one of the built-in Morpheus stages.

Since pre-processing is likely the most common custom class for users to want to implement I'm going to go with the subclass route.

dagardner-nv · 2023-04-04T16:14:11Z

Turns out PreprocessLogParsingStage isn't needed at all. The only thing it does is provide different default constructor values, along with surrounding punctuation with white-space:

        for symbol in string.punctuation:
            text_ser = text_ser.str.replace(symbol, ' ' + symbol + ' ')

However this isn't needed as the cudf subword tokenizer removes punctuation, and the results from both stages are identical.

From:
https://github.com/rapidsai/cudf/blob/43f0c3d70cb8bf777d6856c432245faaa947cc4f/cpp/include/nvtext/subword_tokenize.hpp

 * The strings are first normalized by converting to lower-case, removing
 * punctuation, replacing a select set of multi-byte characters and
 * whitespace characters.

) * Code in `PreprocessLogParsingStage` was about 99% the same as `PreprocessNLPStage` * The only thing `PreprocessLogParsingStage` provided was different default constructor values, along with some special handling of punctuation. However the cudf subword_tokenizer removes all punctuation. fixes #801 Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: #842

dagardner-nv added the feature request New feature or request label Mar 27, 2023

dagardner-nv self-assigned this Mar 27, 2023

github-actions bot added the Needs Triage Need team to review and classify label Mar 27, 2023

dagardner-nv mentioned this issue Apr 4, 2023

Replace usage of PreprocessLogParsingStage with PreprocessNLPStage #842

Merged

3 tasks

dagardner-nv added 3 - Ready for Review and removed Needs Triage Need team to review and classify labels Apr 4, 2023

rapids-bot bot closed this as completed in #842 Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: PreprocessLogParsingStage should be folded into PreprocessNLPStage #801

[FEA]: PreprocessLogParsingStage should be folded into PreprocessNLPStage #801

dagardner-nv commented Mar 27, 2023

dagardner-nv commented Apr 3, 2023

dagardner-nv commented Apr 4, 2023

[FEA]: PreprocessLogParsingStage should be folded into PreprocessNLPStage #801

[FEA]: PreprocessLogParsingStage should be folded into PreprocessNLPStage #801

Comments

dagardner-nv commented Mar 27, 2023

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

dagardner-nv commented Apr 3, 2023

dagardner-nv commented Apr 4, 2023