You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem this feature solves
The pre-processing stage in examples/log_parsing is 99% similar to PreprocessNLPStage and should either be a subclass or PreprocessNLPStage should optionally be configured to add white-space around punctuation marks.
Describe your ideal solution
n/a
Describe any alternatives you have considered
No response
Additional context
No response
Code of Conduct
I agree to follow this project's Code of Conduct
I have searched the open feature requests and have found no duplicates for this feature request
The text was updated successfully, but these errors were encountered:
Folding this into the PreprocessNLPStage has the advantage that it enables C++ execution for the stage, subclassing has the advantage of providing an example on how to extend one of the built-in Morpheus stages.
Since pre-processing is likely the most common custom class for users to want to implement I'm going to go with the subclass route.
Turns out PreprocessLogParsingStage isn't needed at all. The only thing it does is provide different default constructor values, along with surrounding punctuation with white-space:
for symbol in string.punctuation:
text_ser = text_ser.str.replace(symbol, ' ' + symbol + ' ')
However this isn't needed as the cudf subword tokenizer removes punctuation, and the results from both stages are identical.
* The strings are first normalized by converting to lower-case, removing
* punctuation, replacing a select set of multi-byte characters and
* whitespace characters.
)
* Code in `PreprocessLogParsingStage` was about 99% the same as `PreprocessNLPStage`
* The only thing `PreprocessLogParsingStage` provided was different default constructor values, along with some special handling of punctuation. However the cudf subword_tokenizer removes all punctuation.
fixes#801
Authors:
- David Gardner (https://github.com/dagardner-nv)
Approvers:
- Michael Demoret (https://github.com/mdemoret-nv)
URL: #842
Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem this feature solves
The pre-processing stage in
examples/log_parsing
is 99% similar toPreprocessNLPStage
and should either be a subclass orPreprocessNLPStage
should optionally be configured to add white-space around punctuation marks.Describe your ideal solution
n/a
Describe any alternatives you have considered
No response
Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: