-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require that all SpanGroup spans are from the current doc #12569
Conversation
373e7ac
to
445d5bc
Compare
This originally stemmed from https://support.prodi.gy/t/encountering-an-error-with-custom-iob2-format-in-the-ner-spancat-compare-project/6504 but kind of blew up while I was working on it. Due to how the IOB converter is designed (i.e., it doesn't support a provided vocab or pipeline), the project created copies of docs with different vocabs but the same text and then tried to combine spans from different docs into the same reference doc through All This change gets especially messy in terms of copying spans within When initializing a The copied spans are based on character offsets so that it's possible to copy spans between two docs with different tokenizations. This required changes to |
The restriction on only adding spans from the current doc were already implemented for all operations except for `SpanGroup.__init__`. Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in order to validate the character offsets and to make it possible to copy spans between documents with differing tokenization. Currently there is no validation that the document texts are identical, but the span char offsets must be valid spans in the target doc, which prevents you from ending up with completely invalid spans.
445d5bc
to
b9c3045
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only minor comments. As I'm not super familiar with the Span
/Doc
internals, I'd recommend a second reviewer.
Description
The restriction on only adding spans from the current doc were already implemented for all operations except for
SpanGroup.__init__
.Initialize copied spans for
SpanGroup.copy
withDoc.char_span
in order to validate the character offsets and to make it possible to copy spans between documents with differing tokenization. Currently there is no validation that the document texts are identical, but the span char offsets must be valid spans in the target doc, which prevents you from ending up with completely invalid spans.Types of change
Bug fix?
Checklist