Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require that all SpanGroup spans are from the current doc #12569

Merged
merged 2 commits into from
Jun 1, 2023

Conversation

adrianeboyd
Copy link
Contributor

Description

The restriction on only adding spans from the current doc were already implemented for all operations except for SpanGroup.__init__.

Initialize copied spans for SpanGroup.copy with Doc.char_span in order to validate the character offsets and to make it possible to copy spans between documents with differing tokenization. Currently there is no validation that the document texts are identical, but the span char offsets must be valid spans in the target doc, which prevents you from ending up with completely invalid spans.

Types of change

Bug fix?

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels Apr 24, 2023
@adrianeboyd adrianeboyd marked this pull request as draft April 24, 2023 17:36
@adrianeboyd
Copy link
Contributor Author

This originally stemmed from https://support.prodi.gy/t/encountering-an-error-with-custom-iob2-format-in-the-ner-spancat-compare-project/6504 but kind of blew up while I was working on it.

Due to how the IOB converter is designed (i.e., it doesn't support a provided vocab or pipeline), the project created copies of docs with different vocabs but the same text and then tried to combine spans from different docs into the same reference doc through SpanGroup.__init__. I initially thought it was just a vocab issue, but it expanded into the different doc problem on top. The relevant code is here:

https://github.com/explosion/projects/blob/dbc1a2e4383e029967b67ab868bd05aa515477e9/experimental/ner_spancat_compare/scripts/convert.py#L45-L53

All SpanGroup methods that append spans check that the docs match, but SpanGroup.__init__ didn't require this and I don't think it makes sense to support spans that are not from the current doc. This is breaking, but I don't feel like the alternatives make sense because it would be too easy to set invalid spans in a SpanGroup otherwise.

This change gets especially messy in terms of copying spans within Example objects as within the tests, but it still seems like directly copying predicted spans into a reference doc, which may have a different tokenization, should not be allowed.

When initializing a SpanGroup all the spans need to be from same doc, but when copying spans between docs, there is still no requirement that the texts match, just that the spans end up as valid spans.

The copied spans are based on character offsets so that it's possible to copy spans between two docs with different tokenizations. This required changes to Doc.copy because otherwise the Doc was empty at the point when the spans were copied.

The restriction on only adding spans from the current doc were already
implemented for all operations except for `SpanGroup.__init__`.

Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in
order to validate the character offsets and to make it possible to copy
spans between documents with differing tokenization. Currently there is
no validation that the document texts are identical, but the span char
offsets must be valid spans in the target doc, which prevents you from
ending up with completely invalid spans.
@adrianeboyd adrianeboyd added the v3.6 Related to v3.6 label Apr 24, 2023
@adrianeboyd adrianeboyd marked this pull request as ready for review April 24, 2023 18:16
Copy link
Contributor

@rmitsch rmitsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only minor comments. As I'm not super familiar with the Span/Doc internals, I'd recommend a second reviewer.

spacy/tests/parser/test_ner.py Outdated Show resolved Hide resolved
spacy/tokens/span_group.pyx Show resolved Hide resolved
@svlandeg svlandeg merged commit c4112a1 into explosion:master Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects v3.6 Related to v3.6
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants