Require that all SpanGroup spans are from the current doc #12569

adrianeboyd · 2023-04-24T17:35:05Z

Description

The restriction on only adding spans from the current doc were already implemented for all operations except for SpanGroup.__init__.

Initialize copied spans for SpanGroup.copy with Doc.char_span in order to validate the character offsets and to make it possible to copy spans between documents with differing tokenization. Currently there is no validation that the document texts are identical, but the span char offsets must be valid spans in the target doc, which prevents you from ending up with completely invalid spans.

Types of change

Bug fix?

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

adrianeboyd · 2023-04-24T17:50:08Z

This originally stemmed from https://support.prodi.gy/t/encountering-an-error-with-custom-iob2-format-in-the-ner-spancat-compare-project/6504 but kind of blew up while I was working on it.

Due to how the IOB converter is designed (i.e., it doesn't support a provided vocab or pipeline), the project created copies of docs with different vocabs but the same text and then tried to combine spans from different docs into the same reference doc through SpanGroup.__init__. I initially thought it was just a vocab issue, but it expanded into the different doc problem on top. The relevant code is here:

https://github.com/explosion/projects/blob/dbc1a2e4383e029967b67ab868bd05aa515477e9/experimental/ner_spancat_compare/scripts/convert.py#L45-L53

All SpanGroup methods that append spans check that the docs match, but SpanGroup.__init__ didn't require this and I don't think it makes sense to support spans that are not from the current doc. This is breaking, but I don't feel like the alternatives make sense because it would be too easy to set invalid spans in a SpanGroup otherwise.

This change gets especially messy in terms of copying spans within Example objects as within the tests, but it still seems like directly copying predicted spans into a reference doc, which may have a different tokenization, should not be allowed.

When initializing a SpanGroup all the spans need to be from same doc, but when copying spans between docs, there is still no requirement that the texts match, just that the spans end up as valid spans.

The copied spans are based on character offsets so that it's possible to copy spans between two docs with different tokenizations. This required changes to Doc.copy because otherwise the Doc was empty at the point when the spans were copied.

The restriction on only adding spans from the current doc were already implemented for all operations except for `SpanGroup.__init__`. Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in order to validate the character offsets and to make it possible to copy spans between documents with differing tokenization. Currently there is no validation that the document texts are identical, but the span char offsets must be valid spans in the target doc, which prevents you from ending up with completely invalid spans.

rmitsch

LGTM, only minor comments. As I'm not super familiar with the Span/Doc internals, I'd recommend a second reviewer.

spacy/tests/parser/test_ner.py

spacy/tokens/span_group.pyx

adrianeboyd added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels Apr 24, 2023

adrianeboyd marked this pull request as draft April 24, 2023 17:36

adrianeboyd force-pushed the bugfix/spangroup-vocab branch from 373e7ac to 445d5bc Compare April 24, 2023 17:49

adrianeboyd force-pushed the bugfix/spangroup-vocab branch from 445d5bc to b9c3045 Compare April 24, 2023 17:50

adrianeboyd added the v3.6 Related to v3.6 label Apr 24, 2023

adrianeboyd marked this pull request as ready for review April 24, 2023 18:16

rmitsch approved these changes Apr 28, 2023

View reviewed changes

spacy/tests/parser/test_ner.py Outdated Show resolved Hide resolved

spacy/tokens/span_group.pyx Show resolved Hide resolved

Undo change in test_beam_overfitting_IO

5dcad99

svlandeg merged commit c4112a1 into explosion:master Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require that all SpanGroup spans are from the current doc #12569

Require that all SpanGroup spans are from the current doc #12569

adrianeboyd commented Apr 24, 2023

adrianeboyd commented Apr 24, 2023

rmitsch left a comment

Require that all SpanGroup spans are from the current doc #12569

Require that all SpanGroup spans are from the current doc #12569

Conversation

adrianeboyd commented Apr 24, 2023

Description

Types of change

Checklist

adrianeboyd commented Apr 24, 2023

rmitsch left a comment

Choose a reason for hiding this comment