-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MMM_0000066740-23-000058] - AssertionError: Missing: ['part2item3'] #30
Comments
Hmm, looks like it affects the general logic. This change fails
|
Thanks for creating this issue, it's really on point with what we need right now 🙌 It's always nice to see a high-quality Github Issue, thank you for contributing! 🚀🚀
But what if the disclaimer message were a little bit longer, then it would break the >50% check 🤔 I'm thinking, maybe we should have a "failover step", where when the TopLevelSectionTitle couldn't be found the normal way, then scan "TextElements" to be converted to TopLevelSectionTitle. Or maybe even better - pre-split the title into two semantic elements, as "No matters require disclosure" could be its own element? I'll do some exploration/refactoring, and get back to you soon |
I will quickly mention that failing e2e tests doesn't necessarily mean that it's bad. Sometimes the tests are wrong and need to be updated, for example when we improve the parser output. It's a moving goalpost to make sure we don't regress. In this case the tests say that two elements are now recognized as Titles while previously the were regular text paragraphs |
Sure, thanks! |
Moved to alphanome-ai/sec-ai#47 |
Related to alphanome-ai/sec-ai#47
Currently,
TextElement
can becomeHighlightedTextElement
only if 80% of the text content is bold with some font weight. This is not an ideal scenario in cases such as part2item3 of 10-Q_MMM_0000066740-23-000058 which looks like:where ~55% of text content is bold with some font weight. The
style_string
default dict for it looks like:Because of this
TextElement
cannot becomeHighlightedTextElement
which in turn cannot becomeTitleElement
which in turn cannot becomeTopLevelSectionElement
.Solution:
Change
PERCENTAGE_THRESHOLD
from80
to a value less than54.9
.sec-parser/sec_parser/semantic_elements/highlighted_text_element.py
Line 60 in 58304f0
In order to find a sweet spot I checked if there were more fails like these where the threshold needs to be decreased, but could not find any in the dataset. So, I think the best value as of now would be
50
percent assuming neither the trailing description or the bold heading has more text content.The text was updated successfully, but these errors were encountered: