FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MMM_0000066740-23-000058] - AssertionError: Missing: ['part2item3'] #30

INF800 · 2023-10-24T03:30:48Z

Currently, TextElement can become HighlightedTextElement only if 80% of the text content is bold with some font weight. This is not an ideal scenario in cases such as part2item3 of 10-Q_MMM_0000066740-23-000058 which looks like:

where ~55% of text content is bold with some font weight. The style_string default dict for it looks like:

Because of this TextElement cannot become HighlightedTextElement which in turn cannot become TitleElement which in turn cannot become TopLevelSectionElement.

Solution:

Change PERCENTAGE_THRESHOLD from 80 to a value less than 54.9.

sec-parser/sec_parser/semantic_elements/highlighted_text_element.py

Line 60 in 58304f0

PERCENTAGE_THRESHOLD = 80

In order to find a sweet spot I checked if there were more fails like these where the threshold needs to be decreased, but could not find any in the dataset. So, I think the best value as of now would be 50 percent assuming neither the trailing description or the bold heading has more text content.

The text was updated successfully, but these errors were encountered:

INF800 · 2023-10-24T06:09:49Z

Hmm, looks like it affects the general logic. This change fails e2e-verify with:

"10-Q_GOOG_0001652044-23-000070" Line 938 [expected]: "cls_name": "TextElement",
"10-Q_GOOG_0001652044-23-000070" Line 938 [actual]:   "cls_name": "TitleElement",
"10-Q_GOOG_0001652044-23-000070" Line 942 [expected]: "html_hash": "17b63bd7"
"10-Q_GOOG_0001652044-23-000070" Line 942 [actual]:   "html_hash": "17b63bd7",
"10-Q_GOOG_0001652044-23-000070" Line 943 [actual]:   "level": 0
"10-Q_GOOG_0001652044-23-000070" Line 3928 [expected]: "cls_name": "TextElement",
"10-Q_GOOG_0001652044-23-000070" Line 3929 [actual]:   "cls_name": "TitleElement",
"10-Q_GOOG_0001652044-23-000070" Line 3932 [expected]: "html_hash": "9ce67b24"
"10-Q_GOOG_0001652044-23-000070" Line 3933 [actual]:   "html_hash": "9ce67b24",
"10-Q_GOOG_0001652044-23-000070" Line 3934 [actual]:   "level": 0
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                ┃                         ┃      Execution Time      ┃
┃ Report                         ┃        Accuracy         ┃  (Limit, %Limit, Size)   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 10-Q_GOOG_0001652044-23-000070 │ 4 Missing, 6 Unexpected │ 974ms (2095ms, 46%, 2M)  │
│ 10-Q_AAPL_0000320193-23-000077 │            ✓            │ 393ms (963ms, 41%, 963k) │
└────────────────────────────────┴─────────────────────────┴──────────────────────────┘
Note: Execution Time Limit is based on a set rate of 1 microseconds per HTML character (Size).

[ERROR] Verification failed.

Elijas · 2023-10-24T09:28:17Z

Thanks for creating this issue, it's really on point with what we need right now 🙌 It's always nice to see a high-quality Github Issue, thank you for contributing! 🚀🚀

So, I think the best value as of now would be 50 percent assuming neither the trailing description or the bold heading has more text content.

But what if the disclaimer message were a little bit longer, then it would break the >50% check 🤔

I'm thinking, maybe we should have a "failover step", where when the TopLevelSectionTitle couldn't be found the normal way, then scan "TextElements" to be converted to TopLevelSectionTitle. Or maybe even better - pre-split the title into two semantic elements, as "No matters require disclosure" could be its own element?

I'll do some exploration/refactoring, and get back to you soon

Elijas · 2023-10-24T12:41:08Z

Hmm, looks like it affects the general logic. This change fails e2e-verify with:

"10-Q_GOOG_0001652044-23-000070" Line 938 [expected]: "cls_name": "TextElement",
"10-Q_GOOG_0001652044-23-000070" Line 938 [actual]:   "cls_name": "TitleElement",
"10-Q_GOOG_0001652044-23-000070" Line 942 [expected]: "html_hash": "17b63bd7"
"10-Q_GOOG_0001652044-23-000070" Line 942 [actual]:   "html_hash": "17b63bd7",
"10-Q_GOOG_0001652044-23-000070" Line 943 [actual]:   "level": 0
"10-Q_GOOG_0001652044-23-000070" Line 3928 [expected]: "cls_name": "TextElement",
"10-Q_GOOG_0001652044-23-000070" Line 3929 [actual]:   "cls_name": "TitleElement",
"10-Q_GOOG_0001652044-23-000070" Line 3932 [expected]: "html_hash": "9ce67b24"
"10-Q_GOOG_0001652044-23-000070" Line 3933 [actual]:   "html_hash": "9ce67b24",
"10-Q_GOOG_0001652044-23-000070" Line 3934 [actual]:   "level": 0
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                ┃                         ┃      Execution Time      ┃
┃ Report                         ┃        Accuracy         ┃  (Limit, %Limit, Size)   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 10-Q_GOOG_0001652044-23-000070 │ 4 Missing, 6 Unexpected │ 974ms (2095ms, 46%, 2M)  │
│ 10-Q_AAPL_0000320193-23-000077 │            ✓            │ 393ms (963ms, 41%, 963k) │
└────────────────────────────────┴─────────────────────────┴──────────────────────────┘
Note: Execution Time Limit is based on a set rate of 1 microseconds per HTML character (Size).

[ERROR] Verification failed.

I will quickly mention that failing e2e tests doesn't necessarily mean that it's bad. Sometimes the tests are wrong and need to be updated, for example when we improve the parser output. It's a moving goalpost to make sure we don't regress.

In this case the tests say that two elements are now recognized as Titles while previously the were regular text paragraphs

INF800 · 2023-10-24T22:56:10Z

Sure, thanks!

INF800 · 2023-10-25T01:25:49Z

Found a case where un-highlighted text can be a title too. (See Item 1)

Ticker - RKT

Elijas · 2023-11-07T14:05:38Z

Moved to alphanome-ai/sec-ai#47

INF800 mentioned this issue Oct 24, 2023

Change threshold for text style #32

Closed

Elijas added the type:bug Some feature is not working as intended label Oct 24, 2023

Elijas closed this as not planned Won't fix, can't repro, duplicate, stale Nov 7, 2023

Elijas mentioned this issue Nov 7, 2023

Extend Top Level Section detection to work for more 10-Q reports alphanome-ai/sec-ai#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MMM_0000066740-23-000058] - AssertionError: Missing: ['part2item3'] #30

FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MMM_0000066740-23-000058] - AssertionError: Missing: ['part2item3'] #30

INF800 commented Oct 24, 2023 •

edited

Loading

INF800 commented Oct 24, 2023 •

edited

Loading

Elijas commented Oct 24, 2023 •

edited

Loading

Elijas commented Oct 24, 2023 •

edited

Loading

INF800 commented Oct 24, 2023

INF800 commented Oct 25, 2023 •

edited

Loading

Elijas commented Nov 7, 2023

FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MMM_0000066740-23-000058] - AssertionError: Missing: ['part2item3'] #30

FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MMM_0000066740-23-000058] - AssertionError: Missing: ['part2item3'] #30

Comments

INF800 commented Oct 24, 2023 • edited Loading

Solution:

INF800 commented Oct 24, 2023 • edited Loading

Elijas commented Oct 24, 2023 • edited Loading

Elijas commented Oct 24, 2023 • edited Loading

INF800 commented Oct 24, 2023

INF800 commented Oct 25, 2023 • edited Loading

Elijas commented Nov 7, 2023

INF800 commented Oct 24, 2023 •

edited

Loading

INF800 commented Oct 24, 2023 •

edited

Loading

Elijas commented Oct 24, 2023 •

edited

Loading

Elijas commented Oct 24, 2023 •

edited

Loading

INF800 commented Oct 25, 2023 •

edited

Loading