Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MMM_0000066740-23-000058] - AssertionError: Missing: ['part2item3'] #30

Closed
INF800 opened this issue Oct 24, 2023 · 6 comments
Labels
type:bug Some feature is not working as intended

Comments

@INF800
Copy link
Contributor

INF800 commented Oct 24, 2023

Related to alphanome-ai/sec-ai#47


Currently, TextElement can become HighlightedTextElement only if 80% of the text content is bold with some font weight. This is not an ideal scenario in cases such as part2item3 of 10-Q_MMM_0000066740-23-000058 which looks like:

image

where ~55% of text content is bold with some font weight. The style_string default dict for it looks like:

image

Because of this TextElement cannot become HighlightedTextElement which in turn cannot become TitleElement which in turn cannot become TopLevelSectionElement.

Solution:

Change PERCENTAGE_THRESHOLD from 80 to a value less than 54.9.

In order to find a sweet spot I checked if there were more fails like these where the threshold needs to be decreased, but could not find any in the dataset. So, I think the best value as of now would be 50 percent assuming neither the trailing description or the bold heading has more text content.

@INF800
Copy link
Contributor Author

INF800 commented Oct 24, 2023

Hmm, looks like it affects the general logic. This change fails e2e-verify with:

"10-Q_GOOG_0001652044-23-000070" Line 938 [expected]: "cls_name": "TextElement",
"10-Q_GOOG_0001652044-23-000070" Line 938 [actual]:   "cls_name": "TitleElement",
"10-Q_GOOG_0001652044-23-000070" Line 942 [expected]: "html_hash": "17b63bd7"
"10-Q_GOOG_0001652044-23-000070" Line 942 [actual]:   "html_hash": "17b63bd7",
"10-Q_GOOG_0001652044-23-000070" Line 943 [actual]:   "level": 0
"10-Q_GOOG_0001652044-23-000070" Line 3928 [expected]: "cls_name": "TextElement",
"10-Q_GOOG_0001652044-23-000070" Line 3929 [actual]:   "cls_name": "TitleElement",
"10-Q_GOOG_0001652044-23-000070" Line 3932 [expected]: "html_hash": "9ce67b24"
"10-Q_GOOG_0001652044-23-000070" Line 3933 [actual]:   "html_hash": "9ce67b24",
"10-Q_GOOG_0001652044-23-000070" Line 3934 [actual]:   "level": 0
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                ┃                         ┃      Execution Time      ┃
┃ Report                         ┃        Accuracy         ┃  (Limit, %Limit, Size)   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 10-Q_GOOG_0001652044-23-000070 │ 4 Missing, 6 Unexpected │ 974ms (2095ms, 46%, 2M)  │
│ 10-Q_AAPL_0000320193-23-000077 │            ✓            │ 393ms (963ms, 41%, 963k) │
└────────────────────────────────┴─────────────────────────┴──────────────────────────┘
Note: Execution Time Limit is based on a set rate of 1 microseconds per HTML character (Size).

[ERROR] Verification failed.

@Elijas
Copy link
Member

Elijas commented Oct 24, 2023

Thanks for creating this issue, it's really on point with what we need right now 🙌 It's always nice to see a high-quality Github Issue, thank you for contributing! 🚀🚀

So, I think the best value as of now would be 50 percent assuming neither the trailing description or the bold heading has more text content.

But what if the disclaimer message were a little bit longer, then it would break the >50% check 🤔

I'm thinking, maybe we should have a "failover step", where when the TopLevelSectionTitle couldn't be found the normal way, then scan "TextElements" to be converted to TopLevelSectionTitle. Or maybe even better - pre-split the title into two semantic elements, as "No matters require disclosure" could be its own element?

I'll do some exploration/refactoring, and get back to you soon

@Elijas Elijas added the type:bug Some feature is not working as intended label Oct 24, 2023
@Elijas
Copy link
Member

Elijas commented Oct 24, 2023

Hmm, looks like it affects the general logic. This change fails e2e-verify with:

"10-Q_GOOG_0001652044-23-000070" Line 938 [expected]: "cls_name": "TextElement",
"10-Q_GOOG_0001652044-23-000070" Line 938 [actual]:   "cls_name": "TitleElement",
"10-Q_GOOG_0001652044-23-000070" Line 942 [expected]: "html_hash": "17b63bd7"
"10-Q_GOOG_0001652044-23-000070" Line 942 [actual]:   "html_hash": "17b63bd7",
"10-Q_GOOG_0001652044-23-000070" Line 943 [actual]:   "level": 0
"10-Q_GOOG_0001652044-23-000070" Line 3928 [expected]: "cls_name": "TextElement",
"10-Q_GOOG_0001652044-23-000070" Line 3929 [actual]:   "cls_name": "TitleElement",
"10-Q_GOOG_0001652044-23-000070" Line 3932 [expected]: "html_hash": "9ce67b24"
"10-Q_GOOG_0001652044-23-000070" Line 3933 [actual]:   "html_hash": "9ce67b24",
"10-Q_GOOG_0001652044-23-000070" Line 3934 [actual]:   "level": 0
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                ┃                         ┃      Execution Time      ┃
┃ Report                         ┃        Accuracy         ┃  (Limit, %Limit, Size)   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 10-Q_GOOG_0001652044-23-000070 │ 4 Missing, 6 Unexpected │ 974ms (2095ms, 46%, 2M)  │
│ 10-Q_AAPL_0000320193-23-000077 │            ✓            │ 393ms (963ms, 41%, 963k) │
└────────────────────────────────┴─────────────────────────┴──────────────────────────┘
Note: Execution Time Limit is based on a set rate of 1 microseconds per HTML character (Size).

[ERROR] Verification failed.

I will quickly mention that failing e2e tests doesn't necessarily mean that it's bad. Sometimes the tests are wrong and need to be updated, for example when we improve the parser output. It's a moving goalpost to make sure we don't regress.

In this case the tests say that two elements are now recognized as Titles while previously the were regular text paragraphs

@INF800
Copy link
Contributor Author

INF800 commented Oct 24, 2023

Sure, thanks!

@INF800
Copy link
Contributor Author

INF800 commented Oct 25, 2023

Found a case where un-highlighted text can be a title too. (See Item 1)
image
Ticker - RKT

@Elijas
Copy link
Member

Elijas commented Nov 7, 2023

Moved to alphanome-ai/sec-ai#47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Some feature is not working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants