-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏇 PRF: prefer (and encourage) lxml #382
Conversation
bollwyvl
commented
Apr 13, 2021
- for Slow page builds in 6.0 when generating navigation bar #381
Aside from various screwups on my end, it looks like bs4-on-lxml returns different dom... |
import lxml | ||
BS4_PARSER = "lxml" | ||
logger.info("Using lxml for HTML parsing") | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need this too?
except ImportError: | |
except (ImportError, ModuleNotFoundError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nawp, think we're good:
ModuleNotFoundError.__mro__
(ModuleNotFoundError, ImportError, Exception, BaseException, object)
So: re: the dom changes:
So actually: I think the right play for this PR is to just always use lxml... and indeed, stop using bs4 altogether, so that xpath, etc. becomes usable. It's 2021, it's not that big a deal to install anymore, even on windows, and if two major downstreams are feeling performance issues, it doesn't seem unreasonable to have a performance-first approach. Of course we should look more places, but it's pretty hard without benchmarks, which are... tricky to add, after the fact. |
You mean not just using the lxml parser through bs4, but actually not using bs4? Note that right now we quite heavily depend on manipulating the HTML using bs4 functionality (although it has decreased a bit after the changes in #346) |
yeah, lxml can edit stuff, too. one thing i have noted: html actually is quite whitespace aware, so the lxml pretty print is very conservative, and makes testing hard... if proceeding further, would probably keep bs4 for validation purposes. |
Anyhow, I'll hold on this, as it would almost certainly need more investigation, and no doubt trigger a |