Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏇 PRF: prefer (and encourage) lxml #382

Closed
wants to merge 1 commit into from

Conversation

bollwyvl
Copy link
Collaborator

@bollwyvl bollwyvl changed the title PRF: 🐎 prefer (and encourage) lxml 🏇 PRF: prefer (and encourage) lxml Apr 13, 2021
@bollwyvl
Copy link
Collaborator Author

Aside from various screwups on my end, it looks like bs4-on-lxml returns different dom...

import lxml
BS4_PARSER = "lxml"
logger.info("Using lxml for HTML parsing")
except ImportError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need this too?

Suggested change
except ImportError:
except (ImportError, ModuleNotFoundError):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nawp, think we're good:

ModuleNotFoundError.__mro__
(ModuleNotFoundError, ImportError, Exception, BaseException, object)

@bollwyvl
Copy link
Collaborator Author

So: re: the dom changes:

  • lxml will parse whatever, but always gives back a full HTML document
  • if the lxml is not used, an invalid <input type="checkbox"><tags><that><aren't><allowed></input> gets created, as can be seen in the fixtures

So actually: I think the right play for this PR is to just always use lxml... and indeed, stop using bs4 altogether, so that xpath, etc. becomes usable. It's 2021, it's not that big a deal to install anymore, even on windows, and if two major downstreams are feeling performance issues, it doesn't seem unreasonable to have a performance-first approach. Of course we should look more places, but it's pretty hard without benchmarks, which are... tricky to add, after the fact.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 13, 2021

and indeed, stop using bs4 altogether

You mean not just using the lxml parser through bs4, but actually not using bs4? Note that right now we quite heavily depend on manipulating the HTML using bs4 functionality (although it has decreased a bit after the changes in #346)

@bollwyvl
Copy link
Collaborator Author

heavily depend on manipulating the HTML using bs4 functionality

yeah, lxml can edit stuff, too. one thing i have noted: html actually is quite whitespace aware, so the lxml pretty print is very conservative, and makes testing hard... if proceeding further, would probably keep bs4 for validation purposes.

@bollwyvl
Copy link
Collaborator Author

Anyhow, I'll hold on this, as it would almost certainly need more investigation, and no doubt trigger a 0.7.0...

@bollwyvl bollwyvl closed this Aug 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants