🏇 PRF: prefer (and encourage) lxml #382

bollwyvl · 2021-04-13T14:27:47Z

for Slow page builds in 6.0 when generating navigation bar #381

bollwyvl · 2021-04-13T14:31:26Z

Aside from various screwups on my end, it looks like bs4-on-lxml returns different dom...

drammock · 2021-04-13T14:33:29Z

pydata_sphinx_theme/__init__.py

+    import lxml
+    BS4_PARSER = "lxml"
+    logger.info("Using lxml for HTML parsing")
+except ImportError:


might need this too?

Suggested change

except ImportError:

except (ImportError, ModuleNotFoundError):

nawp, think we're good:

ModuleNotFoundError.__mro__ (ModuleNotFoundError, ImportError, Exception, BaseException, object)

bollwyvl · 2021-04-13T17:30:00Z

So: re: the dom changes:

lxml will parse whatever, but always gives back a full HTML document
if the lxml is not used, an invalid <input type="checkbox"><tags><that><aren't><allowed></input> gets created, as can be seen in the fixtures

So actually: I think the right play for this PR is to just always use lxml... and indeed, stop using bs4 altogether, so that xpath, etc. becomes usable. It's 2021, it's not that big a deal to install anymore, even on windows, and if two major downstreams are feeling performance issues, it doesn't seem unreasonable to have a performance-first approach. Of course we should look more places, but it's pretty hard without benchmarks, which are... tricky to add, after the fact.

jorisvandenbossche · 2021-04-13T17:46:40Z

and indeed, stop using bs4 altogether

You mean not just using the lxml parser through bs4, but actually not using bs4? Note that right now we quite heavily depend on manipulating the HTML using bs4 functionality (although it has decreased a bit after the changes in #346)

bollwyvl · 2021-04-13T19:02:18Z

heavily depend on manipulating the HTML using bs4 functionality

yeah, lxml can edit stuff, too. one thing i have noted: html actually is quite whitespace aware, so the lxml pretty print is very conservative, and makes testing hard... if proceeding further, would probably keep bs4 for validation purposes.

bollwyvl · 2021-04-13T19:07:53Z

Anyhow, I'll hold on this, as it would almost certainly need more investigation, and no doubt trigger a 0.7.0...

try lxml

2eea69a

bollwyvl changed the title ~~PRF: 🐎 prefer (and encourage) lxml~~ 🏇 PRF: prefer (and encourage) lxml Apr 13, 2021

drammock reviewed Apr 13, 2021

View reviewed changes

bollwyvl closed this Aug 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏇 PRF: prefer (and encourage) lxml #382

🏇 PRF: prefer (and encourage) lxml #382

bollwyvl commented Apr 13, 2021

bollwyvl commented Apr 13, 2021

drammock Apr 13, 2021

bollwyvl Apr 13, 2021

bollwyvl commented Apr 13, 2021

jorisvandenbossche commented Apr 13, 2021 •

edited

Loading

bollwyvl commented Apr 13, 2021

bollwyvl commented Apr 13, 2021

	except ImportError:
	except (ImportError, ModuleNotFoundError):

🏇 PRF: prefer (and encourage) lxml #382

🏇 PRF: prefer (and encourage) lxml #382

Conversation

bollwyvl commented Apr 13, 2021

bollwyvl commented Apr 13, 2021

drammock Apr 13, 2021

Choose a reason for hiding this comment

bollwyvl Apr 13, 2021

Choose a reason for hiding this comment

bollwyvl commented Apr 13, 2021

jorisvandenbossche commented Apr 13, 2021 • edited Loading

bollwyvl commented Apr 13, 2021

bollwyvl commented Apr 13, 2021

jorisvandenbossche commented Apr 13, 2021 •

edited

Loading