Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster importing of "bookmarks.html" using regex? #213

Closed
gingerbeardman opened this issue Feb 23, 2022 · 4 comments · Fixed by #261
Closed

Faster importing of "bookmarks.html" using regex? #213

gingerbeardman opened this issue Feb 23, 2022 · 4 comments · Fixed by #261
Labels
enhancement New feature or request

Comments

@gingerbeardman
Copy link
Contributor

gingerbeardman commented Feb 23, 2022

Decided to post this as a new issue rather than a reply to #50 (closed)

Given that bookmarks HTML files are machine generated, my first thought to load them would be to use a well-crafted regex to capture the information into named groups.

A quick google shows there are already some such regex being used. Example: https://stackoverflow.com/a/51237774

I don't know Python too well, but in PHP I came up with this regex that will extract data from bookmarks file from:

  • Chromium
  • Firefox
  • Opera
  • Safari
  • Pinboard
  • Linkding

Supports the following attributes, optional unless stated:

  • HREF (required)
  • ADD_DATE
  • LAST_MODIFIED
  • ICON_URI
  • ICON
  • PRIVATE
  • TOREAD
  • TAGS
  • Title (required)

Code

https://gist.github.com/gingerbeardman/0008ba0eaf03050e1c1492ea57314d35

Execution time

I'd be interested to know the difference in performance with your importer.

On my old laptop:

4170 bookmarks
0.025 seconds

Explanation

Pattern

$pattern = '|<DT><A HREF="(?P<href>.*?)"\s*(ADD_DATE="(?P<add_date>.*?)")?\s*(LAST_MODIFIED="(?P<last_modified>.*?)")?\s*(ICON_URI="(?P<icon_uri>.*?)")?\s*(ICON="(?P<icon>.*?)")?\s*(PRIVATE="(?P<private>.*?)")?\s*(TOREAD="(?P<toread>.*?)")?\s*(TAGS="(?P<tags>.*?)")?>(.*?)</A>|';

HREF and Title are required, everything else is an optional named group using the regex format \s*(ATTR="(?P<attr>.*?)")? where ATTR is the HTML attribute and ?P<attr> signifies the group name. Let me know if anything is unclear.

Optional elements come are loaded with empty arrays of the same size as all the others, so processing of the arrays is easy.

Etc

Of course, this is not a full import, as URLs will need to be checked to be valid, text sanitised, etc. But this gets the data from the HTML into structures that can be dealt with in code and it takes mere microseconds.

Thoughts appreciated!

@gingerbeardman gingerbeardman changed the title Faster bookmarks.html importing using regex? Faster importing of "bookmarks.html" using regex? Feb 23, 2022
@sissbruecker
Copy link
Owner

The parsing could be improved, but the performance gains would be negligible since most of the import is spent on database operations. It would be more valuable to optimize these first.

The current parser is kind of tricky to maintain, and the regex would probably be simpler. There is also an alternative parser proposed here: #199

@sissbruecker sissbruecker added the enhancement New feature or request label May 14, 2022
@gingerbeardman
Copy link
Contributor Author

Interesting to read that most of the time is spent on DB.

An idea would be to do a quick import (using regex or whatever) and then to a background operation to gather metadata. That way I can import my bookmarks and start using them immediately, and then if I give it a bit of time it will have the extra data.

Recently I lost the contents of Docker on my NAS so I need to set this up again.

@sissbruecker
Copy link
Owner

Did some work on this: master...perf/improve_import_performance

This changes all database operations to run in bulk, and uses a new parser based on HTMLParser, adapted from #199. On my notebook importing ~1000 bookmarks now takes 0.5s (down from 20s), on my Raspberry Pi 3 its 10 seconds.

Still needs some polish, and tests.

@gingerbeardman
Copy link
Contributor Author

Awesome! That's a great improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants