Faster importing of "bookmarks.html" using regex? #213

gingerbeardman · 2022-02-23T21:43:38Z

Decided to post this as a new issue rather than a reply to #50 (closed)

Given that bookmarks HTML files are machine generated, my first thought to load them would be to use a well-crafted regex to capture the information into named groups.

A quick google shows there are already some such regex being used. Example: https://stackoverflow.com/a/51237774

I don't know Python too well, but in PHP I came up with this regex that will extract data from bookmarks file from:

Chromium
Firefox
Opera
Safari
Pinboard
Linkding

Supports the following attributes, optional unless stated:

HREF (required)
ADD_DATE
LAST_MODIFIED
ICON_URI
ICON
PRIVATE
TOREAD
TAGS
Title (required)

Code

https://gist.github.com/gingerbeardman/0008ba0eaf03050e1c1492ea57314d35

Execution time

I'd be interested to know the difference in performance with your importer.

On my old laptop:

4170 bookmarks
0.025 seconds

Explanation

Pattern

$pattern = '|<DT><A HREF="(?P<href>.*?)"\s*(ADD_DATE="(?P<add_date>.*?)")?\s*(LAST_MODIFIED="(?P<last_modified>.*?)")?\s*(ICON_URI="(?P<icon_uri>.*?)")?\s*(ICON="(?P<icon>.*?)")?\s*(PRIVATE="(?P<private>.*?)")?\s*(TOREAD="(?P<toread>.*?)")?\s*(TAGS="(?P<tags>.*?)")?>(.*?)</A>|';

HREF and Title are required, everything else is an optional named group using the regex format \s*(ATTR="(?P<attr>.*?)")? where ATTR is the HTML attribute and ?P<attr> signifies the group name. Let me know if anything is unclear.

Optional elements come are loaded with empty arrays of the same size as all the others, so processing of the arrays is easy.

Etc

Of course, this is not a full import, as URLs will need to be checked to be valid, text sanitised, etc. But this gets the data from the HTML into structures that can be dealt with in code and it takes mere microseconds.

Thoughts appreciated!

The text was updated successfully, but these errors were encountered:

sissbruecker · 2022-05-14T00:26:56Z

The parsing could be improved, but the performance gains would be negligible since most of the import is spent on database operations. It would be more valuable to optimize these first.

The current parser is kind of tricky to maintain, and the regex would probably be simpler. There is also an alternative parser proposed here: #199

gingerbeardman · 2022-05-15T15:01:47Z

Interesting to read that most of the time is spent on DB.

An idea would be to do a quick import (using regex or whatever) and then to a background operation to gather metadata. That way I can import my bookmarks and start using them immediately, and then if I give it a bit of time it will have the extra data.

Recently I lost the contents of Docker on my NAS so I need to set this up again.

sissbruecker · 2022-05-16T19:20:50Z

Did some work on this: master...perf/improve_import_performance

This changes all database operations to run in bulk, and uses a new parser based on HTMLParser, adapted from #199. On my notebook importing ~1000 bookmarks now takes 0.5s (down from 20s), on my Raspberry Pi 3 its 10 seconds.

Still needs some polish, and tests.

gingerbeardman · 2022-05-16T22:53:26Z

Awesome! That's a great improvement.

gingerbeardman changed the title ~~Faster bookmarks.html importing using regex?~~ Faster importing of "bookmarks.html" using regex? Feb 23, 2022

sissbruecker added the enhancement New feature or request label May 14, 2022

sissbruecker mentioned this issue May 21, 2022

Improve import performance #261

Merged

sissbruecker closed this as completed in #261 May 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster importing of "bookmarks.html" using regex? #213

Faster importing of "bookmarks.html" using regex? #213

gingerbeardman commented Feb 23, 2022 •

edited

Loading

sissbruecker commented May 14, 2022

gingerbeardman commented May 15, 2022

sissbruecker commented May 16, 2022

gingerbeardman commented May 16, 2022

Faster importing of "bookmarks.html" using regex? #213

Faster importing of "bookmarks.html" using regex? #213

Comments

gingerbeardman commented Feb 23, 2022 • edited Loading

Code

Execution time

Explanation

Etc

sissbruecker commented May 14, 2022

gingerbeardman commented May 15, 2022

sissbruecker commented May 16, 2022

gingerbeardman commented May 16, 2022

gingerbeardman commented Feb 23, 2022 •

edited

Loading