-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster importing of "bookmarks.html" using regex? #213
Comments
The parsing could be improved, but the performance gains would be negligible since most of the import is spent on database operations. It would be more valuable to optimize these first. The current parser is kind of tricky to maintain, and the regex would probably be simpler. There is also an alternative parser proposed here: #199 |
Interesting to read that most of the time is spent on DB. An idea would be to do a quick import (using regex or whatever) and then to a background operation to gather metadata. That way I can import my bookmarks and start using them immediately, and then if I give it a bit of time it will have the extra data. Recently I lost the contents of Docker on my NAS so I need to set this up again. |
Did some work on this: master...perf/improve_import_performance This changes all database operations to run in bulk, and uses a new parser based on Still needs some polish, and tests. |
Awesome! That's a great improvement. |
Decided to post this as a new issue rather than a reply to #50 (closed)
Given that bookmarks HTML files are machine generated, my first thought to load them would be to use a well-crafted regex to capture the information into named groups.
A quick google shows there are already some such regex being used. Example: https://stackoverflow.com/a/51237774
I don't know Python too well, but in PHP I came up with this regex that will extract data from bookmarks file from:
Supports the following attributes, optional unless stated:
Code
https://gist.github.com/gingerbeardman/0008ba0eaf03050e1c1492ea57314d35
Execution time
I'd be interested to know the difference in performance with your importer.
On my old laptop:
Explanation
Pattern
HREF and Title are required, everything else is an optional named group using the regex format
\s*(ATTR="(?P<attr>.*?)")?
whereATTR
is the HTML attribute and?P<attr>
signifies the group name. Let me know if anything is unclear.Optional elements come are loaded with empty arrays of the same size as all the others, so processing of the arrays is easy.
Etc
Of course, this is not a full import, as URLs will need to be checked to be valid, text sanitised, etc. But this gets the data from the HTML into structures that can be dealt with in code and it takes mere microseconds.
Thoughts appreciated!
The text was updated successfully, but these errors were encountered: