-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow the option to archive with a headless browser #14
Comments
That is a fantastic idea. Given the original requirement, we implemented similar features in screenshot, but it is still not what you expected. Perhaps we can take things further and develop a piecemeal approach here. |
The biggest challenge for me is developing or choosing a script and its interpreter. I have no experience of this before, but rod has a good api. 😄 I will try to implement this, I really prefer this mode, dealing with all elements is too hard. |
If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :) |
I'm also a big fan of CGo-free and fewer dependencies, chromedp and rod are based on Chrome DevTools Protocol, without CGo or tons of dependencies. 😃 |
@fmartingr Please don't be worried about complex external dependencies. Perhaps we can look forward to the given works. Anyway, pr is wecome. |
Hey, I created a simple demo. https://github.com/hellodword/web-archiving-with-headless-chromium-demo env rod=show,bin=/path/to/chrome go run . It's very simple, but provides custom And use singlefile for saving. |
This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative. @fmartingr What do you think? |
Right, and it's buggy in this demo. 😂 But it's optional, just like the archivebox, archivebox has multi saving modes, singlefile is only one of them. The thing I want to show is ability of custom script, and, a highly recommend cdp library of golang, I think it's much better than chromedp. |
Appreciate the time and effort. Personally, I prefer the option of trying to inject the script in headless over the one implemented in the screenshot project. It appears that making it an option would be reasonable, so if SingleFile is added as a browser extension, I would prefer to put the gitmodule in the An example of archiving results using |
I still haven't started migrating to obelisk just yet... it will be an interesting amount of work to perform and I do not have much time to spare this weeks (and most is invested in replying issues and PRs, yay FOSS! 😂). My comment was regarding more the current state of shiori and some comments by our packages in regard of external dependecies or ecosystems. For me the ideal solution is to import That said, I don't want my comments/vision to halt obelisk's progress! I'm just expressing my fears from an user perspective, not imposing anything. I haven't use any library like this in a while (and not in the Go world, anyway) so I just wanted to make sure I don't create future problems for shiori. You folks are the experts here :) |
Right, it was a demo so I directly use singlefile as an embedded dependency, it could or should be act as a I think nowadays archiving tool do not necessarily need a chromium, but need ability of scripting extension, one reason is there're too much anti-bot stuff (captcha, WAF, and so on) on the internet. |
I'm interested in this somehow, so let's do it. Related to wabarc/wayback#92 |
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
It seems to me that the ideal solution would be the ability to prepare the page for saving not on the server, but on the client. And send it to the server. Now a lot of sites use dynamic image loading, captcha checking, they load comments only if you scroll the page to them (and comments are sometimes more interesting than the article itself), they don’t load all comments (hide discussion threads until you force them to open). Lots of dynamics. Therefore, it is better to save the page after having previously examined it with your own eyes, that all that is needed is loaded and displayed. There is no universal solution here, so it is preferable to inspect the page yourself. I just looked into my Pocket archive and it became very sad - many domains are already partitioned, there are no sites. And the pages themselves (at a premium tariff) are far from being completely saved, sometimes they don’t even have text. And now I'm looking for a solution to this problem. I have now started saving pages through SingleFile, but if you tie it to shiori, it will be just the perfect bookmark manager. At the same time, I would like shiori not to save the text to its database (perhaps only for a quick search), but always retrieve it again from the saved page. Because text content recognition algorithms will always improve, and the content stored in the database may be incorrectly recognized and no longer relevant from the new version of the application. |
@Katarn Thank you for your offer, it's a fantastic idea. As intended,
Makes |
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
Just like archivebox, I think archivebox is very nice, but there're two issues:
And I found a great golang lib rod, how about adding a mode of using headless (or headful, it depends) chromium?
The text was updated successfully, but these errors were encountered: