-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store index in IPFS for full IPFS-based web archive system #61
Comments
@ikreymer - I brain-dumped something related to this in #60. The idea now is to transmit the CDXJ using your mechanism of choice while both nodes are up. It's sub-optimal. I have not investigated IPLD/IPNS recently beyond the spec's site at http://ipld.io/ . |
Should the CDXJ indexes somehow be self-aware of their IPFS/IPLD hash? Such data could be stored in the initial CDXJ metadata fields and potentially used in the future to update an index. @ibnesayeed |
I personally don't like the idea of storing CDXJ as such into IPFS. It can be done for the sake of storage, but not for the sake of discovery as the primary goal. However, I very much like to make the system free from the CDXJ and fully working on the IPFS only. The main functionality of the CDXJ is to allow lookup/discovery, which can be done differently such as how Fluidinfo did it. This would require IPFS to support key-value store or linked data style node graph exploration capability. One possible workflow would be to define a well known starting point for everything. In case of Fluidinfo it is An alternate approach would be to offload the Each This will even allow import and export of WARC files. Exporting to WARC file can be selective if the user wants to apply some filters. It is important to note that if such an archiving system is created, it is not necessary to use WARC based archiving, independent tools like browser extensions can push observation records (Mementos) into the system with added attributes such as session information and make it globally accessible. This way the archiving system will be fully decentralized and collaborative, both capture and replay. /cc @jbenet for thoughts. |
This discussion is great -- i need to page in a lot of design considerations here to better understand possibilities & suggest. In order to get answers faster, and perhaps voice any considerations / feature requests, it may be a good idea to schedule a real time discussion in one of the monday sprints. cc @flyingzumwalt -- you may be a better person than me to help in this discussion. cc @nicola @diasdavid @whyrusleeping we should have nice importers for WARCs to IPLD to ensure the deduplication is maximized in use cases like these. This should not take a lot of work, even now. Just requires the kick off of the "importers project". |
@jbenet, how mature is IPLD and where can we read more about it? I had a feeling that it was still in the design phase, perhaps because I read about it a long time ago. Also, how far has IPNS reached in supporting Memento? Refs: |
@ibnesayeed I agree that the index should be a native IPFS structure, rather than CDXJ. Thanks for adding the diagrams, I think all those approaches are good. I think these link relationships make sense, especially the second option focusing on the TimeGate which links to several Mementos. Another key part is a list of TimeGates, which would allow building a collection out of multiples urls. There would also need to be a way to expand this collection and additional links. Perhaps IPLD could be used here? It's been a while since I had a chance to look at this more closely, and I don't have much time either, but would be happy to join a call as @jbenet suggested. I really hope this problem can be figured out once for all. To me, this is the critical issue that needs to be solved for a true IPFS-based web archiving solution. Not to be overly dramatic, but being able to lookup resources by url and datetime entirely through IPFS is the key difference between just another storage backend to put WARCs into and a ground-breaking, decentralized new web archiving system. I would very much like to see the latter happen :) |
basic IPLD support has been merged into master, and will be released in 0.4.5 (likely before the end of the year). While the tooling around it so far is a bit sparse (we're still figuring out the best interfaces for this), you can use it and resolve things through it already. For example, take this ipld object (written in JSON): {
"Hello": "World",
"cats": {
"kitty": {"/": "QmZ2MNo4QesxepKgYiFSaHDDYa9wuETKDRW2pPm7Fu6rsp"},
"moustache": {"/": "QmPPD8EeDsXJVoWEHHWBY2QRB8sejcxanVeqCybstpLkMY"},
"bulletcat": {"/": "QmS2rX3vcFJdaZxn88rFoknKcLavxbX4kok7xoFfnirSTT"}
},
"code": {"/": "QmTG7sMBQXy6niraGSL9mjWvHUHJaMQYKAJ9Stqwmzo992"},
"catArray": [
{"/": "QmZ2MNo4QesxepKgYiFSaHDDYa9wuETKDRW2pPm7Fu6rsp"},
{"/": "QmPPD8EeDsXJVoWEHHWBY2QRB8sejcxanVeqCybstpLkMY"},
{"/": "QmS2rX3vcFJdaZxn88rFoknKcLavxbX4kok7xoFfnirSTT"}
]
} You can put this into ipfs like:
And then view paths over this like: |
The second one is my personal preference as well. It off-loads the version resolution to the IPNS and does not pollute the object graph itself. However, the first approach was something that I put here just for comments as I was cooking the possible architectures on the whiteboard. Unnecessary chain of updates is undesired in the the first approach because the TimeMap, the node that keeps pointers to all the mementos would be changed after every memento addition, that will cause unwanted dangling versions of the TimeMap object itself that wont be connected from anywhere. Additionally, reference from the |
One way of implementing collection building in the IPFS could be as follows:
Let me describe a bit more about the well know entry point |
It's great to see lots of progress happening on ipwb! However, to me it seems that the key aspect still missing is the ability to store and augment the index (CDXJ) in IPFS as well for a full IPFS system. The user should be able to:
Unless I'm missing something, the current system requires that the user maintain the index locally on their own file system (it could also be put into another system, such as Redis, etc...)
This limitation is a key requirement for being to run a web archive entirely on IPFS itself.
I remember last year there was discussion of IPLD as being a possible solution.. It's been a while since I had a chance to look at this unfortunately, but I wonder if there have been any new developments or insights?
The text was updated successfully, but these errors were encountered: