Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretty listing of resources and title for HTML pages #178

Open
ibnesayeed opened this issue Jun 6, 2017 · 10 comments
Open

Pretty listing of resources and title for HTML pages #178

ibnesayeed opened this issue Jun 6, 2017 · 10 comments

Comments

@ibnesayeed
Copy link
Member

Currently the bare URLs are listed for the archived resources (as shown in #177). We can make it better looking and less space consuming by:

  • Showing titles of HTML pages and hyperlinking corresponding mementos
  • For resources where title is not present or is not applicable, their file name can be extracted from the URL
  • As a fallback, URLs can be used if nothing else is feasible
  • Along with the title, if generated, thumbnails would also be a good way to present resources.

Title of each memento can be stored in the CDXJ file as an optional filed. Title extraction would require HTML parsing at the time of indexing.

@machawk1
Copy link
Member

machawk1 commented Jul 14, 2017

@ibnesayeed Do you believe title extraction should be the default functionality or only activated when a flag is passed to the indexing script?

My vote is the former, though it makes the CDXJ more verbose but richer and more user-friendly when parsed and displayed. We may also offer the option to enrich CDXJ TimeMaps that do not have this information from within the replay interface, e.g.,

A sample CDXJ w/o title attributes !context ["http://oduwsdl.github.io/contexts/cdxj"] !meta {"created_at": "2017-07-14T09:02:23.458675", "generator": "InterPlanetary Wayback v.0.2017.07.10.1739"} com,matkelly)/froggies/frog.png 20170301192639 {"locator": "urn:ipfs/QmUeko8zM7Xanwz6F9GtRH4rLAi4Poj3DMECGsci2BRQfs/QmPhMnX74cwqx2xgj9d3N3gTra8CzafXwSbUwU8xagMfqR", "mime_type": "image/png", "status_code": "200"} com,matkelly)/robots.txt 20170301192639 {"locator": "urn:ipfs/Qmbk3Aju7u26Pzk356a43wY9eUCScAJiLPxhvwsMoVt7Pd/QmYNB85U2txRAAdLp6wvZSPvd8AQq8UcjZJ2azhv5h6NF7", "mime_type": "text/plain", "status_code": "200"} edu,odu,cs)/~mkelly/semester/2017_spring/remotefroggie.html 20170301192639 {"locator": "urn:ipfs/QmPdyY6Pm66iWtGpTc7PqK11hvsnYSKMVL57G69RiNjGcm/QmNZ6mKSSAXAmXEocQj5gT4y4kdcr5D2C173ubWJ6PSKEZ", "mime_type": "text/html", "status_code": "200"}

...
being passed to the replay system, then an "Enrich" button hit to change the CDXJ to:

A sample CDXJ w/o title attributes !context ["http://oduwsdl.github.io/contexts/cdxj"] !meta {"created_at": "2017-07-14T09:02:23.458675", "generator": "InterPlanetary Wayback v.0.2017.07.10.1739"} com,matkelly)/froggies/frog.png 20170301192639 {"locator": "urn:ipfs/QmUeko8zM7Xanwz6F9GtRH4rLAi4Poj3DMECGsci2BRQfs/QmPhMnX74cwqx2xgj9d3N3gTra8CzafXwSbUwU8xagMfqR", "mime_type": "image/png", "status_code": "200"} com,matkelly)/robots.txt 20170301192639 {"locator": "urn:ipfs/Qmbk3Aju7u26Pzk356a43wY9eUCScAJiLPxhvwsMoVt7Pd/QmYNB85U2txRAAdLp6wvZSPvd8AQq8UcjZJ2azhv5h6NF7", "mime_type": "text/plain", "status_code": "200"} edu,odu,cs)/~mkelly/semester/2017_spring/remotefroggie.html 20170301192639 {"locator": "urn:ipfs/QmPdyY6Pm66iWtGpTc7PqK11hvsnYSKMVL57G69RiNjGcm/QmNZ6mKSSAXAmXEocQj5gT4y4kdcr5D2C173ubWJ6PSKEZ", "mime_type": "text/html", "status_code": "200", "title": "Lorem Ipsum"}

This will have ramifications of generating a different hash if the CDXJ is itself pushed into IPFS, a use case I anticipate for collaboration/sharing of a collection of captures. With the eventual IPNS integration and our indexless system (#61), the ramifications would be less severe.

@ibnesayeed
Copy link
Member Author

Either one is fine for me. Indexing is a one-time job, so it is fine if it takes a bit of extra time in title extraction, but can be skipped with a flag when a lot of data is to be indexed and more index annotations may follow later. Just make sure to sanitize the extracted title to clean up any leading or trailing white spaces and converting newlines (if any) to spaces before storing in the CDXJ.

As long as we are not storing raw CDXJ files in IPFS, there is no harm in adding titles later. The newly proposed model can utilize IPLD for attaching such metadata.

@ibnesayeed
Copy link
Member Author

Now we extract titles from the HTML pages and store them in the index.

@machawk1
Copy link
Member

@ibnesayeed Do you want to work on surfacing these values to replace the URI-R+datetime that is currently displayed? Also, thoughts on retaining the display of the URI-R to correspond with the title? Perhaps dimmed/gray, smaller, and adjacent to the title? I would like to continue to see the URI-R in some fashion without something like a hover.

@ibnesayeed
Copy link
Member Author

On the landing page we only want to surface just a handful of captures that meet certain criteria. For them we can make cards/chips that will hold more information in a more appealing way. We can either use some minimal card formatting or go for MementoEmbed style cards on the landing page for a few URI-Ms. /cc @shawnmjones.

@machawk1
Copy link
Member

@ibnesayeed Any ideas on the heuristic we use for which mementos are displayed? Should we also give the user the option of an extended interface to see a comprehensive list?

There were times in developing WAIL that a list of all URI-Rs archived would have been handy from the replay system.

@ibnesayeed
Copy link
Member Author

ibnesayeed commented Aug 27, 2018

An comprehensive pretty listing with filter and pagination or raw CDXJ index downloading should go in the admin interface.

@ibnesayeed
Copy link
Member Author

Any ideas on the heuristic we use for which mementos are displayed?

We do not use any heuristics and let the browser handle it if no content-type was recorded. Some web archives do have some logic in place to predict content-type when missing, but their accuracy is not perfect.

@machawk1
Copy link
Member

@ibnesayeed Any ideas on the heuristic we use for which mementos are displayed? was not asking about content type but rather, if we have 100 mementos, which are displayed, even if they are all HTML? Random? Newest? Largest? Let the user decide? If so, what's the default?

@ibnesayeed
Copy link
Member Author

The item must be an HTML page. From there we can either go for k number of random items, newest items, most archived items, or all of these under different sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants