Pretty listing of resources and title for HTML pages #178

ibnesayeed · 2017-06-06T17:21:44Z

Currently the bare URLs are listed for the archived resources (as shown in #177). We can make it better looking and less space consuming by:

Showing titles of HTML pages and hyperlinking corresponding mementos
For resources where title is not present or is not applicable, their file name can be extracted from the URL
As a fallback, URLs can be used if nothing else is feasible
Along with the title, if generated, thumbnails would also be a good way to present resources.

Title of each memento can be stored in the CDXJ file as an optional filed. Title extraction would require HTML parsing at the time of indexing.

machawk1 · 2017-07-14T13:08:10Z

@ibnesayeed Do you believe title extraction should be the default functionality or only activated when a flag is passed to the indexing script?

My vote is the former, though it makes the CDXJ more verbose but richer and more user-friendly when parsed and displayed. We may also offer the option to enrich CDXJ TimeMaps that do not have this information from within the replay interface, e.g.,

A sample CDXJ w/o title attributes

!context ["http://oduwsdl.github.io/contexts/cdxj"] !meta {"created_at": "2017-07-14T09:02:23.458675", "generator": "InterPlanetary Wayback v.0.2017.07.10.1739"} com,matkelly)/froggies/frog.png 20170301192639 {"locator": "urn:ipfs/QmUeko8zM7Xanwz6F9GtRH4rLAi4Poj3DMECGsci2BRQfs/QmPhMnX74cwqx2xgj9d3N3gTra8CzafXwSbUwU8xagMfqR", "mime_type": "image/png", "status_code": "200"} com,matkelly)/robots.txt 20170301192639 {"locator": "urn:ipfs/Qmbk3Aju7u26Pzk356a43wY9eUCScAJiLPxhvwsMoVt7Pd/QmYNB85U2txRAAdLp6wvZSPvd8AQq8UcjZJ2azhv5h6NF7", "mime_type": "text/plain", "status_code": "200"} edu,odu,cs)/~mkelly/semester/2017_spring/remotefroggie.html 20170301192639 {"locator": "urn:ipfs/QmPdyY6Pm66iWtGpTc7PqK11hvsnYSKMVL57G69RiNjGcm/QmNZ6mKSSAXAmXEocQj5gT4y4kdcr5D2C173ubWJ6PSKEZ", "mime_type": "text/html", "status_code": "200"}

...
being passed to the replay system, then an "Enrich" button hit to change the CDXJ to:

A sample CDXJ w/o title attributes

!context ["http://oduwsdl.github.io/contexts/cdxj"] !meta {"created_at": "2017-07-14T09:02:23.458675", "generator": "InterPlanetary Wayback v.0.2017.07.10.1739"} com,matkelly)/froggies/frog.png 20170301192639 {"locator": "urn:ipfs/QmUeko8zM7Xanwz6F9GtRH4rLAi4Poj3DMECGsci2BRQfs/QmPhMnX74cwqx2xgj9d3N3gTra8CzafXwSbUwU8xagMfqR", "mime_type": "image/png", "status_code": "200"} com,matkelly)/robots.txt 20170301192639 {"locator": "urn:ipfs/Qmbk3Aju7u26Pzk356a43wY9eUCScAJiLPxhvwsMoVt7Pd/QmYNB85U2txRAAdLp6wvZSPvd8AQq8UcjZJ2azhv5h6NF7", "mime_type": "text/plain", "status_code": "200"} edu,odu,cs)/~mkelly/semester/2017_spring/remotefroggie.html 20170301192639 {"locator": "urn:ipfs/QmPdyY6Pm66iWtGpTc7PqK11hvsnYSKMVL57G69RiNjGcm/QmNZ6mKSSAXAmXEocQj5gT4y4kdcr5D2C173ubWJ6PSKEZ", "mime_type": "text/html", "status_code": "200", "title": "Lorem Ipsum"}

This will have ramifications of generating a different hash if the CDXJ is itself pushed into IPFS, a use case I anticipate for collaboration/sharing of a collection of captures. With the eventual IPNS integration and our indexless system (#61), the ramifications would be less severe.

ibnesayeed · 2017-07-14T17:28:40Z

Either one is fine for me. Indexing is a one-time job, so it is fine if it takes a bit of extra time in title extraction, but can be skipped with a flag when a lot of data is to be indexed and more index annotations may follow later. Just make sure to sanitize the extracted title to clean up any leading or trailing white spaces and converting newlines (if any) to spaces before storing in the CDXJ.

As long as we are not storing raw CDXJ files in IPFS, there is no harm in adding titles later. The newly proposed model can utilize IPLD for attaching such metadata.

ibnesayeed · 2018-08-27T15:11:54Z

Now we extract titles from the HTML pages and store them in the index.

machawk1 · 2018-08-27T15:35:49Z

@ibnesayeed Do you want to work on surfacing these values to replace the URI-R+datetime that is currently displayed? Also, thoughts on retaining the display of the URI-R to correspond with the title? Perhaps dimmed/gray, smaller, and adjacent to the title? I would like to continue to see the URI-R in some fashion without something like a hover.

ibnesayeed · 2018-08-27T18:56:35Z

On the landing page we only want to surface just a handful of captures that meet certain criteria. For them we can make cards/chips that will hold more information in a more appealing way. We can either use some minimal card formatting or go for MementoEmbed style cards on the landing page for a few URI-Ms. /cc @shawnmjones.

machawk1 · 2018-08-27T18:58:24Z

@ibnesayeed Any ideas on the heuristic we use for which mementos are displayed? Should we also give the user the option of an extended interface to see a comprehensive list?

There were times in developing WAIL that a list of all URI-Rs archived would have been handy from the replay system.

ibnesayeed · 2018-08-27T19:01:50Z

An comprehensive pretty listing with filter and pagination or raw CDXJ index downloading should go in the admin interface.

ibnesayeed · 2018-08-27T19:03:45Z

Any ideas on the heuristic we use for which mementos are displayed?

We do not use any heuristics and let the browser handle it if no content-type was recorded. Some web archives do have some logic in place to predict content-type when missing, but their accuracy is not perfect.

machawk1 · 2018-08-27T19:07:32Z

@ibnesayeed Any ideas on the heuristic we use for which mementos are displayed? was not asking about content type but rather, if we have 100 mementos, which are displayed, even if they are all HTML? Random? Newest? Largest? Let the user decide? If so, what's the default?

ibnesayeed · 2018-08-27T19:13:02Z

The item must be an HTML page. From there we can either go for k number of random items, newest items, most archived items, or all of these under different sections.

…SON and links for #178

Surfaces the titles to the webUI and associates with the respective JSON and links for #178

machawk1 added enhancement ipwb replay labels Jun 6, 2017

machawk1 added this to the 2.0 (Extended more featureful implementation) milestone Jun 6, 2017

ibnesayeed mentioned this issue Jun 29, 2018

Misreported number of "HTML pages listed" in replay interface #405

Closed

ibnesayeed added a commit that referenced this issue Aug 26, 2018

Extract title from payload and add to the index when available, #178

e200af0

ibnesayeed mentioned this issue Aug 26, 2018

Extract title from payload and add to the index when available #527

Merged

machawk1 mentioned this issue Sep 19, 2018

Split date from time and position it before URI-R in listing #566

Merged

machawk1 added a commit that referenced this issue Sep 25, 2018

Surfaces the titles to the webUI and associates with the respective J…

dc57f25

…SON and links for #178

machawk1 mentioned this issue Sep 25, 2018

Surfaces the titles to the webUI and associates with the respective JSON and links for #178 #570

Merged

machawk1 added a commit that referenced this issue Sep 25, 2018

Merge pull request #570 from oduwsdl/issue-178

7558bc7

Surfaces the titles to the webUI and associates with the respective JSON and links for #178

machawk1 mentioned this issue Feb 21, 2019

Implement memory-efficient in-file binary search for CDXJ indexes #604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretty listing of resources and title for HTML pages #178

Pretty listing of resources and title for HTML pages #178

ibnesayeed commented Jun 6, 2017

machawk1 commented Jul 14, 2017 •

edited

Loading

ibnesayeed commented Jul 14, 2017

ibnesayeed commented Aug 27, 2018

machawk1 commented Aug 27, 2018

ibnesayeed commented Aug 27, 2018

machawk1 commented Aug 27, 2018

ibnesayeed commented Aug 27, 2018 •

edited

Loading

ibnesayeed commented Aug 27, 2018

machawk1 commented Aug 27, 2018

ibnesayeed commented Aug 27, 2018

Pretty listing of resources and title for HTML pages #178

Pretty listing of resources and title for HTML pages #178

Comments

ibnesayeed commented Jun 6, 2017

machawk1 commented Jul 14, 2017 • edited Loading

ibnesayeed commented Jul 14, 2017

ibnesayeed commented Aug 27, 2018

machawk1 commented Aug 27, 2018

ibnesayeed commented Aug 27, 2018

machawk1 commented Aug 27, 2018

ibnesayeed commented Aug 27, 2018 • edited Loading

ibnesayeed commented Aug 27, 2018

machawk1 commented Aug 27, 2018

ibnesayeed commented Aug 27, 2018

machawk1 commented Jul 14, 2017 •

edited

Loading

ibnesayeed commented Aug 27, 2018 •

edited

Loading