-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretty listing of resources and title for HTML pages #178
Comments
@ibnesayeed Do you believe title extraction should be the default functionality or only activated when a flag is passed to the indexing script? My vote is the former, though it makes the CDXJ more verbose but richer and more user-friendly when parsed and displayed. We may also offer the option to enrich CDXJ TimeMaps that do not have this information from within the replay interface, e.g., A sample CDXJ w/o title attributes!context ["http://oduwsdl.github.io/contexts/cdxj"] !meta {"created_at": "2017-07-14T09:02:23.458675", "generator": "InterPlanetary Wayback v.0.2017.07.10.1739"} com,matkelly)/froggies/frog.png 20170301192639 {"locator": "urn:ipfs/QmUeko8zM7Xanwz6F9GtRH4rLAi4Poj3DMECGsci2BRQfs/QmPhMnX74cwqx2xgj9d3N3gTra8CzafXwSbUwU8xagMfqR", "mime_type": "image/png", "status_code": "200"} com,matkelly)/robots.txt 20170301192639 {"locator": "urn:ipfs/Qmbk3Aju7u26Pzk356a43wY9eUCScAJiLPxhvwsMoVt7Pd/QmYNB85U2txRAAdLp6wvZSPvd8AQq8UcjZJ2azhv5h6NF7", "mime_type": "text/plain", "status_code": "200"} edu,odu,cs)/~mkelly/semester/2017_spring/remotefroggie.html 20170301192639 {"locator": "urn:ipfs/QmPdyY6Pm66iWtGpTc7PqK11hvsnYSKMVL57G69RiNjGcm/QmNZ6mKSSAXAmXEocQj5gT4y4kdcr5D2C173ubWJ6PSKEZ", "mime_type": "text/html", "status_code": "200"}... A sample CDXJ w/o title attributes!context ["http://oduwsdl.github.io/contexts/cdxj"] !meta {"created_at": "2017-07-14T09:02:23.458675", "generator": "InterPlanetary Wayback v.0.2017.07.10.1739"} com,matkelly)/froggies/frog.png 20170301192639 {"locator": "urn:ipfs/QmUeko8zM7Xanwz6F9GtRH4rLAi4Poj3DMECGsci2BRQfs/QmPhMnX74cwqx2xgj9d3N3gTra8CzafXwSbUwU8xagMfqR", "mime_type": "image/png", "status_code": "200"} com,matkelly)/robots.txt 20170301192639 {"locator": "urn:ipfs/Qmbk3Aju7u26Pzk356a43wY9eUCScAJiLPxhvwsMoVt7Pd/QmYNB85U2txRAAdLp6wvZSPvd8AQq8UcjZJ2azhv5h6NF7", "mime_type": "text/plain", "status_code": "200"} edu,odu,cs)/~mkelly/semester/2017_spring/remotefroggie.html 20170301192639 {"locator": "urn:ipfs/QmPdyY6Pm66iWtGpTc7PqK11hvsnYSKMVL57G69RiNjGcm/QmNZ6mKSSAXAmXEocQj5gT4y4kdcr5D2C173ubWJ6PSKEZ", "mime_type": "text/html", "status_code": "200", "title": "Lorem Ipsum"}This will have ramifications of generating a different hash if the CDXJ is itself pushed into IPFS, a use case I anticipate for collaboration/sharing of a collection of captures. With the eventual IPNS integration and our indexless system (#61), the ramifications would be less severe. |
Either one is fine for me. Indexing is a one-time job, so it is fine if it takes a bit of extra time in title extraction, but can be skipped with a flag when a lot of data is to be indexed and more index annotations may follow later. Just make sure to sanitize the extracted title to clean up any leading or trailing white spaces and converting newlines (if any) to spaces before storing in the CDXJ. As long as we are not storing raw CDXJ files in IPFS, there is no harm in adding titles later. The newly proposed model can utilize IPLD for attaching such metadata. |
Now we extract titles from the HTML pages and store them in the index. |
@ibnesayeed Do you want to work on surfacing these values to replace the URI-R+datetime that is currently displayed? Also, thoughts on retaining the display of the URI-R to correspond with the title? Perhaps dimmed/gray, smaller, and adjacent to the title? I would like to continue to see the URI-R in some fashion without something like a hover. |
On the landing page we only want to surface just a handful of captures that meet certain criteria. For them we can make cards/chips that will hold more information in a more appealing way. We can either use some minimal card formatting or go for MementoEmbed style cards on the landing page for a few URI-Ms. /cc @shawnmjones. |
@ibnesayeed Any ideas on the heuristic we use for which mementos are displayed? Should we also give the user the option of an extended interface to see a comprehensive list? There were times in developing WAIL that a list of all URI-Rs archived would have been handy from the replay system. |
An comprehensive pretty listing with filter and pagination or raw CDXJ index downloading should go in the admin interface. |
We do not use any heuristics and let the browser handle it if no content-type was recorded. Some web archives do have some logic in place to predict content-type when missing, but their accuracy is not perfect. |
@ibnesayeed Any ideas on the heuristic we use for which mementos are displayed? was not asking about content type but rather, if we have 100 mementos, which are displayed, even if they are all HTML? Random? Newest? Largest? Let the user decide? If so, what's the default? |
The item must be an HTML page. From there we can either go for |
Surfaces the titles to the webUI and associates with the respective JSON and links for #178
Currently the bare URLs are listed for the archived resources (as shown in #177). We can make it better looking and less space consuming by:
Title of each memento can be stored in the CDXJ file as an optional filed. Title extraction would require HTML parsing at the time of indexing.
The text was updated successfully, but these errors were encountered: