Skip to content

ContentSources

leonardr edited this page Apr 21, 2014 · 34 revisions

Misc leftovers from 4/16 meeting

  • Talk to Heather McCormack, community manager for 3M, she knows Wattpad, Smashwords, etc.
  • Talk to Contra Costa about their acquisition design. Can we send them usage signals and get quality signals in return?
  • KUALI-OLE

Paid books

3M, Overdrive, and Axis 360 are the main vendors of DRM-encrypted paid ebooks. Each has a custom API that covers more or less the same ground. Server-Side Design has a back-to-back comparison of 3M, Overdrive, and Axis 360 when it comes to the integration features we care about. I won't go into the details here.

Overdrive

  • What format is the downloaded file? ODM? ASCM? We need to be able to open it in our reader.
  • No API for returning a book early. ("If a format has been locked in, then users need to use software like OverDrive Media Console or Adobe Digital Editions to complete the return process.")
  • We need to be able to do this from our reader.
  • Overdrive's web site has a separate section offering about 27000 ebooks from Project Gutenberg. The presentation is awful. This is probably the ~29000 ebooks from the 2010 PG DVD.

3M Cloud Library

  • We have a copy of the secret documentation.
  • The primary missing functionality here is that there is no way to download a DRM-encrypted book. 3M wants everyone to use their reader.
  • Also missing: we don't hear about it when someone gives up on a book and releases their hold.

Axis 360

  • We have a copy of the secret documentation.
  • Seems to have every major feature we need.

Open Library

Open Library is a project of the Internet Archive which uses physical books in libraries around the country as proxies for ebooks which can be checked out by anyone.

Each Open Library patron must have an individual Open Library account. You can check out up to five books at once. Titles are encrypted with Adobe DRM. PDF quality is good. OCR is atrocious, so EPUBs are not that great.

Open Library has a lot of old (non-public-domain) junk, but it also has a lot of solid midlist fiction and childrens' books. For instance, I did a spot check and looked at 38 ebooks tagged "Hugo winner" in the library. There were 19 titles with available copies, 13 titles that had been checked out, and 6 titles only available in DAISY format.

Open Library lent out about 140k ebooks in March 2014, which is significantly better than NYPL did.

APIs

Open Library's API is a read-only API that provides bibliographic information about a book and its editions. It links to ebook versions of a book, but doesn't say whether copies are available or checked out.

The Open Library Read API can look up book identifiers and return bibliographic information, plus a link to the page where you can borrow that book from Open Library. This API does specify 'status' as 'full access', 'lendable', 'checked out' or restricted'.

Open Library has minimal OPDS support which seems to be undocumented.

Open Library has no API functionality for checking out a book or joining the queue for a book.

Summary

It's tempting to see the nice selection available through Open Library and want to capture some of it for ourselves. However, the nice selection will not last too long if we start presenting Open Library books to 100ks of patrons. There is only one available copy of most books.

It's also hypocritical to integrate our reader into Open Library (and difficult to get Open Library to cooperate with us) since NYPL does not contribute any books to Open Library.

Publishers with unknown integration support

  • iVerse - Comic books. Charges libraries on a per-checkout basis and requires a custom app. We would need to pitch them the idea of an integrated reader, but they have the technology.
  • WattPad - Popular fan fiction community. Since it's a community and not so much a corpus I don't know if there's value in integrating it
  • SmashWords Has an OPDS feed but not clear how we could borrow paid books except through Overdrive.

These are closer to the academic end. They probably have custom APIs with per-site access models.

  • Elsevier
  • Wiley
  • Springer
  • O'Reilly
  • Project Muse
  • JSTOR
  • Oxford Scholarship Online

Open-access books

Project Gutenberg

About 45,000 free public domain texts.

We plan to set up a mirror of the Gutenberg ePub documents. This gives us a text source over which we have complete technical and legal control.

  • Set up the mirror.
  • Provision a machine and copy over the ePubs.
  • Retrieve the MARC records and insert an "Electronic resource" pointer in each. There are a number of sources for MARC records (this one is from gutenberg.org but looks identical to the one at readingroo.ms), and a third-party script for converting Gutenberg RDF documents to MARC.
  • Generate an OPDS feed, either from the MARC records or from the underlying RDF data.
  • The MARC records, the OPDS feed, and the mirror should eventually be updated nightly. The Adelaide third-party library provides daily feeds of new and updated MARC records.
  • Come up with some way of identifying when a Gutenberg text is the same as an Overdrive text, and prefer the Gutenberg text.
  • How many books do we check out through Overdrive that we could replace with Gutenberg?
  • Are there changes we could make to the default epubs that would improve user experience? Doing some manual work on the top 100 books would be worth it.

unglue.it

Only about 20 books currently, but they're high quality open access stuff. There is an unknown integration API that requires a library partner account. James is investigating this.

HathiTrust

  • We contributed some 10ks of texts to be scanned. Some 1Ms of public documents are available in total.
  • From the premises of a library, anyone can download full documents in PDF format. They can also get EPUB format, but only by using the mobile site.
  • Cardholders cannot download full documents off premises because our authentication system is not hooked up to Hathi. (Hathi prefers Shibboleth, used by most university libraries.)
  • Works start with a cover page including a legal disclaimer.
  • OCR is bad enough to make the PDF format preferable.
  • There are unanswered questions as to what exactly we can do with these 1Ms of documents and even the 10Ks of documents we originally contributed. However, it looks like a web user on NYPL premises can download a PDF of any of those 1Ms of texts.

The API

  • Haithi DTrust Data API API documentation in PDF format.
  • Full volumes are available for the Espressnet project only, and only in Espresso Book Machine format. (PDF documentation search term: "Volume-type resources".) Everyone else is restricted to page images, which makes the API useless for our purposes.
  • Leonard asked Josh to pass on the request for Espressnet-like access, and for the ability to get documents in PDF and EPUB format.

Internet Archive

Publishes several million public domain titles, scanned in-house. Integration is possible through ODPS. We can mirror the alphabetical ODPS feed once, and keep it up to date by periodically grabbing the "Recent Scans" feed. We can show IA results in search results and if the user selects one of these texts, have them download directly from IA (while grabbing a copy for ourselves so we can directly serve the next person who downloads it.)

IA has a wide selection of texts but quality is pretty bad. Bibliographical information is sometimes missing. OCR quality is very poor. EPUB editions are sometimes drastically truncated relative to the PDF edition--we would want to use the PDF edition pretty much everywhere. Many texts are random historical documents, not "books" as generally understood.

IA also makes available full copies of works I'm pretty sure are still under copyright, e.g. A Farewell to Arms. I'm not sure what their legal justification is. Note that that is a book scanned and OCRed by the Internet Archive itself, not contributed by a user (which is a whole different problem). So whitelisting books provided by IA won't solve this problem, if it is indeed a problem.

There are subsets of IA's catalog that are extremely interesting: for instance, there's an amazing selection of old periodicals.

Summary: curated views of IA's catalog can be integrated into our catalog, but not IMO the entire catalog. It's a junkyard.

Other open-access sources with integration ability

  • OAPEN. - Open access books from European academic presses. ~700 books in English, ~350 in Dutch. Individual download links. Provided XML file includes bibliographic metadata and download links for each book--it's basically an OPDS feed.

Open-access sources with no integration ability

  • Digital Comic Museum - About 15,000 public domain comics, mostly from the 1940s-1960s. Comics are free but must be downloaded one at a time, and bots are forbidden by TOS. For the month of April they have a deal whereby a donation of $X gets you an FTP quota of X*10 gigabytes of data. I estimate 15,000 comics would be between 375 and 475 gigabytes of data.

  • Comic Book Plus - Very similar story to Digital Comic Museum. Many comics from the UK.

  • NASA About 35 books, including some interesting books of history.

  • Wikibooks

  • Online periodicals: Medium, Matter, Atavist, Byliner, LongReads

Some university presses with a lot of stuff:

Lists of links to open-access books

These sites don't host books; they host catalogues of links to books made available by other sources. There's likely to be a lot of overlap here.

  • The Online Books Page - Completely random curated selection of books from Gutenberg, Hathi, archive.org, university websites, and random websites (did you know the US Golf Association hosts some historical books on golf?)
  • E-Books directory Selection is oriented towards pleasure reading. Claimed 8829 books in collection.
  • The Assayer Selection is oriented towards math and technical books. About 1450 books.

Open-access sources with bad/overlapping selection

Not worth detailed investigation given that they don't host any popular books we can't get elsewhere, but let's keep them in mind.

  • Project Gutenberg Self-Publishing
  • WikiSource
  • Open Book Publishers - 41 books with 16 more on the way. Books are made available under one CC license or another, but only the HTML editions are free to read. Epub editions (presumably also CC-licensed) cost a small amount of money, usually 6 UKP.
  • Knowledge Unlatched - An unglue.it-like pilot program in which libraries band together to make adademic-press books open access. End result is a few books released through OAPEN and Hathi.
  • Open Humanities Press 14 theory-heavy books. Native PDFs. There are also 17 journals, but each has its own separate website.

More university presses:

Open textbooks

Entities who might be interested in cutting a deal

  • FeedBooks - An ebook distributor that is much closer to us politically than 3M/Overdrive.
  • Kensington - Street literature, hard to get into libraries
  • Tor Forge "had a freemium model at one point but not sure they still do". Leonard happens to know Tor's editor-in-chief professionally and can ask him about this.
  • Singularity and Company. A bookstore in Brooklyn that acquires the rights to, digitizes and releases two books every month (one science fiction, one pulp adventure) on a subscription basis. SAmple subscription offer.

Art

We have access to a huge variety of non-textual works to use in the interstitial spaces of our application. Think of the way the MTA uses art on the subway to make uniform spaces more interesting and relieve the boredom of waiting. We can display some pre-cached artwork while (e.g.) waiting for a download to complete or for a search to run. This is an easy way to give even a generic e-reader some personality and a "library" feel.

Any topic-based discovery algorithm we use for books should also work for matching artwork to books.

CoverArt

DPLA

DPLA has a comprehensive API with liberal usage policies. Access to texts is scattershot. Most of the available texts come from Hathi Trust, which we would access separately, and most texts are of highly specialized interest. The artwork is different story. By integrating with DPLA we get a great source for high-quality interstitial art along with machine-readable metadata about the art.

A random sampling of art: 1 2 3 4 All of these are from the Smithsonian, which (like most DPLA contributors) has generous terms of reuse for educational purposes.

The Cooper-Hewitt museum has an API, and the Smithsonian has an internal API called EDAN, but I believe the DPLA API is the only programmatic access to art from across the Smithsonian.

Europeana

Basically the DPLA for European institutions. I have not investigated the API because there's no point until we decide whether to do anything with DPLA data, but it looks pretty similar to DPLA. A sample of the available data.

Wikimedia Commons

A sample of public domain art available through Wikimedia Commons. Access is through the MediaWiki API. Metadata (whether relating to source or to topics) is not as good as for DPLA materials.

Instagram

Joe did a Shipit Day project involving pulling Instagram projects tagged #NYPL and using them as background images.

Clone this wiki locally