Skip to content

ContentSources

leonardr edited this page Aug 15, 2014 · 34 revisions

Misc leftovers from 4/16 meeting

  • Talk to Heather McCormack, community manager for 3M, she knows Wattpad, Smashwords, etc.
  • Talk to Contra Costa about their acquisition design. Can we send them usage signals and get quality signals in return?
  • KUALI-OLE

What is "Content"?

For our Minimum Viable Product, "content" means:

  • Textual ebooks, delivered electronically
  • Electronically delivered graphic novels (there are some already in Overdrive/3M).
  • Print books, to be checked out of branch libraries

When doing our initial design we should keep in mind the following types of content we might deliver electronically and "read" through our client app:

  • Audiobooks
  • Periodicals

These types of content are a little further out there, but theoretically possible:

  • Academic papers
  • Textbooks
  • Reproductions of artwork
  • Historical documents, e.g. maps
  • Software (e.g. the stuff on The Console Living Room

Textual Ebooks

Paid books

3M, Overdrive, and Axis 360 are the main vendors of DRM-encrypted paid ebooks. Each has a custom API that covers more or less the same ground. Server-Side Design has a back-to-back comparison of 3M, Overdrive, and Axis 360 when it comes to the integration features we care about. I won't go into the details here.

Overdrive

  • What format is the downloaded file? ODM? ASCM? We need to be able to open it in our reader.
  • No API for returning a book early. ("If a format has been locked in, then users need to use software like OverDrive Media Console or Adobe Digital Editions to complete the return process.")
  • We need to be able to do this from our reader.
  • Overdrive's web site has a separate section offering about 27000 ebooks from Project Gutenberg. The presentation is awful. This is probably the ~29000 ebooks from the 2010 PG DVD.

3M Cloud Library

  • We have a copy of the secret documentation.
  • The primary missing functionality here is that there is no way to download a DRM-encrypted book. 3M wants everyone to use their reader.
  • Also missing: we don't hear about it when someone gives up on a book and releases their hold.

Axis 360

  • We have a copy of the secret documentation.
  • Seems to have every major feature we need.

Open Library

Open Library is a project of the Internet Archive which uses physical books in libraries around the country as proxies for ebooks which can be checked out by anyone.

Each Open Library patron must have an individual Open Library account. You can check out up to five books at once. Titles are encrypted with Adobe DRM. PDF quality is good. OCR is atrocious, so EPUBs are not that great.

Open Library has a lot of old (non-public-domain) junk, but it also has a lot of solid midlist fiction and childrens' books. For instance, I did a spot check and looked at 38 ebooks tagged "Hugo winner" in the library. There were 19 titles with available copies, 13 titles that had been checked out, and 6 titles only available in DAISY format.

Open Library lent out about 140k ebooks in March 2014, which is significantly better than NYPL did.

APIs

Open Library's API is a read-only API that provides bibliographic information about a book and its editions. It links to ebook versions of a book, but doesn't say whether copies are available or checked out.

The Open Library Read API can look up book identifiers and return bibliographic information, plus a link to the page where you can borrow that book from Open Library. This API does specify 'status' as 'full access', 'lendable', 'checked out' or restricted'.

Open Library has minimal OPDS support which seems to be undocumented.

Open Library has no API functionality for checking out a book or joining the queue for a book.

Summary

It's tempting to see the nice selection available through Open Library and want to capture some of it for ourselves. However, the nice selection will not last too long if we start presenting Open Library books to 100ks of patrons. There is only one available copy of most books.

It's also hypocritical to integrate our reader into Open Library (and difficult to get Open Library to cooperate with us) since NYPL does not contribute any books to Open Library.

Publishers with unknown integration support

  • iVerse - Comic books. An online streaming service that provides libraries with access to thousands of digital graphic novels, comic books and manga at a per-checkout price with simultaneous circulation to virtually any mobile device, tablet or PC. Library Edition is built upon our popular Comics Plus digital distribution platform (powered by iVerse Media) that has delivered more than 20 million content downloads to consumers worldwide.. We would need to pitch them the idea of an integrated reader, but they have the technology.
  • WattPad - Popular fan fiction community. Since it's a community and not so much a corpus I don't know if there's value in integrating it
  • SmashWords Has an OPDS feed but not clear how we could borrow paid books except through Overdrive.

These are closer to the academic end. They probably have custom APIs with per-site access models.

  • Elsevier
  • Wiley
  • Springer
  • O'Reilly
  • Project Muse
  • JSTOR
  • Oxford Scholarship Online

Open-access books

Project Gutenberg

About 45,000 free public domain texts.

We plan to set up a mirror of the Gutenberg ePub documents. This gives us a text source over which we have complete technical and legal control.

  • Set up the mirror.
  • Provision a machine and copy over the ePubs.
  • Retrieve the MARC records and insert an "Electronic resource" pointer in each. There are a number of sources for MARC records (this one is from gutenberg.org but looks identical to the one at readingroo.ms), and a third-party script for converting Gutenberg RDF documents to MARC.
  • Generate an OPDS feed, either from the MARC records or from the underlying RDF data.
  • The MARC records, the OPDS feed, and the mirror should eventually be updated nightly. The Adelaide third-party library provides daily feeds of new and updated MARC records.
  • Come up with some way of identifying when a Gutenberg text is the same as an Overdrive text, and prefer the Gutenberg text.
  • How many books do we check out through Overdrive that we could replace with Gutenberg?
  • Are there changes we could make to the default epubs that would improve user experience? Doing some manual work on the top 100 books would be worth it.

unglue.it

Only about 20 books currently, but they're high quality open access stuff. There is an unknown integration API that requires a library partner account. James is investigating this.

HathiTrust

  • We contributed some 10ks of texts to be scanned. Some 1Ms of public documents are available in total.
  • From the premises of a library, anyone can download full documents in PDF format. They can also get EPUB format, but only by using the mobile site.
  • Cardholders cannot download full documents off premises because our authentication system is not hooked up to Hathi. (Hathi prefers Shibboleth, used by most university libraries.)
  • Works start with a cover page including a legal disclaimer.
  • OCR is bad enough to make the PDF format preferable.
  • There are unanswered questions as to what exactly we can do with these 1Ms of documents and even the 10Ks of documents we originally contributed. However, it looks like a web user on NYPL premises can download a PDF of any of those 1Ms of texts.

The API

  • Haithi DTrust Data API API documentation in PDF format.
  • Full volumes are available for the Espressnet project only, and only in Espresso Book Machine format. (PDF documentation search term: "Volume-type resources".) Everyone else is restricted to page images, which makes the API useless for our purposes.
  • Leonard asked Josh to pass on the request for Espressnet-like access, and for the ability to get documents in PDF and EPUB format.

Internet Archive

Publishes several million public domain titles, scanned in-house. Integration is possible through ODPS. We can mirror the alphabetical ODPS feed once, and keep it up to date by periodically grabbing the "Recent Scans" feed. We can show IA results in search results and if the user selects one of these texts, have them download directly from IA (while grabbing a copy for ourselves so we can directly serve the next person who downloads it.)

IA has a wide selection of texts but quality is pretty bad. Bibliographical information is sometimes missing. OCR quality is very poor. EPUB editions are sometimes drastically truncated relative to the PDF edition--we would want to use the PDF edition pretty much everywhere. Many texts are random historical documents, not "books" as generally understood.

IA also makes available full copies of works I'm pretty sure are still under copyright, e.g. A Farewell to Arms. I'm not sure what their legal justification is. Note that that is a book scanned and OCRed by the Internet Archive itself, not contributed by a user (which is a whole different problem). So whitelisting books provided by IA won't solve this problem, if it is indeed a problem.

There are subsets of IA's catalog that are extremely interesting: for instance, there's an amazing selection of old periodicals.

Summary: curated views of IA's catalog can be integrated into our catalog, but not IMO the entire catalog. It's a junkyard.

The dates books enter public domain vary between countries. Some rights depend on registrations with a country-by-country basis, but the most important factor is how many years have to pass after the death of the creator to allow the works into public domain.

This list will updated yearly. If you want any other site to be included, please leave a suggestion in the comments below.

Europeana

Europeana offers access to millions of digitized items from European museums, libraries, and archives.

More than 2,000 European institutions across Europe contribute to the site, including the British Library and national libraries from many countries.

To find free public domain books, search for the author or title, and in the left-side panel narrow results by file type (“text”), and by copyright (“public domain marked”).

Pulp Magazines Project

About 200 scanned PDF agazines. A combination of pre-1923 magazines and post-1923 magazines with no copyright renewal. An academic project.

Living Books about Life

25 nonfiction books on biology topics. General interest, not textbooks.

Feedbooks

This French ebook site is designed with mobile reading in mind. It’s tailored for mobile browsers, so you can download free ebooks directly to your tablet or smartphone.

Feedbooks offers thousands of public domain ebooks in five languages. Unlike in Internet Archive, most of the free books have covers to look good on your e-reader or e-reading application.

The site integrates with Readmill, so you can send an ebook directly to your Readmill cloud library.

⇢ Feedbooks

Manybooks

This is a popular catalog of public domain ebooks, sourced from Project Gutenberg and Internet Archive.

The books are available in a vast number of different file formats, so if you are looking for less popular ones, like Plucker or FictionBook2, Manybooks is a good destination to explore.

Currently there are almost 30,000 titles in Manybooks.

⇢ Manybooks

World Public Library

World Public Library - front page

The site is a part of a global effort to “preserve and disseminate classic works of literature, serials, bibliographies, dictionaries, encyclopaedias and other reference works.”

All books available here are free, but not all are public domain, so reading Terms and Conditions section is recommended.

The site offers over 3 million digital items, grouped into easy-to-browse collections, including classic literature, children’s books, and academic research collections.

⇢ World Public Library

Google Book Search

Google launched its own ebookstore some time ago, but the earlier book scan project, Google Book Search, is still there.

Perform any search and you’ll see a list of results. If you see Preview or Full View link under the book’s title, it means you can read its scan in the browser.

⇢ Google Book Search

Books Should Be Free

The site offers thousands of free public domain books, as audiobooks or text files. Titles in 30 languages can be found here.

⇢ Books Should Be free

The Literature Network

The site calls itself a “searchable online literature for the student, educator, or enthusiast.”

Currently there are over 3,600 full books and over 4,400 short stories & poems from over 250 authors.

The key to explore the site is author index, from where you can browse linked books, quotes forum threads and quizzes.

⇢ The Literature Network

Bartleby

The site offers free Harvard Classics - complete volumes of the most comprehensive and well-researched anthology of all time (read-online, no downloads offered).

⇢ Bartleby

DailyLit

The platform’s offer should suit modern-day people who are always in a hurry. You can read an ebook in daily installments, delivered by mail or RSS feed.

Apart from DailyLit’s own serialized fiction, you can find here hundreds of classic novels. Pride and Prejudice and War of the Worlds were two first books offered on platform’s launch in 2006.

⇢ DailyLit

Read Easily

The site is dedicated particularly for the partially sighted and visually impaired.

Free classics can be read online, and you can change colors, fonts, as well as increase font size to make the text more legible.

⇢ Read Easily

LibriVox

Founded in 2005, LibriVox is an extensive library of free public domain audiobooks.

Volunteers record chapters of public domain books. Afterwards LibriVox releases the audio file for free in the public domain, and you may use it the way you like.

⇢ LibriVox

Legamus

The site makes free audio books from texts that entered public domain in Europe.

⇢ Legamus

Open Culture

Open Culture is a popular blog that curates access to educational and cultural media.

Among several collections, you can find here a directory of over 500 free ebooks. Most of them are in public domain.

⇢ Open Culture

Classic Literature Library

Public domain books organized into collections. The complete works of William Shakespeare, Jules Verne, Charles Dickens or Mark Twain, among others.

⇢ Classic Literature Library

The Online Books Page

The site, managed by the University of Pennsylvania, offers a clean interface to browse for over 1 million free ebooks from around the web.

⇢ The Online Books Page

Great Books and Classics

A repository of works of classic writers and philosophers, from Sophocles to Epicurus, to Sun-Tzu.

The books in digital format can be read here online as html files.

⇢ Great Books and Classics

Classic Reader

All books on this website are in public domain. You can choose from 3810 titles by 358 authors.

⇢ Classic Reader

Planet Publish

A decent collection of popular works of classic literature, in pdf format.

⇢ Planet Publish

Classical Chinese Literature

Chinese classics with each character hyperlinked to its definition and etymology.

⇢ Classical Chinese Literature

Wolne Lektury

Extremely well-managed collection of free public domain books in Polish language.

Currently there are over 2,300 titles available, either for online reading or to download (epub, mobi, pdf formats).

⇢ Wolne Lektury

Projekti Lönnrot

Public domain books in Finnish and Swedish.

⇢ Projekti Lönnrot

Non-English sources

Español

A not comprehensive list of ebook resources with free offerings:

Русский язык

Polski

Français

Deutsch

Portugués

中文

한국어

日本語

עברית

Scandanavian languages

Other open-access sources with integration ability

  • OAPEN. - Open access books from European academic presses. ~700 books in English, ~350 in Dutch. Individual download links. Provided XML file includes bibliographic metadata and download links for each book--it's basically an OPDS feed.

Open-access sources with no integration ability

  • Digital Comic Museum - About 15,000 public domain comics, mostly from the 1940s-1960s. Comics are free but must be downloaded one at a time, and bots are forbidden by TOS. For the month of April they have a deal whereby a donation of $X gets you an FTP quota of X*10 gigabytes of data. I estimate 15,000 comics would be between 375 and 475 gigabytes of data.

  • Comic Book Plus - Very similar story to Digital Comic Museum. Many comics from the UK.

  • NASA About 35 books, including some interesting books of history.

  • Wikibooks

  • Online periodicals: Medium, Matter, Atavist, Byliner, LongReads

Some university presses with a lot of stuff:

Lists of links to open-access books

These sites don't host books; they host catalogues of links to books made available by other sources. There's likely to be a lot of overlap here.

  • The Online Books Page - Completely random curated selection of books from Gutenberg, Hathi, archive.org, university websites, and random websites (did you know the US Golf Association hosts some historical books on golf?)
  • E-Books directory Selection is oriented towards pleasure reading. Claimed 8829 books in collection.
  • The Assayer Selection is oriented towards math and technical books. About 1450 books.

Open-access sources with bad/overlapping selection

Not worth detailed investigation given that they don't host any popular books we can't get elsewhere, but let's keep them in mind.

  • Project Gutenberg Self-Publishing
  • WikiSource
  • Open Book Publishers - 41 books with 16 more on the way. Books are made available under one CC license or another, but only the HTML editions are free to read. Epub editions (presumably also CC-licensed) cost a small amount of money, usually 6 UKP.
  • Knowledge Unlatched - An unglue.it-like pilot program in which libraries band together to make adademic-press books open access. End result is a few books released through OAPEN and Hathi.
  • Open Humanities Press 14 theory-heavy books. Native PDFs. There are also 17 journals, but each has its own separate website.

More university presses:

Open textbooks

Entities who might be interested in cutting a deal

  • FeedBooks - An ebook distributor that is much closer to us politically than 3M/Overdrive.
  • WattPad
  • Kensington - Street literature, hard to get into libraries
  • Tor Forge "had a freemium model at one point but not sure they still do". Leonard happens to know Tor's editor-in-chief professionally and can ask him about this.
  • Singularity and Company. A bookstore in Brooklyn that acquires the rights to, digitizes and releases two books every month (one science fiction, one pulp adventure) on a subscription basis. SAmple subscription offer.

Interstitial Art

We have access to a huge variety of non-textual works to use in the interstitial spaces of our application. Think of the way the MTA uses art on the subway to make uniform spaces more interesting and relieve the boredom of waiting. We can display some pre-cached artwork while (e.g.) waiting for a download to complete or for a search to run. This is an easy way to give even a generic e-reader some personality and a "library" feel.

Any topic-based discovery algorithm we use for books should also work for matching artwork to books.

NYPL collections

We have a huge collection of digitized material that we already make available to patrons.

DPLA

DPLA has a comprehensive API with liberal usage policies. Access to texts is scattershot. Most of the available texts come from Hathi Trust, which we would access separately, and most texts are of highly specialized interest. The artwork is different story. By integrating with DPLA we get a great source for high-quality interstitial art along with machine-readable metadata about the art.

A random sampling of art: 1 2 3 4 All of these are from the Smithsonian, which (like most DPLA contributors) has generous terms of reuse for educational purposes.

The Cooper-Hewitt museum has an API, and the Smithsonian has an internal API called EDAN, but I believe the DPLA API is the only programmatic access to art from across the Smithsonian.

Europeana

Basically the DPLA for European institutions. I have not investigated the API because there's no point until we decide whether to do anything with DPLA data, but it looks pretty similar to DPLA. A sample of the available data.

Wikimedia Commons

A sample of public domain art available through Wikimedia Commons. Access is through the MediaWiki API. Metadata (whether relating to source or to topics) is not as good as for DPLA materials.

Instagram

Joe did a Shipit Day project involving pulling Instagram projects tagged #NYPL and using them as background images.

Clone this wiki locally