Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

Data Exports

Paul Beaudoin edited this page Oct 11, 2016 · 6 revisions

Warning: This is an experimental feature that requires experimental code. Use (or merge) the bots-and-data-export-release branch to use this experimental feature

[Draft] Scribe Data Exports

Scribe's data model is rather complex and not ideal for reasoning about the data one has collected at the end of one's transcription project. To support data analysis, Scribe includes a feature to generate data exports that collate all collected contributions into a series of JSON documents - one per Subject Set. These documents include all collected transcription data with confidence measures.

The data is anonymized to the extent that no personal information included. User_ids are included to allow one to analyze individual contribution behavior, but those identifiers can not be used to retrieve an individual's name, email, or other personal information.

Creating

The export can be generated with this command:

rake project:build_and_export_final_data

Big caveat: This process will attempt to write a big ZIP file to disk. If your project is hosted on a box with "ephemeral storage" like Heroku, the write may not succeed. In that case, you'll have to run the build_and_export_ locally on a copy of your database.

Making It Public (Or Not)

You have the option to advertise the availability of your data exports. After you've generated your first export, proceed to YOURAPP/admin/data where you'll find a checkbox to "Allow the public to download data". If you enable that option, your app will immediately enable requests to YOURAPP/#/data/exports, which advertises your rolling classification count, allows one to browse exports by keyword, and provides links to the latest dump. A link to an ATOM feed is also available in case researchers would like to be notified when new exports are available.

Note that enabling "Allow the public to download data" at YOURAPP/admin/data will not, on its own, advertise the availability of the /#/data/exports route. If you'd like casual users of your project to discover the Data Exports page without, for example, you emailing it to them, you should include a link in one of your custom content pages. For example, if you have a "Data" page talking about the data you're collecting, you might include a markdown link like [Download/Browse the Latest Exports](/#/data/exports) somewhere in that page.

You can disable "Allow the public to download data" at any time to restrict access to /#/data/exports. Note that the export ZIP files placed in ./public will technically be downloadable regardless of whether or not you've enabled "Allow the public to download data". The ZIP filenames are obscure to mitigate guessing, but nothing else prevents one from downloading a ZIP once one knows the filename. If you need to completely delete an export, you'll have to physically remove the ZIP from the filesystem.

Contents

Data exports consist of a series of JSON documents - one per SubjectSet.

The organization of each dump looks like this:

FinalSubjectSet
 + Subjects
   + Assertions
     + Versions

These elements are described below.

FinalSubjectSet (aka "collection of pages")

This is the top level element storing all data in descendant Subjects

  • id: Unique identifier
  • meta_data: Hash of metadata fields copied from original sets created at project initialization
  • search_terms_by_field: Hash of all distinct values for each entity transcribed for the set
  • subjects: Array of Subject elements (see below)

Subject (aka "page")

A subject represent a single page and includes all assertions made about that page.

  • id: Unique identifier
  • location: Hash of image URLs - typically includes 'standard' and 'thumbnail' entries
  • status: String representing status of the subject on the progress to completion. Values and their meanings include:
    • "active": Subject is being actively marked/transcribed
    • "bad": Subject has been taken out of the active pool because multiple contributors marked it as being blank or otherwise invalid.
    • "retired": Subject has been taken out of the active pool because multiple contributors marked it as being complete.
  • width: Int pixel width of "standard" size image
  • height: Int pixel height of "standard" size image
  • meta_data: Hash of metadata fields copied from original subjects created at project initialization
  • assertions: Array of Assertion elements (see below)

Assertion

An assertion is a single declaration about data contained in an area of the page - typically derived from multiple transcriptions.

  • id: Unique identifier
  • status: String representing status of assertion in workflows. Values are:
    • "awaiting_votes": Assertion represents several convergent transcriptions that are actively being verified (voted upon) in Verify
    • "awaiting_transcriptions": Assertion represents represents at least one transcription, but based on project configuration requires additional transcriptions before proceeding to Verify (if applicable)
    • "complete": Assertion has sufficient transcriptions and votes as configured for the project to be considered a confident assertion.
    • "contentious": Assertion probably has merit, but contributors could not decide on the right data transcription, so no further classifications are allowed.
  • name: String name of the field as specified in the workflow task export_name configuration parameter.
  • confidence: Float representing confidence measure where 0 is very unconfident and 1.0 is 100% confident.
  • data: Hash of the data represented by the assertion. Note that low confidence assertions may have multiple alternate transcriptions represented by child "AssertionVersion" elements
  • versions: Array of AssertionVersion elements (see below)

AssertionVersion

An AssertionVersion is a single distinct transcription. Multiple different transcriptions are possible for a given region. When they disagree, one AssertionVersion is created for each distinct set of data. When multiple contributors submit identical data, their contributions are represented by a single AssertionVersion.

  • data: Hash of the data submitted by contributors. If prompt requested a single transcription, "value" will be the only entry.
  • votes: Int number of contributors agree on this version.
  • confidence: Float representing confidence measure where 0 is very unconfident and 1.0 is 100% confident.
  • instances: Array of hashes representing individual transcriptions. Each hash includes:
    • "created": Date of creation
    • "user_id": Unique identifier for contributor
    • "duration": Float representing approximate time of task completion.
Clone this wiki locally