Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw data storage and transfer protocol #59

Closed
qlquanle opened this issue Aug 9, 2017 · 20 comments
Closed

Raw data storage and transfer protocol #59

qlquanle opened this issue Aug 9, 2017 · 20 comments
Assignees
Milestone

Comments

@qlquanle
Copy link
Contributor

qlquanle commented Aug 9, 2017

Develop and document protocol for raw data storage. (8h)

  • Follow SVN protocol for documenting and storing raw data (readme w/ download date, who, and where; license/citation, codebooks), but do in Dropbox.
  • Determine what provenance.log looks like. (e.g., checksum, hash, file enumeration + file size)
  • Simple data cleaning code is allowed to live in Dropbox.
  • Update RA Manual to reflect these practices.

Document protocol for data transfer. (2h)

  • Manual installation of externals. User points to location via config_user.yaml.
  • Update RA Manual (and template readme) to reflect these practices.
@yuchuan2016
Copy link
Contributor

@stanfordquan @Shun-Yang @arosenbe , see below for my first proposal of the storage protocol. Can you take a look to see if you have any comments? Thanks!

For each directory in raw, we have the following (maximum) four top-level directories along with a README.txt:

  • orig: stores the original dataset as-is. No changes should be performed.
  • docs: contains any documentation related to the raw data, including codebooks, licenses, email correspondence, etc. If no such files exist, this directory can be absence.
  • source: contains a script named clean.xx. This should be light and easy to understand. It also has a script print_provenance.py run after the clean, which produces provenance.txt in data. Ideally, I think we should make print_provenance a function of gslab_python, and have only one script clean.py that does some light clean and call the print_provenance at the end. But sometimes some other languages may be used to clean the dataset. Perhaps we can have make.py that execute the cleaning script and then call print_provenance function?
  • data: the cleaned dataset along with a provenance.txt. We only need to pull this directory as externals.

The README.txt should contains

  • OVERVIEW: A one-sentence description of the dataset.
  • SOURCE: A description of when, where, how and by whom the dataset are obtained. The goal is anyone who follows the instruction exactly and has the right access should be able to obtain the dataset.
  • DESCRIPTION: A description of the contents of each file in orig, and what changes the cleaning step makes, if any.
  • NOTES: Other related notes not falling into the above category.

The print_provenance function would produce a provenance.txt stored in data with the following information:

  • Enumerate all file names in data together with the file size and last modification time.
  • Loop over all files in data and produce checksums each file.
  • Append the README.txt at the end.

If no code cleaning step is needed, I'm slightly in favor of storing the dataset in data and do not have orig different from this. I want to make it consistent that all we want to pull are the data directory which has the provenance.txt.

@qlquanle
Copy link
Contributor Author

qlquanle commented Aug 28, 2017

@yuchuan2016, this is a great start.


About the raw directory, I like the straightforward structure.

So does this mean we should never have a raw directory without a GSLab-approved provenance.log, even if we do no data manipulation? I very much endorse that. The only potential downside is suppose you have the following data flow:

Raw (DropBox) ---->  Cleaning (GitHub) ----> Release (DropBox) 
  ugly.log            provenance.log         provenance.log

Suppose that the cleaning step is so complicated that you need GitHub, so you put the cleaning code on GitHub. ugly.log is a log created by the original data provider, out of GSLab's control. Since the Raw directory has no code, you have two options:

  • orig and data are identical, the clean.xx is basically empty. This is obviously bad because you're storing duplicates.

  • Only store one copy (and name it orig or data). But then ugly.log is not GSLab-approved.


I have no comment about README.txt.


What is the size limit of provenance.txt?

As an example, this is the ldc_nytimes equivalent of ugly.log. With 1.8 million articles, their file list is 50.2 megabytes. That's a lot to carry around, especially because we are going to require analysis code repos to have a committed log of all its external assets. I prefer having a dual structure: a hash/signature/checksum of the directory, and then the complete log stored in the same location with your proposed information (file list, each file checksum, README.md).

It is also important to note that creating this provenance.txt with the proposed information is not trivial and potentially will be massive. The current NYT_archive data have 165 x 12 x 30 * ~200 ~ 10m files, for example.

@arosenbe
Copy link
Contributor

Nice work @yuchuan2016! I like the structure of provenance, and I especially appreciate the idea of including the README. I have a few comments:

  • I think description should also (possibly first) include a description of the files in data. This is the information most relevant to users, so I think it should be emphasized.
  • I agree that our provenance-producing code should live on GitHub, perhaps in gslab_python/gslab_misc. However, this creates a common external dependency across all our raw data directories, and the dependency may be version-dependent as well. Do you have a sense of the appropriate way to document this?
  • I'd lean toward something like a make.py in source to help keep the syntax constant across directories.
  • Relatedly, does clean.xx build the clean dataset directly in data or do we build it somewhere else and then move it in afterward. How does this generalize to the no-cleaning case? Ideally, the same/similar steps can be used in every make.py.

@arosenbe
Copy link
Contributor

@stanfordquan,

I'd fit your situation above into @yuchuan2016's framework like this:

  1. We have two data assets: the raw data and the cleaned version.
  2. Produce the raw data asset by putting the data in data and the ugly.log in docs before creating the provenance.
  3. Create the second asset on GitHub. This asset gets its own provenance, which incorporates the asset from the first.
  4. Release the second asset to gslab_data/derived.

Let me know if this doesn't make sense.

@qlquanle
Copy link
Contributor Author

@arosenbe that's what I was proposing in the second bullet point. The problem there is that you are storing the raw data folder without a GSLab-approved provenance.

@yuchuan2016
Copy link
Contributor

@stanfordquan @arosenbe , my thought would be

  • If there is a cleaning step, we store the dataset in orig, execute the cleaning code and print_provenance in source/make.py, and store the cleaned data and provenance.log in data.
  • If there is no cleaning step (or the cleaning step is complicated that we do it on Github), we directly store the dataset in data, put ugly.log in docs, only execute print_provenance in source/make.py, and store provenance.log in data.

So we always have a GSLab-proved provenance.log in a raw directory. Does this make sense?

@qlquanle
Copy link
Contributor Author

@yuchuan2016 I see. In that case, the clean.xx repo on GitHub would call the data from orig?

@yuchuan2016
Copy link
Contributor

@stanfordquan , I think we should always call data from data. When there is no cleaning step on Dropbox, we simply put the original dataset in data instead of orig.

@qlquanle
Copy link
Contributor Author

That solves my concern about the directory structure. Can you take a look at my comment about file size?

@arosenbe
Copy link
Contributor

arosenbe commented Aug 29, 2017

@yuchuan2016,

I like the idea of always calling the data from the same repo. I do worry that the instruction may get confusing:

  • When data require cleaning, never directly edit files in orig but do edit files in data.
  • When data do not require cleaning, never directly edit files in data

This seems to be inviting raw-data issues since we allow edits in a location based on a feature of the data: their cleanliness. Perhaps each location should have a dedicated purpose. We put "frozen" data in orig and data for export in data, even when both directories have the same contents.

I'm not wild about double storing the data, but thought I'd bring this up.

@yuchuan2016
Copy link
Contributor

@arosenbe , it's a good point. I agree that consistent purpose of each directory is also desirable.

What about we follow the practice in svn? In most sub-directories in raw in svn, there is no code. Then we do not allow any cleaning code to live in raw. We would only have orig and docs, and a top-level make.py which calls the print_provenance. We can put the cleaning code in derived/xx/source on Dropbox or Github, and make a release to derived/xx/data? The structure is as follows. The problem is that we still need to pull from raw/xx/orig for datasets that do not need cleaning, and pull from derived/xx/data for datasets that need cleaning.

  • raw
    • informative_name
      • docs
      • orig
      • readme.txt
      • make.py
  • derived
    • informative_name
      • source (this could live on Dropbox or Github)
      • data

@arosenbe
Copy link
Contributor

@yuchuan2016, that structure makes sense to me. I do foresee problems with large datasets, especially ones that don't need cleaning. Since we store the data under two different directories, we'll have to rclone it (or do something similar) from one location to another. Levi's been trying to rclone a 3TB data store to Dropbox, and the timeline's measured in weeks. It'd be great if there was a lighter weight solution (not sure there is).

@yuchuan2016
Copy link
Contributor

yuchuan2016 commented Aug 29, 2017

Per team discussion, we want to maintain the orig/source/data/docs structure for each sub-directory in raw. If there is no cleaning code, source/build.py would contain only print_provenance, and data would contain a symbolic link to orig. We need to see if rclone works fine with symbolic link.

For the print_provenance function, we want to produce two files.

  • A short provenance.log that contains the md5sum of the wholedata directory, the size, the time it's created/last modified, the machine on which it's modified, and the README.
  • A long provenance_complete.log that contains the above information for each file/subdirectory up to a certain depth. We need to allow flexible options.

When we pull data from raw to other places, we only pull the short-version provenance.log. If something goes wrong, we can go back to raw to check the long provenance_complete.log.

@stanfordquan @Shun-Yang @arosenbe FYI in case I miss something. @arosenbe , it's appreciated if you want to take a lead to write the print_provenance (or another name you see as appropriate).

@yuchuan2016
Copy link
Contributor

rclone has an option --copy-links which works fine with symlinks.

@yuchuan2016
Copy link
Contributor

@stanfordquan @arosenbe @Shun-Yang, I update my deliverable per the discussion during lunch

Directory structure:

The top level structure for a sub-directory in raw is:

  • /orig/: stores the original dataset that needs some clean. No changes should be performed on the dataset at this step.
  • /source/: contains an optional clean.xx that involves some clean step light and easy to understand , and a make.py that executes the clean script as well as print_provenance. It sends the cleaned data and provenance.txt to data. If no clean step is needed, make.py only execute print_provenance.
  • /data/: stores the original dataset that does no need clean, or the cleaned dataset produced from source, together with a provenance.txt.
  • /docs/: contains any documentation related to the raw data, including codebooks, licenses, email correspondence, etc. If no such files exist, this directory can be absence.
  • README.txt

So the minimum structure would be /source/, /data/ and README.txt, when no clean is needed and no additional documentation exists. Anything called by other directories should be pulled from /data/.

README.txt

A README.txt should exist at the top level that contains the following information

  • Overview: A one-sentence description of the dataset.
  • Description: A detailed description of the contents of orig and data, and what changes the cleaning step makes, if any.
  • Source: A description of when, where, how and by whom the dataset are obtained. The goal is anyone who follows the instruction exactly and has the right access should be able to obtain the dataset.
  • Notes: Other related notes not falling into the above category.

Provenance.txt

The print_provenance function should live in gslab_python/gslab_misc and produce a provenance.txt with the following information (subject to change):

  • A summary of the directory, including the total number of files, the modification time, the size and the checksum of the directory.
  • Append the README.txt.
  • Loop over all files in the directory, print information of file name, modification time, file size and checksum. Stop printing when the number of files reaches a limit (say, 1000)

@arosenbe
Copy link
Contributor

arosenbe commented Aug 30, 2017

I've mocked up a script to produce provenance.txt. It uses the MurmurHash3 Python implementation to produce checksums. I got the idea from this SO response. It also uses a backport of scandir from Python 3, which works like os.listdir but also returns file metadata like ls -l. Returning metadata when listing a directory makes os.walk-like algorithms much faster. That's exactly what we're using here, except pulling even more metadata!

I benchmark the script against three repositories: small, medium, and large. The small repo is glsab_python: 186 files, 11 MB. For the medium, I use media-productivity-events: 341 files, 662 MB. The large is my local ldc-nytimes repo: 1.8 million files, 23GB. Each entry below is the time needed to create and the space need to store the provenance.txt under the conditions on the axes.

Directory-level Add file paths, sizes, and modification dates Add checksum
Small < .1 sec.

2 KB
< .1 sec.

42 KB
< 1 sec.

65 KB
Medium <1 sec.

2KB
< 1 sec.

37 KB
8 sec.

57 KB
Large 150 sec.

864 B
170 sec.

114 MB
4000 sec.

187 MB

I've attached the most complete provenance for the small repo and the directory-level provenance for the large repo.

Click to see the script.
import os
import sys
import datetime

import scandir # pip install scandir
import mmh3 # pip install mmh3

def make_provenance(start_path, readme_path, provenance_path, 
                    include_details = True, include_checksum = True):
    file_details = determine_file_details(include_details, include_checksum)

    total_size, num_files, last_mtime, details = scan_wrapper(
        start_path, include_details, include_checksum, file_details)

    write_heading(start_path, provenance_path)
    write_directory_info(provenance_path, total_size, num_files, last_mtime)
    if include_details:
        write_detailed_info(provenance_path, details)
    write_readme(readme_path, provenance_path)

def determine_file_details(include_details, include_checksum):
    '''
    Determine if a checksum entry appears in the detailed file-level information.
    '''
    file_details = 'path|bytes|most recent modification time'
    if include_checksum:
        file_details = '%s|MurmurHash3' % file_details
    return [file_details]

def scan_wrapper(start_path, include_details, include_checksum, file_details):
    '''
    Walk through start_path and get info on files in all subdirectories.
    Walk in same order as os.walk. 
    Also scan to be recurisve-like without overflowing the stack on large directories. 
    '''
    total_size, num_files, last_mtime, file_details, dirs = scan(
        start_path, include_details, include_checksum, file_details)
    while dirs:
        new_start_path = dirs.pop(0) 
        total_size, num_files, last_mtime, file_details, dirs = scan(
            new_start_path, include_details, include_checksum, file_details, 
            dirs, total_size, num_files, last_mtime)
    return total_size, num_files, last_mtime, file_details

def scan(start_path, include_details, include_checksum, file_details,
         dirs = [], total_size = 0, num_files = 0, last_mtime = 0): 
    '''
    Grab file and create directory info from start path. 
    Also return list of unvisited subdirectories under start_path. 
    '''
    print start_path
    entries = scandir.scandir(start_path)
    for entry in entries:
        if entry.is_dir(follow_symlinks = False): # Store subdirs
            dirs.append(entry.path)
        elif entry.is_file():
            # Get file info
            path = entry.path
            stat = entry.stat()
            size = stat.st_size
            mtime = datetime.datetime.fromtimestamp(stat.st_mtime).strftime('%Y-%m-%d %H:%M:%S')
            # Incorporate file info directory info.
            total_size += size            
            num_files += 1
            last_mtime = max(last_mtime, mtime)
            # Optional detailed file information
            if include_details: 
                line = '%s|%s|%s' % (path, size, mtime)
                if include_checksum:
                    with open(path, 'rU') as f:
                        checksum = str(mmh3.hash128(f.read(), 2017))
                    line = '%s|%s' % (line, checksum)
                file_details.append(line)
    return total_size, num_files, last_mtime, file_details, dirs

# May need to replace the \n with os.linesep for windows. 
def write_heading(start_path, provenance_path):
    '''
    Write standard heading for provenance: what and for where it is. 
    '''
    out = 'GSLab directory provenance\ndirectory: %s\n' % os.path.abspath(start_path) 
    with open(provenance_path, 'wb') as f:
        f.write(out)

def write_directory_info(provenance_path, total_size, num_files, last_mtime):
    '''
    Write directory-level information to provenance. 
    '''
    out = 'total bytes: %s\n' % total_size + \
          'number of files: %s\n' % num_files + \
          'most recent modification time: %s\n' % last_mtime
    with open(provenance_path, 'ab') as f:
        f.write('\n==Directory information=====\n')
        f.write(out)

def write_detailed_info(provenance_path, details): # Need to implement a line limit here.
    '''
    Writes file-level information to provenance.
    '''
    out = '%s\n' % '\n'.join(details)
    with open(provenance_path, 'ab') as f:
        f.write('\n==File information=====\n')
        f.write(out)

def write_readme(readme_path, provenance_path):
    '''
    Writes readme to provenance.
    '''
    with open(readme_path, 'rU') as f:
        out = '%s' % f.read()
    with open(provenance_path, 'ab') as f:
        f.write('\n==README verbatim=====\n')
        f.write(out)

I made a mistake in the provenance.txt I uploaded. I'm not going to fix it.

@yuchuan2016
Copy link
Contributor

@arosenbe @stanfordquan @Shun-Yang , whoever has time could take a look at my edits of the RA manual to see if there is anything to change/improve. Thanks!

@qlquanle
Copy link
Contributor Author

On it. I'll commit my own edits from your version and then you can take a look at the diff and conclude RA manual changes.

@qlquanle
Copy link
Contributor Author

@yuchuan2016 here are the changes, only in language/style. I agree with the spirit of the changes.

@yuchuan2016
Copy link
Contributor

@stanfordquan , I agree with all changes!

Summary:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants