-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raw data storage and transfer protocol #59
Comments
@stanfordquan @Shun-Yang @arosenbe , see below for my first proposal of the storage protocol. Can you take a look to see if you have any comments? Thanks! For each directory in
The
The
If no code cleaning step is needed, I'm slightly in favor of storing the dataset in |
@yuchuan2016, this is a great start. About the So does this mean we should never have a
Suppose that the cleaning step is so complicated that you need GitHub, so you put the cleaning code on GitHub.
I have no comment about What is the size limit of As an example, this is the It is also important to note that creating this |
Nice work @yuchuan2016! I like the structure of provenance, and I especially appreciate the idea of including the README. I have a few comments:
|
I'd fit your situation above into @yuchuan2016's framework like this:
Let me know if this doesn't make sense. |
@arosenbe that's what I was proposing in the second bullet point. The problem there is that you are storing the raw data folder without a GSLab-approved provenance. |
@stanfordquan @arosenbe , my thought would be
So we always have a GSLab-proved |
@yuchuan2016 I see. In that case, the |
@stanfordquan , I think we should always call data from |
That solves my concern about the directory structure. Can you take a look at my comment about file size? |
I like the idea of always calling the data from the same repo. I do worry that the instruction may get confusing:
This seems to be inviting raw-data issues since we allow edits in a location based on a feature of the data: their cleanliness. Perhaps each location should have a dedicated purpose. We put "frozen" data in I'm not wild about double storing the data, but thought I'd bring this up. |
@arosenbe , it's a good point. I agree that consistent purpose of each directory is also desirable. What about we follow the practice in svn? In most sub-directories in
|
@yuchuan2016, that structure makes sense to me. I do foresee problems with large datasets, especially ones that don't need cleaning. Since we store the data under two different directories, we'll have to rclone it (or do something similar) from one location to another. Levi's been trying to rclone a 3TB data store to Dropbox, and the timeline's measured in weeks. It'd be great if there was a lighter weight solution (not sure there is). |
Per team discussion, we want to maintain the For the
When we pull data from @stanfordquan @Shun-Yang @arosenbe FYI in case I miss something. @arosenbe , it's appreciated if you want to take a lead to write the |
|
@stanfordquan @arosenbe @Shun-Yang, I update my deliverable per the discussion during lunch Directory structure:The top level structure for a sub-directory in
So the minimum structure would be README.txtA README.txt should exist at the top level that contains the following information
Provenance.txtThe
|
I've mocked up a script to produce provenance.txt. It uses the MurmurHash3 Python implementation to produce checksums. I got the idea from this SO response. It also uses a backport of I benchmark the script against three repositories: small, medium, and large. The small repo is glsab_python: 186 files, 11 MB. For the medium, I use media-productivity-events: 341 files, 662 MB. The large is my local ldc-nytimes repo: 1.8 million files, 23GB. Each entry below is the time needed to create and the space need to store the provenance.txt under the conditions on the axes.
I've attached the most complete provenance for the small repo and the directory-level provenance for the large repo. Click to see the script.import os
import sys
import datetime
import scandir # pip install scandir
import mmh3 # pip install mmh3
def make_provenance(start_path, readme_path, provenance_path,
include_details = True, include_checksum = True):
file_details = determine_file_details(include_details, include_checksum)
total_size, num_files, last_mtime, details = scan_wrapper(
start_path, include_details, include_checksum, file_details)
write_heading(start_path, provenance_path)
write_directory_info(provenance_path, total_size, num_files, last_mtime)
if include_details:
write_detailed_info(provenance_path, details)
write_readme(readme_path, provenance_path)
def determine_file_details(include_details, include_checksum):
'''
Determine if a checksum entry appears in the detailed file-level information.
'''
file_details = 'path|bytes|most recent modification time'
if include_checksum:
file_details = '%s|MurmurHash3' % file_details
return [file_details]
def scan_wrapper(start_path, include_details, include_checksum, file_details):
'''
Walk through start_path and get info on files in all subdirectories.
Walk in same order as os.walk.
Also scan to be recurisve-like without overflowing the stack on large directories.
'''
total_size, num_files, last_mtime, file_details, dirs = scan(
start_path, include_details, include_checksum, file_details)
while dirs:
new_start_path = dirs.pop(0)
total_size, num_files, last_mtime, file_details, dirs = scan(
new_start_path, include_details, include_checksum, file_details,
dirs, total_size, num_files, last_mtime)
return total_size, num_files, last_mtime, file_details
def scan(start_path, include_details, include_checksum, file_details,
dirs = [], total_size = 0, num_files = 0, last_mtime = 0):
'''
Grab file and create directory info from start path.
Also return list of unvisited subdirectories under start_path.
'''
print start_path
entries = scandir.scandir(start_path)
for entry in entries:
if entry.is_dir(follow_symlinks = False): # Store subdirs
dirs.append(entry.path)
elif entry.is_file():
# Get file info
path = entry.path
stat = entry.stat()
size = stat.st_size
mtime = datetime.datetime.fromtimestamp(stat.st_mtime).strftime('%Y-%m-%d %H:%M:%S')
# Incorporate file info directory info.
total_size += size
num_files += 1
last_mtime = max(last_mtime, mtime)
# Optional detailed file information
if include_details:
line = '%s|%s|%s' % (path, size, mtime)
if include_checksum:
with open(path, 'rU') as f:
checksum = str(mmh3.hash128(f.read(), 2017))
line = '%s|%s' % (line, checksum)
file_details.append(line)
return total_size, num_files, last_mtime, file_details, dirs
# May need to replace the \n with os.linesep for windows.
def write_heading(start_path, provenance_path):
'''
Write standard heading for provenance: what and for where it is.
'''
out = 'GSLab directory provenance\ndirectory: %s\n' % os.path.abspath(start_path)
with open(provenance_path, 'wb') as f:
f.write(out)
def write_directory_info(provenance_path, total_size, num_files, last_mtime):
'''
Write directory-level information to provenance.
'''
out = 'total bytes: %s\n' % total_size + \
'number of files: %s\n' % num_files + \
'most recent modification time: %s\n' % last_mtime
with open(provenance_path, 'ab') as f:
f.write('\n==Directory information=====\n')
f.write(out)
def write_detailed_info(provenance_path, details): # Need to implement a line limit here.
'''
Writes file-level information to provenance.
'''
out = '%s\n' % '\n'.join(details)
with open(provenance_path, 'ab') as f:
f.write('\n==File information=====\n')
f.write(out)
def write_readme(readme_path, provenance_path):
'''
Writes readme to provenance.
'''
with open(readme_path, 'rU') as f:
out = '%s' % f.read()
with open(provenance_path, 'ab') as f:
f.write('\n==README verbatim=====\n')
f.write(out) I made a mistake in the provenance.txt I uploaded. I'm not going to fix it. |
@arosenbe @stanfordquan @Shun-Yang , whoever has time could take a look at my edits of the RA manual to see if there is anything to change/improve. Thanks! |
On it. I'll commit my own edits from your version and then you can take a look at the diff and conclude RA manual changes. |
@yuchuan2016 here are the changes, only in language/style. I agree with the spirit of the changes. |
@stanfordquan , I agree with all changes! Summary:
|
Develop and document protocol for raw data storage. (8h)
provenance.log
looks like. (e.g., checksum, hash, file enumeration + file size)Document protocol for data transfer. (2h)
config_user.yaml
.The text was updated successfully, but these errors were encountered: