Comparison of bucket indexing tools #9

charlesbluca · 2020-11-09T20:58:53Z

As talked about in #7, regularly generating an index of all the files in a bucket/directory could be very useful in:

Generating/updating catalogs or databases of relevant datasets
Keeping track of files for synchronization purposes

There are a lot of tools that could do this work - some exclusive to specific cloud providers, others not. Some of these tools include:

gsutil (supports both Google Cloud and S3)
Rclone (supports a variety of cloud providers, including Google )
AWS CLI (supports only S3)
S3P (supports only S3)

I tested the above tools on both Google Cloud and S3 (when relevant) to get a sense of which would have the best utility in listing the entirety of a large bucket. Some basic parameters of the testing include:

Target bucket(s) - Pangeo's CMIP6 buckets in both Google Cloud and S3 storage; both are ~550 TB of data comprising of some 20,000,000+ files
Command - flat listing of all bucket's contents with size and modification time (relevant mainly for synchronization purposes)
Output redirection - currently all output is written to a file unaltered; this may change if we want to edit output in place before writing to file using something like sed

The output of these tests can be found here. Some observations:

For S3 listing, S3P was by far the fastest, running 4-6x faster than the AWS CLI listing (~40 min versus ~165 min)
For Google Cloud listing, Rclone was by far the fastest, running nearly 4x faster than the gsutil listing (~47 min versus ~173 min)
Both gsutil and Rclone had trouble listing S3 storage, with both commands failing to list the bucket within the 6 hour timeout; the listing of modification time likely influenced these results, as listing is significantly faster in both cases when excluding this information

Obviously additional testing of more cloud listing tools (MinIO client for example) would be ideal, but these results provide some motivation to dig deeper into Rclone and S3P to index CMIP6 data in Google Cloud and S3 storage, respectively.

The text was updated successfully, but these errors were encountered:

charlesbluca · 2020-11-09T21:55:02Z

If using these tools for synchronization purposes, which would entail using a diff of two buckets' index files to check what must be transferred/deleted, it is important that the two files have the same formatting of size, modification time, and directory structure so that they can be compared properly.

Because none of the tools I listed above have the same formatting for their listings, to effectively use indexes for bucket synchronization we will either have to:

Use the same listing tool for all buckets we intend to keep synchronized
Reformat all indexes to follow one standard format

The second option seems preferable, as there are a variety of tools to edit the listing output before or after being written to file, and the extra time this would take seems negligible compared to the amount of time it would take to use a non-ideal listing tool.

charlesbluca · 2020-11-10T17:48:11Z

After some further investigation of the listing output, I see there is another issue with using index files for synchronization purposes - different listing tools have different ideas of what to list for modification time!

In particular, Rclone seems to work off of an internal modification time, which is unaltered when a file is copied to another bucket - if a file was last modified on Google Cloud on January 10th, and then copied over to S3 on a later date, Rclone would say both files were last modified on January 10th. In contrast, gsutil and AWS CLI (and consequently S3P) use the time a file was uploaded to its containing bucket as the modification time - in the previous example, the file on Google Cloud would've last been modified on January 10th, and the file on S3 would've last been "modified" on whatever date it was copied over.

A workaround to this problem would be to rely instead upon the checksums of files to test if they are identical, in particular MD5 hashes, which are supported on both Google Cloud and S3. The ability to generate file checksums is limited when using AWS CLI (only for individual files) and potentially unavailable in S3P (hard to tell as there is sparse documentation), but can be done while listing a bucket using gsutil or Rclone:

gsutil hash -m ... 
rclone hashsum MD5 ...

Rclone seems to be able to generate hashes on S3 storage significantly faster than modification times, which could make it a viable option for S3.

rabernat · 2020-11-11T02:37:12Z

Here's another one to try!

https://twitter.com/thundercloudvol/status/1326348841965264896

charlesbluca · 2020-11-11T17:50:03Z

Thanks for the info! I'll add cloud-files and MinIO Client to the general listing tests; maybe cloud-files could also have some functionality in Python code for tasks not suited for gcsfs or s3fs.

In terms of my testing with listing MD5 checksums, Rclone was still significantly faster than gsutil, and worked much faster in Google Cloud than S3 (30 minutes vs 210 minutes). However, it was able to complete the task within a timeout limit, which could potentially make it useful depending on how often we plan on indexing the buckets.

charlesbluca · 2020-11-12T16:25:22Z

CloudFile and MinIO Client both seem to perform similarly to Rclone when it comes to Google Cloud storage, getting a listing in ~30 minutes (no mod times or checksums). Unfortunately, they also share slower listing times when it comes to S3, taking around 3-4x longer.

So far, the optimal indexing tools seem to be Rclone for Google Cloud and S3P for S3 if synchronization isn't a concern, and Rclone listing checksums if it is (though this is still very slow in S3).

charlesbluca · 2020-11-17T21:22:40Z

It looks like S3P is able to generate MD5 checksums using it's each command instead of a standard ls; in fact, roughly the same output as rclone hashsum MD5 can be generated for an S3 bucket using:

s3p each --bucket target-bucket --map "js:(item) => console.log(item.ETag.slice(1,-1), item.Key)" --quiet

Since this tool is backed by s3api, there might be an equivalent to this command using AWS CLI, but I doubt that it would run nearly as fast. I'm going to test this S3P listing and look into CloudFiles and MinIO Client for similar functionality.

rabernat · 2020-11-17T21:24:33Z

Thanks for continuing to work on this Charles!

charlesbluca · 2020-11-18T00:40:18Z

The results of the S3P listing are significant - we are able to generate a list of all files with checksums in the same amount of time (sometimes less!) it would take to simply list the S3 bucket keys. This means we should be able to generate indexes of both buckets in around 30 minutes each, with or without checksums!

I'll continue testing on CloudFiles/MinIO to see if either can generate checksums for Google Cloud faster than Rclone, but as it is I expect to see some performance improvements in the synchronization process now that we're able to generate S3 checksums much faster.

rabernat · 2020-11-18T03:22:29Z

One final thing to keep in mind is the costs. Each of these API requests does have a tiny cost associated with it. Along with the speed information, it would be good to have a ballpark figure on how much each option costs.

charlesbluca · 2020-11-18T19:36:50Z

Good point - pricing is definitely the biggest motivator behind how often we plan to run these scripts.

Based on the pricing of S3 and Cloud Storage, it looks like a list operation is priced at $0.005 per 1,000 requests (or $5 per 1,000,000 requests). I'm still looking for clear numbers on how many objects a Cloud Storage list operation returns, but going off of S3's numbers this would be 1,000 objects per list operation.

Assuming our CMIP6 buckets have somewhere around 25,000,000 objects each now, we can get a ballpark figure of:

2 * 25,000,000 * ($0.005 / 1,000) / 1,000 = $0.25

Per listing of both buckets, split roughly equally across both cloud providers. I don't expect this number to change dramatically from tool to tool, as from my understanding they tend to differ not in how many requests they send to get the listing, but more so in how those requests are sent (serial vs. parallel). In the case of my Rclone sync workflow, this would mean that (ignoring egress fees) there is a total cost of roughly $30 monthly to list out the buckets for comparison 4x daily, versus a $7.50 monthly cost if we opted to do this.

Another thing to take into account is egress fees, i.e. the cost of downloading the index files, which would likely be in excess of 2 GB each (although compression is definitely an option here). In Cloud Storage general egress (downloading a file to a non-cloud associated resource) is $0.12 / GB, while in S3 it is $0.09 / GB. We can see from this that the total egress costs of downloading both index files to one place would exceed the cost of listing out the buckets themselves. This serves as a motivator to move most of our cataloging/synchronization services to a GCP/AWS resource, so we could bring the egress fees down to minimal or even free (although we would still need to pay full egress fees for one of the index files to download it to a different cloud provider) .

agstephens mentioned this issue Nov 10, 2020

Discussion of directory structure and catalog options #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison of bucket indexing tools #9

Comparison of bucket indexing tools #9

charlesbluca commented Nov 9, 2020 •

edited

Loading

charlesbluca commented Nov 9, 2020

charlesbluca commented Nov 10, 2020

rabernat commented Nov 11, 2020

charlesbluca commented Nov 11, 2020

charlesbluca commented Nov 12, 2020

charlesbluca commented Nov 17, 2020

rabernat commented Nov 17, 2020

charlesbluca commented Nov 18, 2020

rabernat commented Nov 18, 2020

charlesbluca commented Nov 18, 2020 •

edited

Loading

Comparison of bucket indexing tools #9

Comparison of bucket indexing tools #9

Comments

charlesbluca commented Nov 9, 2020 • edited Loading

charlesbluca commented Nov 9, 2020

charlesbluca commented Nov 10, 2020

rabernat commented Nov 11, 2020

charlesbluca commented Nov 11, 2020

charlesbluca commented Nov 12, 2020

charlesbluca commented Nov 17, 2020

rabernat commented Nov 17, 2020

charlesbluca commented Nov 18, 2020

rabernat commented Nov 18, 2020

charlesbluca commented Nov 18, 2020 • edited Loading

charlesbluca commented Nov 9, 2020 •

edited

Loading

charlesbluca commented Nov 18, 2020 •

edited

Loading