Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy from S3 is slow #483

Closed
vitalyk-multinarity opened this issue Aug 27, 2023 · 7 comments
Closed

Copy from S3 is slow #483

vitalyk-multinarity opened this issue Aug 27, 2023 · 7 comments
Labels
question Further information is requested

Comments

@vitalyk-multinarity
Copy link

Mountpoint for Amazon S3 version

mount-s3 1.0.0

AWS Region

eu-central-1

Describe the running environment

Running on Ubuntu22.04, S3 bucket and EC2 instance are in the same region.

What happened?

Trying to copy 3.7GB data from S3 mount point to the local dir, it takes more than 10 minutes. (While 'aws s3 cp s3://my-test-bucket /tmp' takes 1 minute).

Using this command:

time cp -r mount-point/* /tmp/

Relevant log output

Aug 27 10:09:27 ip-172-31-9-108 mount-s3[38410]: [WARN] readdirplus{req=18 ino=2 fh=2 offset=0}: mountpoint_s3::inode::readdir: file '' (full key "Projects/") has an invalid name and will be unavailable
Aug 27 10:09:27 ip-172-31-9-108 mount-s3[38410]: [WARN] readdirplus{req=24 ino=3 fh=3 offset=0}: mountpoint_s3::inode::readdir: file '' (full key "Projects/DVT/") has an invalid name and will be unavailable
Aug 27 10:09:27 ip-172-31-9-108 mount-s3[38410]: [WARN] readdirplus{req=30 ino=4 fh=4 offset=0}: mountpoint_s3::inode::readdir: file '' (full key "Projects/DVT/baseline_dataset/") has an invalid name and will be unavailable
Aug 27 10:09:27 ip-172-31-9-108 mount-s3[38410]: [WARN] readdirplus{req=36 ino=5 fh=5 offset=0}: mountpoint_s3::inode::readdir: file '' (full key "Projects/DVT/baseline_dataset/0685_58_home_direct_light_on_keyboard/") has an invalid name and will be unavailable
Aug 27 10:10:50 ip-172-31-9-108 mount-s3[38410]: [WARN] readdirplus{req=3498 ino=2 fh=8 offset=0}: mountpoint_s3::inode::readdir: file '' (full key "Projects/") has an invalid name and will be unavailable
Aug 27 10:10:50 ip-172-31-9-108 mount-s3[38410]: [WARN] readdirplus{req=3506 ino=3 fh=9 offset=0}: mountpoint_s3::inode::readdir: file '' (full key "Projects/DVT/") has an invalid name and will be unavailable
Aug 27 10:10:50 ip-172-31-9-108 mount-s3[38410]: [WARN] readdirplus{req=3514 ino=4 fh=10 offset=0}: mountpoint_s3::inode::readdir: file '' (full key "Projects/DVT/baseline_dataset/") has an invalid name and will be unavailable
Aug 27 10:10:50 ip-172-31-9-108 mount-s3[38410]: [WARN] readdirplus{req=3522 ino=5 fh=11 offset=0}: mountpoint_s3::inode::readdir: file '' (full key "Projects/DVT/baseline_dataset/0685_58_home_direct_light_on_keyboard/") has an invalid name and will be unavailable
Aug 27 10:10:58 ip-172-31-9-108 mount-s3[38410]: [WARN] mountpoint_s3::prefetch::part_queue: closed channel
Aug 27 10:10:59 ip-172-31-9-108 mount-s3[38410]: message repeated 3189 times: [ [WARN] mountpoint_s3::prefetch::part_queue: closed channel]
@vitalyk-multinarity vitalyk-multinarity added the bug Something isn't working label Aug 27, 2023
@monthonk
Copy link
Contributor

Hi, we are investigating into this problem. What we can share right now is that we run some tests on it and noticed that the kernel might try to improve copy performance by sending readahead requests to Mountpoint, but they are interpreted as random reads and end up messing Mountpoint's prefetcher logic.

Could you share more info about how many files did you copy and what mount options did you configure on Mountpoint?

@vitalykarasik
Copy link

vitalykarasik commented Aug 29, 2023

Hi, we are investigating into this problem. What we can share right now is that we run some tests on it and noticed that the kernel might try to improve copy performance by sending readahead requests to Mountpoint, but they are interpreted as random reads and end up messing Mountpoint's prefetcher logic.

Could you share more info about how many files did you copy and what mount options did you configure on Mountpoint?

Thank you!
I didn't use any special mount params.
As for my files - it may be problematic: I have a few (about 5) big movies and a lot (about 6K) small images.

@monthonk
Copy link
Contributor

monthonk commented Sep 1, 2023

We created a new issue (#488) to track readahead problem and it's most likely a root cause of slow copy. Until it's fixed, running mountpoint in single-threaded mode (--max_threads=1) might give you better result.

@swiftfwi
Copy link

swiftfwi commented Sep 5, 2023

after we change to O_DIRECT, the reading speed is still slow, about 200mb/s. However, the write speed is at 1200mb/s.

@dannycjones
Copy link
Contributor

Hey, I was digging into this issue. While we have #488 tracking a performance issue related to out-of-order reads, the issue here is that cp is performing the reads one file at a time while the AWS CLI will use up to 10 threads by default: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#max-concurrent-requests

You can achieve a similar result by parallelizing the copy over multiple threads, which will drive the read requests to Mountpoint in parallel also. For example, I tested using GNU Parallel below. It creates a list of files in the directory and then parallelized the copy over ten threads.

find /mnt/s3/github-483 -type f | parallel -P 10 cp -v {} ./

I tried a quick comparison. I populated my bucket with 6x 512MiB objects and 6000x 120KiB objects. I timed on my machine and saw that the serial cp took around 15m24s time while the parallel cp took around 0m42s. Comparing with the AWS CLI, that took around 0m24s.

Can you give this approach a try and let us know? As I said, we are looking into #488 but I suspect you'll see a much greater improvement by parallelizing these copies. Thanks!

@vitalykarasik
Copy link

Thank you! I'll try this method [hopefully next week],
Vitaly

@dannycjones dannycjones added question Further information is requested and removed bug Something isn't working labels Oct 3, 2023
@dannycjones
Copy link
Contributor

dannycjones commented Oct 3, 2023

I'm closing this issue since we provided the recommendation to parallelize UNIX cp. Comparing the parallel copy performed by the AWS CLI to a sequential copy being requested to mountpoint-s3 is not an "apples to apples" comparison. The application should parallelize its file operations if possible.

We are separately working on #488 to fix the performance issue triggered during some parallel reading, but that's unrelated to this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants