Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark ReadsPipelineSpark against running its component Spark tools individually #3395

Closed
droazen opened this issue Aug 1, 2017 · 4 comments

Comments

@droazen
Copy link
Collaborator

droazen commented Aug 1, 2017

No description provided.

@droazen droazen added this to the Engine-4.0 milestone Aug 1, 2017
@droazen droazen changed the title Benchmark ReadsPipelineSpark against running its component tools individually Benchmark ReadsPipelineSpark against running its component Spark tools individually Aug 1, 2017
@tomwhite
Copy link
Contributor

tomwhite commented Nov 3, 2017

@droazen
Copy link
Collaborator Author

droazen commented Nov 3, 2017

These initial results suggest that the savings from a pure-Spark pipeline are in the 15-30% range. @tomwhite Do you attribute these savings mostly to avoiding writing/reading intermediate outputs?

@droazen
Copy link
Collaborator Author

droazen commented Nov 3, 2017

Also, once we've confirmed these results, we'll want to compare the total core hours of the Spark pipeline against the core hours of an equivalent non-Spark pipeline, to see if the savings provided by a pure-Spark pipeline actually make it cheaper than a non-Spark pipeline.

@droazen droazen modified the milestones: Engine-4.0, Engine-4.1 Jan 16, 2018
@tomwhite
Copy link
Contributor

Closing this as we have more Spark speed improvements now (see e.g. #5127). We can open new issues to track further Spark peformance improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants