Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async write for decompression #2975

Merged
merged 6 commits into from
Jan 21, 2022
Merged

Conversation

yoniko
Copy link
Contributor

@yoniko yoniko commented Jan 5, 2022

Changes:

  • Added --[no-]asyncio flag for CLI decompression.
  • Replaced dstBuffer in decompression with a pool of write jobs.
  • Added an ability to execute write jobs in a separate thread.
  • Added an ability to wait (join) on all jobs in a thread pool (queue and running).
  • Fixed bug in multiframe decompression in lz4.

Benchmarked total runtime (real time) of decompression on 3 machines, these are the runtime improvements in %:

flag                     --asyncio  --no-asyncio  improvement
input_file      machine
enwik8.zst      desktop      0.168         0.211    20.379147
                macbook      0.296         0.556    46.762590
                server       0.195         0.241    19.087137
silesia.tar.zst desktop      0.285         0.388    26.546392
                macbook      0.634         0.982    35.437882
                server       0.337         0.439    23.234624

Notes:

  • user & system time / cpu utilization is mostly the same and might even be higher, but total time is reduced because we wait less time on IO
  • Impact might be even bigger on bigger files as we also measure initialization here

@terrelln
Copy link
Contributor

terrelln commented Jan 5, 2022

Added --[no-]asyncio flag for CLI decompression.

This should probably support compression too. And be overridden if the number of threads is explicitly set with -T.

@yoniko
Copy link
Contributor Author

yoniko commented Jan 5, 2022

That's a good point.
Taking it a bit further, perhaps it'd be better to just consolidate with --single-threaded?
The only problem there would be that the --single-threaded has an opposite default than --asyncio has at the moment (we wanted to have it off by default as its still experimental).

@terrelln
Copy link
Contributor

terrelln commented Jan 5, 2022

Yeah, I think we should combine the functionality with --single-thread. --no-asyncio and --single-thread should just be aliases.

We can default compression to --asyncio and decompression to --no-asyncio, until we're ready to enable it by default.

Currently, --single-thread only applies when -T# is set to 0 or unset. It also doesn't apply if ZSTD_NBTHREADS is not zero. I don't think we actually want ZSTD_NBTHREADS to override --single-thread, but that is open for discussion.

Point being, we should think carefully about the interactions between --single-thread, --[no-]asyncio, -T#, and ZSTD_NBTHREADS. And make sure that we have a consistent story that we're happy with. And that it can be extended to multi-threaded decompression, if we ever add that.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 5, 2022

Currently, --single-thread only applies when -T# is set to 0 or unset. It also doesn't apply if ZSTD_NBTHREADS is not zero.

ZSTD_NBTHREADS is only meant to be a "default" value for -T#, therefore it should be overwritten by any explicit command, including -T# or --single-thread.

Currently, --single-thread only applies when -T# is set to 0 or unset.

This seems strange to me, likely an overlook.
I believe it should supersede -T#,
and there is probably a point to output a warning message in case --single-thread is combined with -T2.

Another candidate policy could be "last one wins", meaning that --single-thread -T2 actually means -T2, and -T2 --single-thread actually means --single-thread. Here also, a warning message would be welcomed to underline the discrepancy.

Lastly, a last policy could be to refuse to work in presence of a discrepancy.

I tend to prefer the first policy (--single-thread has priority), but this is certainly opened to discussion.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking great!

Could you also add some benchmark figures to the PR summary?

lib/common/pool.c Show resolved Hide resolved
POOL_ctx* writerPool;
ZSTD_pthread_mutex_t writeJobsMutex;
void* jobs[DECOMPRESSION_MAX_WRITE_JOBS];
volatile int availableWriteJobs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

availableWriteJobs does not need to be volatile.

But can you add a newline before writeJobsMutex and after availableWriteJobs, and a comment explaining that the mutex protects jobs and availableWriteJobs?

@@ -2147,6 +2244,82 @@ FIO_fwriteSparseEnd(const FIO_prefs_t* const prefs, FILE* file, unsigned storedS
} }
}

static void FIO_writePoolSetDstFile(write_pool_ctx_t *ctx, FILE* dstFile) {
assert(ctx!=NULL);
// We can change the dst file only if we have finished writing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These need to be /* */ comments. // comments are C99.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return job;
}

static void FIO_writePoolCreateThreadPool(write_pool_ctx_t *ctx, const FIO_prefs_t *prefs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming: Can you name the functions FIO_WritePool_createThreadPool()? Or just WritePool_createThreadPool() since they're all static. That makes it clear its the function createThreadPool() for the object/concept/group WritePool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}
}

static write_pool_ctx_t* FIO_writePoolCreate(FIO_prefs_t* const prefs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WritePool_create()

@@ -2147,6 +2244,82 @@ FIO_fwriteSparseEnd(const FIO_prefs_t* const prefs, FILE* file, unsigned storedS
} }
}

static void FIO_writePoolSetDstFile(write_pool_ctx_t *ctx, FILE* dstFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WritePool_setDstFile(), etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

FIO_WritePoolWriteJobExecute(job);
}

static void FIO_writePoolQueueWriteJobAndGetNextAvailable(write_job_t **job) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the functions you intend to be "public", e.g. not the helper functions only used by the other WritePool functions, could you add documentation? You could also add docs for the helpers, but that is less necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@terrelln
Copy link
Contributor

terrelln commented Jan 5, 2022

I tend to prefer the first policy (--single-thread has priority), but this is certainly opened to discussion.

I agree that --single-thread should certainly have priority over ZSTD_NBTHREADS. I could see it going either way for -T#, and don't have a strong preference, as long as we are consistent.

If we decide that --single-thread should override -T#, should --no-asyncio also override -T#?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 6, 2022

If we decide that --single-thread should override -T#, should --no-asyncio also override -T#?

If we decide that --single-thread == --no-asyncio, then yes.

There is a catch though.
--single-thread produces a compressed output different from -T1.
if --no-asyncio == --single-thread, then it will have the same impact,
which has consequences from a reproducibility perspective.
Yet, while it's reasonable to explain that --single-thread impacts the compression process, hence impacts the compressed output, it's more difficult to use the same argument for --no-asyncio, which should only impact I/O.

Thoughts ?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 6, 2022

Could you add a test featuring multi-frames files ?
It's currently a supported scenario, and it's possibly one of the difficult scenarios for --async-io.

Bonus point for files with multiple heterogeneous frames (also currently supported).

@yoniko
Copy link
Contributor Author

yoniko commented Jan 6, 2022

I've actually tested this manually, but it makes sense to add an automated test. Will do it.

@yoniko
Copy link
Contributor Author

yoniko commented Jan 6, 2022

Updating here that we've spoken about --single-thread and --[no-]asyncio and decided to leave the two uncoupled as they have two different underlining meanings:

  • -T# and --single-thread control the threading mode for compression which directly affects the result of compression itself. Thus the --single-thread flag's importance is in compressing in a single-threaded mode and not about having the cli actually run in a single thread.
  • --[no-]asyncio is about having asyncio interfaces for (de)compression. These asyncio interfaces could work with a single threaded compression as well (if available).

testAsyncIO lz4
addTwoFrames lz4
fi
cat tmp_uncompressed | shasum > tmp2
Copy link
Contributor

@Cyan4973 Cyan4973 Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use $MD5SUM to designate the checksum program,
which generally translates into md5sum, but not always (freeBSD uses gmd5sum, macos uses md5 -r, etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, pushed a fix.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 6, 2022

I like the new code path.
I believe it fits well in current design, and even clarifies the processing pattern.
Evidence is that it can be generalized with minimal effort to all supported decoders.

The last issue is about the meson build script,
which I suspect is related to meson's contributors insistence of building the CLI against the shared library,
even though it is a scenario we explicitly do not support.
This PR makes the situation a bit worse in this regard, since it makes the CLI depend even more on non-public components of libzstd, which is fine as long as CLI is linked to the static library, but not the dynamic one.

It may seem a unrelated secondary topic, but we'll nonetheless need a solution to this situation before the merge.

@yoniko
Copy link
Contributor Author

yoniko commented Jan 11, 2022

meson build fixed and Travis finishes successfully.
I did need to include some more files from lib/common in the building of the binary.

- Added --[no-]asyncio flag for CLI decompression.
- Replaced dstBuffer in decompression with a pool of write jobs.
- Added an ability to execute write jobs in a separate thread.
- Added an ability to wait (join) on all jobs in a thread pool (queue and running).
@yoniko yoniko merged commit 1598e6c into facebook:dev Jan 21, 2022
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants