-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define the Zarr streaming API #291
Conversation
33e1085
to
18ac6d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a partial review. It covers the main CMakeLists.txt
file and the public interface zarr.h
. I will continue reviewing the C++ code separately.
src/CMakeLists.txt
Outdated
|
||
####### Acquire Zarr Streaming Library ####### | ||
|
||
set(tgt acquire-zarr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some information about why "streaming" is now the primary target? Additionally, would it be more effective to organize the code by placing all "driver" code in one folder and "streaming" code in another? This approach would allow each subfolder to have its own CMake file defining its target, rather than having a single CMake file defining two targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, the streaming code is going to go into its own repo, but this is probably a good opportunity to separate them.
src/CMakeLists.txt
Outdated
|
||
set(tgt acquire-zarr) | ||
|
||
add_library(${tgt} STATIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's best practice to let users decide whether to build a library as static or shared. While CMake defaults to STATIC
, it offers the BUILD_SHARED_LIBS
option for override. I'd omit specifying the library type unless necessary—in which case, I'll add a comment explaining why.
src/CMakeLists.txt
Outdated
set(tgt acquire-zarr) | ||
|
||
add_library(${tgt} STATIC | ||
include/zarr.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's standard practice to place the include folder containing public headers outside the src directory. This separation clearly distinguishes the public interface from implementation details and simplifies integration for users.
src/CMakeLists.txt
Outdated
PUBLIC | ||
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include> | ||
PRIVATE | ||
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/internal> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you choose to name a directory internal
? If it solely contains streaming code, shouldn't its name reflect that content? This question relates to my earlier suggestion about creating separate directories for each target, each with its own CMake file.
src/CMakeLists.txt
Outdated
) | ||
|
||
install(TARGETS ${tgt} | ||
LIBRARY DESTINATION lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The presence of a LIBRARY DESTINATION
configuration alongside an explicit request for a STATIC
library suggests that there may not be a strict requirement for compiling a static library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what you mean by that?
src/include/zarr.h
Outdated
* @param[in, out] settings The Zarr stream settings struct. | ||
* @param[in] index The index of the dimension to set. Must be less than the | ||
* number of dimensions reserved with ZarrStreamSettings_reserve_dimensions. | ||
* @param[in] name The name of the dimension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the “name” of the dimension user defined?
src/include/zarr.h
Outdated
/** | ||
* @brief Set the multiscale flag for the Zarr stream. | ||
* @param[in, out] settings The Zarr stream settings struct. | ||
* @param[in] multiscale A flag indicating whether to stream to multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A flag, as in true or false? If so, this comment should be clearer. Also, doesn't C11 support bools (stdbool.h
)?
src/include/zarr.h
Outdated
const char* ZarrStreamSettings_get_s3_access_key_id( | ||
const ZarrStreamSettings* settings); | ||
const char* ZarrStreamSettings_get_s3_secret_access_key( | ||
const ZarrStreamSettings* settings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to expose individual getters? Also, having accessor functions for secret keys feels wrong to me. If the function returns a pointer to a sensitive value like an S3 secret access key, it can expose the key to anyone who has access to the pointer. This could be a security risk if the key is used inappropriately or if the pointer is leaked.
src/include/zarr.h
Outdated
char* name, | ||
size_t bytes_of_name, | ||
ZarrDimensionType* kind, | ||
size_t* array_size_px, | ||
size_t* chunk_size_px, | ||
size_t* shard_size_chunks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all these out
params would be better represented as a struct.
src/include/zarr.h
Outdated
|
||
ZarrVersion ZarrStream_get_version(const ZarrStream* stream); | ||
|
||
const char* ZarrStream_get_store_path(const ZarrStream* stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these getters for the same underlying object, but instead of passing the settings object, you passed the stream? If that's the case, why not use one or the other? I'm still trying to understand the benefit of exposing every parameter as an accessor method, so I'm surprised to see another set of accessor functions.
I may need to learn more about this API, but currently it feels like a lot of unnecessary flexibility (and maybe a violation of the principle of least exposure). I might be misunderstanding the library's purpose, but I'm wondering: if I've already provided this information, why are accessor methods needed to retrieve it? Wouldn't returning an instance of the settings object be more straightforward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another partial review that focuses on logging.
src/internal/logger.cpp
Outdated
@@ -0,0 +1,81 @@ | |||
#include "logger.hh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we already have a logger in the common
repo? Why do we need another one? If there are certain benefits to this logger, should we make it our primary logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this refactor removes the dependency on the acquire libraries. I concede that a lot of this doesn't make sense in isolation. If you want an idea of what it looks like all together, you can check out the standalone branch.
src/internal/logger.cpp
Outdated
int line, | ||
const char* func, | ||
const char* format, | ||
...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ offers better alternatives for handling variable arguments, such as variadic templates and initializer lists, which are type-safe and more flexible. I think it should work with __VA_ARGS__
.
src/internal/logger.cpp
Outdated
} | ||
|
||
std::string | ||
Logger::log(LogLevel level, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this code is thread-safe. If multiple threads log messages simultaneously, the output might become interleaved or corrupted.
src/internal/logger.cpp
Outdated
|
||
std::string | ||
Logger::log(LogLevel level, | ||
const char* file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use std::string_view
instead of const char*
?
src/internal/logger.cpp
Outdated
<< std::setfill('0') << std::setw(3) << ms.count() << " " << prefix | ||
<< filename << ":" << line << " " << func << ": "; | ||
|
||
char buffer[1024]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If the formatted log message exceeds this size, wouldn't it cause a buffer overflow?
- Why not use
std::string
, which is the expected return type anyway?
src/internal/logger.hh
Outdated
@@ -0,0 +1,38 @@ | |||
#include "zarr.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design seems fragile. We're incorporating a C API wrapper into a logger class that should be generalized for accessing properties—properties that ought to be defined here initially. Are there any alternative approaches we could consider?
src/internal/logger.cpp
Outdated
auto now = std::chrono::system_clock::now(); | ||
auto time = std::chrono::system_clock::to_time_t(now); | ||
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>( | ||
now.time_since_epoch()) % | ||
1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can organize this code better. It would be an improvement to have a private member function that returns the current timestamp as a string, which would encapsulate this code, as well as std::put_time
.
src/internal/logger.hh
Outdated
#define LOG_DEBUG(...) \ | ||
Logger::log(LogLevel_Debug, __FILE__, __LINE__, __func__, __VA_ARGS__) | ||
#define LOG_INFO(...) \ | ||
Logger::log(LogLevel_Info, __FILE__, __LINE__, __func__, __VA_ARGS__) | ||
#define LOG_WARNING(...) \ | ||
Logger::log(LogLevel_Warning, __FILE__, __LINE__, __func__, __VA_ARGS__) | ||
#define LOG_ERROR(...) \ | ||
Logger::log(LogLevel_Error, __FILE__, __LINE__, __func__, __VA_ARGS__) | ||
|
||
#define EXPECT(e, ...) \ | ||
do { \ | ||
if (!(e)) { \ | ||
const std::string __err = LOG_ERROR(__VA_ARGS__); \ | ||
throw std::runtime_error(__err); \ | ||
} \ | ||
} while (0) | ||
#define CHECK(e) EXPECT(e, "Expression evaluated as false:\n\t%s", #e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like all these macros can be replaced with static member functions, for example:
template <typename... Args>
static void debug(const char* format, Args... args) {
log(LogLevel_Debug, __FILE__, __LINE__, __func__, format, args...);
}
This approach improves type checking (which you get from variadic templates), and it avoids all the issues that come with macro expansions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another partial review focusing on stream.settings
.
src/internal/stream.settings.hh
Outdated
struct ZarrDimension_s | ||
{ | ||
std::string name; /* Name of the dimension */ | ||
uint8_t kind; /* Type of dimension */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of member variables has implications for memory layout. A general rule of thumb is ordering members from largest to smallest type. It can potentially reduce padding and make the struct more memory efficient. This comment is applicable to all structs.
src/internal/stream.settings.hh
Outdated
{ | ||
std::string store_path; /* Path to the Zarr store on the local filesystem */ | ||
|
||
std::string s3_endpoint; /* Endpoint for the S3 service */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an opportunity to organize the code better by introducing more granular types here.
struct ZarrStreamS3Config {
std::string endpoint;
std::string bucket_name;
// ...
};
struct ZarrStreamCompressionConfig {
uint8_t compressor;
uint8_t compressor_codec;
// ...
};
src/internal/stream.settings.hh
Outdated
bool multiscale; /* Whether to stream to multiple resolutions */ | ||
}; | ||
|
||
bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think [[nodiscard]]
is suitable here. It makes it explicit that the return value needs to be examined, as opposed to assuming it throws an exception.
src/internal/stream.settings.cpp
Outdated
#define EXPECT_VALID_ARGUMENT(e, ...) \ | ||
do { \ | ||
if (!(e)) { \ | ||
LOG_ERROR(__VA_ARGS__); \ | ||
return ZarrError_InvalidArgument; \ | ||
} \ | ||
} while (0) | ||
|
||
#define ZARR_DIMENSION_MIN 3 | ||
#define ZARR_DIMENSION_MAX 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think
EXPECT_VALID_ARGUMENT
is defined in multiple places. Can we consolidate? - We should avoid macros for constants and functions (ES.31). C++ offers safer alternatives. This applies to all macro functions in this file. If there isn't a strong requirement to using them, I think they are actively bad.
src/internal/stream.settings.cpp
Outdated
|
||
#define SETTINGS_SET_STRING(settings, member, bytes_of_member) \ | ||
do { \ | ||
if (!(settings)) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better clearer to say if (settings == nullptr)
?
src/internal/stream.settings.cpp
Outdated
ZarrStreamSettings* settings, | ||
const char* s3_secret_access_key, | ||
size_t bytes_of_s3_secret_access_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Are we using clang-tidy for code formatting? If we don't, we should probably configure it for consistency.
src/internal/stream.settings.cpp
Outdated
if (dim_name.empty()) { | ||
LOG_ERROR("Invalid name. Must not be empty"); | ||
return ZarrError_InvalidArgument; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You already checked that
name
is not null, and thatbytes_of_name
is greater than zero. Under what circumstances woulddim_name
be empty? - If such a case exists, it would be better to check it against a
std::string_view
to avoid a premature allocation ofstd::string
.
src/internal/stream.settings.cpp
Outdated
} | ||
|
||
ZarrError | ||
ZarrStreamSettings_get_dimension(const ZarrStreamSettings* settings, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think there needs to be a distinction in the name between an accessor function that returns a value and a function that is used to assign values into output parameters.
- The signature of this function reinforces the need for a struct that represents a "dimension", instead of passing individual properties.
src/internal/stream.settings.cpp
Outdated
EXPECT_VALID_ARGUMENT(settings, "Null pointer: settings"); | ||
EXPECT_VALID_ARGUMENT(name, "Null pointer: name"); | ||
EXPECT_VALID_ARGUMENT(kind, "Null pointer: kind"); | ||
EXPECT_VALID_ARGUMENT(array_size_px, "Null pointer: array_size_px"); | ||
EXPECT_VALID_ARGUMENT(chunk_size_px, "Null pointer: chunk_size_px"); | ||
EXPECT_VALID_ARGUMENT(shard_size_chunks, "Null pointer: shard_size_chunks"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would an accessor or query function need to validate its input?
src/internal/stream.settings.cpp
Outdated
|
||
/* Getters */ | ||
const char* | ||
ZarrStreamSettings_get_store_path(const ZarrStreamSettings* settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find all the _get_
functions confusing. I suspect they're unnecessary boilerplate, but I might be overlooking something. My reasoning is simple: if I already own the settings object, why do I need a public API for accessing individual values, especially for a flat aggregate type?
@shlomnissan I'm picking this up in #293 and beyond. |
This defines the Zarr streaming API in zarr.h. It also moves driver tests to tests/driver and adds logger code and code for setting parameters on the Zarr stream.
The bulk of the changed files are just moved.