Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the Zarr streaming API #291

Closed
wants to merge 18 commits into from
Closed

Define the Zarr streaming API #291

wants to merge 18 commits into from

Conversation

aliddell
Copy link
Member

@aliddell aliddell commented Aug 30, 2024

This defines the Zarr streaming API in zarr.h. It also moves driver tests to tests/driver and adds logger code and code for setting parameters on the Zarr stream.

The bulk of the changed files are just moved.

Copy link
Collaborator

@shlomnissan shlomnissan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review. It covers the main CMakeLists.txt file and the public interface zarr.h. I will continue reviewing the C++ code separately.


####### Acquire Zarr Streaming Library #######

set(tgt acquire-zarr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some information about why "streaming" is now the primary target? Additionally, would it be more effective to organize the code by placing all "driver" code in one folder and "streaming" code in another? This approach would allow each subfolder to have its own CMake file defining its target, rather than having a single CMake file defining two targets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, the streaming code is going to go into its own repo, but this is probably a good opportunity to separate them.


set(tgt acquire-zarr)

add_library(${tgt} STATIC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best practice to let users decide whether to build a library as static or shared. While CMake defaults to STATIC, it offers the BUILD_SHARED_LIBS option for override. I'd omit specifying the library type unless necessary—in which case, I'll add a comment explaining why.

set(tgt acquire-zarr)

add_library(${tgt} STATIC
include/zarr.h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's standard practice to place the include folder containing public headers outside the src directory. This separation clearly distinguishes the public interface from implementation details and simplifies integration for users.

PUBLIC
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
PRIVATE
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/internal>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to name a directory internal? If it solely contains streaming code, shouldn't its name reflect that content? This question relates to my earlier suggestion about creating separate directories for each target, each with its own CMake file.

)

install(TARGETS ${tgt}
LIBRARY DESTINATION lib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The presence of a LIBRARY DESTINATION configuration alongside an explicit request for a STATIC library suggests that there may not be a strict requirement for compiling a static library.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean by that?

* @param[in, out] settings The Zarr stream settings struct.
* @param[in] index The index of the dimension to set. Must be less than the
* number of dimensions reserved with ZarrStreamSettings_reserve_dimensions.
* @param[in] name The name of the dimension.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the “name” of the dimension user defined?

/**
* @brief Set the multiscale flag for the Zarr stream.
* @param[in, out] settings The Zarr stream settings struct.
* @param[in] multiscale A flag indicating whether to stream to multiple
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A flag, as in true or false? If so, this comment should be clearer. Also, doesn't C11 support bools (stdbool.h)?

Comment on lines 374 to 377
const char* ZarrStreamSettings_get_s3_access_key_id(
const ZarrStreamSettings* settings);
const char* ZarrStreamSettings_get_s3_secret_access_key(
const ZarrStreamSettings* settings);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to expose individual getters? Also, having accessor functions for secret keys feels wrong to me. If the function returns a pointer to a sensitive value like an S3 secret access key, it can expose the key to anyone who has access to the pointer. This could be a security risk if the key is used inappropriately or if the pointer is leaked.

Comment on lines 417 to 422
char* name,
size_t bytes_of_name,
ZarrDimensionType* kind,
size_t* array_size_px,
size_t* chunk_size_px,
size_t* shard_size_chunks);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all these out params would be better represented as a struct.


ZarrVersion ZarrStream_get_version(const ZarrStream* stream);

const char* ZarrStream_get_store_path(const ZarrStream* stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these getters for the same underlying object, but instead of passing the settings object, you passed the stream? If that's the case, why not use one or the other? I'm still trying to understand the benefit of exposing every parameter as an accessor method, so I'm surprised to see another set of accessor functions.

I may need to learn more about this API, but currently it feels like a lot of unnecessary flexibility (and maybe a violation of the principle of least exposure). I might be misunderstanding the library's purpose, but I'm wondering: if I've already provided this information, why are accessor methods needed to retrieve it? Wouldn't returning an instance of the settings object be more straightforward?

Copy link
Collaborator

@shlomnissan shlomnissan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another partial review that focuses on logging.

@@ -0,0 +1,81 @@
#include "logger.hh"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already have a logger in the common repo? Why do we need another one? If there are certain benefits to this logger, should we make it our primary logger?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this refactor removes the dependency on the acquire libraries. I concede that a lot of this doesn't make sense in isolation. If you want an idea of what it looks like all together, you can check out the standalone branch.

int line,
const char* func,
const char* format,
...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ offers better alternatives for handling variable arguments, such as variadic templates and initializer lists, which are type-safe and more flexible. I think it should work with __VA_ARGS__.

}

std::string
Logger::log(LogLevel level,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code is thread-safe. If multiple threads log messages simultaneously, the output might become interleaved or corrupted.


std::string
Logger::log(LogLevel level,
const char* file,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use std::string_view instead of const char*?

<< std::setfill('0') << std::setw(3) << ms.count() << " " << prefix
<< filename << ":" << line << " " << func << ": ";

char buffer[1024];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If the formatted log message exceeds this size, wouldn't it cause a buffer overflow?
  • Why not use std::string, which is the expected return type anyway?

@@ -0,0 +1,38 @@
#include "zarr.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design seems fragile. We're incorporating a C API wrapper into a logger class that should be generalized for accessing properties—properties that ought to be defined here initially. Are there any alternative approaches we could consider?

Comment on lines 59 to 63
auto now = std::chrono::system_clock::now();
auto time = std::chrono::system_clock::to_time_t(now);
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(
now.time_since_epoch()) %
1000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can organize this code better. It would be an improvement to have a private member function that returns the current timestamp as a string, which would encapsulate this code, as well as std::put_time.

Comment on lines 22 to 38
#define LOG_DEBUG(...) \
Logger::log(LogLevel_Debug, __FILE__, __LINE__, __func__, __VA_ARGS__)
#define LOG_INFO(...) \
Logger::log(LogLevel_Info, __FILE__, __LINE__, __func__, __VA_ARGS__)
#define LOG_WARNING(...) \
Logger::log(LogLevel_Warning, __FILE__, __LINE__, __func__, __VA_ARGS__)
#define LOG_ERROR(...) \
Logger::log(LogLevel_Error, __FILE__, __LINE__, __func__, __VA_ARGS__)

#define EXPECT(e, ...) \
do { \
if (!(e)) { \
const std::string __err = LOG_ERROR(__VA_ARGS__); \
throw std::runtime_error(__err); \
} \
} while (0)
#define CHECK(e) EXPECT(e, "Expression evaluated as false:\n\t%s", #e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like all these macros can be replaced with static member functions, for example:

template <typename... Args>
static void debug(const char* format, Args... args) {
   log(LogLevel_Debug, __FILE__, __LINE__, __func__, format, args...);
}

This approach improves type checking (which you get from variadic templates), and it avoids all the issues that come with macro expansions.

Copy link
Collaborator

@shlomnissan shlomnissan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another partial review focusing on stream.settings.

struct ZarrDimension_s
{
std::string name; /* Name of the dimension */
uint8_t kind; /* Type of dimension */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of member variables has implications for memory layout. A general rule of thumb is ordering members from largest to smallest type. It can potentially reduce padding and make the struct more memory efficient. This comment is applicable to all structs.

{
std::string store_path; /* Path to the Zarr store on the local filesystem */

std::string s3_endpoint; /* Endpoint for the S3 service */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an opportunity to organize the code better by introducing more granular types here.

struct ZarrStreamS3Config {
    std::string endpoint;
    std::string bucket_name;
    // ...
};

struct ZarrStreamCompressionConfig {
    uint8_t compressor;
    uint8_t compressor_codec;
    // ...
};

bool multiscale; /* Whether to stream to multiple resolutions */
};

bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think [[nodiscard]] is suitable here. It makes it explicit that the return value needs to be examined, as opposed to assuming it throws an exception.

Comment on lines 10 to 19
#define EXPECT_VALID_ARGUMENT(e, ...) \
do { \
if (!(e)) { \
LOG_ERROR(__VA_ARGS__); \
return ZarrError_InvalidArgument; \
} \
} while (0)

#define ZARR_DIMENSION_MIN 3
#define ZARR_DIMENSION_MAX 32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think EXPECT_VALID_ARGUMENT is defined in multiple places. Can we consolidate?
  2. We should avoid macros for constants and functions (ES.31). C++ offers safer alternatives. This applies to all macro functions in this file. If there isn't a strong requirement to using them, I think they are actively bad.


#define SETTINGS_SET_STRING(settings, member, bytes_of_member) \
do { \
if (!(settings)) { \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better clearer to say if (settings == nullptr)?

Comment on lines 182 to 184
ZarrStreamSettings* settings,
const char* s3_secret_access_key,
size_t bytes_of_s3_secret_access_key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Are we using clang-tidy for code formatting? If we don't, we should probably configure it for consistency.

Comment on lines 356 to 359
if (dim_name.empty()) {
LOG_ERROR("Invalid name. Must not be empty");
return ZarrError_InvalidArgument;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • You already checked that name is not null, and that bytes_of_name is greater than zero. Under what circumstances would dim_name be empty?
  • If such a case exists, it would be better to check it against a std::string_view to avoid a premature allocation of std::string.

}

ZarrError
ZarrStreamSettings_get_dimension(const ZarrStreamSettings* settings,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think there needs to be a distinction in the name between an accessor function that returns a value and a function that is used to assign values into output parameters.
  • The signature of this function reinforces the need for a struct that represents a "dimension", instead of passing individual properties.

Comment on lines 490 to 495
EXPECT_VALID_ARGUMENT(settings, "Null pointer: settings");
EXPECT_VALID_ARGUMENT(name, "Null pointer: name");
EXPECT_VALID_ARGUMENT(kind, "Null pointer: kind");
EXPECT_VALID_ARGUMENT(array_size_px, "Null pointer: array_size_px");
EXPECT_VALID_ARGUMENT(chunk_size_px, "Null pointer: chunk_size_px");
EXPECT_VALID_ARGUMENT(shard_size_chunks, "Null pointer: shard_size_chunks");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would an accessor or query function need to validate its input?


/* Getters */
const char*
ZarrStreamSettings_get_store_path(const ZarrStreamSettings* settings)
Copy link
Collaborator

@shlomnissan shlomnissan Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find all the _get_ functions confusing. I suspect they're unnecessary boilerplate, but I might be overlooking something. My reasoning is simple: if I already own the settings object, why do I need a public API for accessing individual values, especially for a flat aggregate type?

@aliddell
Copy link
Member Author

@shlomnissan I'm picking this up in #293 and beyond.

@aliddell aliddell deleted the standalone-sequence branch September 17, 2024 19:36
aliddell added a commit that referenced this pull request Sep 18, 2024
…ries (#293)

Splitting up #291. Moves files and updates CMakeLists.txt files
accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants