Define the Zarr streaming API #291

aliddell · 2024-08-30T17:34:10Z

This defines the Zarr streaming API in zarr.h. It also moves driver tests to tests/driver and adds logger code and code for setting parameters on the Zarr stream.

The bulk of the changed files are just moved.

shlomnissan

This is a partial review. It covers the main CMakeLists.txt file and the public interface zarr.h. I will continue reviewing the C++ code separately.

shlomnissan · 2024-09-04T18:25:16Z

src/CMakeLists.txt

+
+####### Acquire Zarr Streaming Library #######
+
+set(tgt acquire-zarr)


Could we add some information about why "streaming" is now the primary target? Additionally, would it be more effective to organize the code by placing all "driver" code in one folder and "streaming" code in another? This approach would allow each subfolder to have its own CMake file defining its target, rather than having a single CMake file defining two targets.

At some point, the streaming code is going to go into its own repo, but this is probably a good opportunity to separate them.

shlomnissan · 2024-09-04T18:25:33Z

src/CMakeLists.txt

+
+set(tgt acquire-zarr)
+
+add_library(${tgt} STATIC


It's best practice to let users decide whether to build a library as static or shared. While CMake defaults to STATIC, it offers the BUILD_SHARED_LIBS option for override. I'd omit specifying the library type unless necessary—in which case, I'll add a comment explaining why.

shlomnissan · 2024-09-04T18:25:41Z

src/CMakeLists.txt

+set(tgt acquire-zarr)
+
+add_library(${tgt} STATIC
+        include/zarr.h


It's standard practice to place the include folder containing public headers outside the src directory. This separation clearly distinguishes the public interface from implementation details and simplifies integration for users.

shlomnissan · 2024-09-04T18:26:00Z

src/CMakeLists.txt

+        PUBLIC
+        $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
+        PRIVATE
+        $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/internal>


Why did you choose to name a directory internal? If it solely contains streaming code, shouldn't its name reflect that content? This question relates to my earlier suggestion about creating separate directories for each target, each with its own CMake file.

shlomnissan · 2024-09-04T18:27:28Z

src/CMakeLists.txt

+)
+
+install(TARGETS ${tgt}
+        LIBRARY DESTINATION lib


The presence of a LIBRARY DESTINATION configuration alongside an explicit request for a STATIC library suggests that there may not be a strict requirement for compiling a static library.

Can you explain what you mean by that?

shlomnissan · 2024-09-04T20:48:09Z

src/include/zarr.h

+     * @param[in, out] settings The Zarr stream settings struct.
+     * @param[in] index The index of the dimension to set. Must be less than the
+     * number of dimensions reserved with ZarrStreamSettings_reserve_dimensions.
+     * @param[in] name The name of the dimension.


Is the “name” of the dimension user defined?

shlomnissan · 2024-09-04T20:50:14Z

src/include/zarr.h

+    /**
+     * @brief Set the multiscale flag for the Zarr stream.
+     * @param[in, out] settings The Zarr stream settings struct.
+     * @param[in] multiscale A flag indicating whether to stream to multiple


A flag, as in true or false? If so, this comment should be clearer. Also, doesn't C11 support bools (stdbool.h)?

shlomnissan · 2024-09-04T20:52:57Z

src/include/zarr.h

+    const char* ZarrStreamSettings_get_s3_access_key_id(
+      const ZarrStreamSettings* settings);
+    const char* ZarrStreamSettings_get_s3_secret_access_key(
+      const ZarrStreamSettings* settings);


Why do you need to expose individual getters? Also, having accessor functions for secret keys feels wrong to me. If the function returns a pointer to a sensitive value like an S3 secret access key, it can expose the key to anyone who has access to the pointer. This could be a security risk if the key is used inappropriately or if the pointer is leaked.

shlomnissan · 2024-09-04T20:54:16Z

src/include/zarr.h

+      char* name,
+      size_t bytes_of_name,
+      ZarrDimensionType* kind,
+      size_t* array_size_px,
+      size_t* chunk_size_px,
+      size_t* shard_size_chunks);


I think all these out params would be better represented as a struct.

shlomnissan · 2024-09-04T21:01:48Z

src/include/zarr.h

+
+    ZarrVersion ZarrStream_get_version(const ZarrStream* stream);
+
+    const char* ZarrStream_get_store_path(const ZarrStream* stream);


Are these getters for the same underlying object, but instead of passing the settings object, you passed the stream? If that's the case, why not use one or the other? I'm still trying to understand the benefit of exposing every parameter as an accessor method, so I'm surprised to see another set of accessor functions.

I may need to learn more about this API, but currently it feels like a lot of unnecessary flexibility (and maybe a violation of the principle of least exposure). I might be misunderstanding the library's purpose, but I'm wondering: if I've already provided this information, why are accessor methods needed to retrieve it? Wouldn't returning an instance of the settings object be more straightforward?

shlomnissan

This is another partial review that focuses on logging.

shlomnissan · 2024-09-04T21:08:48Z

src/internal/logger.cpp

@@ -0,0 +1,81 @@
+#include "logger.hh"


Don't we already have a logger in the common repo? Why do we need another one? If there are certain benefits to this logger, should we make it our primary logger?

Because this refactor removes the dependency on the acquire libraries. I concede that a lot of this doesn't make sense in isolation. If you want an idea of what it looks like all together, you can check out the standalone branch.

shlomnissan · 2024-09-04T21:10:06Z

src/internal/logger.cpp

+            int line,
+            const char* func,
+            const char* format,
+            ...)


C++ offers better alternatives for handling variable arguments, such as variadic templates and initializer lists, which are type-safe and more flexible. I think it should work with __VA_ARGS__.

shlomnissan · 2024-09-04T21:13:04Z

src/internal/logger.cpp

+}
+
+std::string
+Logger::log(LogLevel level,


I don't think this code is thread-safe. If multiple threads log messages simultaneously, the output might become interleaved or corrupted.

shlomnissan · 2024-09-04T21:13:27Z

src/internal/logger.cpp

+
+std::string
+Logger::log(LogLevel level,
+            const char* file,


Can we use std::string_view instead of const char*?

shlomnissan · 2024-09-04T21:20:30Z

src/internal/logger.cpp

+            << std::setfill('0') << std::setw(3) << ms.count() << " " << prefix
+            << filename << ":" << line << " " << func << ": ";
+
+    char buffer[1024];


If the formatted log message exceeds this size, wouldn't it cause a buffer overflow?

Why not use std::string, which is the expected return type anyway?

shlomnissan · 2024-09-04T21:28:27Z

src/internal/logger.hh

@@ -0,0 +1,38 @@
+#include "zarr.h"


This design seems fragile. We're incorporating a C API wrapper into a logger class that should be generalized for accessing properties—properties that ought to be defined here initially. Are there any alternative approaches we could consider?

shlomnissan · 2024-09-04T21:33:00Z

src/internal/logger.cpp

+    auto now = std::chrono::system_clock::now();
+    auto time = std::chrono::system_clock::to_time_t(now);
+    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(
+                now.time_since_epoch()) %
+              1000;


Nit: I think we can organize this code better. It would be an improvement to have a private member function that returns the current timestamp as a string, which would encapsulate this code, as well as std::put_time.

shlomnissan · 2024-09-04T21:39:15Z

src/internal/logger.hh

+#define LOG_DEBUG(...)                                                         \
+    Logger::log(LogLevel_Debug, __FILE__, __LINE__, __func__, __VA_ARGS__)
+#define LOG_INFO(...)                                                          \
+    Logger::log(LogLevel_Info, __FILE__, __LINE__, __func__, __VA_ARGS__)
+#define LOG_WARNING(...)                                                       \
+    Logger::log(LogLevel_Warning, __FILE__, __LINE__, __func__, __VA_ARGS__)
+#define LOG_ERROR(...)                                                         \
+    Logger::log(LogLevel_Error, __FILE__, __LINE__, __func__, __VA_ARGS__)
+
+#define EXPECT(e, ...)                                                         \
+    do {                                                                       \
+        if (!(e)) {                                                            \
+            const std::string __err = LOG_ERROR(__VA_ARGS__);                  \
+            throw std::runtime_error(__err);                                   \
+        }                                                                      \
+    } while (0)
+#define CHECK(e) EXPECT(e, "Expression evaluated as false:\n\t%s", #e)


I feel like all these macros can be replaced with static member functions, for example:

template <typename... Args> static void debug(const char* format, Args... args) { log(LogLevel_Debug, __FILE__, __LINE__, __func__, format, args...); }

This approach improves type checking (which you get from variadic templates), and it avoids all the issues that come with macro expansions.

shlomnissan

Another partial review focusing on stream.settings.

shlomnissan · 2024-09-05T15:18:59Z

src/internal/stream.settings.hh

+struct ZarrDimension_s
+{
+    std::string name; /* Name of the dimension */
+    uint8_t kind;     /* Type of dimension */


The order of member variables has implications for memory layout. A general rule of thumb is ordering members from largest to smallest type. It can potentially reduce padding and make the struct more memory efficient. This comment is applicable to all structs.

shlomnissan · 2024-09-05T15:21:33Z

src/internal/stream.settings.hh

+{
+    std::string store_path; /* Path to the Zarr store on the local filesystem */
+
+    std::string s3_endpoint;          /* Endpoint for the S3 service */


I think there's an opportunity to organize the code better by introducing more granular types here.

struct ZarrStreamS3Config { std::string endpoint; std::string bucket_name; // ... }; struct ZarrStreamCompressionConfig { uint8_t compressor; uint8_t compressor_codec; // ... };

shlomnissan · 2024-09-05T15:22:58Z

src/internal/stream.settings.hh

+    bool multiscale; /* Whether to stream to multiple resolutions */
+};
+
+bool


I think [[nodiscard]] is suitable here. It makes it explicit that the return value needs to be examined, as opposed to assuming it throws an exception.

shlomnissan · 2024-09-05T15:29:03Z