Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep copy a VectorSchemaRoot? #465

Closed
tisonkun opened this issue Dec 24, 2024 · 9 comments
Closed

Deep copy a VectorSchemaRoot? #465

tisonkun opened this issue Dec 24, 2024 · 9 comments
Labels
Type: enhancement New feature or request

Comments

@tisonkun
Copy link
Member

tisonkun commented Dec 24, 2024

Describe the enhancement requested

I'm writing a convertor method to convert a base64 encoded byte array into Arrow batches and returns it to the user.

public List<VectorSchemaRoot> readArrowBatches(String rows, BufferAllocator allocator) {
    final List<VectorSchemaRoot> batches = new ArrayList<>();
    final byte[] data = Base64.getDecoder().decode(rows);
    final ByteArrayInputStream stream = new ByteArrayInputStream(data);
    try (final ArrowStreamReader reader = new ArrowStreamReader(stream, allocator)) {
        while (reader.loadNextBatch()) {
            batches.add(new Table(reader.getVectorSchemaRoot()).toVectorSchemaRoot());
        }
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
    return batches;
}

Since ArrowStreamReader replace the batch referred by getVectorSchemaRoot in each iteration, I have to do a deepcopy of VectorSchemaRoot every time.

Currently, I use Table's method as a workaround, but wonder if VectorSchemaRoot deserves a copy method, or I implement such a typically use case in a wrong way.

@tisonkun tisonkun added the Type: enhancement New feature or request label Dec 24, 2024
@lidavidm
Copy link
Member

You should use VectorLoader/VectorUnloader to "move" the contents of the reader's root into your own

@tisonkun
Copy link
Member Author

That seems exactly what the inner of Table does. Do we have some util or a copy method for that. Or I just wrap by myself .. It seems quite a common usage and I don't want to hook outside of arrow-java.

while (reader.loadNextBatch()) {
    final VectorSchemaRoot source = reader.getVectorSchemaRoot();
    final VectorUnloader unloader = new VectorUnloader(source);
    final VectorSchemaRoot copy = VectorSchemaRoot.create(source.getSchema(), allocator);
    final VectorLoader loader = new VectorLoader(copy);
    loader.load(unloader.getRecordBatch());
    batches.add(copy);
}

@lidavidm
Copy link
Member

That is the intended usage. What is the problem?

(Note that you can also just keep an array of the batches from the unloader, and load/stream them through a root as necessary.)

@tisonkun
Copy link
Member Author

OK thanks. Yes it seems a list of ArrowRecordBatch owns the buffer and doesn't need to tune with the lifecycle of allocator.

@tisonkun tisonkun closed this as not planned Won't fix, can't repro, duplicate, stale Dec 26, 2024
@tisonkun
Copy link
Member Author

Emmm .. No. The ArrowRecordBatch's buffer is still bound to the allocator, and it doesn't have the schema info where we need to store elsewhere.

@lidavidm
Copy link
Member

Yes, there isn't really any way of untying things from an allocator (this is intentional). There are APIs to transfer memory between allocators (or you can just keep a single allocator across different contexts).

@tisonkun
Copy link
Member Author

@lidavidm Thanks for your information! Is there some docs/cookbook for copy VectorSchemaRoot? It seems challenging to ensure the lifetime of both data and allocator are aligned and I suppose some demo code would help a lot.

@tisonkun
Copy link
Member Author

For example, when I wrote:

        while (reader.loadNextBatch()) {
            final VectorSchemaRoot source = reader.getVectorSchemaRoot();
            final VectorSchemaRoot copy = VectorSchemaRoot.create(source.getSchema(), allocator);
            new VectorLoader(copy).load(new VectorUnloader(source).getRecordBatch());
            batches.add(copy);
        }

It seems the intermediate ArrowRecordBatch should be closed but it's very easy to get it wrong and receive a runtime exception ..

@lidavidm
Copy link
Member

Unfortunately not. You should do something like

try (var batch = unloader.getRecordBatch()) {
  loader.load(batch);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants