idea: Introduce `memory` catalog #412

Xuanwo · 2024-06-20T02:33:12Z

Hi, I came up with this idea while trying to create quick demos showcasing the capabilities and cool features of iceberg-rust. However, I found that setting up the catalog initially consumes most of the time. This isn't ideal for attracting new users or contributors.

I propose introducing a short-lived, in-memory catalog as an ideal starting point for either testing iceberg-rust or using it statelessly.

The design details are currently unclear, and I would like to seek comments and feedback on this idea. What do you think? Do you find the catalog useful?

Xuanwo · 2024-06-20T02:38:42Z

User Story A:

I'm a user of Iceberg downstream. I'm attempting to integrate Iceberg into my project and need to conduct unit tests to ensure the accuracy of my Iceberg-related code. However, I've discovered that I must first connect to a catalog. Although setting up a REST catalog is quick, it doesn't suit my needs well.

Xuanwo · 2024-06-20T02:46:45Z

User Story B:

I'm an external consumer of iceberg tables. My clients will store TiB of Iceberg data in S3 using their own catalogs. Please note, I don't have access to their catalog systems. The only thing available to me is paths to different tables. Which catalog should I set up to read/fetch data from these iceberg tables?

I know we have StaticTable, but it will need:

use iceberg::io::FileIO;
use iceberg::table::StaticTable;
use iceberg::TableIdent;

async fn example() {
    let metadata_file_location = "s3://bucket_name/path/to/metadata.json";
    let file_io = FileIO::from_path(&metadata_file_location).unwrap().build().unwrap();
    let static_identifier = TableIdent::from_strs(["static_ns", "static_table"]).unwrap();
    let static_table = StaticTable::from_metadata_file(&metadata_file_location, static_identifier, file_io).await.unwrap();
    println!("{:?}", static_table.metadata());
}

I want:

let table2 = catalog
    .load_table(&TableIdent::from_strs(["default", "t2"]).unwrap())
    .await
    .unwrap();
println!("{:?}", table2.metadata());

JanKaul · 2024-06-20T04:34:56Z

Great Idea, I think this could be really useful. We should be able to have this kind of behavior with the SQL catalog and an in-memory sqlite database.

liurenjie1024 · 2024-06-20T06:16:53Z

+1 for this idea.

liurenjie1024 · 2024-06-20T07:19:47Z

Not only in ut, but also useful in our example for demonstration.

Fokko · 2024-06-20T07:54:29Z

Great idea @liurenjie1024 I'm all for it!

I'm an external consumer of iceberg tables. My clients will store TiB of Iceberg data in S3 using their own catalogs. Please note, I don't have access to their catalog systems. The only thing available to me is paths to different tables. Which catalog should I set up to read/fetch data from these iceberg tables?

I'm not sure if this is the best example. Ideally when you have a fully functioning catalog, you should be able to expose the catalog with the right privileges (can be behind VPNs etc). It is a bad practice to register a table in multiple catalogs, since it won't track when a table is being updated across the catalogs.

I know we have StaticTable

StaticTable serves a different purpose, and is just ment to access read only tables.

In PyIceberg we had a MemoryCatalog in tests for a long while, and at some point there was a discussion to move this outside of the test directory. In the end we did not do this, and we used the SQLCatalog with a SQLite backend. This can work both fully in-memory, and also persisted locally (for example in /tmp/). I think having the ability to have some persistance will both benefit testing and demonstration since not all data will be gone after the process exits. Also, when we implement writing, we can leverage the locking mechanism from the DBMS.

Xuanwo · 2024-06-20T08:32:05Z

This can work both fully in-memory, and also persisted locally (for example in /tmp/).

Seems a great idea!

The situation differs slightly from the Rust side as we might not want to depend on sqlite, which significantly increases our build time. Perhaps we could incorporate both: memory and sqlite.

Fokko · 2024-06-24T21:36:27Z

As long as both of them are getting maintained :)

fqaiser94 · 2024-07-14T18:51:12Z

Since I didn't note any objections, I've started working on an in-memory implementation of Catalog.
I can take a look at writing a sqlite based implementation as well after that.

liurenjie1024 · 2024-07-15T01:57:08Z

Since I didn't note any objections, I've started working on an in-memory implementation of Catalog. I can take a look at writing a sqlite based implementation as well after that.

Thanks!

fqaiser94 · 2024-07-23T16:07:08Z

#475

liurenjie1024 added this to iceberg-rust Jun 20, 2024

fqaiser94 mentioned this issue Jul 23, 2024

Add memory catalog implementation #475

Merged

liurenjie1024 closed this as completed in #475 Jul 26, 2024

github-project-automation bot moved this to Done in iceberg-rust Jul 26, 2024

odysa mentioned this issue Aug 7, 2024

doc: Update example and doc to show table scan api. #415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: Introduce `memory` catalog #412

idea: Introduce `memory` catalog #412

Xuanwo commented Jun 20, 2024

Xuanwo commented Jun 20, 2024

Xuanwo commented Jun 20, 2024 •

edited

Loading

JanKaul commented Jun 20, 2024

liurenjie1024 commented Jun 20, 2024

liurenjie1024 commented Jun 20, 2024

Fokko commented Jun 20, 2024

Xuanwo commented Jun 20, 2024 •

edited

Loading

Fokko commented Jun 24, 2024

fqaiser94 commented Jul 14, 2024 •

edited

Loading

liurenjie1024 commented Jul 15, 2024

fqaiser94 commented Jul 23, 2024

idea: Introduce memory catalog #412

idea: Introduce memory catalog #412

Comments

Xuanwo commented Jun 20, 2024

Xuanwo commented Jun 20, 2024

Xuanwo commented Jun 20, 2024 • edited Loading

JanKaul commented Jun 20, 2024

liurenjie1024 commented Jun 20, 2024

liurenjie1024 commented Jun 20, 2024

Fokko commented Jun 20, 2024

Xuanwo commented Jun 20, 2024 • edited Loading

Fokko commented Jun 24, 2024

fqaiser94 commented Jul 14, 2024 • edited Loading

liurenjie1024 commented Jul 15, 2024

fqaiser94 commented Jul 23, 2024

idea: Introduce `memory` catalog #412

idea: Introduce `memory` catalog #412

Xuanwo commented Jun 20, 2024 •

edited

Loading

Xuanwo commented Jun 20, 2024 •

edited

Loading

fqaiser94 commented Jul 14, 2024 •

edited

Loading