Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: Introduce memory catalog #412

Closed
Xuanwo opened this issue Jun 20, 2024 · 11 comments · Fixed by #475
Closed

idea: Introduce memory catalog #412

Xuanwo opened this issue Jun 20, 2024 · 11 comments · Fixed by #475

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Jun 20, 2024

Hi, I came up with this idea while trying to create quick demos showcasing the capabilities and cool features of iceberg-rust. However, I found that setting up the catalog initially consumes most of the time. This isn't ideal for attracting new users or contributors.

I propose introducing a short-lived, in-memory catalog as an ideal starting point for either testing iceberg-rust or using it statelessly.

The design details are currently unclear, and I would like to seek comments and feedback on this idea. What do you think? Do you find the catalog useful?

@Xuanwo
Copy link
Member Author

Xuanwo commented Jun 20, 2024

User Story A:

I'm a user of Iceberg downstream. I'm attempting to integrate Iceberg into my project and need to conduct unit tests to ensure the accuracy of my Iceberg-related code. However, I've discovered that I must first connect to a catalog. Although setting up a REST catalog is quick, it doesn't suit my needs well.

@Xuanwo
Copy link
Member Author

Xuanwo commented Jun 20, 2024

User Story B:

I'm an external consumer of iceberg tables. My clients will store TiB of Iceberg data in S3 using their own catalogs. Please note, I don't have access to their catalog systems. The only thing available to me is paths to different tables. Which catalog should I set up to read/fetch data from these iceberg tables?


I know we have StaticTable, but it will need:

use iceberg::io::FileIO;
use iceberg::table::StaticTable;
use iceberg::TableIdent;

async fn example() {
    let metadata_file_location = "s3://bucket_name/path/to/metadata.json";
    let file_io = FileIO::from_path(&metadata_file_location).unwrap().build().unwrap();
    let static_identifier = TableIdent::from_strs(["static_ns", "static_table"]).unwrap();
    let static_table = StaticTable::from_metadata_file(&metadata_file_location, static_identifier, file_io).await.unwrap();
    println!("{:?}", static_table.metadata());
}

I want:

let table2 = catalog
    .load_table(&TableIdent::from_strs(["default", "t2"]).unwrap())
    .await
    .unwrap();
println!("{:?}", table2.metadata());

@JanKaul
Copy link
Collaborator

JanKaul commented Jun 20, 2024

Great Idea, I think this could be really useful. We should be able to have this kind of behavior with the SQL catalog and an in-memory sqlite database.

@liurenjie1024
Copy link
Contributor

+1 for this idea.

@liurenjie1024
Copy link
Contributor

Not only in ut, but also useful in our example for demonstration.

@Fokko
Copy link
Contributor

Fokko commented Jun 20, 2024

Great idea @liurenjie1024 I'm all for it!

I'm an external consumer of iceberg tables. My clients will store TiB of Iceberg data in S3 using their own catalogs. Please note, I don't have access to their catalog systems. The only thing available to me is paths to different tables. Which catalog should I set up to read/fetch data from these iceberg tables?

I'm not sure if this is the best example. Ideally when you have a fully functioning catalog, you should be able to expose the catalog with the right privileges (can be behind VPNs etc). It is a bad practice to register a table in multiple catalogs, since it won't track when a table is being updated across the catalogs.

I know we have StaticTable

StaticTable serves a different purpose, and is just ment to access read only tables.

In PyIceberg we had a MemoryCatalog in tests for a long while, and at some point there was a discussion to move this outside of the test directory. In the end we did not do this, and we used the SQLCatalog with a SQLite backend. This can work both fully in-memory, and also persisted locally (for example in /tmp/). I think having the ability to have some persistance will both benefit testing and demonstration since not all data will be gone after the process exits. Also, when we implement writing, we can leverage the locking mechanism from the DBMS.

@Xuanwo
Copy link
Member Author

Xuanwo commented Jun 20, 2024

This can work both fully in-memory, and also persisted locally (for example in /tmp/).

Seems a great idea!

The situation differs slightly from the Rust side as we might not want to depend on sqlite, which significantly increases our build time. Perhaps we could incorporate both: memory and sqlite.

@Fokko
Copy link
Contributor

Fokko commented Jun 24, 2024

As long as both of them are getting maintained :)

@fqaiser94
Copy link
Contributor

fqaiser94 commented Jul 14, 2024

Since I didn't note any objections, I've started working on an in-memory implementation of Catalog.
I can take a look at writing a sqlite based implementation as well after that.

@liurenjie1024
Copy link
Contributor

Since I didn't note any objections, I've started working on an in-memory implementation of Catalog. I can take a look at writing a sqlite based implementation as well after that.

Thanks!

@fqaiser94
Copy link
Contributor

#475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants