-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility issues with org.apache.iceberg:iceberg-spark-runtime-3.5_2.13:1.5.0
#338
Comments
The Rust SDK is released with all other SDKs, i.e. when 1.12.0/1.11.4 is released. |
There were some discussions about doing separate SDK releases - https://lists.apache.org/thread/2rfnszd4dk36jxynpj382b1717gbyv1y but nothing happened mainly due to the lack of interest in the PMC members... I'm afraid it would be hard for me to get two more PMC members' +1s for a release of the Rust SDK. |
Strangely, I was working with the Rust API on tables generated by Spark with no such issue, but when I tried to port to Rust some code that deals with tables generated by Trino, then I got this exception as well.
Doesn't seem Spark specific issue |
@a-agmon It's possible that this is not a Spark-specific issue, since it is related to the Java Iceberg library. |
@Fokko if #354 is applied, iceberg-rust will no longer be able to read the manifest list files created by pre-1.5.0 Spark and pre-#354 iceberg-rust, since iceberg-rust does not read the fields by |
Thanks @Fokko and @zeodtr
The referenced manifest
|
Another way to resolve this, in a less workaround-ish way, is to simply capture the fact that we have a V1 schema, a V2 schema, and a V2 compatibility schema, which is identical to V2 with just the names of these 3 fields according to V1. Something like this perhaps: pub fn parse_with_version(
bs: &[u8],
version: FormatVersion,
partition_type_provider: impl Fn(i32) -> Result<Option<StructType>>,
) -> Result<ManifestList> {
match version {
FormatVersion::V1 => {
let reader = Reader::with_schema(&MANIFEST_LIST_AVRO_SCHEMA_V1, bs)?;
let values = Value::Array(reader.collect::<std::result::Result<Vec<Value>, _>>()?);
from_value::<_serde::ManifestListV1>(&values)?.try_into(partition_type_provider)
}
FormatVersion::V2 => {
let reader = Reader::with_schema(&MANIFEST_LIST_AVRO_SCHEMA_V2, bs)?;
let read_result = reader.collect::<std::result::Result<Vec<Value>, _>>();
match read_result {
Ok(records) => {
let values = Value::Array(records);
from_value::<_serde::ManifestListV2>(&values)?
.try_into(&partition_type_provider)
}
Err(e) => {
println!("Error reading values according to V2 schema, trying to fall back to V2_COMAPT: {:?}", e);
let reader = Reader::with_schema(&MANIFEST_LIST_AVRO_SCHEMA_V2_COMPAT, bs)?;
let records = reader.collect::<std::result::Result<Vec<Value>, _>>()?;
let values = Value::Array(records);
from_value::<_serde::ManifestListV2Compat>(&values)?
.try_into(&partition_type_provider)
}
}
}
}
} Check out this branch in which I implement this for my use case (using a |
If this is the case, we should update the rust implementation to do the lookup by field ID.
This is exactly the reason why we should never look up fields by name, and use the ID instead. This will avoid breaking changes.
You need to look at the schema, you can do this for example using Avro tools:
It is not based on the table format version. The change was because of the introduction in the V2 spec, but has been applied in Java later on: apache/iceberg#5338 It was just merged after the 1.4.0 release, so it is part of the 1.5.0 release. Trino uses the Java library so the change will also be fixed upstream. The only way forward is doing the lookups by field-id. |
@a-agmon Since the problem is in the schema, IMO checking the schema itself before reading the record is more appropriate. And since the error could be the other one, it cannot be assumed to be a schema mismatch error. @Fokko Reading the fields by |
Thanks, @Fokko and @zeodtr, for the clarifications and explanations! It's not the most elegant solution, but seems to resolve the issue along these lines, at least as I tested.
FormatVersion::V2 => {
// 1. get a hashmap that maps the field_id to the field_name in the Manifest's schema
let manifest_file_schema =
Self::get_record_schema(MANIFEST_LIST_AVRO_SCHEMA_V2.clone())?;
let manifest_file_schema_fields: HashMap<String, String> = =
Self::get_manifest_schema_fields_map(manifest_file_schema, true)?;
// 2. get a hashmap that maps field_name to field_id in the schema of the read avro file
let reader = Reader::new(bs)?;
let file_schema = Self::get_record_schema(reader.writer_schema().clone())?;
let file_schema_fields: HashMap<String, String> =
Self::get_manifest_schema_fields_map(file_schema, false)?;
// 3. get a vec of records from the read avro file .
// each record is a hashmap of field_id and field_value
let file_records = reader.collect::<std::result::Result<Vec<Value>, _>>()?;
let file_records_values_map: HashMap<String, Value> = =
Self::get_avro_records_as_map(file_records, file_schema_fields)?;
// 4. for each record (manifest file) in the Avro file records maps,
// traverse the schema of the manifest file: for each field id in the schema, get the field value from the record
let manifest_records: Vec<Value> = file_records_values_map
.into_iter()
.map(|file_record_fields| {
let fields_values: Vec<_> = manifest_file_schema_fields
.iter()
.filter_map(|(schem_field_id, schem_field_name)| {
file_record_fields
.get(schem_field_id)
.map(|value| (schem_field_name.clone(), value.clone()))
})
.collect();
Value::Record(fields_values)
})
.collect();
let values = Value::Array(manifest_records);
let manifest = from_value::<_serde::ManifestListV2>(&values)?;
manifest.try_into(partition_type_provider)
} Please let me know what you think, I'm also posting this in Slack for visibility as I think its sufficiently important. |
I think creating a field-id to a field-name map is a good (interim) solution. Keep in mind that the next Avro release is planned for this week: https://lists.apache.org/thread/6pn8jztkyom8tr5vbxr1pqgwx6bj0h4c According to #131 looking up the field-id should be resolved in the next release. |
@a-agmon My concerns are as follows:
|
@Fokko #131 only solves the 'saving' part. Currently iceberg-rust does not save |
Thanks @zeodtr , Looking forward to your comments, @Fokko @liurenjie1024 @Xuanwo |
Added a PR that proposes an interim, but more elegant, solution to the problem. I think. here is the main modification #[derive(Debug, Serialize, Deserialize, PartialEq, Eq)]
pub(super) struct ManifestFileV2 {
pub manifest_path: String,
pub manifest_length: i64,
pub partition_spec_id: i32,
pub content: i32,
pub sequence_number: i64,
pub min_sequence_number: i64,
pub added_snapshot_id: i64,
#[serde(alias = "added_data_files_count", alias = "added_files_count")]
pub added_data_files_count: i32,
#[serde(alias = "existing_data_files_count", alias = "existing_files_count")]
pub existing_data_files_count: i32,
#[serde(alias = "deleted_data_files_count", alias = "deleted_files_count")]
pub deleted_data_files_count: i32,
pub added_rows_count: i64,
pub existing_rows_count: i64,
pub deleted_rows_count: i64,
pub partitions: Option<Vec<FieldSummary>>,
pub key_metadata: Option<ByteBuf>,
} |
Hi,
I've been developing a query engine that uses
iceberg-rust
crate.Upon checking Iceberg compatibility with org.apache.iceberg:iceberg-spark-runtime-3.5_2.13:1.4.3, I didn't encounter any issues, at least not with my engine. However, when testing with org.apache.iceberg:iceberg-spark-runtime-3.5_2.13:1.5.0, I did come across a few issues. I managed to address them, either through fixes or workarounds. Here's a summary of the issues encountered and the solutions applied:
Issue 1.
In the following scenario,
The reason behind this is that iceberg-rust doesn't include
"logicalType": "map"
in the Avro schema for Iceberg maps with non-string keys, which are represented as Avro arrays.To address this, I've applied the not-yet-official
apache_avro
0.17 from GitHub and adjusted the iceberg-rust code to align with the changed Avro Rust API. (BTW, the API change was done by an iceberg-rust developer maybe to fix this kind of issue). Then add the logical type to the schema.Issue 2.
In the following scenario,
Once I applied version apache_avro 0.17 and started writing
field-id
to the Avro schema, this issue was resolved.Issue 3.
In the following scenario,
This error is related to the Iceberg issue apache/iceberg#8684 and
iceberg-rust
's inability to read an Avro data usingfield-id
, instead relying on field names.In the aforementioned Iceberg issue, an Iceberg Java developer discovered inconsistencies between the Java source code and specifications regarding the field names of the
manifest_file
struct. Subsequently, the source code was modified to align with the specifications. As a result, Iceberg Java's Avro writers started using different (correct) field names. This adjustment didn't affect Iceberg Java, as it reads the Avro data usingfield-id
rather than the field name. However, iceberg-rust reads the Avro schema using the field name, causing the current issue.To address this, I examined the iceberg-rust and Avro rust codes. However, implementing the functionality to read the Avro data using
field-id
seemed to require a fair amount of time (at least for me). As a temporary solution, I applied an ad hoc workaround inmanifest_list.rs
, after replacing all the incorrect field names in the code.It essentially replaces 'wrong' field names with the correct ones. However, I perceive this as more of a workaround than a solution. Nonetheless, it serves its purpose for the time being. It would be nice if a more fundamental solution could be implemented in the future, such as reading the Avro data using
field-id
.Thank you.
The text was updated successfully, but these errors were encountered: