Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude sign bit from bitpacked encoding if all values are negative #2714

Open
albertlockett opened this issue Aug 8, 2024 · 12 comments
Open
Assignees
Labels
good first issue Good for newcomers

Comments

@albertlockett
Copy link
Contributor

In #2662 we added support for bitpacking signed integers in LanceV2. In #2696, an optimization was made to exclude the sign bit if all the values for a signed type are positive.

We can make a further optimization to exclude the sign bit if all the values are negative.

The way to do this could be to:

  • change the Bitpacked encoding proto message to have a flag indicating all the values are negative
    message Bitpacked {
  • In bitpack_params_for_signed_type add logic to determine if all values are negative, and if so, don't add the sign bit to the number of bits. We can also modify the return type BitpackParams as suggested here:
    fn bitpack_params_for_signed_type<T>(arr: &PrimitiveArray<T>) -> Option<BitpackParams>
  • In the decoder, if all the bits are negative then determine this from the field on the encoding proto-message instead of checking the MSB of the encoded value like we do here:
    let is_negative = is_encoded_item_negative(
    src,
    src_idx,
    src_offset,
    self.bits_per_value as usize,
    );
@albertlockett albertlockett added the good first issue Good for newcomers label Aug 8, 2024
@thinh2
Copy link

thinh2 commented Aug 20, 2024

Hi, I want to tackle this issue.

@westonpace
Copy link
Contributor

Hi, I want to tackle this issue.

Great. I don't think anyone is actively working on this so feel free to create a PR.

@albertlockett
Copy link
Contributor Author

Thanks @thinh2! If you have any questions I'm happy to help. I can also be reached on our discord (my username on there is nbtel).

@thinh2
Copy link

thinh2 commented Sep 19, 2024

@westonpace @albertlockett

I think there is something wrong with the test_bitpack_primitive function.

I tried to set breakpoint inside the bitpack_params_for_signed_type<T> function and run the test_bitpack_primitive, the program doesn't stop at bitpack_params_for_signed_type.

The same happens for BitpackedPageDecoder.decode . Does it mean that test_bitpack_primitive doesn't use the bitpack encode/decode function ?

@westonpace
Copy link
Contributor

The same happens for BitpackedPageDecoder.decode . Does it mean that test_bitpack_primitive doesn't use the bitpack encode/decode function ?

I'm pretty sure, at one point, it did not. We switched from guarding bitpacking with an environment variable (LANCE_USE_BITPACKING) to guarding bitpacking with the version (2.1) and the tests didn't get switched over. I ran into this at one point and I can't remember if my fix for it ever got merged or is still part of my in-progress PRs. I will check.

Also, in related news, @broccoliSpicy has been working on bitpacking as well, to try and utilize the pack/unpack routines in the fastlanes crate: #2886

@thinh2
Copy link

thinh2 commented Sep 23, 2024

Hi @westonpace, how can I mitigate the test issue mentioned above ? have you found the fix for the test issue mention above?

@broccoliSpicy
Copy link
Contributor

broccoliSpicy commented Sep 23, 2024

I tried to set breakpoint inside the bitpack_params_for_signed_type function and run the test_bitpack_primitive, the program doesn't stop at bitpack_params_for_signed_type.

The same happens for BitpackedPageDecoder.decode . Does it mean that test_bitpack_primitive doesn't use the bitpack encode/decode function ?

yeah, in a recent PR cleaning environmental variables, some encodings have been disabled.

to enable bitpacking, you can do something similar to this in the choose_array_encoder function:

https://github.com/broccoliSpicy/lance/blob/697af4a88308df456a42a382edd688bf0ea9a3bd/rust/lance-encoding/src/encoder.rs#L340

            DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => {
                if version >= LanceFileVersion::V2_1 {
                    let compressed_bit_width = compute_compressed_bit_width_for_non_neg(arrays);
                    Ok(Box::new(BitpackedForNonNegArrayEncoder::new(
                        compressed_bit_width as usize,
                        data_type.clone(),
                    )))
                } else {
                    Ok(Box::new(BasicEncoder::new(Box::new(
                        ValueEncoder::default(),
                    ))))
                }
            }

            // for signed integers, I intend to make it a cascaded encoding, a sparse array for the negative values and very wide(bit-width) values,
            // then a bitpacked array for the narrow(bit-width) values, I need `BitpackedForNeg` to be merged first
            DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => {
                if version >= LanceFileVersion::V2_1 {
                    let compressed_bit_width = compute_compressed_bit_width_for_non_neg(arrays);
                    Ok(Box::new(BitpackedForNonNegArrayEncoder::new(
                        compressed_bit_width as usize,
                        data_type.clone(),
                    )))
                } else {
                    Ok(Box::new(BasicEncoder::new(Box::new(
                        ValueEncoder::default(),
                    ))))
                }
            }

for your testing purpose, you can omit the if version >= LanceFileVersion::V2_1 line, to use bitpack encoding, you can change the BitpackedForNonNegArrayEncoder to BitpackEncoder.

@thinh2
Copy link

thinh2 commented Sep 24, 2024

Hi @broccoliSpicy, thanks for your help.

After integrating the code above, my test can use bitpack scheme. However, the test_bitpack_primitive keep failing. I tried to use the code from main branch (adding the code to enable bitpacking) and the test still failed.

Do I need to update the code to enable bitpacking anywhere else ? Here is what I added to the choose_array_encoder function:

           DataType:: Int32 => {
                    let params = bitpack_params(arrays[0].as_ref()).unwrap();
                    Ok(Box::new(BitpackedArrayEncoder::new(
                        params.num_bits,
                        params.signed,
                        //params.all_negative,
                    )))
            }

Currently, the decoder can only decode the first element correctly, other elements are 0.

@broccoliSpicy
Copy link
Contributor

broccoliSpicy commented Sep 24, 2024

yeah, there is a mistake in

src_idx += src.len() - partial_bytes_written + to_next_byte;

which caused the behavior you described:

Currently, the decoder can only decode the first element correctly, other elements are 0.

I think you can try change this line to:

            src_idx += partial_bytes_written + to_next_byte;

and the issue you described should be fixed.

However, I think there are also other issues in the current bitpack implementation, for example, in the encode:

fn encode(
&self,
data: DataBlock,
_data_type: &DataType,
buffer_index: &mut u32,
) -> Result<EncodedArray> {
// calculate the total number of bytes we need to allocate for the destination.
// this will be the number of items in the source array times the number of bits.
let dst_bytes_total = ceil(data.num_values() as usize * self.num_bits as usize, 8);
let mut dst_buffer = vec![0u8; dst_bytes_total];
let mut dst_idx = 0;
let mut dst_offset = 0;
let DataBlock::FixedWidth(unpacked) = data else {
return Err(Error::InvalidInput {
source: "Bitpacking only supports fixed width data blocks".into(),
location: location!(),
});
};
pack_bits(
&unpacked.data,
self.num_bits,
&mut dst_buffer,
&mut dst_idx,
&mut dst_offset,
);
let packed = DataBlock::FixedWidth(FixedWidthDataBlock {
bits_per_value: self.num_bits,
data: LanceBuffer::Owned(dst_buffer),
num_values: unpacked.num_values,
});
let bitpacked_buffer_index = *buffer_index;
*buffer_index += 1;
let encoding = ProtobufUtils::bitpacked_encoding(
self.num_bits,
unpacked.bits_per_value,
bitpacked_buffer_index,
self.signed_type,
);
Ok(EncodedArray {
data: packed,
encoding,
})
}
}

the encoder might also need to handle the case when DataBlock is nullable and allnull, I think you can find some inspiration in the basic.rs's encoder to add support for nullable and allnul datablocks:

fn encode(
&self,
data: DataBlock,
data_type: &DataType,
buffer_index: &mut u32,
) -> Result<EncodedArray> {
match data {
DataBlock::AllNull(_) => {
let encoding = ProtobufUtils::basic_all_null_encoding();
Ok(EncodedArray { data, encoding })
}
DataBlock::Nullable(nullable) => {
let validity_buffer_index = *buffer_index;
*buffer_index += 1;
let validity_desc = ProtobufUtils::flat_encoding(
1,
validity_buffer_index,
/*compression=*/ None,
);
let encoded_values =
self.values_encoder
.encode(*nullable.data, data_type, buffer_index)?;
let encoding =
ProtobufUtils::basic_some_null_encoding(validity_desc, encoded_values.encoding);
let encoded = DataBlock::Nullable(NullableDataBlock {
data: Box::new(encoded_values.data),
nulls: nullable.nulls,
});
Ok(EncodedArray {
data: encoded,
encoding,
})
}
_ => {
let encoded_values = self.values_encoder.encode(data, data_type, buffer_index)?;
let encoding = ProtobufUtils::basic_no_null_encoding(encoded_values.encoding);
Ok(EncodedArray {
data: encoded_values.data,
encoding,
})
}
}
}
}

@broccoliSpicy
Copy link
Contributor

After integrating the code above, my test can use bitpack scheme.

actually, there might be some other issues come up after doing this kind of encoding selection, I filled a issue #2927 and I am currently trying to find a way to fix it.

@thinh2
Copy link

thinh2 commented Sep 25, 2024

@broccoliSpicy I tried to integrate the fix you mentioned above, but there is an error Bitpacking only supports fixed width data blocks . Will #2927 fix it?

@broccoliSpicy
Copy link
Contributor

for the error Bitpacking only supports fixed width data blocks, I think you can try add code similar to this in the bitpack's encode method:
[lance/rust/lance-encoding/src/encodings/physical/basic.rs]

fn encode(
&self,
data: DataBlock,
data_type: &DataType,
buffer_index: &mut u32,
) -> Result<EncodedArray> {
match data {
DataBlock::AllNull(_) => {
let encoding = ProtobufUtils::basic_all_null_encoding();
Ok(EncodedArray { data, encoding })
}
DataBlock::Nullable(nullable) => {
let validity_buffer_index = *buffer_index;
*buffer_index += 1;
let validity_desc = ProtobufUtils::flat_encoding(
1,
validity_buffer_index,
/*compression=*/ None,
);
let encoded_values =
self.values_encoder
.encode(*nullable.data, data_type, buffer_index)?;
let encoding =
ProtobufUtils::basic_some_null_encoding(validity_desc, encoded_values.encoding);
let encoded = DataBlock::Nullable(NullableDataBlock {
data: Box::new(encoded_values.data),
nulls: nullable.nulls,
});
Ok(EncodedArray {
data: encoded,
encoding,
})
}
_ => {
let encoded_values = self.values_encoder.encode(data, data_type, buffer_index)?;
let encoding = ProtobufUtils::basic_no_null_encoding(encoded_values.encoding);
Ok(EncodedArray {
data: encoded_values.data,
encoding,
})
}
}
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants