Skip to content

Commit

Permalink
deny missing_docs on vortex-dtype (#1182)
Browse files Browse the repository at this point in the history
  • Loading branch information
lwwmanning authored Nov 1, 2024
1 parent 9152390 commit 45f69cb
Show file tree
Hide file tree
Showing 11 changed files with 404 additions and 30 deletions.
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
[![Documentation](https://docs.rs/vortex-array/badge.svg)](https://docs.rs/vortex-array)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/vortex-array)](https://pypi.org/project/vortex-array/)

Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache Arrow arrays
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache Arrow arrays
in-memory, on-disk, and over-the-wire.

Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster),
while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster),
while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.

Vortex is intended to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
Expand Down Expand Up @@ -125,7 +125,7 @@ in-memory array implementation, allowing us to defer decompression. Currently, t
Vortex's default compression strategy is based on the
[BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.

Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted
Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted
(recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
the entire chunk. This sounds like it would be very expensive, but given the logical types and basic statistics about a
chunk, it is possible to cheaply prune many encodings and ensure the search space does not explode in size.
Expand Down Expand Up @@ -263,7 +263,6 @@ In particular, the following academic papers have strongly influenced developmen
* Dominik Durner, Viktor Leis, and Thomas Neumann. [Exploiting Cloud Object Storage for High-Performance
Analytics](https://www.durner.dev/app/media/papers/anyblob-vldb23.pdf). PVLDB, 16(11): 2769-2782, 2023.


Additionally, we benefited greatly from:

* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and
Expand Down
9 changes: 7 additions & 2 deletions encodings/roaring/src/integer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,14 @@ impl RoaringIntArray {

let length = bitmap.statistics().cardinality as usize;
let max = bitmap.maximum();
if max.map(|mv| mv as u64 > ptype.max_value()).unwrap_or(false) {
if max
.map(|mv| mv as u64 > ptype.max_value_as_u64())
.unwrap_or(false)
{
vortex_bail!(
"RoaringInt maximum value is greater than the maximum value for the primitive type"
"Bitmap's maximum value ({}) is greater than the maximum value for the primitive type ({})",
max.vortex_expect("Bitmap has no maximum value despite having just checked"),
ptype
);
}

Expand Down
2 changes: 1 addition & 1 deletion vortex-array/src/array/chunked/compute/take.rs
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ fn take_strict_sorted(chunked: &ChunkedArray, indices: &Array) -> VortexResult<A
// Note. Indices might not have a dtype big enough to fit chunk_begin after cast,
// if it does cast the scalar otherwise upcast the indices.
let chunk_indices =
if chunk_begin < PType::try_from(chunk_indices.dtype())?.max_value() as usize {
if chunk_begin < PType::try_from(chunk_indices.dtype())?.max_value_as_u64() as usize {
subtract_scalar(
&chunk_indices,
&Scalar::from(chunk_begin).cast(chunk_indices.dtype())?,
Expand Down
114 changes: 106 additions & 8 deletions vortex-dtype/src/dtype.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,45 +3,56 @@ use std::hash::Hash;
use std::sync::Arc;

use itertools::Itertools;
use vortex_error::{vortex_bail, vortex_err, VortexResult};
use vortex_error::{vortex_bail, vortex_err, vortex_panic, VortexResult};
use DType::*;

use crate::field::Field;
use crate::nullability::Nullability;
use crate::{ExtDType, PType};

/// A name for a field in a struct
pub type FieldName = Arc<str>;
/// An ordered list of field names in a struct
pub type FieldNames = Arc<[FieldName]>;

pub type Metadata = Vec<u8>;

/// Array logical types.
/// The logical types of elements in Vortex arrays.
///
/// Vortex arrays preserve a single logical type, while the encodings allow for multiple
/// physical types to encode that type.
/// physical ways to encode that type.
#[derive(Debug, Clone, PartialOrd, PartialEq, Eq, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub enum DType {
/// The logical null type (only has a single value, `null`)
Null,
/// The logical boolean type (`true` or `false` if non-nullable; `true`, `false`, or `null` if nullable)
Bool(Nullability),
/// Primitive, fixed-width numeric types (e.g., `u8`, `i8`, `u16`, `i16`, `u32`, `i32`, `u64`, `i64`, `f32`, `f64`)
Primitive(PType, Nullability),
/// UTF-8 strings
Utf8(Nullability),
/// Binary data
Binary(Nullability),
/// A struct is composed of an ordered list of fields, each with a corresponding name and DType
Struct(StructDType, Nullability),
/// A variable-length list type, parameterized by a single element DType
List(Arc<DType>, Nullability),
/// Extension types are user-defined types
Extension(ExtDType, Nullability),
}

impl DType {
/// The default DType for bytes
pub const BYTES: Self = Primitive(PType::U8, Nullability::NonNullable);

/// The default DType for indices
pub const IDX: Self = Primitive(PType::U64, Nullability::NonNullable);

/// Get the nullability of the DType
pub fn nullability(&self) -> Nullability {
self.is_nullable().into()
}

/// Check if the DType is nullable
pub fn is_nullable(&self) -> bool {
use crate::nullability::Nullability::*;

Expand All @@ -57,14 +68,17 @@ impl DType {
}
}

/// Get a new DType with `Nullability::NonNullable` (but otherwise the same as `self`)
pub fn as_nonnullable(&self) -> Self {
self.with_nullability(Nullability::NonNullable)
}

/// Get a new DType with `Nullability::Nullable` (but otherwise the same as `self`)
pub fn as_nullable(&self) -> Self {
self.with_nullability(Nullability::Nullable)
}

/// Get a new DType with the given nullability (but otherwise the same as `self`)
pub fn with_nullability(&self, nullability: Nullability) -> Self {
match self {
Null => Null,
Expand All @@ -78,34 +92,42 @@ impl DType {
}
}

/// Check if `self` and `other` are equal, ignoring nullability
pub fn eq_ignore_nullability(&self, other: &Self) -> bool {
self.as_nullable().eq(&other.as_nullable())
}

/// Check if `self` is a `StructDType`
pub fn is_struct(&self) -> bool {
matches!(self, Struct(_, _))
}

/// Check if `self` is an unsigned integer
pub fn is_unsigned_int(&self) -> bool {
PType::try_from(self).is_ok_and(PType::is_unsigned_int)
}

/// Check if `self` is a signed integer
pub fn is_signed_int(&self) -> bool {
PType::try_from(self).is_ok_and(PType::is_signed_int)
}

/// Check if `self` is an integer (signed or unsigned)
pub fn is_int(&self) -> bool {
PType::try_from(self).is_ok_and(PType::is_int)
}

/// Check if `self` is a floating point number
pub fn is_float(&self) -> bool {
PType::try_from(self).is_ok_and(PType::is_float)
}

/// Check if `self` is a boolean
pub fn is_boolean(&self) -> bool {
matches!(self, Bool(_))
}

/// Get the `StructDType` if `self` is a `StructDType`, otherwise `None`
pub fn as_struct(&self) -> Option<&StructDType> {
match self {
Struct(s, _) => Some(s),
Expand Down Expand Up @@ -146,43 +168,61 @@ impl Display for DType {
}
}

/// A struct dtype is a list of names and corresponding dtypes
#[derive(Debug, Clone, PartialOrd, PartialEq, Eq, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct StructDType {
names: FieldNames,
dtypes: Arc<[DType]>,
}

/// Information about a field in a struct dtype
pub struct FieldInfo<'a> {
/// The position index of the field within the enclosing struct
pub index: usize,
/// The name of the field
pub name: Arc<str>,
/// The dtype of the field
pub dtype: &'a DType,
}

impl StructDType {
/// Create a new `StructDType` from a list of names and dtypes
pub fn new(names: FieldNames, dtypes: Vec<DType>) -> Self {
if names.len() != dtypes.len() {
vortex_panic!(
"length mismatch between names ({}) and dtypes ({})",
names.len(),
dtypes.len()
);
}
Self {
names,
dtypes: dtypes.into(),
}
}

/// Get the names of the fields in the struct
pub fn names(&self) -> &FieldNames {
&self.names
}

/// Find the index of a field by name
/// Returns `None` if the field is not found
pub fn find_name(&self, name: &str) -> Option<usize> {
self.names.iter().position(|n| n.as_ref() == name)
}

/// Get information about the referenced field, either by name or index
/// Returns an error if the field is not found
pub fn field_info(&self, field: &Field) -> VortexResult<FieldInfo> {
let index = match field {
Field::Name(name) => self
.find_name(name)
.ok_or_else(|| vortex_err!("Unknown field: {}", name))?,
Field::Index(index) => *index,
};
if index > self.names.len() {
if index >= self.names.len() {
vortex_bail!("field index out of bounds: {}", index)
}
Ok(FieldInfo {
Expand All @@ -192,10 +232,13 @@ impl StructDType {
})
}

/// Get the dtypes of the fields in the struct
pub fn dtypes(&self) -> &Arc<[DType]> {
&self.dtypes
}

/// Project a subset of fields from the struct
/// Returns an error if any of the referenced fields are not found
pub fn project(&self, projection: &[Field]) -> VortexResult<Self> {
let mut names = Vec::with_capacity(projection.len());
let mut dtypes = Vec::with_capacity(projection.len());
Expand All @@ -216,19 +259,74 @@ mod test {
use std::mem;

use crate::dtype::DType;
use crate::{Nullability, StructDType};
use crate::field::Field;
use crate::{Nullability, PType, StructDType};

#[test]
fn size_of() {
assert_eq!(mem::size_of::<DType>(), 40);
}

#[test]
fn is_nullable() {
fn nullability() {
assert!(!DType::Struct(
StructDType::new(vec![].into(), Vec::new()),
Nullability::NonNullable
)
.is_nullable());

let primitive = DType::Primitive(PType::U8, Nullability::Nullable);
assert!(primitive.is_nullable());
assert!(!primitive.as_nonnullable().is_nullable());
assert!(primitive.as_nonnullable().as_nullable().is_nullable());
}

#[test]
fn test_struct() {
let a_type = DType::Primitive(PType::I32, Nullability::Nullable);
let b_type = DType::Bool(Nullability::NonNullable);

let dtype = DType::Struct(
StructDType::new(
vec!["A".into(), "B".into()].into(),
vec![a_type.clone(), b_type.clone()],
),
Nullability::Nullable,
);
assert!(dtype.is_nullable());
assert!(dtype.as_struct().is_some());
assert!(a_type.as_struct().is_none());

let sdt = dtype.as_struct().unwrap();
assert_eq!(sdt.names().len(), 2);
assert_eq!(sdt.dtypes().len(), 2);
assert_eq!(sdt.names()[0], "A".into());
assert_eq!(sdt.names()[1], "B".into());
assert_eq!(sdt.dtypes()[0], a_type);
assert_eq!(sdt.dtypes()[1], b_type);

let proj = sdt
.project(&[Field::Index(1), Field::Name("A".into())])
.unwrap();
assert_eq!(proj.names()[0], "B".into());
assert_eq!(proj.dtypes()[0], b_type);
assert_eq!(proj.names()[1], "A".into());
assert_eq!(proj.dtypes()[1], a_type);

let field_info = sdt.field_info(&Field::Name("B".into())).unwrap();
assert_eq!(field_info.index, 1);
assert_eq!(field_info.name, "B".into());
assert_eq!(field_info.dtype, &b_type);

let field_info = sdt.field_info(&Field::Index(0)).unwrap();
assert_eq!(field_info.index, 0);
assert_eq!(field_info.name, "A".into());
assert_eq!(field_info.dtype, &a_type);

assert!(sdt.field_info(&Field::Index(2)).is_err());

assert_eq!(sdt.find_name("A"), Some(0));
assert_eq!(sdt.find_name("B"), Some(1));
assert_eq!(sdt.find_name("C"), None);
}
}
8 changes: 8 additions & 0 deletions vortex-dtype/src/extension.rs
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
use std::fmt::{Display, Formatter};
use std::sync::Arc;

/// A unique identifier for an extension type
#[derive(Debug, Clone, PartialEq, Eq, Ord, PartialOrd, Hash)]
#[cfg_attr(feature = "serde", derive(::serde::Serialize, ::serde::Deserialize))]
pub struct ExtID(Arc<str>);

impl ExtID {
/// Constructs a new `ExtID` from a string
pub fn new(value: Arc<str>) -> Self {
Self(value)
}
Expand All @@ -29,11 +31,13 @@ impl From<&str> for ExtID {
}
}

/// Opaque metadata for an extension type
#[derive(Debug, Clone, PartialOrd, PartialEq, Eq, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct ExtMetadata(Arc<[u8]>);

impl ExtMetadata {
/// Constructs a new `ExtMetadata` from a byte slice
pub fn new(value: Arc<[u8]>) -> Self {
Self(value)
}
Expand All @@ -51,6 +55,7 @@ impl From<&[u8]> for ExtMetadata {
}
}

/// A type descriptor for an extension type
#[derive(Debug, Clone, PartialOrd, PartialEq, Eq, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct ExtDType {
Expand All @@ -59,15 +64,18 @@ pub struct ExtDType {
}

impl ExtDType {
/// Constructs a new `ExtDType` from an `ExtID` and optional `ExtMetadata`
pub fn new(id: ExtID, metadata: Option<ExtMetadata>) -> Self {
Self { id, metadata }
}

/// Returns the `ExtID` for this extension type
#[inline]
pub fn id(&self) -> &ExtID {
&self.id
}

/// Returns the `ExtMetadata` for this extension type, if it exists
#[inline]
pub fn metadata(&self) -> Option<&ExtMetadata> {
self.metadata.as_ref()
Expand Down
Loading

0 comments on commit 45f69cb

Please sign in to comment.