Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Update documentation #81

Merged
merged 32 commits into from
Apr 13, 2018
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
eb6dd87
update public modules
sadikovi Apr 5, 2018
331fe7d
add docs
sadikovi Apr 5, 2018
cc88b2f
add doc build in readme
Apr 5, 2018
5801b6e
update lib doc
sadikovi Apr 5, 2018
b3cc916
update docs for basic.rs
sadikovi Apr 6, 2018
efc7c37
add license header
sadikovi Apr 6, 2018
ecc1bc9
add docs for basic and errors
sadikovi Apr 7, 2018
9d3cee0
add doc for data_type
sadikovi Apr 7, 2018
9da396e
add schema doc
sadikovi Apr 8, 2018
0092fca
update schema docs
sadikovi Apr 8, 2018
1c2420e
add record docs
sadikovi Apr 9, 2018
f37efd8
add column docs
sadikovi Apr 9, 2018
a7f3bb7
add file doc and some minor updates
sadikovi Apr 9, 2018
c2e1bdd
update readme
sadikovi Apr 9, 2018
99a1353
add docs.rs link
sadikovi Apr 9, 2018
5d34f72
update docs
sadikovi Apr 10, 2018
db3ae7a
update file docs
sadikovi Apr 10, 2018
08fd9e8
add comment for column/reader.rs tests
sadikovi Apr 10, 2018
655d5b6
add parquet version, minor updates
sadikovi Apr 11, 2018
c6f350c
update comment for types.rs
sadikovi Apr 11, 2018
54a1c0d
reexport modules
sadikovi Apr 11, 2018
c130834
add compression docs
sadikovi Apr 11, 2018
e6c9803
update docs
sadikovi Apr 11, 2018
2e59245
add memory docs
sadikovi Apr 11, 2018
5bc4cd6
update global doc
sadikovi Apr 11, 2018
d566524
add decoding docs
sadikovi Apr 12, 2018
b96050b
update docs
sadikovi Apr 12, 2018
605a1fa
update docs
sadikovi Apr 12, 2018
c78f550
fix doc tests
sadikovi Apr 12, 2018
af13421
update from_thrift comments
sadikovi Apr 12, 2018
51a4965
update parquet version in readme
sadikovi Apr 13, 2018
1a11586
update comments
sadikovi Apr 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,37 @@
[![Coverage Status](https://coveralls.io/repos/github/sunchao/parquet-rs/badge.svg?branch=master)](https://coveralls.io/github/sunchao/parquet-rs?branch=master)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![](http://meritbadge.herokuapp.com/parquet)](https://crates.io/crates/parquet)
[![Documentation](https://docs.rs/parquet/badge.svg)](https://docs.rs/parquet)

An [Apache Parquet](https://parquet.apache.org/) implementation in Rust (work in progress)

## Usage
Add this to your Cargo.toml:
```toml
[dependencies]
parquet = "0.1"
```

and this to your crate root:
```rust
extern crate parquet;
```

Example usage:
```rust
use std::fs::File;
use std::path::Path;
use parquet::file::reader::{FileReader, SerializedFileReader};

let file = File::open(&Path::new("/path/to/file")).unwrap();
let reader = SerializedFileReader::new(file).unwrap();
let mut iter = reader.get_row_iter(None).unwrap();
while let Some(record) = iter.next() {
println!("{}", record);
}
```
See crate documentation on available API.

## Requirements
- Rust nightly
- Thrift 0.11.0 or higher
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite related, but we may want to also point out the parquet format version that we support - currently it is 2.3.2?

Expand Down Expand Up @@ -43,5 +71,8 @@ be printed).
## Benchmarks
Run `cargo bench` for benchmarks.

## Docs
To build documentation, run `cargo doc --no-deps`. To compile and view in the browser, run `cargo doc --no-deps --open`.

## License
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0.
17 changes: 17 additions & 0 deletions build.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

use std::env;
use std::fs;
use std::process::Command;
Expand Down
145 changes: 138 additions & 7 deletions src/basic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@
// specific language governing permissions and limitations
// under the License.

//! Contains Rust mappings for Thrift definition.
//! See `parquet.thrift` file to see raw definitions of enums listed below.

use std::convert;
use std::fmt;
use std::result;
Expand All @@ -23,11 +26,17 @@ use std::str;
use errors::ParquetError;
use parquet_thrift::parquet;


// ----------------------------------------------------------------------
// Types from the Thrift definition

/// Mirrors `parquet::Type`
// ----------------------------------------------------------------------
// Mirrors `parquet::Type`

/// Types supported by Parquet.
/// These physical types are intended to be used in combination with the encodings to
/// control the on disk storage format.
/// For example INT16 is not included as a type since a good encoding of INT32
/// would handle this.
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Type {
BOOLEAN,
Expand All @@ -40,56 +49,174 @@ pub enum Type {
FIXED_LEN_BYTE_ARRAY
}

/// Mirrors `parquet::ConvertedType`
// ----------------------------------------------------------------------
// Mirrors `parquet::ConvertedType`

/// Common types (logical types) used by frameworks when using Parquet.
/// This helps map between types in those frameworks to the base types in Parquet.
/// This is only metadata and not needed to read or write the data.
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum LogicalType {
NONE,
/// A BYTE_ARRAY actually contains UTF8 encoded chars.
UTF8,

/// A map is converted as an optional field containing a repeated key/value pair.
MAP,

/// A key/value pair is converted into a group of two fields.
MAP_KEY_VALUE,

/// A list is converted into an optional field containing a repeated field for its
/// values.
LIST,

/// An enum is converted into a binary field
ENUM,

/// A decimal value.
/// This may be used to annotate binary or fixed primitive types. The
/// underlying byte array stores the unscaled value encoded as two's
/// complement using big-endian byte order (the most significant byte is the
/// zeroth element).
///
/// This must be accompanied by a (maximum) precision and a scale in the
/// SchemaElement. The precision specifies the number of digits in the decimal
/// and the scale stores the location of the decimal point. For example 1.23
/// would have precision 3 (3 total digits) and scale 2 (the decimal point is
/// 2 digits over).
DECIMAL,

/// A date stored as days since Unix epoch, encoded as the INT32 physical type.
DATE,

/// The total number of milliseconds since midnight. The value is stored as an INT32
/// physical type.
TIME_MILLIS,

/// The total number of microseconds since midnight. The value is stored as an INT64
/// physical type.
TIME_MICROS,

/// Date and time recorded as milliseconds since the Unix epoch.
/// Recorded as a physical type of INT64.
TIMESTAMP_MILLIS,

/// Date and time recorded as microseconds since the Unix epoch.
/// The value is stored as an INT64 physical type.
TIMESTAMP_MICROS,

/// An unsigned 8 bit integer value stored as INT32 physical type.
UINT_8,

/// An unsigned 16 bit integer value stored as INT32 physical type.
UINT_16,

/// An unsigned 32 bit integer value stored as INT32 physical type.
UINT_32,

/// An unsigned 64 bit integer value stored as INT64 physical type.
UINT_64,

/// A signed 8 bit integer value stored as INT32 physical type.
INT_8,

/// A signed 16 bit integer value stored as INT32 physical type.
INT_16,

/// A signed 32 bit integer value stored as INT32 physical type.
INT_32,

/// A signed 64 bit integer value stored as INT64 physical type.
INT_64,

/// A JSON document embedded within a single UTF8 column.
JSON,

/// A BSON document embedded within a single BINARY column.
BSON,

/// An interval of time.
///
/// This type annotates data stored as a FIXED_LEN_BYTE_ARRAY of length 12.
/// This data is composed of three separate little endian unsigned integers.
/// Each stores a component of a duration of time. The first integer identifies
/// the number of months associated with the duration, the second identifies
/// the number of days associated with the duration and the third identifies
/// the number of milliseconds associated with the provided duration.
/// This duration of time is independent of any particular timezone or date.
INTERVAL
}

/// Mirrors `parquet::FieldRepetitionType`
// ----------------------------------------------------------------------
// Mirrors `parquet::FieldRepetitionType`

/// Representation of field types in schema.
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Repetition {
/// Field is required (can not be null) and each record has exactly 1 value.
REQUIRED,
/// Field is optional (can be null) and each record has 0 or 1 values.
OPTIONAL,
/// Field is repeated and can contain 0 or more values.
REPEATED
}

/// Mirrors `parquet::Encoding`
// ----------------------------------------------------------------------
// Mirrors `parquet::Encoding`

/// Encodings supported by Parquet.
/// Not all encodings are valid for all types. These enums are also used to specify the
/// encoding of definition and repetition levels.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum Encoding {
/// Default byte encoding.
/// - BOOLEAN - 1 bit per value, 0 is false; 1 is true.
/// - INT32 - 4 bytes per value, stored as little-endian.
/// - INT64 - 8 bytes per value, stored as little-endian.
/// - FLOAT - 4 bytes per value, stored as little-endian.
/// - DOUBLE - 8 bytes per value, stored as little-endian.
/// - BYTE_ARRAY - 4 byte length stored as little endian, followed by bytes.
/// FIXED_LEN_BYTE_ARRAY - just the bytes are stored.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missed - at the beginning.

PLAIN,

/// **Deprecated** dictionary encoding.
/// The values in the dictionary are encoded using PLAIN encoding.
/// Since it is deprecated, RLE_DICTIONARY encoding is used for a data page, and PLAIN
/// encoding is used for dictionary page.
PLAIN_DICTIONARY,

/// Group packed run length encoding.
/// Usable for definition/repetition levels encoding and boolean values.
RLE,

/// Bit packed encoding.
/// This can only be used if the data has a known max width.
/// Usable for definition/repetition levels encoding.
BIT_PACKED,

/// Delta encoding for integers, either INT32 or INT64.
/// Works best on sorted data.
DELTA_BINARY_PACKED,

/// Encoding for byte arrays to separate the length values and the data.
/// The lengths are encoded using DELTA_BINARY_PACKED encoding.
DELTA_LENGTH_BYTE_ARRAY,

/// Incremental encoding for byte arrays.
/// Prefix lengths are encoded using DELTA_BINARY_PACKED encoding.
/// Suffixes are stored using DELTA_LENGTH_BYTE_ARRAY encoding.
DELTA_BYTE_ARRAY,

/// Dictionary encoding.
/// The ids are encoded using the RLE encoding.
RLE_DICTIONARY
}

/// Mirrors `parquet::CompressionCodec`
// ----------------------------------------------------------------------
// Mirrors `parquet::CompressionCodec`

/// Supported compression algorithms.
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Compression {
UNCOMPRESSED,
Expand All @@ -101,7 +228,11 @@ pub enum Compression {
ZSTD
}

/// Mirrors `parquet::PageType`
// ----------------------------------------------------------------------
// Mirrors `parquet::PageType`

/// Available data pages for Parquet file format.
/// Note that some of the page types may not be supported.
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum PageType {
DATA_PAGE,
Expand Down
54 changes: 54 additions & 0 deletions src/bin/parquet-read.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,57 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

//! Binary file to read data from a Parquet file.
//!
//! # Install
//!
//! `parquet-read` can be installed using `cargo`:
//! ```
//! cargo install parquet
//! ```
//! After this `parquet-read` should be globally available:
//! ```
//! parquet-read XYZ.parquet
//! ```
//!
//! The binary can also be built from the source code and run as follows:
//! ```
//! cargo run --bin parquet-read XYZ.parquet
//! ```
//!
//! # Usage
//!
//! ```
//! parquet-read <file-path> [num-records]
//! ```
//! where `file-path` is the path to a Parquet file and `num-records` is the optional
//! numeric option that allows to specify number of records to read from a file.
//! When not provided, all records are read.
//!
//! Note that `parquet-read` reads full file schema, no projection or filtering is
//! applied.
//!
//! For example,
//! ```
//! parquet-read data/alltypes_plain.snappy.parquet
//!
//! parquet-read data/alltypes_plain.snappy.parquet 4
//! ```

extern crate parquet;

use std::env;
Expand Down
35 changes: 35 additions & 0 deletions src/bin/parquet-schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,41 @@
// specific language governing permissions and limitations
// under the License.

//! Binary file to print the schema and metadata of a Parquet file.
//!
//! # Install
//!
//! `parquet-schema` can be installed using `cargo`:
//! ```
//! cargo install parquet
//! ```
//! After this `parquet-schema` should be globally available:
//! ```
//! parquet-schema XYZ.parquet
//! ```
//!
//! The binary can also be built from the source code and run as follows:
//! ```
//! cargo run --bin parquet-schema XYZ.parquet
//! ```
//!
//! # Usage
//!
//! ```
//! parquet-schema <file-path> [verbose]
//! ```
//! where `file-path` is the path to a Parquet file and `verbose` is the optional boolean
//! flag that allows to print schema only, when set to `false` (default behaviour when
//! not provided), or print full file metadata, when set to `true`.
//! For example,
//! ```
//! parquet-schema data/alltypes_plain.snappy.parquet
//!
//! parquet-schema data/alltypes_plain.snappy.parquet false
//!
//! parquet-schema data/alltypes_plain.snappy.parquet true
//! ```

extern crate parquet;

use std::env;
Expand Down
Loading