Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use prettier to format md files #367

Merged
merged 3 commits into from
May 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,6 @@

# Code of Conduct

* [Code of Conduct for The Apache Software Foundation][1]
- [Code of Conduct for The Apache Software Foundation][1]

[1]: https://www.apache.org/foundation/policies/conduct.html
[1]: https://www.apache.org/foundation/policies/conduct.html
72 changes: 36 additions & 36 deletions DEVELOPERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,57 +21,57 @@

This section describes how you can get started at developing DataFusion.

For information on developing with Ballista, see the
[Ballista developer documentation](ballista/docs/README.md).
For information on developing with Ballista, see the
[Ballista developer documentation](ballista/docs/README.md).

### Bootstrap environment

DataFusion is written in Rust and it uses a standard rust toolkit:

* `cargo build`
* `cargo fmt` to format the code
* `cargo test` to test
* etc.
- `cargo build`
- `cargo fmt` to format the code
- `cargo test` to test
- etc.

## How to add a new scalar function

Below is a checklist of what you need to do to add a new scalar function to DataFusion:

* Add the actual implementation of the function:
* [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
* [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
* [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
* create a new module [here](datafusion/src/physical_plan) for other functions
* In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
* a new variant to `BuiltinScalarFunction`
* a new entry to `FromStr` with the name of the function as called by SQL
* a new line in `return_type` with the expected return type of the function, given an incoming type
* a new line in `signature` with the signature of the function (number and types of its arguments)
* a new line in `create_physical_expr` mapping the built-in to the implementation
* tests to the function.
* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
* In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
* a new entry of the `unary_scalar_expr!` macro for the new function.
* In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
* a new entry in the `pub use expr::{}` set.
- Add the actual implementation of the function:
- [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
- [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
- [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
- create a new module [here](datafusion/src/physical_plan) for other functions
- In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
- a new variant to `BuiltinScalarFunction`
- a new entry to `FromStr` with the name of the function as called by SQL
- a new line in `return_type` with the expected return type of the function, given an incoming type
- a new line in `signature` with the signature of the function (number and types of its arguments)
- a new line in `create_physical_expr` mapping the built-in to the implementation
- tests to the function.
- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
- In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
- a new entry of the `unary_scalar_expr!` macro for the new function.
- In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
- a new entry in the `pub use expr::{}` set.

## How to add a new aggregate function

Below is a checklist of what you need to do to add a new aggregate function to DataFusion:

* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
* [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
* [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
* [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
* create a new module [here](datafusion/src/physical_plan) for other functions
* In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
* a new variant to `BuiltinAggregateFunction`
* a new entry to `FromStr` with the name of the function as called by SQL
* a new line in `return_type` with the expected return type of the function, given an incoming type
* a new line in `signature` with the signature of the function (number and types of its arguments)
* a new line in `create_aggregate_expr` mapping the built-in to the implementation
* tests to the function.
* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
- [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
- [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
- [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
- create a new module [here](datafusion/src/physical_plan) for other functions
- In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
- a new variant to `BuiltinAggregateFunction`
- a new entry to `FromStr` with the name of the function as called by SQL
- a new line in `return_type` with the expected return type of the function, given an incoming type
- a new line in `signature` with the signature of the function (number and types of its arguments)
- a new line in `create_aggregate_expr` mapping the built-in to the implementation
- tests to the function.
- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.

## How to display plans graphically

Expand Down
115 changes: 52 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ logical query plans as well as a query optimizer and execution engine
capable of parallel execution against partitioned data sources (CSV
and Parquet) using threads.

DataFusion also supports distributed query execution via the
DataFusion also supports distributed query execution via the
[Ballista](ballista/README.md) crate.

## Use Cases
Expand All @@ -42,24 +42,24 @@ the convenience of an SQL interface or a DataFrame API.

## Why DataFusion?

* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
* *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

## Known Uses

Here are some of the projects known to use DataFusion:

* [Ballista](ballista) Distributed Compute Platform
* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
* [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
* [datafusion-python](https://pypi.org/project/datafusion)
* [delta-rs](https://github.com/delta-io/delta-rs)
* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
* [ROAPI](https://github.com/roapi/roapi)
* [Tensorbase](https://github.com/tensorbase/tensorbase)
* [Squirtle](https://github.com/DSLAM-UMD/Squirtle)
- [Ballista](ballista) Distributed Compute Platform
- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
- [datafusion-python](https://pypi.org/project/datafusion)
- [delta-rs](https://github.com/delta-io/delta-rs)
- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
- [ROAPI](https://github.com/roapi/roapi)
- [Tensorbase](https://github.com/tensorbase/tensorbase)
- [Squirtle](https://github.com/DSLAM-UMD/Squirtle)

(if you know of another project, please submit a PR to add a link!)

Expand Down Expand Up @@ -122,8 +122,6 @@ Both of these examples will produce
+---+--------+
```



## Using DataFusion as a library

DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
Expand Down Expand Up @@ -230,7 +228,6 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
- [x] Parquet primitive types
- [ ] Parquet nested types


## Extensibility

DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
Expand All @@ -242,35 +239,32 @@ DataFusion is designed to be extensible at all points. To that end, you can prov
- [x] User Defined `LogicalPlan` nodes
- [x] User Defined `ExecutionPlan` nodes


# Supported SQL

This library currently supports many SQL constructs, including

* `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
* `SELECT ... FROM ...` together with any expression
* `ALIAS` to name an expression
* `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
* most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
* `WHERE` to filter
* `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
* `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`

- `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
- `SELECT ... FROM ...` together with any expression
- `ALIAS` to name an expression
- `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
- most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
- `WHERE` to filter
- `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
- `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`

## Supported Functions

DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https://www.postgresql.org/docs/current/functions.html) where possible. We explicitly choose a single dialect to maximize interoperability with other tools and allow reuse of the PostgreSQL documents and tutorials as much as possible.

Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
Currently, only a subset of the PostgreSQL dialect is implemented, and we will document any deviations.

## Schema Metadata / Information Schema Support

DataFusion supports the showing metadata about the tables available. This information can be accessed using the views of the ISO SQL `information_schema` schema or the DataFusion specific `SHOW TABLES` and `SHOW COLUMNS` commands.

More information can be found in the [Postgres docs](https://www.postgresql.org/docs/13/infoschema-schema.html)).


To show tables available for use in DataFusion, use the `SHOW TABLES` command or the `information_schema.tables` view:
To show tables available for use in DataFusion, use the `SHOW TABLES` command or the `information_schema.tables` view:

```sql
> show tables;
Expand All @@ -291,7 +285,7 @@ To show tables available for use in DataFusion, use the `SHOW TABLES` command o
+---------------+--------------------+------------+--------------+
```

To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or the or `information_schema.columns` view:
To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or the or `information_schema.columns` view:

```sql
> show columns from t;
Expand All @@ -313,50 +307,45 @@ To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or
+------------+-------------+------------------+-------------+-----------+
```



## Supported Data Types

DataFusion uses Arrow, and thus the Arrow type system, for query
execution. The SQL types from
[sqlparser-rs](https://github.com/ballista-compute/sqlparser-rs/blob/main/src/ast/data_type.rs#L57)
are mapped to Arrow types according to the following table


| SQL Data Type | Arrow DataType |
| --------------- | -------------------------------- |
| `CHAR` | `Utf8` |
| `VARCHAR` | `Utf8` |
| `UUID` | *Not yet supported* |
| `CLOB` | *Not yet supported* |
| `BINARY` | *Not yet supported* |
| `VARBINARY` | *Not yet supported* |
| `DECIMAL` | `Float64` |
| `FLOAT` | `Float32` |
| `SMALLINT` | `Int16` |
| `INT` | `Int32` |
| `BIGINT` | `Int64` |
| `REAL` | `Float64` |
| `DOUBLE` | `Float64` |
| `BOOLEAN` | `Boolean` |
| `DATE` | `Date32` |
| `TIME` | `Time64(TimeUnit::Millisecond)` |
| `TIMESTAMP` | `Date64` |
| `INTERVAL` | *Not yet supported* |
| `REGCLASS` | *Not yet supported* |
| `TEXT` | *Not yet supported* |
| `BYTEA` | *Not yet supported* |
| `CUSTOM` | *Not yet supported* |
| `ARRAY` | *Not yet supported* |

| SQL Data Type | Arrow DataType |
| ------------- | ------------------------------- |
| `CHAR` | `Utf8` |
| `VARCHAR` | `Utf8` |
| `UUID` | _Not yet supported_ |
| `CLOB` | _Not yet supported_ |
| `BINARY` | _Not yet supported_ |
| `VARBINARY` | _Not yet supported_ |
| `DECIMAL` | `Float64` |
| `FLOAT` | `Float32` |
| `SMALLINT` | `Int16` |
| `INT` | `Int32` |
| `BIGINT` | `Int64` |
| `REAL` | `Float64` |
| `DOUBLE` | `Float64` |
| `BOOLEAN` | `Boolean` |
| `DATE` | `Date32` |
| `TIME` | `Time64(TimeUnit::Millisecond)` |
| `TIMESTAMP` | `Date64` |
| `INTERVAL` | _Not yet supported_ |
| `REGCLASS` | _Not yet supported_ |
| `TEXT` | _Not yet supported_ |
| `BYTEA` | _Not yet supported_ |
| `CUSTOM` | _Not yet supported_ |
| `ARRAY` | _Not yet supported_ |

# Architecture Overview

There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.

* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)

- (March 2021): The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
- (Feburary 2021): How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)

# Developer's guide

Expand Down
11 changes: 5 additions & 6 deletions ballista/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@

# Ballista: Distributed Compute with Apache Arrow

Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as
first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
data transfer between processes.
- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
- [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
Expand Down Expand Up @@ -57,7 +57,6 @@ April 2021 and should be considered experimental.

## Getting Started

The [Ballista Developer Documentation](docs/README.md) and the
[DataFusion User Guide](https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide) are currently the
The [Ballista Developer Documentation](docs/README.md) and the
[DataFusion User Guide](https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide) are currently the
best sources of information for getting started with Ballista.

8 changes: 4 additions & 4 deletions ballista/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,19 @@
specific language governing permissions and limitations
under the License.
-->

# Ballista Developer Documentation

This directory contains documentation for developers that are contributing to Ballista. If you are looking for
end-user documentation for a published release, please start with the
This directory contains documentation for developers that are contributing to Ballista. If you are looking for
end-user documentation for a published release, please start with the
[DataFusion User Guide](../../docs/user-guide) instead.

## Architecture & Design

- Read the [Architecture Overview](architecture.md) to get an understanding of the scheduler and executor
- Read the [Architecture Overview](architecture.md) to get an understanding of the scheduler and executor
processes and how distributed query execution works.

## Build, Test, Release

- Setting up a [development environment](dev-env.md).
- [Integration Testing](integration-testing.md)

Loading