Skip to content

Commit

Permalink
use prettier to format md files
Browse files Browse the repository at this point in the history
  • Loading branch information
Jiayu Liu committed May 20, 2021
1 parent 2f73558 commit 8e972b0
Show file tree
Hide file tree
Showing 15 changed files with 177 additions and 172 deletions.
4 changes: 2 additions & 2 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,6 @@

# Code of Conduct

* [Code of Conduct for The Apache Software Foundation][1]
- [Code of Conduct for The Apache Software Foundation][1]

[1]: https://www.apache.org/foundation/policies/conduct.html
[1]: https://www.apache.org/foundation/policies/conduct.html
72 changes: 36 additions & 36 deletions DEVELOPERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,57 +21,57 @@

This section describes how you can get started at developing DataFusion.

For information on developing with Ballista, see the
[Ballista developer documentation](ballista/docs/README.md).
For information on developing with Ballista, see the
[Ballista developer documentation](ballista/docs/README.md).

### Bootstrap environment

DataFusion is written in Rust and it uses a standard rust toolkit:

* `cargo build`
* `cargo fmt` to format the code
* `cargo test` to test
* etc.
- `cargo build`
- `cargo fmt` to format the code
- `cargo test` to test
- etc.

## How to add a new scalar function

Below is a checklist of what you need to do to add a new scalar function to DataFusion:

* Add the actual implementation of the function:
* [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
* [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
* [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
* create a new module [here](datafusion/src/physical_plan) for other functions
* In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
* a new variant to `BuiltinScalarFunction`
* a new entry to `FromStr` with the name of the function as called by SQL
* a new line in `return_type` with the expected return type of the function, given an incoming type
* a new line in `signature` with the signature of the function (number and types of its arguments)
* a new line in `create_physical_expr` mapping the built-in to the implementation
* tests to the function.
* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
* In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
* a new entry of the `unary_scalar_expr!` macro for the new function.
* In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
* a new entry in the `pub use expr::{}` set.
- Add the actual implementation of the function:
- [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
- [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
- [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
- create a new module [here](datafusion/src/physical_plan) for other functions
- In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
- a new variant to `BuiltinScalarFunction`
- a new entry to `FromStr` with the name of the function as called by SQL
- a new line in `return_type` with the expected return type of the function, given an incoming type
- a new line in `signature` with the signature of the function (number and types of its arguments)
- a new line in `create_physical_expr` mapping the built-in to the implementation
- tests to the function.
- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
- In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
- a new entry of the `unary_scalar_expr!` macro for the new function.
- In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
- a new entry in the `pub use expr::{}` set.

## How to add a new aggregate function

Below is a checklist of what you need to do to add a new aggregate function to DataFusion:

* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
* [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
* [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
* [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
* create a new module [here](datafusion/src/physical_plan) for other functions
* In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
* a new variant to `BuiltinAggregateFunction`
* a new entry to `FromStr` with the name of the function as called by SQL
* a new line in `return_type` with the expected return type of the function, given an incoming type
* a new line in `signature` with the signature of the function (number and types of its arguments)
* a new line in `create_aggregate_expr` mapping the built-in to the implementation
* tests to the function.
* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
- [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
- [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
- [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
- create a new module [here](datafusion/src/physical_plan) for other functions
- In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
- a new variant to `BuiltinAggregateFunction`
- a new entry to `FromStr` with the name of the function as called by SQL
- a new line in `return_type` with the expected return type of the function, given an incoming type
- a new line in `signature` with the signature of the function (number and types of its arguments)
- a new line in `create_aggregate_expr` mapping the built-in to the implementation
- tests to the function.
- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.

## How to display plans graphically

Expand Down
115 changes: 52 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ logical query plans as well as a query optimizer and execution engine
capable of parallel execution against partitioned data sources (CSV
and Parquet) using threads.

DataFusion also supports distributed query execution via the
DataFusion also supports distributed query execution via the
[Ballista](ballista/README.md) crate.

## Use Cases
Expand All @@ -42,24 +42,24 @@ the convenience of an SQL interface or a DataFrame API.

## Why DataFusion?

* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
* *High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

## Known Uses

Here are some of the projects known to use DataFusion:

* [Ballista](ballista) Distributed Compute Platform
* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
* [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
* [datafusion-python](https://pypi.org/project/datafusion)
* [delta-rs](https://github.com/delta-io/delta-rs)
* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
* [ROAPI](https://github.com/roapi/roapi)
* [Tensorbase](https://github.com/tensorbase/tensorbase)
* [Squirtle](https://github.com/DSLAM-UMD/Squirtle)
- [Ballista](ballista) Distributed Compute Platform
- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
- [datafusion-python](https://pypi.org/project/datafusion)
- [delta-rs](https://github.com/delta-io/delta-rs)
- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
- [ROAPI](https://github.com/roapi/roapi)
- [Tensorbase](https://github.com/tensorbase/tensorbase)
- [Squirtle](https://github.com/DSLAM-UMD/Squirtle)

(if you know of another project, please submit a PR to add a link!)

Expand Down Expand Up @@ -122,8 +122,6 @@ Both of these examples will produce
+---+--------+
```



## Using DataFusion as a library

DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
Expand Down Expand Up @@ -230,7 +228,6 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
- [x] Parquet primitive types
- [ ] Parquet nested types


## Extensibility

DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
Expand All @@ -242,35 +239,32 @@ DataFusion is designed to be extensible at all points. To that end, you can prov
- [x] User Defined `LogicalPlan` nodes
- [x] User Defined `ExecutionPlan` nodes


# Supported SQL

This library currently supports many SQL constructs, including

* `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
* `SELECT ... FROM ...` together with any expression
* `ALIAS` to name an expression
* `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
* most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
* `WHERE` to filter
* `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
* `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`

- `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
- `SELECT ... FROM ...` together with any expression
- `ALIAS` to name an expression
- `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
- most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
- `WHERE` to filter
- `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
- `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`

## Supported Functions

DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https://www.postgresql.org/docs/current/functions.html) where possible. We explicitly choose a single dialect to maximize interoperability with other tools and allow reuse of the PostgreSQL documents and tutorials as much as possible.

Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
Currently, only a subset of the PostgreSQL dialect is implemented, and we will document any deviations.

## Schema Metadata / Information Schema Support

DataFusion supports the showing metadata about the tables available. This information can be accessed using the views of the ISO SQL `information_schema` schema or the DataFusion specific `SHOW TABLES` and `SHOW COLUMNS` commands.

More information can be found in the [Postgres docs](https://www.postgresql.org/docs/13/infoschema-schema.html)).


To show tables available for use in DataFusion, use the `SHOW TABLES` command or the `information_schema.tables` view:
To show tables available for use in DataFusion, use the `SHOW TABLES` command or the `information_schema.tables` view:

```sql
> show tables;
Expand All @@ -291,7 +285,7 @@ To show tables available for use in DataFusion, use the `SHOW TABLES` command o
+---------------+--------------------+------------+--------------+
```

To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or the or `information_schema.columns` view:
To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or the or `information_schema.columns` view:

```sql
> show columns from t;
Expand All @@ -313,50 +307,45 @@ To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or
+------------+-------------+------------------+-------------+-----------+
```



## Supported Data Types

DataFusion uses Arrow, and thus the Arrow type system, for query
execution. The SQL types from
[sqlparser-rs](https://github.com/ballista-compute/sqlparser-rs/blob/main/src/ast/data_type.rs#L57)
are mapped to Arrow types according to the following table


| SQL Data Type | Arrow DataType |
| --------------- | -------------------------------- |
| `CHAR` | `Utf8` |
| `VARCHAR` | `Utf8` |
| `UUID` | *Not yet supported* |
| `CLOB` | *Not yet supported* |
| `BINARY` | *Not yet supported* |
| `VARBINARY` | *Not yet supported* |
| `DECIMAL` | `Float64` |
| `FLOAT` | `Float32` |
| `SMALLINT` | `Int16` |
| `INT` | `Int32` |
| `BIGINT` | `Int64` |
| `REAL` | `Float64` |
| `DOUBLE` | `Float64` |
| `BOOLEAN` | `Boolean` |
| `DATE` | `Date32` |
| `TIME` | `Time64(TimeUnit::Millisecond)` |
| `TIMESTAMP` | `Date64` |
| `INTERVAL` | *Not yet supported* |
| `REGCLASS` | *Not yet supported* |
| `TEXT` | *Not yet supported* |
| `BYTEA` | *Not yet supported* |
| `CUSTOM` | *Not yet supported* |
| `ARRAY` | *Not yet supported* |

| SQL Data Type | Arrow DataType |
| ------------- | ------------------------------- |
| `CHAR` | `Utf8` |
| `VARCHAR` | `Utf8` |
| `UUID` | _Not yet supported_ |
| `CLOB` | _Not yet supported_ |
| `BINARY` | _Not yet supported_ |
| `VARBINARY` | _Not yet supported_ |
| `DECIMAL` | `Float64` |
| `FLOAT` | `Float32` |
| `SMALLINT` | `Int16` |
| `INT` | `Int32` |
| `BIGINT` | `Int64` |
| `REAL` | `Float64` |
| `DOUBLE` | `Float64` |
| `BOOLEAN` | `Boolean` |
| `DATE` | `Date32` |
| `TIME` | `Time64(TimeUnit::Millisecond)` |
| `TIMESTAMP` | `Date64` |
| `INTERVAL` | _Not yet supported_ |
| `REGCLASS` | _Not yet supported_ |
| `TEXT` | _Not yet supported_ |
| `BYTEA` | _Not yet supported_ |
| `CUSTOM` | _Not yet supported_ |
| `ARRAY` | _Not yet supported_ |

# Architecture Overview

There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.

* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)

- (March 2021): The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
- (Feburary 2021): How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)

# Developer's guide

Expand Down
13 changes: 10 additions & 3 deletions datafusion/docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,23 @@ docker run -it -v $(your_data_location):/data datafusion-cli
## Usage

```
DataFusion 4.0.0-SNAPSHOT
DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries
against CSV and Parquet files as well as querying directly against in-memory data.
USAGE:
datafusion-cli [OPTIONS]
datafusion-cli [FLAGS] [OPTIONS]
FLAGS:
-h, --help Prints help information
-q, --quiet Reduce printing other than the results and work quietly
-V, --version Prints version information
OPTIONS:
-c, --batch-size <batch-size> The batch size of each query, default value is 1048576
-c, --batch-size <batch-size> The batch size of each query, or use DataFusion default
-p, --data-path <data-path> Path to your data, default to current directory
-f, --file <file> Execute commands from file, then exit
--format <format> Output format (possible values: table, csv, tsv, json) [default: table]
```

Type `exit` or `quit` to exit the CLI.
Expand All @@ -64,7 +71,7 @@ Type `exit` or `quit` to exit the CLI.
Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary to provide schema information for Parquet files.

```sql
CREATE EXTERNAL TABLE taxi
CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/mnt/nyctaxi/tripdata.parquet';
```
Expand Down
3 changes: 2 additions & 1 deletion docs/user-guide/src/distributed/client-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
specific language governing permissions and limitations
under the License.
-->

# Python

Coming soon.
Coming soon.
5 changes: 3 additions & 2 deletions docs/user-guide/src/distributed/client-rust.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@
specific language governing permissions and limitations
under the License.
-->

## Ballista Rust Client

The Rust client supports a `DataFrame` API as well as SQL. See the
[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.
The Rust client supports a `DataFrame` API as well as SQL. See the
[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.
1 change: 1 addition & 0 deletions docs/user-guide/src/distributed/clients.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
specific language governing permissions and limitations
under the License.
-->

## Clients

- [Rust](client-rust.md)
Expand Down
Loading

0 comments on commit 8e972b0

Please sign in to comment.