use prettier to format md files

apache · May 20, 2021 · 8e972b0 · 8e972b0
1 parent 2f73558
commit 8e972b0
Show file tree

Hide file tree

Showing 15 changed files with 177 additions and 172 deletions.
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -19,6 +19,6 @@
 
 # Code of Conduct
 
-* [Code of Conduct for The Apache Software Foundation][1]
+- [Code of Conduct for The Apache Software Foundation][1]
 
-[1]: https://www.apache.org/foundation/policies/conduct.html
+[1]: https://www.apache.org/foundation/policies/conduct.html
diff --git a/DEVELOPERS.md b/DEVELOPERS.md
@@ -21,57 +21,57 @@
 
 This section describes how you can get started at developing DataFusion.
 
-For information on developing with Ballista, see the 
-[Ballista developer documentation](ballista/docs/README.md). 
+For information on developing with Ballista, see the
+[Ballista developer documentation](ballista/docs/README.md).
 
 ### Bootstrap environment
 
 DataFusion is written in Rust and it uses a standard rust toolkit:
 
-* `cargo build`
-* `cargo fmt` to format the code
-* `cargo test` to test
-* etc.
+- `cargo build`
+- `cargo fmt` to format the code
+- `cargo test` to test
+- etc.
 
 ## How to add a new scalar function
 
 Below is a checklist of what you need to do to add a new scalar function to DataFusion:
 
-* Add the actual implementation of the function:
-  * [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
-  * [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
-  * [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
-  * create a new module [here](datafusion/src/physical_plan) for other functions
-* In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
-  * a new variant to `BuiltinScalarFunction`
-  * a new entry to `FromStr` with the name of the function as called by SQL
-  * a new line in `return_type` with the expected return type of the function, given an incoming type
-  * a new line in `signature` with the signature of the function (number and types of its arguments)
-  * a new line in `create_physical_expr` mapping the built-in to the implementation
-  * tests to the function.
-* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
-* In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
-  * a new entry of the `unary_scalar_expr!` macro for the new function.
-* In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
-  * a new entry in the `pub use expr::{}` set.
+- Add the actual implementation of the function:
+  - [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
+  - [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
+  - [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
+  - create a new module [here](datafusion/src/physical_plan) for other functions
+- In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
+  - a new variant to `BuiltinScalarFunction`
+  - a new entry to `FromStr` with the name of the function as called by SQL
+  - a new line in `return_type` with the expected return type of the function, given an incoming type
+  - a new line in `signature` with the signature of the function (number and types of its arguments)
+  - a new line in `create_physical_expr` mapping the built-in to the implementation
+  - tests to the function.
+- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
+- In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
+  - a new entry of the `unary_scalar_expr!` macro for the new function.
+- In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
+  - a new entry in the `pub use expr::{}` set.
 
 ## How to add a new aggregate function
 
 Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
 
-* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
-  * [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
-  * [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
-  * [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
-  * create a new module [here](datafusion/src/physical_plan) for other functions
-* In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
-  * a new variant to `BuiltinAggregateFunction`
-  * a new entry to `FromStr` with the name of the function as called by SQL
-  * a new line in `return_type` with the expected return type of the function, given an incoming type
-  * a new line in `signature` with the signature of the function (number and types of its arguments)
-  * a new line in `create_aggregate_expr` mapping the built-in to the implementation
-  * tests to the function.
-* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
+- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
+  - [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
+  - [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
+  - [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
+  - create a new module [here](datafusion/src/physical_plan) for other functions
+- In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
+  - a new variant to `BuiltinAggregateFunction`
+  - a new entry to `FromStr` with the name of the function as called by SQL
+  - a new line in `return_type` with the expected return type of the function, given an incoming type
+  - a new line in `signature` with the signature of the function (number and types of its arguments)
+  - a new line in `create_aggregate_expr` mapping the built-in to the implementation
+  - tests to the function.
+- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
 
 ## How to display plans graphically
 

diff --git a/README.md b/README.md
@@ -30,7 +30,7 @@ logical query plans as well as a query optimizer and execution engine
 capable of parallel execution against partitioned data sources (CSV
 and Parquet) using threads.
 
-DataFusion also supports distributed query execution via the  
+DataFusion also supports distributed query execution via the
 [Ballista](ballista/README.md) crate.
 
 ## Use Cases
@@ -42,24 +42,24 @@ the convenience of an SQL interface or a DataFrame API.
 
 ## Why DataFusion?
 
-* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
-* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
-* *Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
-* *High Quality*:  Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
+- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
+- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
+- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
 
 ## Known Uses
 
 Here are some of the projects known to use DataFusion:
 
-* [Ballista](ballista) Distributed Compute Platform
-* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
-* [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
-* [datafusion-python](https://pypi.org/project/datafusion)
-* [delta-rs](https://github.com/delta-io/delta-rs)
-* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
-* [ROAPI](https://github.com/roapi/roapi)
-* [Tensorbase](https://github.com/tensorbase/tensorbase)
-* [Squirtle](https://github.com/DSLAM-UMD/Squirtle)
+- [Ballista](ballista) Distributed Compute Platform
+- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
+- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
+- [datafusion-python](https://pypi.org/project/datafusion)
+- [delta-rs](https://github.com/delta-io/delta-rs)
+- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
+- [ROAPI](https://github.com/roapi/roapi)
+- [Tensorbase](https://github.com/tensorbase/tensorbase)
+- [Squirtle](https://github.com/DSLAM-UMD/Squirtle)
 
 (if you know of another project, please submit a PR to add a link!)
 
@@ -122,8 +122,6 @@ Both of these examples will produce
 +---+--------+
 ```
 
-
-
 ## Using DataFusion as a library
 
 DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
@@ -230,7 +228,6 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
 - [x] Parquet primitive types
 - [ ] Parquet nested types
 
-
 ## Extensibility
 
 DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
@@ -242,35 +239,32 @@ DataFusion is designed to be extensible at all points. To that end, you can prov
 - [x] User Defined `LogicalPlan` nodes
 - [x] User Defined `ExecutionPlan` nodes
 
-
 # Supported SQL
 
 This library currently supports many SQL constructs, including
 
-* `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
-* `SELECT ... FROM ...` together with any expression
-* `ALIAS` to name an expression
-* `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
-* most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
-* `WHERE` to filter
-* `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
-* `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`
-
+- `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
+- `SELECT ... FROM ...` together with any expression
+- `ALIAS` to name an expression
+- `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
+- most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
+- `WHERE` to filter
+- `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
+- `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`
 
 ## Supported Functions
 
 DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https://www.postgresql.org/docs/current/functions.html) where possible. We explicitly choose a single dialect to maximize interoperability with other tools and allow reuse of the PostgreSQL documents and tutorials as much as possible.
 
-Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
+Currently, only a subset of the PostgreSQL dialect is implemented, and we will document any deviations.
 
 ## Schema Metadata / Information Schema Support
 
 DataFusion supports the showing metadata about the tables available. This information can be accessed using the views of the ISO SQL `information_schema` schema or the DataFusion specific `SHOW TABLES` and `SHOW COLUMNS` commands.
 
 More information can be found in the [Postgres docs](https://www.postgresql.org/docs/13/infoschema-schema.html)).
 
-
-To show tables available for use in DataFusion, use the `SHOW TABLES`  command or the `information_schema.tables` view:
+To show tables available for use in DataFusion, use the `SHOW TABLES` command or the `information_schema.tables` view:
 
 ```sql
 > show tables;
@@ -291,7 +285,7 @@ To show tables available for use in DataFusion, use the `SHOW TABLES`  command o
 +---------------+--------------------+------------+--------------+
 ```
 
-To show the schema of a table in DataFusion, use the `SHOW COLUMNS`  command or the or `information_schema.columns` view:
+To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or the or `information_schema.columns` view:
 
 ```sql
 > show columns from t;
@@ -313,50 +307,45 @@ To show the schema of a table in DataFusion, use the `SHOW COLUMNS`  command or
 +------------+-------------+------------------+-------------+-----------+
 ```
 
-
-
 ## Supported Data Types
 
 DataFusion uses Arrow, and thus the Arrow type system, for query
 execution. The SQL types from
 [sqlparser-rs](https://github.com/ballista-compute/sqlparser-rs/blob/main/src/ast/data_type.rs#L57)
 are mapped to Arrow types according to the following table
 
-
-| SQL Data Type   | Arrow DataType                   |
-| --------------- | -------------------------------- |
-| `CHAR`          | `Utf8`                           |
-| `VARCHAR`       | `Utf8`                           |
-| `UUID`          | *Not yet supported*              |
-| `CLOB`          | *Not yet supported*              |
-| `BINARY`        | *Not yet supported*              |
-| `VARBINARY`     | *Not yet supported*              |
-| `DECIMAL`       | `Float64`                        |
-| `FLOAT`         | `Float32`                        |
-| `SMALLINT`      | `Int16`                          |
-| `INT`           | `Int32`                          |
-| `BIGINT`        | `Int64`                          |
-| `REAL`          | `Float64`                        |
-| `DOUBLE`        | `Float64`                        |
-| `BOOLEAN`       | `Boolean`                        |
-| `DATE`          | `Date32`                         |
-| `TIME`          | `Time64(TimeUnit::Millisecond)`  |
-| `TIMESTAMP`     | `Date64`                         |
-| `INTERVAL`      | *Not yet supported*              |
-| `REGCLASS`      | *Not yet supported*              |
-| `TEXT`          | *Not yet supported*              |
-| `BYTEA`         | *Not yet supported*              |
-| `CUSTOM`        | *Not yet supported*              |
-| `ARRAY`         | *Not yet supported*              |
-
+| SQL Data Type | Arrow DataType                  |
+| ------------- | ------------------------------- |
+| `CHAR`        | `Utf8`                          |
+| `VARCHAR`     | `Utf8`                          |
+| `UUID`        | _Not yet supported_             |
+| `CLOB`        | _Not yet supported_             |
+| `BINARY`      | _Not yet supported_             |
+| `VARBINARY`   | _Not yet supported_             |
+| `DECIMAL`     | `Float64`                       |
+| `FLOAT`       | `Float32`                       |
+| `SMALLINT`    | `Int16`                         |
+| `INT`         | `Int32`                         |
+| `BIGINT`      | `Int64`                         |
+| `REAL`        | `Float64`                       |
+| `DOUBLE`      | `Float64`                       |
+| `BOOLEAN`     | `Boolean`                       |
+| `DATE`        | `Date32`                        |
+| `TIME`        | `Time64(TimeUnit::Millisecond)` |
+| `TIMESTAMP`   | `Date64`                        |
+| `INTERVAL`    | _Not yet supported_             |
+| `REGCLASS`    | _Not yet supported_             |
+| `TEXT`        | _Not yet supported_             |
+| `BYTEA`       | _Not yet supported_             |
+| `CUSTOM`      | _Not yet supported_             |
+| `ARRAY`       | _Not yet supported_             |
 
 # Architecture Overview
 
 There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
 
-* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-
+- (March 2021): The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+- (Feburary 2021): How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
 
 # Developer's guide
 

diff --git a/datafusion/docs/cli.md b/datafusion/docs/cli.md
@@ -45,16 +45,23 @@ docker run -it -v $(your_data_location):/data datafusion-cli
 ## Usage
 
 ```
+DataFusion 4.0.0-SNAPSHOT
+DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries
+against CSV and Parquet files as well as querying directly against in-memory data.
+
 USAGE:
-    datafusion-cli [OPTIONS]
+    datafusion-cli [FLAGS] [OPTIONS]
 
 FLAGS:
     -h, --help       Prints help information
+    -q, --quiet      Reduce printing other than the results and work quietly
     -V, --version    Prints version information
 
 OPTIONS:
-    -c, --batch-size <batch-size>    The batch size of each query, default value is 1048576
+    -c, --batch-size <batch-size>    The batch size of each query, or use DataFusion default
     -p, --data-path <data-path>      Path to your data, default to current directory
+    -f, --file <file>                Execute commands from file, then exit
+        --format <format>            Output format (possible values: table, csv, tsv, json) [default: table]
 ```
 
 Type `exit` or `quit` to exit the CLI.
@@ -64,7 +71,7 @@ Type `exit` or `quit` to exit the CLI.
 Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary to provide schema information for Parquet files.
 
 ```sql
-CREATE EXTERNAL TABLE taxi 
+CREATE EXTERNAL TABLE taxi
 STORED AS PARQUET
 LOCATION '/mnt/nyctaxi/tripdata.parquet';
 ```

diff --git a/docs/user-guide/src/distributed/client-python.md b/docs/user-guide/src/distributed/client-python.md
@@ -16,6 +16,7 @@
   specific language governing permissions and limitations
   under the License.
 -->
+
 # Python
 
-Coming soon.
+Coming soon.
diff --git a/docs/user-guide/src/distributed/client-rust.md b/docs/user-guide/src/distributed/client-rust.md
@@ -16,7 +16,8 @@
   specific language governing permissions and limitations
   under the License.
 -->
+
 ## Ballista Rust Client
 
-The Rust client supports a `DataFrame` API as well as SQL. See the 
-[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.
+The Rust client supports a `DataFrame` API as well as SQL. See the
+[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.
diff --git a/docs/user-guide/src/distributed/clients.md b/docs/user-guide/src/distributed/clients.md
@@ -16,6 +16,7 @@
   specific language governing permissions and limitations
   under the License.
 -->
+
 ## Clients
 
 - [Rust](client-rust.md)