[Docs] Adds tutorial and stepwise instructions for EXTERN (#17501)

* draft tutorial on extern: * updated draft * updates * add sidebar, fix reference wording * update reference * final updates to reference * update * add cloud info to tutorial * fix conflict * fix link * Update docs/multi-stage-query/reference.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/multi-stage-query/reference.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * fixes * make hyperlink to console * Update docs/multi-stage-query/reference.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * fix typos --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
apache · Dec 19, 2024 · 4648a41 · 4648a41
1 parent bf6b0fd
commit 4648a41
Show file tree

Hide file tree

Showing 4 changed files with 265 additions and 50 deletions.
diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md
@@ -72,9 +72,8 @@ FROM TABLE(
    `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is
    used to map the external data into the SQL layer.
 
-Variation 2, with the input schema expressed in SQL using an `EXTEND` clause. (See the next
-section for more detail on `EXTEND`). This format also uses named arguments to make the
-SQL a bit easier to read:
+Variation 2, with the input schema expressed in SQL using an `EXTEND` clause. See the next
+section for more detail on `EXTEND`. This format also uses named arguments to make the SQL easier to read:
 
 ```sql
 SELECT
@@ -95,12 +94,12 @@ For more information, see [Read external data with EXTERN](concepts.md#read-exte
 
 #### `EXTERN` to export to a destination
 
-`EXTERN` can be used to specify a destination where you want to export data to.
-This variation of EXTERN requires one argument, the details of the destination as specified below.
-This variation additionally requires an `AS` clause to specify the format of the exported rows.
+You can use `EXTERN` to specify a destination to export data.
+This variation of `EXTERN` accepts the details of the destination as the only argument and requires an `AS` clause to specify the format of the exported rows.
+
+When you export data, Druid creates metadata files in a subdirectory named `_symlink_format_manifest`.
+Within the `_symlink_format_manifest/manifest` directory, the `manifest` file lists absolute paths to exported files using the symlink manifest format. For example:
 
-While exporting data, some metadata files will also be created at the destination in addition to the data. These files will be created in a directory `_symlink_format_manifest`.
-- `_symlink_format_manifest/manifest`: Lists the files which were created as part of the export. The file is in the symlink manifest format, and consists of a list of absolute paths to the files created.
 ```text
 s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker2-partition2.csv
 s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker1-partition1.csv
@@ -112,8 +111,8 @@ s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker0-par
 Keep the following in mind when using EXTERN to export rows:
 - Only INSERT statements are supported.
 - Only `CSV` format is supported as an export format.
-- Partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED BY`) aren't supported with export statements.
-- You can export to Amazon S3 or local storage.
+- Partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED BY`) aren't supported with EXTERN statements.
+- You can export to Amazon S3, Google GCS, or local storage.
 - The destination provided should contain no other files or directories.
 
 When you export data, use the `rowsPerPage` context parameter to restrict the size of exported files.
@@ -128,10 +127,14 @@ SELECT
 FROM <table>
 ```
 
-##### S3
+##### S3 - Amazon S3
+
+To export results to S3, pass the `s3()` function as an argument to the `EXTERN` function.
+Export to S3 requires the `druid-s3-extensions` extension.
+For a list of S3 permissions the MSQ task engine requires to perform export, see [Permissions for durable storage](./security.md#s3).
 
-Export results to S3 by passing the function `s3()` as an argument to the `EXTERN` function. Note that this requires the `druid-s3-extensions`.
-The `s3()` function is a Druid function that configures the connection. Arguments for `s3()` should be passed as named parameters with the value in single quotes like the following example:
+The `s3()` function configures the connection to AWS.
+Pass all arguments for `s3()` as named parameters with their values enclosed in single quotes. For example:
 
 ```sql
 INSERT INTO
@@ -146,25 +149,26 @@ FROM <table>
 
 Supported arguments for the function:
 
-| Parameter   | Required | Description                                                                                                                                                                                                                                                                    | Default |
-|-------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
-| `bucket`    | Yes      | The S3 bucket to which the files are exported to. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`.                                                                                                                     | n/a     |
-| `prefix`    | Yes      | Path where the exported files would be created. The export query expects the destination to be empty. If the location includes other files, then the query will fail. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`. | n/a     |
+| Parameter | Required | Description | Default |
+|---|---|---|---|
+| `bucket` | Yes  | S3 bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths` allow list. | n/a |
+| `prefix` | Yes  | Destination path in the bucket to create exported files. The export query expects the destination path to be empty. If the location includes other files, the query will fail. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths` allow list. | n/a |
 
-The following runtime parameters must be configured to export into an S3 destination:
+Configure the following runtime parameters to export to an S3 destination:
 
-| Runtime Parameter                            | Required | Description                                                                                                                                                                                                                          | Default |
-|----------------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
-| `druid.export.storage.s3.allowedExportPaths` | Yes      | An array of S3 prefixes that are whitelisted as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. Example: `[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]`    | n/a |
-| `druid.export.storage.s3.tempLocalDir`       | No       | Directory used on the local storage of the worker to store temporary files required while uploading the data. Uses the task temporary directory by default.                                                                          | n/a |
-| `druid.export.storage.s3.maxRetry`           | No       | Defines the max number times to attempt S3 API calls to avoid failures due to transient errors.                                                                                                                                      | 10  |
-| `druid.export.storage.s3.chunkSize`          | No       | Defines the size of each chunk to temporarily store in `tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it requires more disk space to store the temporary chunks. | 100MiB |
+| Runtime parameter | Required | Description | Default |
+|---|---|---|---|
+| `druid.export.storage.s3.allowedExportPaths` | Yes | Array of S3 prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For example: `[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]` | n/a |
+| `druid.export.storage.s3.tempLocalDir` | No | Directory for local storage where the worker stores temporary files before uploading the data to S3. | n/a |
+| `druid.export.storage.s3.maxRetry` | No | Maximum number of  attempts for S3 API calls to avoid failures due to transient errors. | 10 |
+| `druid.export.storage.s3.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to S3, but require more disk space to store temporary chunks. | 100MiB |
 
+##### GOOGLE - Google Cloud Storage
 
-##### GS
+To export query results to Google Cloud Storage (GCS), pass the `google()` function as an argument to the `EXTERN` function.
+Export to GCS requires the `druid-google-extensions` extension.
 
-Export results to GCS by passing the function `google()` as an argument to the `EXTERN` function. Note that this requires the `druid-google-extensions`.
-The `google()` function is a Druid function that configures the connection. Arguments for `google()` should be passed as named parameters with the value in single quotes like the following example:
+The `google()` function configures the connection to GCS. Pass the arguments for `google()` as named parameters with their values enclosed in single quotes. For example:
 
 ```sql
 INSERT INTO
@@ -179,29 +183,30 @@ FROM <table>
 
 Supported arguments for the function:
 
-| Parameter   | Required | Description                                                                                                                                                                                                                                                                        | Default |
-|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
-| `bucket`    | Yes      | The GS bucket to which the files are exported to. The bucket and prefix combination should be whitelisted in `druid.export.storage.google.allowedExportPaths`.                                                                                                                     | n/a     |
-| `prefix`    | Yes      | Path where the exported files would be created. The export query expects the destination to be empty. If the location includes other files, then the query will fail. The bucket and prefix combination should be whitelisted in `druid.export.storage.google.allowedExportPaths`. | n/a     |
+| Parameter   | Required | Description | Default |
+|---|---|---|---|
+| `bucket`    | Yes | GCS bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.google.allowedExportPaths` allow list. | n/a |
+| `prefix` | Yes  | Destination path in the bucket to create exported files. The export query expects the destination path to be empty. If the location includes other files, the query will fail. You must add the bucket and prefix combination to the `druid.export.storage.google.allowedExportPaths` allow list. | n/a |
 
-The following runtime parameters must be configured to export into a GCS destination:
+Configure the following runtime parameters to export query results to a GCS destination:
 
-| Runtime Parameter                                | Required | Description                                                                                                                                                                                                                   | Default |
-|--------------------------------------------------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
-| `druid.export.storage.google.allowedExportPaths` | Yes      | An array of GS prefixes that are allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. Example: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a     |
-| `druid.export.storage.google.tempLocalDir`       | No       | Directory used on the local storage of the worker to store temporary files required while uploading the data. Uses the task temporary directory by default.                                                                   | n/a     |
-| `druid.export.storage.google.maxRetry`           | No       | Defines the max number times to attempt GS API calls to avoid failures due to transient errors.                                                                                                                               | 10      |
-| `druid.export.storage.google.chunkSize`          | No       | Defines the size of each chunk to temporarily store in `tempDir`. A large chunk size reduces the API calls to GS; however, it requires more disk space to store the temporary chunks.                                         | 4MiB    |
+| Runtime parameter | Required | Description | Default |
+|---|---|---|---|
+| `druid.export.storage.google.allowedExportPaths` | Yes | Array of GCS prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For example: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a     |
+| `druid.export.storage.google.tempLocalDir` | No | Directory for local storage where the worker stores temporary files before uploading the data to GCS. | n/a |
+| `druid.export.storage.google.maxRetry` | No |  Maximum number of attempts for GCS API calls to avoid failures due to transient errors. | 10 |
+| `druid.export.storage.google.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to GS, but require more disk space to store temporary chunks. | 4 MiB |
 
-##### LOCAL
 
-You can export to the local storage, which exports the results to the filesystem of the MSQ worker.
+##### LOCAL - local file storage
+
+You can export queries to local storage. This process writes the results to the filesystem on the MSQ worker.
 This is useful in a single node setup or for testing but is not suitable for production use cases.
 
-Export results to local storage by passing the function `LOCAL()` as an argument for the `EXTERN FUNCTION`.
-To use local storage as an export destination, the runtime property `druid.export.storage.baseDir` must be configured on the Indexer/Middle Manager.
-This value must be set to an absolute path on the local machine. Exporting data will be allowed to paths which match the prefix set by this value.
-Arguments to `LOCAL()` should be passed as named parameters with the value in single quotes in the following example:
+To export results to local storage, pass the `LOCAL()` function as an argument to the EXTERN function.
+You must configure the runtime property `druid.export.storage.baseDir` as an absolute path on the Indexer or Middle Manager to use local storage as an export destination.
+You can export data to paths that match this value as a prefix.
+Pass all arguments to `LOCAL()` as named parameters with values enclosed in single quotes. For example:
 
 ```sql
 INSERT INTO
@@ -214,13 +219,13 @@ SELECT
 FROM <table>
 ```
 
-Supported arguments to the function:
+Supported arguments for the function:
 
-| Parameter   | Required | Description                                                                                                                                                                                                                                              | Default |
-|-------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --|
-| `exportPath`  | Yes | Absolute path to a subdirectory of `druid.export.storage.baseDir` used as the destination to export the results to. The export query expects the destination to be empty. If the location includes other files or directories, then the query will fail. | n/a |
+| Parameter | Required | Description | Default |
+|---|---|---|---|
+| `exportPath`  | Yes | Absolute path to a subdirectory of `druid.export.storage.baseDir` where Druid exports the query results. The destination must be empty. If the location includes other files or directories, the query will fail. | n/a |
 
-For more information, see [Read external data with EXTERN](concepts.md#write-to-an-external-destination-with-extern).
+For more information, see [Export external data with EXTERN](concepts.md#write-to-an-external-destination-with-extern).
 
 ### `INSERT`
 

diff --git a/docs/multi-stage-query/security.md b/docs/multi-stage-query/security.md
@@ -81,3 +81,6 @@ The MSQ task engine needs the following permissions for pushing, fetching, and r
 - `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write` to write files in durable storage.
 - `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/add/action` to create files in durable storage.
 - `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete` to delete files when they're no longer needed.
+
+<!--TBD GCS-->
+