Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): add SparkStreamingDataSet #198

Merged
merged 100 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
46bb394
Fix links on GitHub issue templates (#150)
astrojuanlu Apr 11, 2023
c9421ae
add spark_stream_dataset.py
tingtingQB Apr 12, 2023
63f578a
Migrate most of `kedro-datasets` metadata to `pyproject.toml` (#161)
astrojuanlu Apr 12, 2023
4b387ff
restructure the strean dataset to align with the other spark dataset
tingtingQB Apr 13, 2023
39ad9fd
adding README.md for specification
tingtingQB Apr 13, 2023
69eb8be
Update kedro-datasets/kedro_datasets/spark/spark_stream_dataset.py
tingtingQB Apr 13, 2023
3106068
rename the dataset
tingtingQB Apr 13, 2023
b8141a7
resolve comments
tingtingQB Apr 17, 2023
738625e
fix format and pylint
tingtingQB Apr 17, 2023
a54cc67
Update kedro-datasets/kedro_datasets/spark/README.md
tingtingQB Apr 17, 2023
b924ad6
add unit tests and SparkStreamingDataset in init.py
tingtingQB Apr 21, 2023
743b823
add unit tests
tingtingQB Apr 25, 2023
3bb3717
update test_save
tingtingQB Apr 25, 2023
ae3bc87
Upgrade Polars (#171)
astrojuanlu Apr 17, 2023
eb634a1
if release is failed, it return exit code and fail the CI (#158)
noklam Apr 17, 2023
115940b
Migrate `kedro-airflow` to static metadata (#172)
astrojuanlu Apr 18, 2023
35231af
Migrate `kedro-telemetry` to static metadata (#174)
astrojuanlu Apr 18, 2023
8c2ea1b
ci: port lint, unit test, and e2e tests to Actions (#155)
ankatiyar Apr 19, 2023
a73b216
Migrate `kedro-docker` to static metadata (#173)
astrojuanlu Apr 19, 2023
7f4527d
Introdcuing .gitpod.yml to kedro-plugins (#185)
noklam Apr 21, 2023
57a11d6
sync APIDataSet from kedro's `develop` (#184)
noklam Apr 24, 2023
11c3888
formatting
tingtingQB Apr 25, 2023
634d884
formatting
tingtingQB May 1, 2023
9e8f55c
formatting
tingtingQB May 1, 2023
dbdf19c
formatting
tingtingQB May 1, 2023
4e49fd9
Merge remote-tracking branch 'origin/add-stream-datasets' into add-st…
tingtingQB May 1, 2023
1a7a477
add spark_stream_dataset.py
tingtingQB Apr 12, 2023
e877944
restructure the strean dataset to align with the other spark dataset
tingtingQB Apr 13, 2023
09e9cf2
adding README.md for specification
tingtingQB Apr 13, 2023
2e30ec0
Update kedro-datasets/kedro_datasets/spark/spark_stream_dataset.py
tingtingQB Apr 13, 2023
6147636
rename the dataset
tingtingQB Apr 13, 2023
29376e9
resolve comments
tingtingQB Apr 17, 2023
42ed37a
fix format and pylint
tingtingQB Apr 17, 2023
d93d9b9
Update kedro-datasets/kedro_datasets/spark/README.md
tingtingQB Apr 17, 2023
5b83444
add unit tests and SparkStreamingDataset in init.py
tingtingQB Apr 21, 2023
5b0630e
add unit tests
tingtingQB Apr 25, 2023
1433808
update test_save
tingtingQB Apr 25, 2023
c7778b5
formatting
tingtingQB Apr 25, 2023
7341429
formatting
tingtingQB May 1, 2023
d8d3bc2
formatting
tingtingQB May 1, 2023
be4a3e5
formatting
tingtingQB May 1, 2023
d3bc0d2
Merge remote-tracking branch 'origin/add-stream-datasets' into add-st…
tingtingQB May 1, 2023
e39c639
lint
tingtingQB May 2, 2023
66440f4
lint
tingtingQB May 2, 2023
0ed5b90
lint
tingtingQB May 2, 2023
04c623b
update test cases
tingtingQB May 2, 2023
a76f944
add negative test
tingtingQB May 2, 2023
30b002d
remove code snippets fpr testing
tingtingQB May 2, 2023
9bef3a2
lint
tingtingQB May 2, 2023
0bb5fe1
update tests
tingtingQB May 2, 2023
e0ebe27
update test and remove redundacy
tingtingQB May 4, 2023
5bb5766
linting
tingtingQB May 4, 2023
2075781
refactor file format
kuriantom369 May 4, 2023
e8ea0d3
fix read me file
kuriantom369 May 4, 2023
f08dd09
docs: Add community contributions (#199)
astrojuanlu May 4, 2023
24bb527
adding test for raise error
tingtingQB May 4, 2023
437e77e
update test and remove redundacy
tingtingQB May 4, 2023
a3fdbf6
linting
tingtingQB May 4, 2023
9d60f25
refactor file format
kuriantom369 May 4, 2023
ced007d
fix read me file
kuriantom369 May 4, 2023
0b88324
adding test for raise error
tingtingQB May 4, 2023
ed26aad
fix readme file
kuriantom369 May 4, 2023
170b092
fix readme
kuriantom369 May 4, 2023
e63a53a
fix conflicts
kuriantom369 May 4, 2023
d986c75
fix ci erors
kuriantom369 May 4, 2023
88e6ee4
Merge branch 'kedro-org:main' into add-stream-datasets
tingtingQB May 4, 2023
64232fa
fix lint issue
tingtingQB May 5, 2023
8a61b41
update class documentation
kuriantom369 May 5, 2023
37e66e8
add additional test cases
kuriantom369 May 16, 2023
07032a8
add s3 read test cases
kuriantom369 May 16, 2023
2470de1
add s3 read test cases
kuriantom369 May 16, 2023
c4e0f4e
add s3 read test case
kuriantom369 May 16, 2023
7e3555e
test s3 read
kuriantom369 May 16, 2023
6a0029d
remove redundant test cases
kuriantom369 May 17, 2023
e8f6696
fix streaming dataset configurations
kuriantom369 May 23, 2023
9a5ebad
update streaming datasets doc
tingtingQB May 25, 2023
eacdd46
resolve comments re documentation
tingtingQB May 25, 2023
68b6e1b
bugfix lint
tingtingQB May 25, 2023
5b2a479
update link
tingtingQB May 25, 2023
b94f211
revert the changes on CI
noklam May 26, 2023
9381816
test(docker): remove outdated logging-related step (#207)
noklam May 17, 2023
373e166
ci: ensure plugin requirements get installed in CI (#208)
deepyaman May 18, 2023
f033b95
ci: Migrate the release workflow from CircleCI to GitHub Actions (#203)
SajidAlamQB May 18, 2023
3fdb71c
build: Relax Kedro bound for `kedro-datasets` (#140)
merelcht May 18, 2023
b08aa6f
ci: don't run checks on both `push`/`pull_request` (#192)
deepyaman May 18, 2023
148b464
chore: delete extra space ending check-release.yml (#210)
deepyaman May 19, 2023
be2431c
ci: Create merge-gatekeeper.yml to make sure PR only merged when all …
noklam May 19, 2023
74a211f
ci: Remove the CircleCI setup (#209)
SajidAlamQB May 19, 2023
9d7820a
feat: Dataset API add `save` method (#180)
McDonnellJoseph May 22, 2023
36de4b9
ci: Automatically extract release notes for GitHub Releases (#212)
ankatiyar May 22, 2023
870e623
feat: Add metadata attribute to datasets (#189)
AhdraMeraliQB May 22, 2023
9d66cc8
feat: Add ManagedTableDataset for managed Delta Lake tables in Databr…
jmholzer May 22, 2023
0aaa922
docs: Update APIDataset docs and refactor (#217)
noklam May 22, 2023
ccec03b
feat: Release `kedro-datasets` version `1.3.0` (#219)
jmholzer May 22, 2023
c2a7128
docs: Fix APIDataSet docstring (#220)
astrojuanlu May 23, 2023
64446dc
Update kedro-datasets/tests/spark/test_spark_streaming_dataset.py
kuriantom369 May 30, 2023
497001d
Update kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py
kuriantom369 May 30, 2023
7f25f3c
Update kedro-datasets/setup.py
kuriantom369 May 30, 2023
bd88b99
Merge branch 'main' into add-stream-datasets
kuriantom369 May 30, 2023
c094db1
fix linting issue
kuriantom369 May 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions kedro-datasets/kedro_datasets/spark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Spark Streaming
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like how concise is this file here!


``SparkStreamingDataSet`` loads and saves data to streaming DataFrames.
See [Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) for details.

To work with multiple streaming nodes, 2 hooks are required for:
- Integrating Pyspark, see [Build a Kedro pipeline with PySpark](https://docs.kedro.org/en/stable/integrations/pyspark_integration.html) for details
- Running streaming query without termination unless exception

#### Supported file formats

Supported file formats are:

- Text
- CSV
- JSON
- ORC
- Parquet

#### Example SparkStreamsHook:

```python
from kedro.framework.hooks import hook_impl
from pyspark.sql import SparkSession

class SparkStreamsHook:
@hook_impl
def after_pipeline_run(self) -> None:
"""Starts a spark streaming await session
once the pipeline reaches the last node
"""

spark = SparkSession.builder.getOrCreate()
spark.streams.awaitAnyTermination()
```
To make the application work with Kafka format, the respective spark configuration needs to be added to``conf/base/spark.yml``.

#### Example spark.yml:

```yaml
spark.driver.maxResultSize: 3g
spark.scheduler.mode: FAIR

```
10 changes: 9 additions & 1 deletion kedro-datasets/kedro_datasets/spark/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
"""Provides I/O modules for Apache Spark."""

__all__ = ["SparkDataSet", "SparkHiveDataSet", "SparkJDBCDataSet", "DeltaTableDataSet"]
__all__ = [
"SparkDataSet",
"SparkHiveDataSet",
"SparkJDBCDataSet",
"DeltaTableDataSet",
"SparkStreamingDataSet",
]

from contextlib import suppress

Expand All @@ -12,3 +18,5 @@
from .spark_jdbc_dataset import SparkJDBCDataSet
with suppress(ImportError):
from .deltatable_dataset import DeltaTableDataSet
with suppress(ImportError):
from .spark_streaming_dataset import SparkStreamingDataSet
155 changes: 155 additions & 0 deletions kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
"""SparkStreamingDataSet to load and save a PySpark Streaming DataFrame."""
from copy import deepcopy
from pathlib import PurePosixPath
from typing import Any, Dict

from kedro.io.core import AbstractDataSet
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.utils import AnalysisException

from kedro_datasets.spark.spark_dataset import (
SparkDataSet,
_split_filepath,
_strip_dbfs_prefix,
)


class SparkStreamingDataSet(AbstractDataSet):
"""``SparkStreamingDataSet`` loads data into Spark Streaming Dataframe objects.
Example usage for the
`YAML API <https://kedro.readthedocs.io/en/stable/data/\
data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_:
.. code-block:: yaml
raw.new_inventory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raw.new_inventory:
raw.new_inventory:

type: streaming.extras.datasets.spark_streaming_dataset.SparkStreamingDataSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type: streaming.extras.datasets.spark_streaming_dataset.SparkStreamingDataSet
type: spark.SparkStreamingDataSet

filepath: data/01_raw/stream/inventory/
file_format: json
save_args:
output_mode: append
checkpoint: data/04_checkpoint/raw_new_inventory
header: True
load_args:
schema:
filepath: data/01_raw/schema/inventory_schema.json
"""

DEFAULT_LOAD_ARGS = {} # type: Dict[str, Any]
DEFAULT_SAVE_ARGS = {} # type: Dict[str, Any]

def __init__(
self,
filepath: str = "",
file_format: str = "",
save_args: Dict[str, Any] = None,
load_args: Dict[str, Any] = None,
) -> None:
"""Creates a new instance of SparkStreamingDataSet.
Args:
filepath: Filepath in POSIX format to a Spark dataframe. When using Databricks
specify ``filepath``s starting with ``/dbfs/``. For message brokers such as
Kafka and all filepath is not required.
file_format: File format used during load and save
operations. These are formats supported by the running
SparkContext include parquet, csv, delta. For a list of supported
formats please refer to Apache Spark documentation at
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
load_args: Load args passed to Spark DataFrameReader load method.
It is dependent on the selected file format. You can find
a list of read options for each supported format
in Spark DataFrame read documentation:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html,
Please note that a schema is mandatory for a streaming DataFrame
if ``schemaInference`` is not True.
save_args: Save args passed to Spark DataFrame write options.
Similar to load_args this is dependent on the selected file
format. You can pass ``mode`` and ``partitionBy`` to specify
your overwrite mode and partitioning respectively. You can find
a list of options for each format in Spark DataFrame
write documentation:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
"""
self._file_format = file_format
self._save_args = save_args
self._load_args = load_args

fs_prefix, filepath = _split_filepath(filepath)

self._fs_prefix = fs_prefix
self._filepath = PurePosixPath(filepath)

# Handle default load and save arguments
self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
if load_args is not None:
self._load_args.update(load_args)
self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
if save_args is not None:
self._save_args.update(save_args)

# Handle schema load argument
self._schema = self._load_args.pop("schema", None)
if self._schema is not None:
if isinstance(self._schema, dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wonder why this is a nested if statement rather than using and? But it's the same on SparkDataSet, so I guess it's consistent. 🤷 Quite possible I did it when adding schema handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed these comment - maybe it should just inherit the SparkDataSet class? #135 I think in general we need to look at all SparkDataSet, many of it is weird but it's quite tricky to remove the code.

the path handling is particular confusing because it's unique for Spark. @deepyaman

self._schema = SparkDataSet._load_schema_from_file(self._schema)
Comment on lines +90 to +92
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have an empty schema at all given InferSchema is not enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty schema will throw error as by default structured streaming requires schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that stream jobs will run initially with empty schema, but when it is killed and restarted schema mismatch error is thrown(not always, mostly when dealing with timestamp cols and all).

Schema inference file/struct is enforced to prevent this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam is it OK from a design perspective that SparkStreamingDataSet uses a private method of SparkDataSet? Feels a bit off to me, but perhaps no clear issues if their requirements are the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with requiring schema; schema concept is more critical in streaming anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman the requirements are same as in the case of SparkDataSet, need to make it mandatory


def _describe(self) -> Dict[str, Any]:
"""Returns a dict that describes attributes of the dataset."""
return {
"filepath": self._fs_prefix + str(self._filepath),
"file_format": self._file_format,
"load_args": self._load_args,
"save_args": self._save_args,
}

@staticmethod
def _get_spark():
return SparkSession.builder.getOrCreate()

def _load(self) -> DataFrame:
"""Loads data from filepath.
If the connector type is kafka then no file_path is required, schema needs to be
seperated from load_args.
Returns:
Data from filepath as pyspark dataframe.
"""
load_path = _strip_dbfs_prefix(self._fs_prefix + str(self._filepath))
data_stream_reader = (
self._get_spark()
.readStream.schema(self._schema)
.format(self._file_format)
.options(**self._load_args)
)
return data_stream_reader.load(load_path)

def _save(self, data: DataFrame) -> None:
"""Saves pyspark dataframe.
Args:
data: PySpark streaming dataframe for saving
"""
save_path = _strip_dbfs_prefix(self._fs_prefix + str(self._filepath))
output_constructor = data.writeStream.format(self._file_format)

(
output_constructor.option(
"checkpointLocation", self._save_args.pop("checkpoint")
)
.option("path", save_path)
.outputMode(self._save_args.pop("output_mode"))
.options(**self._save_args)
.start()
)

def _exists(self) -> bool:
load_path = _strip_dbfs_prefix(self._fs_prefix + str(self._filepath))

try:
self._get_spark().readStream.schema(self._schema).load(
load_path, self._file_format
)
except AnalysisException as exception:
if (
exception.desc.startswith("Path does not exist:")
or "is not a Streaming data" in exception.desc
):
return False
raise
return True
13 changes: 8 additions & 5 deletions kedro-datasets/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,15 @@ def _collect_requirements(requires):
"plotly.PlotlyDataSet": [PANDAS, "plotly>=4.8.0, <6.0"],
"plotly.JSONDataSet": ["plotly>=4.8.0, <6.0"],
}
polars_require = {"polars.CSVDataSet": [POLARS],}
polars_require = {
"polars.CSVDataSet": [POLARS]
}
redis_require = {"redis.PickleDataSet": ["redis~=4.1"]}
snowflake_require = {
"snowflake.SnowparkTableDataSet": ["snowflake-snowpark-python~=1.0.0", "pyarrow~=8.0"]
"snowflake.SnowparkTableDataSet": [
"snowflake-snowpark-python~=1.0.0",
"pyarrow~=8.0",
]
}
spark_require = {
"spark.SparkDataSet": [SPARK, HDFS, S3FS],
Expand All @@ -71,9 +76,7 @@ def _collect_requirements(requires):
"tensorflow-macos~=2.0; platform_system == 'Darwin' and platform_machine == 'arm64'",
]
}
video_require = {
"video.VideoDataSet": ["opencv-python~=4.5.5.64"]
}
video_require = {"video.VideoDataSet": ["opencv-python~=4.5.5.64"]}
yaml_require = {"yaml.YAMLDataSet": [PANDAS, "PyYAML>=4.2, <7.0"]}

extras_require = {
Expand Down
Loading