Make the SQLQueryDataSet compatible with mssql. (kedro-org#101)

* [kedro-docker] Layers size optimization (kedro-org#92) * [kedro-docker] Layers size optimization Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Adjust test requirements Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Skip coverage check on tests dir (some do not execute on Windows) Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Update .coveragerc with the setup Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Fix bandit so it does not scan kedro-datasets Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Fixed existence test Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Check why dir is not created Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Kedro starters are fixed now Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Increased no-output-timeout for long spark image build Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> * Spark image optimized Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> * Linting Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> * Switch to slim image always Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> * Trigger build Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> * Use textwrap.dedent for nicer indentation Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> * Revert "Use textwrap.dedent for nicer indentation" This reverts commit 3a1e3f8. Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> * Revert "Revert "Use textwrap.dedent for nicer indentation"" This reverts commit d322d35. Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> * Make tests read more lines (to skip all deprecation warnings) Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Release Kedro-Docker 0.3.1 (kedro-org#94) * Add release notes for kedro-docker 0.3.1 Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update version in kedro_docker module Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Bump version and update release notes (kedro-org#96) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Make the SQLQueryDataSet compatible with mssql. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Add one test + update RELEASE.md. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Add missing pyodbc for tests. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Mock connection as well. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Add more dates parsing for mssql backend (thanks to fgaudindelrieu@idmog.com) Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Fix an error in docstring of MetricsDataSet (kedro-org#98) Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Bump relax pyarrow version to work the same way as Pandas (kedro-org#100) * Bump relax pyarrow version to work the same way as Pandas We only use PyArrow for `pandas.ParquetDataSet` as such I suggest we keep our versions pinned to the same range as [Pandas does](https://github.com/pandas-dev/pandas/blob/96fc51f5ec678394373e2c779ccff37ddb966e75/pyproject.toml#L100) for the same reason. As such I suggest we remove the upper bound as we have users requesting later versions in [support channels](https://kedro-org.slack.com/archives/C03RKP2LW64/p1674040509133529) * Updated release notes Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Add missing type in catalog example. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Add one more unit tests for adapt_mssql. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * [FIX] Add missing mocker from date test. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * [TEST] Add a wrong input test. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Add pyodbc dependency. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * [FIX] Remove dict() in tests. Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Change check to check on plugin name (kedro-org#103) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Set coverage in pyproject.toml (kedro-org#105) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Move coverage settings to pyproject.toml (kedro-org#106) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Replace kedro.pipeline with modular_pipeline.pipeline factory (kedro-org#99) * Add non-spark related test changes Replace kedro.pipeline.Pipeline with kedro.pipeline.modular_pipeline.pipeline factory. This is for symmetry with changes made to the main kedro library. Signed-off-by: Adam Farley <adamfrly@gmail.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Fix outdated links in Kedro Datasets (kedro-org#111) * fix links * fix dill links Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Fix docs formatting and phrasing for some datasets (kedro-org#107) * Fix docs formatting and phrasing for some datasets Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Manually fix files not resolved with patch command Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Apply fix from kedro-org#98 Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Release `kedro-datasets` `version 1.0.2` (kedro-org#112) * bump version and update release notes * fix pylint errors Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Bump pytest to 7.2 (kedro-org#113) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Prefix Docker plugin name with "Kedro-" in usage message (kedro-org#57) * Prefix Docker plugin name with "Kedro-" in usage message Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Keep Kedro-Docker plugin docstring from appearing in `kedro -h` (kedro-org#56) * Keep Kedro-Docker plugin docstring from appearing in `kedro -h` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * [kedro-datasets ] Add `Polars.CSVDataSet` (kedro-org#95) Signed-off-by: wmoreiraa <walber3@gmail.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * Remove deprecated `test_requires` from `setup.py` in Kedro-Docker (kedro-org#54) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Yassine Alouini <yalouini@idmog.com> * [FIX] Fix ds to data_set. Signed-off-by: Yassine Alouini <yalouini@idmog.com> --------- Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com> Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com> Signed-off-by: Yassine Alouini <yalouini@idmog.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Mariusz Strzelecki <szczeles@gmail.com> Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: OKA Naoya <pn11@users.noreply.github.com> Co-authored-by: Joel <35801847+datajoely@users.noreply.github.com> Co-authored-by: adamfrly <45516720+adamfrly@users.noreply.github.com> Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Walber Moreira <58264877+wmoreiraa@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar · Mar 21, 2023 · 324e552 · 324e552
1 parent 83a4591
commit 324e552
Show file tree

Hide file tree

Showing 5 changed files with 126 additions and 2 deletions.
diff --git a/kedro-datasets/RELEASE.md b/kedro-datasets/RELEASE.md
@@ -11,7 +11,7 @@
 | `polars.CSVDataSet` | A `CSVDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` |
 
 ## Bug fixes and other changes
-
+* Add `mssql` backend to the `SQLQueryDataSet` DataSet using `pyodbc` library.
 
 # Release 1.0.2:
 

diff --git a/kedro-datasets/kedro_datasets/pandas/sql_dataset.py b/kedro-datasets/kedro_datasets/pandas/sql_dataset.py
@@ -1,6 +1,7 @@
 """``SQLDataSet`` to load and save data to a SQL backend."""
 
 import copy
+import datetime as dt
 import re
 from pathlib import PurePosixPath
 from typing import Any, Dict, NoReturn, Optional
@@ -22,6 +23,7 @@
     "psycopg2": "psycopg2",
     "mysqldb": "mysqlclient",
     "cx_Oracle": "cx_Oracle",
+    "mssql": "pyodbc",
 }
 
 DRIVER_ERROR_MESSAGE = """
@@ -321,7 +323,49 @@ class SQLQueryDataSet(AbstractDataSet[None, pd.DataFrame]):
         >>>                            credentials=credentials)
         >>>
         >>> sql_data = data_set.load()
+        >>>
+    Example of usage for mssql:
+    ::
+
+
+        >>> credentials = {"server": "localhost", "port": "1433",
+        >>>                "database": "TestDB", "user": "SA",
+        >>>                "password": "StrongPassword"}
+        >>> def _make_mssql_connection_str(
+        >>>    server: str, port: str, database: str, user: str, password: str
+        >>> ) -> str:
+        >>>    import pyodbc  # noqa
+        >>>    from sqlalchemy.engine import URL  # noqa
+        >>>
+        >>>    driver = pyodbc.drivers()[-1]
+        >>>    connection_str = (f"DRIVER={driver};SERVER={server},{port};DATABASE={database};"
+        >>>                      f"ENCRYPT=yes;UID={user};PWD={password};"
+        >>>                       "TrustServerCertificate=yes;")
+        >>>    return URL.create("mssql+pyodbc", query={"odbc_connect": connection_str})
+        >>> connection_str = _make_mssql_connection_str(**credentials)
+        >>> data_set = SQLQueryDataSet(credentials={"con": connection_str},
+        >>>                            sql="SELECT TOP 5 * FROM TestTable;")
+        >>> df = data_set.load()
+
+    In addition, here is an example of a catalog with dates parsing:
+    ::
+
 
+        >>> mssql_dataset:
+        >>>    type: kedro_datasets.pandas.SQLQueryDataSet
+        >>>    credentials: mssql_credentials
+        >>>    sql: >
+        >>>       SELECT *
+        >>>       FROM  DateTable
+        >>>       WHERE date >= ? AND date <= ?
+        >>>       ORDER BY date
+        >>>    load_args:
+        >>>       params:
+        >>>        - ${begin}
+        >>>        - ${end}
+        >>>       index_col: date
+        >>>       parse_dates:
+        >>>         date: "%Y-%m-%d %H:%M:%S.%f0 %z"
     """
 
     # using Any because of Sphinx but it should be
@@ -413,6 +457,8 @@ def __init__(  # pylint: disable=too-many-arguments
         self._connection_str = credentials["con"]
         self._execution_options = execution_options or {}
         self.create_connection(self._connection_str)
+        if "mssql" in self._connection_str:
+            self.adapt_mssql_date_params()
 
     @classmethod
     def create_connection(cls, connection_str: str) -> None:
@@ -456,3 +502,26 @@ def _load(self) -> pd.DataFrame:
 
     def _save(self, data: None) -> NoReturn:
         raise DataSetError("'save' is not supported on SQLQueryDataSet")
+
+    # For mssql only
+    def adapt_mssql_date_params(self) -> None:
+        """We need to change the format of datetime parameters.
+        MSSQL expects datetime in the exact format %y-%m-%dT%H:%M:%S.
+        Here, we also accept plain dates.
+        `pyodbc` does not accept named parameters, they must be provided as a list."""
+        params = self._load_args.get("params", [])
+        if not isinstance(params, list):
+            raise DataSetError(
+                "Unrecognized `params` format. It can be only a `list`, "
+                f"got {type(params)!r}"
+            )
+        new_load_args = []
+        for value in params:
+            try:
+                as_date = dt.date.fromisoformat(value)
+                new_val = dt.datetime.combine(as_date, dt.time.min)
+                new_load_args.append(new_val.strftime("%Y-%m-%dT%H:%M:%S"))
+            except (TypeError, ValueError):
+                new_load_args.append(value)
+        if new_load_args:
+            self._load_args["params"] = new_load_args
diff --git a/kedro-datasets/setup.py b/kedro-datasets/setup.py
@@ -67,7 +67,7 @@ def _collect_requirements(requires):
     "pandas.JSONDataSet": [PANDAS],
     "pandas.ParquetDataSet": [PANDAS, "pyarrow>=6.0"],
     "pandas.SQLTableDataSet": [PANDAS, "SQLAlchemy~=1.2"],
-    "pandas.SQLQueryDataSet": [PANDAS, "SQLAlchemy~=1.2"],
+    "pandas.SQLQueryDataSet": [PANDAS, "SQLAlchemy~=1.2", "pyodbc~=4.0"],
     "pandas.XMLDataSet": [PANDAS, "lxml~=4.6"],
     "pandas.GenericDataSet": [PANDAS],
 }

diff --git a/kedro-datasets/test_requirements.txt b/kedro-datasets/test_requirements.txt
@@ -39,6 +39,7 @@ pre-commit>=2.9.2, <3.0  # The hook `mypy` requires pre-commit version 2.9.2.
 psutil==5.8.0
 pyarrow>=1.0, <7.0
 pylint>=2.5.2, <3.0
+pyodbc~=4.0.35
 pyproj~=3.0
 pyspark>=2.2, <4.0
 pytest-cov~=3.0

diff --git a/kedro-datasets/tests/pandas/test_sql_dataset.py b/kedro-datasets/tests/pandas/test_sql_dataset.py
@@ -11,6 +11,7 @@
 
 TABLE_NAME = "table_a"
 CONNECTION = "sqlite:///kedro.db"
+MSSQL_CONNECTION = "mssql+pyodbc://?odbc_connect=DRIVER%3DODBC+Driver+for+SQL"
 SQL_QUERY = "SELECT * FROM table_a"
 EXECUTION_OPTIONS = {"stream_results": True}
 FAKE_CONN_STR = "some_sql://scott:tiger@localhost/foo"
@@ -417,3 +418,56 @@ def test_create_connection_only_once(self, mocker):
         assert mock_engine.call_count == 2
         assert fourth.engines == first.engines
         assert len(first.engines) == 2
+
+    def test_adapt_mssql_date_params_called(self, mocker):
+        """Test that the adapt_mssql_date_params
+        function is called when mssql backend is used.
+        """
+        mock_adapt_mssql_date_params = mocker.patch(
+            "kedro_datasets.pandas.sql_dataset.SQLQueryDataSet.adapt_mssql_date_params"
+        )
+        mock_engine = mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine")
+        ds = SQLQueryDataSet(sql=SQL_QUERY, credentials={"con": MSSQL_CONNECTION})
+        mock_engine.assert_called_once_with(MSSQL_CONNECTION)
+        assert mock_adapt_mssql_date_params.call_count == 1
+        assert len(ds.engines) == 1
+
+    def test_adapt_mssql_date_params(self, mocker):
+        """Test that the adapt_mssql_date_params
+        function transforms the params as expected, i.e.
+        making datetime date into the format %Y-%m-%dT%H:%M:%S
+        and ignoring the other values.
+        """
+        mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine")
+        load_args = {
+            "params": ["2023-01-01", "2023-01-01T20:26", "2023", "test", 1.0, 100]
+        }
+        ds = SQLQueryDataSet(
+            sql=SQL_QUERY, credentials={"con": MSSQL_CONNECTION}, load_args=load_args
+        )
+        assert ds._load_args["params"] == [
+            "2023-01-01T00:00:00",
+            "2023-01-01T20:26",
+            "2023",
+            "test",
+            1.0,
+            100,
+        ]
+
+    def test_adapt_mssql_date_params_wrong_input(self, mocker):
+        """Test that the adapt_mssql_date_params
+        function fails with the correct error message
+        when given a wrong input
+        """
+        mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine")
+        load_args = {"params": {"value": 1000}}
+        pattern = (
+            "Unrecognized `params` format. It can be only a `list`, "
+            "got <class 'dict'>"
+        )
+        with pytest.raises(DataSetError, match=pattern):
+            SQLQueryDataSet(
+                sql=SQL_QUERY,
+                credentials={"con": MSSQL_CONNECTION},
+                load_args=load_args,
+            )