Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect against None components of universal pathlib xcom backend #41921

Merged
merged 1 commit into from
Sep 1, 2024

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Sep 1, 2024

fixes: #41723


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, I was hoping there is someone around with more background knowledge why the code is currently like it is... otherwise if none available.. OK to merge it as is.
Maybe we wait a few days hoping that bolke returns back.

airflow/io/path.py Show resolved Hide resolved
airflow/providers/common/io/xcom/backend.py Show resolved Hide resolved
@ap--
Copy link
Contributor

ap-- commented Sep 1, 2024

Regarding the overwrite argument in rename: The implementation of UPath.rename relies on AbstractFileSystem.mv. Below is a report for a selection of available fsspec filesystems, showing if they rely on the default implementation or if they override the method.

tldr: the default and most custom implementations throw away extra keyword arguments. The only one in this list that doesn't is SMBFileSystem, but it seems there's no overwrite keyword argument further down in smbclient.rename either. So unless you use a fsspec filesystem that is not listed here and requires the overwrite kwarg, it should be safe to remove.

See below for details on the different rename implementations:

fsspec subclasses: customized method report for 'mv'

Click to see the environment used to generate the report
adlfs==2024.7.0
aenum==3.1.15
aiobotocore==2.14.0
aiofile==3.8.8
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiohttp-retry==2.8.3
aioitertools==0.11.0
aiooss2==0.2.10
aiosignal==1.3.1
aliyun-python-sdk-core==2.15.2
aliyun-python-sdk-kms==2.16.5
amqp==5.2.0
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.4.0
appdirs==1.4.4
asyncssh==2.16.0
atpublic==5.0
attrs==24.2.0
azure-core==1.30.2
azure-datalake-store==0.0.53
azure-identity==1.17.1
azure-storage-blob==12.22.0
bcrypt==4.2.0
billiard==4.2.0
botocore==1.35.7
boxfs==0.3.0
boxsdk==3.13.0
cachetools==5.5.0
caio==0.9.17
celery==5.4.0
certifi==2024.8.30
cffi==1.17.0
charset-normalizer==3.3.2
circuitbreaker==2.0.0
click==8.1.7
click-didyoumean==0.3.1
click-plugins==1.1.1
click-repl==0.3.0
cloudpickle==3.0.0
colorama==0.4.6
configobj==5.0.8
crcmod==1.7
cryptography==42.0.8
dask==2024.8.2
decorator==5.1.1
dictdiffer==0.9.0
diskcache==5.6.3
distributed==2024.8.2
distro==1.9.0
docker-pycreds==0.4.0
dpath==2.2.0
dropbox==12.0.2
dropboxdrivefs==1.4.1
dulwich==0.22.1
dvc==3.53.1
dvc-data==3.15.2
dvc-http==2.32.0
dvc-objects==5.1.0
dvc-render==1.0.2
dvc-studio-client==0.21.0
dvc-task==0.4.0
entrypoints==0.4
filelock==3.15.4
flatten-dict==0.4.2
flufl.lock==8.1.0
frozenlist==1.4.1
fsspec==2024.6.1
fsspec_xrootd==0.3.0
funcy==2.0
gcsfs==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
google-api-core==2.19.2
google-auth==2.34.0
google-auth-oauthlib==1.2.1
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.5.0
google-resumable-media==2.7.2
googleapis-common-protos==1.65.0
grandalf==0.8
gto==1.7.1
h11==0.14.0
httpcore==1.0.5
httpx==0.27.2
huggingface-hub==0.23.5
hydra-core==1.3.2
idna==3.8
isodate==0.6.1
iterative-telemetry==0.0.8
Jinja2==3.1.4
jmespath==0.10.0
kombu==5.4.0
lakefs==0.7.1
lakefs-sdk==1.32.1
lakefs-spec==0.10.0
locket==1.0.0
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
morefs==0.2.2
msal==1.30.0
msal-extensions==1.2.0
msgpack==1.0.8
multidict==6.0.5
networkx==3.3
numpy==2.1.0
oauthlib==3.2.2
oci==2.133.0
ocifs==1.3.1
omegaconf==2.3.0
orjson==3.10.7
oss2==2.18.1
ossfs==2023.12.0
packaging==24.1
paramiko==3.4.1
partd==1.4.2
pathspec==0.12.1
platformdirs==3.11.0
ply==3.11
portalocker==2.10.1
prompt_toolkit==3.0.47
proto-plus==1.24.0
protobuf==5.28.0
psutil==6.0.0
pyarrow==17.0.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pycryptodome==3.20.0
pydantic==2.8.2
pydantic_core==2.20.1
pydot==3.0.1
pygit2==1.15.1
Pygments==2.18.0
pygtrie==2.5.0
PyJWT==2.9.0
PyNaCl==1.5.0
pyOpenSSL==24.2.1
pyparsing==3.1.4
pyspnego==0.11.1
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
requests==2.32.3
requests-oauthlib==2.0.0
requests-toolbelt==1.0.0
rich==13.8.0
rsa==4.9
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
s3fs==2024.6.1
scmrepo==3.3.7
semver==3.0.2
sentry-sdk==2.13.0
setproctitle==1.3.3
setuptools==74.0.0
shellingham==1.5.4
shortuuid==1.0.13
shtab==1.7.1
six==1.16.0
smbprotocol==1.14.0
smmap==5.0.1
sniffio==1.3.1
sortedcontainers==2.4.0
sqltrie==0.11.1
stone==3.3.1
tabulate==0.9.0
tblib==3.0.0
tomlkit==0.13.2
toolz==0.12.1
tornado==6.4.1
tqdm==4.66.5
typer==0.12.5
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.0.7
vine==5.1.0
voluptuous==0.15.2
wandb==0.17.8
wandbfs==0.0.2
wcwidth==0.2.13
webdav4==0.10.0
wrapt==1.16.0
yarl==1.9.6
zc.lockfile==3.0.post1
zict==3.0.0

Default implementation

    def mv(self, path1, path2, recursive=False, maxdepth=None, **kwargs):
        """Move file(s) from one location to another"""
        if path1 == path2:
            logger.debug("%s mv: The paths are the same, so no files were moved.", self)
        else:
            # explicitly raise exception to prevent data corruption
            self.copy(
                path1, path2, recursive=recursive, maxdepth=maxdepth, onerror="raise"
            )
            self.rm(path1, recursive=recursive)

These filesystem classes do not customize the method 'mv':

  • AzureBlobFileSystem
  • AzureDatalakeFileSystem
  • CachingFileSystem
  • BoxFileSystem
  • DataFileSystem
  • DropboxDriveFileSystem
  • _DVCFileSystem
  • WholeFileCacheFileSystem
  • GCSFileSystem
  • GenericFileSystem
  • GitFileSystem
  • GithubFileSystem
  • HfFileSystem
  • HTTPFileSystem
  • JupyterFileSystem
  • LakeFSFileSystem
  • LibArchiveFileSystem
  • MemoryFileSystem
  • OCIFileSystem
  • OSSFileSystem
  • ReferenceFileSystem
  • XRootDFileSystem
  • S3FileSystem
  • SimpleCacheFileSystem
  • TarFileSystem
  • WandbFS
  • ZipFileSystem
  • DictFS
  • MemFS
  • OverlayFileSystem

Subclasses customizing 'mv'

HadoopFileSystem

HadoopFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, path1, path2, **kwargs)
    @wrap_exceptions
    def mv(self, path1, path2, **kwargs):
        path1 = self._strip_protocol(path1).rstrip("/")
        path2 = self._strip_protocol(path2).rstrip("/")
        self.fs.move(path1, path2)

AsyncLocalFileSystem

AsyncLocalFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, path1, path2, **kwargs)
    def mv(self, path1, path2, **kwargs):
        path1 = self._strip_protocol(path1)
        path2 = self._strip_protocol(path2)
        shutil.move(path1, path2)

DaskWorkerFileSystem

DaskWorkerFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, *args, **kwargs)
    def mv(self, *args, **kwargs):
        if self.worker:
            self.fs.mv(*args, **kwargs)
        else:
            self.rfs.mv(*args, **kwargs).compute()

DatabricksFileSystem

DatabricksFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, source_path, destination_path, recursive=False, maxdepth=None, **kwargs)
    def mv(
        self, source_path, destination_path, recursive=False, maxdepth=None, **kwargs
    ):
        """
        Move a source to a destination path.

        A note from the original [databricks API manual]
        (https://docs.databricks.com/dev-tools/api/latest/dbfs.html#move).

        When moving a large number of files the API call will time out after
        approximately 60s, potentially resulting in partially moved data.
        Therefore, for operations that move more than 10k files, we strongly
        discourage using the DBFS REST API.

        Parameters
        ----------
        source_path: str
            From where to move (absolute path)
        destination_path: str
            To where to move (absolute path)
        recursive: bool
            Not implemented to far.
        maxdepth:
            Not implemented to far.
        """
        if recursive:
            raise NotImplementedError
        if maxdepth:
            raise NotImplementedError

        try:
            self._send_to_api(
                method="post",
                endpoint="move",
                json={"source_path": source_path, "destination_path": destination_path},
            )
        except DatabricksException as e:
            if e.error_code == "RESOURCE_DOES_NOT_EXIST":
                raise FileNotFoundError(e.message)
            elif e.error_code == "RESOURCE_ALREADY_EXISTS":
                raise FileExistsError(e.message)

            raise e
        self.invalidate_cache(self._parent(source_path))
        self.invalidate_cache(self._parent(destination_path))

DirFileSystem

DirFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, path1, path2, **kwargs)
    def mv(self, path1, path2, **kwargs):
        return self.fs.mv(
            self._join(path1),
            self._join(path2),
            **kwargs,
        )

LocalFileSystem

LocalFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, path1, path2, **kwargs)
    def mv(self, path1, path2, **kwargs):
        path1 = self._strip_protocol(path1)
        path2 = self._strip_protocol(path2)
        shutil.move(path1, path2)

FTPFileSystem

FTPFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, path1, path2, **kwargs)
    def mv(self, path1, path2, **kwargs):
        path1 = self._strip_protocol(path1)
        path2 = self._strip_protocol(path2)
        self.ftp.rename(path1, path2)
        self.invalidate_cache(self._parent(path1))
        self.invalidate_cache(self._parent(path2))

SFTPFileSystem

SFTPFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, old, new)
    def mv(self, old, new):
        logger.debug("Renaming %s into %s", old, new)
        self.ftp.posix_rename(old, new)

SMBFileSystem

SMBFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, path1, path2, recursive=None, maxdepth=None, **kwargs)
    def mv(self, path1, path2, recursive=None, maxdepth=None, **kwargs):
        wpath1 = _as_unc_path(self.host, path1)
        wpath2 = _as_unc_path(self.host, path2)
        smbclient.rename(wpath1, wpath2, port=self._port, **kwargs)

WebdavFileSystem

WebdavFileSystem.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, path1: str, path2: str, recursive: bool = False, maxdepth: Optional[bool] = None, **kwargs: Any) -> None
    def mv(
        self,
        path1: str,
        path2: str,
        recursive: bool = False,
        maxdepth: Optional[bool] = None,
        **kwargs: Any,
    ) -> None:
        """Move a file/directory from one path to the other."""
        path1 = self._strip_protocol(path1)
        path2 = self._strip_protocol(path2)

        if recursive and not maxdepth and self.isdir(path1):
            return self.client.move(path1, path2)

        if not recursive and self.isdir(path1):
            return self.makedirs(path2)

        super().mv(path1, path2, recursive=recursive, maxdepth=maxdepth, **kwargs)
        return None

WebHDFS

WebHDFS.mv is customized and signature is different

  • base_cls: (self, path1, path2, recursive=False, maxdepth=None, **kwargs)
  • subclass: (self, path1, path2, **kwargs)
    def mv(self, path1, path2, **kwargs):
        self._call("RENAME", method="put", path=path1, destination=path2)

@potiuk
Copy link
Member Author

potiuk commented Sep 1, 2024

Thanks @ap-- !

@potiuk potiuk merged commit 7a75f0a into apache:main Sep 1, 2024
109 checks passed
@potiuk potiuk deleted the fix-none-values-on-basepath branch September 1, 2024 20:18
@potiuk potiuk modified the milestones: Airflow 2.10.1, Airflow 2.10.2 Sep 1, 2024
potiuk added a commit to potiuk/airflow that referenced this pull request Sep 1, 2024
@potiuk potiuk modified the milestones: Airflow 2.10.2, Airflow 2.10.1 Sep 2, 2024
potiuk added a commit that referenced this pull request Sep 2, 2024
@utkarsharma2 utkarsharma2 added changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) type:misc/internal Changelog: Misc changes that should appear in change log and removed changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) labels Sep 2, 2024
utkarsharma2 pushed a commit that referenced this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:common-io type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Universal-pathlib 0.2.3 seems to break compatibility with 0.2.2 (at least breaks mypy checks).
4 participants