Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO-507: Fixing imports with spaces #501

Merged
merged 4 commits into from
Dec 12, 2022

Conversation

owencjones
Copy link
Contributor

@owencjones owencjones commented Dec 2, 2022

  • Parsing filenames using pathlib.Path.suffixes was not working well, because we were joining all suffixes into a string to allow extensions like .nii.gz to work, and...
  • This stops filenames like filename.1.png from working as the extension is parsed as 1.png and that is not in the valid list (.png is)

Changes:

  • Deprecated is_extension_allowed, is_image_extension_allowed, and is_video_extension_allowed, because these accept only the extension
  • Introduced is_extension_allowed_by_filename and corresponding image and video equivalents, which accept the entire filename, and check it ends with a valid extension.
  • Included tests for all introduced and changed functions.

darwin/utils.py Outdated
@@ -210,13 +280,15 @@ def find_files(
for f in files:
path = Path(f)
if path.is_dir():
found_files.extend([f for f in path.glob(pattern) if is_extension_allowed("".join(f.suffixes))])
elif is_extension_allowed("".join(path.suffixes)):
found_files.extend([f for f in path.glob(pattern) if is_extension_allowed(str(path))])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes made only by formatter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to give visibility this line - "".join(path.suffixes) is there to handle the fact that some extensions have two dots in them e.g .nii.gz. I'm not sure this new logic works, as is_extension_allowed expects only the extension passed to it not the entire path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the new is_extension_allowed_by_filename but, maybe that should be used instead of is_extension_allowed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're absolutely right, not sure how I missed that!

Well, I am - I unit tested it, so tested the functions individually!

found_files.append(path)
else:
raise UnsupportedFileType(path)

return [f for f in found_files if f not in map(Path, files_to_exclude)]
files_to_exclude_full_paths = [str(Path(f)) for f in files_to_exclude]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned into two lines for readability, because applying the str as

[f for f in found file if str(f) not in map(str, map(Path, files_to_exclude))]

is not too readable, and import this will tell us it should be.

Comparing by string, because comparing PosixPath, WindowsPath, or generally PurePath objects is quite unreliable. Iteratively comparing path.parents and then path.name works, and casting to str provides ...parents/name in a single string - which is more efficient to compare.

Comparison of parents and then name is O(n)=n*(p+1) - where p is number of parents. Comparison of string output is O(n)=1 - at least on the surface - but under the hood, we should at least assume that pathlib is relatively optimised.


class TestIsExtensionAllowedByFilenameFunctions(FindFileTestCase):
@dataclass
class Dependencies:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency injection approach used because pytest already has is_allowed_by_filename in memory from the file header, and this guarantees a new memory pointer for each test - pytest runs 4 in parallel as default.

@@ -1,4 +1,5 @@
from unittest.mock import MagicMock, patch
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatted made changes to a file I saved in advertently.

@owencjones owencjones marked this pull request as ready for review December 2, 2022 11:41
@owencjones owencjones changed the title IO-507_fixing_imports_with_spaces IO-507: Fixing imports with spaces Dec 2, 2022
@linear
Copy link

linear bot commented Dec 2, 2022

IO-507 BUG: "Error: No files found" when importing folders that contain files with spaces in the filename

BUG: submission from @Rafal Zadlowski

Summary (describe an issue):
"Error: No files found" when importing folders that contain files with spaces in the filename

Share Loom/Screenshots with Console/Network opened:
In the thread

Darwin affected version
V2

Environment (production/staging; browser and OS version)
Darwin-py 0.8.3. Works fine on 0.8.2 version

Impact
Some (few customers are impacted)

Priority
High - important issue, but has workaround

Expected Behaviour
Files containing spaces in the filename are correctly recognized and imported.

Team & Dataset Link  
Spine AI: https://darwin.v7labs.com/super_admin/teams/1661

Intercom ticket
https://app.intercom.com/a/apps/pb9z3asr/inbox/inbox/5119030/conversations/150317

@rslota rslota requested a review from tomvars December 6, 2022 16:18
@owencjones owencjones merged commit e4ce4ed into master Dec 12, 2022
@owencjones owencjones deleted the IO-507_fixing_imports_with_spaces branch December 12, 2022 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants