Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add FileExtensionClassifier to previews #5514

Merged
merged 18 commits into from
Aug 15, 2023
Merged

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Aug 4, 2023

What?

This PR introduces a new component, FileExtensionClassifier, to our preview. This component classifies files based on their file extensions. It takes a list of file paths as input and groups them by their file extensions. The list of extensions to consider is provided during the initialization of the component.

Why?

When working with a large number of files, it can be useful to categorize them based on their mime types. This component simplifies this task by automatically grouping files by their extensions.

How can it be used?

The FileExtensionClassifier can be used as a standalone component or as part of a pipeline. Here's a simple example of how to use it:

classifier = FileExtensionClassifier(mime_types=["text/plain", "application/pdf", "application/msword"])
classified_files = classifier.run({"paths": ["/path/to/file1.txt", "/path/to/file2.pdf", "/path/to/file3.doc"]})

In this example, classified_files will be a dictionary where the keys are the file mime types ("text/plain", "application/pdf", "application/msword"), and the values are lists of pathlib.Path objects, representing the file paths that match the corresponding extension.

How did you test it?

Tests for the FileExtensionClassifier are included in this PR. They verify that the component correctly classifies files based on their extensions and handles various edge cases, such as files without extensions or with unrecognized extensions.

Notes for the reviewer

Please pay special attention to the handling of edge cases and the format of the output data. We decided to use pathlib.Path objects in the output to make it easier to work with the file paths in subsequent steps of a pipeline. If you have any suggestions for improvements or see any potential issues, please let me know.

@vblagoje vblagoje requested review from a team as code owners August 4, 2023 14:43
@vblagoje vblagoje requested review from dfokina and ZanSara and removed request for a team August 4, 2023 14:43
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Aug 4, 2023
@vblagoje
Copy link
Member Author

vblagoje commented Aug 4, 2023

@ZanSara I don't know if io_utils is the best package for this component. Or if this is actually what you had in mind for this component to begin with 😅 Just LMK what else needs to be done. This is more of a preview than a finalized PR

@coveralls
Copy link
Collaborator

coveralls commented Aug 4, 2023

Pull Request Test Coverage Report for Build 5867619952

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.07%) to 48.13%

Totals Coverage Status
Change from base Build 5859099232: 0.07%
Covered Lines: 11399
Relevant Lines: 23684

💛 - Coveralls

@vblagoje
Copy link
Member Author

vblagoje commented Aug 7, 2023

Moved to classifiers as we agreed @ZanSara , please have a look.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a dynamic output definition for this component 🙂

haystack/preview/components/classifiers/file_classifier.py Outdated Show resolved Hide resolved
@vblagoje
Copy link
Member Author

vblagoje commented Aug 8, 2023

Ok @ZanSara - please see if e758131 is what you asked for.

@vblagoje
Copy link
Member Author

@ZanSara adjusted to use mime types throughout. Tests were adjusted as well. Note that since mime types have "/" and "-" in their name, we need to convert any non-alpha character to "_" to follow dataclass naming convention. Please see comments and adjusted tests for details.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably simplify one step, overall it's near ready to merge!

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more nitpicks: let's not get swamped by this mimetype issue, I believe we can merge it today!

@vblagoje
Copy link
Member Author

@ZanSara can you please give it another look through

@vblagoje vblagoje changed the title feat: Add FileTypeClassifier to previews feat: Add FileExtensionClassifier to previews Aug 15, 2023
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you!

@vblagoje vblagoje merged commit 8652d00 into main Aug 15, 2023
54 checks passed
@vblagoje vblagoje deleted the file_type_classifier_v2 branch August 15, 2023 13:58
DosticJelena pushed a commit to smartcat-labs/haystack that referenced this pull request Aug 23, 2023
* Add FileExtensionClassifier preview component

* Add release note

* PR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants