Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Classifier by Type (v2) #5362

Closed
1 of 2 tasks
Tracked by #5265
ZanSara opened this issue Jul 14, 2023 · 3 comments
Closed
1 of 2 tasks
Tracked by #5265

File Classifier by Type (v2) #5362

ZanSara opened this issue Jul 14, 2023 · 3 comments
Assignees
Labels
2.x Related to Haystack v2.0 good first issue Good for newcomers

Comments

@ZanSara
Copy link
Contributor

ZanSara commented Jul 14, 2023

Very simple component that takes in a list of paths and splits such list into several output lists, one for each file type.

It could look like the following:

@component
class FileTypeClassifier:

    @component.input
    def input(self):
        class Input:
            paths: List[Union[str, Path]]
        return Input

    @component.output
    def output(self):
        class Output:
            ... as many extensions as defined in the __init__, like: "txt: List[Union[str, Path]]" ...
        return Output

    def __init__(self, extensions: List[str]):
        ... 

    def run(self, data):
        ... categorizes the files ...
        return self.output(txt=txt_files, pdf=pdf_files, wav=wav_files, ....)

Note: during the implementation of the above we realized that extracting the mimetype for a file is a non-trivial task to carry out cross-platform in Python. The mimetype library relies on the file extension. The "correct" library to use, python-magic, is non-trivial to install on Windows due to the need to ship libmagic with it, and there might be Python based alternatives polyfile or filetype, but we would need to check their respective accuracy.

Therefore we decided to still implement a simple FileExtensionClassifier that uses the file extension: it's very easy to use, very quick, and does not try to read the file, which can be a plus in some scenarios. We'll then also implement a "real" FileTypeClassifier using one of the libraries mentioned for more accurate results.

Tasks

@ZanSara ZanSara changed the title File Type Classifier component (v2) FileTypeClassifier component (v2) Jul 14, 2023
@ZanSara ZanSara changed the title FileTypeClassifier component (v2) FileTypeClassifier (v2) Jul 14, 2023
@vblagoje
Copy link
Member

vblagoje commented Aug 3, 2023

What's the best folder package for this component in preview @ZanSara? I put it in io but not sure how we want to structure packages under components. Suggestions welcome.

@ZanSara
Copy link
Contributor Author

ZanSara commented Aug 7, 2023

So far we don't have specific rules yet, but for this component we may make a package called classifiers. It may end up including all types of classifiers (query classifiers, document classifiers, etc etc) and imho that's ok. If it gets too big we'll think later how to split it.

@ZanSara ZanSara changed the title FileTypeClassifier (v2) File Classifier by Type (v2) Aug 15, 2023
@ZanSara ZanSara added the 2.x Related to Haystack v2.0 label Aug 25, 2023
@masci masci added the good first issue Good for newcomers label Sep 13, 2023
@vblagoje
Copy link
Member

This issue was completed with #5514 wasn't it @ZanSara ?

@masci masci closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants