Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing data, models and logs directories from repo? #111

Closed
fmikaelian opened this issue Apr 29, 2019 · 13 comments
Closed

removing data, models and logs directories from repo? #111

fmikaelian opened this issue Apr 29, 2019 · 13 comments

Comments

@fmikaelian
Copy link
Collaborator

We could just let the script create them?

@andrelmfarias
Copy link
Collaborator

@fmikaelian

I just made a pull request with the changes proposed. By the way, I think we should explain clearly in the README how to use download.py. I remember there was a section about it... did you remove it?

The correct way to use it should be:

  1. Place yourself in the cdQA folder
  2. Run on terminal: python ./cdqa/utils/download.py

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented May 7, 2019

Sorry I removed the section because I was thinking about the ability to load the data directly from the package, either from disk or from internet. Instead of using a python command.

Like they do in Keras for the data: https://keras.io/datasets/
And for the models: https://keras.io/applications/#xception

@andrelmfarias
Copy link
Collaborator

Ah ok, I see. I think it is a good idea!

However, what about the creation of the folders data, models and logs locally? Do you think it should be created when the user load a model or a dataset?

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented May 9, 2019

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented May 9, 2019

The idea is to download a ressource if it is not already cached in a .cdqa repository. Otherwise read directly from this repository.

path = get_file(
        path,
        origin='https://s3.amazonaws.com/keras-datasets/boston_housing.npz',
        file_hash='f553886a1f8d56431e820c5b82552d9d95cfcb96d1e678153f8839538947dff5')

@fmikaelian
Copy link
Collaborator Author

Ideally, we can do the same for the datasets and models. For the log repository, we can maybe use logging lib and create .cdqa/logs with it to manage logs.

@andrelmfarias
Copy link
Collaborator

I think we can be inspired keras' get_file() and load_data() method:

The idea is to download a ressource if it is not already cached in a .cdqa repository. Otherwise read directly from this repository.

path = get_file(
        path,
        origin='https://s3.amazonaws.com/keras-datasets/boston_housing.npz',
        file_hash='f553886a1f8d56431e820c5b82552d9d95cfcb96d1e678153f8839538947dff5')

So actually we will create a cache repo .cdqa with sub repos models and data? The idea is to get rid of download.py?

Ideally, we can do the same for the datasets and models. For the log repository, we can maybe use logging lib and create .cdqa/logs with it to manage logs.

I don't know this library, but I will take a look on it.

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented May 9, 2019

Yes.

It is just an idea and I think this issue is not the most important at this stage. I think we should focus on getting #122 and #91 and #115 for now. How do you think?

@andrelmfarias
Copy link
Collaborator

I agree. We can do it later, let's focus on these other issues

@fmikaelian
Copy link
Collaborator Author

Came across this after sklearn consortium, might be a good source of inspiration for this issue: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html

@andrelmfarias
Copy link
Collaborator

Came across this after sklearn consortium, might be a good source of inspiration for this issue: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html

I am not sure if it suits our needs as our interest is to be able to download the model. It works only for datasets. Do you also want to make it possible to download BNP data?

@fmikaelian
Copy link
Collaborator Author

I think the primary idea is to remove the data, models and logs directory from this repo. I usually don't see such directories across python repositories. Users would therefore be able to download manually the models from release, and them read them from anywhere and manage the data and logs as they wish.

The only thing is, if we remove those 3 directories now, what do we need to change ? What is the impact? If none, let's remove these and close the issue.

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Jun 14, 2019

Got it.

I do not think it will have an impact if we delete them, but I will do some tests to assure us.

The other point I want to adress in my PR (that I am already working on) is #119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants