Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the CSVLoader class to read from different datasources/targets and different kinds of formats #70

Open
neomatrix369 opened this issue Oct 20, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@neomatrix369
Copy link
Contributor

neomatrix369 commented Oct 20, 2020

Is your feature request related to a problem? Please describe.
At the moment it appears the CSVLoader can only load .csv files from the disk or a file system. Which could be a limitation both from the functionality point of view and also provenance (metadata) recording point of view.

In the Provenance data, we see the path of the file given during the training process, this path could be invalid if the process was run in docker containers or another ephemeral means.

Other libraries (non-Java based) allow loading .tgz, .zip, etc formats and although this may just be a single step when trying to manage multiple datasets this can be a boon.

Describe the solution you'd like
CSVLoader through sub-class implementations allow loading:

  • files downloaded from not just local file system but also via the web (Secure and public sources i.e. S3 bucket or github)
  • files stored in different formats i.e. .tgz, .zip (compressed formats mainly)
  • data stored in datastores / databases (via login/password or other connection strings)
  • additional metadata information about the dataset itself, i.e field definition and background of the dataset or links or resources to them

Additional context
Maybe show these functionalities or other functionalities or features of CSVLoader via notebook tutorials.

This request is actually two folds:

  • file format
  • data source (or target) location

Once any or all of these are established, the provenance information can now have a bit more independent set of information on how to replicate the data loading process.

For e.g.


TrainTestSplitter(
	class-name = org.tribuo.evaluation.TrainTestSplitter
	source = CSVLoader(
			class-name = org.tribuo.data.csv.CSVLoader
			outputFactory = LabelFactory(
					class-name = org.tribuo.classification.LabelFactory
				)
			response-name = species
			separator = ,
			quote = "
			path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
			file-modified-time = 2020-07-06T10:52:01.938-04:00
			resource-hash = 36F668D1CBC29A8C2C1128C5D2F0D400FA04ED4DC62D12246F44CE9360360CC0
		)
	train-proportion = 0.7
	seed = 1
	size = 150
	is-train = true
)

From the above I could not recreate the model building process or just the data loading process easily because path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data is local an individual computer system. While we could have paths like path = https://path/to/bezdekIris.data which would make the whole process a lot more independent. And also add value to the provenance metadata, as we would know the original source of the data.

@neomatrix369 neomatrix369 added the enhancement New feature or request label Oct 20, 2020
@Craigacp
Copy link
Member

The CSVLoader is designed to be a very simple and quick way of getting a numerical csv with a response column up off the disk and into Tribuo. The file format is moderately flexible in CSVLoader you can change the separator and quote characters, but I don't expect to expand CSVLoader beyond that. For anything more complex than that in terms of format, processing or other information you should use CSVDataSource, IDXDataSource, JsonDataSource, SQLDataSource, or TextDataSource. If the format isn't covered by one of those then we could look at adding a new data source, but it wouldn't be in CSVLoader. I agree it would be nice to have more provenance information in the data sources, and we could look at adding an instance field to the DataProvenance, but for the time being you can store extra information in the runProvenance argument to Trainer.train as that is stored in the resulting model.

With respect to transparently reading compressed formats, we could add that as an option to the previously mentioned data sources. For zip and tgz formats then we'd need to cope with potential directory structure and figure out which file to load from within the archive, which adds complexity.

It's intentional that the current loaders do not read from remote endpoints (apart from SQLDataSource which connects to an external database). We could relax this but it would have to be controlled by a flag, and it's a bit of an issue that a configuration file can make a web request to load something.

@Craigacp
Copy link
Member

I'm finishing off a tutorial on RowProcessor which uses CSVDataSource and JsonDataSource to load more complex columnar data from csv and json files respectively.

@neomatrix369
Copy link
Contributor Author

neomatrix369 commented Oct 20, 2020

@Craigacp I agree with the points above, maybe I wasn't clear enough. I meant inheriting from CSVDatasource and creating implementations that do that separately, like you mentioned CSVDataSource, IDXDataSource, JsonDataSource, SQLDataSource, or TextDataSource

The parent class of implementation is an implementation detail, it's just to provide more ways to load data and capture the source is what I was eluding to.

@neomatrix369
Copy link
Contributor Author

neomatrix369 commented Oct 20, 2020

About compressed files, directory/folder support won't be necessary, just allowing it to detect that it's compressed csv/json file is more than enough. There may be use-cases for such usage as I have already seen that when data files get used, lots of different lightweight formats are sought after to solve read/write issues (storage and latency).

@Craigacp
Copy link
Member

We already have something that can transparently figure out if it's a GZipped file elsewhere in OLCUT, which will return the appropriate input stream implementation. We could probably extend that to support zip, but I don't think we'd want to induce a dependency inside OLCUT to get bzip support.

@Craigacp
Copy link
Member

Craigacp commented Oct 20, 2020

So concretely there would be:

  • optional loading of gzip or zip compressed files through the data sources
  • loading files over the web (most libraries that do this provide a caching mechanism which would require designing, especially as the provenance hash currently reads the file a second time, it would be bad to download it twice)
  • mechanism for adding additional metadata to a datasource (e.g. additional provenance information on construction? or something else)
  • support for other data formats

For the last point I'm not clear what's required. Tribuo can already connect to things via JDBC, and read delimited and json format inputs. Are there other major formats we should support?

We use ColumnarDataSource as the base class for CSV, Json and SQL format data, so there could be other subclasses of that for other columnar inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants