Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate corpus dialog should allow to specify a pathname pattern #113

Open
johann-petrak opened this issue Feb 10, 2020 · 3 comments
Open

Comments

@johann-petrak
Copy link
Contributor

There is often a situation where a corpus is available as a large number of documents in a directory or directory three where the filename and/or path in the tree conveys important information for filtering or selecting documents, e.g. the filename may contain a year, a topic, a classification label etc.

It would then be extremely useful if we could specify a regexp to match the path names to import, where pathnames would maybe best be represented as URLs (so that subdirectory separators would always be slashes, even on Windows, and not backslashes which are very clumsy to use in regexps).

@greenwoodma
Copy link
Contributor

sounds like the kind of thing best done via the groovy console so you can do any matching you want and use the information in any way you want; i.e. naming the documents or corpus from elements of the path. In fact I've done this previously a couple of times in projects to match things like language codes in filename and then set a document parameter.

@johann-petrak
Copy link
Contributor Author

I think the effort to write this in Groovy is a lot bigger than entering a pattern in a dialog, even for people who occasionally use Groovy but even more so for people who enjoy GATE because they can use a GUI instead of a scripting language. I think it is reasonable to expect that advanced GATE users may know regexps or easily learn them as far as they need this there (especially if we have examples in the manual), but less reasonable to expect they they not just know Groovy, but also the API necessary to do this.

@greenwoodma
Copy link
Contributor

I guess my point was that usually you are going to want to do more than select via a regex, in that you are probably going to want to do something with the information as well, and trying to cover all those options in a GUI would be a nightmare.

If you just want to select files based on the folder or file names, surely you could do that easily outside of GATE. I think file managers on both windows and linux allow you to select files according to regex patterns which you could then move into a new folder before loading into GATE. And of course you could easily do it on the command line.

I'm not saying it's a totally silly idea, I just thing that in the majority of cases it won't take you far enough and you'd still need to write a script to deal with setting document features etc. based on the file/path info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants