You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is often a situation where a corpus is available as a large number of documents in a directory or directory three where the filename and/or path in the tree conveys important information for filtering or selecting documents, e.g. the filename may contain a year, a topic, a classification label etc.
It would then be extremely useful if we could specify a regexp to match the path names to import, where pathnames would maybe best be represented as URLs (so that subdirectory separators would always be slashes, even on Windows, and not backslashes which are very clumsy to use in regexps).
The text was updated successfully, but these errors were encountered:
sounds like the kind of thing best done via the groovy console so you can do any matching you want and use the information in any way you want; i.e. naming the documents or corpus from elements of the path. In fact I've done this previously a couple of times in projects to match things like language codes in filename and then set a document parameter.
I think the effort to write this in Groovy is a lot bigger than entering a pattern in a dialog, even for people who occasionally use Groovy but even more so for people who enjoy GATE because they can use a GUI instead of a scripting language. I think it is reasonable to expect that advanced GATE users may know regexps or easily learn them as far as they need this there (especially if we have examples in the manual), but less reasonable to expect they they not just know Groovy, but also the API necessary to do this.
I guess my point was that usually you are going to want to do more than select via a regex, in that you are probably going to want to do something with the information as well, and trying to cover all those options in a GUI would be a nightmare.
If you just want to select files based on the folder or file names, surely you could do that easily outside of GATE. I think file managers on both windows and linux allow you to select files according to regex patterns which you could then move into a new folder before loading into GATE. And of course you could easily do it on the command line.
I'm not saying it's a totally silly idea, I just thing that in the majority of cases it won't take you far enough and you'd still need to write a script to deal with setting document features etc. based on the file/path info.
There is often a situation where a corpus is available as a large number of documents in a directory or directory three where the filename and/or path in the tree conveys important information for filtering or selecting documents, e.g. the filename may contain a year, a topic, a classification label etc.
It would then be extremely useful if we could specify a regexp to match the path names to import, where pathnames would maybe best be represented as URLs (so that subdirectory separators would always be slashes, even on Windows, and not backslashes which are very clumsy to use in regexps).
The text was updated successfully, but these errors were encountered: