Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High level helper functions for extracting typesystem independent annotations #87

Open
aggarwalpiush opened this issue Nov 5, 2019 · 12 comments
Milestone

Comments

@aggarwalpiush
Copy link

Hi,

We are trying to extract token's text_strings and pos tags from cas objects. Also, different type systems lead to return different pos tags formats. @zesch Please correct me here if I am wrong.
It would be great to have some helper functions (some are shown in the below examples) that could solve these requests.

For example for the given cas object:

  1. To return all the token texts:
 cas.get_token_strings() or cas.select(TOKEN).as_text()
  1. To return all pos tags with ptb pos tag format:
cas.select(TOKEN).get_pos_tags(format='ptb')

We hope to see these helper functions as part of this API.

Thanks!!

@reckart
Copy link
Member

reckart commented Nov 5, 2019

Hi @aggarwalpiush :)

The issue is that cassis is meant to be a generic type-system agnostic library. I.e. it should support any UIMA type system. In fact, we have users which use e.g. the cTAKES type system and may not work at all with DKPro Core. So we would need some way of

@jcklie and I have beed throwing around a number of ideas, e.g.

  • passing a strategy to the constructor of the CAS constructor which would monkey-patch the CAS instance and add convenience methods: cas = CAS(DKPro_Core); cas.get_tokens() - but there would be no IDE auto-completion support
  • using some kind generic typing, e.g. cas = CAS(); cas.$(DKPro_Core).get_tokens() - where $ would be a method returning the type passed to it as an argument; but apparently Python doesn't support this kind of trick (Java does) and there would be no auto-completion in the IDE
  • an extension mechanism like Pandas has it; but again no auto-complete support
  • simply using static functions: import dkpro_core.accessors; get_tokens(cas)- at least some IDE auto-complete support, but not necessarily a nice API
  • subclassing the CAS: cas = DKProCoreCAS()- has IDE auto-complete support, but honestly I don't like it because IMHO it doesn't separate concerns sufficiently. E.g. what if you want to use a CAS object with different type systems, e.g. DKPro Core plus you own type system. Nah...
  • wrapping the CAS with an accessor which implements the same interface as the CAS: cas = DKPro_Core(CAS()); cas.get_tokens() - has IDE auto-completion support and also you could wrap the same CAS object with different accessors if you wanted to work with multiple type systems

... so the wrapper approach seems to us the most promising one for the moment. Also, cassis doesn't need to be extended to support it.

That said ...

cas.select(TOKEN).as_text()

This is something which I think would be really nice to have.

1 similar comment
@reckart
Copy link
Member

reckart commented Nov 5, 2019

Hi @aggarwalpiush :)

The issue is that cassis is meant to be a generic type-system agnostic library. I.e. it should support any UIMA type system. In fact, we have users which use e.g. the cTAKES type system and may not work at all with DKPro Core. So we would need some way of

@jcklie and I have beed throwing around a number of ideas, e.g.

  • passing a strategy to the constructor of the CAS constructor which would monkey-patch the CAS instance and add convenience methods: cas = CAS(DKPro_Core); cas.get_tokens() - but there would be no IDE auto-completion support
  • using some kind generic typing, e.g. cas = CAS(); cas.$(DKPro_Core).get_tokens() - where $ would be a method returning the type passed to it as an argument; but apparently Python doesn't support this kind of trick (Java does) and there would be no auto-completion in the IDE
  • an extension mechanism like Pandas has it; but again no auto-complete support
  • simply using static functions: import dkpro_core.accessors; get_tokens(cas)- at least some IDE auto-complete support, but not necessarily a nice API
  • subclassing the CAS: cas = DKProCoreCAS()- has IDE auto-complete support, but honestly I don't like it because IMHO it doesn't separate concerns sufficiently. E.g. what if you want to use a CAS object with different type systems, e.g. DKPro Core plus you own type system. Nah...
  • wrapping the CAS with an accessor which implements the same interface as the CAS: cas = DKPro_Core(CAS()); cas.get_tokens() - has IDE auto-completion support and also you could wrap the same CAS object with different accessors if you wanted to work with multiple type systems

... so the wrapper approach seems to us the most promising one for the moment. Also, cassis doesn't need to be extended to support it.

That said ...

cas.select(TOKEN).as_text()

This is something which I think would be really nice to have.

@zesch
Copy link
Member

zesch commented Nov 5, 2019

Wouldn't that also be type system specific?

cas.select(TOKEN).as_text() # token.getCoveredText()
cas.select(LEMMA).as_text() # lemma.getValue()

@reckart
Copy link
Member

reckart commented Nov 5, 2019

If we imagine TOKEN and LEMMA to be type name string constants - no.

@jcklie
Copy link
Collaborator

jcklie commented Nov 5, 2019

How would cassis know what feature to use for as_text()?

@jcklie
Copy link
Collaborator

jcklie commented Nov 5, 2019

In Python, one would normally just use a list comprehension for that, e.g.

values = [x.value for x in cas.select(LEMMA)]

@reckart
Copy link
Member

reckart commented Nov 5, 2019

For as_text(), we would use get_covered_text(), not a feature value.

@zesch
Copy link
Member

zesch commented Nov 5, 2019

This would somewhat diminish the usefulness, as many types beyond token would not return useful results. If we use an accessor, couldn't it decide to return different feature values depending on the type?

@reckart
Copy link
Member

reckart commented Nov 5, 2019

It probably could, but it could be confusing. E.g. if as_text() returns the covered text for tokens but say the entity type for entities, I would find that confusing. How would I get the covered text of an entity? If you wanted to introduce a convenience accessor for "the most commonly used feature value", I would find it sensible for it to have a different name, e.g. as_value() - this could e.g. return the "value" feature for named entities (instead of the "identifier" feature) or the "PosValue" feature for POS tags (instead of the "CoarseValue").

@zesch
Copy link
Member

zesch commented Nov 5, 2019

  1. There should be a way to access feature values of annotations.
  2. I would find it confusing if cas.select(TOKEN).as_text() and cas.select(POS).as_text() would return the same values (as they would do now, right?)

@reckart
Copy link
Member

reckart commented Nov 5, 2019

There is a way to access feature values, e.g. as @jcklie illustrated:

values = [x.value for x in cas.select(LEMMA)]

x.value reads the feature value on the feature structure x. You can also write to the feature x.value = "value".

Right now, as_text() does not exist. cas.select(XXX) returns a "Generator", i..e not a list - so evaluation is lazy. That is why we currently cannot easily add methods to it - we can also not easily figure out if the result is none-empty. We have been looking e.g. at https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.peekable or considered to return a list ... no final decision for the time being. I think it would be good if cas.select(xxx) returned something we can define methods on - some kind of lazily evaluated iterable maybe to allow eventually mirroring the UIMAv3 select API - or at least do a Pythonista version of it.

@jcklie
Copy link
Collaborator

jcklie commented Nov 5, 2019

I will track the extension mechanism in #83 and the extension methods you want here so that we do not mix up the issues.

@jcklie jcklie added this to the Backlog milestone Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants