Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coreference Resolution #565

Open
oaguy1 opened this issue Feb 10, 2019 · 24 comments
Open

Coreference Resolution #565

oaguy1 opened this issue Feb 10, 2019 · 24 comments

Comments

@oaguy1
Copy link
Contributor

oaguy1 commented Feb 10, 2019

Hello! I am looking into using coreference resolution in a project I am working. There exist a reasonably easy (read: does not require a neural network and training) algorithm to do just this and I was thinking of adding it to this library. I read the contributing guide and wanted to make an issue to test the water before spending a lot of time working on this.

@oaguy1 oaguy1 changed the title Coreference Detection Coreference Resolution Feb 10, 2019
@oaguy1
Copy link
Contributor Author

oaguy1 commented Feb 10, 2019

A simpler explanation and sample implementation of the algorithm I mentioned can be found here.

@spencermountain
Copy link
Owner

spencermountain commented Feb 11, 2019

YESSSSS
go for it!
any ideas about how you'd like to handle the api for it? I'd be happy to help.

something like this?

let doc=nlp('Carrots are orange. They are delicious.')
doc.pronouns().data()
// [{text:'they', normal:'they', reference:'carrots'}]

doc.nouns().data()
//[{text:'carrots', normal:'carrots', references:['they']}]

something like that?
there is a term-id property, (i think) that you could use, too.
anyways, yeah. sounds great. go for it.

@spencermountain
Copy link
Owner

it may be desirable too, to actually fetch the reference word(s), so that people can do whatever they want to the results, like replace them or something.

The only tricky-part i can imagine is tracking-down the reference word(s), and packing them into a Text object, so that a person can do doc.match('#Vegetable').nouns().references().match('#whatever').toUpperCase()... and so on.
This could get a little complicated. I'm happy to help

@oaguy1
Copy link
Contributor Author

oaguy1 commented Feb 12, 2019

Glad you are excited! The algorithm I linked to can track down the the references of the pronouns in a manner that is right most of the time (80%).

The way I was thinking about approaching this was adding an additional tagging step where we looked at each of the pronouns and then use Hobbs’ algorithm to find the best guess at the antecedent. With that in mind, my initial plan for the API was something like this:

// grabbing the antecendt to a pronoun
doc.match(#Pronoun”).get(0).antecedent();

// grabbing the pronouns for person
doc.people().get(0).pronouns();

I think once we have the additional API built out for Terms, something closer to what you initially suggested on the more macro/document level.

Let me know what you think! I plan on sitting down and putting some more time on this tomorrow.

@spencermountain
Copy link
Owner

yeah cool!
to make it feel like the other methods, i'd do it like this

doc.match(#Pronoun”).antecedents(0);
doc.people().pronouns(0);

either way, happy to see this in-action, then we can shove it around after.

been thinking past few weeks about breaking-up compromise into more micro-libraries, like d3 did. If we end up doing that, this work will end-up in a named-entity-plugin, or something like that
(just a heads-up)
thanks, lemme know if I can help with anything.

@oaguy1
Copy link
Contributor Author

oaguy1 commented Feb 22, 2019

Sounds good! I definitely want to try to keep the API as close to the rest of the library as possible. I am hacking on this when I have time, but still won't have much to share for a while. Once I have a good working MVP with tests I will make a PR and we can really play with it.

@oaguy1
Copy link
Contributor Author

oaguy1 commented Mar 7, 2019

@spencermountain As part of the algorithm I am implementing, I am trying to start from an individual Term (an instance of a pronoun) and then move to the previous Term in the sentence to see if it matches some criteria. Once the beginning of the sentence is reached, the "previous" term would the last Term in the previous sentence. It would also be good to know if the previous term came from another sentence or paragraph. Is there support for such movement within the text currently in the lib? If not, where would be a good place to start for adding it?

@spencermountain
Copy link
Owner

hey David, yeah you may want to just use the internal arrays of sentences, and terms.

let doc = nlp(myText)
doc.list //arrays of sentences
doc.list[0].terms // terms in each sentence

we don't have any support for paragraphs (right now)

@oaguy1
Copy link
Contributor Author

oaguy1 commented Mar 8, 2019

That is helpful, thank you so much for responding so quickly.

I was thinking, if you only have sentences of terms, how do you feel about adding some sort of index to the Term objects, so they are aware of their position within the document? I could add this during the build process, an attribute named something like refPosition with a two item length array [index of sentence, index of term].

Let me know what you think, I don't want to be too crazy adding things w/o checking in.

@spencermountain
Copy link
Owner

hey David, yeah this has been the hard-part of making compromise, that 'position within the document' changes considerably, and depends on where the user is zooming-in, cloning, etc.

I've started working on a major re-write, for v12, that you may be interested in, over here. It uses a linked-list model, so references, and indexes are more 'postmodern', and don't suffer any of the awkwardness you're going through.

I'm also concerned that adding in co-reference resolution to v11 may be more complicated than it would be in v12. It's not very solid yet, and still moving-around in some circles..

How would you feel about me creating a compromise-coreference repo, and us working on it there?

That would give us an opportunity to implement that Hobbs paper, without worrying about api changes:

const nlp=require('compromise')
const ccr=require('compromise-coreference')

let doc=nlp(myText)
let json = ccr(doc)
/* {whatever json-structure you'd like} */

how's that?

@oaguy1
Copy link
Contributor Author

oaguy1 commented Mar 8, 2019 via email

@spencermountain
Copy link
Owner

hey, i've added you to a basic version of this here.
take it for a ride - feel free to commit directly to it, it's pretty-rough!
cheers

@oaguy1
Copy link
Contributor Author

oaguy1 commented Mar 11, 2019 via email

@oaguy1
Copy link
Contributor Author

oaguy1 commented Mar 14, 2019 via email

@spencermountain
Copy link
Owner

i'd love to hear more about this idea, how do you imagine it working?

@oaguy1
Copy link
Contributor Author

oaguy1 commented Mar 14, 2019 via email

@spencermountain
Copy link
Owner

wanna just join the existing slack group?

@oaguy1
Copy link
Contributor Author

oaguy1 commented Mar 14, 2019 via email

@au-re
Copy link

au-re commented Jan 6, 2023

hi! sorry to comment on an old issue, but I was wondering if coreference resolution eventually did become part of compromise?

@spencermountain
Copy link
Owner

spencermountain commented Jan 6, 2023

hey Aurélien - on my new-years resolutions this year.

There's actually an undocumented api for it here - i wouldn't recommend using it yet though.

will update this issue when it lands. Would love some help.
cheers

@spencermountain
Copy link
Owner

spencermountain commented Jan 6, 2023

if you, (or anybody) was interested in working on it, the current implementation is here

it's a pretty-tricky problem. current version looks back 2 sentences for a 'he' or 'she'. i think i started to try 'they' and got overwhelmed. 'it' is most-likely the hardest.
it should also chain, so 'he' looks for previous 'he' references, etc.
cheers

@au-re
Copy link

au-re commented Jan 26, 2023

I've been reading a bit about the topic, turns out co-reference resolution is a whole field of research 😅 I found a paper describing a nice rule based algorithm that might be a good starting point https://aclanthology.org/J13-4004.pdf

It describes a series of sieves that are applied until all mentions in a text refer to some entity.

Maybe it could work something like this:

const text = "John is a musician. He played a new song. A girl was listening to the song. 'It is my favorite', John said to her."

nlp(text).coreference().json()
[
    { terms: [...], text: "John", coreference: { refs: [1]  } }, 
    { terms: [...], text: "he", coreference: { refs: [1]  } }, 
    { terms: [...], text: "a new song", coreference: { refs: [2]  } }, 
    { terms: [...], text: "It", coreference: { refs: [2]  } }, 
    { terms: [...], text: "A girl", coreference:{ refs: [3]  } }, 
    { terms: [...], text: "the song", coreference: { refs: [2]  } }, 
    { terms: [...], text: "my", coreference:{ refs: [1] }, 
    { terms: [...], text: "her", coreference: { refs: [3]  } }, 
]

Keeping an array of references might be useful for cases where one word might refer to several entities (e.g. "they")

Here are some of the sieves described in the paper:

  1. Mention Detection
  2. Speaker Identification
  3. Exact Match
  4. Pronominal Coreference Resolution (I think this is what you have started working on)

For each mention we then try to find a matching antecedent by running it through every sieve, a sieve either resolves the match or leaves it for a later sieve.

Some additional methods might be useful to build the sieves:

nlp(text).mentions().json()
// [{ terms: [...], text: "John" }, { terms: [...], text: "It" }, { terms: [...], text: "A girl" }, { terms: [...], text: "my" }, ...]

nlp(text).speakers().json()
// [{ terms: [...], text: "John", speaker: { quote: "It is my favorite" } }]

@spencermountain
Copy link
Owner

spencermountain commented Jan 26, 2023

hey Aurélien, thank you for sharing this. I'll read that paper this week, it looks really helpful. It would be great to work on this problem with someone.

I've got a few changes on the dev branch in advance of doing coreference. I can talk through them if you'd like, but it should land as a release next week. Mostly changes to .nouns() responses, for weird noun-phrases. There's also an awkwardly named people().guessGender() 😬.

I'm also trying to build-up a tag for people referred to not by name, called #Actor - for things like 'the bartender ... he ..', or 'my grandma ... she'. Right now it's just a bunch of professions, mostly.

i like the sketchup for the api. Let me read that paper and release these fixes then I'll ping you next week.
cheers

@spencermountain spencermountain pinned this issue Jan 26, 2023
@spencermountain spencermountain mentioned this issue Feb 4, 2023
Merged
@spencermountain
Copy link
Owner

okay, #Actor stuff is released in 14.8.2. Ready to start reproducing this paper, if you wanted.
The api right now is this:

doc.pronouns().forEach(p=>{
  p.refersTo().debug()
})

The logic lives here and the half-passing tests are here

Lots do to! You're welcome to try someting in a branch, or make a pr to dev or something. cheers

@spencermountain spencermountain unpinned this issue Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants