Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out Preprocessing #4

Open
ablodge opened this issue Jul 12, 2020 · 3 comments
Open

Figure out Preprocessing #4

ablodge opened this issue Jul 12, 2020 · 3 comments
Assignees

Comments

@ablodge
Copy link
Collaborator

ablodge commented Jul 12, 2020

Preprocessing in the parser is based on Zhang et al. 2019 and only works on AMRs. We need to figure out whether/how we want to handle preprocessing of UCCA, EDS, DRG, and PTG.

I think to get the preprocessing working on the new data, you only need to modify AMRIO to look more like AMRGraph.

One possible consequence of working without preprocessing:
AMRGraph.py apparently expects attributes to be in a particular format or else it ignores them (line 63). While working on the parser without preprocessing, this basically results in all attributes being ignored.

@ablodge ablodge assigned ablodge and jakpra and unassigned ablodge and jakpra Jul 12, 2020
@ablodge
Copy link
Collaborator Author

ablodge commented Jul 12, 2020

@jakpra Do you want to work on this one?

@jakpra
Copy link
Member

jakpra commented Jul 12, 2020

Sure. I'd like to go about this by looking for general (linguistic/structural?) patterns within each framework and across frameworks.
Like I said before, the Zhang+19 preprocessing looks very AMR-specific and not very principled. It has many special cases that handle just an individual word or construction. I'm all for handling long-tail phenomena, but I can't imagine that this style of preprocessing is worth spending a lot of time on.

I'll look into the attribute formatting; I guess a simple workaround could just be to comment out those lines that would ignore the "bad" ones... But most importantly, we should check what makes them "bad" and what the shared task has to say about that.

@jakpra
Copy link
Member

jakpra commented Jul 13, 2020

  • Disabled a bunch of AMR-specific well-formedness checks in AMRGraph.py for now so we don't lose anything from the other frameworks.

  • Have to check which of the checks should be re-enabled.

  • Ran stanza to add features.

  • Extracted vocabs.

  • Check what other (liguistically or otherwise) principled preprocessing steps we can do.

  • Implement additional preprocessing.

  • Run preprocessing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants