Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/feature augmentation #14

Merged
merged 12 commits into from
Jul 21, 2020

Conversation

Sarenne
Copy link
Contributor

@Sarenne Sarenne commented Jul 15, 2020

Description

Added additional lexical features to measure language complexity:

  • The full set of POS-tag rates from the Universal POS Tag set
  • Dependency distance features, including total and average dependency distance across sentence objects
  • Number of dependencies, including total and average number of unique dependency relations across sentence objects.

Notes

Dependency distance: The way that Stanza deals with punctuation-based dependency relations and the root relation could be producing unexpected distances. Some consideration may be required to ensure that this implementation is behaving correctly -- should punctuation relations be ignored?

eg

ID - word - dep head : dep distance
1 The 2  :  1
2 picture 3  :  1
**3 shows 0  :  3 (root)**
4 a 5  :  1
5 boy 3  :  2
6 walking 5  :  1
7 to 9  :  2
8 the 9  :  1
9 kitchen 6  :  3
10 to 11  :  1
11 pick 6  : 5
12 a 13  :  1
13 cookie 11  :  2
14 from 17  :  3
15 the 17  :  2
16 cookie 17  :  1
17 jar 11  :  6
**18 . 3  :  15 (punctuation)**

- number of unique dependencies (total & average)
- dependency distance (total & average)
Compute_features now reads a list rather than a tuple to allow
multiple features to be passed.
@Sarenne
Copy link
Contributor Author

Sarenne commented Jul 15, 2020

I fixed a small bug in compute_features and added some functionality to DocumentProcessor st a user can now split sentences on newline to introduce manual tokenization, or with the stanza sentence tokenization as before.

@Sarenne
Copy link
Contributor Author

Sarenne commented Jul 17, 2020

Was having issues using CoreNLP (const_parse_tree()) as I didn't have JDK installed -- added this requirement to the README instructions on setting up CoreNLP.

README.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants