Feature/feature augmentation #14

Sarenne · 2020-07-15T16:00:04Z

Description

Added additional lexical features to measure language complexity:

The full set of POS-tag rates from the Universal POS Tag set
Dependency distance features, including total and average dependency distance across sentence objects
Number of dependencies, including total and average number of unique dependency relations across sentence objects.

Notes

Dependency distance: The way that Stanza deals with punctuation-based dependency relations and the root relation could be producing unexpected distances. Some consideration may be required to ensure that this implementation is behaving correctly -- should punctuation relations be ignored?

eg

ID - word - dep head : dep distance
1 The 2  :  1
2 picture 3  :  1
**3 shows 0  :  3 (root)**
4 a 5  :  1
5 boy 3  :  2
6 walking 5  :  1
7 to 9  :  2
8 the 9  :  1
9 kitchen 6  :  3
10 to 11  :  1
11 pick 6  : 5
12 a 13  :  1
13 cookie 11  :  2
14 from 17  :  3
15 the 17  :  2
16 cookie 17  :  1
17 jar 11  :  6
**18 . 3  :  15 (punctuation)**

- number of unique dependencies (total & average) - dependency distance (total & average)

Compute_features now reads a list rather than a tuple to allow multiple features to be passed.

Sarenne · 2020-07-15T16:01:57Z

I fixed a small bug in compute_features and added some functionality to DocumentProcessor st a user can now split sentences on newline to introduce manual tokenization, or with the stanza sentence tokenization as before.

Sarenne · 2020-07-17T15:23:25Z

Was having issues using CoreNLP (const_parse_tree()) as I didn't have JDK installed -- added this requirement to the README instructions on setting up CoreNLP.

FEATURES.md

blabla/sentence_aggregators/lexico_semantic_fearture_aggregator.py

README.md

Sarenne added 7 commits July 14, 2020 14:54

✨ Adding additional pos tag rate features from UPOS

6585c18

✨ Adding more dependency-based features

c273659

- number of unique dependencies (total & average) - dependency distance (total & average)

🚨

3e3df0e

📝 Update FEATURES.md

9bc07da

📝 Update FEATURES.md

d01ecb0

🐛 Updated compute_features arg type

7d73316

Compute_features now reads a list rather than a tuple to allow multiple features to be passed.

⚡ Added flag to split sentences on newline or with stanza

d720aca

❇️ Update README to include JDK requirements

ee2a066

abhisheknovoic reviewed Jul 17, 2020

View reviewed changes

FEATURES.md Show resolved Hide resolved

abhisheknovoic reviewed Jul 17, 2020

View reviewed changes

blabla/sentence_aggregators/lexico_semantic_fearture_aggregator.py Outdated Show resolved Hide resolved

abhisheknovoic reviewed Jul 17, 2020

View reviewed changes

blabla/sentence_aggregators/lexico_semantic_fearture_aggregator.py Outdated Show resolved Hide resolved

abhisheknovoic reviewed Jul 17, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

Sarenne added 4 commits July 20, 2020 11:45

📝 Update feature lists

0887136

🥅 Update some lexical features to check for nans

50277d4

Merge remote readme changes

ce6f1f9

📝 Update README with JDK check

3731ad1

abhisheknovoic merged commit 7b101db into novoic:dev Jul 21, 2020

abhisheknovoic mentioned this pull request Jul 22, 2020

Merging all changes for BlaBla V0.2 Release #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/feature augmentation #14

Feature/feature augmentation #14

Sarenne commented Jul 15, 2020

Sarenne commented Jul 15, 2020

Sarenne commented Jul 17, 2020 •

edited

Loading

Feature/feature augmentation #14

Feature/feature augmentation #14

Conversation

Sarenne commented Jul 15, 2020

Description

Notes

Sarenne commented Jul 15, 2020

Sarenne commented Jul 17, 2020 • edited Loading

Sarenne commented Jul 17, 2020 •

edited

Loading