GitHub - novoic/blabla: Novoic's linguistic feature extraction library

A Python package for clinical linguistic feature extraction in multiple languages

For information about contributing, citing, licensing (including commercial licensing) and getting in touch, please see our wiki.

Our documentation can be found here. Our paper can be found here. For a list of features and their language support, see FEATURES.md.

Setup

Note that BlaBla requires Python version 3.6 or later.

To install BlaBla from source:

git clone https://github.com/novoic/blabla.git
cd blabla
pip install .

To install BlaBla using PyPI:

pip install blabla

Installing CoreNLP

For some features, BlaBla also requires Stanford CoreNLP to be installed. See FEATURES.md for a list of these features.

To set up CoreNLP version 4.0.0, do ./setup_corenlp.sh after changing corenlp_dir and lang if required. The legal lang parameters correspond to the languages available for CoreNLP:

english
arabic
chinese
french
german
spanish

After installation, or if you already have CoreNLP installed, let BlaBla know where to find it using export CORENLP_HOME=/path/to/corenlp.

CoreNLP also requires the Java Developer Kit to be installed. To check whether it is already installed locally, run $ javac -version.

Quickstart

Print the noun rate for some example text using Python (find the YAML configs inside the BlaBla repo):

from blabla.document_processor import DocumentProcessor

with DocumentProcessor('stanza_config/stanza_config.yaml', 'en') as doc_proc:
    content = open('example_configs/example_document.txt').read()
    doc = doc_proc.analyze(content, 'string')

res = doc.compute_features(['noun_rate'])
print(res)

Run BlaBla on a directory of text files and write the output to a csv file (find the YAML configs inside the BlaBla repo):

blabla compute-features -F example_configs/features.yaml -S stanza_config/stanza_config.yaml -i /path/to/text/files/dir -o blabla_features.csv -format string

For more details about usage, keep reading!

Usage

BlaBla uses two config files for feature extraction. One of them specifies settings for the CoreNLP Server and the other specifies the list of features.

Server config file

BlaBla comes with a predefined config file for Stanza and CoreNLP, which can be found at stanza_config/stanza_config.yaml. You don't need to modify any of these values. However, if want to run CoreNLP Server on a different port other than 9001, change the port number.

Input format

BlaBla supports two types of inputs. You can either send a free form text as a sentence or a paragraph or an array of JSONs.

Free Text

You can process natural language represented in the form of free text with BlaBla. A sample text file is provided at example_configs/example_document.txt. Note that we specify the input format "string" when we call the analyze method.

from blabla.document_processor import DocumentProcessor

with DocumentProcessor('stanza_config/stanza_config.yaml', 'en') as doc_proc:
    content = open('example_configs/example_document.txt').read()
    doc = doc_proc.analyze(content, 'string')

res = doc.compute_features(['noun_rate', 'verb_rate'])
print(res)

JSON Input

BlaBla requires word-level time stamps for phonetic features. The JSON format should be in a format that contains words and timestamps for each of the word in the text. Each JSON in the array corresponds to one sentence. A sample format is provided in the example_configs/example_document.json file in this repository. Note that we specify the input format "json" when we call the analyze method.

from blabla.document_processor import DocumentProcessor

with DocumentProcessor('stanza_config/stanza_config.yaml', 'en') as doc_proc:
    content = open('example_configs/example_document.json').read()
    doc = doc_proc.analyze(content, 'json')
    
res = doc.compute_features(['speech_rate'])
print(res)

Note: Please make sure the compatibility between the feature, language and input format is maintained. If your input format is string, and you ask for a feature supported only in JSON format which requires timestamps (such as total_pause_time), the code will throw an exception. Refer to FEATURES.md file for more information.

Command line interface

BlaBla can also be called using its command line interface (CLI). A sample command below shows how to use the CLI to process all the files in the a directory and dump the output features as a CSV file.

blabla compute-features -F example_configs/features.yaml -S stanza_config/stanza_config.yaml -i /path/to/text/files/dir -o blabla_features.csv -format string

When running the above CLI, you will need to provide the following arguments:

-F: path to the features.yaml file defining the language and list of features.
-S: path to the config file stanza_config.yaml containing the default server settings for Stanza and CoreNLP.
-i: path to the input directory of text or JSON files.
-o: path to the output CSV file. test_output.csv mentioned above.
-format: the format of the input files (either string or json).

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.circleci		.circleci
.github		.github
bin		bin
blabla		blabla
docs		docs
example_configs		example_configs
stanza_config		stanza_config
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
FEATURES.md		FEATURES.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py
setup_corenlp.sh		setup_corenlp.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Installing CoreNLP

Quickstart

Usage

Server config file

Input format

Free Text

JSON Input

Command line interface

About

Releases 4

Packages

Contributors 3

Languages

License

novoic/blabla

Folders and files

Latest commit

History

Repository files navigation

Setup

Installing CoreNLP

Quickstart

Usage

Server config file

Input format

Free Text

JSON Input

Command line interface

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 3

Languages

Packages