Provides a library that can take Communication
objects, containing
Section
s, and annotate them using the Stanford CoreNLP
framework. This produces Communication
objects with Tokenization
objects, and optionally EntityMention
and Entity
objects.
<dependency>
<groupId>edu.jhu.hlt</groupId>
<artifactId>concrete-stanford</artifactId>
<version>x.y.z</version>
</dependency>
See pom.xml for latest version.
All examples assume the input files contain Communication
objects
with, at minimum, Section
objects underneath them and the text
field set. This library will not produce useful output if there are no
Section
objects underneath the Communication
objects that are run.
There are two primary drivers --- one that processes Tokenized
Concrete
files, and one that does not. Each has its own
requirements, described below.
If you have a directory of .tar.gz
files you want to run through stanford,
see this script on how to do that
via qsub
.
Make sure to build the project before running the script.
Load in a Communication
with Section
s with TextSpan
s:
// Sections are required for useful output
Communication withSections = ...;
// You need to know what language the Communication is written in
String language = "en";
Then create an annotator object and the language of the
Communication
. The following example shows the
AnnotateNonTokenizedConcrete
tool.
PipelineLanguage lang = PipelineLanguage.getEnumeration(language);
AnnotateNonTokenizedConcrete analytic = new AnnotateNonTokenizedConcrete(lang);
Run over the Communication
:
// Option 1: Wrap the Communication in an appropriate wrapper to ensure pre-reqs are handled
// Below throws a MiscommunicationException if there are no Sections or there are Sentences
// within the Sections.
NonSentencedSectionedCommunication wc = new NonSentencedSectionedCommunication(withSections);
StanfordPostNERCommunication annotated = annotatedWithStanford = analytic.annotate(wc);
// Call 'getRoot()' to get the root, unwrapped Communication.
Communication unwrapped = annotated.getRoot();
// Option 2: Do not wrap the Communication, and handle the possible exception.
// Below will throw if the passed in Communication 'withSections' is invalid
// for the analytic.
StanfordPostNERCommunication annotated = annotatedWithStanford = analytic.annotate(withSections);
Communication unwrapped = annotated.getRoot();
annotated
is a Communication
with the output of the system. This
includes sentences and tokenizations, and DEPENDING on the annotator,
entity mentions and entities as well.
StanfordPostNERCommunication
is a utility wrapper that allows easier
access to members; see
here
for the implementations.
You can also run this tool as a command line program: both AnnotateTokenizedConcrete
and
AnnotateNonTokenizedConcrete
can be run via the command line.
- Argument 1: a path to a file on disk that is either a serialized Concrete
Communication
(ending with.concrete
), a.tar
file of serialized ConcreteCommunication
objects, or a.tar.gz
file with serialized ConcreteCommunication
objects. Recall that eachCommunication
must haveSection
objects and must havetext
fields set. - Argument 2: a path that represents the desired output. The below are supported:
Input | Result |
---|---|
.concrete or .comm file |
Produces a single new .concrete or .comm file |
.tar file with Communication objects |
Produces a single .tar file with annotated Communication s |
.tar.gz ... |
Produces a single .tar.gz file with annotated Communication s |
Alternatively, you can pass in a directory as output. If only a directory is
used as output, the file name from the input will be used and extension mirrored
(e.g., if .tar
is input, .tar.gz
will be output).
- Argument 3 (optional): The language to use. Currently supported are
en
andcn
(for English and Chinese). The default isen
.
concrete-stanford
can annotate text that is both pre-tokenized and text that is not.
By default, all annotators add named entity recognition, part-of-speech, lemmatization, a constituency parse and three dependency parses (converted deterministically from the constituency parse).
The main annotator for non-tokenized input is AnnotateNonTokenizedConcrete
.
It requires sectioned data, and each section must have valid textSpans
set.
In addition to the above added annotations, AnnotateNonTokenizedConcrete
will add entity
mention identification and coreference.
The main annotator for non-tokenized input is AnnotateTokenizedConcrete
.
It requires fully Tokenized data; each {Section
,Sentence
,Token
} must have valid textSpans
set.
Replace the environment variables in the code below with directories that represent your input and output.
The following should be compliant in any sh
-like shell.
Be sure to change [en | cn]
to either en
or cn
,
depending on what language your documents are in.
export CONC_STAN_INPUT_FILE=/path/to/.concrete/or/.tar/or/.tar.gz
export CONC_STAN_OUTPUT_DIR=/path/to/output/dir
mvn clean compile assembly:single
java -cp target/*.jar edu.jhu.hlt.concrete.stanford.AnnotateNonTokenizedConcrete \
$CONC_STAN_INPUT_FILE \
$CONC_STAN_OUTPUT_DIR \
[en | cn]
The Dockerfile stands up a server implementing Concrete's
AnnotateCommunicationService
. An image built from this Dockerfile
is available on Docker Hub as
hltcoe/concrete-stanford,
and can be pulled using:
docker pull hltcoe/concrete-stanford
To see what command line flags are supported, run:
docker run hltcoe/concrete-stanford --help
At minimum, you must specify a language (currently, either en
or
cn
) using the --language
flag, e.g.:
docker run hltcoe/concrete-stanford --language en
The concrete-stanford AnnotateCommunicationService
requires
Communications that have been at least section-segmented. See the
"Known Annotators" section above for more details about the type of
data concrete-stanford expects.