Utilities for reading the Annotated NYT corpus.
<dependency>
<groupId>edu.jhu.hlt</groupId>
<artifactId>annotated-nyt</artifactId>
<version>1.1.5</version>
</dependency>
Create a NYTCorpusDocumentParser
object:
NYTCorpusDocumentParser parser = new NYTCorpusDocumentParser();
Read a single .xml
document from the annotated NYT corpus:
Path p = Paths.get("/your/path/.xml");
byte[] bytes = Files.readAllBytes(p);
NYTCorpusDocument ncd = parser.fromByteArray(bytes, false);
AnnotatedNYTDocument and = new AnnotatedNYTDocument(ncd);
All fields in the AnnotatedNYTDocument
objects are guaranteed to
be non-null
.
Many of the fields in the corpus can be empty or null
in the
documents themselves. These fields are represented in the wrapper
object, AnnotatedNYTDocument
, as Optional
fields.
Many convenience methods exist to convert naturally list-based items (e.g.,
the body as a List
of paragraphs). Many of these sections, however,
can also be null
. In these cases, the API will return an empty List
object. These lists will never be null
.
The integration test can be executed with the following command:
mvn clean verify -Pitest -DanytDataPath=/path/to/your/LDC/corpus/data/dir
The anyDataPath
property should point to your data
directory
from the extracted ANYT corpus. This directory contains many folders
with numbers as names, representing years of annotated NYT data.