-
Notifications
You must be signed in to change notification settings - Fork 58
Generating a NIF dataset using Java
This article describes, how a developer could use the classes defined in our gerbil.nif.transfer
library to generate a dataset using the Natural Language Processing Interchange Format (NIF). Note that this article is neither a complete NIF tutorial nor is our library able to handle all possibilities, classes and properties that are offered by the NIF ontology.
Throughout this article, we want to use the following text as a small example.
Japan (Japanese: 日本 Nippon or Nihon) is a stratovolcanic archipelago of 6,852 islands.
Inside this text, we would like to mark Japan
and stratovolcanic archipelago
as named entities. Furthermore, we would like to express, that Japan is a country and a stratovolcanic archipelago and that a stratovolcanic archipelago is a special type of archipelago. Additionally, we would like to add the topic (or tag) "Geography", since it contains geographical information.
The implementation of the example can be found here.
We start by creating a document object using the text and a document URI.
String text = "Japan (Japanese: 日本 Nippon or Nihon) is a stratovolcanic archipelago of 6,852 islands.";
Document document = new DocumentImpl(text, "http://example.org/document0");
A document has a list of so-called Markings. These are additional information that can be added to the text, e.g., the occurance of a named entity. A list of available Marking can be found here.
For our two named entities, we create TypedNamedEntity
objects and add them to the document.
Set<String> uris = new HashSet<String>();
uris.add("http://example.org/Japan");
Set<String> types = new HashSet<String>();
types.add("http://example.org/Country");
types.add("http://example.org/StratovolcanicArchipelago");
document.addMarking(new TypedNamedEntity(0, 5, uris, types));
uris = new HashSet<String>();
uris.add("http://example.org/StratovolcanicArchipelago");
types = new HashSet<String>();
types.add("http://example.org/Archipelago");
types.add("http://www.w3.org/2000/01/rdf-schema#Class");
document.addMarking(new TypedNamedEntity(42, 5, uris, types));
The topic "Geography" is added using the Annotation
class.
uris = new HashSet<String>();
uris.add("http://example.org/Geography");
document.addMarking(new Annotation(uris));
Since a "real" corpus comprises more than only one document, we might add our generated document ot a list and create some more documents.
List<Document> documents = new ArrayList<Document>();
documents.add(document);
Writing our new list of documents to an OutputStream
, Writer
or simple String
can be done using an instance of the org.aksw.gerbil.io.nif.NIFWriter
interface. In our example, we are using a writer for Turtle, i.e., aorg.aksw.gerbil.io.nif.impl.TurtleNIFWriter
object. (Note, that there are no other implementations of this interface at the moment.)
NIFWriter writer = new TurtleNIFWriter();
String nifString = writer.writeNIF(documents);
System.out.println(nifString);
This would print the following Turtle to our console
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://example.org/document0#char=0,86>
a nif:RFC5147String , nif:String , nif:Context ;
nif:beginIndex "0"^^xsd:nonNegativeInteger ;
nif:endIndex "86"^^xsd:nonNegativeInteger ;
nif:isString "Japan (Japanese: 日本 Nippon or Nihon) is a stratovolcanic archipelago of 6,852 islands."^^xsd:string ;
nif:topic <http://example.org/document0#annotation0> .
<http://example.org/document0#char=0,5>
a nif:RFC5147String , nif:String ;
nif:anchorOf "Japan"^^xsd:string ;
nif:beginIndex "0"^^xsd:nonNegativeInteger ;
nif:endIndex "5"^^xsd:nonNegativeInteger ;
nif:referenceContext <http://example.org/document0#char=0,86> ;
itsrdf:taClassRef <http://example.org/Country> , <http://example.org/StratovolcanicArchipelago> ;
itsrdf:taIdentRef <http://example.org/Japan> .
<http://example.org/document0#char=42,68>
a nif:RFC5147String , nif:String ;
nif:anchorOf "stratovolcanic archipelago"^^xsd:string ;
nif:beginIndex "42"^^xsd:nonNegativeInteger ;
nif:endIndex "68"^^xsd:nonNegativeInteger ;
nif:referenceContext <http://example.org/document0#char=0,86> ;
itsrdf:taClassRef <http://example.org/Archipelago> , rdfs:Class ;
itsrdf:taIdentRef <http://example.org/StratovolcanicArchipelago> .
<http://example.org/document0#annotation0>
a nif:Annotation ;
itsrdf:taIdentRef <http://example.org/Geography> .
After generating a NIF corpus, it can be helpful to parse the NIF using a NIFParser
instance. In our example, we can do this in the following way.
NIFParser parser = new TurtleNIFParser();
parser.parseNIF(nifString);
Our parser implementation checks the position of Span
instances by checking their first and last character as well as the single characters preceding and following the span. Changing the length of the name entity "Japan" to the false value 6 would lead to the following warning messages
... WARN [org.aksw.gerbil.io.nif.utils.NIFPositionHelper] - <Found an anormal marking that ends with a whitespace: "'Japan '(Japanese: 日本 Nippon...">
Parsing the created NIF and looking for such warnings can help to find mistakes.
Instead of text containing the NIF information, a jena RDF Model
can be created.
DocumentListWriter listWriter = new DocumentListWriter();
Model nifModel = ModelFactory.createDefaultModel();
listWriter.writeDocumentsToModel(nifModel, documents);