-
Notifications
You must be signed in to change notification settings - Fork 196
Batch Mode for RDF Generation
#Batch Mode for RDF Generation
Karma can be used in a batch mode to generate RDF for large datasets. This can be done using a command line Utility OfflineRDFGenerator or using the Karma RDF Generation API
This is a command line utility to load a model and a source, and then generate RDF. The source can be JSON, XML, CSV or database. With database, the API loads 10,000 rows at a time.
To build the offline jar, goto the karma-offline subdirectory and execute the following:
cd karma-offline
mvn install -P shaded
This builds a standalone jar karma-offline-0.0.1-SNAPSHOT-shaded.jar
in the target
sub-folder or karma-offline that can be used to generate RDF and JSON-LD in batch mode
To generate RDF when the source is a file, go the the karma-offline/target
sub-directory of Karma and execute the following command:
java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator \
--sourcetype <sourcetype> \
--filepath <filepath> \
--modelfilepath <modelfilepath> \
--sourcename <sourcename> \
--outputfile <outputfile>
Example invocation for a JSON file:
java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator \
--sourcetype JSON \
--filepath "/files/data/wikipedia.json" \
--modelfilepath "/files/models/model-wikipedia.ttl" \
--sourcename wikipedia \
--outputfile wikipedia-rdf.n3
For a CSV file, you can specify additional parameters, such as the delimiter, text qualifier, header start index and the data start index. Example invocation for a JSON file with tab as delimiter and quotes as qualifier:
java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator \
--sourcetype CSV \
--filepath "/files/data/wikipedia.csv" \
--delimiter TAB \
--textqualifier '\\\"' \
--headerindex 1 \
--dataindex 2 \
--modelfilepath "/files/models/model-wikipedia.ttl" \
--sourcename wikipedia \
--outputfile wikipedia-rdf.n3
To generate RDF of a database table, go to the karma-offline subdirectory of Karma and run the following command from terminal:
java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator \
--sourcetype DB \
--modelfilepath <modelfilepath> \
--outputfile <outputfile> \
--dbtype <dbtype> \
--hostname <hostname> \
--username <username> \
--password <password> \
--portnumber <portnumber> \
--dbname <dbname> \
--tablename <tablename>
Valid argument values for dbtype
are Oracle, MySQL, SQLServer, PostGIS, Sybase.
Apart from the karma-offline jar, you would also need to put the JDBC driver for the database in the classpath
Example invocation:
java -cp mysql-connector-java-5.0.8-bin.jar:karma-offline-0.0.1-SNAPSHOT-shaded.jar \
edu.isi.karma.rdf.OfflineRdfGenerator \
--sourcetype DB \
--dbtype MySQL \
--hostname localhost \
--username root \
--password mypassword \
--portnumber 3306 \
--dbname karma \
--tablename offlineUsers \
--modelfilepath "/Users/dipsy/karma-projects/offlineUsers-model.ttl" \
--outputfile offlineUsers-rdf.n3
If the model requires a selection, the selection name 'DEFAULT_TEST 'needs to be passed as a command line argument --selection
to the OfflineRDFGenerator. This makes it possible to execute the same model with or without selection in offline mode.
Example invocation:
java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator \
--sourcetype JSON \
--filepath "/files/data/wikipedia.json" \
--modelfilepath "/files/models/model-wikipedia.ttl" \
--selection "DEFAULT_TEST" \
--sourcename wikipedia \
--outputfile wikipedia-rdf.n3
If Karma cannot accurately detect the encoding, the user must specify it using the --encoding
option.
Example invocation:
java -cp karma-offline-0.0.1-SNAPSHOT-shaded.jar edu.isi.karma.rdf.OfflineRdfGenerator \
--sourcetype JSON \
--filepath "/files/data/wikipedia.json" \
--modelfilepath "/files/models/model-wikipedia.ttl" \
--selection "DEFAULT_TEST" \
--sourcename wikipedia \
--encoding "UTF-8" \
--outputfile wikipedia-rdf.n3
This API is meant for repeated RDF generation from the same model. In this setting we load the models at the beginning and then every time the user does a query we use the model to generate RDF. The input can be JSON, CSV or an XML File / String / InputStream.
edu.isi.karma.rdf.GenericRDFGenerator
API to add a model to the RDF Generator
// modelIdentifier : Provides a name and location of the model file
void addModel(R2RMLMappingIdentifier modelIdentifier);
API to generate the RDF For a Request
//request : Provides all details for the Inputs to the RDF Generator like the input data, setting for provenance etc
void generateRDF(RDFGeneratorRequest request)
edu.isi.karma.rdf.RDFGeneratorRequest
API to set the input data
//inputData : Input Data as String
public void setInputData(String inputData)
//inputStream: Input data as a Stream
public void setInputStream(InputStream inputStream)
//inputFile: Input data file
public void setInputFile(File inputFile)
API to set the input data type
//dataType: Valid values: CSV,JSON,XML,AVRO
public void setDataType(InputType dataType)
Setting to generate provenance information
//addProvenance -> flag to indicate if provenance information should be added to the RDF
public void setAddProvenance(boolean addProvenance)
The writer for RDF
//writer -> Writer for the RDF output. This can be an N3KR2RMLRDFWriter or JSONKR2RMLRDFWriter or BloomFilterKR2RMLRDFWriter
public void addWriter(KR2RMLRDFWriter writer)
Example use:
GenericRDFGenerator rdfGenerator = new GenericRDFGenerator();
//Construct a R2RMLMappingIdentifier that provides the location of the model and a name for the model and add the model to the JSONRDFGenerator. You can add multiple models using this API.
R2RMLMappingIdentifier modelIdentifier = new R2RMLMappingIdentifier(
"people-model", new File("/files/models/people-model.ttl").toURI().toURL());
rdfGenerator.addModel(modelIdentifier);
String filename = "files/data/people.json";
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
N3KR2RMLRDFWriter writer = new N3KR2RMLRDFWriter(new URIFormatter(), pw);
RDFGeneratorRequest request = new RDFGeneratorRequest("people-model", filename);
request.setInputFile(new File(getTestResource(filename).toURI()));
request.setAddProvenance(true);
request.setDataType(InputType.JSON);
request.addWriter(writer);
rdfGenerator.generateRDF(request);
String rdf = sw.toString();
System.out.println("Generated RDF: " + rdf);
If the model requires a selection, GenericRDFGenerator provides a contructor that takes in the selection name 'DEFAULT_TEST 'as the argument.
Example use:
GenericRDFGenerator rdfGenerator = new GenericRDFGenerator('DEAFULT_TEST');
//Construct a R2RMLMappingIdentifier that provides the location of the model and a name for the model and add the model to the JSONRDFGenerator. You can add multiple models using this API.
R2RMLMappingIdentifier modelIdentifier = new R2RMLMappingIdentifier(
"people-model", new File("/files/models/people-model.ttl").toURI().toURL());
rdfGenerator.addModel(modelIdentifier);
String filename = "files/data/people.json";
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
N3KR2RMLRDFWriter writer = new N3KR2RMLRDFWriter(new URIFormatter(), pw);
RDFGeneratorRequest request = new RDFGeneratorRequest("people-model", filename);
request.setInputFile(new File(getTestResource(filename).toURI()));
request.setAddProvenance(true);
request.setDataType(InputType.JSON);
request.addWriter(writer);
rdfGenerator.generateRDF(request);
String rdf = sw.toString();
System.out.println("Generated RDF: " + rdf);