-
Notifications
You must be signed in to change notification settings - Fork 149
Extending the Web Harvester
The Web Harvester is an integral part of the Esri Geoportal Server. The purpose of the web harvester is to facilitate acquiring metadata content existing on a remote repository and storing it within the Geoportal catalog. See How to Publish Resources for more information on how to harvest remote repositories.
The Web Harvester itself is a background process with a UI that allows registering a remote repository, defining harvest options (documents' owner, harvest frequency, notifications), starting and stopping a manual harvest, and viewing harvest results. Out-of-the-box, the Web harvester can access different types of remote repositories through different communication protocols: CSW, WAF, OAI-PMH, Esri IMS, Esri ArcGIS, THREDDS, and ordinary URLs pointing to some resource.
'Extending the web harvester' refers to designing and implementing a plugin that can access repositories through additional protocols besides the ones available out-of-the-box. Typically, such a plugin will be implemented in the Java programming language, although using JNI (Java Native Interface) is also possible.
This topic will discuss concept important for understanding how the Web Harvester can be extended to include additional protocols, and step through an example you can implement as a tutorial; if you work through the example, download the files available here.
To understand the design of the Web Harvester, you must understand its underlying modules, important design principles, and the basic structures of the components. These are described in turn here.
The Web Harvester consists of the following modules (Figure 1). These are described in detail below:
The engine encapsulates core functionality of the Web Harvester. It controls the lifecycle of all dependent modules, and provides error handling, statistic collection, and report generation. It also produces and sends notification emails.
The task queue is a FIFO queue designed to store upcoming harvest requests. Its physical representation is a pair of two database tables: GPT_HARVESTING_JOBS_PENDING to store pending tasks and GPT_HARVESTING_JOBS_COMPLETED to store tasks completed.
The thread pool is a collection of worker threads initially created and put into the waiting mode. Once a new task is detected in a task queue, the engine dispatches that task to the first available thread. If there are no available threads (e.g., all are busy with processing previous requests), the harvesting request stays in the queue until one of the threads completes its job.
Each protocol client understands how to access and retrieve documents on a particular remote repository. There are several implemented clients suitable for a single repository type: WAF, CSW, OAI-PMH, THREDDS, Esri IMS, Esri ArcGIS, and a URL. The worker thread uses the protocol client to iterate over the content of the remote repository and retrieve documents.
The publishing agent validates harvested documents and stores them within the Geoportal Server catalog. It also produces generic information collected for harvesting report generation purposes.
The auto selector is a background process designed to automatically select harvesting repositories and place them into the task queue according to their schedule.
The watchdog is a background process designed to detect changes to the task queue in case of multi-machine deployment scenarios. It checks occasionally a state of the task queue, then finds newly added elements (typically, it is a manually added task by the user action) and notifies it to the engine.
The configuration comes from two places: a collection of predefined parameters within the gpt.xml configuration file which drives various aspects of the functionality of the Web Harvester, and also the list of defined remote repositories with configuration for each stored within the geoportal database. The configuration of each repository tells what is the protocol and the URL, how often a site should be revisited, should a description of the repository (title) be updated every time it is being harvested, a user id on behalf of which all newly acquired records will be published, and whether notification emails will be sent to the user who defined the repository for harvesting.
There is a user interface (UI) designed to create and access the repository definition through the web browser. It also facilitates initiating manual harvesting and viewing harvesting reports. The UI controller is a mediator between the web page and web harvester internal mechanisms.
Any design of a new protocol client should consider the fact that the harvesting process is the most time consuming and memory consuming process in the Geoportal Server. Best practices which should be used during designing a protocol client are:
- Use an iterator pattern if possible
- Break data into smaller chunks (pagination) if an iterator pattern cannot be used
- Release memory as soon as possible
- Use reasonable data caching
- Optimize third party libraries usage (tuning)
- Use the 'update date' feature if available to avoid synchronization of the entire catalog even if only few records have been updated.
Protocol client functionality is defined through a set of interfaces with a minimal number of records. These interfaces promote certain patterns but do not enforce any particular way of implementation. The entire definition is found in the 'com.esri.gpt.framework.resource' Java package. There are four sub-packages:
- adapters
- api
- common
- query
Although a developer is free to choose his/her own design, it is a good idea to follow patterns used in the native implementation of built-in protocol types. The best way to start is to take the OAI protocol client or CSW protocol client as an example to follow - this is discussed later in this topic.
There are three main classes in this package: Resource, Publishable and Native.
This is a basic resource associated with or obtained from the remote repository. In most cases it could be an actual metadata record or some sort of folder descriptor.
This represents an item which can be published to the Geoportal Server catalog. The most prominent method of this interface is the getContent() function. It has to provide a string with a valid metadata record. There is no any definition of what that metadata is and how it is obtained; it could be directly harvested from the remote repository, be in a different format than metadata obtained from the repository, or even be generated from some non-metadata information received from the repository.
Native is just like the Publishable interface except it carries information about the remote repository itself. The Native interface has no new methods or fields and serves only to identify the semantics of the native resource description.
Classes from this package define workflow.
The QueryBuilder interface creates queries based upon some query criteria. It also provides information about remote repositories in the form of Native metadata as well as gives information about repository capabilities.
The only function of the Query interface is to be executed. Once executed it should provide a result which would allow iterating over all records from the repository which are matching criteria.
The Result interface provides a way to iterate over all the records from the repository matching query criteria.
The system can preserve protocol persistency. Each time user creates site to harvest, it declares a protocol which will be saved together with all site data into the database.
The key element to protocol persistency is the Protocol interface from the com.esri.gpt.control.webharvest.protocol package. There is also an abstract class that serves as an implementation of that interface and is called HarvestProtocol from com.esri.gpt.catalog.harvest.protocols package. Going further, there is an extension to HarvestProtocol class: AbstractHTTPHarvestProtocol from the same package serving the common purpose for all HTTP based protocol.
Another element is the ProtocolFactories class from the 'com.esri.gpt.control.webharvest.protocol' package. It is a collection of factories, each capable of creating an instance of the Protocol interface. Each new implementation of the Protocol has to have its factory registered within ProtocolFactories.
There are two legitimate ways of registering a factory: by manual hardcoding the factory within the ProtocolFactories class, or by configuration through the gpt.xml file.
Anytime a new protocol client is designed and added to the system, the current user interface has to be modified to support that protocol client. Typically this providing the user with the ability to declare a site supporting that protocol, define additional parameters for the protocol client, and search sites supporting the new protocol. The first step to adapt the user interface to the new protocol is to let the system know that the new protocol is available. There are two classes which need to be modified:
- com.esri.gpt.control.publication. ManageMetadataController
- com.esri.gpt.control.harvest. HarvestController
- Create an implementation of Resource, Publishable and Native interfaces from the 'com.esri.gpt.framework.resource.api' package. Refer to the available examples in 'com.esri.gpt.control.webharvest.client' for best practices and design patterns.
- Create an implementation of QueryBuilder, Query, Result interfaces from the 'com.esri.gpt.framework.resource.query' package. Refer to the available examples in 'com.esri.gpt.control.webharvest.client' for best practices and design patterns.
- Create the persistency mechanism by implementing the Protocol interface from the 'com.esri.gpt.control.webharvest.protocol' package. Refer to examples in 'com.esri.gpt.catalog.harvest.protocols'.
- Create the protocol factory by manual hardcoding such a factory in 'com.esri.gpt.control.webharvest.protocol.ProtocolFactories' class or by modifying the gpt.xml configuration file.
- Create the user interface by modifying the ManageMetadataController class and HarvestController class and by creating new section in harvestBody.jsp.
An example of the CSW protocol client (package: 'com.esri.gpt.control.webharvest.client.csw') shows how to create the protocol client in a case when the actual third party component already exists. In other words, the example of the CSW protocol client is a wrapper over a preexisting library.
An example of the OAI protocol client (package: 'com.esri.gpt.control.webharvest.client.oai') shows how to create protocol client in a case when no other third party libraries providing the communication mechanism exists. In other words, the OAI example is an entire and genuine implementation of the protocol client designed from scratch.
- src\com\esri\gpt\framework\resource\api\Resource.java
- src\com\esri\gpt\framework\resource\api\Publishable.java
- src\com\esri\gpt\framework\resource\api\Native.java
- src\com\esri\gpt\framework\resource\api\SourceUri.java
- src\com\esri\gpt\framework\resource\query\QueryBuilder.java
- src\com\esri\gpt\framework\resource\query\Query.java
- src\com\esri\gpt\framework\resource\query\Result.java
- src\com\esri\gpt\framework\resource\query\Criteria.java
- src\com\esri\gpt\control\webharvest\protocol\Protocol.java
- src\com\esri\gpt\control\webharvest\protocol\ProtocolFactory.java
- src\com\esri\gpt\control\webharvest\protocol\ProtocolFactories.java
- src\com\esri\gpt\catalog\harvest\protocol\HarvestProtocol.java
- src\com\esri\gpt\control\harvest\HarvestController.java
- src\com\esri\gpt\control\publication\ManageMetadataController.java
- web\catalog\harvest\harvestBody.jsp
- src\gpt\resources\gpt.properties
The goal of this example is to give developers hands-on experience with developing new a harvest protocol plugin. This example will design and implement such a client to allow registering a local or a network folder using the Microsoft UNC folder path and harvesting all metadata stored within that folder or its subfolders.
The prerequisites for successful completion of the example is a basic knowledge of the Java programming language and familiarity with an IDE (Integrated Development Environment): NetBeans, Eclipse, Notepad, Vi, etc.
The sections below discuss updating requisite components to support the UNC protocol.
You will find some example files helpful to reference as you work through the example. Download the files from here.
We will start with creating a package to for a newly developed module. The package will be: src\com\esri\gpt\control\webharvest\client\unc
UncFile represents a single metadata file ready to publish. Because it represents an entity which is publishable, it has to implement the Publishable interface from Geoportal Server. In fact, UncFile is an adaptor to the actual File class from Java SDK, which converts that class into Publishable type. Let's create a stub first. It will have all the necessary imports required later on.
package com.esri.gpt.control.webharvest.client.unc;
import com.esri.gpt.framework.resource.api.Publishable;
import com.esri.gpt.framework.resource.api.Resource;
import com.esri.gpt.framework.resource.api.SourceUri;
import com.esri.gpt.framework.resource.common.StringUri;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Date;
import javax.xml.transform.TransformerException;
import org.xml.sax.SAXException;
class UncFile implements Publishable {
private File file;
public UncFile(File file) {
this.file = file;
}
}
Note, this doesn't have to be a public class. Next, provide the implementation of all methods defined in Publishable interface.
class UncFile implements Publishable {
private File file;
public UncFile(File file) {
this.file = file;
}
public Iterable<Resource> getNodes() {
return new ArrayList<Resource>();
}
public SourceUri getSourceUri() {
return new StringUri(file.getAbsolutePath());
}
public Date getUpdateDate() {
return new Date(file.lastModified());
}
public String getContent() throws IOException, TransformerException, SAXException {
FileReader reader = new FileReader(file);
StringBuilder sb = new StringBuilder();
char [] buffer = new char[1024];
int count = 0;
while ((count=reader.read(buffer))>=0) {
sb.append(buffer, 0, count);
}
return sb.toString();
}
}
The getNodes() method lists all the nodes for a given resource. Since a file as a resource doesn't have any sub-nodes, this method returns an empty list.
The getSourceUri() method provides a unique URI identifying the resource. It has to be unique; otherwise there will be duplicated items in the Geoportal Server catalog. In our case an absolute path is unique and can be used for that purpose.
The getUpdateDate() method is a critical element of an incremental harvest. An Incremental harvest is a mechanism which greatly increases performance of consecutive harvest sessions by fetching only records that changed since the last harvest. The method getUpdateDate() could simply return a null value, but in that case the incremental harvest would be disabled. In our case, we will read a value of the update date directly from the file system.
The getContent() method provides content of the file as a string. We will just read the content of the entire file using a FileReader class from the Java SDK.
UncFolder represents a single folder which can contain multiple files or other folders. Similar to UncFile, this is an adaptor to the File class. However, a folder is not an entity being published to the catalog, and therefore it doesn't need to extend Publishable interface; only Resource interface is needed here.
package com.esri.gpt.control.webharvest.client.unc;
import com.esri.gpt.control.webharvest.IterationContext;
import com.esri.gpt.framework.resource.api.Resource;
import com.esri.gpt.framework.resource.query.Criteria;
import com.esri.gpt.framework.util.ReadOnlyIterator;
import java.io.File;
import java.util.ArrayList;
import java.util.Iterator;
class UncFolder implements Resource {
private IterationContext context;
private Criteria criteria;
private File root;
public UncFolder(IterationContext context, Criteria criteria, File root) {
this.context = context;
this.criteria = criteria;
this.root = root;
}
}
Unlike the UncFile, constructor of UncFolder takes two additional arguments besides the File type argument: context and criteria. Although, neither context nor criteria will be used in our example, it is worth mentioning their purpose.
context an instance of an object implementing the IterationContext interface. It is created by the harvesting engine each time a new harvest session is started. The sole purpose of this argument is to let the algorithm report any exception which may occur during harvesting. Refer to the implementation of the CSW protocol client or OAI protocol client to have better understanding how context can be used.
criteria this argument carries information about harvest criteria. Currently, only last update date criteria are being used, and the harvesting engine takes care of that particular criterion. We would need to use it if there would be a potential benefit of using it for better performance. At this moment only WAF protocol client actively uses criteria.
Let's implement the only remaining method from the Resource interface: getNodes(). But before we can do that, we need to create an iterator.
class UncFolder implements Resource {
...
private class UncIterator extends ReadOnlyIterator<Resource> {
private Iterator<Resource> resources;
public boolean hasNext() {
if (resources==null) {
ArrayList<Resource> content = new ArrayList<Resource>();
for (File f : root.listFiles()) {
if (f.isDirectory()) {
content.add(new UncFolder(context, criteria, f));
} else if (f.isFile() && f.getName().toLowerCase().endsWith(".xml")) {
content.add(new UncFile(f));
}
}
resources = content.iterator();
}
return resources.hasNext();
}
public Resource next() {
return resources.next();
}
}
...
}
First, notice that UncIterator extends something called ReadOnlyIterator. It could just implement the Iterator interface from the Java SDK, but in that case it would need to implement the delete() method. The ReadOnlyIterator is an abstract class which has the delete() method already implemented, so extending ReadOnlyIterator class is a convenient way to provide a custom iterator without unnecessary overhead.
UncIterator is an inner class of UncFolder and therefore has full access to UncFolder member variables. It creates its own variable called resources and populates it the first time the hasNext() method is called. It lists all the content of the root folder and depending whether it is a sub folder or a file with the extension .xml, it creates an instance of UncFile or UncFolder. These instances are stored in the list and once processing of content of the root folder is done, an iterator is produced and assigned to resources variable. From then on, a job of the next() method returns next element from the resources.
Now, let's finish the implementation of UncFolder and create the remaining getNodes() method. According to the method signature, it should return an instance of Iterable. The only method in Iterable is iterator(), so we will create an anonymous class of Iterable creating an instance of UncIterator.
class UncFolder implements Resource {
...
public Iterable<Resource> getNodes() {
return new Iterable<Resource>() {
public Iterator<Resource> iterator() {
return new UncIterator();
}
};
}
private class UncIterator extends ReadOnlyIterator<Resource> {
...
}
}
UncQuery carries all the information about harvesting conditions (context, criteria, starting point), and at the time of the request it must provide a Result. A Result doesn't need to know about all of the items to be obtained from the repository. Instead, it can use lazy computation and like in our example, provide a way to iterate over the entire catalog of the repository while obtaining information about folders, sub-folders and files as it goes.
Our implementation uses a helper class: CommonResult. This class can take a number of different types of arguments and convert it into the Result (in that sense CommonResult is an adaptor). In our case it takes a single instance of UncFolder representing a root folder.
package com.esri.gpt.control.webharvest.client.unc;
import com.esri.gpt.control.webharvest.IterationContext;
import com.esri.gpt.control.webharvest.common.CommonResult;
import com.esri.gpt.framework.resource.query.Criteria;
import com.esri.gpt.framework.resource.query.Query;
import com.esri.gpt.framework.resource.query.Result;
import java.io.File;
public class UncQuery implements Query {
private IterationContext context;
private Criteria criteria;
private File root;
public UncQuery(IterationContext context, Criteria criteria, File root) {
this.context = context;
this.criteria = criteria;
this.root = root;
}
public Result execute() {
return new CommonResult(new UncFolder(context, criteria, root));
}
}
Note that during harvesting session, context and criteria will be provided by the harvesting engine while the root argument will come from the repository definition stored in the database.
UncQueryBuilder is a factory class. Its method newQuery() defined in the QueryBuilder interface creates an instance of a concrete class implementing Query; in our example this is an instance of UncQuery.
package com.esri.gpt.control.webharvest.client.unc;
import com.esri.gpt.control.webharvest.IterationContext;
import com.esri.gpt.control.webharvest.common.CommonCapabilities;
import com.esri.gpt.framework.resource.api.Native;
import com.esri.gpt.framework.resource.query.Capabilities;
import com.esri.gpt.framework.resource.query.Criteria;
import com.esri.gpt.framework.resource.query.Query;
import com.esri.gpt.framework.resource.query.QueryBuilder;
import java.io.File;
public class UncQueryBuilder implements QueryBuilder {
private IterationContext context;
private File root;
public UncQueryBuilder(IterationContext context, File root) {
this.context = context;
this.root = root;
}
public Query newQuery(Criteria crt) {
return new UncQuery(context, crt, root);
}
public Native getNativeResource() {
return null;
}
public Capabilities getCapabilities() {
return new CommonCapabilities();
}
}
getNativeResource() should return an instance of the class implementing the Native interface. This class is just like the Publishable interface except that the getContent() method will provide a descriptor XML of the repository itself instead of the metadata record. In our case there is a repository descriptor, so it returns a null value. Refer to the CSW protocol client example to understand how it might be used.
getCapabilities() has currently no valuable meaning and is not used anywhere else by the harvest engine. This is legacy code, but because the QueryBuilder defines it, we need to provide some sort of implementation. Thus, we simply return an instance of the CommonCapabilities class.
UncHarvestProtocol represents information to be stored in the database. This is an extension of the HarvestProtocol abstract class. Both are placed in a preexisting package com.esri.gpt.catalog.harvest.protocols.
package com.esri.gpt.catalog.harvest.protocols;
import com.esri.gpt.catalog.harvest.clients.exceptions.HRConnectionException;
import com.esri.gpt.control.webharvest.IterationContext;
import com.esri.gpt.control.webharvest.client.unc.UncQueryBuilder;
import com.esri.gpt.framework.resource.query.QueryBuilder;
import java.io.File;
public class UncHarvestProtocol extends HarvestProtocol {
public ProtocolType getType() {
return null;
}
public String getKind() {
return "UNC";
}
public QueryBuilder newQueryBuilder(IterationContext context, String url) {
return new UncQueryBuilder(context, new File(url));
}
public void ping(String url) throws Exception {
File file = new File(url);
if (!file.isDirectory() || !file.canRead()) {
throw new HRConnectionException("Cannot access: "+url);
}
}
}
getType() is a legacy method; we have to implement and return a null value.
getKind() should return a unique name of the harvest protocol.
newQueryBuilder() is a factory method; it creates an instance of an appropriate QueryBuilder. The parameter URL will be provided from the repository definition.
ping()is called anytime a user clicks the 'Test' button on the repository definition page. This method should check if the protocol definition is correct by using the provided url parameter and any attribute available in the result from calling the getAttributeMap() method of the HarvestProtocol class. Values of both the URL parameter and getAttributeMap() come from the protocol definition form. Method ping() must throw HRConnectionException if it cannot access the repository, or HRInvalidProtocolException if the repository definition is invalid. In our case we just check if the URL parameter represents a file on a local or network disk and if that file is accessible.
UncProtocolFactory is a factory class used to create in instance of the class implementing the Protocol interface. In our case it will be an instance of the UncProtocol class.
package com.esri.gpt.control.webharvest.protocol.factories;
import com.esri.gpt.catalog.harvest.protocols.UncHarvestProtocol;
import com.esri.gpt.control.webharvest.protocol.Protocol;
import com.esri.gpt.control.webharvest.protocol.ProtocolFactory;
public class UncProtocolFactory implements ProtocolFactory {
public String getName() {
return "UNC";
}
public Protocol newProtocol() {
return new UncHarvestProtocol();
}
}
getName() returns a name of the harvest protocol just like getKind() from HarvestProtocol.
newProtocol() creates an instance of the Protocol.
ProtocolFactories is a collection of all known harvest protocol factories. It could be configured through the gpt.xml configuration file, but in our case we will use an alternative way and add information about UNC protocol client by hardcoding it.
...
public class ProtocolFactories extends TreeMap<String, ProtocolFactory> {
...
public void initDefault() {
put("ArcIms" , new ArcImsProtocolFactory());
put("CSW" , new CswProtocolFactory());
put("OAI" , new OaiProtocolFactory());
put("WAF" , new WafProtocolFactory());
put("RES" , new ResourceProtocolFactory());
put("ARCGIS" , new ArcGISProtocolFactory());
put("AGP" , new AgpProtocolFactory());
put("THREDDS", new ThreddsProtocolFactory());
put("UNC", new UncProtocolFactory());
}
...
}
All we have to do is alter the initDefault() method and put a new protocol factory under the name exactly as getName() from UncProtocolFactory returns.
gpt.properties is a resource file stored in the src\src\gpt\properties folder. It holds all the localized strings for any UI element displayed to the user. We will add just one line at the end of the file.
...
catalog.harvest.manage.edit.protocol.unc = UNC
The HarvestController class handles input from the harvest repository registration page. It is stored in the com.esri.gpt.control.harvest package. The only thing we need to do is to inform this controller about new harvest protocol client.
public class HarvestController extends BaseHarvestController {
...
private static final ProtocolDef [] protocolDefs = {
new ProtocolDef("res", "catalog.harvest.manage.edit.protocol.resource", false),
new ProtocolDef("arcgis", "catalog.harvest.manage.edit.protocol.arcgis", true),
new ProtocolDef("arcims", "catalog.harvest.manage.edit.protocol.arcims", false),
new ProtocolDef("oai", "catalog.harvest.manage.edit.protocol.oai", false),
new ProtocolDef("waf", "catalog.harvest.manage.edit.protocol.waf", false),
new ProtocolDef("csw", "catalog.harvest.manage.edit.protocol.csw", false),
new ProtocolDef("thredds", "catalog.harvest.manage.edit.protocol.thredds", false),
new ProtocolDef("unc", "catalog.harvest.manage.edit.protocol.unc", false),
};
...
}
The ManageMetadataController class handles input from the resources management page. It is stored in the com.esri.gpt.control.publication package. The only thing we need to do is to inform this controller about the new harvest protocol client.
public class ManageMetadataController extends BaseActionListener {
...
private static final ProtocolDef [] protocolDefs = {
new ProtocolDef("res", "catalog.harvest.manage.edit.protocol.resource", false),
new ProtocolDef("arcgis", "catalog.harvest.manage.edit.protocol.arcgis", true),
new ProtocolDef("arcims", "catalog.harvest.manage.edit.protocol.arcims", false),
new ProtocolDef("oai", "catalog.harvest.manage.edit.protocol.oai", false),
new ProtocolDef("waf", "catalog.harvest.manage.edit.protocol.waf", false),
new ProtocolDef("csw", "catalog.harvest.manage.edit.protocol.csw", false),
new ProtocolDef("thredds", "catalog.harvest.manage.edit.protocol.thredds", false),
new ProtocolDef("unc", "catalog.harvest.manage.edit.protocol.unc", false),
};
...
}
Last but not least is the harvestBody.jsp page stored in web\catalog\harvest folder. The most important action associated with adding a new harvest protocol client is to add a new section handling extra parameters required for that protocol to work - for example, the 'CSW profile' parameter for the CSW protocol. Since the UNC protocol doesn't need any extra information besides the file system path, we only need to make sure that these sections are correctly being shown or hidden once user clicks on the radio buttons selecting a protocol. This a small change to the JavaScript code on the page.
function selectSection(section) {
enableSection("res", section=="res");
enableSection("arcgis", section=="arcgis");
enableSection("arcims", section=="arcims");
enableSection("oai", section=="oai");
enableSection("waf", section=="waf");
enableSection("csw", section=="csw");
enableSection("agp", section=="agp");
enableSection("thredds", section=="thredds");
enableSection("unc", section=="unc");
...
}
When you have completed this step, you will compile your new custom files, deploy them, and test.
Back to Customizations