-
-
Notifications
You must be signed in to change notification settings - Fork 903
Pure Java Nokogiri for JRuby
Pure Java version of Nokogiri is a Java port for JRuby. When Nokogiri 1.5.x is used on JRuby, pure Java version starts working on it. Nokogiri 1.4.x and before, FFI version of Nokogiri works on JRuby via FFI library, which needs libxml2/libxslt installed. On the other hand, pure Java version doesn’t use libxml2/libxslt and FFI library. It uses Apache Xerces, nekoHTML, and a couple of pure Java APIs. Nokogiri’s methods implemented by C libraries have been reimplemented by Java. This means we don’t have any limitation to use Nokogiri on various environment. It works even on a pure Java environment, such as Google App Engine. Also, it works on Windows platform painlessly. Pure Java version has been finally release on Jul 1, 2011.
Give pure Java version a try and let us know your impression. If you find a bug, file it with “pure-java” tag.
gem install nokogiri
- JDK 1.6.0 and later
- JRuby 1.5.1 and later
Pure Java Nokogiri uses Java APIs below:
- CyberNeko HTML Parser 1.9.12 – http://nekohtml.sourceforge.net/
- CyberNeko DTD Converter 0.1.11 – http://people.apache.org/~andyc/neko/doc/dtd/index.html
- Apache Xerces2 Java 2.9.0 – http://xerces.apache.org/
- Jing 20081028 – A RELAX NG validator in Java; http://www.thaiopensource.com/relaxng/jing.html
- iso-relax – RELAX Core; http://www.xml.gr.jp/relax/
Nokogiri 1.5.0.beta.3 or later works fine on Google App Engine. But, if you are using 1.5.0.beta.2, you need a small hack to run with google-appengine gem.
1) Comment out five require xxx.jar lines in .gems/bundler_gems/jruby/1.8/gems/nokogiri-1.5.0.beta.2-java/lib/nokogiri.rb
1 # -*- coding: utf-8 -*-
2 # Modify the PATH on windows so that the external DLLs will get loaded.
3
4 require 'rbconfig'
5 ENV['PATH'] = [File.expand_path(
6 File.join(File.dirname(__FILE__), "..", "ext", "nokogiri")
7 ), ENV['PATH']].compact.join(';') if RbConfig::CONFIG['host_os'] =~ /(mswin|mingw)/i
8
9 if defined?(RUBY_ENGINE) && RUBY_ENGINE == "jruby"
10 # require 'isorelax.jar'
11 # require 'jing.jar'
12 # require 'nekohtml.jar'
13 # require 'nekodtd.jar'
14 # require 'xercesImpl.jar'
15 require 'nokogiri/nokogiri'
16 else
17 require 'nokogiri/nokogiri'
18 end
2) Remove WEB-INF/lib/gems.jar (if you have this file)
3) Restart the server
Then, Nokogiri will start working. This bug was fixed in master, so 1.5.0 final release won’t have this problem.
Please note. Pure Java Nokogiri is not yet fully tested on Google App Engine. There might be GAE specific problems.
Pure Java version has methods to handle org.w3c.dom.Document type object directly. When you might want to manipulate XML document using Nokgoiri API then send it back to Java API. For such cases, these two methods will help you.
Class Nokogiri::XML::Document
Public Class Method
wrap(document)
Wrap org.w3c.dom.Document object and return Nokogiri::XML::Document
Public Instance Method
to_java()
Return org.w3c.dom.Document object of this Nokogiri::XML::Document
Porting to Java is not easy. Contributors have struggled over the different behaviors between libxml2 and Xerces. Almost all Nokogiri API are implemented as they are, but some were very hard to make. Thus, pure Java version has a few specific rules. Please be aware followings when you use pure Java version.
Many users complain that an order of attributes are not the same as an input document. When Xerces parses the document, this happens. However, this is not a bug. XML specification doesn’t say the order must be retained. A behavior of Xerces on the attributes order is correct in terms of XML processing. Xerces creates a “logically” correct DOM tree.
Add “dtdvalid” option when a document is read.
xml = Nokogiri::XML(File.open(XML_FILE)) {|cfg| cfg.dtdvalid}
list = xml.internal_subset.validate xml
The number of errors is not the same as libxml2 version. Java version doesn’t report errors of attributes whose elements have already reported errors.
Don’t forget to write the second parameter.
<!DOCTYPE foo PUBLIC “bar” "">
Xerces strictly checks whether a document is valid based on a given schema, while libxml does not. This behavior is unable to adjust to the one of libxml. Please be aware if XML Schema validation fails only on pure Java version.
Pure Java version doesn’t support XSLT.register method. This method is used for mapping namespace and Ruby object so that XSLT extension function works as intended. XST extension function processing by Java API is very different from libxml2. Java API needs specific rules; besides, we need to be careful to choose tools for XML processing. In some combination of processors, the feature works but doesn’t for other combination. If you are interested in details, go to my blog, http://yokolet.blogspot.com/2010/10/pure-java-nokogiri-xslt-extension.html. You’ll learn why this feature is dropped from pure Java version further.
This JRuby bug was fixed in 1.6.7 and later. If you are using older version of JRuby, the patch https://github.com/tenderlove/nokogiri/issues/607 will help you.
If you want to help pure Java Nokogiri, you need to build it after cloning the source. As of 1.5.0.beta.4, building pure Java Nokogiri got much easier. See http://groups.google.com/group/nokogiri-talk/browse_thread/thread/8f58feca3b25fcdc for details.
Nokogiri 1.5.0.beta.3 and before, you need a bit complicated steps. Charlie wrote a nice, easy-to follow blog entry,
Nokogiri Java Port: Help Us Finish It!, which will help you, definitely.
Don’t forget. The codebase of pure Java Nokogiri has been merged into master. You don’t need to checkout any branch.