-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't parse html of http://rdfa.info/ (plus relative URL nit) #50
Comments
So in order to then see if this was something I could deal with using other parsers I tried the following: First, in order to work at the lower required I wrote myself a class that allows me to construct a parser, but still use it easily. implicit class SesameParserExt(val parser: org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser) extends AnyVal {
def parse(rdfa: String, base: String) = {
import org.openrdf.model.impl.{LinkedHashModel,ValueFactoryImpl}
val model = new LinkedHashModel()
val collector = new org.openrdf.rio.helpers.ContextStatementCollector(model,ValueFactoryImpl.getInstance())
parser.setRDFHandler(collector)
parser.parse(new java.io.StringReader(rdfa),base)
model
}
} Then in order to make life easier creating new parsers and being able to set preferences import org.semarglproject.sesame.rdf.rdfa._
def RdfaParser(setup: SesameRDFaParser => Unit): SesameRDFaParser = {
val p = new SesameRDFaParser()
setup(p)
p
} Then I tried a couple of libs to move html to xml. First TagSoup, that has not changed in the past 5 years. import scala.util.Try
import $ivy.`org.ccil.cowan.tagsoup:tagsoup:1.2.1`
val tagsoupParser = org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(null)
val attemptTS = Try{
RdfaParser(_.setXmlReader(tagsoupParser.getXMLReader())).parse(rdfa,"http://rdfa.info/")
} This actually works. import scala.collection.JavaConverters._
val triples = attemptTS.get.iterator().asScala.toList
browse(triples) the last line gives the following List(
(http://rdfa.info/, doap:name, "RDFa"@en) [null],
(http://rdfa.info/, doap:shortdesc, "The Resource Description Framework in Attributes"@en) [null],
(http://rdfa.info/, doap:homepage, http://rdfa.info/) [null],
(http://rdfa.info/, http://www.w3.org/1999/xhtml/vocab#stylesheet, https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css) [null],
(http://rdfa.info/, doap:description, "
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
"@en) [null],
(http://rdfa.info/, dc:description, "
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
"@en) [null]
) there seem to be 6 triples in there. RDFa distiller found the following: @base <http://rdfa.info/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix doap: <http://usefulinc.com/ns/doap#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<> dcterms:description """
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
"""@en;
doap:description """
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
"""@en;
doap:homepage <>;
doap:name "RDFa"@en;
doap:shortdesc "The Resource Description Framework in Attributes"@en . so it looks like semargle found one extra statement, namely
which is fine with me. |
I don't seem to have the same luck with the NekoParser import $ivy.`net.sourceforge.nekohtml:nekohtml:1.9.22`
val nekoParser = new org.cyberneko.html.parsers.SAXParser()
val attempt = Try{
RdfaParser(_.setXmlReader(nekoParser)).parse(rdfa,"http://rdfa.info/")
} which captures a `NullpointerException attempt.get
java.lang.NullPointerException
org.semarglproject.sesame.core.sink.SesameSink.convertNonLiteral(SesameSink.java:78)
org.semarglproject.sesame.core.sink.SesameSink.addPlainLiteral(SesameSink.java:94) |
but if I set the RDF version to 1.1 then it works. import org.openrdf.rio.helpers.RDFaVersion
val attemptNK2 = Try(RdfaParser{p=>
p.setXmlReader(nekoParser);
p.setRdfaCompatibility(RDFaVersion.RDFA_1_1)
}.parse(rdfa,"http://rdfa.info/")) and we get 6 statements again. Should the parsing not set the version? How is one meant to know from the outside which version to use? |
It is easy to reproduce this using the Ammonite shell
For future reference I placed this file here rdfa.info.txt
Btw, that throws an exception because of relative URIs which one could argue about.
But let us continue...
If I try the method recommended on the Sesame repository, having specified a base,
I get the following exception.
which seems to indicate that an xml parser is used there rather than an html parser.
The text was updated successfully, but these errors were encountered: