Can't parse html of http://rdfa.info/ (plus relative URL nit) #50

bblfish · 2017-07-16T07:39:34Z

It is easy to reproduce this using the Ammonite shell

$ curl http://rdfa.info/ > rdfa.info.txt

For future reference I placed this file here rdfa.info.txt

$ amm
@  import $ivy.`org.semarglproject:semargl-rdfa:0.7`
@ import $ivy.`org.semarglproject:semargl-sesame:0.7`
@ import org.openrdf.rio._
val rdfa = read("rdfa.info.txt")
def rd = new java.io.StringReader(rdfa)
Rio.parse(rd,"",RDFFormat.RDFA)

Btw, that throws an exception because of relative URIs which one could argue about.
But let us continue...

If I try the method recommended on the Sesame repository, having specified a base,
I get the following exception.

@ Rio.parse(rd,"http://rdfa.info/",RDFFormat.RDFA)
[Fatal Error] :51:5: The element type "link" must be terminated by the matching end-tag "</link>".
org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; lineNumber: 51; columnNumber: 5; The element type "link" must be terminated by the matching end-tag "</link>".
  org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:111)
  org.openrdf.rio.Rio.parse(Rio.java:425)
  org.openrdf.rio.Rio.parse(Rio.java:323)
  ammonite.$sess.cmd16$.<init>(cmd16.sc:1)
  ammonite.$sess.cmd16$.<clinit>(cmd16.sc)
org.semarglproject.rdf.ParseException: org.xml.sax.SAXParseException; lineNumber: 51; columnNumber: 5; The element type "link" must be terminated by the matching end-tag "</link>".

which seems to indicate that an xml parser is used there rather than an html parser.

bblfish · 2017-07-16T12:03:17Z

So in order to then see if this was something I could deal with using other parsers I tried the following:

First, in order to work at the lower required I wrote myself a class that allows me to construct a parser, but still use it easily.

implicit class SesameParserExt(val parser: org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser) extends AnyVal {
    def parse(rdfa: String, base: String) = {
       import org.openrdf.model.impl.{LinkedHashModel,ValueFactoryImpl}
       val model = new LinkedHashModel()
       val collector = new org.openrdf.rio.helpers.ContextStatementCollector(model,ValueFactoryImpl.getInstance())
       parser.setRDFHandler(collector)
       parser.parse(new java.io.StringReader(rdfa),base)
       model
    }
}

Then in order to make life easier creating new parsers and being able to set preferences

import org.semarglproject.sesame.rdf.rdfa._
def RdfaParser(setup: SesameRDFaParser => Unit): SesameRDFaParser = {
   val p = new SesameRDFaParser()
   setup(p)
   p
}

Then I tried a couple of libs to move html to xml.

First TagSoup, that has not changed in the past 5 years.

import scala.util.Try
import $ivy.`org.ccil.cowan.tagsoup:tagsoup:1.2.1`
val tagsoupParser =  org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(null)
val attemptTS = Try{
   RdfaParser(_.setXmlReader(tagsoupParser.getXMLReader())).parse(rdfa,"http://rdfa.info/")
 }

This actually works.

import scala.collection.JavaConverters._
val triples = attemptTS.get.iterator().asScala.toList
browse(triples)

the last line gives the following

List(
  (http://rdfa.info/, doap:name, "RDFa"@en) [null],
  (http://rdfa.info/, doap:shortdesc, "The Resource Description Framework in Attributes"@en) [null],
  (http://rdfa.info/, doap:homepage, http://rdfa.info/) [null],
  (http://rdfa.info/, http://www.w3.org/1999/xhtml/vocab#stylesheet, https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css) [null],
  (http://rdfa.info/, doap:description, "
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        "@en) [null],
  (http://rdfa.info/, dc:description, "
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        "@en) [null]
)

there seem to be 6 triples in there. RDFa distiller found the following:

@base <http://rdfa.info/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix doap: <http://usefulinc.com/ns/doap#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<> dcterms:description """
RDFa is an extension to HTML5 that helps you markup things like People, Places, 
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        """@en;
   doap:description """
RDFa is an extension to HTML5 that helps you markup things like People, Places, 
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        """@en;
   doap:homepage <>;
   doap:name "RDFa"@en;
   doap:shortdesc "The Resource Description Framework in Attributes"@en .

so it looks like semargle found one extra statement, namely

(http://rdfa.info/, http://www.w3.org/1999/xhtml/vocab#stylesheet, https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css) [null],

which is fine with me.

bblfish · 2017-07-16T12:21:53Z

I don't seem to have the same luck with the NekoParser

import $ivy.`net.sourceforge.nekohtml:nekohtml:1.9.22`
val nekoParser = new org.cyberneko.html.parsers.SAXParser()
val attempt = Try{
  RdfaParser(_.setXmlReader(nekoParser)).parse(rdfa,"http://rdfa.info/")
}

which captures a `NullpointerException

attempt.get
java.lang.NullPointerException
  org.semarglproject.sesame.core.sink.SesameSink.convertNonLiteral(SesameSink.java:78)
  org.semarglproject.sesame.core.sink.SesameSink.addPlainLiteral(SesameSink.java:94)

bblfish · 2017-07-16T12:55:52Z

but if I set the RDF version to 1.1 then it works.

import org.openrdf.rio.helpers.RDFaVersion
val attemptNK2 = Try(RdfaParser{p=>
       p.setXmlReader(nekoParser);
       p.setRdfaCompatibility(RDFaVersion.RDFA_1_1)
    }.parse(rdfa,"http://rdfa.info/"))

and we get 6 statements again.

Should the parsing not set the version? How is one meant to know from the outside which version to use?

bblfish changed the title ~~RDFa parsers barfs on relative URIs~~ Can't parse html of http://rdfa.info/ (plus relative URL nit) Jul 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't parse html of http://rdfa.info/ (plus relative URL nit) #50

Can't parse html of http://rdfa.info/ (plus relative URL nit) #50

bblfish commented Jul 16, 2017 •

edited

Loading

bblfish commented Jul 16, 2017 •

edited

Loading

bblfish commented Jul 16, 2017

bblfish commented Jul 16, 2017 •

edited

Loading

Can't parse html of http://rdfa.info/ (plus relative URL nit) #50

Can't parse html of http://rdfa.info/ (plus relative URL nit) #50

Comments

bblfish commented Jul 16, 2017 • edited Loading

bblfish commented Jul 16, 2017 • edited Loading

bblfish commented Jul 16, 2017

bblfish commented Jul 16, 2017 • edited Loading

bblfish commented Jul 16, 2017 •

edited

Loading

bblfish commented Jul 16, 2017 •

edited

Loading

bblfish commented Jul 16, 2017 •

edited

Loading