Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't parse html of http://rdfa.info/ (plus relative URL nit) #50

Open
bblfish opened this issue Jul 16, 2017 · 3 comments
Open

Can't parse html of http://rdfa.info/ (plus relative URL nit) #50

bblfish opened this issue Jul 16, 2017 · 3 comments

Comments

@bblfish
Copy link

bblfish commented Jul 16, 2017

It is easy to reproduce this using the Ammonite shell

$ curl http://rdfa.info/ > rdfa.info.txt

For future reference I placed this file here rdfa.info.txt

$ amm
@  import $ivy.`org.semarglproject:semargl-rdfa:0.7`
@ import $ivy.`org.semarglproject:semargl-sesame:0.7`
@ import org.openrdf.rio._
val rdfa = read("rdfa.info.txt")
def rd = new java.io.StringReader(rdfa)
Rio.parse(rd,"",RDFFormat.RDFA)

Btw, that throws an exception because of relative URIs which one could argue about.
But let us continue...

If I try the method recommended on the Sesame repository, having specified a base,
I get the following exception.

@ Rio.parse(rd,"http://rdfa.info/",RDFFormat.RDFA)
[Fatal Error] :51:5: The element type "link" must be terminated by the matching end-tag "</link>".
org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; lineNumber: 51; columnNumber: 5; The element type "link" must be terminated by the matching end-tag "</link>".
  org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:111)
  org.openrdf.rio.Rio.parse(Rio.java:425)
  org.openrdf.rio.Rio.parse(Rio.java:323)
  ammonite.$sess.cmd16$.<init>(cmd16.sc:1)
  ammonite.$sess.cmd16$.<clinit>(cmd16.sc)
org.semarglproject.rdf.ParseException: org.xml.sax.SAXParseException; lineNumber: 51; columnNumber: 5; The element type "link" must be terminated by the matching end-tag "</link>".

which seems to indicate that an xml parser is used there rather than an html parser.

@bblfish bblfish changed the title RDFa parsers barfs on relative URIs Can't parse html of http://rdfa.info/ (plus relative URL nit) Jul 16, 2017
@bblfish
Copy link
Author

bblfish commented Jul 16, 2017

So in order to then see if this was something I could deal with using other parsers I tried the following:

First, in order to work at the lower required I wrote myself a class that allows me to construct a parser, but still use it easily.

implicit class SesameParserExt(val parser: org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser) extends AnyVal {
    def parse(rdfa: String, base: String) = {
       import org.openrdf.model.impl.{LinkedHashModel,ValueFactoryImpl}
       val model = new LinkedHashModel()
       val collector = new org.openrdf.rio.helpers.ContextStatementCollector(model,ValueFactoryImpl.getInstance())
       parser.setRDFHandler(collector)
       parser.parse(new java.io.StringReader(rdfa),base)
       model
    }
}

Then in order to make life easier creating new parsers and being able to set preferences

import org.semarglproject.sesame.rdf.rdfa._
def RdfaParser(setup: SesameRDFaParser => Unit): SesameRDFaParser = {
   val p = new SesameRDFaParser()
   setup(p)
   p
}

Then I tried a couple of libs to move html to xml.

First TagSoup, that has not changed in the past 5 years.

import scala.util.Try
import $ivy.`org.ccil.cowan.tagsoup:tagsoup:1.2.1`
val tagsoupParser =  org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(null)
val attemptTS = Try{
   RdfaParser(_.setXmlReader(tagsoupParser.getXMLReader())).parse(rdfa,"http://rdfa.info/")
 }

This actually works.

import scala.collection.JavaConverters._
val triples = attemptTS.get.iterator().asScala.toList
browse(triples) 

the last line gives the following

List(
  (http://rdfa.info/, doap:name, "RDFa"@en) [null],
  (http://rdfa.info/, doap:shortdesc, "The Resource Description Framework in Attributes"@en) [null],
  (http://rdfa.info/, doap:homepage, http://rdfa.info/) [null],
  (http://rdfa.info/, http://www.w3.org/1999/xhtml/vocab#stylesheet, https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css) [null],
  (http://rdfa.info/, doap:description, "
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        "@en) [null],
  (http://rdfa.info/, dc:description, "
RDFa is an extension to HTML5 that helps you markup things like People, Places,
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        "@en) [null]
)

there seem to be 6 triples in there. RDFa distiller found the following:

@base <http://rdfa.info/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix doap: <http://usefulinc.com/ns/doap#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<> dcterms:description """
RDFa is an extension to HTML5 that helps you markup things like People, Places, 
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        """@en;
   doap:description """
RDFa is an extension to HTML5 that helps you markup things like People, Places, 
Events, Recipes and Reviews. Search Engines and Web Services use this markup
to generate better search listings and give you better visibility on the Web,
so that people can find your website more easily.
        """@en;
   doap:homepage <>;
   doap:name "RDFa"@en;
   doap:shortdesc "The Resource Description Framework in Attributes"@en .

so it looks like semargle found one extra statement, namely

(http://rdfa.info/, http://www.w3.org/1999/xhtml/vocab#stylesheet, https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css) [null],

which is fine with me.

@bblfish
Copy link
Author

bblfish commented Jul 16, 2017

I don't seem to have the same luck with the NekoParser

import $ivy.`net.sourceforge.nekohtml:nekohtml:1.9.22`
val nekoParser = new org.cyberneko.html.parsers.SAXParser()
val attempt = Try{
  RdfaParser(_.setXmlReader(nekoParser)).parse(rdfa,"http://rdfa.info/")
}

which captures a `NullpointerException

attempt.get
java.lang.NullPointerException
  org.semarglproject.sesame.core.sink.SesameSink.convertNonLiteral(SesameSink.java:78)
  org.semarglproject.sesame.core.sink.SesameSink.addPlainLiteral(SesameSink.java:94)

@bblfish
Copy link
Author

bblfish commented Jul 16, 2017

but if I set the RDF version to 1.1 then it works.

import org.openrdf.rio.helpers.RDFaVersion
val attemptNK2 = Try(RdfaParser{p=>
       p.setXmlReader(nekoParser);
       p.setRdfaCompatibility(RDFaVersion.RDFA_1_1)
    }.parse(rdfa,"http://rdfa.info/"))

and we get 6 statements again.

Should the parsing not set the version? How is one meant to know from the outside which version to use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant