Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOIN LATERAL or Correlated Subquery #100

Open
VladimirAlexiev opened this issue Jul 17, 2019 · 66 comments
Open

JOIN LATERAL or Correlated Subquery #100

VladimirAlexiev opened this issue Jul 17, 2019 · 66 comments

Comments

@VladimirAlexiev
Copy link
Contributor

VladimirAlexiev commented Jul 17, 2019

Why?

GraphQL is a hot topic amongst developers and tool vendors.
There are several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation.

GraphQL queries are hierarchical and quite regular.
Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level.

Assuming a simple companies graph structure and some reasonable order and limit syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL:

{
  country(id:"...bulgaria") {
    city (order: {population:DESC}, limit: 2) {
      id name population
      company (order: {revenue:DESC}, limit: 2) {
        id name revenue
      }
    }
  }
}

If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem.
limit in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city.

We at Onto believe that to implement this efficiently, you need the subquery to run in a loop for every row of the parent query.

Previous work

This is a common problem in databases. Eg see StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?, which gives the following solutions:

  1. <child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>) aka using Windowing functions
  2. <parent-query> JOIN LATERAL (<child-query>), see PostgreSQL (FROM, Lateral, SELECT, heap.io blog Dec 2014), SQL Server (cross apply)
  3. WITH Common Table Expression
  4. JOIN (COUNT...GROUP BY) WHERE <=5

A Correlated Subquery is like LEFT JOIN LATERAL ... ON true, see StackOverflow: Difference between Lateral and Subquery:

  • Correlated subquery has limitations: it cannot return multiple columns and multiple rows
  • LATERAL is made for that: it's assumed automatically for cross-joins FROM x1,x2 where x2 is a table function (eg unnest)

Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning).
However, LATERAL is usually faster than Windowing functions (depending on data and indexing).

Two of the leading GraphQL implementations for RDMBS use LATERAL: Hasura and Join Monster

Proposed solution

A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows.

  • We've tried structuring with CONSTRUCT but it doesn't support ordering and we don't want to mess with rdf:List
  • StarDog's Extending the Solution with ARRAY may be relevant, but will require too large extension of SPARQL
  • So we're thinking of using SELECT with a UNION-style distribution of parent and child values

(Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows)

country city city_name population company company_name revenue COMMENT
geo:732800/ geo:727011/ Sofia 1152556
geo:727011/ co:123 Sofia Foo Co 987 Sofia company 1
geo:727011/ co:456 Sofia Bar Co 123 Sofia company 2
geo:732800/ geo:728193/ Plovdiv 340494
geo:728193/ co:789 Plovdiv Foo Co 987 Plovdiv company 1
geo:728193/ co:012 Plovdiv Bar Co 123 Plovdiv company 2

Assume this construct (other syntax suggestions are welcome!):

{foo} UNION LATERAL(?var) {bar}
  • ?var must be bound by foo (and exported in its select if it's a subquery)
  • ?var must also be exported by bar
  • bar is iterated for every binding of ?var
  • the results of bar are appended to the results of foo

Then we could use it to implement the query in question:

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  {select ?country ?city ?city_name ?population {
    bind(<http://sws.geonames.org/732770/> as ?country)
    ?country x:city ?city.
    ?city x:name ?city_name.
    ?city x:population ?population.
  } order by desc(?population) limit 2}
  UNION LATERAL(?city)
  {select ?city ?company ?company_name ?revenue {
    ?city x:company ?company.
    ?company x:name ?company_name.
    ?company x:revenue ?revenue
  } order by desc(?revenue) limit 2}
}

(It's more likely to have inverse links ?city x:country ?country and ?company x:city ?city but for simplicity we use straight links)

Considerations for backward compatibility

None?

@afs
Copy link
Collaborator

afs commented Jul 18, 2019

Related: "Query hints" #71

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Sep 23, 2020

https://www.stardog.com/labs/blog/stored-query-service/ explains how to call stored queries in Stardog (similar to subqueries but providing better modularity).
It also includes ideas how to use a similar mechanism for correlated (LATERAL) joins and cites this issue.

  • @klinovp: I upvoted your correlated join issue for SPARQL 1.2 btw, it's good stuff. Let's turn it into a SEP maybe?
  • Vladimir: Do you have the nerve and stamina for it?

@kasei, @afs, @jeenbroekstra and @JervenBolleman:

  • Your feedback and suggestions will be highly appreciated here!
  • Do you find this issue interesting enough to help us with a SEP?

@rubensworks Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly?

@namedgraph
Copy link

Any example where this would be relevant that is not related to GraphQL?..

@klinovp
Copy link

klinovp commented Sep 23, 2020

An example how a correlated join is a useful addition to un-correlated ones? It's exactly the same situation as in SQL and there're lots of use cases, the classical one is something like "you have a database of products and a set of offers for each product, you want to get 5 highest price offers for each product".

GraphQL is one important use case here but this is a core SPARQL extension proposal which is perfectly valid outside of the GraphQL context.

@JervenBolleman
Copy link
Collaborator

JervenBolleman commented Sep 23, 2020 via email

@rubensworks
Copy link
Member

Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly?

I fully agree on the importance of this issue for SPARQL in general.

I wouldn't call it crucial for GraphQL per se, as different interpretations are possible on how to apply it to RDF data. But for the interpretation that Ontotext has picked, it definitely makes sense.

@lisp
Copy link
Contributor

lisp commented Oct 8, 2020

ignore sql machinations related to lateral and correlated subqueries, to concentrate just on the intent to produce from some dependent sub-expression some number of solutions for each solution of some dominant subexpression.

should not the general "sidewards information passing" mechanism suffice?

is this best accommodated with a union variant?

is it not more a matter of just declaring top-down rather than bottom-up reduction - analogous to the conflagration surrounding EXISTS?
in which case an explicit SUBSELECT operator and/or something like an EACH modifier would suffice to declare intent and leave it to the processor to arrange for the s-i-p data/control flow in any and all join forms.
as in (in simplified form)

select * {
  {select * where { ?country x:city ?city. } limit 2}
  {SUBSELECT * where { ?city x:company ?company. } limit 2}
}

or

select * {
  {select * where { ?country x:city ?city. } limit 2}
  {SELECT * where { ?city x:company ?company. } LIMIT (EACH 2) }
}

@jaw111
Copy link
Contributor

jaw111 commented Jun 9, 2021

@lisp from the perspective of a SPARQL user (rather than implementer) it would seem more intuitive to have a somewhat bottom-up phrasing for this. Based on your example:

select *
where { ?city x:company ?company. }
limit 2
for each {
  select *
  where { ?country x:city ?city. }
  limit 2
}

@lisp
Copy link
Contributor

lisp commented Jun 9, 2021

given the implicit control flow which applies to the rest of the language, this would be

for each {
  select *
  where { ?country x:city ?city. }
  limit 2 }
{ select *
  where { ?city x:company ?company. }
  limit 2 }

@VladimirAlexiev
Copy link
Contributor Author

https://medium.com/agnos-ai/sql-features-not-supported-in-sparql-1-1-eb34e3519077#7d71 AGNOS blog giving some examples, but in SQL using the "Northwind" MS SQL example database

@VladimirAlexiev
Copy link
Contributor Author

@lisp We need a syntax showing the dependency (major-minor) explicitly (as in your #100 (comment)).

A symmetric syntax like your #100 (comment) won't work because:

  • We want all major elements to be returned, even if some of them have no minor correlates (Sofia should be returned even if it had no companies in it)
  • You can nest them at any number of levels

@jaw111 Please let's not spell it minor-major. (Do you write RTL by any chance? ;-)

Here's an example of deeper nesting, and each level has a dependency on the previous level:

  • country
    • company
      • employee
      • product
        • rating
    • museum
      • painting
      • sculpture

@VladimirAlexiev
Copy link
Contributor Author

@klinovp on twitter: We added correlated subqueries to SPARQL: https://docs.stardog.com/query-stardog/stored-query-service#correlated-subqueries It's similar to LATERAL in Postgres.

@lisp
Copy link
Contributor

lisp commented Oct 14, 2021

what are the circumstances that require the sqs:inputs mechanism?

@jaw111
Copy link
Contributor

jaw111 commented Oct 14, 2021

@klinovp no RTL experience here.

Looking back to the original proposed solution, I struggle with the UNION clause, but could that simply be omitted and use a syntax for LATERAL something of a mash-up of VALUES and FILTER EXISTS to indicate the variable(s) for which bindings are to be passed sideways/laterally to the inner pattern/subquery.

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  {
    select ?country ?city ?city_name ?population {
      bind(<http://sws.geonames.org/732770/> as ?country)
      ?country x:city ?city.
      ?city x:name ?city_name.
      ?city x:population ?population.
    }
    order by desc(?population)
    limit 2
  }
  LATERAL ?city {
    select ?city ?company ?company_name ?revenue {
      ?city x:company ?company.
      ?company x:name ?company_name.
      ?company x:revenue ?revenue.
    }
    order by desc(?revenue)
    limit 2
  }
}

One could also pass in multiple variables and make the minor part optional. The outer part need not be a subquery:

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      select ?city ?company ?company_name ?revenue {
        ?city x:company ?company.
        ?company x:name ?company_name.
        ?company x:revenue ?revenue.
        ?company x:headquarteredIn ?country .
      }
      order by desc(?revenue)
      limit 2
    }
  }
}

Similarly, what is in the lateral might just be a graph pattern rather than a subquery. This could be used to give hints to the query processor whether to pass in solutions it already has when evaluating an optional clause (nest loop join) or use bottom up approach.

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      ?city x:company ?company.
      ?company x:name ?company_name.
      ?company x:revenue ?revenue.
      ?company x:headquarteredIn ?country .
    }
  }
}

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Oct 15, 2021

@jaw111 thanks for your syntax proposals!

  1. If the main is not a subquery, would we put order/limit at the bottom as for a normal query?
select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      select ?city ?company ?company_name ?revenue {
        ?city x:company ?company.
        ?company x:name ?company_name.
        ?company x:revenue ?revenue.
        ?company x:headquarteredIn ?country .
      }
      order by desc(?revenue)
      limit 2
    }
  }
} order by desc(?population) limit 2
  1. We should always think of the case of multiple nestings:
     major
       minor
         sub-minor

So I'd favor syntaxes that reflect that nesting in the query structure.

@lisp
Copy link
Contributor

lisp commented Oct 15, 2021

the same question applies to the LATERAL proposal as to sqs:inputs: what are the circumstances which require the declaration for the corresponding variables?

@VladimirAlexiev
Copy link
Contributor Author

@lisp minor should execute for each solution of major.
If they are merely patterns (not subqueries) then I think you are right, it's not necessary to declare the lateral variables.
If they are subqueries, they declare their exposed variables, and again it's not necessary to declare them in LATERAL.

@Aklakan
Copy link

Aklakan commented Aug 26, 2022

For the just released Jena 4.6.0 I contributed the "service enhancer" plugin which has lateral joins as one of its features: https://jena.apache.org/documentation/query/service_enhancer.html
It's the first release so consider it experimental.

Example: Fetch one English label for every entity:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?s ?l {
  VALUES ?s { wd:Q1686799 wd:Q54837 wd:Q54872 wd:Q54871 wd:Q108379795 } 
  SERVICE <loop:https://query.wikidata.org/sparql> { # SERVICE <loop:> {...} lateral joins with active dataset
    SELECT ?l {
      ?s rdfs:label ?l
      FILTER(langMatches(lang(?l), 'en'))
    } ORDER BY ?l LIMIT 1
  }
}

right, it's not necessary to declare the lateral variables. If they are subqueries, they declare their exposed variables, and again it's not necessary to declare them in LATERAL.

That was also my finding that its not necessary to declare the variables. In principle, lateral joins require different scoping rules as described in the linked document.

@frensjan
Copy link

frensjan commented Dec 13, 2022

Question: could someone comment on the semantics of multiple / consecutive lateral joins? Does lateral join n extend lateral join n-1? Or would this be based on shared variables?

An example use case could be to

  • select titles of threads in a forum and for each thread
    • select the last n posts and
    • the m most active users in each thread.
Example query
select ?thread ?title
       ?postTitle ?postAuthor
       ?topAuthor
{
    ?thread a :Thread ;
            :title ?title .

    lateral {
        select (?title as ?postTitle)
               (?author as ?postAuthor)
        {
            ?post :postedIn ?thread ;
                  :title ?title ;
                  :author ?author ;
                  :postedOn ?postedOn .
        }
        order by desc(?postedOn)
        limit 5
    }

    lateral {
        select (?author as ?topAuthor)
        {
            ?post :postedIn ?thread ;
                  :author ?author .
        }
        group by ?author
        order by desc(count(*))
        limit 3
    }
}

The key here is that the two 'lateral clauses' are meant to be unrelated to each other and only be related to the top-level query.

@Tpt
Copy link
Contributor

Tpt commented Dec 13, 2022

@frensjan If I understand @afs proposal properly, LATERAL works just like OPTIONAL i.e. variables bound inside of the first LATERAL clause will be considered when evaluating the second LATERAL clause. To use SPARQL SSE notation, X LATERAL { Y } LATERAL { Z } is parsed to (lateral (lateral X Y) Z)

@afs
Copy link
Collaborator

afs commented Dec 13, 2022

@frensjan - interesting example.

In the current proposal, it is as @Tpt describes. The second lateral takes the results from the first as input. With no shared variables, that is in effect a cross-product.

What shape of result set are you trying to produce? there can be multiple lines for the same?thread, ?title pairs in each lateral.

The description feels like it is a an "details for A, details for B" (A and B themselves not connected) which suggests a UNION somewhere - that does not get information from each branch on the same row. But if there are 5 rows from the first lateral, for one (thread,title) what is the desired outcome for the 3 rows from the second for that (thread,title)? There isn't a nested table feature in SPARQL - GROUP_CONCAT can be used but it isn't ideal.

@frensjan
Copy link

Thanks for the responses @afs and @Tpt; understood and as I expected. For as long as the size of the 'legs' is small, the costs of such a cross-product could be acceptable.

The use case would be to show related information in an entity listing / table or in an entity detail view. I.e. the name and number of posts for the OP and the number of posts per thread of a listing of threads.

Using a pseudo GraphQL query this could be something like:

{
    threads(limit:10) {
        title
        creationDate
        postCount       # lateral join to count(?post) over ?post :parent ?thread
        creator {       # lateral join to get ?creator :created ?thread
            name        # fetch ?creator :name ?name
            postCount   # count(?post) over ?creator :created ?post
        }
    }
}
A more elaborate example for vulnerabilities
{
    cve(id:"https://nvd.nist.gov/vuln/detail/CVE-2022-42003") {
        title
        description
        ...

        link {
            url
            tags {
                label
            }
        }

        weaknesses {
            id
            name
            source
        }

        affected {
            cpe {
                pattern
                version {
                    from
                    to
                }
                instances(limit:10) {
                    part
                    vendor
                    product
                    version
                    update
                    edition
                    ...
                }
            }
        }
    }
}

Use cases where information is tracked over time it would be useful to show the latest value for a number of attributes; e.g. based on a timestamp associated with the graph such triple occurs in. If the application can assume these attributes change values atomically in this example in the same named graph, this could be solved by one lateral join to the named graph with the last changes. But reality can be stubborn in my experience.

GROUP_CONCAT would only work for primitive values and not for more elaborate (sub-)solution. Ideally, this could be combined with the FRAMED ideas that are also discussed for SPARQL 1.2.

Note that while I'm using GraphQL here as examples, it is intended as pseudo code / a means to convey intent. I think these use-cases hold regardless of whether GraphQL is part of the technology stack.

@frensjan
Copy link

With regards to LATERAL enhancement and its companion enhancement for inject, how should the following query be interpreted?

SELECT * {
    ?s :value ?v
    LATERAL {
        SELECT DISTINCT ?o {
            ?o :value ?w .
            FILTER(?w > ?v)
        }
    }
}

The intent of the query is to select the ?os that have a :value ?w larger than the :value ?v of ?s.

However, based on SEP-0007 it would seem that ?v isn't bound in the evaluation of the sub-select because it isn't in its projection. This can obviously be fixed:

SELECT * {
    ?s :value ?v
    LATERAL {
        SELECT DISTINCT ?v ?o {
            ?o :value ?w .
            FILTER(?w > ?v)
        }
    }
}

However, this is a bit awkward, as ?v can never have the output role of the sub-select to use the wording from Correlation and Substitution in SPARQL.

@afs
Copy link
Collaborator

afs commented Jan 22, 2024

@frensjan - nice - this brings out the difference between LATERAL and parameterised queries by variable substitution.

There are two different ?v variable that never meet (i.e. get joined).
As a query, the inner ?v can be replaced by any variable not used elsewhere and the query results are the same.
(Jena renames the inner variable so it does not have to have code for scoping issues in other query execution code.)

This is a difference between LATERAL and parameterised queries by variable substitution. I believe they are both are valid use cases - the same machinery applies but there are slight differences in how it is used, here how to treat scope-hidden variables.

It doesn't even have to involve LATERAL:

As a parameterized query:

    SELECT ?o ?w {
         ?o :value ?w .
         FILTER(?w > ?v)
     }

makes sense. As a query, it rejects everything because the FILTER is error.

#57 and others.

@lisp
Copy link
Contributor

lisp commented Jan 22, 2024

the change to scoping rules is one of the reasons for LATERAL.
see https://observablehq.com/d/dba772c1bf5ed10f

@frensjan
Copy link

frensjan commented Jan 22, 2024

Thanks for the reactions. So would you @afs and @lisp say my second query should work as intended?

If so, don't you agree that it is awkward to select a variable ?v (which is typically used to expressed the variable is in the output role of the (sub-)select) to convey that it is in the input role?


@afs, what did you mean by

It doesn't even have to involve LATERAL:

The query that I gave as an example translates to something like:

SELECT * {
    ?s :value ?v
    ?o :value ?w .
    FILTER(?w > ?v)
}

But that's not really the issue I wanted to address.

A more involved example:

PREFIX : <http://example.org>

SELECT * WHERE {

    # messages and when they were created
    ?message1 a :Message .
    ?message1 :created ?referenceTimestamp .

    # for each message, the last ten messages that were created before it
    LATERAL {
        SELECT ?referenceTimestamp ?message2 {
            ?message2 a :Message .
            ?message2 :created ?created .
            FILTER( ?created < ?referenceTimestamp)
        }
        ORDER BY DESC( ?created )
        LIMIT 10
    }

}

Is the consensus that (given SEP-0006 and 0007) this would work as intended (as per the comments)?

The algebra according to Arq (click to expand)
  (lateral
    (bgp
      (triple ?message1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.orgMessage>)
      (triple ?message1 <http://example.orgcreated> ?referenceTimestamp)
    )
    (project (?referenceTimestamp ?message2)
      (top (10 (desc ?/created))
        (filter (< ?/created ?referenceTimestamp)
          (bgp
            (triple ?message2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.orgMessage>)
            (triple ?message2 <http://example.orgcreated> ?/created)
          )))))
I assume that the query doesn't work as intended if ?referenceTimestamp isn't selected in the sub-select. Although I would like this to work as I think this would be a lot easier to explain to my users! (click to expand)
PREFIX : <http://example.org>

SELECT * WHERE {

    # messages and when they were created
    ?message1 a :Message .
    ?message1 :created ?referenceTimestamp .

    # for each message, the last ten messages that were created before it
    LATERAL {
        SELECT ?message2 {
            ?message2 a :Message .
            ?message2 :created ?created .
            FILTER( ?created < ?referenceTimestamp)
        }
        ORDER BY DESC( ?created )
        LIMIT 10
    }

}

@afs
Copy link
Collaborator

afs commented Jan 22, 2024

my second query should work as intended

Yes. ?v has to be in-scope at the top of the LHS (left hand side) of LATERAL.

@afs
Copy link
Collaborator

afs commented Jan 22, 2024

# for each message, the last ten messages that were created before it

Yes! That's a very important application. The "up to 10 of" for each item is very hard to do otherwise.

?/created

FWIW That's a renamed variable.

@frensjan
Copy link

We're fleshing out the implementation of LATERAL for RDF4J; another question to make sure we get the intended semantics right:

In the Oxigraph tests there are two queries:

PREFIX ex: <http://example.org/>

SELECT ?s ?o WHERE {
    ?s a ex:T.
    OPTIONAL { LATERAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

and

PREFIX ex: <http://example.org/>

SELECT ?s ?o WHERE {
    ?s a ex:T.
    LATERAL { OPTIONAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

From the expected results, it seems that in the first query ?s is considered out of scope for evaluation for each ?s a ex:T. while in the second query it is in scope.

For me it's surprising that the semantics of these two queries differ. I consider both OPTIONAL and LATERAL to have 'nested loop' like semantics; i.e., for each solution / binding set from the left-hand side, variable bindings are 'injected into' the right-hand side graph pattern.

I would consider { P } OPTIONAL { { } LATERAL Q } to be equivalent to { P } LATERAL { { } OPTIONAL Q }.


I did noticed a subtle difference in the algebra for Jena:

  (project (?s ?o)
    (conditional
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (lateral
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))

vs

  (project (?s ?o)
    (lateral
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (leftjoin
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))

Note the conditional vs leftjoin operators.

I haven't tested the actual behaviour of Jena yet. But as Jena uses variable renaming, I would expect ?s to be renamed to ?/s if it were out of scope.

@Tpt
Copy link
Contributor

Tpt commented Jan 25, 2024

We're fleshing out the implementation of LATERAL for RDF4J

Amazing!

For me it's surprising that the semantics of these two queries differ. I consider both OPTIONAL and LATERAL to have 'nested loop' like semantics; i.e., for each solution / binding set from the left-hand side, variable bindings are 'injected into' the right-hand side graph pattern.

This is the trap. OPTIONAL do not have "nested loop" semantic but a "left join" semantic. So, in A OPTIONAL { B }, A and B are supposed to be evaluated independently then joined. See this definition in the SPARQL spec. For example the query

SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } }

can be rewritten

SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(false) } }

because ?a is not bound inside of the OPTIONAL
and so will return only a=1
wheras

SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } }

can be rewritten

SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(true) } }

because ?a is now bound inside of the LATERAL
and so will return a=1 b=2.

This is why rewriting OPTIONAL into nested loops is hard because one needs to take care of not changing the results.

@frensjan
Copy link

Thanks @Tpt. This maybe a bit broader of an issue in RDF4J perhaps ... I'l look into that.

@frensjan
Copy link

This may also be an issue with the Jena implementation then @afs ? I've tested the Oxigraph queries and they both yield the same results given the test data; see below.

(I could have perhaps created an issue for Jena, but as this concerns the semantics of LATERAL somewhat I reported it here)

$ arq -version
Apache Jena version 4.10.0
$ cat data.ttl
@prefix ex: <http://example.org/> .

ex:s1 a ex:T ; ex:p 11 , 12 , 13 .
ex:s2 a ex:T ; ex:p 21 , 22 , 23 .
ex:s3 a ex:T .
$ arq --explain --data data.ttl --query optional-in-lateral.rq
12:44:39 INFO  exec            :: QUERY
  PREFIX  ex:   <http://example.org/>

  SELECT  ?s ?o
  WHERE
    { ?s  a  ex:T
      LATERAL
        { OPTIONAL
            { SELECT  ?s ?o
              WHERE
                { ?s  ex:p  ?o }
              ORDER BY ?o
              LIMIT   2
            }
        }
    }
12:44:39 INFO  exec            :: ALGEBRA
  (project (?s ?o)
    (lateral
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (leftjoin
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:39 INFO  exec            :: BGP ::   ?s rdf:type <http://example.org/T>
12:44:39 INFO  exec            :: Reorder/generic ::   ?s rdf:type <http://example.org/T>
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s2> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s2> <http://example.org/p> ?o
--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 |    |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------
$ arq --explain --data data.ttl --query lateral-in-optional.rq
12:44:45 INFO  exec            :: QUERY
  PREFIX  ex:   <http://example.org/>

  SELECT  ?s ?o
  WHERE
    { ?s  a  ex:T
      OPTIONAL
        { LATERAL
            { SELECT  ?s ?o
              WHERE
                { ?s  ex:p  ?o }
              ORDER BY ?o
              LIMIT   2
            }
        }
    }
12:44:45 INFO  exec            :: ALGEBRA
  (project (?s ?o)
    (conditional
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (lateral
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:45 INFO  exec            :: BGP ::   ?s rdf:type <http://example.org/T>
12:44:45 INFO  exec            :: Reorder/generic ::   ?s rdf:type <http://example.org/T>
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s2> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s2> <http://example.org/p> ?o
--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 |    |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

@afs
Copy link
Collaborator

afs commented Jan 25, 2024

Hi @frensjan -- thanks for providing the clear details.

The results look like the same rows to me. And I get the same results locally.

What's the issue you've noticed?

conditional is a particular way to execute leftjoin.

Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.

@frensjan
Copy link

What's the issue you've noticed?

Well, the issue is that it's a different result than when you run these queries with Oxigraph apparently!

conditional is a particular way to execute leftjoin.

That's what I've read in the javadoc. I could imagine that it's backed by a different evaluation algorithm and would explain differences.

Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.

That's also my initial 'stance'. But apparently in Oxigraph this is different?

@afs
Copy link
Collaborator

afs commented Jan 25, 2024

Please could could record the Oxigraph results here for the record?

Jena5 gets the same results, different order:

--------------
| s     | o  |
==============
| ex:s3 |    |
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

(there is a different default in-memory graph implementation)

@frensjan
Copy link

The Oxigraph tests are on Github. The relevant files are:

  • subselect_inside_optional.rq
  • subselect_inside_optional.srx
  • subselect_outside_optional.rq
  • subselect_outside_optional.srx

@afs
Copy link
Collaborator

afs commented Jan 25, 2024

tested the Oxigraph queries and they both yield the same results given the test data; see below.

I misread that as both queries give equal results

Those two are different results:

subselect_inside_optional.srx :

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 |    |
| ex:s3 |    |
--------------

subselect_outside_optional.srx:

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
| ex:s3 |    |
--------------

@afs
Copy link
Collaborator

afs commented Jan 25, 2024

Jena gets the same as Oxigraph those if you use the "ref" query engine (--engine=ref) or if you switch off the optimizer (--optimize=off) so Jena is wrong for the normal engine. It's likely the conditional.

The "ref" engine is very simple, runs exactly the algebra and materializes each intermediate step.

apache/jena#2218

@frensjan
Copy link

Thanks for investigating this further @afs. I assume choosing the conditional is an incorrect optimisation by Jena.

RDF4J evaluates all left-joins in this manner (bindings from the LHS are used for variables with the same name on the LHS).

I've raised the issue at RDF4J (which is larger than just optional lateral).


Maybe not the easiest to explain to users (n=1)

I must say that this does strike me as odd ... I need to explain this to my users:

PREFIX : <http://example.org/>

SELECT * WHERE {
    VALUES ?x { :x }
    OPTIONAL {
        FILTER( BOUND(?x) )
        BIND( :y as ?y )
        BIND( ?x as ?z )
    }
}

yields

---------------
| x  | y  | z |
===============
| :x | :y |   |
---------------

(I intentionally wrote the FILTER clause first in the optional block for dramatic effect)

I understand that this is as per the specification (filters in a group are applied to the group as a whole and if a left join has a group with a filter on the RHS, this filter is applied for the merged solutions from left and right that are compatible). But it's not that intuitive (for me, n=1).

Blazegraph

Note that querying the Wikidata SPARQL endpoint (which is still backed by Blazegraph if I'm not mistaken) gives yet another result:

--------------
| x  | y | z |
==============
| :x |   |   |
--------------

Virtuoso

Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:

Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned

@afs
Copy link
Collaborator

afs commented Jan 26, 2024

Thanks for investigating this further @afs. I assume choosing the conditional is an incorrect optimisation by Jena.

Looking at subselect_inside_optional.srx - it is symmetric in :s1 and :s2 but the answers are different.
The only asymmetry is the top 2 subselect.

I think that LATERAL rules apply inside the subselect because ?s is not a hidden variable. I can only explain the results

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 |    |
| ex:s3 |    |
--------------

if the subselect executes with free ?s (no lateral injection) when 11 and 12 are first in the ORDER BY

What am I missing?
When it's ?s=:s2, LATERAL should making all execution within it's right hand side act like ?s is :s2 only.

SELECT ?s ?o { BIND(ex:s2 AS ?s) ?s ex:p ?o .} ORDER BY ?o LIMIT 2 is 21 and 22 which I'd expect to be leftjoin'ed into the answers.

SELECT * WHERE {
    ?s a ex:T.
    OPTIONAL { LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

This is not to argue Jena results for subselect_inside_optional are correct. If they are, it's an accident of conditional.
The unoptimized form is definitive and it agrees with Oxigraph.

Unoptimized evaluation does fails to propagate ?s back leading to the effect that the subselect is not having the LATERAL binding any more. That is a bug. (@Tpt will understand where public work time has gone.) The optimization, by accident, puts it back in this case by something unrelated to LATERAL.

@lisp
Copy link
Contributor

lisp commented Jan 26, 2024

@frensjan, in

But it's not that intuitive (for me, n=1).

"it's" is to mean the recommendation's algebra or the result which it produces in this instance?

@TallTed
Copy link
Member

TallTed commented Jan 26, 2024

@frensjan wrote --

Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:

Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned

It should perhaps be noted that this error is not mandatory behavior in Virtuoso. There's a checkbox on the /sparql query form, "Strict checking of void variables", which is ticked by default (resulting in your error). Unticking that box (or adding &signal_unconnected=on to a results URI) gets this --

----------------
| x  | y  | z  |
================
| :x |    |    |
----------------

I confess to not yet fully understanding this issue. As it seem that the latter solution set is an error, I would appreciate it if someone who does fully understand this issue (and so can evaluate any patch for it) would log an issue on the Virtuoso project.

@frensjan
Copy link

frensjan commented Jan 26, 2024

@frensjan, in
But it's not that intuitive (for me, n=1).
"it's" is to mean the recommendation's algebra or the result which it produces in this instance?

I meant the algebra, the semantics of the query (and thus the results produced). This may very well be biased by the fact that I do a lot of imperative programming. But the curly braces around the RHS of OPTIONAL for a lot of people probably provide some kind of hint that what's in it has the same 'data' available (or not).

Filters in OPTIONAL are special, as they have results from both LHS and RHS 'in scope'.

Variables in statement patterns on the RHS shared with the LHS form the join key, so they feel like they're 'in scope'. E.g. in SELECT * { ?s :p ?x OPTIONAL { ?s :q ?y } }, ?s will always have the same values on the LHS as on the RHS (if any compatible solution exists of course).

But for e.g. BIND this isn't the case. So you can write:

SELECT ?price ?discount {
    VALUES ?price { 10 }
    OPTIONAL {
        VALUES ?discount { 0.10 }
        FILTER( ?price * (1 - ?discount) < 10 )
    }
}

But this query has different semantics:

SELECT ?price ?discount {
    VALUES ?price { 10 }
    OPTIONAL {
        VALUES ?discount { 0.10 }
        BIND( ?price * (1 - ?discount) AS ?effectivePrice )
        FILTER( ?effectivePrice < 10 )
    }
}

@frensjan
Copy link

But perhaps I digress too much. This issue is about LATERAL, not OPTIONAL.

If RDF4J would be 'fixed' though, I would need to train my users in the difference between
P { LATERAL { OPTIONAL Q } } and P { OPTIONAL { LATERAL Q }.

E.g., this query gives, for every ?movie, the average rating of all movies:

SELECT * {
    ?movie a :Movie
    OPTIONAL { LATERAL {
        { SELECT (AVG(?rating) as ?averageRating) {
            ?movie :rating ?rating
        } }
    } }
}

That's just hard to grok for me.

@afs
Copy link
Collaborator

afs commented Jan 26, 2024

What am I (@afs) missing?

That the OPTIONAL is has normal evaluation so the RHS is

{ LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2 }

No LHS ?s until after LATERAL is evaluated. The LHS of the LATERAL is the unit { }.

@VladimirAlexiev
Copy link
Contributor Author

https://ceur-ws.org/Vol-3471/paper8.pdf by @Aklakan
Scaling RML and SPARQL-based Knowledge Graph Construction with Apache Spark
sec 4.1. Merging CONSTRUCT Queries using LATERAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests