JOIN LATERAL or Correlated Subquery #100

VladimirAlexiev · 2019-07-17T11:13:15Z

Why?

GraphQL is a hot topic amongst developers and tool vendors.
There are several implementations of GraphQL over RDF (HyperGraphQL, Comunica, TopQuadrant, StarDog), and we at Onto are also working on an implementation.

GraphQL queries are hierarchical and quite regular.
Following the logic of GraphQL resolvers, you do the parent-level query then turn to the child-level.

Assuming a simple companies graph structure and some reasonable order and limit syntax (GraphQL "input objects"), a query "Give me the top 2 BG cities, and the top 2 companies in each" could be expressed like this in GraphQL:

{
  country(id:"...bulgaria") {
    city (order: {population:DESC}, limit: 2) {
      id name population
      company (order: {revenue:DESC}, limit: 2) {
        id name revenue
      }
    }
  }
}

If you try to implement this with a SPARQL subquery, you'll run into what I call the "distributed limit" problem.
limit in the company subquery will apply globally, so even if you use a limit of 2*2 or even 50k, the first city (Sofia) will gobble up all companies, leaving none for the other city.

We at Onto believe that to implement this efficiently, you need the subquery to run in a loop for every row of the parent query.

Previous work

This is a common problem in databases. Eg see StackOverflow: Grouped LIMIT in PostgreSQL: show the first N rows for each group?, which gives the following solutions:

<child-order> OVER (PARTITION BY <parent-id> ORDER BY <parent-order>) aka using Windowing functions
<parent-query> JOIN LATERAL (<child-query>), see PostgreSQL (FROM, Lateral, SELECT, heap.io blog Dec 2014), SQL Server (cross apply)
WITH Common Table Expression
JOIN (COUNT...GROUP BY) WHERE <=5

A Correlated Subquery is like LEFT JOIN LATERAL ... ON true, see StackOverflow: Difference between Lateral and Subquery:

Correlated subquery has limitations: it cannot return multiple columns and multiple rows
LATERAL is made for that: it's assumed automatically for cross-joins FROM x1,x2 where x2 is a table function (eg unnest)

Somewhat related SPARQL issues were posted before: #47 (windowing), #9 (partitioning).
However, LATERAL is usually faster than Windowing functions (depending on data and indexing).

Two of the leading GraphQL implementations for RDMBS use LATERAL: Hasura and Join Monster

Proposed solution

A key question is how to return the results, to ensure that child rows don't mess up the limit on parent rows.

We've tried structuring with CONSTRUCT but it doesn't support ordering and we don't want to mess with rdf:List
StarDog's Extending the Solution with ARRAY may be relevant, but will require too large extension of SPARQL
So we're thinking of using SELECT with a UNION-style distribution of parent and child values

(Edited Sep 2020 to interleave the Company rows with the City rows, which makes reconstructing the nested objects easier and enables potential streaming. Previously I had all city rows, then all company rows)

country	city	city_name	population	company	company_name	revenue	COMMENT
geo:732800/	geo:727011/	Sofia	1152556
	geo:727011/			co:123	Sofia Foo Co	987	Sofia company 1
	geo:727011/			co:456	Sofia Bar Co	123	Sofia company 2
geo:732800/	geo:728193/	Plovdiv	340494
	geo:728193/			co:789	Plovdiv Foo Co	987	Plovdiv company 1
	geo:728193/			co:012	Plovdiv Bar Co	123	Plovdiv company 2

Assume this construct (other syntax suggestions are welcome!):

{foo} UNION LATERAL(?var) {bar}

?var must be bound by foo (and exported in its select if it's a subquery)
?var must also be exported by bar
bar is iterated for every binding of ?var
the results of bar are appended to the results of foo

Then we could use it to implement the query in question:

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  {select ?country ?city ?city_name ?population {
    bind(<http://sws.geonames.org/732770/> as ?country)
    ?country x:city ?city.
    ?city x:name ?city_name.
    ?city x:population ?population.
  } order by desc(?population) limit 2}
  UNION LATERAL(?city)
  {select ?city ?company ?company_name ?revenue {
    ?city x:company ?company.
    ?company x:name ?company_name.
    ?company x:revenue ?revenue
  } order by desc(?revenue) limit 2}
}

(It's more likely to have inverse links ?city x:country ?country and ?company x:city ?city but for simplicity we use straight links)

Considerations for backward compatibility

None?

The text was updated successfully, but these errors were encountered:

afs · 2019-07-18T11:29:06Z

Related: "Query hints" #71

VladimirAlexiev · 2020-09-23T08:25:52Z

https://www.stardog.com/labs/blog/stored-query-service/ explains how to call stored queries in Stardog (similar to subqueries but providing better modularity).
It also includes ideas how to use a similar mechanism for correlated (LATERAL) joins and cites this issue.

@klinovp: I upvoted your correlated join issue for SPARQL 1.2 btw, it's good stuff. Let's turn it into a SEP maybe?
Vladimir: Do you have the nerve and stamina for it?

@kasei, @afs, @jeenbroekstra and @JervenBolleman:

Your feedback and suggestions will be highly appreciated here!
Do you find this issue interesting enough to help us with a SEP?

@rubensworks Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly?

namedgraph · 2020-09-23T08:45:16Z

Any example where this would be relevant that is not related to GraphQL?..

klinovp · 2020-09-23T09:00:43Z

An example how a correlated join is a useful addition to un-correlated ones? It's exactly the same situation as in SQL and there're lots of use cases, the classical one is something like "you have a database of products and a set of offers for each product, you want to get 5 highest price offers for each product".

GraphQL is one important use case here but this is a core SPARQL extension proposal which is perfectly valid outside of the GraphQL context.

JervenBolleman · 2020-09-23T09:20:59Z

I think this issue is interesting enough to do a SEP on. Reading through this `LATERAL` proposal, I suspect that it overlaps with the `PREFER` SEP (proposed for #13) and maybe that can be extended. The syntax/usecase is different, but execution wise I suspect the overlap is much higher. As in `PREFER` is a `LATERAL` query that must have an `ORDER BY` and fixed `LIMIT` of 1. Which I had not thought of, and considering I only had my first coffee I am willing to be corrected on ;)

rubensworks · 2020-09-24T06:54:36Z

Do you agree with Onto that this issue is quite crucial for implementing GraphQL fully and correctly?

I fully agree on the importance of this issue for SPARQL in general.

I wouldn't call it crucial for GraphQL per se, as different interpretations are possible on how to apply it to RDF data. But for the interpretation that Ontotext has picked, it definitely makes sense.

lisp · 2020-10-08T10:38:33Z

ignore sql machinations related to lateral and correlated subqueries, to concentrate just on the intent to produce from some dependent sub-expression some number of solutions for each solution of some dominant subexpression.

should not the general "sidewards information passing" mechanism suffice?

is this best accommodated with a union variant?

is it not more a matter of just declaring top-down rather than bottom-up reduction - analogous to the conflagration surrounding EXISTS?
in which case an explicit SUBSELECT operator and/or something like an EACH modifier would suffice to declare intent and leave it to the processor to arrange for the s-i-p data/control flow in any and all join forms.
as in (in simplified form)

select * {
  {select * where { ?country x:city ?city. } limit 2}
  {SUBSELECT * where { ?city x:company ?company. } limit 2}
}

or

select * {
  {select * where { ?country x:city ?city. } limit 2}
  {SELECT * where { ?city x:company ?company. } LIMIT (EACH 2) }
}

jaw111 · 2021-06-09T22:08:14Z

@lisp from the perspective of a SPARQL user (rather than implementer) it would seem more intuitive to have a somewhat bottom-up phrasing for this. Based on your example:

select *
where { ?city x:company ?company. }
limit 2
for each {
  select *
  where { ?country x:city ?city. }
  limit 2
}

lisp · 2021-06-09T23:44:15Z

given the implicit control flow which applies to the rest of the language, this would be

for each {
  select *
  where { ?country x:city ?city. }
  limit 2 }
{ select *
  where { ?city x:company ?company. }
  limit 2 }

VladimirAlexiev · 2021-08-20T14:11:14Z

https://medium.com/agnos-ai/sql-features-not-supported-in-sparql-1-1-eb34e3519077#7d71 AGNOS blog giving some examples, but in SQL using the "Northwind" MS SQL example database

VladimirAlexiev · 2021-08-20T14:21:25Z

@lisp We need a syntax showing the dependency (major-minor) explicitly (as in your #100 (comment)).

A symmetric syntax like your #100 (comment) won't work because:

We want all major elements to be returned, even if some of them have no minor correlates (Sofia should be returned even if it had no companies in it)
You can nest them at any number of levels

@jaw111 Please let's not spell it minor-major. (Do you write RTL by any chance? ;-)

Here's an example of deeper nesting, and each level has a dependency on the previous level:

country
- company
  - employee
  - product
    - rating
- museum
  - painting
  - sculpture

VladimirAlexiev · 2021-10-14T08:39:13Z

@klinovp on twitter: We added correlated subqueries to SPARQL: https://docs.stardog.com/query-stardog/stored-query-service#correlated-subqueries It's similar to LATERAL in Postgres.

lisp · 2021-10-14T09:36:04Z

what are the circumstances that require the sqs:inputs mechanism?

jaw111 · 2021-10-14T10:26:45Z

@klinovp no RTL experience here.

Looking back to the original proposed solution, I struggle with the UNION clause, but could that simply be omitted and use a syntax for LATERAL something of a mash-up of VALUES and FILTER EXISTS to indicate the variable(s) for which bindings are to be passed sideways/laterally to the inner pattern/subquery.

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  {
    select ?country ?city ?city_name ?population {
      bind(<http://sws.geonames.org/732770/> as ?country)
      ?country x:city ?city.
      ?city x:name ?city_name.
      ?city x:population ?population.
    }
    order by desc(?population)
    limit 2
  }
  LATERAL ?city {
    select ?city ?company ?company_name ?revenue {
      ?city x:company ?company.
      ?company x:name ?company_name.
      ?company x:revenue ?revenue.
    }
    order by desc(?revenue)
    limit 2
  }
}

One could also pass in multiple variables and make the minor part optional. The outer part need not be a subquery:

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      select ?city ?company ?company_name ?revenue {
        ?city x:company ?company.
        ?company x:name ?company_name.
        ?company x:revenue ?revenue.
        ?company x:headquarteredIn ?country .
      }
      order by desc(?revenue)
      limit 2
    }
  }
}

Similarly, what is in the lateral might just be a graph pattern rather than a subquery. This could be used to give hints to the query processor whether to pass in solutions it already has when evaluating an optional clause (nest loop join) or use bottom up approach.

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      ?city x:company ?company.
      ?company x:name ?company_name.
      ?company x:revenue ?revenue.
      ?company x:headquarteredIn ?country .
    }
  }
}

VladimirAlexiev · 2021-10-15T09:20:53Z

@jaw111 thanks for your syntax proposals!

If the main is not a subquery, would we put order/limit at the bottom as for a normal query?

select ?country ?city ?city_name ?population ?company ?company_name ?revenue {
  bind(<http://sws.geonames.org/732770/> as ?country)
  ?country x:city ?city.
  ?city x:name ?city_name.
  ?city x:population ?population.
  optional {
    LATERAL ?city ?country {
      select ?city ?company ?company_name ?revenue {
        ?city x:company ?company.
        ?company x:name ?company_name.
        ?company x:revenue ?revenue.
        ?company x:headquarteredIn ?country .
      }
      order by desc(?revenue)
      limit 2
    }
  }
} order by desc(?population) limit 2

We should always think of the case of multiple nestings:

     major
       minor
         sub-minor

So I'd favor syntaxes that reflect that nesting in the query structure.

lisp · 2021-10-15T11:17:09Z

the same question applies to the LATERAL proposal as to sqs:inputs: what are the circumstances which require the declaration for the corresponding variables?

VladimirAlexiev · 2022-02-06T20:01:15Z

@lisp minor should execute for each solution of major.
If they are merely patterns (not subqueries) then I think you are right, it's not necessary to declare the lateral variables.
If they are subqueries, they declare their exposed variables, and again it's not necessary to declare them in LATERAL.

Aklakan · 2022-08-26T09:02:59Z

For the just released Jena 4.6.0 I contributed the "service enhancer" plugin which has lateral joins as one of its features: https://jena.apache.org/documentation/query/service_enhancer.html
It's the first release so consider it experimental.

Example: Fetch one English label for every entity:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?s ?l {
  VALUES ?s { wd:Q1686799 wd:Q54837 wd:Q54872 wd:Q54871 wd:Q108379795 } 
  SERVICE <loop:https://query.wikidata.org/sparql> { # SERVICE <loop:> {...} lateral joins with active dataset
    SELECT ?l {
      ?s rdfs:label ?l
      FILTER(langMatches(lang(?l), 'en'))
    } ORDER BY ?l LIMIT 1
  }
}

right, it's not necessary to declare the lateral variables. If they are subqueries, they declare their exposed variables, and again it's not necessary to declare them in LATERAL.

That was also my finding that its not necessary to declare the variables. In principle, lateral joins require different scoping rules as described in the linked document.

frensjan · 2022-12-13T12:51:12Z

Question: could someone comment on the semantics of multiple / consecutive lateral joins? Does lateral join n extend lateral join n-1? Or would this be based on shared variables?

An example use case could be to

select titles of threads in a forum and for each thread
- select the last n posts and
- the m most active users in each thread.

Example query

select ?thread ?title
       ?postTitle ?postAuthor
       ?topAuthor
{
    ?thread a :Thread ;
            :title ?title .

    lateral {
        select (?title as ?postTitle)
               (?author as ?postAuthor)
        {
            ?post :postedIn ?thread ;
                  :title ?title ;
                  :author ?author ;
                  :postedOn ?postedOn .
        }
        order by desc(?postedOn)
        limit 5
    }

    lateral {
        select (?author as ?topAuthor)
        {
            ?post :postedIn ?thread ;
                  :author ?author .
        }
        group by ?author
        order by desc(count(*))
        limit 3
    }
}

The key here is that the two 'lateral clauses' are meant to be unrelated to each other and only be related to the top-level query.

Tpt · 2022-12-13T15:05:57Z

@frensjan If I understand @afs proposal properly, LATERAL works just like OPTIONAL i.e. variables bound inside of the first LATERAL clause will be considered when evaluating the second LATERAL clause. To use SPARQL SSE notation, X LATERAL { Y } LATERAL { Z } is parsed to (lateral (lateral X Y) Z)

afs · 2022-12-13T15:48:16Z

@frensjan - interesting example.

In the current proposal, it is as @Tpt describes. The second lateral takes the results from the first as input. With no shared variables, that is in effect a cross-product.

What shape of result set are you trying to produce? there can be multiple lines for the same?thread, ?title pairs in each lateral.

The description feels like it is a an "details for A, details for B" (A and B themselves not connected) which suggests a UNION somewhere - that does not get information from each branch on the same row. But if there are 5 rows from the first lateral, for one (thread,title) what is the desired outcome for the 3 rows from the second for that (thread,title)? There isn't a nested table feature in SPARQL - GROUP_CONCAT can be used but it isn't ideal.

frensjan · 2022-12-13T23:28:14Z

Thanks for the responses @afs and @Tpt; understood and as I expected. For as long as the size of the 'legs' is small, the costs of such a cross-product could be acceptable.

The use case would be to show related information in an entity listing / table or in an entity detail view. I.e. the name and number of posts for the OP and the number of posts per thread of a listing of threads.

Using a pseudo GraphQL query this could be something like:

{
    threads(limit:10) {
        title
        creationDate
        postCount       # lateral join to count(?post) over ?post :parent ?thread
        creator {       # lateral join to get ?creator :created ?thread
            name        # fetch ?creator :name ?name
            postCount   # count(?post) over ?creator :created ?post
        }
    }
}

A more elaborate example for vulnerabilities

{
    cve(id:"https://nvd.nist.gov/vuln/detail/CVE-2022-42003") {
        title
        description
        ...

        link {
            url
            tags {
                label
            }
        }

        weaknesses {
            id
            name
            source
        }

        affected {
            cpe {
                pattern
                version {
                    from
                    to
                }
                instances(limit:10) {
                    part
                    vendor
                    product
                    version
                    update
                    edition
                    ...
                }
            }
        }
    }
}

Use cases where information is tracked over time it would be useful to show the latest value for a number of attributes; e.g. based on a timestamp associated with the graph such triple occurs in. If the application can assume these attributes change values atomically in this example in the same named graph, this could be solved by one lateral join to the named graph with the last changes. But reality can be stubborn in my experience.

GROUP_CONCAT would only work for primitive values and not for more elaborate (sub-)solution. Ideally, this could be combined with the FRAMED ideas that are also discussed for SPARQL 1.2.

Note that while I'm using GraphQL here as examples, it is intended as pseudo code / a means to convey intent. I think these use-cases hold regardless of whether GraphQL is part of the technology stack.

frensjan · 2024-01-22T11:16:54Z

With regards to LATERAL enhancement and its companion enhancement for inject, how should the following query be interpreted?

SELECT * {
    ?s :value ?v
    LATERAL {
        SELECT DISTINCT ?o {
            ?o :value ?w .
            FILTER(?w > ?v)
        }
    }
}

The intent of the query is to select the ?os that have a :value ?w larger than the :value ?v of ?s.

However, based on SEP-0007 it would seem that ?v isn't bound in the evaluation of the sub-select because it isn't in its projection. This can obviously be fixed:

SELECT * {
    ?s :value ?v
    LATERAL {
        SELECT DISTINCT ?v ?o {
            ?o :value ?w .
            FILTER(?w > ?v)
        }
    }
}

However, this is a bit awkward, as ?v can never have the output role of the sub-select to use the wording from Correlation and Substitution in SPARQL.

afs · 2024-01-22T16:54:04Z

@frensjan - nice - this brings out the difference between LATERAL and parameterised queries by variable substitution.

There are two different ?v variable that never meet (i.e. get joined).
As a query, the inner ?v can be replaced by any variable not used elsewhere and the query results are the same.
(Jena renames the inner variable so it does not have to have code for scoping issues in other query execution code.)

This is a difference between LATERAL and parameterised queries by variable substitution. I believe they are both are valid use cases - the same machinery applies but there are slight differences in how it is used, here how to treat scope-hidden variables.

It doesn't even have to involve LATERAL:

As a parameterized query:

    SELECT ?o ?w {
         ?o :value ?w .
         FILTER(?w > ?v)
     }

makes sense. As a query, it rejects everything because the FILTER is error.

#57 and others.

lisp · 2024-01-22T17:17:43Z

the change to scoping rules is one of the reasons for LATERAL.
see https://observablehq.com/d/dba772c1bf5ed10f

frensjan · 2024-01-22T18:11:29Z

Thanks for the reactions. So would you @afs and @lisp say my second query should work as intended?

If so, don't you agree that it is awkward to select a variable ?v (which is typically used to expressed the variable is in the output role of the (sub-)select) to convey that it is in the input role?

@afs, what did you mean by

It doesn't even have to involve LATERAL:

The query that I gave as an example translates to something like:

SELECT * {
    ?s :value ?v
    ?o :value ?w .
    FILTER(?w > ?v)
}

But that's not really the issue I wanted to address.

A more involved example:

PREFIX : <http://example.org>

SELECT * WHERE {

    # messages and when they were created
    ?message1 a :Message .
    ?message1 :created ?referenceTimestamp .

    # for each message, the last ten messages that were created before it
    LATERAL {
        SELECT ?referenceTimestamp ?message2 {
            ?message2 a :Message .
            ?message2 :created ?created .
            FILTER( ?created < ?referenceTimestamp)
        }
        ORDER BY DESC( ?created )
        LIMIT 10
    }

}

Is the consensus that (given SEP-0006 and 0007) this would work as intended (as per the comments)?

The algebra according to Arq (click to expand)

  (lateral
    (bgp
      (triple ?message1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.orgMessage>)
      (triple ?message1 <http://example.orgcreated> ?referenceTimestamp)
    )
    (project (?referenceTimestamp ?message2)
      (top (10 (desc ?/created))
        (filter (< ?/created ?referenceTimestamp)
          (bgp
            (triple ?message2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.orgMessage>)
            (triple ?message2 <http://example.orgcreated> ?/created)
          )))))

I assume that the query doesn't work as intended if ?referenceTimestamp isn't selected in the sub-select. Although I would like this to work as I think this would be a lot easier to explain to my users! (click to expand)

PREFIX : <http://example.org>

SELECT * WHERE {

    # messages and when they were created
    ?message1 a :Message .
    ?message1 :created ?referenceTimestamp .

    # for each message, the last ten messages that were created before it
    LATERAL {
        SELECT ?message2 {
            ?message2 a :Message .
            ?message2 :created ?created .
            FILTER( ?created < ?referenceTimestamp)
        }
        ORDER BY DESC( ?created )
        LIMIT 10
    }

}

afs · 2024-01-22T19:09:12Z

my second query should work as intended

Yes. ?v has to be in-scope at the top of the LHS (left hand side) of LATERAL.

afs · 2024-01-22T19:13:37Z

# for each message, the last ten messages that were created before it

Yes! That's a very important application. The "up to 10 of" for each item is very hard to do otherwise.

?/created

FWIW That's a renamed variable.

frensjan · 2024-01-25T09:51:51Z

We're fleshing out the implementation of LATERAL for RDF4J; another question to make sure we get the intended semantics right:

In the Oxigraph tests there are two queries:

PREFIX ex: <http://example.org/>

SELECT ?s ?o WHERE {
    ?s a ex:T.
    OPTIONAL { LATERAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

and

PREFIX ex: <http://example.org/>

SELECT ?s ?o WHERE {
    ?s a ex:T.
    LATERAL { OPTIONAL {SELECT ?s ?o WHERE { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

From the expected results, it seems that in the first query ?s is considered out of scope for evaluation for each ?s a ex:T. while in the second query it is in scope.

For me it's surprising that the semantics of these two queries differ. I consider both OPTIONAL and LATERAL to have 'nested loop' like semantics; i.e., for each solution / binding set from the left-hand side, variable bindings are 'injected into' the right-hand side graph pattern.

I would consider { P } OPTIONAL { { } LATERAL Q } to be equivalent to { P } LATERAL { { } OPTIONAL Q }.

I did noticed a subtle difference in the algebra for Jena:

  (project (?s ?o)
    (conditional
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (lateral
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))

vs

  (project (?s ?o)
    (lateral
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (leftjoin
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))

Note the conditional vs leftjoin operators.

I haven't tested the actual behaviour of Jena yet. But as Jena uses variable renaming, I would expect ?s to be renamed to ?/s if it were out of scope.

Tpt · 2024-01-25T11:33:04Z

We're fleshing out the implementation of LATERAL for RDF4J

Amazing!

For me it's surprising that the semantics of these two queries differ. I consider both OPTIONAL and LATERAL to have 'nested loop' like semantics; i.e., for each solution / binding set from the left-hand side, variable bindings are 'injected into' the right-hand side graph pattern.

This is the trap. OPTIONAL do not have "nested loop" semantic but a "left join" semantic. So, in A OPTIONAL { B }, A and B are supposed to be evaluated independently then joined. See this definition in the SPARQL spec. For example the query

SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } }

can be rewritten

SELECT * WHERE { BIND(1 AS ?a } OPTIONAL { BIND(2 AS ?b) FILTER(false) } }

because ?a is not bound inside of the OPTIONAL
and so will return only a=1
wheras

SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(BOUND(?a)) } }

can be rewritten

SELECT * WHERE { BIND(1 AS ?a } LATERAL { BIND(2 AS ?b) FILTER(true) } }

because ?a is now bound inside of the LATERAL
and so will return a=1 b=2.

This is why rewriting OPTIONAL into nested loops is hard because one needs to take care of not changing the results.

frensjan · 2024-01-25T11:46:18Z

Thanks @Tpt. This maybe a bit broader of an issue in RDF4J perhaps ... I'l look into that.

frensjan · 2024-01-25T11:50:39Z

This may also be an issue with the Jena implementation then @afs ? I've tested the Oxigraph queries and they both yield the same results given the test data; see below.

(I could have perhaps created an issue for Jena, but as this concerns the semantics of LATERAL somewhat I reported it here)

$ arq -version
Apache Jena version 4.10.0

$ cat data.ttl
@prefix ex: <http://example.org/> .

ex:s1 a ex:T ; ex:p 11 , 12 , 13 .
ex:s2 a ex:T ; ex:p 21 , 22 , 23 .
ex:s3 a ex:T .

$ arq --explain --data data.ttl --query optional-in-lateral.rq
12:44:39 INFO  exec            :: QUERY
  PREFIX  ex:   <http://example.org/>

  SELECT  ?s ?o
  WHERE
    { ?s  a  ex:T
      LATERAL
        { OPTIONAL
            { SELECT  ?s ?o
              WHERE
                { ?s  ex:p  ?o }
              ORDER BY ?o
              LIMIT   2
            }
        }
    }
12:44:39 INFO  exec            :: ALGEBRA
  (project (?s ?o)
    (lateral
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (leftjoin
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:39 INFO  exec            :: BGP ::   ?s rdf:type <http://example.org/T>
12:44:39 INFO  exec            :: Reorder/generic ::   ?s rdf:type <http://example.org/T>
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:39 INFO  exec            :: BGP ::   <http://example.org/s2> <http://example.org/p> ?o
12:44:39 INFO  exec            :: Reorder/generic ::   <http://example.org/s2> <http://example.org/p> ?o
--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 |    |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

$ arq --explain --data data.ttl --query lateral-in-optional.rq
12:44:45 INFO  exec            :: QUERY
  PREFIX  ex:   <http://example.org/>

  SELECT  ?s ?o
  WHERE
    { ?s  a  ex:T
      OPTIONAL
        { LATERAL
            { SELECT  ?s ?o
              WHERE
                { ?s  ex:p  ?o }
              ORDER BY ?o
              LIMIT   2
            }
        }
    }
12:44:45 INFO  exec            :: ALGEBRA
  (project (?s ?o)
    (conditional
      (bgp (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/T>))
      (lateral
        (table unit)
        (project (?s ?o)
          (top (2 ?o)
            (bgp (triple ?s <http://example.org/p> ?o)))))))
12:44:45 INFO  exec            :: BGP ::   ?s rdf:type <http://example.org/T>
12:44:45 INFO  exec            :: Reorder/generic ::   ?s rdf:type <http://example.org/T>
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s1> <http://example.org/p> ?o
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s3> <http://example.org/p> ?o
12:44:45 INFO  exec            :: BGP ::   <http://example.org/s2> <http://example.org/p> ?o
12:44:45 INFO  exec            :: Reorder/generic ::   <http://example.org/s2> <http://example.org/p> ?o
--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s3 |    |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

afs · 2024-01-25T12:36:34Z

Hi @frensjan -- thanks for providing the clear details.

The results look like the same rows to me. And I get the same results locally.

What's the issue you've noticed?

conditional is a particular way to execute leftjoin.

Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.

frensjan · 2024-01-25T12:59:01Z

What's the issue you've noticed?

Well, the issue is that it's a different result than when you run these queries with Oxigraph apparently!

conditional is a particular way to execute leftjoin.

That's what I've read in the javadoc. I could imagine that it's backed by a different evaluation algorithm and would explain differences.

Renaming only happens if a variable does not remain in-scope up the evaluation - ?s does remain in-scope in this case.

That's also my initial 'stance'. But apparently in Oxigraph this is different?

afs · 2024-01-25T13:31:20Z

Please could could record the Oxigraph results here for the record?

Jena5 gets the same results, different order:

--------------
| s     | o  |
==============
| ex:s3 |    |
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
--------------

(there is a different default in-memory graph implementation)

frensjan · 2024-01-25T15:13:48Z

The Oxigraph tests are on Github. The relevant files are:

subselect_inside_optional.rq
subselect_inside_optional.srx
subselect_outside_optional.rq
subselect_outside_optional.srx

afs · 2024-01-25T17:41:05Z

tested the Oxigraph queries and they both yield the same results given the test data; see below.

I misread that as both queries give equal results

Those two are different results:

subselect_inside_optional.srx :

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 |    |
| ex:s3 |    |
--------------

subselect_outside_optional.srx:

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 | 21 |
| ex:s2 | 22 |
| ex:s3 |    |
--------------

afs · 2024-01-25T17:41:35Z

Jena gets the same as Oxigraph those if you use the "ref" query engine (--engine=ref) or if you switch off the optimizer (--optimize=off) so Jena is wrong for the normal engine. It's likely the conditional.

The "ref" engine is very simple, runs exactly the algebra and materializes each intermediate step.

apache/jena#2218

frensjan · 2024-01-26T08:55:03Z

Thanks for investigating this further @afs. I assume choosing the conditional is an incorrect optimisation by Jena.

RDF4J evaluates all left-joins in this manner (bindings from the LHS are used for variables with the same name on the LHS).

I've raised the issue at RDF4J (which is larger than just optional lateral).

Maybe not the easiest to explain to users (n=1)

I must say that this does strike me as odd ... I need to explain this to my users:

PREFIX : <http://example.org/>

SELECT * WHERE {
    VALUES ?x { :x }
    OPTIONAL {
        FILTER( BOUND(?x) )
        BIND( :y as ?y )
        BIND( ?x as ?z )
    }
}

yields

---------------
| x  | y  | z |
===============
| :x | :y |   |
---------------

(I intentionally wrote the FILTER clause first in the optional block for dramatic effect)

I understand that this is as per the specification (filters in a group are applied to the group as a whole and if a left join has a group with a filter on the RHS, this filter is applied for the merged solutions from left and right that are compatible). But it's not that intuitive (for me, n=1).

Blazegraph

Note that querying the Wikidata SPARQL endpoint (which is still backed by Blazegraph if I'm not mistaken) gives yet another result:

--------------
| x  | y | z |
==============
| :x |   |   |
--------------

Virtuoso

Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:

Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned

afs · 2024-01-26T13:32:05Z

Thanks for investigating this further @afs. I assume choosing the conditional is an incorrect optimisation by Jena.

Looking at subselect_inside_optional.srx - it is symmetric in :s1 and :s2 but the answers are different.
The only asymmetry is the top 2 subselect.

I think that LATERAL rules apply inside the subselect because ?s is not a hidden variable. I can only explain the results

--------------
| s     | o  |
==============
| ex:s1 | 11 |
| ex:s1 | 12 |
| ex:s2 |    |
| ex:s3 |    |
--------------

if the subselect executes with free ?s (no lateral injection) when 11 and 12 are first in the ORDER BY

What am I missing?
When it's ?s=:s2, LATERAL should making all execution within it's right hand side act like ?s is :s2 only.

SELECT ?s ?o { BIND(ex:s2 AS ?s) ?s ex:p ?o .} ORDER BY ?o LIMIT 2 is 21 and 22 which I'd expect to be leftjoin'ed into the answers.

SELECT * WHERE {
    ?s a ex:T.
    OPTIONAL { LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2} }
}

This is not to argue Jena results for subselect_inside_optional are correct. If they are, it's an accident of conditional.
The unoptimized form is definitive and it agrees with Oxigraph.

Unoptimized evaluation does fails to propagate ?s back leading to the effect that the subselect is not having the LATERAL binding any more. That is a bug. (@Tpt will understand where public work time has gone.) The optimization, by accident, puts it back in this case by something unrelated to LATERAL.

lisp · 2024-01-26T14:13:55Z

@frensjan, in

But it's not that intuitive (for me, n=1).

"it's" is to mean the recommendation's algebra or the result which it produces in this instance?

TallTed · 2024-01-26T16:50:43Z

@frensjan wrote --

Virtuoso is somewhat more aligned with the specification it seems when querying dbpedia. But instead of ?x not being bound it 'just' throws an error:

Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'x' is used in the query result set but not assigned

It should perhaps be noted that this error is not mandatory behavior in Virtuoso. There's a checkbox on the /sparql query form, "Strict checking of void variables", which is ticked by default (resulting in your error). Unticking that box (or adding &signal_unconnected=on to a results URI) gets this --

----------------
| x  | y  | z  |
================
| :x |    |    |
----------------

I confess to not yet fully understanding this issue. As it seem that the latter solution set is an error, I would appreciate it if someone who does fully understand this issue (and so can evaluate any patch for it) would log an issue on the Virtuoso project.

frensjan · 2024-01-26T20:44:28Z

@frensjan, in
But it's not that intuitive (for me, n=1).
"it's" is to mean the recommendation's algebra or the result which it produces in this instance?

I meant the algebra, the semantics of the query (and thus the results produced). This may very well be biased by the fact that I do a lot of imperative programming. But the curly braces around the RHS of OPTIONAL for a lot of people probably provide some kind of hint that what's in it has the same 'data' available (or not).

Filters in OPTIONAL are special, as they have results from both LHS and RHS 'in scope'.

Variables in statement patterns on the RHS shared with the LHS form the join key, so they feel like they're 'in scope'. E.g. in SELECT * { ?s :p ?x OPTIONAL { ?s :q ?y } }, ?s will always have the same values on the LHS as on the RHS (if any compatible solution exists of course).

But for e.g. BIND this isn't the case. So you can write:

SELECT ?price ?discount {
    VALUES ?price { 10 }
    OPTIONAL {
        VALUES ?discount { 0.10 }
        FILTER( ?price * (1 - ?discount) < 10 )
    }
}

But this query has different semantics:

SELECT ?price ?discount {
    VALUES ?price { 10 }
    OPTIONAL {
        VALUES ?discount { 0.10 }
        BIND( ?price * (1 - ?discount) AS ?effectivePrice )
        FILTER( ?effectivePrice < 10 )
    }
}

frensjan · 2024-01-26T20:57:48Z

But perhaps I digress too much. This issue is about LATERAL, not OPTIONAL.

If RDF4J would be 'fixed' though, I would need to train my users in the difference between
P { LATERAL { OPTIONAL Q } } and P { OPTIONAL { LATERAL Q }.

E.g., this query gives, for every ?movie, the average rating of all movies:

SELECT * {
    ?movie a :Movie
    OPTIONAL { LATERAL {
        { SELECT (AVG(?rating) as ?averageRating) {
            ?movie :rating ?rating
        } }
    } }
}

That's just hard to grok for me.

afs · 2024-01-26T21:25:06Z

What am I (@afs) missing?

That the OPTIONAL is has normal evaluation so the RHS is

{ LATERAL {SELECT ?s ?o { ?s ex:p ?o } ORDER BY ?o LIMIT 2 }

No LHS ?s until after LATERAL is evaluated. The LHS of the LATERAL is the unit { }.

VladimirAlexiev · 2024-03-15T09:27:36Z

https://ceur-ws.org/Vol-3471/paper8.pdf by @Aklakan
Scaling RML and SPARQL-based Knowledge Graph Construction with Apache Spark
sec 4.1. Merging CONSTRUCT Queries using LATERAL

Aklakan mentioned this issue Oct 22, 2020

Extended CONSTRUCT query forms to declare graph fragments with a starting point #127

Closed

Aklakan mentioned this issue Sep 20, 2021

Entity-based Construct Queries #128

Open

VladimirAlexiev mentioned this issue Oct 12, 2021

SPARQL Service Description: add "features/dialects" and a client protocol #152

Open

VladimirAlexiev closed this as completed Oct 15, 2021

VladimirAlexiev reopened this Oct 15, 2021

hobofan mentioned this issue Oct 25, 2022

Implement Jena-like loop: "service" to support lateral joins oxigraph/oxigraph#267

Closed

afs mentioned this issue Nov 12, 2022

LATERAL join apache/jena#1615

Closed

Jamie-SA mentioned this issue Dec 5, 2022

SEP-0006 for Lateral Join. SEP-0007 for variable substitution #165

Merged

edwardanderson mentioned this issue Jan 5, 2024

Are _label properties supposed to be expanded into rdfs:label triples? linked-art/linked.art#539

Closed

frensjan mentioned this issue Jan 19, 2024

Investigate SPARQL CG Lateral proposal-006 eclipse-rdf4j/rdf4j#4315

Open

afs mentioned this issue Jan 25, 2024

LATERAL query evaluation error apache/jena#2218

Closed

frensjan mentioned this issue Jan 26, 2024

Optional incorrectly binds values from LHS in the RHS graph pattern eclipse-rdf4j/rdf4j#4882

Open

VladimirAlexiev mentioned this issue Feb 2, 2024

SEP-0006 issue that may be very confusing for users #194

Open

VladimirAlexiev mentioned this issue Mar 15, 2024

SPARQL binding paradoxes #195

Open

JOIN LATERAL or Correlated Subquery #100

JOIN LATERAL or Correlated Subquery #100

Comments

VladimirAlexiev commented Jul 17, 2019 • edited Loading

Why?

Previous work

Proposed solution

Considerations for backward compatibility

afs commented Jul 18, 2019

VladimirAlexiev commented Sep 23, 2020 • edited Loading

namedgraph commented Sep 23, 2020

klinovp commented Sep 23, 2020

JervenBolleman commented Sep 23, 2020 via email • edited Loading

rubensworks commented Sep 24, 2020

lisp commented Oct 8, 2020 • edited Loading

jaw111 commented Jun 9, 2021

lisp commented Jun 9, 2021

VladimirAlexiev commented Aug 20, 2021

VladimirAlexiev commented Aug 20, 2021

VladimirAlexiev commented Oct 14, 2021

lisp commented Oct 14, 2021

jaw111 commented Oct 14, 2021 • edited Loading

VladimirAlexiev commented Oct 15, 2021 • edited Loading

lisp commented Oct 15, 2021

VladimirAlexiev commented Feb 6, 2022

Aklakan commented Aug 26, 2022 • edited Loading

frensjan commented Dec 13, 2022 • edited Loading

Tpt commented Dec 13, 2022

afs commented Dec 13, 2022

frensjan commented Dec 13, 2022

frensjan commented Jan 22, 2024

afs commented Jan 22, 2024

lisp commented Jan 22, 2024

frensjan commented Jan 22, 2024 • edited Loading

afs commented Jan 22, 2024

afs commented Jan 22, 2024 • edited Loading

frensjan commented Jan 25, 2024

Tpt commented Jan 25, 2024 • edited Loading

frensjan commented Jan 25, 2024

frensjan commented Jan 25, 2024

afs commented Jan 25, 2024

frensjan commented Jan 25, 2024

afs commented Jan 25, 2024 • edited Loading

frensjan commented Jan 25, 2024

afs commented Jan 25, 2024 • edited Loading

afs commented Jan 25, 2024 • edited Loading

frensjan commented Jan 26, 2024

Maybe not the easiest to explain to users (n=1)

Blazegraph

Virtuoso

afs commented Jan 26, 2024 • edited Loading

lisp commented Jan 26, 2024

TallTed commented Jan 26, 2024

frensjan commented Jan 26, 2024 • edited Loading

frensjan commented Jan 26, 2024

afs commented Jan 26, 2024 • edited Loading

VladimirAlexiev commented Mar 15, 2024

VladimirAlexiev commented Jul 17, 2019 •

edited

Loading

VladimirAlexiev commented Sep 23, 2020 •

edited

Loading

JervenBolleman commented Sep 23, 2020 via email •

edited

Loading

lisp commented Oct 8, 2020 •

edited

Loading

jaw111 commented Oct 14, 2021 •

edited

Loading

VladimirAlexiev commented Oct 15, 2021 •

edited

Loading

Aklakan commented Aug 26, 2022 •

edited

Loading

frensjan commented Dec 13, 2022 •

edited

Loading

frensjan commented Jan 22, 2024 •

edited

Loading

afs commented Jan 22, 2024 •

edited

Loading

Tpt commented Jan 25, 2024 •

edited

Loading

afs commented Jan 25, 2024 •

edited

Loading

afs commented Jan 25, 2024 •

edited

Loading

afs commented Jan 25, 2024 •

edited

Loading

afs commented Jan 26, 2024 •

edited

Loading

frensjan commented Jan 26, 2024 •

edited

Loading

afs commented Jan 26, 2024 •

edited

Loading