Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support window functions #47

Open
kasei opened this issue Apr 3, 2019 · 22 comments
Open

Support window functions #47

kasei opened this issue Apr 3, 2019 · 22 comments
Labels
function More custom function improvements or inbuilds query Extends the Query spec

Comments

@kasei
Copy link
Collaborator

kasei commented Apr 3, 2019

SPARQL should add support for window functions. This would increase expressivity and address some existing use cases such as "limit per resource".

Why

Window functions would allow computing values that are unavailable in SPARQL 1.1 queries:

  • row numbering and ranking
  • quantiles
  • moving averages
  • running totals

These can be used to address use cases such as limiting the result set to a specific number of results for each resource ("limit per resource"). For example, consider a query to retrieve information about web posts:

SELECT ?post ?title ?author ?date WHERE {
	?post a sioc:Post ;
		dc:title ?title ;
		sioc:has_creator ?author
}

Given that a post can have any number of titles and authors, we might wish to restrict our query to only providing information about at most 3 authors for any individual post. This isn't easily done using standard SPARQL, but can be addressed using window functions.

Previous work

  • I've implemented window functions (with the strawman syntax shown below) in Kineo.
  • Window functions in SQLServer
  • Window functions in SQLite
  • Window functions in PostgreSQL

Proposed solution

Using a RANK window function, we can filter the result set of the example query above with a HAVING clause:

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
SELECT ?post ?title ?author ?date WHERE {
	?post a sioc:Post ;
		dc:title ?title ;
		sioc:has_creator ?author
}
HAVING (RANK() OVER (PARTITION BY ?post ORDER BY ?author) <= 2)

This will take the the result set from matching the basic graph pattern, and partition it into groups based on the value of ?post. Within each partition, rows will be sorted by ?author, and then assigned an increasing integer rank. Finally, these rows will be filtered to keep only those with a rank less than or equal to 2. The final result set will be the concatenation of rows in each partition.

Beyond this use case, existing aggregates (e.g. AVG and SUM) can be used with windows to support things like moving averages and running totals.

Considerations for backward compatibility

None.

@RickMoynihan
Copy link

You can achieve this already with a little more syntax by using a sub SELECT with the LIMIT on to return the subjects, then not setting a limit on the outer query.

@kasei
Copy link
Collaborator Author

kasei commented Apr 3, 2019

@RickMoynihan for specific cases that's true, but I believe window functions can more naturally express some of these use cases and definitely increases expressivity for others. I used this specific example only because it was the one used in the original LimitPerResource feature request for SPARQL 1.1.

@kasei
Copy link
Collaborator Author

kasei commented Apr 3, 2019

(Updated the example to use ROW_NUMBER instead of RANK because this example doesn't use partition ordering. However, RANK is an example of a function that would be hard to approximate using just SPARQL 1.1 due to its handling of ties.)

@dydra
Copy link

dydra commented Apr 3, 2019

window functions can more naturally express some of these use cases and definitely increases expressivity for others

please describe such cases.

@kasei
Copy link
Collaborator Author

kasei commented Apr 4, 2019

@dydra I believe the example above is a prime example where it is more naturally expressed using a window function; having to duplicate the BGP in this query simply to pull out a distinct list of posts would be syntactically much more verbose, and this is a fairly straightforward example. SPARQL 1.1 saw precedent for adding features that in part allowed simplifying existing queries that were unnecessarily complex (property paths, negation, etc.). While I think window functions' increased expressivity alone make them worthwhile to discuss for inclusion, the simplification of some types of queries would also represent a big improvement.

Also, as mentioned, computing a moving average is an example that cannot be addressed with SPARQL 1.1. Another common window function I've used in production and would love to see come to SPARQL is quantile computation.

@kasei
Copy link
Collaborator Author

kasei commented Apr 4, 2019

@dydra @RickMoynihan

Apologies. In my haste to compile and submit issues today I seem to have botched the LimitPerResource example as well as my (obviously untested) query showing how to apply window functions to it. This was compounded by just skimming the original feature request page for SPARQL 1.1, which didn't really cover the real use cases as I have understood them in the intervening years.

My current understanding of the "limit per resource" use case would be this: you want to limit the number of results you get from the query for each distinct value of some set of variables. For the SIOC posts example that started this off, this would be something like "for each post, I'm interested in getting data for at most 3 authors." I hope it will be clear why this is a more interesting and complicated example. For this, window functions can be used by partitioning the results by the ?post variable, assigning a rank within each partition ordered by author bound to a new variable ?rank, and then filtering all results such that ?rank <= 3.

I hope that helps to clarify how window functions could address this specific use case. I'll try to update the example in the issue text to more accurately represent this use case. As I said, though, I think their value goes well beyond this use case, and can provide some nice features that we currently don't have.

@JervenBolleman JervenBolleman added function More custom function improvements or inbuilds query Extends the Query spec labels Apr 4, 2019
@VladimirAlexiev
Copy link
Contributor

@kasei Fully support. Can you craft more examples of the uses you mentioned, eg moving averages?

@kasei
Copy link
Collaborator Author

kasei commented Apr 7, 2019

@VladimirAlexiev

Again, using the strawman syntax I've implemented, here are couple of examples:

# Moving average over the past 4 quarters
SELECT ?quarter ?value (AVG(?value) OVER (ORDER BY ?quarter ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ?movingAverage) WHERE {
    VALUES (?quarter ?value) {
        ("2017Q1" 1)        # 1.0   = avg(1) ; only 1 Q of data
        ("2017Q2" 2)        # 1.5   = avg(1, 2) ; only 2 Q of data
        ("2017Q3" 1)        # 1.33  = avg(1, 2, 1) ; only 3 Q of data
        ("2017Q4" 2)        # 1.5   = avg(1, 2, 1, 2)
        ("2018Q1" 3)        # 2.0   = avg(2, 1, 2, 3)
        ("2018Q2" 2)        # 2.0   = avg(1, 2, 3, 2)
        ("2018Q3" 2)        # 2.25  = avg(2, 3, 2, 2)
        ("2018Q4" 4)        # 2.75  = avg(3, 2, 2, 4)
        ("2019Q1" 5)        # 3.25  = avg(2, 2, 4, 5)
    }
}
ORDER BY ?quarter
# 3 photos from each country
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?image ?country WHERE {
	?image a foaf:Image ;
		dcterms:coverage [ foaf:name ?country ; dcterms:type "Country" ] ;
		.
}
HAVING (ROW_NUMBER() OVER (PARTITION BY ?country) <= 3)
ORDER BY ?country

I can't guarantee it's uptime, but these can be run at this endpoint.

@kjetilk
Copy link

kjetilk commented Apr 7, 2019

I was the source of LimitPerResource way back before the SPARQL 1.1 work started, and it was a part of the initial work of the WG, but didn't make it in, something I have suffered ever since.

The initial use case was that we had a database of articles, where each article could be categorized into a number of categories and have any number of authors, and it had a bunch of optional fields. All this could yield a cyclic graph with a bit of heterogeneity when we CONSTRUCTed results, and what we found (this was before subqueries too), was that while SPARQL 1.0 would often work over heterogeneous data, but apart from OPTIONAL, wasn't very well adapted to it. Subqueries helped quite a lot, but I noticed when moving from industry to academia, that students didn't think of subqueries when they were faced with this problem:

you don't care about the number of solutions to query, you care about the number of "things"

So, the subquery route to achieving such a simple requirement quickly turned out to be very developer unfriendly. We had a hard time implementing paging of results as pure LIMIT and OFFSET couldn't not be used. This is a very common use case, and so I think we really need to fix this going forward.

Then the question is if we should fix this by some surface syntax, but then, @kasei came up with this idea of a windowing function. I find it intriguing, as I would assume it has use cases well beyond mine, event stream processing, as implemented by several extensions, would seem to benefit as well by the introduction of this algebra.

Thus, a windowing function shouldn't be considered just for my original case, even though I think that is an important one, but should be considered more generally.

@lisp
Copy link
Contributor

lisp commented Apr 7, 2019

one way to read this is that some properties of the execution process - such as ordinal in a given solution sequence, should be available as first-class values and that the aggregation operation should be extended with operations which use them.

@kasei
Copy link
Collaborator Author

kasei commented Apr 7, 2019

@lisp Yes, that could solve some (but not all) of the cases for which window functions are useful. However, window functions go beyond the expressivity of just having access to an ordinal number. For example, this can be seen in cases where the value can represent ties (RANK vs. ROW_NUMBER), where the window frame is smaller than the partition group (e.g. while computing a moving average), or where the computed value uses both the ordinal of the current row and information from the entire partition (e.g. the NTILE function). Some of these might be able to be computed with syntactically cumbersome repetition using just aggregates and existing graph operators, but not all of them, and certainly not in as concise a fashion.

@lisp
Copy link
Contributor

lisp commented Apr 8, 2019

ok.
are there any execution properties which do not devolve to an ordinal in some solution sequence?

@kasei
Copy link
Collaborator Author

kasei commented Apr 8, 2019

@lisp

(I should say that I'm not an expert here, but I've found window functions useful enough to implement in my SPARQL systems.)

Window functions can do anything they want on the rows in the window frame, so I imagine there's a lot of potential for ones that don't devolve to ordinals. Amongst the common window functions, I'd think that anything that dealt with ties would not be equivalent to ordinal-based approaches. This includes things like RANK which might produce a sequence like (1, 1, 1, 4, 4, 6) (as opposed to ROW_NUMBER which would just be 1, 2, 3, 4, 5, 6).

Ties can also impact the rows that comprise the window frame. Instead of defining the window frame as a range of row offsets, it can be defined as operating over groupings of rows that are deemed equivalent based on the window ordering. (Hard to explain in a quick comment here; this can be looked up as the difference between using ROWS and RANGE.) I've not worked with use cases that needed this functionality but I suspect that this would be incredibly difficult or impossible to do based just on row ordinals.

@dydra
Copy link

dydra commented Apr 8, 2019 via email

@kasei
Copy link
Collaborator Author

kasei commented Apr 9, 2019

@dydra

that is described (elsewhere) as though it devolves to a sequence number, just not directly.

everything which i have found from sql and nosql sources describes (a) value(s) which devolve(s) to (a) position(s) in the equivalent of (a) solution sequence(s).

Could you point me at some of these descriptions? I'm not clear how some of these can be reduced to just sequence numbers. For example, using RANK() on data with ties:

SELECT ?row (RANK() OVER (PARTITION BY ?partition ORDER BY ?value) AS ?rank) WHERE {
	VALUES (?row ?partition ?value) {
		(1 1 1)
		(2 1 2)
		(3 1 2)
		(4 1 2)
		(5 2 3)
		(6 2 3)
	}
}

[run]

The resulting ?rank row ordinals with corresponding ?rank values are:

row rank
1 1
1 2
3 2
4 2
5 1
6 1

I'm not sure how that could be reduced to just the ordinals, but maybe you have some ideas.

A more complicated example (and for which I don't have an implementation) would be using window framing using RANGE instead of ROWS. I've found the SQLite Window Functions Introduction to be one of the most accessible descriptions of the complexity involved here.

Perhaps we can discuss specific examples? Using a strawman syntax for accessing sequence numbers, can you show or think of an approach to implementing some of the examples above without window functions? My instinct is that many queries (like moving average, or photos per country) might be possible but will quickly become syntactically cumbersome when making small changes (e.g. changing the number of photos per country from 3 to 10), to such an extent that they become either unmanageable or so complex that query runtime becomes prohibitive (or both).

@lisp
Copy link
Contributor

lisp commented Apr 10, 2019

Could you point me at some of these descriptions?

starting simply with a search for the difference btw rank and row number, various descriptions read very much as when the results reflect positions in some sequence. as does the sqllite description. this sequence need not be the original solution sequence, but intermediate results of window operations are solution sequences just the same.

@kasei
Copy link
Collaborator Author

kasei commented Apr 10, 2019

@lisp I found that second link rather unhelpful. I couldn't reconcile the description with my understanding of RANK or the actual results shown in the example. Did you take a look at the SQLite documentation by any chance? It's not perfect, but I found its definitions and description of peer groups to be much better than most other documentation.

@lisp
Copy link
Contributor

lisp commented Apr 10, 2019

note, please, above, "as does the sqllite description".

@kasei
Copy link
Collaborator Author

kasei commented Apr 10, 2019

Apologies, I had missed the SQLite mention in your last response.

this sequence need not be the original solution sequence, but intermediate results of window operations are solution sequences just the same.

Perhaps I've lost track of what your concerns are. I think ordinal numbers are used in a few places in window functions and in the examples we've discussed. They are the result of ROW_NUMBER, and are required for computing things like RANK and DENSE_RANK. They are also required for defining a window frame (either directly in cases like ROWS BETWEEN 1 PRECEDING AND CURRENT ROW, or indirectly after peer groups have been computed).

However, I thought your original suggestion was that the same results could be achieved without window functions if we had the ability to access or compute the ordinal number of solutions in a result set. I don't believe this is the case, at least not in the general case and/or without tremendous complexity in the query to achieve similar semantics. Window function syntax allows concise declaration of the intended function, window, and optional framing without having to jump through hoops to do the same thing (e.g. without framing, many window queries could probably be alternatively computed by using aggregate subqueries to compute the window partitioning and adding extra joins to get the partition key into each solution. I'm not sure just how far this approach could take you in approximating larger subsets of window functions' expressivity, though.)

@lisp
Copy link
Contributor

lisp commented Apr 10, 2019

I thought your original suggestion was that the same results could be achieved without window functions if we had the ability to access or compute the ordinal number of solutions in a result set.

that was way back at the start and has long since been supplanted by my response to

(I should say that I'm not an expert here, but I've found window functions useful enough to implement in my SPARQL systems.)

(neither am i, but i suppose, that may have to change. that is why i am trying to deconstruct them.)

in connection with which intent, i conjectured (in my response on 08.04) as to alternatives to ordinals. in addition to which list, i wonder about operators which incorporate specific alternative preceding or succeeding solutions.

@mbarbieri77
Copy link

Hello guys, I’ve been working on a few articles that I have published under agnos.ai on Medium.com that I would like to share with you.
I think that the great thing about these articles is that they use a RDF triplestore that was migrated from a well-known sample relational database, which allows people to run SQL and SPARQL queries side by side. This can be very useful for learning SPARQL, when you already know SQL, or to validate new feature proposals for SPARQL 1.2, for example, where you can show what you want in SQL, and then validate future implementations by running SPARQL against a equivalent triplestore.
https://medium.com/agnos-ai/sql-features-not-supported-in-sparql-1-1-eb34e3519077
Let me know what you think and any feedback is very welcome. Thank you. Marcelo.

@TallTed
Copy link
Member

TallTed commented Mar 12, 2021

Sounds to me like Scrollable Cursors (see ODBC, JDBC, SQL, etc.) are being reinvented...

Hard to handle in a RESTian world, but quite valuable if specified and implemented well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
function More custom function improvements or inbuilds query Extends the Query spec
Projects
None yet
Development

No branches or pull requests

9 participants