Feature/optimize member queries #22

akariv · 2016-08-29T15:15:31Z

No description provided.

vitorbaptista · 2017-07-13T15:06:09Z

@akariv Anything missing here apart from the broken build?

akariv · 2017-07-16T17:31:39Z

Well, this is a optimisation patch meant to tackle the fact that postgres distinct queries are very slow.
I never meant to merge this PR, as I don't like the solution here.
There's an alternate solution which consists of normalising the data as it's being uploaded which is still wip - I'll make sure it's pushed here as well.

coveralls · 2017-08-04T15:51:28Z

Coverage increased (+0.07%) to 94.73% when pulling 540a99c on feature/optimize-member-queries into fb8eddc on master.

vitorbaptista · 2017-08-04T15:57:25Z

@akariv The tests were failing because we were trying doing queries like

SELECT
  cap_or_cur.code AS "cap_or_cur.code",
  max(cap_or_cur.label) AS "cap_or_cur.label"
FROM cap_or_cur
GROUP BY cap_or_cur.code
ORDER BY cap_or_cur.label DESC NULLS LAST;  -- This column wasn't part of the select

This raises the error:

ERROR:  column "cap_or_cur.label" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...FROM cap_or_cur GROUP BY cap_or_cur.code ORDER BY cap_or_cur...

The cap_or_cur schema is:

                               Table "public.cap_or_cur"
 Column |       Type        |                        Modifiers                         
--------+-------------------+----------------------------------------------------------
 _id    | integer           | not null default nextval('cap_or_cur__id_seq'::regclass)
 code   | character varying | 
 label  | character varying | 
Indexes:
    "cap_or_cur_pkey" PRIMARY KEY, btree (_id)

I expected that we would have a unique index between for code, but we don't. I'm assuming this is because we never added it, not because there could be repeated codes. If that's the case, my solution was to simply GROUP BY cap_or_cur.code, cap_or_cur.label. There will always be a single (code, label) pair anyway, so there's no need to use max(). All tests pass fine (see 540a99c).

If that's not the case, and there can be multiple cap_or_cur.code, then we'd have to somehow use the same aggregate functions in the ORDER BY, which can get messy quickly.

akariv · 2017-08-04T16:03:04Z

I think that if we make the query to look like

SELECT
  cap_or_cur.code AS "cap_or_cur.code",
  max(cap_or_cur.label) AS "cap_or_cur.label"
FROM cap_or_cur
GROUP BY cap_or_cur.code
ORDER BY "cap_or_cur.label" DESC NULLS LAST;

The point of this query is to retrieve only one row per code - just grouping by code and label might bring multiple instances for each code.

vitorbaptista · 2017-08-04T16:45:40Z

So, my aim is to change the query to:

SELECT *
FROM (
  SELECT
    cap_or_cur.code AS "cap_or_cur.code",
    max(cap_or_cur.label) AS "cap_or_cur.label"
  FROM cap_or_cur
  GROUP BY cap_or_cur.code
) AS cap_or_cur
ORDER BY "cap_or_cur.label" DESC NULLS LAST;

This avoids having to add max() to the ORDER BY clause. My (failed) try was:

def members(self, ref, cuts=None, order=None, page=None, page_size=None):
    """ List all the distinct members of the given reference, filtered and
    paginated. If the reference describes a dimension, all attributes are
    returned. """
    q = select()
    bindings = []
    cuts, q, bindings = Cuts(self).apply(q, bindings, cuts)
    fields, q, bindings = Fields(self).apply(q, bindings, ref, distinct=True)
    q = select([q.alias('cap_or_cur')]) # <---- Subquery (This is the only line added)
    ordering, q, bindings = Ordering(self).apply(q, bindings, order)
    q = self.restrict_joins(q, bindings)
    count = count_results(self, q)

    # ...

This generates the query (it fails before pagination, so that wasn't added yet):

SELECT
  cap_or_cur."cap_or_cur.code",
  cap_or_cur."cap_or_cur.label" 
FROM (
  SELECT
    cap_or_cur.code AS "cap_or_cur.code",
    cap_or_cur.label AS "cap_or_cur.label"
  FROM cap_or_cur
  GROUP BY cap_or_cur.code, cap_or_cur.label
) AS cap_or_cur
ORDER BY cap_or_cur.label DESC NULLS LAST

Which fails because the actual colum codes are cap_or_cur.cap_or_cur.code. If this was manually written, I would simply remove the aliases in the subquery and be done with it. As it's not, and the Fields() code is used in many other places, I'm not sure what to do.

coveralls · 2017-08-04T17:01:35Z

Coverage increased (+0.1%) to 94.785% when pulling 558880c on feature/optimize-member-queries into fb8eddc on master.

akariv · 2017-08-04T17:02:00Z

@vitorbaptista I think I got it - take a look

coveralls · 2017-08-04T17:04:24Z

Coverage increased (+0.1%) to 94.785% when pulling 95eb17c on feature/optimize-member-queries into fb8eddc on master.

coveralls · 2017-08-04T17:04:24Z

Coverage increased (+0.1%) to 94.785% when pulling 95eb17c on feature/optimize-member-queries into fb8eddc on master.

vitorbaptista · 2017-08-07T16:22:35Z

@akariv Cool! If I understood correctly, your patch makes the order use

Wow! Super interesting. The original query that was raising the error was:

SELECT
  cap_or_cur.code AS "cap_or_cur.code",
  max(cap_or_cur.label) AS "cap_or_cur.label"
FROM cap_or_cur
GROUP BY cap_or_cur.code
ORDER BY cap_or_cur.label DESC NULLS LAST;

The query created now after you modified the code is:

SELECT
  cap_or_cur.code AS "cap_or_cur.code",
  max(cap_or_cur.label) AS "cap_or_cur.label"
FROM cap_or_cur
GROUP BY cap_or_cur.code
ORDER BY "cap_or_cur.label" DESC NULLS LAST;  -- The only difference are the double quotes here

The difference is just the double quotes in the ORDER BY. From what I understood, this is because an unquoted identifier has a limited set of valid characters, while quoted identifiers don't. For example, they can't have whitespaces, while quoted identifiers can. According to https://www.postgresql.org/docs/9.4/static/sql-syntax-lexical.html (section 4.1.1), they can't have a dot. So ORDER BY "cap_or_cur.label" is different from ORDER BY cap_or_cur.label. I'm not sure how the .label means in the second example, where cap_or_cur would be the identifier. However, the solution is to somehow make SQLAlchemy add quotes to the column in ORDER BY.

Your solution works, although I'm not sure I understand why 😅 I tried to understand how your changes resulted in this new SQL query, but couldn't. Regardless, all tests are passing and it worked with a few examples I tried locally, so LGTM 👍

akariv mentioned this pull request Oct 29, 2016

Distinct values from a cube #24

Open

vitorbaptista force-pushed the feature/optimize-member-queries branch from ca4c401 to 3b78074 Compare August 4, 2017 15:02

Replace slow distinct queries with a more optimized approach

92c6cb1

akariv force-pushed the feature/optimize-member-queries branch from 540a99c to 558880c Compare August 4, 2017 16:57

Fix ordering for member queries

95eb17c

akariv force-pushed the feature/optimize-member-queries branch from 558880c to 95eb17c Compare August 4, 2017 17:01

akariv merged commit a250a8e into master Aug 4, 2017

akariv deleted the feature/optimize-member-queries branch August 4, 2017 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/optimize member queries #22

Feature/optimize member queries #22

akariv commented Aug 29, 2016

vitorbaptista commented Jul 13, 2017

akariv commented Jul 16, 2017

coveralls commented Aug 4, 2017 •

edited

Loading

vitorbaptista commented Aug 4, 2017 •

edited

Loading

akariv commented Aug 4, 2017 •

edited by vitorbaptista

Loading

vitorbaptista commented Aug 4, 2017 •

edited

Loading

coveralls commented Aug 4, 2017 •

edited

Loading

akariv commented Aug 4, 2017

coveralls commented Aug 4, 2017

coveralls commented Aug 4, 2017

vitorbaptista commented Aug 7, 2017

Feature/optimize member queries #22

Feature/optimize member queries #22

Conversation

akariv commented Aug 29, 2016

vitorbaptista commented Jul 13, 2017

akariv commented Jul 16, 2017

coveralls commented Aug 4, 2017 • edited Loading

vitorbaptista commented Aug 4, 2017 • edited Loading

akariv commented Aug 4, 2017 • edited by vitorbaptista Loading

vitorbaptista commented Aug 4, 2017 • edited Loading

coveralls commented Aug 4, 2017 • edited Loading

akariv commented Aug 4, 2017

coveralls commented Aug 4, 2017

coveralls commented Aug 4, 2017

vitorbaptista commented Aug 7, 2017

coveralls commented Aug 4, 2017 •

edited

Loading

vitorbaptista commented Aug 4, 2017 •

edited

Loading

akariv commented Aug 4, 2017 •

edited by vitorbaptista

Loading

vitorbaptista commented Aug 4, 2017 •

edited

Loading

coveralls commented Aug 4, 2017 •

edited

Loading