Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/optimize member queries #22

Merged
merged 2 commits into from
Aug 4, 2017
Merged

Conversation

akariv
Copy link
Member

@akariv akariv commented Aug 29, 2016

No description provided.

@vitorbaptista
Copy link
Contributor

@akariv Anything missing here apart from the broken build?

@akariv
Copy link
Member Author

akariv commented Jul 16, 2017

Well, this is a optimisation patch meant to tackle the fact that postgres distinct queries are very slow.
I never meant to merge this PR, as I don't like the solution here.
There's an alternate solution which consists of normalising the data as it's being uploaded which is still wip - I'll make sure it's pushed here as well.

@vitorbaptista vitorbaptista force-pushed the feature/optimize-member-queries branch from ca4c401 to 3b78074 Compare August 4, 2017 15:02
@coveralls
Copy link

coveralls commented Aug 4, 2017

Coverage Status

Coverage increased (+0.07%) to 94.73% when pulling 540a99c on feature/optimize-member-queries into fb8eddc on master.

@vitorbaptista
Copy link
Contributor

vitorbaptista commented Aug 4, 2017

@akariv The tests were failing because we were trying doing queries like

SELECT
  cap_or_cur.code AS "cap_or_cur.code",
  max(cap_or_cur.label) AS "cap_or_cur.label"
FROM cap_or_cur
GROUP BY cap_or_cur.code
ORDER BY cap_or_cur.label DESC NULLS LAST;  -- This column wasn't part of the select

This raises the error:

ERROR:  column "cap_or_cur.label" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...FROM cap_or_cur GROUP BY cap_or_cur.code ORDER BY cap_or_cur...

The cap_or_cur schema is:

                               Table "public.cap_or_cur"
 Column |       Type        |                        Modifiers                         
--------+-------------------+----------------------------------------------------------
 _id    | integer           | not null default nextval('cap_or_cur__id_seq'::regclass)
 code   | character varying | 
 label  | character varying | 
Indexes:
    "cap_or_cur_pkey" PRIMARY KEY, btree (_id)

I expected that we would have a unique index between for code, but we don't. I'm assuming this is because we never added it, not because there could be repeated codes. If that's the case, my solution was to simply GROUP BY cap_or_cur.code, cap_or_cur.label. There will always be a single (code, label) pair anyway, so there's no need to use max(). All tests pass fine (see 540a99c).

If that's not the case, and there can be multiple cap_or_cur.code, then we'd have to somehow use the same aggregate functions in the ORDER BY, which can get messy quickly.

@akariv
Copy link
Member Author

akariv commented Aug 4, 2017

I think that if we make the query to look like

SELECT
  cap_or_cur.code AS "cap_or_cur.code",
  max(cap_or_cur.label) AS "cap_or_cur.label"
FROM cap_or_cur
GROUP BY cap_or_cur.code
ORDER BY "cap_or_cur.label" DESC NULLS LAST; 

The point of this query is to retrieve only one row per code - just grouping by code and label might bring multiple instances for each code.

@vitorbaptista
Copy link
Contributor

vitorbaptista commented Aug 4, 2017

So, my aim is to change the query to:

SELECT *
FROM (
  SELECT
    cap_or_cur.code AS "cap_or_cur.code",
    max(cap_or_cur.label) AS "cap_or_cur.label"
  FROM cap_or_cur
  GROUP BY cap_or_cur.code
) AS cap_or_cur
ORDER BY "cap_or_cur.label" DESC NULLS LAST; 

This avoids having to add max() to the ORDER BY clause. My (failed) try was:

def members(self, ref, cuts=None, order=None, page=None, page_size=None):
    """ List all the distinct members of the given reference, filtered and
    paginated. If the reference describes a dimension, all attributes are
    returned. """
    q = select()
    bindings = []
    cuts, q, bindings = Cuts(self).apply(q, bindings, cuts)
    fields, q, bindings = Fields(self).apply(q, bindings, ref, distinct=True)
    q = select([q.alias('cap_or_cur')]) # <---- Subquery (This is the only line added)
    ordering, q, bindings = Ordering(self).apply(q, bindings, order)
    q = self.restrict_joins(q, bindings)
    count = count_results(self, q)

    # ...

This generates the query (it fails before pagination, so that wasn't added yet):

SELECT
  cap_or_cur."cap_or_cur.code",
  cap_or_cur."cap_or_cur.label" 
FROM (
  SELECT
    cap_or_cur.code AS "cap_or_cur.code",
    cap_or_cur.label AS "cap_or_cur.label"
  FROM cap_or_cur
  GROUP BY cap_or_cur.code, cap_or_cur.label
) AS cap_or_cur
ORDER BY cap_or_cur.label DESC NULLS LAST

Which fails because the actual colum codes are cap_or_cur.cap_or_cur.code. If this was manually written, I would simply remove the aliases in the subquery and be done with it. As it's not, and the Fields() code is used in many other places, I'm not sure what to do.

@akariv akariv force-pushed the feature/optimize-member-queries branch from 540a99c to 558880c Compare August 4, 2017 16:57
@coveralls
Copy link

coveralls commented Aug 4, 2017

Coverage Status

Coverage increased (+0.1%) to 94.785% when pulling 558880c on feature/optimize-member-queries into fb8eddc on master.

@akariv akariv force-pushed the feature/optimize-member-queries branch from 558880c to 95eb17c Compare August 4, 2017 17:01
@akariv
Copy link
Member Author

akariv commented Aug 4, 2017

@vitorbaptista I think I got it - take a look

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 94.785% when pulling 95eb17c on feature/optimize-member-queries into fb8eddc on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 94.785% when pulling 95eb17c on feature/optimize-member-queries into fb8eddc on master.

@akariv akariv merged commit a250a8e into master Aug 4, 2017
@akariv akariv deleted the feature/optimize-member-queries branch August 4, 2017 17:04
@vitorbaptista
Copy link
Contributor

@akariv Cool! If I understood correctly, your patch makes the order use

Wow! Super interesting. The original query that was raising the error was:

SELECT
  cap_or_cur.code AS "cap_or_cur.code",
  max(cap_or_cur.label) AS "cap_or_cur.label"
FROM cap_or_cur
GROUP BY cap_or_cur.code
ORDER BY cap_or_cur.label DESC NULLS LAST;

The query created now after you modified the code is:

SELECT
  cap_or_cur.code AS "cap_or_cur.code",
  max(cap_or_cur.label) AS "cap_or_cur.label"
FROM cap_or_cur
GROUP BY cap_or_cur.code
ORDER BY "cap_or_cur.label" DESC NULLS LAST;  -- The only difference are the double quotes here

The difference is just the double quotes in the ORDER BY. From what I understood, this is because an unquoted identifier has a limited set of valid characters, while quoted identifiers don't. For example, they can't have whitespaces, while quoted identifiers can. According to https://www.postgresql.org/docs/9.4/static/sql-syntax-lexical.html (section 4.1.1), they can't have a dot. So ORDER BY "cap_or_cur.label" is different from ORDER BY cap_or_cur.label. I'm not sure how the .label means in the second example, where cap_or_cur would be the identifier. However, the solution is to somehow make SQLAlchemy add quotes to the column in ORDER BY.

Your solution works, although I'm not sure I understand why 😅 I tried to understand how your changes resulted in this new SQL query, but couldn't. Regardless, all tests are passing and it worked with a few examples I tried locally, so LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants