Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] Faster query for reports #3316

Merged
merged 1 commit into from
May 20, 2021
Merged

Conversation

bruntib
Copy link
Contributor

@bruntib bruntib commented May 17, 2021

The query of reports has the following format:

SELECT col1, col2 FROM reports WHERE col IN
(SELECT col FROM reports WHERE conditions);

This is the same(?) as

SELECT col1, col2 FROM reports WHERE conditions;

SQL queries are translated to query strategies by PostgreSQL. This
strategy determines the algorithm for gathering resulting records from
the database. The first query generates a strategy where the inner
select is implemented with a nested loop. Such a nested loop on
"reports" table is so slow that it times out. In this commit the query is
rewritten to the second form.

But even if the second form is used, PostgreSQL may generate a strategy
with inner loop in some cases. The generated strategy not only depends
on the query but on table size and many other things too. So it is
possible that later this timeout issue will come back. One technique
for minimizing nested loops is not using table JOINS if possible. For
this reason a query has been split to two queries in order to avoid
joining "files" table. The results of the two queries are merged in
Python code. In order to avoid unnecessary table joins, we join only
those tables which occur in some query condition. Earlier there were
some unnecessary joins, for example the columns of files is not used
anywhere in the query:

SELECT col1, col2 FROM reports LEFT JOIN files ON ... WHERE col1 = 42;

# separate query of file table results a query strategy for the
# previous query which doesn't contain such an inner loop. See
# EXPLAIN SELECT columns FROM ...
all_files = dict(session.query(File.id, File.filepath).all())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not effective. Let's assume that my product contains 50000 files and I want to get reports in a given source file (a.cpp). So I set the file path filter to filter only this source file but here we will get all the source files.

In the query_result we have the information what kind of file ids do we have so here we can get file paths for only those file ids. We do the same above to get the details for the reports. We can also use the chunk util function which was introduced by you for this:

def chunks(iterator, n):

For example if my database contains 50000 files and the query_result has 700 unique file ids, we will split it to two chunks/queries (max element can be 500) and the performance will be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see how this chunks() could be used here. Could you please elaborate it a little more? Thanks.


return and_(Report.bug_id.in_(query_base),
Report.run_id.in_(query_base_runs))
return and_(diff_filter), necessary_tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your commit you mentioned that the following queries are the same:

SELECT col1, col2 FROM reports WHERE col IN (SELECT col FROM reports WHERE conditions);

SELECT col1, col2 FROM reports WHERE conditions;

This is not a true statement. In the first query the col=bug_id in out case, and multiple reports can have the same bug_id.

Let's assume that you stored a run yesterday (myrun1) and you store the same result set to a different run (myrun2). So myrun1 and myrun2 contain the same results but with different detection dates. Now if I set the outstanding report date filter to yesterday midnight and I don't set any other filters the first query will return all the results from myrun1 and myrun2 (because of the hashes). In the second case (second query) it will return the results only from the first run.

I think the new behaviour is good but it should be mentioned in the commit message that we changed this behaviour and the two queries are not identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the commit message.

@@ -788,18 +820,21 @@ def create_count_expression(report_filter):
return func.count(literal_column('*'))


def apply_report_filter(q, filter_expression):
def apply_report_filter(q, filter_expression, necessary_tables):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the variable name necessary_tables is not so obvious what it means. Maybe join_tables would be more expressive. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


sort_types, sort_type_map, order_type_map = \
get_sort_map(sort_types)
Report.path_length, Report.analyzer_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is strange for me that this query on a big psql database takes me 10sec to execute but if I execute the raw query explicitly in the database it takes 3 sec. So the raw query 3-4x times faster than this one. I thought that the reason behind this is that sqlalchemy needs to create objects from the result set but if I limit the query to return only 1 result the slowness is the same. Maybe the reason is that in the SELECT statement we have "too many" columns?

This is just a side comment but it would be good if we can increase the performance of executing query through orm.

Comment on lines 1513 to 1515
file_ids = set(map(lambda r: r[2], query_result))
all_files = dict(session.query(File.id, File.filepath)
.filter(File.id.in_(file_ids)).all())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous comment I mentioned to use our chunk util function because the number of variables which can be inside an IN statement is limited in case of SQLite (default: SQLITE_MAX_VARIABLE_NUMBER=999, https://www.sqlite.org/limits.html#max_variable_number). So if your database contains more files this query will fail.

The query of reports has the following format:

```.sql
SELECT col1, col2 FROM reports WHERE col IN
(SELECT col FROM reports WHERE conditions);
```

This is almost the same as

```.sql
SELECT col1, col2 FROM reports WHERE conditions;
```

The difference comes when the reports of two runs on different dates
are queried. The first version returns all reports from all runs of
which the bug hash has been selected by the filter, the second one
selects only those reports which match the condition.

SQL queries are translated to query strategies by PostgreSQL. This
strategy determines the algorithm for gathering resulting records from
the database. The first query generates a strategy where the inner
select is implemented with a nested loop. Such a nested loop on
"reports" table is so slow that it times out. In this commit the query is
rewritten to the second form.

But even if the second form is used, PostgreSQL may generate a strategy
with inner loop in some cases. The generated strategy not only depends
on the query but on table size and many other things too. So it is
possible that later this timeout issue will come back. One technique
for minimizing nested loops is not using table JOINS if possible. For
this reason a query has been split to two queries in order to avoid
joining "files" table. The results of the two queries are merged in
Python code. In order to avoid unnecessary table joins, we join only
those tables which occur in some query condition. Earlier there were
some unnecessary joins, for example the columns of files is not used
anywhere in the query:

```.sql
SELECT col1, col2 FROM reports LEFT JOIN files ON ... WHERE col1 = 42;
```
Copy link
Contributor

@csordasmarton csordasmarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@csordasmarton csordasmarton merged commit 176987e into Ericsson:master May 20, 2021
@bruntib bruntib deleted the fast_query branch May 20, 2021 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants