New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[server] Faster query for reports #3316

Merged

csordasmarton merged 1 commit into Ericsson:master from bruntib:fast_query

May 20, 2021

Contributor

bruntib commented May 17, 2021

The query of reports has the following format:

SELECT col1, col2 FROM reports WHERE col IN
(SELECT col FROM reports WHERE conditions);

This is the same(?) as

SELECT col1, col2 FROM reports WHERE conditions;

SQL queries are translated to query strategies by PostgreSQL. This
strategy determines the algorithm for gathering resulting records from
the database. The first query generates a strategy where the inner
select is implemented with a nested loop. Such a nested loop on
"reports" table is so slow that it times out. In this commit the query is
rewritten to the second form.

But even if the second form is used, PostgreSQL may generate a strategy
with inner loop in some cases. The generated strategy not only depends
on the query but on table size and many other things too. So it is
possible that later this timeout issue will come back. One technique
for minimizing nested loops is not using table JOINS if possible. For
this reason a query has been split to two queries in order to avoid
joining "files" table. The results of the two queries are merged in
Python code. In order to avoid unnecessary table joins, we join only
those tables which occur in some query condition. Earlier there were
some unnecessary joins, for example the columns of files is not used
anywhere in the query:

SELECT col1, col2 FROM reports LEFT JOIN files ON ... WHERE col1 = 42;

bruntib added enhancement 🌟 database 🗄️ server 🖥️ refactoring 😡 ➡️ 🙂 performance 🏃 labels

bruntib added this to the release 6.16.0 milestone

bruntib requested a review from csordasmarton

May 17, 2021 06:59

csordasmarton suggested changes

View reviewed changes

web/server/codechecker_server/api/report_server.py Outdated

+                              # separate query of file table results a query strategy for the
+                              # previous query which doesn't contain such an inner loop. See
+                              # EXPLAIN SELECT columns FROM ...
+                              all_files = dict(session.query(File.id, File.filepath).all())

Contributor

csordasmarton May 17, 2021

This is not effective. Let's assume that my product contains 50000 files and I want to get reports in a given source file (a.cpp). So I set the file path filter to filter only this source file but here we will get all the source files.

In the query_result we have the information what kind of file ids do we have so here we can get file paths for only those file ids. We do the same above to get the details for the reports. We can also use the chunk util function which was introduced by you for this:

codechecker/codechecker_common/util.py

Line 154 in 26538cf

def chunks(iterator, n):

For example if my database contains 50000 files and the query_result has 700 unique file ids, we will split it to two chunks/queries (max element can be 500) and the performance will be better.

Contributor Author

bruntib May 19, 2021

I can't see how this chunks() could be used here. Could you please elaborate it a little more? Thanks.

web/server/codechecker_server/api/report_server.py Outdated

-                      return and_(Report.bug_id.in_(query_base),
-                                  Report.run_id.in_(query_base_runs))
+                      return and_(diff_filter), necessary_tables

Contributor

csordasmarton May 18, 2021

In your commit you mentioned that the following queries are the same:

SELECT col1, col2 FROM reports WHERE col IN (SELECT col FROM reports WHERE conditions);

SELECT col1, col2 FROM reports WHERE conditions;

This is not a true statement. In the first query the col=bug_id in out case, and multiple reports can have the same bug_id.

Let's assume that you stored a run yesterday (myrun1) and you store the same result set to a different run (myrun2). So myrun1 and myrun2 contain the same results but with different detection dates. Now if I set the outstanding report date filter to yesterday midnight and I don't set any other filters the first query will return all the results from myrun1 and myrun2 (because of the hashes). In the second case (second query) it will return the results only from the first run.

I think the new behaviour is good but it should be mentioned in the commit message that we changed this behaviour and the two queries are not identical.

Contributor Author

bruntib May 19, 2021

I extended the commit message.

web/server/codechecker_server/api/report_server.py Outdated

@@ @@ -788,18 +820,21 @@ def create_count_expression(report_filter): @@
                       return func.count(literal_column('*'))
-              def apply_report_filter(q, filter_expression):
+              def apply_report_filter(q, filter_expression, necessary_tables):

Contributor

csordasmarton May 18, 2021

For me the variable name necessary_tables is not so obvious what it means. Maybe join_tables would be more expressive. What do you think?

Contributor Author

bruntib May 19, 2021

Done.

web/server/codechecker_server/api/report_server.py

-                              sort_types, sort_type_map, order_type_map = \
-                                  get_sort_map(sort_types)
+                                                Report.path_length, Report.analyzer_name)

Contributor

csordasmarton May 18, 2021

What is strange for me that this query on a big psql database takes me 10sec to execute but if I execute the raw query explicitly in the database it takes 3 sec. So the raw query 3-4x times faster than this one. I thought that the reason behind this is that sqlalchemy needs to create objects from the result set but if I limit the query to return only 1 result the slowness is the same. Maybe the reason is that in the SELECT statement we have "too many" columns?

This is just a side comment but it would be good if we can increase the performance of executing query through orm.

bruntib force-pushed the fast_query branch from 3c746a4 to 0bd1501 Compare

May 19, 2021 22:44

bruntib requested a review from csordasmarton

May 19, 2021 22:45

bruntib force-pushed the fast_query branch from 0bd1501 to 5f2ac9b Compare

May 19, 2021 22:46

csordasmarton reviewed

View reviewed changes

web/server/codechecker_server/api/report_server.py Outdated

Comment on lines 1513 to 1515

+                              file_ids = set(map(lambda r: r[2], query_result))
+                              all_files = dict(session.query(File.id, File.filepath)
+                                               .filter(File.id.in_(file_ids)).all())

Contributor

csordasmarton May 20, 2021

In my previous comment I mentioned to use our chunk util function because the number of variables which can be inside an IN statement is limited in case of SQLite (default: SQLITE_MAX_VARIABLE_NUMBER=999, https://www.sqlite.org/limits.html#max_variable_number). So if your database contains more files this query will fail.


          [server] Faster query for reports

4bbc00d

The query of reports has the following format:

```.sql
SELECT col1, col2 FROM reports WHERE col IN
(SELECT col FROM reports WHERE conditions);
```

This is almost the same as

```.sql
SELECT col1, col2 FROM reports WHERE conditions;
```

The difference comes when the reports of two runs on different dates
are queried. The first version returns all reports from all runs of
which the bug hash has been selected by the filter, the second one
selects only those reports which match the condition.

SQL queries are translated to query strategies by PostgreSQL. This
strategy determines the algorithm for gathering resulting records from
the database. The first query generates a strategy where the inner
select is implemented with a nested loop. Such a nested loop on
"reports" table is so slow that it times out. In this commit the query is
rewritten to the second form.

But even if the second form is used, PostgreSQL may generate a strategy
with inner loop in some cases. The generated strategy not only depends
on the query but on table size and many other things too. So it is
possible that later this timeout issue will come back. One technique
for minimizing nested loops is not using table JOINS if possible. For
this reason a query has been split to two queries in order to avoid
joining "files" table. The results of the two queries are merged in
Python code. In order to avoid unnecessary table joins, we join only
those tables which occur in some query condition. Earlier there were
some unnecessary joins, for example the columns of files is not used
anywhere in the query:

```.sql
SELECT col1, col2 FROM reports LEFT JOIN files ON ... WHERE col1 = 42;
```

bruntib force-pushed the fast_query branch from 5f2ac9b to 4bbc00d Compare

May 20, 2021 10:54

bruntib requested a review from csordasmarton

May 20, 2021 10:59

csordasmarton approved these changes

View reviewed changes

Contributor

csordasmarton left a comment

LGTM!

csordasmarton merged commit 176987e into Ericsson:master

bruntib deleted the fast_query branch

May 20, 2021 12:49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

database 🗄️ enhancement 🌟 performance 🏃 refactoring 😡 ➡️ 🙂 server 🖥️