Skip to content

Commit

Permalink
[server] Faster query for reports
Browse files Browse the repository at this point in the history
The query of reports has the following format:

```.sql
SELECT col1, col2 FROM reports WHERE col IN
(SELECT col FROM reports WHERE conditions);
```

This is almost the same as

```.sql
SELECT col1, col2 FROM reports WHERE conditions;
```

The difference comes when the reports of two runs on different dates
are queried. The first version returns all reports from all runs of
which the bug hash has been selected by the filter, the second one
selects only those reports which match the condition.

SQL queries are translated to query strategies by PostgreSQL. This
strategy determines the algorithm for gathering resulting records from
the database. The first query generates a strategy where the inner
select is implemented with a nested loop. Such a nested loop on
"reports" table is so slow that it times out. In this commit the query is
rewritten to the second form.

But even if the second form is used, PostgreSQL may generate a strategy
with inner loop in some cases. The generated strategy not only depends
on the query but on table size and many other things too. So it is
possible that later this timeout issue will come back. One technique
for minimizing nested loops is not using table JOINS if possible. For
this reason a query has been split to two queries in order to avoid
joining "files" table. The results of the two queries are merged in
Python code. In order to avoid unnecessary table joins, we join only
those tables which occur in some query condition. Earlier there were
some unnecessary joins, for example the columns of files is not used
anywhere in the query:

```.sql
SELECT col1, col2 FROM reports LEFT JOIN files ON ... WHERE col1 = 42;
```
  • Loading branch information
bruntib committed May 19, 2021
1 parent 46cca7e commit 5f2ac9b
Showing 1 changed file with 127 additions and 87 deletions.
Loading

0 comments on commit 5f2ac9b

Please sign in to comment.