-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[server] Faster query for reports #3316
Conversation
# separate query of file table results a query strategy for the | ||
# previous query which doesn't contain such an inner loop. See | ||
# EXPLAIN SELECT columns FROM ... | ||
all_files = dict(session.query(File.id, File.filepath).all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not effective. Let's assume that my product contains 50000
files and I want to get reports in a given source file (a.cpp). So I set the file path filter to filter only this source file but here we will get all the source files.
In the query_result
we have the information what kind of file ids do we have so here we can get file paths for only those file ids. We do the same above to get the details for the reports. We can also use the chunk util function which was introduced by you for this:
codechecker/codechecker_common/util.py
Line 154 in 26538cf
def chunks(iterator, n): |
For example if my database contains 50000 files and the query_result has 700 unique file ids, we will split it to two chunks/queries (max element can be 500) and the performance will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see how this chunks()
could be used here. Could you please elaborate it a little more? Thanks.
|
||
return and_(Report.bug_id.in_(query_base), | ||
Report.run_id.in_(query_base_runs)) | ||
return and_(diff_filter), necessary_tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your commit you mentioned that the following queries are the same:
SELECT col1, col2 FROM reports WHERE col IN (SELECT col FROM reports WHERE conditions);
SELECT col1, col2 FROM reports WHERE conditions;
This is not a true statement. In the first query the col
=bug_id
in out case, and multiple reports can have the same bug_id
.
Let's assume that you stored a run yesterday (myrun1
) and you store the same result set to a different run (myrun2
). So myrun1
and myrun2
contain the same results but with different detection dates. Now if I set the outstanding report date filter to yesterday midnight and I don't set any other filters the first query will return all the results from myrun1 and myrun2 (because of the hashes). In the second case (second query) it will return the results only from the first run.
I think the new behaviour is good but it should be mentioned in the commit message that we changed this behaviour and the two queries are not identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extended the commit message.
@@ -788,18 +820,21 @@ def create_count_expression(report_filter): | |||
return func.count(literal_column('*')) | |||
|
|||
|
|||
def apply_report_filter(q, filter_expression): | |||
def apply_report_filter(q, filter_expression, necessary_tables): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the variable name necessary_tables
is not so obvious what it means. Maybe join_tables
would be more expressive. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
sort_types, sort_type_map, order_type_map = \ | ||
get_sort_map(sort_types) | ||
Report.path_length, Report.analyzer_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is strange for me that this query on a big psql database takes me 10sec to execute but if I execute the raw query explicitly in the database it takes 3 sec. So the raw query 3-4x times faster than this one. I thought that the reason behind this is that sqlalchemy needs to create objects from the result set but if I limit the query to return only 1 result the slowness is the same. Maybe the reason is that in the SELECT statement we have "too many" columns?
This is just a side comment but it would be good if we can increase the performance of executing query through orm.
file_ids = set(map(lambda r: r[2], query_result)) | ||
all_files = dict(session.query(File.id, File.filepath) | ||
.filter(File.id.in_(file_ids)).all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous comment I mentioned to use our chunk
util function because the number of variables which can be inside an IN
statement is limited in case of SQLite (default: SQLITE_MAX_VARIABLE_NUMBER=999
, https://www.sqlite.org/limits.html#max_variable_number). So if your database contains more files this query will fail.
The query of reports has the following format: ```.sql SELECT col1, col2 FROM reports WHERE col IN (SELECT col FROM reports WHERE conditions); ``` This is almost the same as ```.sql SELECT col1, col2 FROM reports WHERE conditions; ``` The difference comes when the reports of two runs on different dates are queried. The first version returns all reports from all runs of which the bug hash has been selected by the filter, the second one selects only those reports which match the condition. SQL queries are translated to query strategies by PostgreSQL. This strategy determines the algorithm for gathering resulting records from the database. The first query generates a strategy where the inner select is implemented with a nested loop. Such a nested loop on "reports" table is so slow that it times out. In this commit the query is rewritten to the second form. But even if the second form is used, PostgreSQL may generate a strategy with inner loop in some cases. The generated strategy not only depends on the query but on table size and many other things too. So it is possible that later this timeout issue will come back. One technique for minimizing nested loops is not using table JOINS if possible. For this reason a query has been split to two queries in order to avoid joining "files" table. The results of the two queries are merged in Python code. In order to avoid unnecessary table joins, we join only those tables which occur in some query condition. Earlier there were some unnecessary joins, for example the columns of files is not used anywhere in the query: ```.sql SELECT col1, col2 FROM reports LEFT JOIN files ON ... WHERE col1 = 42; ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The query of reports has the following format:
This is the same(?) as
SQL queries are translated to query strategies by PostgreSQL. This
strategy determines the algorithm for gathering resulting records from
the database. The first query generates a strategy where the inner
select is implemented with a nested loop. Such a nested loop on
"reports" table is so slow that it times out. In this commit the query is
rewritten to the second form.
But even if the second form is used, PostgreSQL may generate a strategy
with inner loop in some cases. The generated strategy not only depends
on the query but on table size and many other things too. So it is
possible that later this timeout issue will come back. One technique
for minimizing nested loops is not using table JOINS if possible. For
this reason a query has been split to two queries in order to avoid
joining "files" table. The results of the two queries are merged in
Python code. In order to avoid unnecessary table joins, we join only
those tables which occur in some query condition. Earlier there were
some unnecessary joins, for example the columns of files is not used
anywhere in the query: