Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow queries against table 'extras_objectchange' #3392

Closed
ir0n-curta1n opened this issue Aug 5, 2019 · 2 comments
Closed

Slow queries against table 'extras_objectchange' #3392

ir0n-curta1n opened this issue Aug 5, 2019 · 2 comments
Labels
status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application

Comments

@ir0n-curta1n
Copy link

Environment

  • Python version: 3.6
  • NetBox version: 2.6.1

Steps to Reproduce

  1. Populate table 'extras_objectchange' with a bunch of rows. In my case, one million or so, but the number can be higher based on disk type, amount/type of memory and CPUs for the issue to become really noticeable.
  2. Send multiple SELECT/UPDATE statements in parallel to NetBox web API, via pynetbox. In my case, this was a script reading from our old DCIM source and sending SQL statements against the web NetBox API, also using asyncio.
  3. Enable PostgreSQL statement logging and notice long-running statements against table 'extras_objectchange'.

Expected Behavior

I expected things to function without issues.

Observed Behavior

I believe the web API creates SQL statements against table 'extras_objectchange', column 'time' (a column with no index), creating a bottleneck due to the poor database performance of multiple sequential scans of table 'extras_objectchange'. Table 'extras_objectchange' needs an index on column 'time'. Here's the long version of what was observed:

My company is migrating off a DCIM product built completely in-house. As one of the steps in the migration we've written python scripts to read our old DCIM tool and populate NetBox using this fetched data. The migration script uses pynetbox to send queries to the NetBox web API. Since we've not gotten the migrated data just right in the first, or second, or n'th try, we've produced a good number of rows in table 'extras_objectchange'...CHANGELOG_RETENTION value is kept at default of 90 days. We've noticed at at a certain point the migration script started running much slower. After looking, gunicorn workers were queueing requests, taking up memory and eventually leading to swapping. Enabling postgres statement logging showed that there were a large number of long-running requests against table 'extras_objectchange', statements such as:
SELECT "extras_objectchange"."id", "extras_objectchange"."time", "extras_objectchange"."user_id", "extras_objectchange"."user_name", "extras_objectchange"."request_id", "extras_objectchange"."action", "extras_objectchange"."changed_object_type_id", "extras_objectchange"."changed_object_id", "extras_objectchange"."related_object_type_id", "extras_objectchange"."related_object_id", "extras_objectchange"."object_repr", "extras_objectchange"."object_data" FROM "extras_objectchange" WHERE "extras_objectchange"."time" < '2019-04-26T14:44:52.229708+00:00'::timestamptz;

The static timestamp value used in the above statement was exactly 90 days (our/default logging retention period) at the time of execution. Although I've traced these statements to a definite source, I don't see postgresql table triggers that would do this, and thus assume it is some sort of internal NetBox created statement. In any case, we see a lot of these statements when running our sync script. The problem with the statements is that their WHERE clause compares the 'time' column to a static value, and there appears to be no 'time' column index on table 'extras_objectchange'. This combined with the number of rows we have in the table is making for really poor performance. We're sequentially scanning a 1,000,000 row table.

explain SELECT "extras_objectchange"."id", "extras_objectchange"."time", "extras_objectchange"."user_id", "extras_objectchange"."user_name", "extras_objectchange"."request_id", "extras_objectchange"."action", "extras_objectchange"."changed_object_type_id", "extras_objectchange"."changed_object_id", "extras_objectchange"."related_object_type_id", "extras_objectchange"."related_object_id", "extras_objectchange"."object_repr", "extras_objectchange"."object_data" FROM "extras_objectchange" WHERE "extras_objectchange"."time" < '2019-04-26T14:44:52.229708+00:00'::timestamptz;
                                   QUERY PLAN     
--------------------------------------------------------------------------------
 Seq Scan on extras_objectchange  (cost=0.00..130200.30 rows=948049 width=364)
   Filter: ("time" < '2019-04-26 10:44:52.229708-04'::timestamp with time zone)
(2 rows)

Thanks for all the hard work creating and maintaining an awesome DCIM/IPAM product.

@jeremystretch
Copy link
Member

jeremystretch commented Aug 28, 2019

The static timestamp value used in the above statement was exactly 90 days (our/default logging retention period) at the time of execution. Although I've traced these statements to a definite source, I don't see postgresql table triggers that would do this, and thus assume it is some sort of internal NetBox created statement.

This is probably the auto-purge rule triggering. Every time the change logging middleware runs, there's a 1% chance that NetBox will purge expired Changelog entries. Reviewing the code, it occurs to me that this routine is being called far more often than it should be. The 1% chance is applied for every request rather than only for requests which effect changes. So that's one problem, and I've opened #3459 for that to be addressed separately.

Edit: Disregard. This only runs if a change has been logged.

If you had asked me whether Django DateTimeFields get an index created automatically, I would have said yes, but it appears that's not the case. We'll need to add an index for the field manually.

@jeremystretch jeremystretch added status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application labels Aug 28, 2019
@ir0n-curta1n
Copy link
Author

Awesome. Thank you for fixing this.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application
Projects
None yet
Development

No branches or pull requests

2 participants