-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lot of disk space wasted by unused indices #2762
Comments
Most welcome. I think it's especially important for moving to JSONB to take this list into account. I don't understand how this is automatically fixed. If you keep all the indices (within JSONB), than the problem just remains, won't it? And if you remove all indices, you also remove the ones that are used. |
I think the main question is how big the index would be for a JSONB file (and if this becomes then useful/used). Indeed this is something to test - @szoupanos it would be good if before merging you could get a dump from @lekah (no need for the file repo I guess), import it and run the migration script to 1) see if the migration script can run in a reasonable time and 2) check the index sizes and speed of queries, where maybe @lekah can provide a few that are relevant for his research and potentially complex/slow |
Note @giovannipizzi that if the repository is not there and the database contains |
@lekah One question
Are you saying that on your database you did queries that filtered e.g. for attributes of nodes etc. and still these indices are not used? E.g. I just imported an aiida export file in 0.12; the total database is 420MB but there are 1370MB of indices. That doesn't seem like a healthy choice
|
I did not run specific queries, I just monitor during everyday usage. "If yes, could you let me know what I should fix in 0.12 to remove them" If you want to remove them for yourself: DROP INDEX [index_name], see https://www.postgresql.org/docs/11/sql-dropindex.html |
@ltalirz Just after being sure which indexes to remove (by dropping them manually), you will have to create the according migration that drops them (I suppose that it is obvious but OK..) |
Results, this time from an aiida 1.0 DB
here, "External size" is P.S. For comparison, here also the output of the queries used above:
|
Is there anything happening on this issue? This is not just a problem for the total size of the database (which seems to double), but also of speed: indices slow down the writing operations. So if importing or creating nodes takes a lot of time, the many indices AiiDA are probably the problem (and quite easy to fix). |
@lekah You're very welcome to have a go at this! |
@ltalirz very kind of you. I will try to join one of the next meeting to first discuss how to proceed and what the constraints are, if that's ok. |
We have "coding day" right now - if you have any questions concerning this, feel free to ping Gio/Seb/me on slack |
Hey guys, in relation to this issue, I have started to look at profiling aiida-core performance. After speaking about this with @lekah I think it might be good to have a simple way for developers/users to generate some pre-defined "telemetry" data that we can collate, to get a better understanding of how AiiDA databases are generally being used (including e.g. what indices are unused). I would envisage something like: $ verdi database telemetry which would (a) generate a JSON blob with relevant statistics for the database (and python environment) and maybe (b) try to automatically send it to a server somewhere. Let me know what you think? |
hi @chrisjsewell , this sounds like a good idea. This script works on aiida 0.x and 1.x; this type of backwards compatibility is not needed for your package though. |
P.S. Of course we have the replies of people already, so you might also want to have a look at the statistics people sent in. |
Thanks @ltalirz I will take a look |
Maybe this should be a separate command (e.g. |
Now that the Django backend has been dropped, it would be good to take a few (big) production databases and run the query provided in the OP. We can then see what indices are rarely being hit while occupying significant space. We should be careful to take databases that have been used in a way that represents as well as possible the variety in which databases can be used. The fact that indices on @giovannipizzi @chrisjsewell maybe we should include this query as something to be run after successful migration of the upcoming testing/coding day? |
Yep sounds good, note there is now the e.g. $ verdi devel run-sql "select schemaname as table_schema, relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_relation_size(relid)) as data_size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as external_size from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc, pg_relation_size(relid) desc limit 10;"
('public', 'db_dbnode', '225 MB', '64 MB', '161 MB')
('public', 'db_dbgroup_dbnodes', '29 MB', '9480 kB', '20 MB')
('public', 'db_dblink', '23 MB', '10 MB', '13 MB')
('public', 'db_dbgroup', '144 kB', '8192 bytes', '136 kB')
('public', 'db_dbcomputer', '80 kB', '8192 bytes', '72 kB')
('public', 'db_dbuser', '64 kB', '8192 bytes', '56 kB')
('public', 'db_dbsetting', '64 kB', '8192 bytes', '56 kB')
('public', 'db_dblog', '64 kB', '0 bytes', '64 kB')
('public', 'db_dbcomment', '40 kB', '0 bytes', '40 kB')
('public', 'db_dbauthinfo', '40 kB', '0 bytes', '40 kB') |
This would now be |
also note there are now the "schema reflection" regression files that clarify exactly wheat indexes are in the database, e.g. |
I've collected some statistics for my high-throughput project with AiiDA.
It's version 0.10.1, but some results could still be of interest.
Essentially, a few indices take a lot of space but are useless.
One example are the indices on the keys of the attributes, which use 2x13GB but are never scanned in this case.
I write the query used below and attach index_usage.txt, the result given by the DB
The text was updated successfully, but these errors were encountered: