Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lot of disk space wasted by unused indices #2762

Open
lekah opened this issue Apr 16, 2019 · 21 comments
Open

Lot of disk space wasted by unused indices #2762

lekah opened this issue Apr 16, 2019 · 21 comments

Comments

@lekah
Copy link
Contributor

lekah commented Apr 16, 2019

I've collected some statistics for my high-throughput project with AiiDA.
It's version 0.10.1, but some results could still be of interest.

Essentially, a few indices take a lot of space but are useless.
One example are the indices on the keys of the attributes, which use 2x13GB but are never scanned in this case.

I write the query used below and attach index_usage.txt, the result given by the DB

SELECT
    pt.tablename AS TableName
    ,t.indexname AS IndexName
    ,pc.reltuples AS TotalRows
    ,pg_size_pretty(pg_relation_size(quote_ident(pt.tablename)::text)) AS TableSize
    ,pg_size_pretty(pg_relation_size(quote_ident(t.indexrelname)::text)) AS IndexSize
    ,t.idx_scan AS TotalNumberOfScan
    ,t.idx_tup_read AS TotalTupleRead
    ,t.idx_tup_fetch AS TotalTupleFetched
    ,pgi.indexdef AS IndexDef
FROM pg_tables AS pt
LEFT OUTER JOIN pg_class AS pc 
	ON pt.tablename=pc.relname
LEFT OUTER JOIN
( 
	SELECT 
		pc.relname AS TableName
		,pc2.relname AS IndexName
		,psai.idx_scan
		,psai.idx_tup_read
		,psai.idx_tup_fetch
		,psai.indexrelname 
	FROM pg_index AS pi
	JOIN pg_class AS pc 
		ON pc.oid = pi.indrelid
	JOIN pg_class AS pc2 
		ON pc2.oid = pi.indexrelid
	JOIN pg_stat_all_indexes AS psai 
		ON pi.indexrelid = psai.indexrelid 
)AS T
    ON pt.tablename = T.TableName
LEFT OUTER JOIN pg_indexes as pgi
    ON T.indexname = pgi.indexname
WHERE pt.schemaname='public'
ORDER BY 1;```
@giovannipizzi
Copy link
Member

Thanks @lekah ! I think with the work @sphuber is finalising, where we move to JSONB, this should be "automatically" fixed - let's keep this open for now, anyway, until JSONB is merged

@lekah
Copy link
Contributor Author

lekah commented Apr 16, 2019

Most welcome. I think it's especially important for moving to JSONB to take this list into account. I don't understand how this is automatically fixed. If you keep all the indices (within JSONB), than the problem just remains, won't it? And if you remove all indices, you also remove the ones that are used.

@giovannipizzi
Copy link
Member

I think the main question is how big the index would be for a JSONB file (and if this becomes then useful/used). Indeed this is something to test - @szoupanos it would be good if before merging you could get a dump from @lekah (no need for the file repo I guess), import it and run the migration script to 1) see if the migration script can run in a reasonable time and 2) check the index sizes and speed of queries, where maybe @lekah can provide a few that are relevant for his research and potentially complex/slow

@szoupanos szoupanos changed the title Lot of disk space wasted by unused indices [Testing] Lot of disk space wasted by unused indices Apr 22, 2019
@szoupanos szoupanos changed the title [Testing] Lot of disk space wasted by unused indices Lot of disk space wasted by unused indices Apr 22, 2019
@sphuber
Copy link
Contributor

sphuber commented May 30, 2019

Note @giovannipizzi that if the repository is not there and the database contains TrajectoryData (which I guess is the case for @lekah 's database) the migration is going to fail. The migration of those classes requires data of the repository

@ltalirz
Copy link
Member

ltalirz commented Jun 5, 2019

@lekah One question

One example are the indices on the keys of the attributes, which use 2x13GB but are never scanned in this case.

Are you saying that on your database you did queries that filtered e.g. for attributes of nodes etc. and still these indices are not used?
If yes, could you let me know what I should fix in 0.12 to remove them?

E.g. I just imported an aiida export file in 0.12; the total database is 420MB but there are 1370MB of indices. That doesn't seem like a healthy choice

cofdb-discover-django_leopold=# SELECT pg_size_pretty(sum(pg_relation_size(pg_class.oid))::bigint), nspname,
CASE pg_class.relkind WHEN 'r' THEN 'table' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 'v' THEN 'view' WHEN 't' THEN 'toast' ELSE pg_class.relkind::text END
FROM pg_class
LEFT OUTER JOIN pg_namespace ON (pg_namespace.oid = pg_class.relnamespace)
GROUP BY pg_class.relkind, nspname
ORDER BY sum(pg_relation_size(pg_class.oid)) DESC;
 pg_size_pretty |      nspname       | relkind
----------------+--------------------+----------
 1369 MB        | public             | index
 420 MB         | public             | table
 3368 kB        | pg_catalog         | index
 3208 kB        | pg_catalog         | table
 416 kB         | pg_toast           | toast
 296 kB         | pg_toast           | index
 208 kB         | public             | sequence
 96 kB          | information_schema | table
 0 bytes        | information_schema | view
 0 bytes        | pg_catalog         | view
(10 rows)

@lekah
Copy link
Contributor Author

lekah commented Jun 5, 2019

I did not run specific queries, I just monitor during everyday usage.
Now, for example, I checked again, and I have a few hits on the db_dbattribute key index. I present the new file
index_usage2.txt. The most important index and is the (dbnode_id, key) composite index. There are still no hits on the index on attribute.datatype, one example index to get rid off. You could also get rid of all indices in auth_group, auth_group_permissions, auth_permissions, db_dbsetting, db_dbuser, and db_dbuser_groups... I would remove all indices that have 0 hits for now. Wouldn't hurt checking another big database just to be sure that these numbers are not specific to my project.

"If yes, could you let me know what I should fix in 0.12 to remove them" If you want to remove them for yourself: DROP INDEX [index_name], see https://www.postgresql.org/docs/11/sql-dropindex.html

@szoupanos
Copy link
Contributor

@ltalirz Just after being sure which indexes to remove (by dropping them manually), you will have to create the according migration that drops them (I suppose that it is obvious but OK..)

@ltalirz
Copy link
Member

ltalirz commented Jan 15, 2020

Results, this time from an aiida 1.0 DB

select schemaname as table_schema, relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_relation_size(relid)) as data_size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as external_size from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc, pg_relation_size(relid) desclimit 10;

 table_schema |     table_name      | total_size | data_size  | external_size
--------------+---------------------+------------+------------+---------------
 public       | db_dbnode           | 1494 MB    | 672 MB     | 822 MB
 public       | db_dblink           | 711 MB     | 158 MB     | 553 MB
 public       | db_dblog            | 408 MB     | 312 MB     | 97 MB
 public       | db_dbgroup_dbnodes  | 256 MB     | 73 MB      | 182 MB
 public       | db_dbgroup          | 40 MB      | 13 MB      | 28 MB

here, "External size" is pg_total_relation_size(relid) - pg_relation_size(relid).
I.e. indices are roughly the same size as the actual data in this example.

P.S. For comparison, here also the output of the queries used above:

 pg_size_pretty |      nspname       | relkind
----------------+--------------------+----------
 1415 MB        | public             | index
 1227 MB        | public             | table
 265 MB         | pg_toast           | toast
 3416 kB        | pg_catalog         | index
 3248 kB        | pg_catalog         | table
 3224 kB        | pg_toast           | index
 120 kB         | public             | sequence
 96 kB          | information_schema | table
 0 bytes        | information_schema | view
 0 bytes        | pg_catalog         | view

@lekah
Copy link
Contributor Author

lekah commented Apr 1, 2020

Is there anything happening on this issue? This is not just a problem for the total size of the database (which seems to double), but also of speed: indices slow down the writing operations. So if importing or creating nodes takes a lot of time, the many indices AiiDA are probably the problem (and quite easy to fix).

@ltalirz
Copy link
Member

ltalirz commented Apr 1, 2020

@lekah You're very welcome to have a go at this!

@lekah
Copy link
Contributor Author

lekah commented Apr 1, 2020

@ltalirz very kind of you. I will try to join one of the next meeting to first discuss how to proceed and what the constraints are, if that's ok.

@ltalirz
Copy link
Member

ltalirz commented Apr 1, 2020

We have "coding day" right now - if you have any questions concerning this, feel free to ping Gio/Seb/me on slack

@chrisjsewell
Copy link
Member

Hey guys, in relation to this issue, I have started to look at profiling aiida-core performance.
To this end, I have created a small package: https://github.com/chrisjsewell/aiida-profiling,
which includes some of the queries you mention above (see db_stats.py).

After speaking about this with @lekah I think it might be good to have a simple way for developers/users to generate some pre-defined "telemetry" data that we can collate, to get a better understanding of how AiiDA databases are generally being used (including e.g. what indices are unused). I would envisage something like:

$ verdi database telemetry

which would (a) generate a JSON blob with relevant statistics for the database (and python environment) and maybe (b) try to automatically send it to a server somewhere.

Let me know what you think?
If this sounds agreeable I will write up a JSON schema proposal for what the blob should contain.

@ltalirz
Copy link
Member

ltalirz commented Apr 21, 2020

hi @chrisjsewell , this sounds like a good idea.
you can take inspiration from the simple script that we wrote to collect statistics for the AiiDA-powered research questionnaire , hosted here: https://github.com/ltalirz/aiida-statistics-query

This script works on aiida 0.x and 1.x; this type of backwards compatibility is not needed for your package though.

@ltalirz
Copy link
Member

ltalirz commented Apr 21, 2020

P.S. Of course we have the replies of people already, so you might also want to have a look at the statistics people sent in.

@chrisjsewell
Copy link
Member

Thanks @ltalirz I will take a look

@greschd
Copy link
Member

greschd commented Apr 21, 2020

(b) try to automatically send it to a server somewhere.

Maybe this should be a separate command (e.g. verdi database telemetry upload), as people might want to inspect the JSON to check what they're uploading before doing so.

@sphuber
Copy link
Contributor

sphuber commented Feb 17, 2022

Now that the Django backend has been dropped, it would be good to take a few (big) production databases and run the query provided in the OP. We can then see what indices are rarely being hit while occupying significant space. We should be careful to take databases that have been used in a way that represents as well as possible the variety in which databases can be used. The fact that indices on mtime, ctime and user_id are not used for production databases may be well the case, but databases that are mostly queried, through let's say the REST API, I would expect this to look a lot different.

@giovannipizzi @chrisjsewell maybe we should include this query as something to be run after successful migration of the upcoming testing/coding day?

@chrisjsewell
Copy link
Member

Yep sounds good, note there is now the verdi devel run-sql command that people can run,

e.g.

$ verdi devel run-sql "select schemaname as table_schema, relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_relation_size(relid)) as data_size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as external_size from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc, pg_relation_size(relid) desc limit 10;"
('public', 'db_dbnode', '225 MB', '64 MB', '161 MB')
('public', 'db_dbgroup_dbnodes', '29 MB', '9480 kB', '20 MB')
('public', 'db_dblink', '23 MB', '10 MB', '13 MB')
('public', 'db_dbgroup', '144 kB', '8192 bytes', '136 kB')
('public', 'db_dbcomputer', '80 kB', '8192 bytes', '72 kB')
('public', 'db_dbuser', '64 kB', '8192 bytes', '56 kB')
('public', 'db_dbsetting', '64 kB', '8192 bytes', '56 kB')
('public', 'db_dblog', '64 kB', '0 bytes', '64 kB')
('public', 'db_dbcomment', '40 kB', '0 bytes', '40 kB')
('public', 'db_dbauthinfo', '40 kB', '0 bytes', '40 kB')

@chrisjsewell
Copy link
Member

I would envisage something like: verdi database telemetry

This would now be verdi storage telemetry, but still feel it could be a good idea to add (potentially also incorporating data from the disk-objectstore)

@chrisjsewell
Copy link
Member

also note there are now the "schema reflection" regression files that clarify exactly wheat indexes are in the database, e.g.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants