Lot of disk space wasted by unused indices #2762

lekah · 2019-04-16T10:12:37Z

I've collected some statistics for my high-throughput project with AiiDA.
It's version 0.10.1, but some results could still be of interest.

Essentially, a few indices take a lot of space but are useless.
One example are the indices on the keys of the attributes, which use 2x13GB but are never scanned in this case.

I write the query used below and attach index_usage.txt, the result given by the DB

SELECT
    pt.tablename AS TableName
    ,t.indexname AS IndexName
    ,pc.reltuples AS TotalRows
    ,pg_size_pretty(pg_relation_size(quote_ident(pt.tablename)::text)) AS TableSize
    ,pg_size_pretty(pg_relation_size(quote_ident(t.indexrelname)::text)) AS IndexSize
    ,t.idx_scan AS TotalNumberOfScan
    ,t.idx_tup_read AS TotalTupleRead
    ,t.idx_tup_fetch AS TotalTupleFetched
    ,pgi.indexdef AS IndexDef
FROM pg_tables AS pt
LEFT OUTER JOIN pg_class AS pc 
	ON pt.tablename=pc.relname
LEFT OUTER JOIN
( 
	SELECT 
		pc.relname AS TableName
		,pc2.relname AS IndexName
		,psai.idx_scan
		,psai.idx_tup_read
		,psai.idx_tup_fetch
		,psai.indexrelname 
	FROM pg_index AS pi
	JOIN pg_class AS pc 
		ON pc.oid = pi.indrelid
	JOIN pg_class AS pc2 
		ON pc2.oid = pi.indexrelid
	JOIN pg_stat_all_indexes AS psai 
		ON pi.indexrelid = psai.indexrelid 
)AS T
    ON pt.tablename = T.TableName
LEFT OUTER JOIN pg_indexes as pgi
    ON T.indexname = pgi.indexname
WHERE pt.schemaname='public'
ORDER BY 1;```

The text was updated successfully, but these errors were encountered:

giovannipizzi · 2019-04-16T10:35:14Z

Thanks @lekah ! I think with the work @sphuber is finalising, where we move to JSONB, this should be "automatically" fixed - let's keep this open for now, anyway, until JSONB is merged

lekah · 2019-04-16T12:50:33Z

Most welcome. I think it's especially important for moving to JSONB to take this list into account. I don't understand how this is automatically fixed. If you keep all the indices (within JSONB), than the problem just remains, won't it? And if you remove all indices, you also remove the ones that are used.

giovannipizzi · 2019-04-16T13:02:46Z

I think the main question is how big the index would be for a JSONB file (and if this becomes then useful/used). Indeed this is something to test - @szoupanos it would be good if before merging you could get a dump from @lekah (no need for the file repo I guess), import it and run the migration script to 1) see if the migration script can run in a reasonable time and 2) check the index sizes and speed of queries, where maybe @lekah can provide a few that are relevant for his research and potentially complex/slow

sphuber · 2019-05-30T20:55:57Z

Note @giovannipizzi that if the repository is not there and the database contains TrajectoryData (which I guess is the case for @lekah 's database) the migration is going to fail. The migration of those classes requires data of the repository

ltalirz · 2019-06-05T11:22:42Z

@lekah One question

One example are the indices on the keys of the attributes, which use 2x13GB but are never scanned in this case.

Are you saying that on your database you did queries that filtered e.g. for attributes of nodes etc. and still these indices are not used?
If yes, could you let me know what I should fix in 0.12 to remove them?

E.g. I just imported an aiida export file in 0.12; the total database is 420MB but there are 1370MB of indices. That doesn't seem like a healthy choice

cofdb-discover-django_leopold=# SELECT pg_size_pretty(sum(pg_relation_size(pg_class.oid))::bigint), nspname,
CASE pg_class.relkind WHEN 'r' THEN 'table' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 'v' THEN 'view' WHEN 't' THEN 'toast' ELSE pg_class.relkind::text END
FROM pg_class
LEFT OUTER JOIN pg_namespace ON (pg_namespace.oid = pg_class.relnamespace)
GROUP BY pg_class.relkind, nspname
ORDER BY sum(pg_relation_size(pg_class.oid)) DESC;
 pg_size_pretty |      nspname       | relkind
----------------+--------------------+----------
 1369 MB        | public             | index
 420 MB         | public             | table
 3368 kB        | pg_catalog         | index
 3208 kB        | pg_catalog         | table
 416 kB         | pg_toast           | toast
 296 kB         | pg_toast           | index
 208 kB         | public             | sequence
 96 kB          | information_schema | table
 0 bytes        | information_schema | view
 0 bytes        | pg_catalog         | view
(10 rows)

lekah · 2019-06-05T12:32:24Z

I did not run specific queries, I just monitor during everyday usage.
Now, for example, I checked again, and I have a few hits on the db_dbattribute key index. I present the new file
index_usage2.txt. The most important index and is the (dbnode_id, key) composite index. There are still no hits on the index on attribute.datatype, one example index to get rid off. You could also get rid of all indices in auth_group, auth_group_permissions, auth_permissions, db_dbsetting, db_dbuser, and db_dbuser_groups... I would remove all indices that have 0 hits for now. Wouldn't hurt checking another big database just to be sure that these numbers are not specific to my project.

"If yes, could you let me know what I should fix in 0.12 to remove them" If you want to remove them for yourself: DROP INDEX [index_name], see https://www.postgresql.org/docs/11/sql-dropindex.html

szoupanos · 2019-06-05T12:50:03Z

@ltalirz Just after being sure which indexes to remove (by dropping them manually), you will have to create the according migration that drops them (I suppose that it is obvious but OK..)

ltalirz · 2020-01-15T20:45:24Z

Results, this time from an aiida 1.0 DB

select schemaname as table_schema, relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_relation_size(relid)) as data_size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as external_size from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc, pg_relation_size(relid) desclimit 10;

 table_schema |     table_name      | total_size | data_size  | external_size
--------------+---------------------+------------+------------+---------------
 public       | db_dbnode           | 1494 MB    | 672 MB     | 822 MB
 public       | db_dblink           | 711 MB     | 158 MB     | 553 MB
 public       | db_dblog            | 408 MB     | 312 MB     | 97 MB
 public       | db_dbgroup_dbnodes  | 256 MB     | 73 MB      | 182 MB
 public       | db_dbgroup          | 40 MB      | 13 MB      | 28 MB

here, "External size" is pg_total_relation_size(relid) - pg_relation_size(relid).
I.e. indices are roughly the same size as the actual data in this example.

P.S. For comparison, here also the output of the queries used above:

 pg_size_pretty |      nspname       | relkind
----------------+--------------------+----------
 1415 MB        | public             | index
 1227 MB        | public             | table
 265 MB         | pg_toast           | toast
 3416 kB        | pg_catalog         | index
 3248 kB        | pg_catalog         | table
 3224 kB        | pg_toast           | index
 120 kB         | public             | sequence
 96 kB          | information_schema | table
 0 bytes        | information_schema | view
 0 bytes        | pg_catalog         | view

lekah · 2020-04-01T07:03:46Z

Is there anything happening on this issue? This is not just a problem for the total size of the database (which seems to double), but also of speed: indices slow down the writing operations. So if importing or creating nodes takes a lot of time, the many indices AiiDA are probably the problem (and quite easy to fix).

ltalirz · 2020-04-01T07:22:57Z

@lekah You're very welcome to have a go at this!

lekah · 2020-04-01T07:49:10Z

@ltalirz very kind of you. I will try to join one of the next meeting to first discuss how to proceed and what the constraints are, if that's ok.

ltalirz · 2020-04-01T08:20:00Z

We have "coding day" right now - if you have any questions concerning this, feel free to ping Gio/Seb/me on slack

chrisjsewell · 2020-04-21T09:52:07Z

Hey guys, in relation to this issue, I have started to look at profiling aiida-core performance.
To this end, I have created a small package: https://github.com/chrisjsewell/aiida-profiling,
which includes some of the queries you mention above (see db_stats.py).

After speaking about this with @lekah I think it might be good to have a simple way for developers/users to generate some pre-defined "telemetry" data that we can collate, to get a better understanding of how AiiDA databases are generally being used (including e.g. what indices are unused). I would envisage something like:

$ verdi database telemetry

which would (a) generate a JSON blob with relevant statistics for the database (and python environment) and maybe (b) try to automatically send it to a server somewhere.

Let me know what you think?
If this sounds agreeable I will write up a JSON schema proposal for what the blob should contain.

ltalirz · 2020-04-21T09:55:52Z

hi @chrisjsewell , this sounds like a good idea.
you can take inspiration from the simple script that we wrote to collect statistics for the AiiDA-powered research questionnaire , hosted here: https://github.com/ltalirz/aiida-statistics-query

This script works on aiida 0.x and 1.x; this type of backwards compatibility is not needed for your package though.

ltalirz · 2020-04-21T09:58:27Z

P.S. Of course we have the replies of people already, so you might also want to have a look at the statistics people sent in.

chrisjsewell · 2020-04-21T10:02:47Z

Thanks @ltalirz I will take a look

greschd · 2020-04-21T10:10:29Z

(b) try to automatically send it to a server somewhere.

Maybe this should be a separate command (e.g. verdi database telemetry upload), as people might want to inspect the JSON to check what they're uploading before doing so.

sphuber · 2022-02-17T15:39:01Z

Now that the Django backend has been dropped, it would be good to take a few (big) production databases and run the query provided in the OP. We can then see what indices are rarely being hit while occupying significant space. We should be careful to take databases that have been used in a way that represents as well as possible the variety in which databases can be used. The fact that indices on mtime, ctime and user_id are not used for production databases may be well the case, but databases that are mostly queried, through let's say the REST API, I would expect this to look a lot different.

@giovannipizzi @chrisjsewell maybe we should include this query as something to be run after successful migration of the upcoming testing/coding day?

chrisjsewell · 2022-02-17T15:56:25Z

Yep sounds good, note there is now the verdi devel run-sql command that people can run,

e.g.

$ verdi devel run-sql "select schemaname as table_schema, relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_relation_size(relid)) as data_size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as external_size from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc, pg_relation_size(relid) desc limit 10;"
('public', 'db_dbnode', '225 MB', '64 MB', '161 MB')
('public', 'db_dbgroup_dbnodes', '29 MB', '9480 kB', '20 MB')
('public', 'db_dblink', '23 MB', '10 MB', '13 MB')
('public', 'db_dbgroup', '144 kB', '8192 bytes', '136 kB')
('public', 'db_dbcomputer', '80 kB', '8192 bytes', '72 kB')
('public', 'db_dbuser', '64 kB', '8192 bytes', '56 kB')
('public', 'db_dbsetting', '64 kB', '8192 bytes', '56 kB')
('public', 'db_dblog', '64 kB', '0 bytes', '64 kB')
('public', 'db_dbcomment', '40 kB', '0 bytes', '40 kB')
('public', 'db_dbauthinfo', '40 kB', '0 bytes', '40 kB')

chrisjsewell · 2022-02-17T15:59:52Z

I would envisage something like: verdi database telemetry

This would now be verdi storage telemetry, but still feel it could be a good idea to add (potentially also incorporating data from the disk-objectstore)

chrisjsewell · 2022-02-17T16:02:23Z

also note there are now the "schema reflection" regression files that clarify exactly wheat indexes are in the database, e.g.

aiida-core/tests/backends/aiida_sqlalchemy/migrations/test_all_schema/test_main_main_0001_.yml

Line 401 in 092919d

indexes:

lekah assigned sphuber, giovannipizzi and szoupanos Apr 16, 2019

szoupanos changed the title ~~Lot of disk space wasted by unused indices~~ [Testing] Lot of disk space wasted by unused indices Apr 22, 2019

szoupanos changed the title ~~[Testing] Lot of disk space wasted by unused indices~~ Lot of disk space wasted by unused indices Apr 22, 2019

szoupanos mentioned this issue Apr 22, 2019

[Testing] Test the migration speed of Django from the EAV schema to JSONB #2786

Closed

ramirezfranciscof added the topic/database-schema label Feb 5, 2020

giovannipizzi mentioned this issue Aug 25, 2021

Drop Django Backend #4985

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lot of disk space wasted by unused indices #2762

Lot of disk space wasted by unused indices #2762

lekah commented Apr 16, 2019

giovannipizzi commented Apr 16, 2019

lekah commented Apr 16, 2019

giovannipizzi commented Apr 16, 2019

sphuber commented May 30, 2019

ltalirz commented Jun 5, 2019

lekah commented Jun 5, 2019 •

edited

Loading

szoupanos commented Jun 5, 2019

ltalirz commented Jan 15, 2020 •

edited

Loading

lekah commented Apr 1, 2020

ltalirz commented Apr 1, 2020

lekah commented Apr 1, 2020

ltalirz commented Apr 1, 2020

chrisjsewell commented Apr 21, 2020

ltalirz commented Apr 21, 2020 •

edited

Loading

ltalirz commented Apr 21, 2020

chrisjsewell commented Apr 21, 2020

greschd commented Apr 21, 2020

sphuber commented Feb 17, 2022

chrisjsewell commented Feb 17, 2022

chrisjsewell commented Feb 17, 2022

chrisjsewell commented Feb 17, 2022

Lot of disk space wasted by unused indices #2762

Lot of disk space wasted by unused indices #2762

Comments

lekah commented Apr 16, 2019

giovannipizzi commented Apr 16, 2019

lekah commented Apr 16, 2019

giovannipizzi commented Apr 16, 2019

sphuber commented May 30, 2019

ltalirz commented Jun 5, 2019

lekah commented Jun 5, 2019 • edited Loading

szoupanos commented Jun 5, 2019

ltalirz commented Jan 15, 2020 • edited Loading

lekah commented Apr 1, 2020

ltalirz commented Apr 1, 2020

lekah commented Apr 1, 2020

ltalirz commented Apr 1, 2020

chrisjsewell commented Apr 21, 2020

ltalirz commented Apr 21, 2020 • edited Loading

ltalirz commented Apr 21, 2020

chrisjsewell commented Apr 21, 2020

greschd commented Apr 21, 2020

sphuber commented Feb 17, 2022

chrisjsewell commented Feb 17, 2022

chrisjsewell commented Feb 17, 2022

chrisjsewell commented Feb 17, 2022

lekah commented Jun 5, 2019 •

edited

Loading

ltalirz commented Jan 15, 2020 •

edited

Loading

ltalirz commented Apr 21, 2020 •

edited

Loading