-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New sql schema #53
New sql schema #53
Conversation
darix
commented
Jul 15, 2020
+1 from my side. |
the comment patch is now already in the opensuse branch so it can be ignored here. |
for storing file information and mirror mapping all the mirr_* functions are not ported yet. For the details see https://github.com/openSUSE/mirrorbrain/wiki/Roadmap
and have consistent naming of all functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments are less important here, but we should build common understanding on some of the notes.
|
||
CREATE TABLE filemetadata | ||
( | ||
id integer GENERATED ALWAYS AS IDENTITY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest going with bigint: currently max id in db is above 900M, which is close to 50% of int capacity. So limit may be reached in few years
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
DELETE FROM filemetadata | ||
WHERE id IN ( | ||
SELECT filemetadata_id | ||
FROM filemetadata_mirror_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following query should do the same job and is much lighter, because it doesn't calculate exact counts (m.id will be NULL only for those rows, which don't have any mirror):
DELETE FROM filemetadata
USING filemetadata AS f
LEFT OUTER JOIN mirrors AS m ON filemetadata_id = f.id
WHERE filemetadata.id = f.id AND m.id is NULL AND f.mtime < (now()-'3 months'::interval)
I know that the counts are pre-calculated anyway - but maybe we can simplify workflow and do not maintain MATERIALIZED VIEW at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain (analyze, buffers) delete from filemetadata using filemetadata AS f LEFT OUTER JOIN mirrors AS m ON filemetadata_id = f.id WHERE filemetadata_id = f.id AND m.server_id is NULL AND f.mtime < (now()-'3 months'::interval);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Delete on filemetadata (cost=1.00..1116601.47 rows=2862434 width=18) (actual time=0.638..0.638 rows=0 loops=1)
Buffers: shared read=4
-> Nested Loop (cost=1.00..1116601.47 rows=2862434 width=18) (actual time=0.638..0.638 rows=0 loops=1)
Buffers: shared read=4
-> Nested Loop (cost=1.00..16.55 rows=1 width=12) (actual time=0.637..0.637 rows=0 loops=1)
Buffers: shared read=4
-> Index Scan using idx_mirrors_server_id on mirrors m (cost=0.56..8.08 rows=1 width=10) (actual time=0.636..0.636 rows=0 loops=1)
Index Cond: (server_id IS NULL)
Buffers: shared read=4
-> Index Scan using pk_filemetadata on filemetadata f (cost=0.44..8.46 rows=1 width=10) (never executed)
Index Cond: (id = m.filemetadata_id)
Filter: (mtime < (now() - '3 mons'::interval))
-> Seq Scan on filemetadata (cost=0.00..888591.96 rows=22799296 width=6) (never executed)
Planning Time: 5.143 ms
JIT:
Functions: 14
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.544 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 1.544 ms
Execution Time: 2.353 ms
(19 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh it probably should be WHERE filemetadata.id
instead of WHERE filemetadata_id
sha1pieces bytea, | ||
btih bytea, | ||
pgp text, | ||
zblocksize smallint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use INT instead of SMALLINT, which is mileted to 32K : zsync algorithm doesn't have practical limit for block size and e.g. block size of 1M is much suitable for huge files (see #47 )
zblocksize, | ||
zhashlens, | ||
zsums, | ||
encode(zsums, 'hex') AS zsumshex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need path
column here as well, so the view can be queried without join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it. but right know it is 2 queries anyway.
https://github.com/openSUSE/mirrorbrain/blob/opensuse/mod_mirrorbrain/mod_mirrorbrain.c#L139..L163
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean instead of subquery in the hash query
mirrorbrain/mod_mirrorbrain/mod_mirrorbrain.c
Lines 157 to 159 in f33ca26
"WHERE file_id = (SELECT id " \ | |
"FROM filearr " \ | |
"WHERE path = %s) " \ |
WHERE path = %s
if the view would have path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how til will look if the column is inside view https://github.com/openSUSE/mirrorbrain/pull/52/files#diff-a44eaf51cd9831129dc4b515db0bd2aeL157-R156
all your changes should be addressed |