Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ttl eviction for RecycleStore #15513

Merged
merged 1 commit into from
Feb 25, 2021

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Feb 24, 2021

Problem

(mostly copied from #15188 )

recycling AppendVec seems to accmulate RSS usage over time. That's because RecycleStore's entries gradually occupied larger AppendVecs over time.

The followings are top excerpts over the time for easier illustration. (I have somewhat more detailed metrics on the tested machines, if wanted)

With recycling (current code; bad;):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 881078 ubuntu    20   0   47.2g  33.0g   6.1g S  1094  26.7   9795:01 solana-validato                                                                                               
 881078 ubuntu    20   0   49.1g  35.3g   6.4g S  1325  28.6  16664:41 solana-validato                                                                                               
 881078 ubuntu    20   0   52.1g  37.8g   6.3g S  1086  30.6  33565:52 solana-validato                                                                                                                      
 881078 ubuntu    20   0   52.4g  38.2g   6.5g S 286.7  30.9  37263:25 solana-validato                                                                                               
 881078 ubuntu    20   0   56.4g  42.0g   8.7g S 693.8  34.0  43183:25 solana-valida                                                                                  
 881078 ubuntu    20   0   67.5g  52.6g  17.2g S 787.5  42.5  51273:52 solana-validato         
 881078 ubuntu    20   0   70.5g  52.2g  16.5g S  1050  42.2  54339:02 solana-validato
 881078 ubuntu    20   0   81.6g  47.7g   9.8g S 773.3  38.6  65919:56 solana-valida                                                                                   
 881078 ubuntu    20   0   85.4g  52.3g  13.9g S 712.5  42.3  70357:35 solana-valida                                                                                   
 881078 ubuntu    20   0   87.9g  53.4g  14.9g S 944.2  43.2  75894:56 solana-valida                                                                                                                    
 881078 ubuntu    20   0   94.6g  46.9g   7.8g S  1100  37.9  88728:44 solana-valida                                                                                                                     
 881078 ubuntu    20   0   98.3g  51.4g  12.3g S  1812  41.6 101525:57 solana-valida                                                                                                                    

Summary of Changes

Almost same approch is inherited from #15139.

As a context, #15139 is superceded by #15320 because #15139 had a unsolved unbounded Recyler growth problem.

However, RecycleStores shouldn't have such problem (it's capped to MAX_RECYCLE_STORES and it doesn't need to be dynamic and thus doesn't need to shrink.)..

So, I think the ttl thing can be applied here this time really. The benefit is simpler implementation compared to full-fledged dynamic implementation a la #15320.

Fixes #14366

@codecov
Copy link

codecov bot commented Feb 24, 2021

Codecov Report

Merging #15513 (c8971e5) into master (bc01a33) will decrease coverage by 0.0%.
The diff coverage is 74.1%.

@@            Coverage Diff            @@
##           master   #15513     +/-   ##
=========================================
- Coverage    80.2%    80.2%   -0.1%     
=========================================
  Files         406      406             
  Lines      104172   104258     +86     
=========================================
+ Hits        83603    83669     +66     
- Misses      20569    20589     +20     

Copy link
Member

@sakridge sakridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ryoqun
Copy link
Member Author

ryoqun commented Feb 25, 2021

Here's basic metrics observations for this pr, I ran validator with this pr for 12 hours:

accounts_db_store_timings2.current_recycle_store_bytes

BEFORE

Slightly hard to look; but ba2 leaked another 7.5G. (moreover it has been leaking whopping 75G)

image

AFTER

constant around ~5G.

image

@ryoqun
Copy link
Member Author

ryoqun commented Feb 25, 2021

clean_purge_slots_stats.drop_storage_entries_elapsed now sees constant cleanings

BEFORE

image

AFTER

image

@ryoqun
Copy link
Member Author

ryoqun commented Feb 25, 2021

clean_purge_slots_stats.recycle_stores_write_elapsed

BEFORE

image

AFTER

We see now constant (write) lock time, this time also includes the actual expiration loop.

This lock can block the blockstore_processor (= important) threads. but should be ok. That's because 150us per every 10min. (so small overhead).

This also implies that each iteration of this loop (https://github.com/solana-labs/solana/pull/15513/files#diff-1090394420d51617f3233275c2b65ed706b35b53b115fe65f82c682af8134a6fR650) takes around 150ns.

image

@ryoqun
Copy link
Member Author

ryoqun commented Feb 25, 2021

accounts_db_store_timings2.current_recycle_store_count

BEFORE

image

AFTER

As seen on this, new behavior (roughtly 1/3 is expired each 10 min) can be confirmed.

image

@ryoqun
Copy link
Member Author

ryoqun commented Feb 25, 2021

It seems that system-performance-tests has very variance and now I can't run it for some reason... I'll prioritize release over it.

@ryoqun ryoqun merged commit 21b4300 into solana-labs:master Feb 25, 2021
mergify bot pushed a commit that referenced this pull request Feb 25, 2021
mergify bot added a commit that referenced this pull request Feb 25, 2021
(cherry picked from commit 21b4300)

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

solana-validator leaks memory (but at very slow pace)
2 participants