Skip to content
This repository has been archived by the owner on Oct 13, 2023. It is now read-only.

[19.03 backport] Added garbage collector for image layers #268

Conversation

thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Jun 7, 2019

based on top of #267 - only last commit is new, but marked it WIP

backport of

Minor conflict because moby@072400f (moby#38377) is not in the 19.03 branch;

diff --cc internal/test/daemon/daemon.go
index bb8451db8c,4705d2f4ee..0000000000
--- a/internal/test/daemon/daemon.go
+++ b/internal/test/daemon/daemon.go
@@@ -60,16 -61,18 +61,31 @@@ type Daemon struct
        UseDefaultHost    bool
        UseDefaultTLSHost bool

++<<<<<<< HEAD
 +      id            string
 +      logFile       *os.File
 +      cmd           *exec.Cmd
 +      storageDriver string
 +      userlandProxy bool
 +      execRoot      string
 +      experimental  bool
 +      init          bool
 +      dockerdBinary string
 +      log           logT
++=======
+       id                         string
+       logFile                    *os.File
+       cmd                        *exec.Cmd
+       storageDriver              string
+       userlandProxy              bool
+       defaultCgroupNamespaceMode string
+       execRoot                   string
+       experimental               bool
+       init                       bool
+       dockerdBinary              string
+       log                        logT
+       imageService               *images.ImageService
++>>>>>>> 213681b66a... First step to implement full garbage collector for image layers

        // swarm related field
        swarmListenAddr string

- What I did
Added garbage collector for image layers

Closes moby#38488 and most probably moby#39247 too.
and with moby#38401 also docker/for-win#745

- How I did it
Added logic which marks layers to be under removal by renaming layer digest folder to contain -<int>-removing suffix. That way we are able keep track that which layers are not anymore complete (=> which need to be re-downloaded if needed) and we are also able run clean up routine for those.

- How to verify it
Reproduce issue like it is described on moby#38488
also notice bug with missing content on comment moby#38488 (comment)

Run daemon on debug mode and stop it.

You will see log like this and orphan layer will disappear from disk.

DEBU[2019-05-17T02:22:21.349871543+03:00] start clean shutdown of all containers with a 15 seconds timeout... 
DEBU[2019-05-17T02:22:21.350859348+03:00] found 2 orphan layers                        
DEBU[2019-05-17T02:22:21.350918187+03:00] removing orphan layer, chain ID: sha256:3fc64803ca2de7279269048fe2b8b3c73d4536448c87c32375b2639ac168a48b , cache ID: 5ccabfe0f839144bcbb30b7eae3d65e89fd93e34b411a9ba37b753c6cadc939c 
DEBU[2019-05-17T02:22:21.352871349+03:00] Removing folder: /var/lib/docker/image/overlay2/layerdb/sha256/3fc64803ca2de7279269048fe2b8b3c73d4536448c87c32375b2639ac168a48b-0-removing 
DEBU[2019-05-17T02:22:21.353409212+03:00] Removing folder: /var/lib/docker/image/overlay2/layerdb/sha256/3fc64803ca2de7279269048fe2b8b3c73d4536448c87c32375b2639ac168a48b-1-removing 
DEBU[2019-05-17T02:22:21.353901755+03:00] removing orphan layer, chain ID: sha256:3fc64803ca2de7279269048fe2b8b3c73d4536448c87c32375b2639ac168a48b , cache ID: 4594ad5ca4f2ae9046141506d857fe9e94d68727809813d653b98353a43dd5fc 
DEBU[2019-05-17T02:22:21.356055301+03:00] Unix socket /var/run/docker/libnetwork/60b87c485747cb08641761e12ba27e169f4a9430efd4748121f1ce4d0fd3d536.sock doesn't exist. cannot accept client connections 
DEBU[2019-05-17T02:22:21.356174192+03:00] Cleaning up old mountid : start.             
INFO[2019-05-17T02:22:21.356225827+03:00] stopping event stream following graceful shutdown  error="<nil>" module=libcontainerd namespace=moby
DEBU[2019-05-17T02:22:21.356871850+03:00] Cleaning up old mountid : done.              
DEBU[2019-05-17T02:22:21.357264017+03:00] Clean shutdown succeeded 

and if you do same test on Windows this is what you will see:

time="2019-05-16T23:31:11.505667200Z" level=debug msg="start clean shutdown of all containers with a 35 seconds timeout..."
time="2019-05-16T23:31:11.510504700Z" level=debug msg="found 1 orphan layers"
time="2019-05-16T23:31:11.511193900Z" level=debug msg="removing orphan layer, chain ID: sha256:cc83dc73fcbf0e8b0ce6bce79e23643e9ffc7773af69f831e7d1f2f5764d44b2 , cache ID: bc365c16a135afa6ee813b7f550d7723d135157a0080dc1c25489314f938b656"
time="2019-05-16T23:31:11.511193900Z" level=debug msg="hcsshim::GetComputeSystems - Begin Operation"
time="2019-05-16T23:31:11.512310200Z" level=debug msg="HCS ComputeSystem Query" json="{}"
time="2019-05-16T23:31:11.513192900Z" level=debug msg="hcsshim::GetComputeSystems - End Operation - Success"
time="2019-05-16T23:31:11.514189800Z" level=debug msg="hcsshim::DestroyLayer" path="C:\\ProgramData\\docker\\windowsfilter\\bc365c16a135afa6ee813b7f550d7723d135157a0080dc1c25489314f938b656-removing"
time="2019-05-16T23:31:11.678774400Z" level=debug msg="hcsshim::DestroyLayer - succeeded" path="C:\\ProgramData\\docker\\windowsfilter\\bc365c16a135afa6ee813b7f550d7723d135157a0080dc1c25489314f938b656-removing"
time="2019-05-16T23:31:11.678774400Z" level=debug msg="Removing folder: C:\\ProgramData\\docker\\image\\windowsfilter\\layerdb\\sha256\\cc83dc73fcbf0e8b0ce6bce79e23643e9ffc7773af69f831e7d1f2f5764d44b2-0-removing"
time="2019-05-16T23:31:11.687751900Z" level=debug msg="Clean shutdown succeeded"

- A picture of a cute animal (not mandatory but encouraged)
image

@thaJeztah thaJeztah added this to the 19.03.1 milestone Jun 7, 2019
@thaJeztah
Copy link
Member Author

Added this to the 19.03.1 milestone so that we can further discuss if we want this backported for that patch release

/cc @olljanat @cpuguy83

@olljanat
Copy link

olljanat commented Jun 7, 2019

I would like to get this backported as we see those issues on production but other hand those have been there with Windows containers from day one (logic which I needed to change was introduced by moby#17924 long before Windows containers was even supported) so if there is no other users reporting about this issue then it probably is not that important to get back ported.

It is also worth to note that as it took for a while me to get fix implemented we have also created workaround to it by adding some randomness to image build scripts so layers will not been re-used on newer versions (ugly way but at least it works).

@thaJeztah thaJeztah modified the milestones: 19.03.1, 19.03.2 Jul 30, 2019
@olljanat
Copy link

olljanat commented Aug 1, 2019

@thaJeztah @cpuguy83 so how it is, can we carry this as part of 19.03.2 ?

@thaJeztah thaJeztah force-pushed the 19.03_backport_38488_layer_garbage_collector branch from c868302 to 3767a57 Compare August 9, 2019 01:34
@thaJeztah thaJeztah changed the title [WIP][19.03 backport] Added garbage collector for image layers [19.03 backport] Added garbage collector for image layers Aug 9, 2019
Copy link

@ddebroy ddebroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have know this codebase much - pointed out a couple of concerns.

func (fms *fileMetadataStore) getOrphan() ([]roLayer, error) {
var orphanLayers []roLayer
for _, algorithm := range supportedAlgorithms {
fileInfos, err := ioutil.ReadDir(filepath.Join(fms.root, string(algorithm)))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any scenarios where the paths here may be volume GUID prefixed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Those layer metadata folders are on format:
/var/lib/docker/image/<storage driver>/layerdb/<hash algorithm>/<digest>
example: /var/lib/docker/image/overlay2/layerdb/sha256/ffc4c11463ee21b7532b63abd6079393c619a5d0f4b00397a4b9d1cf9efc4d9b

and when deleteLayer() function is called it will rename it to:
/var/lib/docker/image/overlay2/layerdb/sha256/ffc4c11463ee21b7532b63abd6079393c619a5d0f4b00397a4b9d1cf9efc4d9b-removing
which can be then found by getOrphan()

@@ -305,6 +305,47 @@ func (fms *fileMetadataStore) GetMountParent(mount string) (ChainID, error) {
return ChainID(dgst), nil
}

func (fms *fileMetadataStore) getOrphan() ([]roLayer, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a unit-test around the logic in getOrphan in filestore_test.go please?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Implemented it on moby#39715

@thaJeztah thaJeztah modified the milestones: 19.03.2, 19.03.3 Aug 30, 2019
@thaJeztah thaJeztah force-pushed the 19.03_backport_38488_layer_garbage_collector branch from 3767a57 to 8b1458a Compare September 12, 2019 08:33
@thaJeztah thaJeztah changed the title [19.03 backport] Added garbage collector for image layers [WIP][19.03 backport] Added garbage collector for image layers Sep 12, 2019
@thaJeztah
Copy link
Member Author

Waiting for #268 to be merged, so that I can cherry-pick that one

@thaJeztah thaJeztah force-pushed the 19.03_backport_38488_layer_garbage_collector branch from 8b1458a to 5cc822d Compare September 16, 2019 11:55
@thaJeztah
Copy link
Member Author

cherry-picked moby#39715; removing "WIP"

@thaJeztah thaJeztah changed the title [WIP][19.03 backport] Added garbage collector for image layers [19.03 backport] Added garbage collector for image layers Sep 16, 2019
@andrewhsu andrewhsu modified the milestones: 19.03.3, 19.03.4 Sep 23, 2019
@thaJeztah thaJeztah force-pushed the 19.03_backport_38488_layer_garbage_collector branch 2 times, most recently from 5c86c87 to 2fa6110 Compare September 30, 2019 11:15
@thaJeztah
Copy link
Member Author

ping @ddebroy @vikramhh PTAL - I think we have all parts in this PR now

@vikramhh
Copy link

LGTM

@thaJeztah
Copy link
Member Author

@kolyshkin @andrewhsu PTAL

@thaJeztah thaJeztah modified the milestones: 19.03.4, 19.03.5 Oct 11, 2019
@andrewhsu andrewhsu modified the milestones: 19.03.5, 19.03.6 Nov 12, 2019
@olljanat
Copy link

ping @ddebroy @kolyshkin @andrewhsu
I found broken layer issue once again from our servers and would like to get this one forward.

@ddebroy
Copy link

ddebroy commented Nov 26, 2019

I had a couple of concerns that were addressed above. So LGTM from me (not a maintainer and don't have a lot of context around this code).

olljanat added a commit to olljanat/moby that referenced this pull request Nov 27, 2019
@cpuguy83
Copy link

Just to voice a concern emphatic I expressed in Slack only before...

I am worried about bringing this change into a mature release.

@thaJeztah
Copy link
Member Author

So, should we close this one as being too risky?

(if so, remind me to revisit #384 and see if I can modify that one to work without this PR)

@olljanat
Copy link

Fair enough. Let's close this one here then and make sure that it works correctly on next main release.
Most probably this one moby#38401 (or modified version of it) will be also needed.

And FYI for anyone who might read this is that I created custom package with this one included https://github.com/olljanat/moby/releases/tag/19.03.5-olljanat1 and we will start using it on our environments.

As side comment. Based on testings today I noticed on one environment that at least Windows Defender causes time to time issues on layer handling even when I have excludes for C:\ProgramData\docker folder and docked.exe/docker.exe processes (which is one of the reasons that garbage collector is needed after all).

@Drewster727
Copy link

@thaJeztah any idea on a target date or version for this? I have windows nodes chewing up disk space that require regular maintenance :)

olljanat and others added 4 commits January 25, 2020 14:05
Refactored exiting logic on way that layers are first marked to be under
removal so if actual removal fails they can be found from disk and
cleaned up.

Full garbage collector will be implemented as part of containerd
migration.

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
(cherry picked from commit 213681b)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
1. Reduce complexity due to nested if blocks by using early
return/continue
2. Improve logging

Changes suggested as a part of code review comments in 39748

Signed-off-by: Vikram bir Singh <vikrambir.singh@docker.com>
(cherry picked from commit ebf12db)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
(cherry picked from commit d6cbeee)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
(cherry picked from commit 8660330)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah thaJeztah force-pushed the 19.03_backport_38488_layer_garbage_collector branch from 2fa6110 to 43abde3 Compare January 25, 2020 13:06
@thaJeztah
Copy link
Member Author

moved to moby#40412

@thaJeztah thaJeztah closed this Jan 25, 2020
@thaJeztah thaJeztah removed this from the 19.03.6 milestone Jan 25, 2020
olljanat added a commit to olljanat/moby that referenced this pull request Sep 14, 2020
olljanat added a commit to olljanat/moby that referenced this pull request Oct 26, 2020
@thaJeztah thaJeztah deleted the 19.03_backport_38488_layer_garbage_collector branch March 24, 2022 23:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants