Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing config file lock mechanism #352

Closed
wants to merge 13 commits into from

Conversation

sverdlov93
Copy link
Contributor

@sverdlov93 sverdlov93 commented Mar 9, 2022

  • All tests passed. If this feature is not already covered by the tests, I added new tests.
  • This pull request is on the dev branch.
  • I used gofmt for formatting the code before submitting the pull request.

Fix config commands locking mechanism.
Make all config commands that write to config file - atomic.

Also, get lock's file modified time from os, instead of using time.Now() (as suggested by @HWoidt)

should fix #350

@sverdlov93 sverdlov93 requested a review from eyalbe4 March 9, 2022 12:39
@HWoidt
Copy link

HWoidt commented Mar 9, 2022

Hi @sverdlov93
Thank you for your fast response! On first sight your fix looks good to me. If you provide me with build-instructions or an ubuntu-x64 executable of jfrog-cli including this change, I can test whether it solves the issue on our side.

I created a short reproducer script:

#!/bin/bash
trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

for i in 1 2 3 4 5 6 7 8 9 10; do
    jf config remove --quiet testserver$i &>/dev/null
    (while true; do
        jf config add testserver$i --url test.local --user blub --interactive=false
        jf config remove --quiet testserver$i
    done
    ) &
done

wait

This typically triggers the corruption within a few seconds up to one Minute or so (be sure to back up your jfrog-cli.conf ;) ):

$ ./test_jfrog_race.sh
14:48:31 [Error] Server ID 'testserver4' already exists.
14:48:31 [Error] Server ID 'testserver1' already exists.
14:48:31 [Info] "testserver2" configuration could not be found.

14:48:31 [Error] Server ID 'testserver4' already exists.
14:48:31 [Error] Server ID 'testserver6' already exists.
14:48:31 [Error] Server ID 'testserver8' already exists.
14:48:31 [Info] "testserver6" configuration could not be found.

14:48:31 [Info] "testserver8" configuration could not be found.

14:48:31 [Info] "testserver7" configuration could not be found.

14:48:31 [Error] Server ID 'testserver8' already exists.
14:48:31 [Info] "testserver4" configuration could not be found.

14:48:31 [Info] "testserver2" configuration could not be found.

14:48:31 [Info] "testserver3" configuration could not be found.

14:48:31 [Info] "testserver5" configuration could not be found.

14:48:31 [Info] "testserver8" configuration could not be found.

14:48:32 [Error] Server ID 'testserver4' already exists.
14:48:32 [Error] Server ID 'testserver5' already exists.
14:48:32 [Error] invalid character 't' after top-level value
14:48:32 [Error] invalid character 't' after top-level value
14:48:32 [Error] invalid character 't' after top-level value
14:48:32 [Error] invalid character 't' after top-level value

@sverdlov93
Copy link
Contributor Author

Hi @HWoidt,
First of all, Thanks for reporting this issue and for your very informative logs and data.
Our apologies for the inconvenience caused by this.

I am currently working on the fix, and have some more changes and tests to add.
I will update here as soon as it's done and ready for merge.

@sverdlov93 sverdlov93 changed the title Fixing lock sorting mechanism issue Fixing config file loc mechanism Mar 14, 2022
@sverdlov93 sverdlov93 changed the title Fixing config file loc mechanism Fixing config file lock mechanism Mar 14, 2022
@sverdlov93
Copy link
Contributor Author

sverdlov93 commented Mar 15, 2022

Hi, @HWoidt.
The above PR is waiting for review and contains:

  • Fix config commands locking mechanism.
  • Make all config commands that write to config file - atomic.
  • Also, (although, eventually it wasn't the cause for the issue) get lock's file modified time from os, instead of using time.Now().

If you want to test it on your machine you can pull JFrog CLI PR and run ./build/build.sh to get the executable.

common/commands/config.go Outdated Show resolved Hide resolved
common/commands/config.go Outdated Show resolved Hide resolved
Copy link

@HWoidt HWoidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sverdlov93 ,
Sorry for letting you wait. I had other stuff on my plate.
I have now tested your changes on our Ubuntu 20.04 machine with the bash script I provided earlier:

  1. When running tasks concurrently the command often fails because the Stat-call (lock.go:169) on an already deleted lock-file causes the command to fail instead of being retried. I changed this locally and made the following observations with the Stat error ignored.
  2. There sometimes is a deadlock, both processes are waiting until their lock-retry loop runs out after ~2 minutes and then return with a failure. I suspect this to be due to two files having the same timestamp. This tie can occur with higher frequency than before because the filesystem timestamps have lower resolution.
  3. The corruption can still be observed, although with vastly lower probability. This occurred maybe 2 times in the 3h that I left it running with 40 concurrent processes. I fear that we are running into the same conceptual problem as with the previous solution: The file-creation and timestamp generation are not within the same critical section. Probably the filesystem first prepares all file meta-data (including the timestamp) and only then commits the changes to disk, allowing for a time window in which another process can run through the whole sequence of creating the file and checking all timestamps. Since this occurs within the same system call, the time between timestamp-creation and visibility of the new file is shorter, resulting in a lower probability of hitting this issue.

With your changes I see much reduced probability for this to occur. We initially ran into this after about 1 Month, while starting three concurrent jf-config processes at the same time once a day. Considering that in the tests the error now happened after more than 30min compared to the 10-20s it was before and with 4 times as many threads, we can assume at least a factor of 100x reduction in probability. On our CI server we have circumvented this issue now by assigning individual config files per worker, using the JFROG_CLI_HOME_DIR environment variable.

For correctness it would probably be best to use a mechanism explicitly designed for file locking (i.e. LockFileEx on windows and flock on linux). There is already a cross platform go implementation used for exactly the same usecase of protecting configfiles in the "go" commandline tool: go.internal.lockedfile. Unfortunately this is an internal package. There is a project containing useful internal packages from the go standard library which exports this though: https://github.com/rogpeppe/go-internal

}
// Last element is the timestamp.
time, err := strconv.ParseInt(splitted[4], 10, 64)
fileInfo, err := os.Stat(path)
if err != nil {
return nil, errorutils.CheckError(err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between getting the filesList and calling stat on the file, the other process may well be already done. In that case the file does not exist anymore which is normal and no cause for concern. As such, this error should be ignored here:

Suggested change
return nil, errorutils.CheckError(err)
continue

Copy link
Contributor Author

@sverdlov93 sverdlov93 Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HWoidt,
Actually, when I worked on the fix I tried using LockFileEx/flock/lockedfile and some others.
Some worked for specific OS only and some were deprecated or unmaintained.
BTW, I saw that Go has already accepted exporting part of the lockedfile code but it's not merged yet.

I improved the Stat section and I will try to look again at the timing issue.

Copy link
Contributor Author

@sverdlov93 sverdlov93 Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HWoidt
I created also #367 with the same improved config file commands but with the old time.now() mechanism.
Can you try that one also to see if it's better?

case Clear:
return cc.clear()
default:
return fmt.Errorf("Not supported config command type: " + string(cc.cmdType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not supported... --> Unsupported...

@@ -211,7 +234,7 @@ func (cc *ConfigCommand) prepareConfigurationData() ([]*config.ServerDetails, er
return configurations, err
}

/// Returning the first non empty value:
/// Returning the first non-empty value:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// --> //

@eyalbe4 eyalbe4 added the bug Something isn't working label May 14, 2022
@sverdlov93 sverdlov93 closed this Jun 12, 2022
@arcanericky arcanericky mentioned this pull request Jun 24, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants