Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error writing to datastore on Windows 10 #7115

Closed
hsn10 opened this issue Apr 8, 2020 · 27 comments · Fixed by #7133 or #7178
Closed

Error writing to datastore on Windows 10 #7115

hsn10 opened this issue Apr 8, 2020 · 27 comments · Fixed by #7133 or #7178
Assignees
Labels
kind/bug A bug in existing code (including security flaws)
Milestone

Comments

@hsn10
Copy link

hsn10 commented Apr 8, 2020

Version information:

0.5 rc1

Description:

New flatfs datastore version introduced in 0.5 has problems with writing blocks on Windows 10.

2020-04-07T16:46:32.917+0200    �[31mERROR�[0m  bitswap go-bitswap@v0.2.8/bitswap.go:342
Error writing 3 blocks to datastore: 
  rename C:\tmp\ipfs2\blocks\.temp\temp-043813542
 C:\tmp\ipfs2\blocks\AJ\CIQCCVMFRWGRR2S7OUU7F3JWGSYDBLXX7SJKXT343KSIH2XRLPM3AJI.data: Access is denied.
github.com/ipfs/go-bitswap.(*Bitswap).receiveBlocksFrom
        pkg/mod/github.com/ipfs/go-bitswap@v0.2.8/bitswap.go:342
github.com/ipfs/go-bitswap.(*Bitswap).ReceiveMessage
        pkg/mod/github.com/ipfs/go-bitswap@v0.2.8/bitswap.go:419
github.com/ipfs/go-bitswap/network.(*impl).handleNewStream
        pkg/mod/github.com/ipfs/go-bitswap@v0.2.8/network/ipfs_impl.go:271
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1
        pkg/mod/github.com/libp2p/go-libp2p@v0.7.4/p2p/host/basic/basic_host.go:417

There is a very large number of temp-* files in blocks\.temp directory which means that this error happens quite often. It does not fail writing for all blocks, some blocks are written correctly. My guess is that file has to be closed first before it can be renamed on windows platform.

@hsn10 hsn10 added the kind/bug A bug in existing code (including security flaws) label Apr 8, 2020
@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 8, 2020

We do close the files before renaming.

This suggests that some other process (like an antivirus) might be opening them though.

It would be helpful to try this golang/go#36568 (comment) and see if it's indeed the antivirus.

The change we made was to create the temp files in specific temp-folder that can be cleaned up later, rather than creating in their final folder and rename them there. I guess the fact that it has to move cross-folders causes issues.

Workarounds:

  • See if an antivirus is touching those files and prevent it from doing it
  • Switch to badger datastore instead of flatfs

@Stebalien I don't see a good fix here, should we revert the temp file thing for the moment while we rethink it?

@Stebalien
Copy link
Member

Stebalien commented Apr 8, 2020

@hsn10 does this reproduce on go-ipfs 0.4.23

@hsanjuan I don't see how that could be affecting this. Maybe we need to give these files a .data extension to get the AV to ignore it? edit: nevermind, that shouldn't be related

@hsn10
Copy link
Author

hsn10 commented Apr 8, 2020

Older IPFS works correctly. I checked antivirus and there are no exceptions defined for old IPFS data directory.

@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 8, 2020

I checked antivirus and there are no exceptions defined for old IPFS data directory.

But does it still happen if you add such exceptions?

Stebalien added a commit that referenced this issue Apr 9, 2020
@hsn10
Copy link
Author

hsn10 commented Apr 9, 2020

I added exception to AVG antivirus and there is still number of files left in blocks\.temp and occasional exceptions - No change from previous behaviour, I didn't get Error writing to datastore stack trace every time.

I am not sure if defined exception in AVG actually works because while browsing http://docs.ipfs.io.ipns.localhost:8080/guides/concepts/dht/ I got error:

internalWebError: open C:\tmp\IPFS2\blocks\52\AFYBEIADW54ULQFTVNGZ2A4RHZCY77MD5VFGOKUYKSKG6BOEZTW4W7O52A.data: The process cannot access the file because it is being used by another process.

To be sure I disabled AVG completely and these errors are still there. No change - still temp files stuck and stack traces.

@Stebalien
Copy link
Member

Can you build and use a custom branch: https://github.com/ipfs/go-ipfs/tree/test/flatfs?

This branch reverts to a version of flatfs before the change, to make sure we're on the right track.

@Stebalien
Copy link
Member

Note: the easiest way to build on windows is to run:

GOPROXY=direct go get github.com/ipfs/go-ipfs/cmd/ipfs@test/flatfs

Alternatively, you can use a pre-built binary.

@hsn10
Copy link
Author

hsn10 commented Apr 9, 2020

I did the same testing at your provided binary, IPFS directory added to antivirus exceptions.

This error is still there:

internalWebError: open C:\tmp\IPFS2\blocks\GG\AFKREIGC7KSDCN3EL747P3XF4JNA5VQ3ENROYJPDYZITFED7FDEL3SSGGQ.data: The process cannot access the file because it is being used by another process.

There is lot of put-DDDDD files in repository. I checked ipfs repo from stable branch and no put-DDDD files are there. After inspecting blocks and checking file sizes of put-DDDDD and an .data block file it looks like everything what is in put-DDDDD file left behind is also in .data file, so it looks like cleanup problem.

Sample of such directory, look carefully at file dates:

$ ls --full-time
total 20
-rw-r--r-- 1 hsn 197609  816 2020-04-09 10:55:15.785266500 +0200 AFYBEIA465WMGALGEBK5FC53RZO3H7PCAR4TBPWTYZLHIFFZGMIB6YQ2AU.data
-rw-r--r-- 1 hsn 197609 5796 2020-04-09 10:45:45.701833100 +0200 CIQMEQ4UZMWVETA2EP6YGXPVBJFBNCT2X7Y5FA2IJHNAANDLCEYX2AA.data
-rw-r--r-- 1 hsn 197609 5796 2020-04-09 10:45:45.703783400 +0200 put-764934526

and more interesting case:

$ ls --full-time
total 16
-rw-r--r-- 1 hsn 197609 3202 2020-04-09 10:57:00.550484900 +0200 AFYBEIC6CKGQFF3ALSQPXGOCSIZYJACBR4HOTEGA7D7JA7E3LFGQYHBA2A.data
-rw-r--r-- 1 hsn 197609 3202 2020-04-09 10:57:00.553414300 +0200 put-310544969
-rw-r--r-- 1 hsn 197609 3202 2020-04-09 10:57:00.557319000 +0200 put-343986019
-rw-r--r-- 1 hsn 197609 3202 2020-04-09 10:57:00.552437300 +0200 put-348454676

It looks like put file left behind is written after block already finished downloading from other source, rename to an existing file fails but there is no cleanup procedure for deleting stale put-DDDDDD.

@Stebalien
Copy link
Member

Yeah, ok, so it's likely due to some change in timing in bitswap. It looks like we should be opening all files in SHARE_DELETE mode.

@Stebalien
Copy link
Member

Could you try the build linked in #7133 (comment)? (or just build that branch).

@hsn10
Copy link
Author

hsn10 commented Apr 10, 2020

Testing go-ipfs version: 0.5.0-rc1-924e870 shows no change in behaviour

@Stebalien
Copy link
Member

Ok, well, we're now opening these files with FILE_SHARE_DELETE. There's still a potential issue where a get could fail because a put is in progress, but we're seeing the reverse here which should be impossible given only IPFS.

Do you use any local search/indexing tools? Are you storing your IPFS repo in something like Dropbox?

Given the fact that we're somehow failing to delete the temporary file, something must be keeping these files open.

@hsn10
Copy link
Author

hsn10 commented Apr 10, 2020

I would not blame external influence because old 0.4 version works just fine.

@willscott
Copy link
Contributor

An updated stack trace / error output from the latest version would be useful for continued investigation, if you can reliably reproduce

@hsn10
Copy link
Author

hsn10 commented Apr 10, 2020

2020-04-10T08:31:20.393+0200    �[31mERROR�[0m  bitswap go-bitswap@v0.2.8/bitswap.go:342
Error writing 1 blocks to datastore: rename C:\tmp\IPFS2\blocks\.temp\temp-728509826
C:\tmp\IPFS2\blocks\RX\CIQHHYWDJDJTLTI6E23HFHS3NYHRAHANXG36R7F3RJBHJQJGESQIRXA.data: Access is denied.
github.com/ipfs/go-bitswap.(*Bitswap).receiveBlocksFrom
        pkg/mod/github.com/ipfs/go-bitswap@v0.2.8/bitswap.go:342
github.com/ipfs/go-bitswap.(*Bitswap).ReceiveMessage
        pkg/mod/github.com/ipfs/go-bitswap@v0.2.8/bitswap.go:419
github.com/ipfs/go-bitswap/network.(*impl).handleNewStream
        pkg/mod/github.com/ipfs/go-bitswap@v0.2.8/network/ipfs_impl.go:271
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1
        pkg/mod/github.com/libp2p/go-libp2p@v0.7.4/p2p/host/basic/basic_host.go:417

That exception do not relate 1:1 to every file left behind. Its about 15 files left behind : 1 exception.

@Stebalien
Copy link
Member

I would not blame external influence because old 0.4 version works just fine.

Are you still seeing temporary files get left behind? We delete the temporary file on failure so that should only be possible if some outside application is opening them.

Unfortunately, this working in previous go-ipfs versions could simply be a timing difference.

One thing I have noticed is that we're now using batch puts in bitswap instead of putting each node sequentially. I'll put together a patch to revert that and see if it helps.


Things I've checked:

  1. Has & GetSize shouldn't prevent a Put. While they may, internally, call windows.CreateFile (inside os.Stat), they'll do so with a 0 access mode which should be fine (unless syscall: ReadLink() & Utimes() arguments for access-mode & share-mode are wrong golang/go#35150 (comment) is wrong).
  2. Get shouldn't prevent a Put because we open files with FILE_SHARE_DELETE.
  3. Put/Delete shouldn't prevent a Put because we only allow one write operation to a given key at any time.
  4. Put could prevent a Get/Has, temporarily. However, that's not relevant here.
  • os.Rename is not atomic on windows. The file might not be there.
  • os.Rename may not open the target file with FILE_SHARE_DELETE. (not sure)

@Stebalien
Copy link
Member

Stebalien commented Apr 10, 2020

Hm. I do have an alternative explanation for why we might not be deleting temporary files. I need to check it.

edit: yes, we had a bug there. But that still doesn't explain this behavior.

@Stebalien
Copy link
Member

Could you try the test/flatfs branch again, or ipfs.zip? This version will just ignore the put errors if the put was idempotent anyways (which is always the case here).

@hsn10
Copy link
Author

hsn10 commented Apr 11, 2020

No change

@Stebalien
Copy link
Member

Could you be more specific?

  1. Are you still getting the error from Error writing to datastore on Windows 10 #7115 (comment)?
  2. Are temporary files still being left behind?

The second issue should definitely have been fixed.

If you're still seeing the first issue, something is preventing us from opening the target block for reading, not just for writing.

@hsn10
Copy link
Author

hsn10 commented Apr 11, 2020

Temporary files are left behind, I didn't tested long enough to wait for stacktrace it needs about 10 minutes to show.

@hsn10
Copy link
Author

hsn10 commented Apr 16, 2020

What about to make old 0.4.X version with new datastore for testing purposes?

@Stebalien
Copy link
Member

Here's a test binary of the previous release with the new datastore. It's not perfect because I needed to upgrade multiple datsatores (interface changes) but it should (hopefully) work:
ipfs.zip

I expect this version will work, but you're right in that we should still test it.

I believe this issue is as follows:

In the previous release, when we wrote blocks, we'd do the following repeatedly:

  1. Check to see if we already have the block. If so, return success.
  2. Write the block to a temporary file.
  3. Move the block into place.

Now, we collect a batch of blocks, then:

  1. Filter the blocks to figure out which one's we're missing.
  2. Write temporary files for all the blocks.
  3. Close and sync each temporary file.
  4. Move the temporary files into place.

If we're writing multiple temporary files at once, this opens up a larger window for thrid-party applications to open these temporary files and start messing with them.

That's why we asked you to disable your AV (and I'd still like you to try to completely disable all AVs). I've tried to reproduce this issue on my own Windows machine but couldn't.

As for how to proceed:

  1. Tell Windows users with this problem to use the badger datastore.
  2. Repeatedly try to rename these files with small sleeps. This is the standard approach to this problem.
  3. Create temporary files with DELETE access and keep them open until after we've moved them into place. However, that complicates the code a bit as we'd need to keep temporary files open longer.

I'm currently leaning towards 2 as it should "just work" in most cases and won't have any performance impact in the normal non-error case.

@hsn10
Copy link
Author

hsn10 commented Apr 16, 2020

This works ok. No file left behind and no exceptions after 20 minutes of testing

Stebalien added a commit that referenced this issue Apr 17, 2020
Another attempt to fix #7115.
@Stebalien Stebalien reopened this Apr 17, 2020
@Stebalien
Copy link
Member

Ok, we've updated flatfs to retry more aggressively. Could you try ipfs.zip?

@hsn10
Copy link
Author

hsn10 commented Apr 18, 2020

it seems to be fixed now. I also noticed that CPU usage is significantly higher compared to 0.4.X but it may not be caused by filesystem fix.

@Stebalien
Copy link
Member

It's probably but I'd love to take a look and see if we have a different issue. If you have some time, I'd appreciate it if you could follow the debug guide (while observing the increased CPU usage) and file an issue with the results.

Anyways, thanks for all your help in debugging this!

ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
Another attempt to fix ipfs#7115.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
Another attempt to fix ipfs#7115.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Another attempt to fix ipfs#7115.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Another attempt to fix ipfs#7115.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Another attempt to fix ipfs#7115.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Another attempt to fix ipfs#7115.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants