Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to not copy files when using share files and web ui/cli #3542

Closed
2 tasks done
shortcipher3 opened this issue Aug 11, 2021 · 2 comments
Closed
2 tasks done
Assignees

Comments

@shortcipher3
Copy link
Contributor

My actions before raising this issue

When importing image files from the share I have do not have the Copy data into CVAT checked, but the images are zipped and copied into the cvat_server docker image under the paths /home/django/data/data/<id>/original/*.zip and /home/django/data/data/<id>/compressed/*.zip, with the checkbox checked images are also copied to /home/django/data/data/<id>/raw/. When importing images into CVAT stored on my server I now have 3-4 copies of each image - the original images, the original images zipped, the compressed images zipped, and the raw images (if checkbox not checked). Many people would like to have only one copy of the images on disk.

Notes:

The compressed/0.zip is used by the ui to load images when labeling. Maybe we want to keep a compressed version around for the webui? A more memory friendly solution might be using progressive decoding or jpeg xl, where only part of the original image file is transferred if a lower resolution is desired (and the original file supports it).

The original/0.zip is used when exporting the dataset. Zipping the files doesn't do anything to reduce their size and it seems equally valid to provide the original images if they are available on the persistent share volume.

#2377 removed a single copy of the images in the raw folder.
#2862 would like to see the functionality from #2377 be applied to cli usage
#204 Asked for this same functionality - maybe I should re-open it, but it seems like enough time has passed and other developments have been made that I opted to create a new issue

Expected Behaviour

While importing image files from the share with the Copy data into CVAT unchecked no images should be copied into the docker images, available disk space should be approximately the same. There should be a flag to duplicate this behavior when copying using the cli tool.

Current Behaviour

While importing image files from the share with the Copy data into CVAT unchecked images are zipped and copied into the cvat_server docker image under:

  • /home/django/data/data/<id>/original/0.zip
  • /home/django/data/data/<id>/compressed/0.zip

The checkbox does stop images from being copied to /home/django/data/data/<id>/raw/, but the cli does not have a flag to duplicate this behavior (duplicate of #2862)

Possible Solution

Could create soft-links to the original files and add support for serving image files instead of just zip.

Could keep track of where the files live in the share volume and serve directly from the share volume. For a compressed version could support lower resolution images for file formats that support progressive decoding (eg jpeg xl, flif)

Steps to Reproduce (for bugs)

  1. Create a share volume (instructions)
  2. http://localhost:8080/tasks/create
  • select connected file share
  • select an image file
  • under advanced make sure Copy data into CVAT unchecked
  • submit
  1. docker exec -ti cvat ls -1v /home/django/data/data | tail -n1 to get the id
  2. docker cp cvat:/home/django/data/data/<id> ./ to copy the files for local inspection. Can verify that if you unzip the files in <id>/original/0.zip and <id>/compressed/0.zip they are derived from the original file.

Context

When importing images using the cli in an automated fashion, I found my import had halted due to the harddrive running out of memory when I had 100 GB of free disk space before starting. Also the import took much longer than expected, since the files were already on the server.

Your Environment

  • Git hash commit (git log -1): commit 472d535
  • Docker version docker version (e.g. Docker 17.0.05): 20.10.8
  • Are you using Docker Swarm or Kubernetes? - no
  • Operating System and version (e.g. Linux, Windows, MacOS): Ubuntu 21.04
@Marishka17
Copy link
Contributor

@shortcipher3, Hi,
Currently, there are 2 ways to create an annotation task:

  1. You do not enable the checkbox Use cache and in this case the necessary chunks (e.g /home/django/data/data/<id>/original/0.zip, /home/django/data/data/<id>/compressed/0.zip) are prepared during task creation and saved in folders ../original/, ../compressed/
  2. You enable the checkbox Use cache and in this case, the task is created on the fly, no data copies are created in the folders (original/compressed), and the necessary chunks are prepared as needed and stored in the cache.

@shortcipher3
Copy link
Contributor Author

Thanks for clearing that up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants