Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix predefined sorting for task data #5083

Merged
merged 103 commits into from
Jun 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
ae7cd3c
Add test
zhiltsov-max Oct 11, 2022
b9366cb
Implement file ordering support in task data
zhiltsov-max Oct 11, 2022
a5b4083
test stub
zhiltsov-max Oct 12, 2022
c133822
Reduce default status check period in CLI
zhiltsov-max Oct 12, 2022
0865096
v1
zhiltsov-max Oct 13, 2022
a790d74
v2
zhiltsov-max Oct 13, 2022
5f0dfbf
check ci
zhiltsov-max Oct 13, 2022
0066548
Update tests
zhiltsov-max Oct 14, 2022
3fbe420
Generate manifest in sdk on data submission with predefined sorting
zhiltsov-max Oct 17, 2022
b2e2ca6
Remove extra imports
zhiltsov-max Oct 17, 2022
66c588b
Fix rename
zhiltsov-max Oct 18, 2022
15ec41c
add sdk test
zhiltsov-max Oct 19, 2022
adaa769
Implement server support for v3
zhiltsov-max Oct 21, 2022
304f22c
t
zhiltsov-max Oct 21, 2022
6404855
Fix tests
zhiltsov-max Oct 24, 2022
6141279
Drop the old behavior
zhiltsov-max Oct 24, 2022
3a42c50
Remove rest api tests hacks
zhiltsov-max Oct 24, 2022
a5284f5
Add schema docs and require the file list
zhiltsov-max Oct 24, 2022
1281c26
Make the input specification optional
zhiltsov-max Oct 24, 2022
4fb6511
Remove unnecessary changes
zhiltsov-max Oct 24, 2022
6ae83cc
Update docs pages
zhiltsov-max Oct 25, 2022
2b65807
Simplify some code, add more docs
zhiltsov-max Oct 25, 2022
499d1dd
Add tests for local zip uploads with manifest
zhiltsov-max Oct 25, 2022
f136107
Fix server tests with manifests
zhiltsov-max Oct 26, 2022
15daa2f
Add tus multirequest uploading tests
zhiltsov-max Oct 27, 2022
972b9d2
Update sdk
zhiltsov-max Oct 27, 2022
09347f6
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Oct 27, 2022
e3d3e53
Remove extra request serializer from schema
zhiltsov-max Oct 27, 2022
a9cc165
Fix sdk tests
zhiltsov-max Oct 27, 2022
a550070
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Oct 27, 2022
bd1c567
Update changelog
zhiltsov-max Oct 27, 2022
84180fd
Remove test code
zhiltsov-max Oct 27, 2022
ae8e00f
Fix formatting
zhiltsov-max Oct 27, 2022
63937a5
Fix problem with chunk cache in tests
zhiltsov-max Oct 28, 2022
9836904
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Oct 31, 2022
c476196
Update manifest instructions
zhiltsov-max Nov 3, 2022
69a1886
Install dataset_manifest dependencies in the server image
zhiltsov-max Nov 3, 2022
7f03a6a
Fix dataset_manifest
zhiltsov-max Nov 3, 2022
5f5c9e7
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Nov 3, 2022
523436c
Merge remote-tracking branch 'origin/zm/fix-predefined-sorting' into …
zhiltsov-max Nov 3, 2022
a82aacb
Merge remote-tracking branch 'origin/develop' into zm/fix-predefined-…
zhiltsov-max Nov 3, 2022
4203569
Fix remark
zhiltsov-max Nov 3, 2022
9522ed4
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Dec 21, 2022
e45ee7b
Fix merge
zhiltsov-max Dec 21, 2022
c3f81c0
Apply suggestions from code review
zhiltsov-max Dec 21, 2022
132a4d7
t
zhiltsov-max Feb 16, 2023
af6d2a9
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Feb 16, 2023
06fc61d
remove some extra changes and fix some comments
zhiltsov-max Feb 16, 2023
582702e
Renaming and fixes
zhiltsov-max Feb 16, 2023
fdfd516
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Mar 14, 2023
a48d675
Remove unnecessary python shebang change
zhiltsov-max Mar 14, 2023
b4328b7
Simplify dataset_manifest error taxonomy
zhiltsov-max Mar 14, 2023
ede18f7
Add IndexError
zhiltsov-max Mar 14, 2023
169be76
Refactor dataset_manifest
zhiltsov-max Mar 14, 2023
dd5ea62
Fix merge for manifest detection
zhiltsov-max Mar 14, 2023
5296c27
Fix cache location in tests
zhiltsov-max Mar 14, 2023
83a978a
Fix merge
zhiltsov-max Mar 14, 2023
f38b650
Remove support for omitted fields for server files
zhiltsov-max Mar 14, 2023
da4e5f4
Fix merge, refactor, fix tests
zhiltsov-max Mar 14, 2023
e2375dd
Fix cache cleaning in tests
zhiltsov-max Mar 14, 2023
d5f2e61
Fix imports
zhiltsov-max Mar 14, 2023
2c5e0b5
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Mar 15, 2023
ca95dc6
Fix manifest handling
zhiltsov-max Mar 15, 2023
c0def68
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Mar 16, 2023
69d27df
Move the new request description into the existing to avoid api changes
zhiltsov-max Mar 16, 2023
ae7e47a
Refactor some code, improve comments, descriptions, errors and naming
zhiltsov-max Mar 16, 2023
3d71a9f
Fix tests
zhiltsov-max Mar 16, 2023
8c20272
Fix remaining comments
zhiltsov-max Mar 16, 2023
fb64592
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Mar 16, 2023
fccd9c2
Fix linter
zhiltsov-max Mar 16, 2023
d4adea2
Fix linter
zhiltsov-max Mar 16, 2023
79b56c5
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Mar 17, 2023
efd9d3e
Update schema
zhiltsov-max Mar 17, 2023
6e413b1
Improve endpoint description
zhiltsov-max Mar 17, 2023
ee8cdd0
Merge branch 'develop' into zm/fix-predefined-sorting
nmanovic Mar 19, 2023
348d5b7
Rename field
zhiltsov-max Mar 20, 2023
07cc300
Mention Upload-Length has POST method in CVAT
zhiltsov-max Mar 20, 2023
e906604
Update tests
zhiltsov-max Mar 20, 2023
9f4eb1e
Address comments
zhiltsov-max Mar 20, 2023
5786244
Update schema
zhiltsov-max Mar 20, 2023
d1b0d65
Fix comparison
zhiltsov-max Mar 20, 2023
d78f127
Fix tests
zhiltsov-max Mar 20, 2023
6365dc4
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Mar 20, 2023
cd5c4db
Add missing migration
zhiltsov-max Mar 20, 2023
84ec2ae
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Mar 21, 2023
d79dcf0
Merge branch 'develop' into zm/fix-predefined-sorting
nmanovic Mar 21, 2023
be9931e
Fix merge
zhiltsov-max Mar 21, 2023
bc2fcb0
Merge branch 'develop' into zm/fix-predefined-sorting
nmanovic Mar 24, 2023
f715675
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Apr 9, 2023
ee4d070
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Apr 18, 2023
2284320
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Apr 18, 2023
13cbbc3
Apply renaming to new uses
zhiltsov-max Apr 18, 2023
6815448
Revert some changes
zhiltsov-max Apr 18, 2023
da3c833
Update migrations
zhiltsov-max Apr 18, 2023
3b5c636
Merge branch 'develop' into zm/fix-predefined-sorting
zhiltsov-max Jun 5, 2023
db875bb
Move ordering parameter to the final data request
zhiltsov-max Jun 6, 2023
31bc9e2
Fix imports
zhiltsov-max Jun 6, 2023
313bbf1
Fix changelog
zhiltsov-max Jun 6, 2023
b874d26
Fix uploading in sdk
zhiltsov-max Jun 6, 2023
4acf217
Fix tests
zhiltsov-max Jun 6, 2023
f731d25
Fix formatting
zhiltsov-max Jun 6, 2023
0fc7213
Remove dead code
zhiltsov-max Jun 7, 2023
924abd0
Make implementation more robust
zhiltsov-max Jun 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .remarkignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
cvat-sdk/docs/
cvat-sdk/README.md
.env/
site/themes/
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## \[2.5.0] - Unreleased
### Added
- \[Server API\] An option to supply custom file ordering for task data uploads (<https://github.com/opencv/cvat/pull/5083>)

### Changed
- Allowed to use dataset manifest for the `predefined` sorting method for task data (<https://github.com/opencv/cvat/pull/5083>)
- TBD

### Changed
Expand All @@ -25,6 +29,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Deletion of uploaded files, including annotations and backups,
after they have been uploaded to the server using the TUS protocol but before an RQ job has been initiated. (<https://github.com/opencv/cvat/pull/5909>)
- Simultaneous creation of tasks or projects with identical names from backups by multiple users.(<https://github.com/opencv/cvat/pull/5909>)
- \[Server API\] The `predefined` sorting method for task data uploads (<https://github.com/opencv/cvat/pull/5083>)

### Security
- TDB
Expand Down
7 changes: 5 additions & 2 deletions cvat-sdk/cvat_sdk/core/uploading.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2022 CVAT.ai Corporation
# Copyright (C) 2022-2023 CVAT.ai Corporation
#
# SPDX-License-Identifier: MIT

Expand Down Expand Up @@ -350,6 +350,10 @@ def upload_files(
if pbar is not None:
pbar.start(total_size, desc="Uploading data")

if str(kwargs.get("sorting_method")).lower() == "predefined":
# Request file ordering, because we reorder files to send more efficiently
kwargs.setdefault("upload_file_order", [p.name for p in resources])

self._tus_start_upload(url)

for group, group_size in bulk_file_groups:
Expand All @@ -374,7 +378,6 @@ def upload_files(
pbar.advance(group_size)

for filename in separate_files:
# TODO: check if basename produces invalid paths here, can lead to overwriting
self._upload_file_data_with_tus(
url,
filename,
Expand Down
18 changes: 15 additions & 3 deletions cvat/apps/engine/media_extractors.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,12 +381,24 @@ def get_zip_filename(self):

def get_path(self, i):
if self._zip_source.filename:
return os.path.join(os.path.dirname(self._zip_source.filename), self._source_path[i]) \
if not self.extract_dir else os.path.join(self.extract_dir, self._source_path[i])
prefix = self._get_extract_prefix()
return os.path.join(prefix, self._source_path[i])
else: # necessary for mime_type definition
return self._source_path[i]

def __contains__(self, media_file):
return super().__contains__(os.path.relpath(media_file, self._get_extract_prefix()))

def _get_extract_prefix(self):
return self.extract_dir or os.path.dirname(self._zip_source.filename)

def reconcile(self, source_files, step=1, start=0, stop=None, dimension=DimensionType.DIM_2D, sorting_method=None):
if source_files:
# file list is expected to be a processed output of self.get_path()
# which returns files with the output directory prefix
prefix = self._get_extract_prefix()
source_files = [os.path.relpath(fn, prefix) for fn in source_files]

super().reconcile(
source_files=source_files,
step=step,
Expand All @@ -397,7 +409,7 @@ def reconcile(self, source_files, step=1, start=0, stop=None, dimension=Dimensio
)

def extract(self):
self._zip_source.extractall(self.extract_dir if self.extract_dir else os.path.dirname(self._zip_source.filename))
self._zip_source.extractall(self._get_extract_prefix())
if not self.extract_dir:
os.remove(self._zip_source.filename)

Expand Down
37 changes: 37 additions & 0 deletions cvat/apps/engine/migrations/0068_auto_20230418_0901.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Generated by Django 3.2.18 on 2023-04-18 09:01

from django.db import migrations


class Migration(migrations.Migration):

dependencies = [
('engine', '0067_alter_cloudstorage_credentials_type'),
]

operations = [
migrations.AlterModelOptions(
name='clientfile',
options={'default_permissions': (), 'ordering': ('id',)},
),
migrations.AlterModelOptions(
name='relatedfile',
options={'default_permissions': (), 'ordering': ('id',)},
),
migrations.AlterModelOptions(
name='remotefile',
options={'default_permissions': (), 'ordering': ('id',)},
),
migrations.AlterModelOptions(
name='serverfile',
options={'default_permissions': (), 'ordering': ('id',)},
),
migrations.AlterUniqueTogether(
name='remotefile',
unique_together={('data', 'file')},
),
migrations.AlterUniqueTogether(
name='serverfile',
unique_together={('data', 'file')},
),
]
75 changes: 60 additions & 15 deletions cvat/apps/engine/mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import base64
import json
import os
import os.path
import uuid
from dataclasses import asdict, dataclass
from distutils.util import strtobool
Expand Down Expand Up @@ -151,9 +152,30 @@ def __init__(self, request):
self.size = int(request.META.get("CONTENT_LENGTH", settings.TUS_DEFAULT_CHUNK_SIZE))
self.content = request.body

# This upload mixin is implemented using tus
# tus is open protocol for file uploads (see more https://tus.io/)
class UploadMixin:
"""
Implements file uploads to the server. Allows to upload single and multiple files, suspend
and resume uploading. Uses the TUS open file uploading protocol (https://tus.io/).

Implements the following protocols:
a. A single Data request

and

b.1. An Upload-Start request
b.2.a. The regular TUS protocol requests (Upload-Length + Chunks)
b.2.b. Upload-Multiple requests
b.3. An Upload-Finish request

Requests:
- Data - POST, no extra headers or 'Upload-Start' + 'Upload-Finish' headers
- Upload-Start - POST, has an 'Upload-Start' header
- Upload-Length - POST, has an 'Upload-Length' header (read the TUS protocol)
- Chunk - HEAD/PATCH (read the TUS protocol)
- Upload-Finish - POST, has an 'Upload-Finish' header
- Upload-Multiple - POST, has a 'Upload-Multiple' header
"""

_tus_api_version = '1.0.0'
_tus_api_version_supported = ['1.0.0']
_tus_api_extensions = []
Expand Down Expand Up @@ -204,11 +226,11 @@ def upload_data(self, request):
if one_request_upload or finish_upload:
return self.upload_finished(request)
elif start_upload:
return Response(status=status.HTTP_202_ACCEPTED)
return self.upload_started(request)
elif tus_request:
return self.init_tus_upload(request)
elif bulk_file_upload:
return self.append(request)
return self.append_files(request)
else: # backward compatibility case - no upload headers were found
return self.upload_finished(request)

Expand All @@ -218,7 +240,7 @@ def init_tus_upload(self, request):
else:
metadata = self._get_metadata(request)
filename = metadata.get('filename', '')
if not self.validate_filename(filename):
if not self.is_valid_uploaded_file_name(filename):
return self._tus_response(status=status.HTTP_400_BAD_REQUEST,
data="File name {} is not allowed".format(filename))

Expand Down Expand Up @@ -310,32 +332,55 @@ def append_tus_chunk(self, request, file_id):
extra_headers={'Upload-Offset': tus_file.offset,
'Upload-Filename': tus_file.filename})

def validate_filename(self, filename):
def is_valid_uploaded_file_name(self, filename: str) -> bool:
"""
Checks the file name to be valid.
SpecLad marked this conversation as resolved.
Show resolved Hide resolved
Returns True if the filename is valid, otherwise returns False.
"""

upload_dir = self.get_upload_dir()
file_path = os.path.join(upload_dir, filename)
return os.path.commonprefix((os.path.realpath(file_path), upload_dir)) == upload_dir

def get_upload_dir(self):
def get_upload_dir(self) -> str:
return self._object.data.get_upload_dirname()

def get_request_client_files(self, request):
def _get_request_client_files(self, request):
serializer = DataSerializer(self._object, data=request.data)
serializer.is_valid(raise_exception=True)
data = {k: v for k, v in serializer.validated_data.items()}
return data.get('client_files', None)
return serializer.validated_data.get('client_files')

def append(self, request):
client_files = self.get_request_client_files(request)
def append_files(self, request):
"""
Processes a single or multiple files sent in a single request inside
a file uploading session.
"""

client_files = self._get_request_client_files(request)
if client_files:
upload_dir = self.get_upload_dir()
for client_file in client_files:
with open(os.path.join(upload_dir, client_file['file'].name), 'ab+') as destination:
filename = client_file['file'].name
if not self.is_valid_uploaded_file_name(filename):
return Response(status=status.HTTP_400_BAD_REQUEST,
data=f"File name {filename} is not allowed", content_type="text/plain")

with open(os.path.join(upload_dir, filename), 'ab+') as destination:
destination.write(client_file['file'].read())
return Response(status=status.HTTP_200_OK)

# override this to do stuff after upload
def upload_started(self, request):
"""
Allows to do actions before upcoming file uploading.
"""
return Response(status=status.HTTP_202_ACCEPTED)

def upload_finished(self, request):
raise NotImplementedError('You need to implement upload_finished in UploadMixin')
"""
Allows to process uploaded files.
"""

raise NotImplementedError('Must be implemented in the derived class')

class AnnotationMixin:
def export_annotations(self, request, db_obj, export_func, callback, get_data=None):
Expand Down
23 changes: 18 additions & 5 deletions cvat/apps/engine/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,11 +264,6 @@ def make_dirs(self):
os.makedirs(self.get_original_cache_dirname())
os.makedirs(self.get_upload_dirname())

def get_uploaded_files(self):
upload_dir = self.get_upload_dirname()
uploaded_files = [os.path.join(upload_dir, file) for file in os.listdir(upload_dir) if os.path.isfile(os.path.join(upload_dir, file))]
represented_files = [{'file':f} for f in uploaded_files]
return represented_files

class Video(models.Model):
data = models.OneToOneField(Data, on_delete=models.CASCADE, related_name="video", null=True)
Expand Down Expand Up @@ -424,13 +419,22 @@ class Meta:
default_permissions = ()
unique_together = ("data", "file")

# Some DBs can shuffle the rows. Here we restore the insertion order.
# https://github.com/opencv/cvat/pull/5083#discussion_r1038032715
ordering = ('id', )

# For server files on the mounted share
class ServerFile(models.Model):
data = models.ForeignKey(Data, on_delete=models.CASCADE, null=True, related_name='server_files')
file = models.CharField(max_length=1024)

class Meta:
default_permissions = ()
unique_together = ("data", "file")

# Some DBs can shuffle the rows. Here we restore the insertion order.
# https://github.com/opencv/cvat/pull/5083#discussion_r1038032715
ordering = ('id', )

# For URLs
class RemoteFile(models.Model):
Expand All @@ -439,6 +443,11 @@ class RemoteFile(models.Model):

class Meta:
default_permissions = ()
unique_together = ("data", "file")

# Some DBs can shuffle the rows. Here we restore the insertion order.
# https://github.com/opencv/cvat/pull/5083#discussion_r1038032715
ordering = ('id', )


class RelatedFile(models.Model):
Expand All @@ -451,6 +460,10 @@ class Meta:
default_permissions = ()
unique_together = ("data", "path")

# Some DBs can shuffle the rows. Here we restore the insertion order.
# https://github.com/opencv/cvat/pull/5083#discussion_r1038032715
ordering = ('id', )

class Segment(models.Model):
task = models.ForeignKey(Task, on_delete=models.CASCADE)
start_frame = models.IntegerField()
Expand Down
34 changes: 28 additions & 6 deletions cvat/apps/engine/serializers.py
Original file line number Diff line number Diff line change
Expand Up @@ -763,12 +763,33 @@ class DataSerializer(serializers.ModelSerializer):
"""))
job_file_mapping = JobFileMapping(required=False, write_only=True)

upload_file_order = serializers.ListField(
child=serializers.CharField(max_length=1024),
default=list, allow_empty=True, write_only=True,
help_text=textwrap.dedent("""\
Allows to specify file order for client_file uploads.
Only valid with the "{}" sorting method selected.

To state that the input files are sent in the correct order,
pass an empty list.

If you want to send files in an arbitrary order
and reorder them afterwards on the server,
pass the list of file names in the required order.
""".format(models.SortingMethod.PREDEFINED))
)

class Meta:
model = models.Data
fields = ('chunk_size', 'size', 'image_quality', 'start_frame', 'stop_frame', 'frame_filter',
'compressed_chunk_type', 'original_chunk_type', 'client_files', 'server_files', 'server_files_exclude','remote_files', 'use_zip_chunks',
'cloud_storage_id', 'use_cache', 'copy_data', 'storage_method', 'storage', 'sorting_method', 'filename_pattern',
'job_file_mapping')
fields = (
'chunk_size', 'size', 'image_quality', 'start_frame', 'stop_frame', 'frame_filter',
'compressed_chunk_type', 'original_chunk_type',
'client_files', 'server_files', 'remote_files',
'use_zip_chunks', 'server_files_exclude',
'cloud_storage_id', 'use_cache', 'copy_data', 'storage_method',
'storage', 'sorting_method', 'filename_pattern',
'job_file_mapping', 'upload_file_order',
)
extra_kwargs = {
'chunk_size': { 'help_text': "Maximum number of frames per chunk" },
'size': { 'help_text': "The number of frames" },
Expand Down Expand Up @@ -858,8 +879,9 @@ def _pop_data(self, validated_data):
server_files = validated_data.pop('server_files')
remote_files = validated_data.pop('remote_files')

validated_data.pop('job_file_mapping', None) # optional
validated_data.pop('server_files_exclude', None) # optional
validated_data.pop('job_file_mapping', None) # optional, not present in Data
validated_data.pop('upload_file_order', None) # optional, not present in Data
validated_data.pop('server_files_exclude', None) # optional, not present in Data

for extra_key in { 'use_zip_chunks', 'use_cache', 'copy_data' }:
validated_data.pop(extra_key)
Expand Down
Loading