Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix predefined sorting for task data #5083

Merged
merged 103 commits into from
Jun 8, 2023
Merged

Conversation

zhiltsov-max
Copy link
Contributor

@zhiltsov-max zhiltsov-max commented Oct 11, 2022

Motivation and context

Fixes #5061, #4179

  • Added a way to declare custom file ordering for the local task data uploads via TUS protocol
  • Added an option to use a manifest to support the predefined sorting method
    • This file is required for the predefined sorting mode with image archives
  • Fixed file ordering when tasks are created from SDK or CLI in the predefined sorting mode
  • Added more tests for task data uploading API

The uploading protocol is implemented:

The user specifies sorting_method=predefined if the task creation request. Then the data is uploaded.

  1. Client files uploading
    1.1. The files are uploaded as separate files (using the TUS protocol) or grouped files (using the Upload-Multiple requests).
    1.2. The Upload-Finish request comes (or its unlabeled legacy equivalent). The new optional field can be supplied: upload_file_order - a list of strings. It allows to override the input file order, if necessary, and is only valid with the predefined sorting method specified.
    1.2.1. If the field is empty or missing, the client files in the data requests are considered ordered.
    1.2.2. If the field is not empty, a list containing the file list in the required order is expected in the upload_file_order field.
    1.2.2.1. If there are client_files in the request, the files are sorted
    1.2.2.2. If file lists mismatch, an explanatory error is raised.

  2. Data processing
    2.1. At this point, all *_files are considered ordered as requested.
    2.2. Require a metafile for zip uploads with predefined sorting. The file is expected to accompany the uploaded zip file, not to be inside of the archive.
    2.3. If there is a metafile in the input data, files are ordered after the metafile.
    2.3.1. If the data is extracted from cloud, only the specified subset of the files is kept in the manifest.
    2.3.2. If the upload data doesn't exist in the metafile, an error is raised.
    2.3.3. A job_file_mapping has higher priority than metafile, if specified.

How has this been tested?

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

cvat/apps/engine/views.py Fixed Show fixed Hide fixed
@zhiltsov-max zhiltsov-max changed the title [WIP] Fix predefined sorting for task data [DO NOT MERGE] Fix predefined sorting for task data Oct 13, 2022
@nmanovic nmanovic changed the title [DO NOT MERGE] Fix predefined sorting for task data [WIP] Fix predefined sorting for task data Oct 14, 2022
@zhiltsov-max
Copy link
Contributor Author

zhiltsov-max commented Oct 20, 2022

Problems:

  1. If we send data with TUS, the data is stored in a directory, and then it is returned unordered. It can also be sent unordered. Then, when we try to get the list of the uploaded files, we get the files in an unspecified order (cvat-cli --sorting-method predefined results in seemingly random order of images #5061).

  2. If we send an archive, the data is returned as the archive reports, i.e. the file order is unspecified.

Possible solutions:

  1. Require a metadata file when sorting_method=predefined

REJECTED: Too obligatory for users. We want to avoid using metafiles until really necessary

2.1. Require an upload spec (a request) for TUS uploads
2.1.1. Introduce a new optional request body in the Upload-Start request. It has 2 fields: ordered (boolean, default=False), files (list of strings, optional)
2.1.1.1. If ordered=True, the files are considered ordered. Write the input ordering into a special metafile (e.g. upload_dir/__upload_auto_file_ordering__) and update it on each input file request.
2.1.1.2. If ordered=False and "files" is not empty, the "files" field is expected to contain the file list in the required order. A temporary metafile (e.g. upload_dir/__upload_custom_file_ordering__) is generated from the input data.
2.1.1.3. If ordered=False and "files" is empty or missing, the files are considered unordered. No extra actions are done.
2.1.1.4. The files are being uploaded as separate files and grouped files. If the input file has a reserved name, fail the request with an error. If a temporary metafile for auto ordering is found in the upload location, the file is updated.
2.1.1.5. The Upload-Finished request comes.
2.1.1.5.1. If no metafiles found in the upload location, do not report any error. We don't raise errors to support existing cases where a manifest file is sent in the uploaded data, but is not supposed for sorting.
2.1.1.5.2. Sort the input files (the client_files field) using any of the temporary metafiles found. Raise an error if file lists mismatch.

2.2. Require a metafile for zip uploads with predefined sorting. The file is expected to accompany the uploaded zip file, not to be inside of the archive.
2.3. SDK and UI are updated to send the "files" field with the list of upcoming files.

BEING TESTED

@nmanovic , @SpecLad , @azhavoro - do you have any comments about the new solution?

@SpecLad
Copy link
Contributor

SpecLad commented Oct 20, 2022

Is the ordered field necessary? Seems like the file list should be enough.

@zhiltsov-max
Copy link
Contributor Author

zhiltsov-max commented Oct 21, 2022

Is the ordered field necessary? Seems like the file list should be enough.

It's hard to say, probably no. We could encode ordered=True into the default value for the array, but it looks like there are some clients that depend on the existing behavior (e.g. that the files are either in the direct or reversed order), and I'm not sure if it is good to break their code. The existing behavior is different - the files are returned as os.listdir() returns (which is unspecified, but in practice it gives 2 variants for some reason). In the proposed variant, the existing behavior is kept, and the user needs to explicitly request their sorting expectations.

cvat/apps/engine/views.py Fixed Show fixed Hide fixed
cvat/apps/engine/views.py Fixed Show fixed Hide fixed
@zhiltsov-max
Copy link
Contributor Author

I decided to remove the backward compatibility mode. It turned out that Django was sorting the client_files field lexicographically in all cases (with and without TUS), so the predefined sorting was never really working for local uploads. The old behavior can be reproduced with just the lexicographical sorting method now. For predefined, I fixed behavior in single-request uploads, and I implemented a way to optionally specify the input file order. I decided not to require this just for TUS to be consistent with the simple, single-request version.

@zhiltsov-max zhiltsov-max changed the title [WIP] Fix predefined sorting for task data Fix predefined sorting for task data Jun 6, 2023
cvat/apps/engine/views.py Show resolved Hide resolved
cvat/apps/engine/views.py Show resolved Hide resolved
@SpecLad SpecLad merged commit 8df8872 into develop Jun 8, 2023
@SpecLad SpecLad deleted the zm/fix-predefined-sorting branch June 8, 2023 12:32
@azhavoro azhavoro mentioned this pull request Jun 9, 2023
mikhail-treskin pushed a commit to retailnext/cvat that referenced this pull request Jul 1, 2023
Extracted from cvat-ai#5083 
Related cvat-ai#5096 

- Improved dataset manifest docs
- Dataset manifest requirements are now installed in the server image
- Package dependencies are aligned with the server
mikhail-treskin pushed a commit to retailnext/cvat that referenced this pull request Jul 1, 2023
Extracted from cvat-ai#5083

- Added a default arg for task data uploading
- Added an option to wait for the data processing in task data uploading
- Moved data splitting by requests for TUS closer to the point of use
mikhail-treskin pushed a commit to retailnext/cvat that referenced this pull request Jul 1, 2023
Extracted from cvat-ai#5083 

- Fixed test data placement in tests
- Fixed cache cleaning in tests
- Refactored server API tests and dataset_manifest
- Added more tests
hoshimura added a commit to hoshimura/fiftyone that referenced this pull request Nov 20, 2023
keeping old behavior for cvat versions < 2.4.6 because
of cvat bug fix cvat-ai/cvat#5083

Change-Id: I542695437c05c09bb48023d598af479965abb384
hoshimura added a commit to hoshimura/fiftyone that referenced this pull request Nov 21, 2023
keeping old behavior for cvat versions < 2.4.6 because
of cvat bug fix cvat-ai/cvat#5083

Change-Id: I542695437c05c09bb48023d598af479965abb384
ehofesmann pushed a commit to voxel51/fiftyone that referenced this pull request Nov 28, 2023
keeping old behavior for cvat versions < 2.4.6 because
of cvat bug fix cvat-ai/cvat#5083

Change-Id: I542695437c05c09bb48023d598af479965abb384
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cvat-cli --sorting-method predefined results in seemingly random order of images
4 participants