Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype: Folder items scheduler - evaluation attribute sorted tree #5349

Closed
wants to merge 2 commits into from

Conversation

mrow4a
Copy link
Contributor

@mrow4a mrow4a commented Dec 1, 2016

Case Scenario 1:
Carlos and his Zombies just went back from holidays, he opens the computer and tries to sync all his photos - 4GB. With his link, to upload all the photos will take around 1 hour. He starts working since next day is monday, and he puts in the various folders his new files.

Case Scenario 2:
Carlos didnt use sync client for longer time, he want to do the initial sync of his files. He starts sync of a lot of folders with a lot of files. Total sync time will be around 30 minutes. Within all these files, there are some folders on which he collaborates on the document. He want to share the document he is editing during the sync, and this folder is not big (it contains only documents).

EXPECTATIONS:
In the very nested structure, there is a folder with only small changes, and folder with a lot of new files, which sync time will be very long. It is expected, that the folders with small changes will be synced first. If I pause-start sync, my small file schould by synced first

REALITY:
OC client currently is "I just found folder on filesystem and I will try to sync its contents to the cloud" structured. This means, that if you place a small change in some nexted folder tree, and before it there is another nested tree with a lot of files, your document will wait till sync comes to your folder

Solution:
Effect 1:
I removed the caching of the Finished Jobs, and I just remove them from the stack. I am using also Iterator instead of for loop.

Effect 2:
Oc sync is structurising the sync in the following manner, not regarding on the size of the folders:

THE TREES BELOW ARE TREES OF CHANGES - These folders could contain a lot of unchanged files in each. Tree is being build only basing on size of changes within specific subtree.

Important note: all file operations(IGNORE, MOVE, ETC) except for PUT and GET, are assigned with HighPriority, it means they will be synced with insertion order (the same as previously). PUT and GET predicates are based on their file sizes and are assigned NormalPriority. Folders (subtrees) are assigned ContainerPriority, which means their predicate is size of the changes within this subtree. REMOVED folders are HighPriority and will be synced old way.

ownCloud Root
├── less-important-folder-10MB/
       ├── nested-folderes-files 9MB
       ├── less-important-file 1MB
       └── nested-folderes-files 4MB
├── Home&Work-10GB/
        ├── folder with nested 10GB of photos to be sync
        ├── very-very-important folder with single documents to sync - 2MB
        └── less-important-document - 1MB
├── important-folder-12MB/
└── less-important-folder - 5MB/
        └── important-file-1MB

However, this algorithm (not using loop for all the items, only for folders within folder - complexity 0(n) where n is number of folders in the folder) builds it as follows, looking on size predicate for files and size of items within folders:

ownCloud new Execution Tree:
├── less-important-folder-5MB/
       └── important-file-1MB
├── less-important-folder-10MB/
       ├── less-important-file 1MB
       ├── nested-folderes-files 4MB
       └── nested-folderes-files 9MB
├── important-folder-12MB/
├── Home&Work-10GB/
        ├── less-important-document - 1MB
        ├── very-very-important folder with single documents to sync - 2MB
        └── folder with nested 10GB of photos to be sync

I have tested it on similar structure and it works really nice :> @ogoffart @DeepDiver1975 @guruz @felixboehm

@mention-bot
Copy link

@mrow4a, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ogoffart, @ckamm and @jturcotte to be potential reviewers.

@mrow4a mrow4a changed the title Add folder items scheduler with specific predicate - item sizes ascending Folder items scheduler with specific predicate - item sizes ascending Dec 1, 2016
@mrow4a
Copy link
Contributor Author

mrow4a commented Dec 1, 2016

This algorithm live sorts the items withing all the folders according to its size or the total size of the contained items. It CREATES ON FLY a specific sorted tree, while ADDING items to the folder using specific data structure QMap with key being predicate and value being Job (not using sorting loop! this will be inefficient for a lot of files, it will loop only for folders, which number is not expected to be exceeding milion I guess).

Quick reminder, QMap is characterised that it sorts its item by the key. If the key is the same, e.g. 0, items are sorted in the insertion order.

@mrow4a
Copy link
Contributor Author

mrow4a commented Dec 1, 2016

This should also solve #1633 since this will By The Way balance upload/download by placing the files/folders of the same sizes close to each other. Our 3 separate threads should create nice pipe utilization in this case.

Please close the issue after merge

@mrow4a mrow4a force-pushed the scheduler branch 4 times, most recently from 56afdbf to 786478a Compare December 1, 2016 21:03
// TODO: The call for getJobPredicateValue should recursively traverse all kids in _containerJobs of _containerjobs,
// in order to update their predicates. Root job should call updatePredicates in owncloudPropagator as a last step before run
// , and save flag _folderPredicateUpdated = true
_subJobs.insertMulti(job->getJobPredicateValue(), job);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should basicaly be moved to the moment before root get started, and the function _root->updateContainersPredicates() should perform left-deep traversal of all the folders with changed files ( _containerJobs), and be added to _subJobs list with getJobPredicateValue(), job>. This has to be done like that, because files are being added to containers and we have to update _subJobs list of all folders starting from deepest ones.

@mrow4a mrow4a force-pushed the scheduler branch 2 times, most recently from d319e86 to d252113 Compare December 4, 2016 18:45

// This uses recursion to perform Depth-First Traversal of the directories with changes trees
// If the given (this) directory contains _containerJobs, it will call updateJob on that child dir job, otherwise does nothing
void updateJobPredicateValues();
Copy link
Contributor Author

@mrow4a mrow4a Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is initiated (run first) for _root, after all jobs are appended. It will execute Post-Order Depth-First Tree Traversal in order to visit directories kids (_containerJobs) of given _root, starting from bottom, and visiting their respecive directories kids (_containerJobs).

@phil-davis
Copy link
Contributor

See #5352 to fix a few comment typos.

@guruz guruz added this to the 2.3.0 milestone Dec 5, 2016
}

void append(PropagatorJob *subJob) {
_subJobs.append(subJob);
_subJobsPriority += subJob->getJobPredicateValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see getJobPredicateValue is called here.. but there is also updateJobPredicateValues... does this not cause too much load? Maybe I misunderstand..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two different functions. getJobPredicateValue will just get the present value of the sync item (ignore file etc, sync file (file size)) and for directories, we insert them to _containerJobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateJobPredicateValue is recursively traversing only _containerJobs tree, thus only directories with changes. This one will be called once for the _root job, and recursively visit all "changed" directories. _subJobs are untouched, and I will not touch that, because this could contain thousands of items, while _containerJobs will never be deep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more commenting..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guruz Ok, on the weekend will write algorithm documentation for that in https://github.com/owncloud/documentation/tree/master/developer_manual

if(subJob->_priority == JobPriority::ContainerItemsPriority){
_containerJobs.append(subJob);
} else {
_subJobs.insertMulti(subJob->getJobPredicateValue(), subJob);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the intent clear, change QMap to QMultiMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, didnt know about this one. Nice


for (int i = _firstUnfinishedSubJob; i < subJobsCount; ++i) {
if (_subJobs.at(i)->_state == Finished) {
QMutableMapIterator<quint64, PropagatorJob *> subJobsIterator(_subJobs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment that this iterates by predicate and also iterated over the multiple values of a multi map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt it just iterates over the items it has in the sorting order? I thought it is self explaining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it was not...

@@ -396,9 +396,12 @@ void OwncloudPropagator::start(const SyncFileItemVector& items)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I added you on several issues here in github, please mention them in the commit message if you think they are handled by this

// all the sub files or sub directories.
QVector<PropagatorJob *> _subJobs;
// all the sub files or sub directories. This map has to be updated with _containerJobs
// QMap is ordered by the key value, or in case of equal keys (e.g 0) by insertion order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify what the key value actually is here in the comment (it's the predicate value aka file sizes right?)

// QMap is ordered by the key value, or in case of equal keys (e.g 0) by insertion order.
QMap<quint64, PropagatorJob *> _subJobs;

QVector<PropagatorJob *> _containerJobs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add detailed comment of the destinction between a container job and a sub job.

Copy link
Contributor

@guruz guruz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, interesting..
Needs more comments.. :)
What does smashbox say?

@mrow4a
Copy link
Contributor Author

mrow4a commented Dec 5, 2016

@guruz Whom to contact for smashbox tests? How do you do that normally?

@owncloud/qa do you automatize smashbox tests also for client?

EDIT: submitted corrections

@mrow4a
Copy link
Contributor Author

mrow4a commented Dec 5, 2016

Please after acceptance, close #4498 (comment)
and also #5077 (comment)

@guruz
Copy link
Contributor

guruz commented Dec 7, 2016

Please after acceptance, close

@mrow4a That's why I wrote "please mention them in the commit message if you think they are handled by this".. then they are linked.

@mrow4a
Copy link
Contributor Author

mrow4a commented Dec 7, 2016

@guruz Ok, I misunderstood you, though you mean just comment here. Will correct it.

@mrow4a
Copy link
Contributor Author

mrow4a commented Dec 8, 2016

Ohh, so I should just run it localy, I though you had some QA server on which you test your branches. Ok, will test it localy agains the tests from ownCloud CI

Copy link
Member

@jturcotte jturcotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My list of changes I'd like to see after the discussion today:

  • Rename "Predicate" to something like "Function"
  • Set the sort function of a folder vs. its sibblings to be the most recent modification time of all its children

@mrow4a mrow4a changed the title Folder items scheduler with specific predicate - item sizes ascending Folder items scheduler by evaluation attribute Dec 14, 2016

// FIXME! we should probably cache this result
//this will get the _container jobs within this parent directory and iterate over them
QMutableVectorIterator<PropagatorJob *> containerJobsIterator(_containerJobs);
Copy link
Contributor Author

@mrow4a mrow4a Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I use List? @ogoffart because of the remove?

@@ -48,7 +48,7 @@ class PropagateRemoteDelete : public PropagateItemJob {
QPointer<DeleteJob> _job;
public:
PropagateRemoteDelete (OwncloudPropagator* propagator,const SyncFileItemPtr& item)
: PropagateItemJob(propagator, item) {}
: PropagateItemJob(propagator, item, JobPriority::LastOutPriority) {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok?

@@ -40,7 +40,8 @@ static const char checkSumAdlerC[] = "Adler32";
class PropagateLocalRemove : public PropagateItemJob {
Q_OBJECT
public:
PropagateLocalRemove (OwncloudPropagator* propagator,const SyncFileItemPtr& item) : PropagateItemJob(propagator, item) {}
PropagateLocalRemove (OwncloudPropagator* propagator,const SyncFileItemPtr& item)
: PropagateItemJob(propagator, item, JobPriority::FirstOutPriority) {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok?

@mrow4a mrow4a changed the title Folder items scheduler by evaluation attribute Folder items scheduler - evaluation attribute sorted tree Dec 18, 2016
@felixboehm
Copy link
Contributor

@mrow4a Please add unit tests, you are rising complexity without testing.
Please check performance impact with large amount of files.

@mrow4a mrow4a changed the title Folder items scheduler - evaluation attribute sorted tree Prototype: Folder items scheduler - evaluation attribute sorted tree Dec 20, 2016
@ogoffart ogoffart modified the milestones: 3.0.0, 2.3.0 Jan 11, 2017
@ogoffart
Copy link
Contributor

We have not established yet best way to schedule the file.
I think the priority goes to get bundling in.
And this issue should not block 2.3.0

@mrow4a
Copy link
Contributor Author

mrow4a commented Jan 11, 2017

Yes, I think this is yet another feature, independent from other things, but worth consideration.

@guruz guruz modified the milestones: 2.4.0, 3.0.0 Mar 16, 2017
@guruz
Copy link
Contributor

guruz commented Mar 23, 2017

AFAIK this only makes sense if #5440 is done?

@ogoffart ogoffart removed this from the 2.4.0 milestone Apr 28, 2017
@ogoffart ogoffart removed their assignment Apr 28, 2017
@ogoffart
Copy link
Contributor

I removed the milestone because this would still need more work, and I don't think it is worth it at all. But I leave it open for the sake of discussion.

@mrow4a
Copy link
Contributor Author

mrow4a commented Apr 28, 2017

I still think that sorting by modification time and syncing that way might be a good idea.

@guruz
Copy link
Contributor

guruz commented May 3, 2017

and I don't think it is worth it at all.

I think it might be.

So let's see after the tests/benchmarks about this.
No rush now for 2.4 or 3.0..

@ogoffart
Copy link
Contributor

Closing outdated pull request

@ogoffart ogoffart closed this Oct 30, 2019
@TheOneRing TheOneRing deleted the scheduler branch December 2, 2019 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants