Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hashing of aggregate classes, rename reverse_order to sort_ascending #351

Merged
merged 10 commits into from
Sep 8, 2020

Conversation

bdice
Copy link
Member

@bdice bdice commented Sep 2, 2020

Description

Fixes a few minor issues in #348 that I caught after it was merged.

Motivation and Context

I will add comments in the PR discussing the motivation for my changes.

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog.

@bdice bdice requested review from a team as code owners September 2, 2020 03:23
@bdice bdice self-assigned this Sep 2, 2020
@bdice bdice requested review from b-butler and cbkerr and removed request for a team September 2, 2020 03:23
@bdice bdice added aggregation bug Something isn't working labels Sep 2, 2020
@bdice bdice changed the base branch from master to feature/enable-aggregation September 2, 2020 03:23
States if the jobs are to be sorted in reverse order.
The default value is False.
:type reverse_order:
:param sort_ascending:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call this sort_ascending with a default value of True so it matches Pandas' sort_values(by=..., ascending=True, ...). I find "ascending" unambiguous, while "reverse order" is less clear. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sort in ascending order by default.
Hence we'll always have to use the not keyword.
While this will definitely work, my preference here would be using sort_descending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I only had to invert its value twice. Once in the call to sorted and also once in the check for “default values” where I replaced not reverse_order with sort_ascending. So it was an equal number of “nots” in each direction, which seems no more complex to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion either way, but this usage is distinct enough from anything you would do with pandas that I don't think there's any reason for confusion. I'm assuming the performance difference is negligible, but if that's not the case and the default sorting is the opposite of what pandas does I think choosing performance over exact API consistency with pandas is unnecessary here. I'm fine with whatever decision is made, though.

Copy link
Collaborator

@kidrahahjo kidrahahjo Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice @vyasr Then we can stick with how @bdice implemented this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel this needs to conform to pandas, but I do think the naming is clearer with sort_ascending or sort_descending.

Comment on lines +225 to +226
return hash(func.__code__.co_code)
except AttributeError: # Cannot access function's compiled bytecode
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should hash the bytecode before returning, since this is purely used as a unique identifier. We also shouldn't catch Exception! We know this should raise an AttributeError if it can't access the attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above (where this function is used).

Comment on lines +215 to +217
return hash((self._sort_by, self._sort_ascending, self._is_aggregate,
self._get_unique_function_id(self._aggregator_function),
self._get_unique_function_id(self._select)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're severely over-complicating things here. Just take the hash of a tuple of hashable objects that uniquely define this instance. There's no need to do all the string operations and sha1 digests.

The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple.
https://docs.python.org/3/reference/datamodel.html#object.__hash__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the _get_unique_function_id even need to exist? @kidrahahjo why were you hashing just the bytecode of the function (which is what _get_unique_function_id returns for functions) rather than just hashing the function directly? I honestly don't remember thinking through whether it was necessary, I think I just assumed there was a reason you bothered to define that rather than just hashing the function directly and moved on. Any object with a __code__ attribute should already be hashable though, right?

Copy link
Collaborator

@kidrahahjo kidrahahjo Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr Since aggregator_function is defined internally hence the hash would depend on it's location.
I believe this was the main reason for not creating a hash of a method directly.

Edit: I'll document this and add a commit now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is just for others' reference. In my understanding, _get_unique_function_id is useful for the same reason we use obj.__code__.co_code for the "tags" on _condition objects in flow. It allows us to know that two function objects are the same if their bytecode is equivalent (even if they're not actually located in the same object). Hashing the bytecode (when possible) is a helpful way to uniquely identify the aggregator/select function.

Demonstration of how hashes of functions are not equal (even though the functions are equivalent), but hashing the bytecode shows their equivalence:

>>> def foo():
...   print('asdf')
...
>>> def bar():
...   print('asdf')
...
>>>
>>> foo is bar
False
>>> hash(foo)
8793327702568
>>> hash(bar)
8793327702577
>>> foo.__code__.co_code
b't\x00d\x01\x83\x01\x01\x00d\x00S\x00'
>>> bar.__code__.co_code
b't\x00d\x01\x83\x01\x01\x00d\x00S\x00'
>>> assert hash(foo.__code__.co_code) == hash(bar.__code__.co_code)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kidrahahjo @bdice yes, that is true, but we're not using hashing to check equality or depending on being able to check equality at all any more since, as per our discussions, checking co_code is insufficient: for instance:

>>> def f(x):
...     return x**2
...
>>> def g(x):
...     return x**3
...
>>> f.__code__.co_code
b'|\x00d\x01\x13\x00S\x00'
>>> g.__code__.co_code
b'|\x00d\x01\x13\x00S\x00'

Although I pointed out some ways to get around the particular problem above (as well other even more subtle ways in which this comparison can fail), the point is that this id isn't really unique, right? So do we gain anything by defining a function whose name is not accurate and whose primary purpose is no longer valid? If Aggregate.__eq__ only returns True if (roughly) a is b, then wouldn't it make sense for our hashes to also be distinct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Actually the situation even gets worse than that, I discussed some cases where comparing __code__.co_consts is also not sufficient (and possible ways to get around those edge cases as well). However, overall I think there are enough such possibilities that it's safest to avoid the problem altogether and just require strict object equality, which is also what we changed __eq__ to do. @kidrahahjo could you make the relevant change here (on a separate branch/PR at this point)? Basically just remove the usage of this helper function and directly hash the object unless this will cause any other problems. @b-butler feel free to comment if you think I'm missing something here, but I think this is directly in line with my suggestion for __eq__.

Copy link
Member

@b-butler b-butler Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice @vyasr The __hash__ method does not have to distinguish all elements that are not equivalent it just requires that a == b -> hash(a) == hash(b). We can do this for now, but I think eventually having a hash method that can distinguish between commonly used aggregates would be useful. https://docs.python.org/3/reference/datamodel.html#object.__hash__

I don't think there is any safety concern given since if a is b then a == b and hash(a) == hash(b) in our model. __hash__ is not necessarily meant in Python to provide a unique id. It is more for hashable collections like dict and set. We can safely give objects the same hash when they are not equal as long as our __eq__ gives no false positives which the is operator ensures.

Edit: Fixed link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, in general hashes don't need to be equivalent and we only depend on __eq__. I just figured that for now we should keep the two roughly in sync.

signac at present basically assumes no hash collisions... but that's another story.

Copy link
Member Author

@bdice bdice Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, hash collisions in signac would be caught and raise errors. I believe we have tests for this behavior. (Just didn’t want readers of the above comment to fear unexpected behavior.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, hash collisions where two statepoints refer to the same job ID will raise an error. Nothing will break.

AFAIK hash collisions will raise the same DestinationExistsError as is raised when actually creating two identical jobs though. I'm not aware of any logic that explicitly checks for the reason and indicates to the user that there's a hash collision, so the error will look the same as if the user did something like

j1 = project.open_job(dict(a=1)).init()
j2 = project.open_job(dict(a=1, b=2)).init()
del j2.sp.b  # This will raise a DestinationExistsError because now j2 and j1 have the same statepoint.

In the event that a user somehow managed to trigger a collision (which is obviously highly unlikely, and not something that I think we need to deal with), the resulting error message would prevent any corruption of the data space, it would just be very confusing. That's all I was referring to, nothing needs to be done there.

Comment on lines +279 to +280
self._aggregates = []
self._aggregate_ids = {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize empty lists/dicts with object literals. It's faster and more Pythonic.

an operation function aggregated by the default aggregates i.e.
:py:class:`aggregator.groupsof(1)`.
an operation function using the default aggregator, i.e.
``aggregator.groupsof(1)``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably won't be parsed correctly by Sphinx. :py:class: will create a cross-link to the corresponding class, but we just want a code snippet.


def __hash__(self):
blob = self._project.get_id()
return int(_hash(blob), 16)
return hash(repr(self._project))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not a safe hash function. The project id is just a name, and isn't actually a unique identifier (two projects in different locations can have the same name, and thus the same id). For the Project's __eq__ method, we use repr(self) == repr(other), because the repr includes a unique identifier: the project's path on disk. I use the same approach here.

Comment on lines +424 to +425
hash_ = md5(blob.encode('utf-8')).hexdigest()
return f'agg-{hash_}'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separate function was a little bit overkill -- it's only used once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intent intent in requesting a separate function was to also include the logic for blob construction (which was being done in many places in this file), but as you've pointed out you can always just hash a tuple of things rather than build up that object which is better :)

@bdice bdice requested review from kidrahahjo and removed request for cbkerr September 2, 2020 03:30
Copy link
Collaborator

@kidrahahjo kidrahahjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bdice for reviewing #348 and fixing the issues through this PR.
I added a few commits which address your comments related to using fully qualified import statements.
Apart from this, it looks good to me. Thanks once, again.

Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @bdice. They look good.

States if the jobs are to be sorted in reverse order.
The default value is False.
:type reverse_order:
:param sort_ascending:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel this needs to conform to pandas, but I do think the naming is clearer with sort_ascending or sort_descending.

@bdice bdice merged commit 4dab42a into feature/enable-aggregation Sep 8, 2020
@bdice bdice deleted the feature/aggregation-sort-hash branch September 8, 2020 00:49
bdice added a commit that referenced this pull request Jan 20, 2021
… and documentation. (#427)

* Change the submission ID for aggregate logic (#334)

Changes the submission id generation to prepare for aggregation. The `__str__` method of `_JobOperation` is also changed to better handle job aggregation. The unique id generated using a hash of all the `_JobOperation` object's data is still the same, so the uniqueness should not be in question; the only change is to the readable part of the id.

The change to the `str` implementation mimics NumPy, and the submission id favors brevity.

* Add aggregator classes in flow (#348)

Add a new decorator by which operations can be tagged for various types of aggregation, along with classes to store actual aggregates once they have been generated.

* Fix hashing of aggregate classes, rename reverse_order to sort_ascending (#351)

* Use sort_ascending instead of reverse_order.

* Fix and simplify hash methods.

* Clarify docstrings.

* Initialize list/dict with literals.

* Use one-line function instead of defining a separate method.

* Don't wrap line.

* class -> object.

* use itertools reference while using groupby

* Use reference for zip_longest too

* Document _get_unique_function_id

Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Add default aggregate support to flow (#335)

Change the internals of flow so that everything operates on default aggregates (aggregates of size 1) rather than individual jobs.

* Use tqdm.auto for better notebook compatibility. (#371)

* Remove unused attributes/methods. (#380)

* Remove unused __init__, the implicit parent constructor is equivalent.

* Use public .job attribute.

* Remove unused argument from _add_operation_selection_arg_group.

* Use logger.warning instead of deprecated logger.warn.

* Refactor error variables and messages. (#373)

* Remove unused **kwargs from _FlowProjectClass metaclass constructor.

* Raise ImportError instead of RuntimeWarning if pprofile is not installed (it's a hard error).

* Deprecate CPUEnvironment and GPUEnvironment (#381)

* Deprecate CPUEnvironment and GPUEnvironment. The classes are unused and untested.

* Remove unused private functions for importing from files.

* Update changelog.

* Make _verify_group_compatibility a class method. (#379)

* Use isinstance check for IgnoreConditions. (#378)

* Remove dangling else/elif clauses after raise/return statements. (#377)

Introduces stylistic changes suggested by pylint.

* Use f-string.

* Linting dicts. (#374)

* Refactor dicts, OrderedDict, and comprehensions. Resolves #323.

* Import Mapping from collections.abc.

* Use {}

* Use descriptive variable names. (#376)

* Use longer/clearer variable names.

* Update _no_aggregation.

* Rename _operation_to/from_tuple to _job_operation_to/from_tuple

Co-authored-by: Brandon Butler <butlerbr@umich.edu>

* Make _to_hashable a private function. (#384)

* Make _to_hashable private.

* Copy _to_hashable method from signac.

* Adding/improving docstrings and comments. (#375)

* Add and improve docstrings.

* Minor edits to docstrings.

* Use shorter one-line summaries in some long docstrings.

* Cleanup docstrings, variable names. (#372)

* Clean up docs and variables relating to pre-conditions and post-conditions.

* Clean up docstrings, variable names, defaults, remove obsolete helper functions.

* Use f-strings, avoid second-person "you".

* Use full name for template_filters.

* Docstring revisions.

* Use job_id consistently (adds to #363). (#386)

* Fix other instances of job_id missed in #363.

* Update changelog.

* Use descriptive variable name job_id.

* Add all missing docstrings, enforce pydocstyle. (#387)

Validates docstrings using pydocstyle and adds all the necessary documentation for this to pass.

* Revise Sphinx docs. (#389)

* Enforce pydocstyle more broadly.

* Ignore hidden folders.

* Use temporary config that allows missing docstrings.

* Apply changes from pydocstyle and minor edits for clarity.

* Add more docstrings.

* Re-enable D102.

* Add all missing docstrings for public methods.

* Update changelog.

* Remove duplicate summary of FlowProject.

* Don't use syntax highlighting in CLI help message.

* Revise environment docstrings.

* Update link for Comet docs.

* Document class property ALIASES.

* Revise copy_from.

* Revise label docstring.

* Add operations and base environment classes to docs.

* Standardize references to signac Jobs.

* Update APIs and references.

* Revise docstrings.

* Rename dummy state to placeholder.

* Update ComputeEnvironmentType docstrings.

* Explicitly list classes/functions to document.

* Remove all API from root of scheduling subpackage.

* Revise aggregation docs.

* Improve IgnoreConditions docs.

* Remove :py: (role is unnecessary).

* Docstring revisions.

* Apply suggestions from code review

Co-authored-by: Vyas Ramasubramani <vramasub@umich.edu>

* Apply suggestions from code review

Co-authored-by: Vyas Ramasubramani <vramasub@umich.edu>

* Update flow/project.py

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

* Update flow/project.py

* Clarify simple-scheduler purpose.

Co-authored-by: Vyas Ramasubramani <vramasub@umich.edu>

* Use NumPy style docstrings. (#392)

* Use napoleon to render numpydoc style.

* NumPy-doc-ify aggregates.

* Fix name of aggregator class.

* Revisions to environment docstrings.

* Revisions to aggregate docstrings.

* NumPy-doc-ify INCITE environments.

* NumPy-doc-ify remaining environments.

* NumPy-doc-ify render_status.

* NumPy-doc-ify scheduler base classes.

* NumPy-doc-ify LSF scheduler.

* NumPy-doc-ify simple scheduler.

* NumPy-doc-ify SLURM scheduler.

* NumPy-doc-ify Torque scheduler.

* NumPy-doc-ify template.

* NumPy-doc-ify testing.

* NumPy-doc-ify util.

* NumPy-doc-ify project.

* Fix remaining sphinx-style docstrings.

* Updated docstrings with velin (https://github.com/Carreau/velin).

* Adopt numpy convention for pydocstyle.

* Remove methods that are implemented identically in the parent class.

* Add call operators to docs.

* Fix FlowGroup reference.

* Reformat examples.

* Remove trailing colons where no type is specified.

* Fix class reference.

* Fix (deprecated) JobOperation references.

* Use preconditions/postconditions.

* Add changelog line.

* Apply suggestions from code review

Co-authored-by: Brandon Butler <butlerbr@umich.edu>

* Move env argument.

* Fix typo.

* Apply suggested changes.

* Add docs to simple-scheduler, don't git ignore the bin directory containing simple-scheduler.

Co-authored-by: Brandon Butler <butlerbr@umich.edu>

* Optimization: Don't use min_len_unique_id in str(_JobOperation). (#400)

* Don't call min_len_unique_id when converting _JobOperation to a string. It's O(N) and is called O(N) times, leading to O(N^2) behavior and significant performance penalties.

* Update template reference data to use long job ids.

* Evaluate run commands lazily. (#396)

Cache lazily generated run commands to avoid side affects associated with the generation.
Also avoids introducing any side effects due to f-string evaluation when generating logging entries.

* Add missing raise statement.

* Initialize variables early to prevent unbound values.

* Use same boolean logic as elsewhere for determining eligibility.

* Use job id instead of string call.

* Try to return immediately instead of calling both __contains__ and __getitem__.

* Add codecov support to CI. (#405)

* Add codecov support.

* Remove mock (not used).

* Get executable directive at call time instead of definition time (#402)

* Make DefaultAggregateStore picklable (#383)

* Make default aggregate class picklable

* Refactor pickling

* Refactor code comments

* Add __getstate__, __setstate__ methods to FlowProject

* Make cloudpickle a hard dependancy

* Fix linting issues

* Minor changes to variable names.

* Use f-string.

* Solve circular reference issue during unpickling by pre-computing hash value.

* Cache actual hash value.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

* Show total for aggregate status info.

* Simplify implementation of __iter__, avoid duplicated logic.

* Optimize _get_aggregate_from_id to avoid open_job call. Speeds up function by about 2x for default aggregators.

* Fail gracefully when expanding bundles if no jobs are on the scheduler (e.g. using TestEnvironment with a fake scheduler).

* Minor changes to variable names and docstrings.

* Fix Iteration Bug on next and add AggregatesCursor to flow (#390)

* Change method name, fix return for _select_jobs_from_args

* Add AggregatesCursor class to flow

* make AggregatesCursor private

* Fix import error

* Simplify iteration logic.

* Update docstring.

* Fix iteration order.

* Use set comprehension.

* Parse filter arguments in the CLI, pass through filters to JobsCursor.

* Rename internal variable to _jobs_cursor.

* Fix docstring errors.

* Refactor jobs-to-aggregates conversion.

* Update changelog.

* Optimize __contains__ for case with no filters.

* Simplify tqdm bar description.

* Update comment.

* Update comment.

* Update comment.

* Update variable name.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

* Small updates to docstrings and small refactors.

* Optimization: use cached status everywhere. (#410)

Prevents repeatedly checking the project document's file for status, by using a cached status for all methods that require submission status.

Squashed commit messages:
* Ensure that all status-fetching operations use a cached status dictionary.

* Use cached status in script command.

* Use cached_status and keyword arguments in tests.

* Fix test of submission operations.

* Update changelog.

* Remove FlowGroup._get_status. Resolves #368.

* Ensure that job operations are created.

* Update docstring.

* Remove pass-through argument for cached_status.

* Optimize job id generation to avoid expensive JSON encoding step (it's a plain string).

* Use context manager functionality of pools (requires Python 3.4+ and hadn't been updated).

* Add explicit test for threaded parallelism.

* Unify group and aggregate storage with a bi-map data structure. (#415)

* Unify group and aggregate storage with a bidirectional map data structure.

* Update comment.

* Swap iteration order of generating run operations.

* Improve __eq__ implementation.

* Update comment.

* Improve bidict.__delitem__.

* Fix bidict docstring.

* Only create aggregate stores once.

* Add tests for bidict.

* Make bidict private.

* Make bidict a MutableMapping, add more bidict tests.

* Add comment about insertion ordering.

* Optimization: don't deepcopy directives (#421)

* Modify _JobOperation construction to not deepcopy directives

Deepcopying directives is expensive since it is not a pure data
structure. This commit just converts the directives to a pure data
structure (i.e. a ``dict``). In addition, we pass the user specified
directives (as opposed to the environment specified directives) to
``_JobOperation``. This should improve the performance of run and
submit.

* Only evaluate directives when unevaluated.

Adds a check to prevent the same directive from being evaluated twice.
This should speed up the evaluation of multi-operation groups.

* Add back (fixed) comment about user directives.

* Update docstring.

* Use dict constructor instead of dict comprehension.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

* Unify group/aggregate selection (#416)

* Update argument name for get_aggregate_id.

* Update hashing.

* Add shared logic for group/aggregate selection.

* Reduce redundancy in methods using group/aggregate selection.

* Revert to by-job ordering in pending operations.

* Rename test.

* Add optimization if all selected groups share an aggregate store.

* Refactor group aggregate store optimization.

* Yield aggregate_id during selection iteration. Rename jobs to aggregate. Unify more selection logic.

* Use aggregate id in group id generation.

* Simplify _SubmissionJobOperation constructor.

* Make order of run operations deterministic (avoid set subtraction).

* Remove duplicate code for label fetching.

* Update template reference data.

* Remove _is_selected_aggregate method (replaced with "in" check).

* Skip postconditions check for eligibility if the group/aggregate is known to be incomplete.

* Rename variable.

* Only consider selected aggregates.

* Fix comment.

* Rename blob to id_string.

* Move check for zero selected groups.

* Add TODO.

* Add comment explaining eligibility ignoring postconditions.

* Remove index parameter (#423)

* Remove index parameter (resolves #418).

* Update template reference data.

* Remove the use of arguments scheduled to be removed in 0.12 (#424)

* Remove the use of  argument

* Remove the use of env attribute while submit

* Pass environment as keyword argument.

* Remove env argument in a few missed places.

* Fix template reference data. It was using mpiexec (default) and should use the cluster-specific MPI commands. I don't know why the reference data was wrong.

* Remove extra comment about status parallelization.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

* Update tqdm requirement.

* Deprecate unbuffered mode (#425)

* Always buffer, add deprecation warning if use_buffered_mode=False.

* Remove unbuffered tests.

* Reduce number of jobs to speed up tests.

* Update changelog.

* Update deprecation notice.

* Refactoring tasks (#426)

* Refactor status rendering into a private function.

Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Renamed fakescheduler to fake_scheduler.

Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Make environment metaclass private.

Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Remove unused status.py file.

Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Remove out argument from templates.

Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Refactor scheduler classes, unify code.

Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Remove metaclass from docs.

Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Update docstrings.

* Update docstrings.

* Update changelog.

* Update docstrings.

* Update changelog.txt

* Update changelog.txt

Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>

* Refactor aggregation (#422)

Cleans up various internals associated with aggregation, introducing additional abstraction layers and adding docstrings.  

* Refactor aggregation.

* Fix failing tests from refactor.

* Improve clarity and coverage of tests.

* Add more tests to enforce that aggregates are tuples.

* Expand tests of default aggregator.

* Update tests.

* Unify __iter__.

* __getitem__ should raise a KeyError if not found.

* Update error message.

* Update flow/aggregates.py

Co-authored-by: Vyas Ramasubramani <vramasub@umich.edu>

* Simplify _register_aggregates method. (#430)

* New _fetch_status code. (#417)

Rewrite the _fetch_status method, which was extremely large and unwieldy. The new code is more efficient since it avoids making multiple passes through the jobs. It also uses some newly introduced functions and context managers as well as newer tqdm functions to clean up existing code.

* Initial rewrite of _fetch_status.

* Update _fetch_scheduler_status to require aggregates.

* Clarify docstring for status_parallelization.

* Experiment with callback approach (only iterate over aggregates/groups once).

* Skip evaluation of post-conditions during eligibility if known to be incomplete.

* Change operations to groups.

* Use context manager for updating the cached status in bulk.

* Separate scheduler query from cached status update.

* Fix docstring.

* Use context manager for _fetch_scheduler_status cache update.

* Use cached status context manager for _fetch_status.

* Refactor status updates so the scheduler query happens on-the-fly and only considers selected groups/aggregates.

* Intermediate commit with notes.

* Simplify eligibility check.

* Update docstring.

* Update tests to account for removed method.

* Rename cid to cluster_id.

* Define parallel executor.

* Only update status if the update dictionary has data.

* Add parallel execution.

* Remove status callback.

* Clean up status fetching.

* Remove unnecessary internal methods.

* Fix test (scheduler is queried internally).

* Rename jobs -> aggregate.

* Fix bug in tests (the dynamically altered jobs should NOT be resubmitted, this was probably an error due to the use of cached status rather than querying the scheduler).

* Show aggregate id in str of _JobOperation.

* Fix script output.

* Remove keyword argument.

* Reset MockScheduler in setUp for all tests.

* Mock root directory (can't override Project.root_directory() because it is used for all job document paths and buffering, and must reflect an actual path on disk).

* Update template reference data.

* Refactor mocked root.

* Pickle function to be executed in parallel using cloudpickle so that it can be sent across threads.

* Fix pre-commit hooks.

* Fix root directory mocking.

* Improve progress iterators.

* Use chunked job label fetching.

* Refactor parallel executor to accept one iterable and use Pool for process parallelism.

* Use integers in the cached status update.

* Fix mocking of system executable. Resolves #413.

* Update changelog.

* Mock environment directly rather than using **kwargs for compatibility with signac 1.0.0.

* Buffer during project status update.

* Use ordered results.

* Don't buffer during status update (no performance difference).

* Refactor job labels so that the list of individual jobs is generated during the same single loop over the project to generate aggregates.

* Update flow/util/misc.py

Co-authored-by: Vyas Ramasubramani <vramasub@umich.edu>

* Fix function parameters in test, use kwargs for _fetch_status.

* Use process_map.

* Use MOCK_EXECUTABLE.

* Add comments explaining use of FakeScheduler.

* Collect errors that occur during status evaluation.

* Mock the id generation method instead of injecting mocked attributes.

Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Co-authored-by: Vyas Ramasubramani <vramasub@umich.edu>

* Change reference name from fakescheduler to fake_scheduler (#431)

* Make class aggregator and method get_aggregate_id private (#432)

* Make aggregator and get_aggregate_id private

* Remove reference from docs

* Use private name for _aggregator in test_aggregates.py.

* Remove aggregates from documentation, add TODOs.

* Fix docstring reference.

* Update changelog.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Co-authored-by: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>
Co-authored-by: Vyas Ramasubramani <vramasub@umich.edu>
Co-authored-by: Brandon Butler <butlerbr@umich.edu>
Co-authored-by: Alyssa Travitz <atravitz@umich.edu>
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aggregation bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants