Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting Namespace- Using the SLURM scheduler to check the status of a project tries to find bundled operations from other users and/or instances of FlowProject #758

Closed
CalCraven opened this issue Jul 24, 2023 · 17 comments · Fixed by #832
Labels
bug Something isn't working
Milestone

Comments

@CalCraven
Copy link

Description

When submitting jobs using the default SLURM template to a scheduler, the job status is returned with the [A], [Q], etc. progress to keep track of these jobs. The jobs themselves are labeled in the scheduler via their NAME attribute, which looks something like Project_name/bundle/job_hash. The issues arises when two users call their projects the same Project_name when creating the FlowProject subclass. An error will be raised because the status check sees that there's a job under your project, but that job_hash will not exist in the .bundles directory. A possible solution might be to append the username to the Project_name when the job is posted to the scheduler i.e. Project_name-user_name/bundle/job_hash.

To reproduce

./project1/project1.py

class Project(FlowProject):
    """Subclass of FlowProject to provide custom methods and attributes."""
    def __init__(self):
        super().__init__()
if __name__ == "__main__":
    Project().main()

./project2/project2.py

class Project(FlowProject):
    """Subclass of FlowProject to provide custom methods and attributes."""
    def __init__(self):
        super().__init__()
if __name__ == "__main__":
    Project().main()
python project1.py submit 
cd ../project2
python project2.py status

Error output

(membrane) [quachcd@head minimized_surfaces]$ python src/project.py status 
/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
  warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
/raid6/homes/quachcd/Documents/env/science_env/mbuild/mbuild/recipes/__init__.py:13: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
  entry_points = metadata.entry_points()["mbuild.plugins"]
/raid6/homes/quachcd/Documents/env/science_env/gmso/gmso/formats/mol2.py:79: UserWarning: The record type indicator Meta is not supported. Skipping current section and moving to the next RTI header.
  warnings.warn(
/raid6/homes/quachcd/Documents/env/science_env/gmso/gmso/formats/mol2.py:79: UserWarning: The record type indicator @<TRIPOS>SUBSTRUCTURE is not supported. Skipping current section and moving to the next RTI header.
  warnings.warn(
Using environment configuration: Rahman
WARNING:flow.project:Unable to load template from package. Original Error '__main__.__spec__ is None'.
Querying scheduler...
ERROR:flow.project:Error during status update: [Errno 2] No such file or directory: '/raid6/homes/quachcd/Documents/science/membrane/solvents-separation-membrane/screening_workflow/minimized_surfaces/.bundles/Project/bundle/9bbc851ef6ac321ec66fe0a8d7c94d735673411d'
Use '--ignore-errors' to complete the update anyways.
Traceback (most recent call last):
  File "/raid6/homes/quachcd/Documents/science/membrane/solvents-separation-membrane/screening_workflow/minimized_surfaces/src/project.py", line 227, in <module>
    project.main()
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 5120, in main
    args.func(args)
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 4767, in _main_status
    raise error
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 4761, in _main_status
    self.print_status(jobs=aggregates, **args)
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 2980, in print_status
    status_results, job_labels, individual_jobs = self._fetch_status(
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 2646, in _fetch_status
    scheduler_info = self._query_scheduler_status(
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 2554, in _query_scheduler_status
    return {
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 2554, in <dictcomp>
    return {
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 2198, in scheduler_jobs
    yield from self._expand_bundled_jobs(scheduler.jobs())
  File "/raid6/homes/quachcd/.conda/envs/membrane/lib/python3.10/site-packages/flow/project.py", line 2172, in _expand_bundled_jobs
    with open(self._fn_bundle(job.name())) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/raid6/homes/quachcd/Documents/science/membrane/solvents-separation-membrane/screening_workflow/minimized_surfaces/.bundles/Project/bundle/9bbc851ef6ac321ec66fe0a8d7c94d735673411d'

System configuration

Please complete the following information:

  • Operating System: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.10
  • Version of Python=3.8.15
  • Version of signac=2.0.0
  • Version of signac-flow=0.25.1
@bdice
Copy link
Member

bdice commented Jul 24, 2023

Here's some information that may be helpful for tracking down the cause of the bug.

For individual jobs (not bundles), signac-flow includes the project's full path in the hash when generating the submission names.

signac-flow/flow/project.py

Lines 970 to 973 in c0f44b2

aggregate_id = get_aggregate_id(aggregate)
full_name = f"{project.path}%{aggregate_id}%{op_string}"
# The job_op_id is a hash computed from the unique full name.
job_op_id = md5(full_name.encode("utf-8")).hexdigest()

For bundles, the bundle ID is dependent on the ids of the input job-operations. In my understanding this should already incorporate the project's full path (which we use to ensure uniqueness across projects). If the users are operating on separate signac data spaces, I would not have expected this to be a problem.

_id = sha1(".".join(op.id for op in operations).encode("utf-8")).hexdigest()

Can you confirm if the projects share a path? Are the users operating on independent signac projects, or two FlowProjects pointing to the same signac project? Maybe you can insert a breakpoint and share the output of [op.id for op in operations] in _store_bundled for such a case?

@bdice
Copy link
Member

bdice commented Jul 24, 2023

I would lean away from solutions that insert the username into the job/bundle identifier, because theoretically two users operating on a shared signac project should be able to submit and query over the same jobs without conflict. Adding a username to the mix would mean that signac-flow can't identify whether duplicate work is being submitted by both users acting on the same data space.

@CalCraven
Copy link
Author

This is a fair point about not inserting the username for that use case.

For the example error message posted above, the two projects were at different paths with the two FlowProjects pointing to different signac projects, but still conflicted when checking their status, allowing neither to be checked while the other had jobs in the queue or active on nodes. I'll generate a toy example, such as what I posted above and see if I can replicate the error while also inserting that breakpoint.

Thanks for your help!

@CalCraven
Copy link
Author

Okay here's a toy example within the filetree:
./
├── project1
│   ├── init_jobs.py
│   └── project.py
└── project2
├── init_jobs.py
└── project.py

Code to reproduce the error

craven76@head :~$ ls ./
project1 project2
craven76@head :~$ cd project1
craven76@head :~$ python init_project.py
craven76@head :~$ python project.py submit -n 4 --bundle 2
Using environment configuration: Rahman
Querying scheduler...
Submitting cluster job 'TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780':
WARNING:flow.project:Unable to load template from package. Original Error '__main__.__spec__ is None'.
 - Group: wait_1min(89e412cabc2300a734873ff43b2dd367)
 - Group: wait_1min(f3e56c602771e9541aef61d502562b89)
Submitting cluster job 'TestProject/bundle/0333205b13dc702ff1adafb5fe7a1ba309c90a2c':
 - Group: wait_1min(e535d5106574b0506407aebaec71318e)
 - Group: wait_1min(c7426fb8d903c232eeb81f6454c81069)
craven76@head :~$ cd ../project2
craven76@head :~$ python project.py status
Using environment configuration: Rahman
WARNING:flow.project:Unable to load template from package. Original Error '__main__.__spec__ is None'.
Querying scheduler...
ERROR:flow.project:Error during status update: [Errno 2] No such file or directory: '/raid6/homes/craven76/test_signac_bug/project2/.bundles/TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780'
Use '--ignore-errors' to complete the update anyways.
Traceback (most recent call last):
  File "project.py", line 30, in <module>
    TestProject().main()
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 5120, in main
    args.func(args)
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 4767, in _main_status
    raise error
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 4761, in _main_status
    self.print_status(jobs=aggregates, **args)
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2980, in print_status
    status_results, job_labels, individual_jobs = self._fetch_status(
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2646, in _fetch_status
    scheduler_info = self._query_scheduler_status(
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2554, in _query_scheduler_status
    return {
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2554, in <dictcomp>
    return {
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2198, in scheduler_jobs
    yield from self._expand_bundled_jobs(scheduler.jobs())
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2172, in _expand_bundled_jobs
    with open(self._fn_bundle(job.name())) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/raid6/homes/craven76/test_signac_bug/project2/.bundles/TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780'

@CalCraven
Copy link
Author

And then I went and add the breakpoint [op.id for op in operations] within signac-flow/flow/project.py.

With the result of:

craven76@head :~$ python project.py submit -n 4 --bundle 2
WARNING:root:The operations for the space are ['TestProject/e535d5106574b0506407aebaec71318e/wait_1min/19d14bb5f93e4a7089005cd827fc69eb', 'TestProject/c7426fb8d903c232eeb81f6454c81069/wait_1min/f99b4a74c99916888f94527db2002770']
craven76@head :~$ cd ../project2 
craven76@head :~$ python project.py status
Using environment configuration: Rahman
WARNING:flow.project:Unable to load template from package. Original Error '__main__.__spec__ is None'.
Querying scheduler...
ERROR:flow.project:Error during status update: [Errno 2] No such file or directory: '/raid6/homes/craven76/test_signac_bug/project2/.bundles/TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780'
Use '--ignore-errors' to complete the update anyways.
Traceback (most recent call last):
  File "project.py", line 30, in <module>
    TestProject().main()
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 5124, in main
    args.func(args)
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 4771, in _main_status
    raise error
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 4765, in _main_status
    self.print_status(jobs=aggregates, **args)
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2984, in print_status
    status_results, job_labels, individual_jobs = self._fetch_status(
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2650, in _fetch_status
    scheduler_info = self._query_scheduler_status(
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2558, in _query_scheduler_status
    return {
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2558, in <dictcomp>
    return {
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2202, in scheduler_jobs
    yield from self._expand_bundled_jobs(scheduler.jobs())
  File "/raid6/homes/craven76/.conda/envs/switchable38/lib/python3.8/site-packages/flow/project.py", line 2176, in _expand_bundled_jobs
    with open(self._fn_bundle(job.name())) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/raid6/homes/craven76/test_signac_bug/project2/.bundles/TestProject/bundle/2b7a0a6de2e0451ac1a2805d7623eefa12d1d780'

@CalCraven
Copy link
Author

Original .py files for generating the issues above.
signac_namespace_bug.zip

@b-butler
Copy link
Member

b-butler commented Oct 13, 2023

@CalCraven I am not sure how this is happening with two users, given

if user is None:
user = getpass.getuser()
cmd = ["squeue", "-u", user, "-h", "--format=%2t%100j"]

but for a single user with two projects, I think it would error due to

signac-flow/flow/project.py

Lines 2169 to 2174 in 74de383

bundle_prefix = self._bundle_prefix
for job in scheduler_jobs:
if job.name().startswith(bundle_prefix):
with open(self._fn_bundle(job.name())) as file:
for line in file:
yield ClusterJob(line.strip(), job.status())

where self._bundle_prefix=ClassName/bundle.

Thus, I am surprised that someone else's submissions are causing errors, but it makes sense that one person's submissions could. We assume that if it has the same bundle prefix it is the same, but the prefix is not at all expected to be unique.

Solutions:

  • Add some spice to bundle_prefix to prevent matching,
  • Iterate over the bundles directory and check for job existence not the other way around.
  • Add new metadata about extant bundles/scheduler jobs and iterate over that
  • Fail silently on bundle file that does not exist

Out of these, the second is most appealing from a design perspective (we would just need to place the jobs in a {jobid: scheduler_job} dictionary first.

@seoulfood
Copy link
Contributor

I also accidentally triggered this error, using two FlowProjects that have the same class name (MyProject in my case). Debugging with @cbkerr! A workaround that I'm using is to name the class after the specific project I'm working on and hoping that no one else is using it.

There is a slurm job name of MyProject/bundle/3572942c6bf210551200f6013ddfac753c662383 and when I created another separate project with class name MyProject it was trying to find this bundle, which only exists in the other project.

The following is generated when calling print_status()

Querying scheduler...
Traceback (most recent call last):
  File "/gpfs/accounts/sglotzer_root/sglotzer0/gabs/why_is_signac_not_working/testSignac.py", line 35, in <module>
    project.print_status(detailed=True, parameters=['n'])
  File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2980, in print_status
    status_results, job_labels, individual_jobs = self._fetch_status(
  File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2646, in _fetch_status
    scheduler_info = self._query_scheduler_status(
  File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2554, in _query_scheduler_status
    return {
  File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2554, in <dictcomp>
    return {
  File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2198, in scheduler_jobs
    yield from self._expand_bundled_jobs(scheduler.jobs())
  File "/home/gabs/anaconda3/envs/HOOMDTester/lib/python3.9/site-packages/flow/project.py", line 2172, in _expand_bundled_jobs
    with open(self._fn_bundle(job.name())) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/accounts/sglotzer_root/sglotzer0/gabs/why_is_signac_not_working/.bundles/MyProject/bundle/3572942c6bf210551200f6013ddfac753c662383'

@cbkerr cbkerr added the bug Something isn't working label Jan 30, 2024
@CalCraven
Copy link
Author

Thanks for reminding me of this thread @seoulfood! Yeah, I use the same workaround and hope my naming is unique. Will keep an eye on this if you and @cbkerr can think up a more permanent fix. I can also test anything on our cluster if need be.

@b-butler
Copy link
Member

@CalCraven @seoulfood I posted some solutions above. The problem is that we check all scheduler jobs by a user, regardless of project. The bundle prefix, which determines if we assume the scheduler job comes from the current project, only depends on the project class name.

@CalCraven
Copy link
Author

@CalCraven I am not sure how this is happening with two users, given

Yeah that seems strange, that could be something unique to the setup of our cluster. I haven't been able to replicate it so I'm not necessarily worried about that issue. My intuition says it could easily have been coming from myself and another person using the "FLowProject" default label for two unique projects apiece, and the status checks being limited to the same bug per user. I will dig a little more to see if that's the case, but I think you're right that conflict shouldn't be an issue.

Out of these, the second is most appealing from a design perspective (we would just need to place the jobs in a {jobid: scheduler_job} dictionary first.

I agree with your opinion, although I know little about the extent the other three solutions would take to implement. Especially in the case where operation is on a huge cluster with thousands of jobs, it makes sense to operate only on jobs you know exist in the current project.

@joaander
Copy link
Member

Out of these, the second is most appealing from a design perspective (we would just need to place the jobs in a {jobid: scheduler_job} dictionary first.

I agree with your opinion, although I know little about the extent the other three solutions would take to implement. Especially in the case where operation is on a huge cluster with thousands of jobs, it makes sense to operate only on jobs you know exist in the current project.

I don't think this alone completely solves the problem. If the same user has two projects of the same name AND submits the same operations on the same jobs in each - then there is no way to tell them apart.

I agree that looping only over bundles known to the project is a good improvement, but I think we additionally need to to disambiguate the bundle ids. Perhaps add a hash of the project's absolute path? In signac 2.0, it is not possible to store 2 projects at the same path. We wouldn't necessarily need to add additional hash characters to an already long bundle id - we could include the project path as a salt in the hash that is already computed.

@bdice
Copy link
Member

bdice commented Feb 23, 2024

Perhaps add a hash of the project's absolute path? [...] we could include the project path as a salt in the hash that is already computed.

I agree this sounds like the right solution. However, when I've looked at this issue in the past, I was confused. I saw there is already a project path being included here:

full_name = f"{project.path}%{aggregate_id}%{op_string}"

Perhaps we're missing this somewhere else?

@b-butler
Copy link
Member

We current assume that if a job exist in the cluster's scheduler with the bundle prefix which only includes the project name and word bundle the scheduler job is for that project, and we attempt to open the file corresponding to that bundle; however, it may not exist if two projects with the same name exist in the same place.

@joaander You are right, I didn't think about false positives the other way around. We do need to disambiguate the bundle prefix more regardless of going from file to scheduler job or scheduler job to bundle.

@joaander
Copy link
Member

@joaander You are right, I didn't think about false positives the other way around. We do need to disambiguate the bundle prefix more regardless of going from file to scheduler job or scheduler job to bundle.

@bdice Pointed out that this is already done. I failed to read the entire comment thread or look at the code.

@b-butler
Copy link
Member

This is true of scheduler jobs, but as I mentioned above and can also be seen here,

signac-flow/flow/project.py

Lines 2129 to 2132 in 74de383

@property
def _bundle_prefix(self):
sep = getattr(self._environment, "JOB_ID_SEPARATOR", "/")
return f"{self.__class__.__name__}{sep}bundle{sep}"

and here,

signac-flow/flow/project.py

Lines 2169 to 2174 in 74de383

bundle_prefix = self._bundle_prefix
for job in scheduler_jobs:
if job.name().startswith(bundle_prefix):
with open(self._fn_bundle(job.name())) as file:
for line in file:
yield ClusterJob(line.strip(), job.status())

when it comes to searching for extant bundles bundle_prefix does not contain disambiguating information, and, thus, we attempt to open non-existent files when we have two projects with the same name using bundles.

@cbkerr cbkerr changed the title Conflicting Namespace- Using the SLURM scheduler to check the status of a project conflicts with jobs from other users Conflicting Namespace- Using the SLURM scheduler to check the status of a project tries to find bundled operations from other users and/or instances of FlowProject Mar 15, 2024
@bdice bdice mentioned this issue Mar 18, 2024
6 tasks
bdice added a commit that referenced this issue Mar 18, 2024
@bdice
Copy link
Member

bdice commented Mar 18, 2024

I was cleaning out old work and wanted to follow up on this. I filed #832 which I believe solves the problem that @b-butler identified. Please take a look when you are able.

@cbkerr cbkerr added this to the v0.29.0 milestone Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants