Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only restore jobIds added within last 3 days #983

Conversation

roulaoregan-spi
Copy link
Contributor

Summarize your change.
Common complaint from our users is that if they added 1000s of jobs to their previous Job Monitor Tree session, opencuetopia would take a very long time to load, especially if the jobs were moved to the historical database since their last session (they would also receive many warnings in the shell which further bogs down the app).

To avoid the restore lag for such scenarios:

  • Only restore jobs from config.ini that were added within the last 3 days.
  • Add a timestamp to the config.ini file for each monitored job. Uses the timestamp of when the user added the job to monitor over the job's database startTime() because a job could be running on the farm for 3+ days or be dependent on other jobs, meaning a job could have a db start time > 3 (and hence not restored but could still be in progress and of interest to the user)
  • This change does not prevent the user from adding 1000s of jobs to be monitored, only that it will restore the most recent 200 jobs.

Copy link
Collaborator

@bcipriano bcipriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question here -- why do we need this if jobs are auto-archived? Won't archived jobs fail to load during the restore process and drop off the monitored list?

@@ -46,7 +47,8 @@
PLUGIN_DESCRIPTION = "Monitors a list of jobs"
PLUGIN_PROVIDES = "MonitorJobsDockWidget"
REGEX_EMPTY_STRING = re.compile("^$")

TIME_DELTA = 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant name is a bit generic IMO and it's generally preferred for the name to contain the time unit if it's a duration -- maybe something like JOB_RESTORE_THRESHOLD_DAYS?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was suggesting to rename TIME_DELTA -> JOB_RESTORE_THRESHOLD_DAYS. I'm confused now what each of these do.

Based on the code it looks like these two lines should be:

# Maximum number of jobs that will be restored on CueGUI restart.
JOB_RESTORE_MAX_JOBS = 200
# Maximum age for a job to be restored on CueGUI restart.
JOB_RESTORE_THREADHOLD_DAYS = 3

Do I have that right?

"""
today = datetime.datetime.now()
limit = JOB_LOAD_LIMIT if len(jobIds) > JOB_LOAD_LIMIT else len(jobIds)
msg = """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For multiline strings I think we should use implicit joins instead of the triple quotes -- this adds extra whitespace to the log message, right?

Maybe something like:

msg = ('Unable to load previously loaded job since it was moved '
       'to the historical database: {0}')

(jobs are moved to historical database)

:param jobIds: monitored jobs ids and their timestamp from previous working state (loaded from config.ini file)
:ptype: list of tuples (ex: [("Job.f156be87-987a-48b9-b9da-774cd58674a3", 1612482716.170947),...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe the format for argument types should be :type jobIds: list[tuple]. Additional explanation, examples, etc. should be either on the :param line or above in the description.

@roulaoregan-spi
Copy link
Contributor Author

@bcipriano - to answer your two questions:
why do we need this if jobs are auto-archived?
Won't archived jobs fail to load during the restore process and drop off the monitored list?

Cuegui uses the config.ini file to load the saved state which is not ideal.

Some more context to this is that there is another PR that I still need to add, is that there is a bug in the autoload on reopen, currently, opencue doesn't autoload jobs that are not owned by the user because the Utils.py findJob regex needs to prepend ^Job since it uses the JobKey to save to the config.ini file. For example: Plugins_Opened="Monitor Jobs::{\"jobs\": [\"Job.00000000-0000-0000-00000-000000000000\"
Once this was fixed, it caused a side effect of loading "every" job saved historically in the config file.
The issue resides is in the implementation of failing to load archived jobs, it loops over each jobID in the config.ini list, this caused a problem for users who have 1000+ jobs to load from the config.ini file. Opencue issues warnings to the shell Unable to load previously loaded job since it was moved to the historical database: for each job that it was not able to add to the Monitor Jobs Plugin widget causing a bottle neck in the launching of the app. Users would see 1000+ warnings in the shell, the app would not load and they would just exit before the restore completed because iterating over so many jobs takes so long, our quick fix was to delete the config file and have the open the app again, however, when they would add another 1000+ jobs to monitor, close the app and relaunch 4-5 days later the same issue repeatedly occurred.

Although this a "fix" the real issue is that the restore state needs to be redesigned as well as not to prevent the app from launching if it can't load the jobids. As well, I am okay if we defer to this PR to creating another issue for redesigning the restore state. Thoughts?

Copy link
Collaborator

@bcipriano bcipriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for that, I get it now -- the extra load is just during startup, not during the entire runtime. I agree this makes sense as a stopgap fix.

Approved, but looks like there's a unit test failing.

@roulaoregan-spi
Copy link
Contributor Author

Thanks Brian.
I will look at the failure

@bcipriano
Copy link
Collaborator

Ping -- any assistance needed to resolve this test failure?

@roulaoregan-spi roulaoregan-spi force-pushed the refactor-restore-jobIds branch 8 times, most recently from 1150174 to 86ad2b3 Compare July 20, 2021 01:55
@roulaoregan-spi
Copy link
Contributor Author

@bcipriano - I fixed the pylint errors and the pipeline now passes.

@bcipriano
Copy link
Collaborator

Thanks!

There are a couple of open comments in the code review, could you take a look at those please?

@roulaoregan-spi
Copy link
Contributor Author

@bcipriano - will do, apologies that I missed them previously!

@roulaoregan-spi roulaoregan-spi force-pushed the refactor-restore-jobIds branch 3 times, most recently from 440ae37 to 9d995fb Compare December 4, 2021 00:49
@@ -226,23 +226,41 @@ def setLoadMine(self, value):
@type value: boolean or QtCore.Qt.Checked or QtCore.Qt.Unchecked"""
self.__loadMine = (value is True or value == QtCore.Qt.Checked)

def addJob(self, job):
def addJob(self, job, timestamp=None):
"""Adds a job to the list. With locking"
@param job: Job can be None, a job object, or a job name.
@type job: job, string, None"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a line to the docstring explaining what the new timestamp param does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added timestamp to the docstring

@@ -46,7 +47,8 @@
PLUGIN_DESCRIPTION = "Monitors a list of jobs"
PLUGIN_PROVIDES = "MonitorJobsDockWidget"
REGEX_EMPTY_STRING = re.compile("^$")

TIME_DELTA = 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was suggesting to rename TIME_DELTA -> JOB_RESTORE_THRESHOLD_DAYS. I'm confused now what each of these do.

Based on the code it looks like these two lines should be:

# Maximum number of jobs that will be restored on CueGUI restart.
JOB_RESTORE_MAX_JOBS = 200
# Maximum age for a job to be restored on CueGUI restart.
JOB_RESTORE_THREADHOLD_DAYS = 3

Do I have that right?

@roulaoregan-spi
Copy link
Contributor Author

@bcipriano, sorry about that, yes my previous commit change was incorrect. I updated the constants, one for days and the other for limit. I hope this fixes the issue.

@roulaoregan-spi
Copy link
Contributor Author

roulaoregan-spi commented Dec 5, 2021

@bcipriano I noticed that the build tests are failing on:

Warning, treated as error:
node class 'meta' is already registered, its visitors will be overridden
Error: Process completed with exit code 2.

I think it might be related to: sphinx-doc/sphinx#9841
Any suggestions in what I could do to fix this?

Thanks in advance

Copy link
Collaborator

@bcipriano bcipriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

Yeah, that documentation build failure is due to an upstream issue. Tracked by #1065, we'll get that fixed up soon.

@bcipriano
Copy link
Collaborator

#1065 is fixed now, I re-ran the checks in this PR but it's still failing. I think you'll just need to sync from master.

roulaoregan-spi and others added 5 commits December 6, 2021 08:00
Changed constant to more descriptive name,
changed multi line text to use implicit joins,
fixed arg type.
Previous commit mixed constant job threshold's
limit with days, corrected it. Added comment
description for timestamp in addJob.
@roulaoregan-spi roulaoregan-spi force-pushed the refactor-restore-jobIds branch from d23b978 to e313aff Compare December 6, 2021 16:00
@roulaoregan-spi
Copy link
Contributor Author

@bcipriano - thank you! I rebased the code from master and all tests pass.

@DiegoTavares DiegoTavares merged commit 5122518 into AcademySoftwareFoundation:master Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants