Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 12040 #12155

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Fix issue 12040 #12155

wants to merge 8 commits into from

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Oct 22, 2024

Fixes #12040

Status

In development

Description

Introduce new logic to update sites and associated rules:

  • Add new POST API for transferor into MSManager, POST JSON payload to /ms-transferor/data/transferor
  • Add new updateSites API to handler POST request and return status to upstream caller (ReqMgr2)
  • Use local file system as persistent storage to store JSON payloads
  • define JSON payload as list with the following data-structure:
[{"workflow": <name>, "SiteWhiteList": [T1,...], "SiteBlackList": [T2, ...]}, {...}]
  • define response from MSTransferor to upstream caller as following:
[{"workflow": <name>, "error": <error>}, {...}]
  • implement saveData and readData to perform IO operations for provided JSON payload and handle its persistent storage. So far these APIs rely on usage of local file system where it store JSON as file whose name is workflow name. If we will decide to use other storage, e.g. database only these two APIs will need a change to perform IO operations
  • provide business logic of _updateSites API which will be executed by execute API of MSTransferor daemon.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

<If it's a follow up work; or porting a fix from a different branch, please mention them here.>

External dependencies / deployment changes

<Does it require deployment changes? Does it rely on third-party libraries?>

@vkuznet vkuznet requested a review from amaltaro October 22, 2024 12:40
@vkuznet
Copy link
Contributor Author

vkuznet commented Oct 22, 2024

@amaltaro , this is initial logic based on provided requirements. I would appreciate if you will reviewed and let me know if it has expected behavior. In particular, I need to know decision about persistent storage and overview of acknowledged responses to upstream caller. Once we settle on this the rest would be implementation of site update/rules only.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 293 new failures
    • 4 tests deleted
    • 13 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15353/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 4 new failures
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 13 warnings
    • 65 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15354/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 17 warnings and errors that must be fixed
    • 13 warnings
    • 65 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15355/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 13 warnings
    • 69 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15357/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 13 warnings
    • 69 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15361/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 4 new failures
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 13 warnings
    • 69 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15366/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 13 warnings
    • 69 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15367/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, despite not covering 100% of your changes, I left some comments along the code.

In addition, for dealing with persisted information in the filesystem. If we decide to keep writing a file per workflow, we then need to implement:

  • deleting that file once data replacement has been successful
  • listing all files pending for data replacement

In my opinion, filesystem will provide only the workflow name that needs replacement. We then fetch the workflow from ReqMgr2 (similar to what is done by getRequestRecords()) and let it go through the service.

@@ -72,6 +74,11 @@ def __init__(self, msConfig, logger=None):
"""
super(MSTransferor, self).__init__(msConfig, logger=logger)

# persistent area for site list processing
wdir = '{}/storage'.format(os.getcwd())
self.storage = self.msConfig.get('persistentArea', wdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure that this area is persistent across POD restarts, so we do not lose data accidentally.
I remember we used to use something like /data/srv/state/ for database related data.

@@ -195,6 +202,13 @@ def execute(self, reqStatus):
self.logger.info("%d requests information completely processed.", len(reqResults))

for wflow in reqResults:
# perform site list updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code has to be placed outside of this for loop (L197). Otherwise it will only get executed when there is other workflows in the queue for data placement (workflows sitting in assigned).

@@ -195,6 +202,13 @@ def execute(self, reqStatus):
self.logger.info("%d requests information completely processed.", len(reqResults))

for wflow in reqResults:
# perform site list updates
errors = self._updateSites(wflow)
if len(errors) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, you are overwriting this metric with the very last workflow outcome.
Instead, the way it has been used so far is to provide a summary of the microservice execution cycle.

Said that, my suggestion would be to define it to an integer number saying how many workflows (count) have been re-placed.

"""
Update sites API provides asynchronous update of Site info.

:param doc: JSON payload with the following data structures:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the source code, it looks like we only save the workflow name. I think that is correct, but we then need to update this docstring.

Copy link
Contributor Author

@vkuznet vkuznet Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We receive this record {'workflow': <wflow name>, 'SiteWhiteList' ['T1', ...], 'SiteBlackList': ['T2',...]} from upstream and this is what is saved into a file with workflow name as a file name. This allows to keep site lists when we need to run business logic and avoid extra calls to upstream service.

:return: acknowledge dict to upstream caller (ReqMgr2)
"""
# preserve provided payload to local file system
errors = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this API is supposed to receive a single workflow per HTTP call (and I would say this is what we should implement), then we should convert errors from list to string type.

# send acknowledged message back to upstream caller
resp = {'status': 'ok'}
if len(errors) != 0:
resp = {'status': 'fail', 'errors': errors}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to use the same string as we use in CouchDB, just so we keep error strings as consistent as possible. Please check out the CMSCouch.py module, which I believe sets the non ok answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is used by HTTP end-point to return to upstream caller. Please clearly define how HTTP end-point should behave both in success and failure mode? In other words, if this API succeed, what it should return, a code , nothing? And if it fails what it should return to upstream code, a string? How error can be defined in upstream code from a return value of this API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, on a separate note, why MSTransferor or in this sense any MS service should be complaint how CMSCouch return errors? I'm not criticizing but rather trying to understand. Bottom line, I'm asking how any MS service should return the success and failures? Is it standardized across all MS services?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was probably mixing things up and ended up thinking that this data structure was written to couchdb, hence reporting any potential errors from the backend database back to the user.

Seeing that I was wrong, I would suggest you to look into MSPileup (or perhaps pick a different MS service) to see how the server responds back to the client, which data format and content is returned. Just so we try to keep services as consistent as possible.

data = json.load(istream.read())
return data

def _updateSites(self, wflow):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove all this code and rely on what has already been implemented in MSTransferor, hence, just let the workflow go through the standard algorithm.

When removing this module, please do not squash commits though. Just in case I am missing any detail that would make that not possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why I need to remove it since it is a business logic of requested feature. How standard algorithm will execute a logic which is not there? So far, the default algorithm does not deal with sites in white/black lists? I don't understand what you require to do here. Please elaborate more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why I am suggesting to have only a list of workflows that need dedicated data placement (instead of having the site lists a well).
You will, of course, have to modify the standard algorithm such that it can also considers a list of workflow(s) that is retrieved from somewhere else. Other than that, the rest of the logic is already implemented and there is no need to have all this code duplication.

@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 10, 2024

@amaltaro , I asked few questions about your comments and I'm not sure you saw them, but in order to proceed with this PR I need your response. Please have a look along the PR threads where I posted my questions.

@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 18, 2024

@amaltaro , this is kind ping that in order to move forward I'm awaiting response on my questions in this PRs.

@amaltaro
Copy link
Contributor

@vkuznet you need to refresh the review request through the "Reviewers" option on the top right side, otherwise I cannot see it in the GitHub filters. In addition, please update the title of this PR and if needed amend the commit message as well.

@vkuznet
Copy link
Contributor Author

vkuznet commented Nov 18, 2024

Alan, this is not review request since I didn't made any changes, and rather it is request to answer my questions in order for me to proceed. Since I didn't update any code I though I should not request a review. Please see my questions in open threads and reply to each of them directly within a thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remake input data placement upon site list changes
4 participants