Fix issue 12040 #12155

vkuznet · 2024-10-22T12:40:07Z

Fixes #12040

Status

In development

Description

Introduce new logic to update sites and associated rules:

Add new POST API for transferor into MSManager, POST JSON payload to /ms-transferor/data/transferor
Add new updateSites API to handler POST request and return status to upstream caller (ReqMgr2)
Use local file system as persistent storage to store JSON payloads
define JSON payload as list with the following data-structure:

[{"workflow": <name>, "SiteWhiteList": [T1,...], "SiteBlackList": [T2, ...]}, {...}]

define response from MSTransferor to upstream caller as following:

[{"workflow": <name>, "error": <error>}, {...}]

implement saveData and readData to perform IO operations for provided JSON payload and handle its persistent storage. So far these APIs rely on usage of local file system where it store JSON as file whose name is workflow name. If we will decide to use other storage, e.g. database only these two APIs will need a change to perform IO operations
provide business logic of _updateSites API which will be executed by execute API of MSTransferor daemon.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

External dependencies / deployment changes

vkuznet · 2024-10-22T12:41:47Z

@amaltaro , this is initial logic based on provided requirements. I would appreciate if you will reviewed and let me know if it has expected behavior. In particular, I need to know decision about persistent storage and overview of acknowledged responses to upstream caller. Once we settle on this the rest would be implementation of site update/rules only.

cmsdmwmbot · 2024-10-22T13:18:02Z

Jenkins results:

Python3 Unit tests: failed
- 293 new failures
- 4 tests deleted
- 13 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15353/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-10-22T14:20:20Z

Jenkins results:

Python3 Unit tests: failed
- 4 new failures
- 1 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: failed
- 14 warnings and errors that must be fixed
- 13 warnings
- 65 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15354/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-10-22T15:16:31Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 1 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: failed
- 17 warnings and errors that must be fixed
- 13 warnings
- 65 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15355/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-10-22T16:21:52Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 13 warnings
- 69 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15357/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-10-23T14:19:32Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests no longer failing
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 13 warnings
- 69 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15361/artifact/artifacts/PullRequestReport.html

khurtado · 2024-10-23T22:18:51Z

test this please

cmsdmwmbot · 2024-10-23T22:30:16Z

Jenkins results:

Python3 Unit tests: failed
- 4 new failures
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 13 warnings
- 69 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15366/artifact/artifacts/PullRequestReport.html

khurtado · 2024-10-24T12:52:03Z

test this please

cmsdmwmbot · 2024-10-24T13:23:44Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 13 warnings
- 69 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15367/artifact/artifacts/PullRequestReport.html

amaltaro

Valentin, despite not covering 100% of your changes, I left some comments along the code.

In addition, for dealing with persisted information in the filesystem. If we decide to keep writing a file per workflow, we then need to implement:

deleting that file once data replacement has been successful
listing all files pending for data replacement

In my opinion, filesystem will provide only the workflow name that needs replacement. We then fetch the workflow from ReqMgr2 (similar to what is done by getRequestRecords()) and let it go through the service.

amaltaro · 2024-10-30T20:23:09Z

src/python/WMCore/MicroService/MSTransferor/MSTransferor.py

@@ -72,6 +74,11 @@ def __init__(self, msConfig, logger=None):
        """
        super(MSTransferor, self).__init__(msConfig, logger=logger)

+        # persistent area for site list processing
+        wdir = '{}/storage'.format(os.getcwd())
+        self.storage = self.msConfig.get('persistentArea', wdir)


We need to ensure that this area is persistent across POD restarts, so we do not lose data accidentally.
I remember we used to use something like /data/srv/state/ for database related data.

amaltaro · 2024-10-30T20:24:45Z

src/python/WMCore/MicroService/MSTransferor/MSTransferor.py

@@ -195,6 +202,13 @@ def execute(self, reqStatus):
            self.logger.info("%d requests information completely processed.", len(reqResults))

            for wflow in reqResults:
+                # perform site list updates


I think this code has to be placed outside of this for loop (L197). Otherwise it will only get executed when there is other workflows in the queue for data placement (workflows sitting in assigned).

amaltaro · 2024-10-30T20:27:26Z

src/python/WMCore/MicroService/MSTransferor/MSTransferor.py

@@ -195,6 +202,13 @@ def execute(self, reqStatus):
            self.logger.info("%d requests information completely processed.", len(reqResults))

            for wflow in reqResults:
+                # perform site list updates
+                errors = self._updateSites(wflow)
+                if len(errors) == 0:


In practice, you are overwriting this metric with the very last workflow outcome.
Instead, the way it has been used so far is to provide a summary of the microservice execution cycle.

Said that, my suggestion would be to define it to an integer number saying how many workflows (count) have been re-placed.

amaltaro · 2024-10-30T20:48:43Z

src/python/WMCore/MicroService/MSTransferor/MSTransferor.py

+        """
+        Update sites API provides asynchronous update of Site info.
+
+        :param doc: JSON payload with the following data structures:


From the source code, it looks like we only save the workflow name. I think that is correct, but we then need to update this docstring.

We receive this record {'workflow': <wflow name>, 'SiteWhiteList' ['T1', ...], 'SiteBlackList': ['T2',...]} from upstream and this is what is saved into a file with workflow name as a file name. This allows to keep site lists when we need to run business logic and avoid extra calls to upstream service.

amaltaro · 2024-10-30T20:49:40Z

src/python/WMCore/MicroService/MSTransferor/MSTransferor.py

+        :return: acknowledge dict to upstream caller (ReqMgr2)
+        """
+        # preserve provided payload to local file system
+        errors = []


If this API is supposed to receive a single workflow per HTTP call (and I would say this is what we should implement), then we should convert errors from list to string type.

amaltaro · 2024-10-30T20:50:44Z

src/python/WMCore/MicroService/MSTransferor/MSTransferor.py

+        # send acknowledged message back to upstream caller
+        resp = {'status': 'ok'}
+        if len(errors) != 0:
+            resp = {'status': 'fail', 'errors': errors}


I would suggest to use the same string as we use in CouchDB, just so we keep error strings as consistent as possible. Please check out the CMSCouch.py module, which I believe sets the non ok answer.

This API is used by HTTP end-point to return to upstream caller. Please clearly define how HTTP end-point should behave both in success and failure mode? In other words, if this API succeed, what it should return, a code , nothing? And if it fails what it should return to upstream code, a string? How error can be defined in upstream code from a return value of this API?

And, on a separate note, why MSTransferor or in this sense any MS service should be complaint how CMSCouch return errors? I'm not criticizing but rather trying to understand. Bottom line, I'm asking how any MS service should return the success and failures? Is it standardized across all MS services?

I was probably mixing things up and ended up thinking that this data structure was written to couchdb, hence reporting any potential errors from the backend database back to the user.

Seeing that I was wrong, I would suggest you to look into MSPileup (or perhaps pick a different MS service) to see how the server responds back to the client, which data format and content is returned. Just so we try to keep services as consistent as possible.

amaltaro · 2024-10-30T20:52:28Z

src/python/WMCore/MicroService/MSTransferor/MSTransferor.py

+            data = json.load(istream.read())
+        return data
+
+    def _updateSites(self, wflow):


I would remove all this code and rely on what has already been implemented in MSTransferor, hence, just let the workflow go through the standard algorithm.

When removing this module, please do not squash commits though. Just in case I am missing any detail that would make that not possible.

I'm not sure why I need to remove it since it is a business logic of requested feature. How standard algorithm will execute a logic which is not there? So far, the default algorithm does not deal with sites in white/black lists? I don't understand what you require to do here. Please elaborate more.

That is why I am suggesting to have only a list of workflows that need dedicated data placement (instead of having the site lists a well).
You will, of course, have to modify the standard algorithm such that it can also considers a list of workflow(s) that is retrieved from somewhere else. Other than that, the rest of the logic is already implemented and there is no need to have all this code duplication.

vkuznet · 2024-11-10T14:14:40Z

@amaltaro , I asked few questions about your comments and I'm not sure you saw them, but in order to proceed with this PR I need your response. Please have a look along the PR threads where I posted my questions.

vkuznet · 2024-11-18T12:56:36Z

@amaltaro , this is kind ping that in order to move forward I'm awaiting response on my questions in this PRs.

amaltaro · 2024-11-18T14:36:22Z

@vkuznet you need to refresh the review request through the "Reviewers" option on the top right side, otherwise I cannot see it in the GitHub filters. In addition, please update the title of this PR and if needed amend the commit message as well.

vkuznet · 2024-11-18T15:07:19Z

Alan, this is not review request since I didn't made any changes, and rather it is request to answer my questions in order for me to proceed. Since I didn't update any code I though I should not request a review. Please see my questions in open threads and reply to each of them directly within a thread.

vkuznet added 2 commits October 22, 2024 08:32

Add new POST API for transferor data-service

f0f7415

Implement initial logic of update sites APIs

26dfac0

vkuznet requested a review from amaltaro October 22, 2024 12:40

vkuznet added 2 commits October 22, 2024 10:10

fix else if

c85f111

expand logic of _updateSites API

5934460

complete business logic

40332e4

vkuznet added 2 commits October 22, 2024 11:21

fix issues from pylint

7cddee3

Add errors handling

1e4b043

updates from pylint

6d8a89c

amaltaro requested changes Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 12040 #12155

Fix issue 12040 #12155

vkuznet commented Oct 22, 2024

vkuznet commented Oct 22, 2024

cmsdmwmbot commented Oct 22, 2024

cmsdmwmbot commented Oct 22, 2024

cmsdmwmbot commented Oct 22, 2024

cmsdmwmbot commented Oct 22, 2024

cmsdmwmbot commented Oct 23, 2024

khurtado commented Oct 23, 2024

cmsdmwmbot commented Oct 23, 2024

khurtado commented Oct 24, 2024

cmsdmwmbot commented Oct 24, 2024

amaltaro left a comment

amaltaro Oct 30, 2024

amaltaro Oct 30, 2024

amaltaro Oct 30, 2024

amaltaro Oct 30, 2024

vkuznet Oct 31, 2024 •

edited

Loading

amaltaro Oct 30, 2024

amaltaro Oct 30, 2024

vkuznet Oct 31, 2024

vkuznet Oct 31, 2024

amaltaro Nov 19, 2024

amaltaro Oct 30, 2024

vkuznet Oct 31, 2024

amaltaro Nov 19, 2024

vkuznet commented Nov 10, 2024

vkuznet commented Nov 18, 2024

amaltaro commented Nov 18, 2024

vkuznet commented Nov 18, 2024 •

edited

Loading

Fix issue 12040 #12155

Are you sure you want to change the base?

Fix issue 12040 #12155

Conversation

vkuznet commented Oct 22, 2024

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

vkuznet commented Oct 22, 2024

cmsdmwmbot commented Oct 22, 2024

cmsdmwmbot commented Oct 22, 2024

cmsdmwmbot commented Oct 22, 2024

cmsdmwmbot commented Oct 22, 2024

cmsdmwmbot commented Oct 23, 2024

khurtado commented Oct 23, 2024

cmsdmwmbot commented Oct 23, 2024

khurtado commented Oct 24, 2024

cmsdmwmbot commented Oct 24, 2024

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuznet Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuznet commented Nov 10, 2024

vkuznet commented Nov 18, 2024

amaltaro commented Nov 18, 2024

vkuznet commented Nov 18, 2024 • edited Loading

vkuznet Oct 31, 2024 •

edited

Loading

vkuznet commented Nov 18, 2024 •

edited

Loading