-
Notifications
You must be signed in to change notification settings - Fork 107
Patching production machine
Sometimes we need to patch directly production wmagent. (It is not always possible to redeploy the production machine) In that case, we SHOUDN'T directly change the source code in production agent.
The procedure goes I like this.
-
Identify the problem, and if you need to create the patch. Need to create the fetch according to the production version of agent.
i.e. If production version is 0.9.7, create the patch against 0.9.7. not the master branch
-
Throughly test the patch and apply to production.
$ curl https://patch-diff.githubusercontent.com/raw/dmwm/WMCore/pull/[PR_NUMBER].patch | patch -d /data/srv/wmagent/current/apps/wmagent/lib/python2.7/site-packages/ -p3
Tier0Agent code location: need to change [] with current version of Tier0 and WMAgent respectfully in order below.
/data/tier0/srv/wmagent/[2.1.2]/sw/slc7_amd64_gcc630/cms/wmagent/[1.1.12.patch2]/lib/python2.7/site-packages/
2.1 adding couch patch.
$ curl https://patch-diff.githubusercontent.com/raw/dmwm/WMCore/pull/[PR_NUMBER].patch | patch -d /data/srv/wmagent/current/apps/wmagent/data -p2
2.2 push the couchapp to right application (execute-reqmgr, execute-workqueue, execute-wmagent)
$ $manage execute-agent wmagent-couchapp-init
-
restart component affected by patch and monitor
$ $manage execute-agent wmcoreD --restart --component AnalyticsDataCollector
-
update the patch list in the twiki here
The same rules as mentioned above also apply for this case, of course.
If you have the cmst1
user pass, you can then ssh to lxplus and run the following
long command... wait, before running this command, make sure to:
- update the list/regex of host names (possibly including the relval node)
- update the pull request number (replace PR_NUMBER by the correct number)
- update the --shutdown and --restart commands to properly reflect the component that needs to be restarted
then the skeleton command is as follow (again, from lxplus as cmst1)
for h in vocms0{250,251,252,253,254,255,256,257}; do echo ""; ssh cmst1@$h 'source /data/admin/wmagent/env.sh;
echo -e "\n\n ********** Patching `hostname` ************";
curl https://patch-diff.githubusercontent.com/raw/dmwm/WMCore/pull/[PR_NUMBER].patch | patch -d apps/wmagent/lib/python2*/site-packages/ -p 3;
$manage execute-agent wmcoreD --shutdown --components=DBS3Upload,JobCreator;
echo -e "\nSleeping 3 seconds ..." && sleep 3;
$manage execute-agent wmcoreD --restart --components=DBS3Upload,JobCreator'; done
check the stdout and make sure the patch was properly applied and components were restarted.
The same rules as mentioned above also apply for this case, of course. As cmsdataops, ssh to one of the FNAL agents (e.g. submit1) and run the following long command... wait, before running this command, make sure to:
- update the list/regex of host names (possibly including the relval node)
- update the pull request number (replace PR_NUMBER by the correct number)
- update the --shutdown and --restart commands to properly reflect the component that needs to be restarted
then the skeleton command is as follow (this time from one of the FNAL schedd nodes)
for h in cmsgwms-submit{3,4,5,6}; do echo ""; ssh cmsdataops@$h 'source /data/admin/wmagent/env.sh;
echo -e "\n\n ********** Patching `hostname` ************";
curl https://patch-diff.githubusercontent.com/raw/dmwm/WMCore/pull/[PR_NUMBER].patch | patch -d apps/wmagent/lib/python2*/site-packages/ -p 3;
$manage execute-agent wmcoreD --shutdown --components=DBS3Upload,JobCreator;
echo -e "\nSleeping 3 seconds ..." && sleep 3;
$manage execute-agent wmcoreD --restart --components=DBS3Upload,JobCreator'; done
check the stdout and make sure the patch was properly applied and components were restarted.