Releases: PanDAWMS/pilot
73.18
-
Added merged_lhef._0.events-new and Process to the list of files list that are removed before the job log is created. Discussed in JIRA ticket: https://its.cern.ch/jira/browse/ATLASJT-391
-
Now using function getmtime() instead of getctime() in the looping job killer; problem reported by David Cameron
73.17
Pilot 1 was updated this morning for a problem in the looping job algorithm affecting some user jobs (with getting the mod time for files that no longer exist - it is not clear how those files disappear since they are known less than a second before the mod time command).
73.16
Looping job update:
- Fixing a bug in the looping job killer which now identifies the most recently modified file (discovered by R. Walker)
Wrong architecture:
- Added new error code 1248, ERR_WRONGARCHITECTURE: "Job built on wrong architecture"
- Added new error code 1249, ERR_RUNGENFAILURE: "RunGen failure (consult log file)"
Max work dir size:
- Added a 10% grace margin to size checks against maxwdir
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-482
No shell code returned for proxy error
- Added NOPROXY error code to list in shellExitCode converter (discovered by P. Svirin)
73.15
- Containers update for Pilot 2 that has implications on Pilot 1
- Pilot 1 discards any @-syntax that might be present in cmtConfig from job definition (used to communicate ALRB_USER_PLATFORM to atlasSetup)
- Rucio detect client location function update (used in combination with list_replicas())
- Now setting the site name using 'gstat' instead of 'gocname' as the latter is not always correctly set
Contributions from A. Anisenkov, P. Nilsson
73.14
- The location detection function used with list_replicas() has been updated and should now be IPv6 compatible. The previous version did not work on IPv6-only WNs.
73.13
Exception handling:
- Added exception handling to findProcessesInGroup() function to avoid crash when ps command has failed (leading to 'invalid literal for int()')
Debug update (requested by R. Walker)
- Pilot now looks for both .log and log. file patterns when looking for latest updated log file
Event service related
- Bug related to corrupt files reporting
Nordugrid patch (already released on Nordugrid)
- Moved back offending rucio imports from module header to where they are used (rucio_sitemover)
Code contributions from W. Guan, P. Nilsson.
73.12
-
Checksum verification update. For stage-out checksum verification the pilot now uses checksum value from XML which is calculated right after the
payload finishes rather than the value recalculated after the stage-out has finished. This optimization will discover
any file corruptions that happen between end of payload execution and end of stage-out. -
Now reporting corrupted ES files
-
Rucio logger update; previously after an exception the rucio log was not propagated correctly
Code contributions from Wen Guan, Tomas Javurek, Paul Nilsson.
73.11
Corrected benchmark termination (a missing break). Previously the benchmark loop could not be terminated if the command ran for longer than the allowed time (due to a stuck benchmark command which is unrelated to the pilot). This led to large stdout in some jobs.
73.10
- Stage-out verifications has been switched off for OS transfers (the site mover inherits from the rucio site mover which recently added this type of verifications which led to problems with OS:s). It led to problems with sim jobs at P1
- Time-outs have been removed from Rucio API transfer calls since they led to problems on the server side
I also squeezed in an update to avoid port 9256 which might randomly be opened by a TCP server, to avoid problems with local monitoring which uses this port. Requested by A. De Salvo
Code contributions from W. Guan, T. Javurek, P. Nilsson.
73.8
The davs update from the previous pilot version was rolled back since it caused some problems with direct access on a couple of sites (but later fixed in AGIS by Rod Walker). If there is a need to continue testing with davs, a special pilot version can be provided.